AI Dictionary
Intermediate· ~2 min read#reranker#cross-encoder#rag

Reranker

The second-pass ranker

A secondary layer that takes coarse vector-search results and reorders them with a more precise model — surfacing the truly relevant ones.

FROM MANY RESULTS TO FEW PRECISE ONESVECTOR SEARCH · TOP 50doc_400.92doc_410.88doc_420.84doc_430.80doc_440.76doc_450.72doc_460.68doc_470.64…42 moreRE-RANKERcross-encoderTRUE TOP-5doc_070.98doc_420.96doc_880.91doc_150.84doc_230.79vectors are fast but coarse — rerankers are slower but precise
Definition

Vector search is fast but coarse: a cosine similarity over millions of vectors returns top-50 in sub-second time. Problem: the top-5 of those 50 isn't always correctly ordered for the user — vector embeddings encode query and doc separately, missing the deep interaction.

A reranker takes those 50 results and feeds each as a (query, doc) pair to a cross-encoder model. The cross-encoder looks at the pair together and produces a much more accurate relevance score. Result: top-50 → reranked top-5, with real precision.

Common tools: Cohere Rerank, bge-reranker, Voyage rerank, mxbai-rerank. Can boost RAG retrieval quality by 20-30%.

Analogy

In a library, first check the catalog (vector search — fast, rough) and pull 50 titles. Then have an editor go through them and say "these 5 are actually closest to your topic" (reranker — slow, precise). The catalog alone isn't enough; the editor layer lifts quality.

Real-world example

A SaaS docs RAG. User asks: "how do I bypass the API rate limit?" Vector search returns 50 chunks; top-5: 1. "Rate limit basics" (0.91) 2. "Pricing tiers" (0.88) 3. "Authentication" (0.85) 4. "Throttling guide" (0.82) ← actual answer here 5. "Plan comparison" (0.80)

The reranker scores all 50 against the query as pairs; "Throttling guide" wins: 1. Throttling guide (0.96) 2. Rate limit basics (0.92) 3. Plan comparison (0.78) ...

LLM gets the right chunk first; answer quality jumps noticeably.

When to use
  • RAG retrieval quality is poor — reranker is the first intervention
  • Top-K is large (50-100) but LLM context fits few (5-10)
  • High-stakes domains (legal, medical) — wrong chunk = wrong answer
  • Multilingual search — vectors alone miss linguistic nuance
When not to use
  • Top-3 is already 95% correct — extra layer not worth it
  • Latency-critical — rerankers add 100-500ms
  • Tiny corpus (<1K chunks) — vector search alone is fine
Common pitfalls

Cross-encoders are slow

Bi-encoders (vector) encode once; cross-encoders run again for every query-doc pair. 50 results = 50 forward passes. Batching and a smaller model are key.

Wrong model choice

Multilingual rerankers are less optimized for English, monolingual ones weak on non-English. Match the model to your data. Cohere Rerank v3 is multilingual + production-ready.

Reranker isn't a silver bullet

First good chunking, then good retrieval, then reranker. Don't use a reranker to save bad retrieval; fix the earlier layers too.