Reranker
The second-pass ranker
A secondary layer that takes coarse vector-search results and reorders them with a more precise model — surfacing the truly relevant ones.
Vector search is fast but coarse: a cosine similarity over millions of vectors returns top-50 in sub-second time. Problem: the top-5 of those 50 isn't always correctly ordered for the user — vector embeddings encode query and doc separately, missing the deep interaction.
A reranker takes those 50 results and feeds each as a (query, doc) pair to a cross-encoder model. The cross-encoder looks at the pair together and produces a much more accurate relevance score. Result: top-50 → reranked top-5, with real precision.
Common tools: Cohere Rerank, bge-reranker, Voyage rerank, mxbai-rerank. Can boost RAG retrieval quality by 20-30%.
In a library, first check the catalog (vector search — fast, rough) and pull 50 titles. Then have an editor go through them and say "these 5 are actually closest to your topic" (reranker — slow, precise). The catalog alone isn't enough; the editor layer lifts quality.
A SaaS docs RAG. User asks: "how do I bypass the API rate limit?" Vector search returns 50 chunks; top-5: 1. "Rate limit basics" (0.91) 2. "Pricing tiers" (0.88) 3. "Authentication" (0.85) 4. "Throttling guide" (0.82) ← actual answer here 5. "Plan comparison" (0.80)
The reranker scores all 50 against the query as pairs; "Throttling guide" wins: 1. Throttling guide (0.96) 2. Rate limit basics (0.92) 3. Plan comparison (0.78) ...
LLM gets the right chunk first; answer quality jumps noticeably.
- RAG retrieval quality is poor — reranker is the first intervention
- Top-K is large (50-100) but LLM context fits few (5-10)
- High-stakes domains (legal, medical) — wrong chunk = wrong answer
- Multilingual search — vectors alone miss linguistic nuance
- Top-3 is already 95% correct — extra layer not worth it
- Latency-critical — rerankers add 100-500ms
- Tiny corpus (<1K chunks) — vector search alone is fine
Cross-encoders are slow
Bi-encoders (vector) encode once; cross-encoders run again for every query-doc pair. 50 results = 50 forward passes. Batching and a smaller model are key.
Wrong model choice
Multilingual rerankers are less optimized for English, monolingual ones weak on non-English. Match the model to your data. Cohere Rerank v3 is multilingual + production-ready.
Reranker isn't a silver bullet
First good chunking, then good retrieval, then reranker. Don't use a reranker to save bad retrieval; fix the earlier layers too.