Reranking
Cross-encoders, bi-encoders, and why two-stage retrieval works
Embedding models sacrifice accuracy for speed. They encode queries and documents independently, enabling fast nearest-neighbor search. But this independence limits how well they can capture query-document interactions. Reranking with more powerful models recovers this accuracy.
Bi-Encoders vs Cross-Encoders
Interactive: Bi-encoder vs cross-encoder
Speed
Fast: encode once, compare with dot product. O(1) per comparison.
Accuracy
Limited: no direct token interaction between query and document.
Bi-encoders (what we have been discussing):
- Encode query and document separately
- Compare embeddings with cosine similarity
- Fast: encode once, compare with dot product
- Limited: no direct query-document interaction
Cross-encoders:
- Take query and document together as input
- Output a relevance score directly
- Slow: must run full model for each pair
- Powerful: can model complex query-document relationships
Cross-encoders are too slow for initial retrieval (imagine running BERT on 1 million documents per query). But they are perfect for reranking a small candidate set.
Two-Stage Retrieval
Two-stage retrieval pipeline
Query
User submits query
Stage 1: Retrieve candidates
- Use bi-encoder + vector index
- Fast approximate search
- Retrieve top 100-1000 candidates
- Recall-focused: cast a wide net
Stage 2: Rerank candidates
- Use cross-encoder
- Score each candidate against the query
- Re-sort by cross-encoder scores
- Return top 10-20
This combines the speed of bi-encoders (search millions in milliseconds) with the accuracy of cross-encoders (compare 100 pairs with high precision).
Why Cross-Encoders Are More Accurate
Cross-encoders see query and document tokens together. This enables:
Attention across both: A token in the query can directly attend to tokens in the document. Relationships are computed, not inferred from separate embeddings.
Fine-grained matching: Exact matches, synonyms, and paraphrases are detected in context.
Negation and qualification: "Does NOT support X" correctly matches queries about X not being supported.
Numerical reasoning: "Temperature above 100" correctly matches documents mentioning "105 degrees".
Bi-encoders compress documents into fixed vectors before seeing the query. Information is lost.
Score Distributions
Score distributions
Bi-Encoder Scores
Clustered: hard to distinguish top from rest
Cross-Encoder Scores
Spread out: clear separation of relevant from irrelevant
Key insight: Cross-encoder scores have higher variance, making it easier to identify truly relevant results. This is why reranking improves precision.
Bi-encoder similarity scores often cluster. Top-100 candidates might all have scores between 0.78 and 0.82. Hard to distinguish.
Cross-encoder scores spread out. The same 100 candidates might score between 0.3 and 0.95. Clear separation between relevant and irrelevant.
This improved discrimination is why reranking helps: it separates candidates that bi-encoders cannot distinguish.
Reranking in Practice
Interactive: Reranking effect
Bi-encoder scores are similar—hard to tell which are truly relevant
Common reranking models:
- ms-marco-MiniLM (fast, good quality)
- BGE-reranker (strong performance)
- Cohere Rerank (API service)
- cross-encoder/ms-marco (classic)
Latency budget:
- Retrieving 100 candidates: ~10ms
- Reranking 100 with MiniLM: ~50ms
- Total: ~60ms (acceptable for most applications)
How many to retrieve/rerank:
- More candidates → better recall → higher latency
- Typical: retrieve 100, rerank to top 10
- High-stakes: retrieve 500-1000, rerank to top 20
When to Use Reranking
Use reranking when:
- Initial retrieval quality is insufficient
- You have latency budget for cross-encoder
- Candidates are hard to distinguish by embedding similarity
- Precision matters more than cost
Skip reranking when:
- Bi-encoder quality is good enough
- Latency is extremely constrained
- Cost per query matters (reranking uses more compute)
- Candidates are already well-separated
Reranking Beyond Cross-Encoders
LLM-based reranking: Ask an LLM to rank candidates by relevance. More expensive but can use reasoning.
Learned sparse models: SPLADE and similar produce sparse vectors that can be compared efficiently. Not quite reranking but improves over dense-only.
Multi-vector models: ColBERT uses multiple vectors per document, enabling richer matching at moderate cost.
Ensemble methods: Combine multiple reranker scores with fusion techniques.
Key Takeaways
- Bi-encoders encode query and document separately—fast but limited interaction modeling
- Cross-encoders process query and document together—slow but much more accurate
- Two-stage retrieval: fast bi-encoder retrieval, then accurate cross-encoder reranking
- Cross-encoders can model negation, qualification, and fine-grained matching that bi-encoders miss
- Reranking spreads out scores, making it easier to distinguish truly relevant results
- Common pattern: retrieve 100 candidates, rerank to top 10, with ~50ms reranking latency
- Use reranking when precision matters and latency budget allows