Reranking

Cross-encoders, bi-encoders, and why two-stage retrieval works

Embedding models sacrifice accuracy for speed. They encode queries and documents independently, enabling fast nearest-neighbor search. But this independence limits how well they can capture query-document interactions. Reranking with more powerful models recovers this accuracy.

Bi-Encoders vs Cross-Encoders

Interactive: Bi-encoder vs cross-encoder

Query Encoder

"What causes rain?"

↓

Query Embedding

Document Encoder

"Rain forms when..."

↓

Doc Embedding

cosine(query_emb, doc_emb) = 0.82

Speed

Fast: encode once, compare with dot product. O(1) per comparison.

Accuracy

Limited: no direct token interaction between query and document.

Bi-encoders (what we have been discussing):

Encode query and document separately
Compare embeddings with cosine similarity
Fast: encode once, compare with dot product
Limited: no direct query-document interaction

Cross-encoders:

Take query and document together as input
Output a relevance score directly
Slow: must run full model for each pair
Powerful: can model complex query-document relationships

Cross-encoders are too slow for initial retrieval (imagine running BERT on 1 million documents per query). But they are perfect for reranking a small candidate set.

Two-Stage Retrieval

Two-stage retrieval pipeline

→

Query

User submits query

1 items

Stage 1: Retrieve candidates

Use bi-encoder + vector index
Fast approximate search
Retrieve top 100-1000 candidates
Recall-focused: cast a wide net

Stage 2: Rerank candidates

Use cross-encoder
Score each candidate against the query
Re-sort by cross-encoder scores
Return top 10-20

This combines the speed of bi-encoders (search millions in milliseconds) with the accuracy of cross-encoders (compare 100 pairs with high precision).

Why Cross-Encoders Are More Accurate

Cross-encoders see query and document tokens together. This enables:

Attention across both: A token in the query can directly attend to tokens in the document. Relationships are computed, not inferred from separate embeddings.

Fine-grained matching: Exact matches, synonyms, and paraphrases are detected in context.

Negation and qualification: "Does NOT support X" correctly matches queries about X not being supported.

Numerical reasoning: "Temperature above 100" correctly matches documents mentioning "105 degrees".

Bi-encoders compress documents into fixed vectors before seeing the query. Information is lost.

Score Distributions

Score distributions

Bi-Encoder Scores

Top 1Top 50

Clustered: hard to distinguish top from rest

Cross-Encoder Scores

Top 1Top 50

Spread out: clear separation of relevant from irrelevant

Key insight: Cross-encoder scores have higher variance, making it easier to identify truly relevant results. This is why reranking improves precision.

Bi-encoder similarity scores often cluster. Top-100 candidates might all have scores between 0.78 and 0.82. Hard to distinguish.

Cross-encoder scores spread out. The same 100 candidates might score between 0.3 and 0.95. Clear separation between relevant and irrelevant.

This improved discrimination is why reranking helps: it separates candidates that bi-encoders cannot distinguish.

Reranking in Practice

Interactive: Reranking effect

1Rain Formation Process0.89

2Weather Patterns Overview0.87

3Precipitation Types0.86

4Cloud Formation0.85

5Climate vs Weather0.84

6Water Cycle Explained0.83

Bi-encoder scores are similar—hard to tell which are truly relevant

Common reranking models:

ms-marco-MiniLM (fast, good quality)
BGE-reranker (strong performance)
Cohere Rerank (API service)
cross-encoder/ms-marco (classic)

Latency budget:

Retrieving 100 candidates: ~10ms
Reranking 100 with MiniLM: ~50ms
Total: ~60ms (acceptable for most applications)

How many to retrieve/rerank:

More candidates → better recall → higher latency
Typical: retrieve 100, rerank to top 10
High-stakes: retrieve 500-1000, rerank to top 20

When to Use Reranking

Use reranking when:

Initial retrieval quality is insufficient
You have latency budget for cross-encoder
Candidates are hard to distinguish by embedding similarity
Precision matters more than cost

Skip reranking when:

Bi-encoder quality is good enough
Latency is extremely constrained
Cost per query matters (reranking uses more compute)
Candidates are already well-separated

Reranking Beyond Cross-Encoders

LLM-based reranking: Ask an LLM to rank candidates by relevance. More expensive but can use reasoning.

Learned sparse models: SPLADE and similar produce sparse vectors that can be compared efficiently. Not quite reranking but improves over dense-only.

Multi-vector models: ColBERT uses multiple vectors per document, enabling richer matching at moderate cost.

Ensemble methods: Combine multiple reranker scores with fusion techniques.

Key Takeaways

Bi-encoders encode query and document separately—fast but limited interaction modeling
Cross-encoders process query and document together—slow but much more accurate
Two-stage retrieval: fast bi-encoder retrieval, then accurate cross-encoder reranking
Cross-encoders can model negation, qualification, and fine-grained matching that bi-encoders miss
Reranking spreads out scores, making it easier to distinguish truly relevant results
Common pattern: retrieve 100 candidates, rerank to top 10, with ~50ms reranking latency
Use reranking when precision matters and latency budget allows