Reranking

Cross-encoders, bi-encoders, and why two-stage retrieval works

Embedding models sacrifice accuracy for speed. They encode queries and documents independently, enabling fast nearest-neighbor search. But this independence limits how well they can capture query-document interactions. Reranking with more powerful models recovers this accuracy.

Bi-Encoders vs Cross-Encoders

Interactive: Bi-encoder vs cross-encoder

Query Encoder
"What causes rain?"
Query Embedding
Document Encoder
"Rain forms when..."
Doc Embedding
cosine(query_emb, doc_emb) = 0.82

Speed

Fast: encode once, compare with dot product. O(1) per comparison.

Accuracy

Limited: no direct token interaction between query and document.

Bi-encoders (what we have been discussing):

  • Encode query and document separately
  • Compare embeddings with cosine similarity
  • Fast: encode once, compare with dot product
  • Limited: no direct query-document interaction

Cross-encoders:

  • Take query and document together as input
  • Output a relevance score directly
  • Slow: must run full model for each pair
  • Powerful: can model complex query-document relationships

Cross-encoders are too slow for initial retrieval (imagine running BERT on 1 million documents per query). But they are perfect for reranking a small candidate set.

Two-Stage Retrieval

Two-stage retrieval pipeline

Query

User submits query

1 items

Stage 1: Retrieve candidates

  • Use bi-encoder + vector index
  • Fast approximate search
  • Retrieve top 100-1000 candidates
  • Recall-focused: cast a wide net

Stage 2: Rerank candidates

  • Use cross-encoder
  • Score each candidate against the query
  • Re-sort by cross-encoder scores
  • Return top 10-20

This combines the speed of bi-encoders (search millions in milliseconds) with the accuracy of cross-encoders (compare 100 pairs with high precision).

Why Cross-Encoders Are More Accurate

Cross-encoders see query and document tokens together. This enables:

Attention across both: A token in the query can directly attend to tokens in the document. Relationships are computed, not inferred from separate embeddings.

Fine-grained matching: Exact matches, synonyms, and paraphrases are detected in context.

Negation and qualification: "Does NOT support X" correctly matches queries about X not being supported.

Numerical reasoning: "Temperature above 100" correctly matches documents mentioning "105 degrees".

Bi-encoders compress documents into fixed vectors before seeing the query. Information is lost.

Score Distributions

Score distributions

Bi-Encoder Scores

Top 1Top 50

Clustered: hard to distinguish top from rest

Cross-Encoder Scores

Top 1Top 50

Spread out: clear separation of relevant from irrelevant

Key insight: Cross-encoder scores have higher variance, making it easier to identify truly relevant results. This is why reranking improves precision.

Bi-encoder similarity scores often cluster. Top-100 candidates might all have scores between 0.78 and 0.82. Hard to distinguish.

Cross-encoder scores spread out. The same 100 candidates might score between 0.3 and 0.95. Clear separation between relevant and irrelevant.

This improved discrimination is why reranking helps: it separates candidates that bi-encoders cannot distinguish.

Reranking in Practice

Interactive: Reranking effect

1Rain Formation Process0.89
2Weather Patterns Overview0.87
3Precipitation Types0.86
4Cloud Formation0.85
5Climate vs Weather0.84
6Water Cycle Explained0.83

Bi-encoder scores are similar—hard to tell which are truly relevant

Common reranking models:

  • ms-marco-MiniLM (fast, good quality)
  • BGE-reranker (strong performance)
  • Cohere Rerank (API service)
  • cross-encoder/ms-marco (classic)

Latency budget:

  • Retrieving 100 candidates: ~10ms
  • Reranking 100 with MiniLM: ~50ms
  • Total: ~60ms (acceptable for most applications)

How many to retrieve/rerank:

  • More candidates → better recall → higher latency
  • Typical: retrieve 100, rerank to top 10
  • High-stakes: retrieve 500-1000, rerank to top 20

When to Use Reranking

Use reranking when:

  • Initial retrieval quality is insufficient
  • You have latency budget for cross-encoder
  • Candidates are hard to distinguish by embedding similarity
  • Precision matters more than cost

Skip reranking when:

  • Bi-encoder quality is good enough
  • Latency is extremely constrained
  • Cost per query matters (reranking uses more compute)
  • Candidates are already well-separated

Reranking Beyond Cross-Encoders

LLM-based reranking: Ask an LLM to rank candidates by relevance. More expensive but can use reasoning.

Learned sparse models: SPLADE and similar produce sparse vectors that can be compared efficiently. Not quite reranking but improves over dense-only.

Multi-vector models: ColBERT uses multiple vectors per document, enabling richer matching at moderate cost.

Ensemble methods: Combine multiple reranker scores with fusion techniques.

Key Takeaways

  • Bi-encoders encode query and document separately—fast but limited interaction modeling
  • Cross-encoders process query and document together—slow but much more accurate
  • Two-stage retrieval: fast bi-encoder retrieval, then accurate cross-encoder reranking
  • Cross-encoders can model negation, qualification, and fine-grained matching that bi-encoders miss
  • Reranking spreads out scores, making it easier to distinguish truly relevant results
  • Common pattern: retrieve 100 candidates, rerank to top 10, with ~50ms reranking latency
  • Use reranking when precision matters and latency budget allows