Hybrid Retrieval

Combining BM25 and dense retrieval with reciprocal rank fusion and beyond

Dense retrieval with embeddings captures semantic similarity. Keyword search with BM25 captures exact matches. Neither is universally better. Hybrid retrieval combines both, getting the best of each approach.

BM25 vs Dense Retrieval

Interactive: Compare retrieval methods

BM25 Results

  1. 1.Python dict documentation
  2. 2.Dictionary methods in Python
  3. 3.dict vs list performance

Dense Results

  1. 1.Python dictionary tutorial
  2. 2.Hash maps in Python
  3. 3.Key-value data structures
Observation:

BM25 finds exact 'dict' matches; Dense finds semantically related 'dictionary' and 'hash map' content.

BM25 (sparse/keyword search) matches exact terms. It excels at finding specific names, codes, and rare terms. It has no semantic understanding—the words must match—but it is fast and interpretable.

Dense retrieval (embedding search) captures meaning. It handles synonyms and paraphrases naturally. It may miss exact matches if the embedding model was not trained on that terminology, and it requires embedding computation at query time.

Each wins in different scenarios. BM25 wins when you search "Error code E1234" and need an exact match, when dealing with rare technical terms not in the embedding training data, and when looking for specific proper nouns or product names. Dense wins when you search "How to make code faster" and want documents about "performance optimization," for cross-lingual retrieval where the same concept is expressed in different languages, and whenever conceptual similarity matters more than lexical overlap.

Why Combine Them?

Each method has blind spots the other covers:

QueryBM25DenseHybrid
"Python dict"Finds "dict" exactlyFinds "dictionary"Finds both
"make code fast"Misses "optimization"Finds itFinds it
"error ABC123"Exact matchMay missExact match
"deployment problems"Misses "issues"Finds itFinds it

Hybrid achieves higher recall than either alone.

Reciprocal Rank Fusion (RRF)

The most common fusion method: combine rankings, not scores.

Interactive: Reciprocal rank fusion

DocumentBM25 RankDense RankRRF ScoreFinal Rank
Fast algorithms explained140.03201
Performance optimization guide510.03182
Quick start guide360.03103
Speed up your code820.03084
Faster build times20.01615
Code efficiency tips30.01596
RRF(d) = Σ 1/(k + ranki(d)) where k = 60

For each document appearing in any result list:

RRF(d)=rrankings1k+rankr(d)\text{RRF}(d) = \sum_{r \in \text{rankings}} \frac{1}{k + \text{rank}_r(d)}

Where k is a constant (typically 60) that dampens the effect of rank.

RRF is rank-based, so there is no need to normalize scores across different retrieval systems. Documents that rank highly in both lists get boosted. Documents appearing in only one list are still included but weighted lower. The constant k controls how much to trust top ranks versus spreading weight across the ranking.

Score Normalization and Blending

An alternative: normalize scores from each retriever and blend.

Interactive: Alpha blending

Dense: 60%|BM25: 40%
1Doc B
D: 0.85B: 0.88H: 0.86
2Doc D
D: 0.71B: 0.95H: 0.81
3Doc C
D: 0.78B: 0.72H: 0.76
4Doc A
D: 0.92B: 0.45H: 0.73

hybrid(d)=αnormalize(dense(d))+(1α)normalize(bm25(d))\text{hybrid}(d) = \alpha \cdot \text{normalize}(\text{dense}(d)) + (1-\alpha) \cdot \text{normalize}(\text{bm25}(d))

Normalization can be done several ways. Min-max scaling puts scores in the [0, 1] range. Z-score normalization centers at mean 0 with standard deviation 1. Softmax converts scores to probabilities.

The alpha parameter controls the blend. At α = 1.0, you get pure dense retrieval. At α = 0.0, you get pure BM25. At α = 0.5, both contribute equally. Most systems find optimal performance around 0.5-0.7, slightly favoring dense retrieval.

Recall Comparison

Recall comparison across methods

Recall@10

BM25
65%
Dense
75%
Hybrid
85%

Recall@100

BM25
82%
Dense
88%
Hybrid
94%

Hybrid advantage: +20% relative improvement in Recall@10 compared to dense-only, by capturing documents that exact matching finds but semantic search misses.

On benchmark datasets, hybrid consistently outperforms either alone:

MethodRecall@10Recall@100
BM25 only65%82%
Dense only75%88%
Hybrid (RRF)85%94%

The improvement comes from capturing documents that one method misses.

Implementation Patterns

Pattern 1: Parallel retrieval

bm25_results = bm25_search(query, k=100)
dense_results = dense_search(query_embedding, k=100)
combined = rrf_merge(bm25_results, dense_results)
return combined[:10]
plaintext

Pattern 2: Dense first, BM25 boost

dense_results = dense_search(query_embedding, k=100)
for doc in dense_results:
    bm25_score = compute_bm25(query, doc)
    doc.score = 0.7 * doc.dense_score + 0.3 * bm25_score
return sort(dense_results)[:10]
plaintext

Pattern 3: BM25 filter, dense rerank

bm25_candidates = bm25_search(query, k=1000)
embeddings = get_embeddings(bm25_candidates)
dense_scores = cosine_similarity(query_embedding, embeddings)
return sort_by(bm25_candidates, dense_scores)[:10]
plaintext

Learned Sparse Retrieval

Modern alternatives to BM25:

SPLADE: Learn sparse vectors where dimensions correspond to vocabulary terms. Combines semantic learning with sparse representation.

ColBERT: Multiple vectors per document, late interaction. Neither purely sparse nor dense.

Learned term weighting: Replace BM25's statistical weights with learned weights.

These can replace BM25 in hybrid setups for better performance.

Practical Considerations

Latency is the main concern. Running two retrievers doubles compute. Mitigate this by executing retrievals in parallel, using a smaller BM25 candidate set, or caching embeddings where possible.

Indexing requires maintaining both a vector index and an inverted index. This means more storage and more maintenance overhead. Ensure both indices stay synchronized when documents are added or removed.

Tuning is dataset-dependent. The optimal alpha or k parameters vary significantly across domains. Always tune on a validation set with labeled relevance judgments.

Sometimes you should skip hybrid entirely. If latency is critical and dense retrieval is good enough, the added complexity is not worth it. If your domain has no rare or specific terms that BM25 would catch, dense alone may suffice. And if resources are constrained, maintaining two indices may not be feasible.

Key Takeaways

  • Dense retrieval captures semantics; BM25 captures exact matches—each has blind spots
  • Hybrid retrieval combines both, achieving higher recall than either alone
  • Reciprocal Rank Fusion (RRF) merges rankings without needing score normalization
  • Alpha blending normalizes and interpolates scores; tune alpha empirically
  • Hybrid adds latency (two retrievers) and storage (two indices)—worth it for recall-sensitive applications
  • Learned sparse methods (SPLADE) offer an alternative to traditional BM25
  • Typical hybrid gains: 10-20% relative improvement in recall