Hybrid Retrieval
Combining BM25 and dense retrieval with reciprocal rank fusion and beyond
Dense retrieval with embeddings captures semantic similarity. Keyword search with BM25 captures exact matches. Neither is universally better. Hybrid retrieval combines both, getting the best of each approach.
BM25 vs Dense Retrieval
Interactive: Compare retrieval methods
BM25 Results
- 1.Python dict documentation
- 2.Dictionary methods in Python
- 3.dict vs list performance
Dense Results
- 1.Python dictionary tutorial
- 2.Hash maps in Python
- 3.Key-value data structures
BM25 finds exact 'dict' matches; Dense finds semantically related 'dictionary' and 'hash map' content.
BM25 (sparse/keyword search) matches exact terms. It excels at finding specific names, codes, and rare terms. It has no semantic understanding—the words must match—but it is fast and interpretable.
Dense retrieval (embedding search) captures meaning. It handles synonyms and paraphrases naturally. It may miss exact matches if the embedding model was not trained on that terminology, and it requires embedding computation at query time.
Each wins in different scenarios. BM25 wins when you search "Error code E1234" and need an exact match, when dealing with rare technical terms not in the embedding training data, and when looking for specific proper nouns or product names. Dense wins when you search "How to make code faster" and want documents about "performance optimization," for cross-lingual retrieval where the same concept is expressed in different languages, and whenever conceptual similarity matters more than lexical overlap.
Why Combine Them?
Each method has blind spots the other covers:
| Query | BM25 | Dense | Hybrid |
|---|---|---|---|
| "Python dict" | Finds "dict" exactly | Finds "dictionary" | Finds both |
| "make code fast" | Misses "optimization" | Finds it | Finds it |
| "error ABC123" | Exact match | May miss | Exact match |
| "deployment problems" | Misses "issues" | Finds it | Finds it |
Hybrid achieves higher recall than either alone.
Reciprocal Rank Fusion (RRF)
The most common fusion method: combine rankings, not scores.
Interactive: Reciprocal rank fusion
| Document | BM25 Rank | Dense Rank | RRF Score | Final Rank |
|---|---|---|---|---|
| Fast algorithms explained | 1 | 4 | 0.0320 | 1 |
| Performance optimization guide | 5 | 1 | 0.0318 | 2 |
| Quick start guide | 3 | 6 | 0.0310 | 3 |
| Speed up your code | 8 | 2 | 0.0308 | 4 |
| Faster build times | 2 | — | 0.0161 | 5 |
| Code efficiency tips | — | 3 | 0.0159 | 6 |
For each document appearing in any result list:
Where k is a constant (typically 60) that dampens the effect of rank.
RRF is rank-based, so there is no need to normalize scores across different retrieval systems. Documents that rank highly in both lists get boosted. Documents appearing in only one list are still included but weighted lower. The constant k controls how much to trust top ranks versus spreading weight across the ranking.
Score Normalization and Blending
An alternative: normalize scores from each retriever and blend.
Interactive: Alpha blending
Normalization can be done several ways. Min-max scaling puts scores in the [0, 1] range. Z-score normalization centers at mean 0 with standard deviation 1. Softmax converts scores to probabilities.
The alpha parameter controls the blend. At α = 1.0, you get pure dense retrieval. At α = 0.0, you get pure BM25. At α = 0.5, both contribute equally. Most systems find optimal performance around 0.5-0.7, slightly favoring dense retrieval.
Recall Comparison
Recall comparison across methods
Recall@10
Recall@100
Hybrid advantage: +20% relative improvement in Recall@10 compared to dense-only, by capturing documents that exact matching finds but semantic search misses.
On benchmark datasets, hybrid consistently outperforms either alone:
| Method | Recall@10 | Recall@100 |
|---|---|---|
| BM25 only | 65% | 82% |
| Dense only | 75% | 88% |
| Hybrid (RRF) | 85% | 94% |
The improvement comes from capturing documents that one method misses.
Implementation Patterns
Pattern 1: Parallel retrieval
bm25_results = bm25_search(query, k=100)
dense_results = dense_search(query_embedding, k=100)
combined = rrf_merge(bm25_results, dense_results)
return combined[:10]Pattern 2: Dense first, BM25 boost
dense_results = dense_search(query_embedding, k=100)
for doc in dense_results:
bm25_score = compute_bm25(query, doc)
doc.score = 0.7 * doc.dense_score + 0.3 * bm25_score
return sort(dense_results)[:10]Pattern 3: BM25 filter, dense rerank
bm25_candidates = bm25_search(query, k=1000)
embeddings = get_embeddings(bm25_candidates)
dense_scores = cosine_similarity(query_embedding, embeddings)
return sort_by(bm25_candidates, dense_scores)[:10]Learned Sparse Retrieval
Modern alternatives to BM25:
SPLADE: Learn sparse vectors where dimensions correspond to vocabulary terms. Combines semantic learning with sparse representation.
ColBERT: Multiple vectors per document, late interaction. Neither purely sparse nor dense.
Learned term weighting: Replace BM25's statistical weights with learned weights.
These can replace BM25 in hybrid setups for better performance.
Practical Considerations
Latency is the main concern. Running two retrievers doubles compute. Mitigate this by executing retrievals in parallel, using a smaller BM25 candidate set, or caching embeddings where possible.
Indexing requires maintaining both a vector index and an inverted index. This means more storage and more maintenance overhead. Ensure both indices stay synchronized when documents are added or removed.
Tuning is dataset-dependent. The optimal alpha or k parameters vary significantly across domains. Always tune on a validation set with labeled relevance judgments.
Sometimes you should skip hybrid entirely. If latency is critical and dense retrieval is good enough, the added complexity is not worth it. If your domain has no rare or specific terms that BM25 would catch, dense alone may suffice. And if resources are constrained, maintaining two indices may not be feasible.
Key Takeaways
- Dense retrieval captures semantics; BM25 captures exact matches—each has blind spots
- Hybrid retrieval combines both, achieving higher recall than either alone
- Reciprocal Rank Fusion (RRF) merges rankings without needing score normalization
- Alpha blending normalizes and interpolates scores; tune alpha empirically
- Hybrid adds latency (two retrievers) and storage (two indices)—worth it for recall-sensitive applications
- Learned sparse methods (SPLADE) offer an alternative to traditional BM25
- Typical hybrid gains: 10-20% relative improvement in recall