Hybrid Retrieval

Combining BM25 and dense retrieval with reciprocal rank fusion and beyond

Dense retrieval with embeddings captures semantic similarity. Keyword search with BM25 captures exact matches. Neither is universally better. Hybrid retrieval combines both, getting the best of each approach.

BM25 vs Dense Retrieval

Interactive: Compare retrieval methods

BM25 Results

1.Python dict documentation
2.Dictionary methods in Python
3.dict vs list performance

Dense Results

1.Python dictionary tutorial
2.Hash maps in Python
3.Key-value data structures

Observation:

BM25 finds exact 'dict' matches; Dense finds semantically related 'dictionary' and 'hash map' content.

BM25 (sparse/keyword search) matches exact terms. It excels at finding specific names, codes, and rare terms. It has no semantic understanding—the words must match—but it is fast and interpretable.

Dense retrieval (embedding search) captures meaning. It handles synonyms and paraphrases naturally. It may miss exact matches if the embedding model was not trained on that terminology, and it requires embedding computation at query time.

Each wins in different scenarios. BM25 wins when you search "Error code E1234" and need an exact match, when dealing with rare technical terms not in the embedding training data, and when looking for specific proper nouns or product names. Dense wins when you search "How to make code faster" and want documents about "performance optimization," for cross-lingual retrieval where the same concept is expressed in different languages, and whenever conceptual similarity matters more than lexical overlap.

Why Combine Them?

Each method has blind spots the other covers:

Query	BM25	Dense	Hybrid
"Python dict"	Finds "dict" exactly	Finds "dictionary"	Finds both
"make code fast"	Misses "optimization"	Finds it	Finds it
"error ABC123"	Exact match	May miss	Exact match
"deployment problems"	Misses "issues"	Finds it	Finds it

Hybrid achieves higher recall than either alone.

Reciprocal Rank Fusion (RRF)

The most common fusion method: combine rankings, not scores.

Interactive: Reciprocal rank fusion

RRF k parameter: 6060

Document	BM25 Rank	Dense Rank	RRF Score	Final Rank
Fast algorithms explained	1	4	0.0320	1
Performance optimization guide	5	1	0.0318	2
Quick start guide	3	6	0.0310	3
Speed up your code	8	2	0.0308	4
Faster build times	2	—	0.0161	5
Code efficiency tips	—	3	0.0159	6

RRF(d) = Σ 1/(k + rank_i(d)) where k = 60

For each document appearing in any result list:

$\text{RRF}(d) = \sum_{r \in \text{rankings}} \frac{1}{k + \text{rank}_r(d)}$

Where k is a constant (typically 60) that dampens the effect of rank.

RRF is rank-based, so there is no need to normalize scores across different retrieval systems. Documents that rank highly in both lists get boosted. Documents appearing in only one list are still included but weighted lower. The constant k controls how much to trust top ranks versus spreading weight across the ranking.

Score Normalization and Blending

An alternative: normalize scores from each retriever and blend.

Interactive: Alpha blending

Alpha (dense weight): 0.600.6

Dense: 60%|BM25: 40%

1Doc B

D: 0.85B: 0.88H: 0.86

2Doc D

D: 0.71B: 0.95H: 0.81

3Doc C

D: 0.78B: 0.72H: 0.76

4Doc A

D: 0.92B: 0.45H: 0.73

$\text{hybrid}(d) = \alpha \cdot \text{normalize}(\text{dense}(d)) + (1-\alpha) \cdot \text{normalize}(\text{bm25}(d))$

Normalization can be done several ways. Min-max scaling puts scores in the [0, 1] range. Z-score normalization centers at mean 0 with standard deviation 1. Softmax converts scores to probabilities.

The alpha parameter controls the blend. At α = 1.0, you get pure dense retrieval. At α = 0.0, you get pure BM25. At α = 0.5, both contribute equally. Most systems find optimal performance around 0.5-0.7, slightly favoring dense retrieval.

Recall Comparison

Recall comparison across methods

Recall@10

BM25

65%

Dense

75%

Hybrid

85%

Recall@100

BM25

82%

Dense

88%

Hybrid

94%

Hybrid advantage: +20% relative improvement in Recall@10 compared to dense-only, by capturing documents that exact matching finds but semantic search misses.

On benchmark datasets, hybrid consistently outperforms either alone:

Method	Recall@10	Recall@100
BM25 only	65%	82%
Dense only	75%	88%
Hybrid (RRF)	85%	94%

The improvement comes from capturing documents that one method misses.

Implementation Patterns

Pattern 1: Parallel retrieval

bm25_results = bm25_search(query, k=100)
dense_results = dense_search(query_embedding, k=100)
combined = rrf_merge(bm25_results, dense_results)
return combined[:10]

plaintext

Pattern 2: Dense first, BM25 boost

dense_results = dense_search(query_embedding, k=100)
for doc in dense_results:
    bm25_score = compute_bm25(query, doc)
    doc.score = 0.7 * doc.dense_score + 0.3 * bm25_score
return sort(dense_results)[:10]

plaintext

Pattern 3: BM25 filter, dense rerank

bm25_candidates = bm25_search(query, k=1000)
embeddings = get_embeddings(bm25_candidates)
dense_scores = cosine_similarity(query_embedding, embeddings)
return sort_by(bm25_candidates, dense_scores)[:10]

plaintext

Learned Sparse Retrieval

Modern alternatives to BM25:

SPLADE: Learn sparse vectors where dimensions correspond to vocabulary terms. Combines semantic learning with sparse representation.

ColBERT: Multiple vectors per document, late interaction. Neither purely sparse nor dense.

Learned term weighting: Replace BM25's statistical weights with learned weights.

These can replace BM25 in hybrid setups for better performance.

Practical Considerations

Latency is the main concern. Running two retrievers doubles compute. Mitigate this by executing retrievals in parallel, using a smaller BM25 candidate set, or caching embeddings where possible.

Indexing requires maintaining both a vector index and an inverted index. This means more storage and more maintenance overhead. Ensure both indices stay synchronized when documents are added or removed.

Tuning is dataset-dependent. The optimal alpha or k parameters vary significantly across domains. Always tune on a validation set with labeled relevance judgments.

Sometimes you should skip hybrid entirely. If latency is critical and dense retrieval is good enough, the added complexity is not worth it. If your domain has no rare or specific terms that BM25 would catch, dense alone may suffice. And if resources are constrained, maintaining two indices may not be feasible.

Key Takeaways

Dense retrieval captures semantics; BM25 captures exact matches—each has blind spots
Hybrid retrieval combines both, achieving higher recall than either alone
Reciprocal Rank Fusion (RRF) merges rankings without needing score normalization
Alpha blending normalizes and interpolates scores; tune alpha empirically
Hybrid adds latency (two retrievers) and storage (two indices)—worth it for recall-sensitive applications
Learned sparse methods (SPLADE) offer an alternative to traditional BM25
Typical hybrid gains: 10-20% relative improvement in recall