Evaluation Metrics
Precision, recall, MRR, NDCG: measuring retrieval quality with worked examples
You cannot improve what you do not measure. Retrieval evaluation quantifies how well your system finds relevant documents. Different metrics capture different aspects of quality: did we find everything? Are results ranked correctly? Is the first result good?
Precision and Recall
The fundamental trade-off in retrieval.
Interactive: Precision and recall
Precision: Of the documents you retrieved, what fraction are relevant?
Recall: Of all relevant documents, what fraction did you retrieve?
Precision@k: Precision computed on top k results only.
Recall@k: Recall computed on top k results only.
Trade-off: Retrieving more increases recall but often decreases precision. Retrieving less increases precision but misses relevant documents.
Worked Example: Precision and Recall
Query: "Python error handling"
- Total relevant documents in corpus: 10
- You retrieve 8 documents
- Of those 8, 5 are relevant
You found half the relevant documents (moderate recall), and most of what you found was relevant (decent precision).
Mean Reciprocal Rank (MRR)
When you care most about the first relevant result.
Interactive: MRR calculator
| Query | First Relevant Rank | Reciprocal Rank |
|---|---|---|
| Python error handling | 1 | 1.000 |
| Machine learning basics | 3 | 0.333 |
| REST API design | 2 | 0.500 |
For each query, find the rank of the first relevant document. The reciprocal rank is 1/rank.
Example:
- Query 1: First relevant at rank 1 → RR = 1/1 = 1.0
- Query 2: First relevant at rank 3 → RR = 1/3 = 0.33
- Query 3: First relevant at rank 2 → RR = 1/2 = 0.5
Use MRR when: Users typically click the first good result (search engines, Q&A).
Normalized Discounted Cumulative Gain (NDCG)
When you care about the ranking quality of all results.
Interactive: NDCG calculator
NDCG accounts for:
- Graded relevance (not just relevant/not)
- Position (earlier is better)
- Normalization (compare across queries)
DCG (Discounted Cumulative Gain):
Gain from each result is discounted by log of position.
NDCG:
Normalize by the ideal DCG (perfect ranking).
Worked Example: NDCG
Results with relevance scores [3, 2, 0, 1, 2] (3 = highly relevant):
DCG@5:
Ideal ranking [3, 2, 2, 1, 0]:
(perfect ranking for this set)
Which Metric to Use?
Metrics comparison
For RAG: Focus on Recall@k. The LLM just needs to see the relevant passages—exact ranking within the context window matters less.
| Metric | Use When | Captures |
|---|---|---|
| Precision@k | You show fixed k results | Quality of what's shown |
| Recall@k | Missing documents is costly | Coverage of relevant docs |
| MRR | First result matters most | Quality of top result |
| NDCG@k | Ranking order matters | Graded ranking quality |
For semantic search: NDCG and Recall are typically most important. You want high recall (find everything relevant) with good ranking (best first).
For RAG: Recall@k matters most. If the LLM sees the relevant passages, it can answer. Ranking within context matters less.
Evaluation Best Practices
1. Create a test set: Queries with labeled relevant documents. 50-100 queries minimum.
2. Use multiple metrics: No single metric tells the whole story.
3. Segment by query type: Performance may vary by category, length, or difficulty.
4. Compare to baseline: Measure improvement relative to previous system or simple baseline.
5. Statistical significance: Improvements should be significant, not random variation.
6. Regular evaluation: Re-evaluate as data and queries evolve.
Common Pitfalls
Incomplete labels: If you only label top-retrieved docs, recall is artificially inflated.
Position bias in labels: Labelers rate higher-ranked docs as more relevant.
Query set bias: Test queries not representative of production traffic.
Metric gaming: Optimizing metric without improving user experience.
Key Takeaways
- Precision measures quality of retrieved results; Recall measures coverage of relevant documents
- MRR focuses on the first relevant result—use for single-answer scenarios
- NDCG measures graded ranking quality—use when result order matters
- No single metric tells the whole story; use multiple metrics
- For RAG, Recall@k is typically most important (LLM needs to see relevant passages)
- Build a labeled test set and evaluate regularly
- Watch for evaluation pitfalls: incomplete labels, bias, unrepresentative queries