Evaluation Metrics

Precision, recall, MRR, NDCG: measuring retrieval quality with worked examples

You cannot improve what you do not measure. Retrieval evaluation quantifies how well your system finds relevant documents. Different metrics capture different aspects of quality: did we find everything? Are results ranked correctly? Is the first result good?

Precision and Recall

The fundamental trade-off in retrieval.

Interactive: Precision and recall

Retrieved: 8 documents8

Relevant Retrieved

of 10 total relevant

Precision

62.5%

5/8 retrieved

Recall

50.0%

5/10 found

Visualization

Relevant & Retrieved

False Positive

Missed (False Negative)

Precision: Of the documents you retrieved, what fraction are relevant?

$\text{Precision} = \frac{\text{Relevant Retrieved}}{\text{Total Retrieved}}$

Recall: Of all relevant documents, what fraction did you retrieve?

$\text{Recall} = \frac{\text{Relevant Retrieved}}{\text{Total Relevant}}$

Precision@k: Precision computed on top k results only.

Recall@k: Recall computed on top k results only.

Trade-off: Retrieving more increases recall but often decreases precision. Retrieving less increases precision but misses relevant documents.

Worked Example: Precision and Recall

Query: "Python error handling"

Total relevant documents in corpus: 10
You retrieve 8 documents
Of those 8, 5 are relevant

$\text{Precision} = \frac{5}{8} = 0.625 = 62.5\%$

$\text{Recall} = \frac{5}{10} = 0.50 = 50\%$

You found half the relevant documents (moderate recall), and most of what you found was relevant (decent precision).

Mean Reciprocal Rank (MRR)

When you care most about the first relevant result.

Interactive: MRR calculator

Query	First Relevant Rank	Reciprocal Rank
Python error handling	1	1.000
Machine learning basics	3	0.333
REST API design	2	0.500

Mean Reciprocal Rank

0.611

(1.00 + 0.33 + 0.50) / 3 = 0.611

For each query, find the rank of the first relevant document. The reciprocal rank is 1/rank.

$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}$

Example:

Query 1: First relevant at rank 1 → RR = 1/1 = 1.0
Query 2: First relevant at rank 3 → RR = 1/3 = 0.33
Query 3: First relevant at rank 2 → RR = 1/2 = 0.5

$\text{MRR} = \frac{1.0 + 0.33 + 0.5}{3} = 0.61$

Use MRR when: Users typically click the first good result (search engines, Q&A).

Normalized Discounted Cumulative Gain (NDCG)

When you care about the ranking quality of all results.

Interactive: NDCG calculator

Relevance scores (drag to reorder)

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

DCG@5

10.48

IDCG@5

10.82

NDCG@5

0.969

DCG = Σ (2^rel - 1) / log₂(rank + 1)

NDCG = DCG / IDCG = 10.48 / 10.82 = 0.969

NDCG accounts for:

Graded relevance (not just relevant/not)
Position (earlier is better)
Normalization (compare across queries)

DCG (Discounted Cumulative Gain):

$\text{DCG}@k = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)}$

Gain from each result is discounted by log of position.

NDCG:

$\text{NDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}$

Normalize by the ideal DCG (perfect ranking).

Worked Example: NDCG

Results with relevance scores [3, 2, 0, 1, 2] (3 = highly relevant):

DCG@5: $\text{DCG} = \frac{2^3-1}{\log_2(2)} + \frac{2^2-1}{\log_2(3)} + \frac{2^0-1}{\log_2(4)} + \frac{2^1-1}{\log_2(5)} + \frac{2^2-1}{\log_2(6)}$ $= \frac{7}{1} + \frac{3}{1.58} + \frac{0}{2} + \frac{1}{2.32} + \frac{3}{2.58} = 7 + 1.90 + 0 + 0.43 + 1.16 = 10.49$

Ideal ranking [3, 2, 2, 1, 0]: $\text{IDCG} = 7 + 1.90 + 1.16 + 0.43 + 0 = 10.49$

$\text{NDCG}@5 = \frac{10.49}{10.49} = 1.0$ (perfect ranking for this set)

Which Metric to Use?

Metrics comparison

For RAG: Focus on Recall@k. The LLM just needs to see the relevant passages—exact ranking within the context window matters less.

Metric	Use When	Captures
Precision@k	You show fixed k results	Quality of what's shown
Recall@k	Missing documents is costly	Coverage of relevant docs
MRR	First result matters most	Quality of top result
NDCG@k	Ranking order matters	Graded ranking quality

For semantic search: NDCG and Recall are typically most important. You want high recall (find everything relevant) with good ranking (best first).

For RAG: Recall@k matters most. If the LLM sees the relevant passages, it can answer. Ranking within context matters less.

Evaluation Best Practices

1. Create a test set: Queries with labeled relevant documents. 50-100 queries minimum.

2. Use multiple metrics: No single metric tells the whole story.

3. Segment by query type: Performance may vary by category, length, or difficulty.

4. Compare to baseline: Measure improvement relative to previous system or simple baseline.

5. Statistical significance: Improvements should be significant, not random variation.

6. Regular evaluation: Re-evaluate as data and queries evolve.

Common Pitfalls

Incomplete labels: If you only label top-retrieved docs, recall is artificially inflated.

Position bias in labels: Labelers rate higher-ranked docs as more relevant.

Query set bias: Test queries not representative of production traffic.

Metric gaming: Optimizing metric without improving user experience.

Key Takeaways

Precision measures quality of retrieved results; Recall measures coverage of relevant documents
MRR focuses on the first relevant result—use for single-answer scenarios
NDCG measures graded ranking quality—use when result order matters
No single metric tells the whole story; use multiple metrics
For RAG, Recall@k is typically most important (LLM needs to see relevant passages)
Build a labeled test set and evaluate regularly
Watch for evaluation pitfalls: incomplete labels, bias, unrepresentative queries