Context Optimization
How much to retrieve, what to include, and prompt engineering for RAG
The context you provide to the LLM determines answer quality. Too little context: missing information. Too much: confusion, cost, and potential hallucination. Finding the right balance is essential for effective RAG.
Context Window Constraints
Interactive: Context window usage
LLMs have finite context windows (4K to 200K+ tokens). Within that window, you must fit:
- System prompt: Instructions and guidelines (~100-500 tokens)
- Retrieved context: Your passages (~1000-8000 tokens typical)
- User question: The query (~20-200 tokens)
- Response space: Room for the answer (~500-2000 tokens)
Common mistakes:
- Stuffing too much context, leaving no room for response
- Using entire context window when a fraction would suffice
- Not accounting for response length in planning
How Many Passages?
More is not always better:
Interactive: Chunk selection
How much relevant info is included
Signal-to-noise ratio
Token usage
Info ignored due to position
Too few (1-2 passages):
- May miss relevant information
- Faster, cheaper
- Higher risk of incomplete answers
Too many (10+ passages):
- Diminishing returns on relevance
- Higher cost (more tokens)
- "Lost in the middle" effect: models neglect mid-context information
- Potential confusion from contradictory passages
Sweet spot (3-5 passages): Often optimal. Enough coverage without overwhelming.
Calibrate empirically for your domain.
Lost in the Middle
Research shows LLMs attend more to the beginning and end of context than the middle. Relevant information buried in the middle may be missed.
Implications:
- Put most relevant passages first
- Consider ending with important context too
- Avoid long middle sections with critical information
Context Ordering Strategies
Context ordering strategies
Most relevant first
Default choice. Key info at start where attention is highest.
Relevance-ordered: Most similar passages first. Natural and usually effective.
Reverse-ordered: Least similar first, most similar last. Places key info at the end.
Interleaved: Alternate high/low relevance. Distributes importance.
Source-grouped: Group by document source. Good for multi-source answers.
Chronological: Order by date for time-sensitive queries.
Empirically, relevance-ordered with most important first works best for most cases.
Prompt Templates
Prompt template patterns
Use ONLY the following context to answer. If the answer is not in the context, say "I don't know."
Context:
{context}
Question: {question}
Answer:Grounding instruction reduces hallucination by telling model to admit uncertainty.
Effective RAG prompts include:
Clear role definition:
You are a technical documentation assistant.Context delineation:
Use ONLY the following context to answer:
---
{context}
---Grounding instructions:
If the answer is not in the context, say "I don't have that information."
Do not make up information.Citation guidance:
Cite sources using [Source: title] format.Response format:
Provide a concise answer in 2-3 sentences.Passage Formatting
How you format passages matters:
Numbered passages:
[1] First passage text here...
[2] Second passage text here...Enables citation by number.
Source-labeled:
From "User Guide v2.1":
Passage text here...
From "API Reference":
More passage text...Good for transparency.
XML-style:
<passage source="doc1.pdf" page="5">
Passage text here...
</passage>Structured, machine-readable.
Context Compression
When context is too long, compress it:
Extractive summarization: Pull key sentences from each passage.
Abstractive summarization: Rewrite passages to be shorter while preserving meaning.
Query-focused extraction: Keep only sentences relevant to the query.
Progressive summarization: Summarize long documents in stages.
Trade-off: Compression may lose nuance. Use when necessary, not by default.
Dynamic Context Selection
Adapt context to the query:
Confidence-based cutoff: Include passages above similarity threshold.
Marginal relevance: Diversify context to cover different aspects.
Query-type aware: Different query types need different context amounts.
Iterative retrieval: Start with less, retrieve more if initial answer is uncertain.
Cost Optimization
Context tokens cost money:
Input tokens: Usually 10-50% cheaper than output tokens But more numerous: 3000 context tokens + 500 output = 3500 total
Strategies:
- Use shorter context for simple queries
- Cache common contexts
- Summarize verbose passages
- Truncate passages to relevant sections
Key Takeaways
- Balance context size: too little misses info, too much confuses and costs
- Account for response space when planning context window usage
- "Lost in the middle" effect: put important passages first (or last)
- Use clear formatting: numbered passages or source labels for citations
- Include grounding instructions: tell the model to admit uncertainty
- Compress context when necessary but preserve essential information
- Typical sweet spot: 3-5 well-chosen passages