RAG Architecture
The retrieve-then-generate pattern and its variants
Retrieval-Augmented Generation (RAG) bridges semantic search and language models. Instead of relying solely on what an LLM knows from training, RAG retrieves relevant documents and provides them as context. The LLM generates answers grounded in actual sources.
The Basic RAG Pattern
Interactive: RAG flow
Query
User asks a question
The flow is straightforward:
- User query: "What are the side effects of aspirin?"
- Retrieve: Semantic search finds relevant passages from a medical database
- Augment: Insert retrieved passages into the prompt as context
- Generate: LLM reads context and produces an answer
- Return: User sees the answer (optionally with source citations)
The power: LLMs are excellent at reading comprehension. Given relevant context, they can synthesize, summarize, and answer accurately—without needing that knowledge baked into weights.
Why RAG Works
LLMs have impressive knowledge but fundamental limitations.
Knowledge cutoff means training data ends at some date. Events after that date are unknown to the model. RAG addresses this because retrieval can include current data, updated as recently as needed.
Hallucination is the tendency of LLMs to confidently generate false information. RAG mitigates this because answers are grounded in actual documents—the model reads and synthesizes rather than recalls from uncertain memory.
No private data means your internal documentation, code, and policies are not in the training set. RAG solves this directly: any document collection can be indexed and retrieved.
Verification is impossible with a pure LLM—there is no way to check where information came from. RAG enables citations because you know exactly which retrieved passages informed the answer.
Prompt Construction
Interactive: Prompt construction
You are a helpful assistant. Answer based on the provided context only.
Passage 1: Aspirin can cause stomach irritation and bleeding... Passage 2: Common side effects include nausea and heartburn...
What are the side effects of aspirin?
A typical RAG prompt structure:
System: You are a helpful assistant. Answer questions based on
the provided context. If the answer is not in the context,
say "I don't have information about that."
Context:
[Retrieved passage 1]
[Retrieved passage 2]
[Retrieved passage 3]
Question: {user_question}
Answer:The key elements of this prompt structure are worth examining. The system instructions guide behavior and set expectations for how the model should respond. The context block contains the retrieved passages, clearly delineated so the model knows what is source material versus what is instruction. The question is the user's actual query. And the grounding instruction tells the model explicitly to use the provided context and admit uncertainty when the answer is not present.
RAG Variants
RAG variants
Naive RAG
Retrieve top-k, stuff into prompt, generate
- + Simple
- + Fast
- + Easy to debug
- - May retrieve irrelevant
- - No iterative refinement
Naive RAG retrieves the top-k results, stuffs them into the prompt, and generates. Simple and a solid baseline.
Sentence-window RAG embeds individual sentences for precise matching but returns surrounding context—typically a paragraph or more—to give the model rich context for generation.
Parent document RAG embeds small chunks for precise retrieval but returns the parent documents those chunks came from. This preserves full context at the cost of longer prompts.
Multi-query RAG generates variations of the original query, retrieves for each variation, then merges results. This improves recall by catching documents that might match alternative phrasings.
Iterative RAG generates an initial answer, evaluates whether it is sufficient, and retrieves more if needed. This adapts the number of retrieval rounds to the question's complexity.
Self-RAG goes further: the LLM decides when to retrieve and evaluates retrieval quality itself. Fully autonomous but more complex to implement.
The Retrieval-Generation Interface
What goes into the prompt matters significantly.
Number of passages involves a direct trade-off: more context means more information but also higher cost and potential confusion if passages contradict each other.
Passage ordering affects model behavior. Should the most relevant passage come first? The most recent? Grouped by source? Models are sensitive to ordering, and the best approach depends on your use case.
Passage formatting helps the model understand structure. Numbered lists, headers, or XML tags all work. The key is clear delineation so the model knows where one passage ends and another begins.
Passage length trades precision against context. Short snippets are more likely to be directly relevant but may lack necessary context. Long passages provide more context but may include irrelevant information.
Metadata inclusion—source, date, author—is useful for citations and helps the model understand context but consumes tokens from your context window.
Latency Considerations
RAG latency breakdown
LLM generation is 96% of total time
Optimization focus: Since LLM dominates latency, use streaming to improve perceived performance. Users see tokens immediately instead of waiting for full generation.
Typical RAG latency breaks down as follows. Query embedding takes 10-50ms. Vector search takes 5-20ms. Passage retrieval from storage takes 10-50ms. LLM generation takes 500-3000ms.
The LLM dominates. Optimizing retrieval helps but the generation step is usually 90%+ of total latency.
Streaming: Start returning tokens before full generation completes. Perceived latency drops dramatically.
Architectural Patterns
Synchronous RAG:
query → embed → search → retrieve → generate → returnSimple, predictable, higher latency.
Streaming RAG:
query → embed → search → retrieve → stream(generate)Same pipeline but streams output. Better UX.
Parallel retrieval:
query → [embed, search₁] → merge → retrieve → generate
[embed, search₂]Multiple indices searched in parallel. Lower latency for hybrid.
Cached retrieval:
query → check_cache → (hit: return) / (miss: full pipeline)For repeated queries. Common in production.
Key Takeaways
- RAG retrieves relevant documents and provides them as context to an LLM for generation
- It addresses hallucination, knowledge cutoffs, and private data access
- Prompt construction includes system instructions, clearly formatted context, and grounding guidance
- Variants include sentence-window, parent document, multi-query, and self-RAG patterns
- LLM generation dominates latency—optimize there with streaming and caching
- The retrieval-generation interface (what context, how formatted) significantly impacts quality