RAG Architecture

The retrieve-then-generate pattern and its variants

Retrieval-Augmented Generation (RAG) bridges semantic search and language models. Instead of relying solely on what an LLM knows from training, RAG retrieves relevant documents and provides them as context. The LLM generates answers grounded in actual sources.

The Basic RAG Pattern

Interactive: RAG flow

Query

User asks a question

The flow is straightforward:

User query: "What are the side effects of aspirin?"
Retrieve: Semantic search finds relevant passages from a medical database
Augment: Insert retrieved passages into the prompt as context
Generate: LLM reads context and produces an answer
Return: User sees the answer (optionally with source citations)

The power: LLMs are excellent at reading comprehension. Given relevant context, they can synthesize, summarize, and answer accurately—without needing that knowledge baked into weights.

Why RAG Works

LLMs have impressive knowledge but fundamental limitations.

Knowledge cutoff means training data ends at some date. Events after that date are unknown to the model. RAG addresses this because retrieval can include current data, updated as recently as needed.

Hallucination is the tendency of LLMs to confidently generate false information. RAG mitigates this because answers are grounded in actual documents—the model reads and synthesizes rather than recalls from uncertain memory.

No private data means your internal documentation, code, and policies are not in the training set. RAG solves this directly: any document collection can be indexed and retrieved.

Verification is impossible with a pure LLM—there is no way to check where information came from. RAG enables citations because you know exactly which retrieved passages informed the answer.

Prompt Construction

Interactive: Prompt construction

System

You are a helpful assistant. Answer based on the provided context only.

Context

Passage 1: Aspirin can cause stomach irritation and bleeding... Passage 2: Common side effects include nausea and heartburn...

Question

What are the side effects of aspirin?

A typical RAG prompt structure:

System: You are a helpful assistant. Answer questions based on 
the provided context. If the answer is not in the context, 
say "I don't have information about that."
 
Context:
[Retrieved passage 1]
[Retrieved passage 2]
[Retrieved passage 3]
 
Question: {user_question}
 
Answer:

plaintext

The key elements of this prompt structure are worth examining. The system instructions guide behavior and set expectations for how the model should respond. The context block contains the retrieved passages, clearly delineated so the model knows what is source material versus what is instruction. The question is the user's actual query. And the grounding instruction tells the model explicitly to use the provided context and admit uncertainty when the answer is not present.

RAG Variants

RAG variants

Naive RAG

Retrieve top-k, stuff into prompt, generate

Pros

+ Simple
+ Fast
+ Easy to debug

Cons

- May retrieve irrelevant
- No iterative refinement

Naive RAG retrieves the top-k results, stuffs them into the prompt, and generates. Simple and a solid baseline.

Sentence-window RAG embeds individual sentences for precise matching but returns surrounding context—typically a paragraph or more—to give the model rich context for generation.

Parent document RAG embeds small chunks for precise retrieval but returns the parent documents those chunks came from. This preserves full context at the cost of longer prompts.

Multi-query RAG generates variations of the original query, retrieves for each variation, then merges results. This improves recall by catching documents that might match alternative phrasings.

Iterative RAG generates an initial answer, evaluates whether it is sufficient, and retrieves more if needed. This adapts the number of retrieval rounds to the question's complexity.

Self-RAG goes further: the LLM decides when to retrieve and evaluates retrieval quality itself. Fully autonomous but more complex to implement.

The Retrieval-Generation Interface

What goes into the prompt matters significantly.

Number of passages involves a direct trade-off: more context means more information but also higher cost and potential confusion if passages contradict each other.

Passage ordering affects model behavior. Should the most relevant passage come first? The most recent? Grouped by source? Models are sensitive to ordering, and the best approach depends on your use case.

Passage formatting helps the model understand structure. Numbered lists, headers, or XML tags all work. The key is clear delineation so the model knows where one passage ends and another begins.

Passage length trades precision against context. Short snippets are more likely to be directly relevant but may lack necessary context. Long passages provide more context but may include irrelevant information.

Metadata inclusion—source, date, author—is useful for citations and helps the model understand context but consumes tokens from your context window.

Latency Considerations

RAG latency breakdown

Embed query

30ms

Vector search

15ms

Fetch passages

25ms

LLM generation

1500ms

Total latency:1570ms

LLM generation is 96% of total time

Optimization focus: Since LLM dominates latency, use streaming to improve perceived performance. Users see tokens immediately instead of waiting for full generation.

Typical RAG latency breaks down as follows. Query embedding takes 10-50ms. Vector search takes 5-20ms. Passage retrieval from storage takes 10-50ms. LLM generation takes 500-3000ms.

The LLM dominates. Optimizing retrieval helps but the generation step is usually 90%+ of total latency.

Streaming: Start returning tokens before full generation completes. Perceived latency drops dramatically.

Architectural Patterns

Synchronous RAG:

query → embed → search → retrieve → generate → return

plaintext

Simple, predictable, higher latency.

Streaming RAG:

query → embed → search → retrieve → stream(generate)

plaintext

Same pipeline but streams output. Better UX.

Parallel retrieval:

query → [embed, search₁] → merge → retrieve → generate
        [embed, search₂]

plaintext

Multiple indices searched in parallel. Lower latency for hybrid.

Cached retrieval:

query → check_cache → (hit: return) / (miss: full pipeline)

plaintext

For repeated queries. Common in production.

Key Takeaways

RAG retrieves relevant documents and provides them as context to an LLM for generation
It addresses hallucination, knowledge cutoffs, and private data access
Prompt construction includes system instructions, clearly formatted context, and grounding guidance
Variants include sentence-window, parent document, multi-query, and self-RAG patterns
LLM generation dominates latency—optimize there with streaming and caching
The retrieval-generation interface (what context, how formatted) significantly impacts quality