Chunking Strategies
Fixed-size, semantic, hierarchical: trade-offs and implementations
Documents are too long to embed whole. A 10-page PDF produces a single vector—too coarse for precise retrieval. Chunking splits documents into smaller pieces, each embedded separately. But chunking is not neutral: how you chunk affects what you can retrieve.
Why Chunking Matters
Embedding models have context windows (512-8192 tokens typically). Long documents must be split. But even if models could handle long documents, chunking improves retrieval:
Precision: A query about a specific fact should return the paragraph containing that fact, not the entire document.
Diversity: Multiple relevant sections from one document can appear in results.
Relevance signal: The embedding of a focused chunk is more informative than a diluted whole-document embedding.
Fixed-Size Chunking
The simplest approach: split text every N characters or tokens.
Interactive: Chunk size effects
Medium chunks: balanced precision and context
Parameters:
- Chunk size (tokens or characters)
- Overlap between chunks
Pros:
- Simple to implement
- Predictable chunk sizes
- Easy to parallelize
Cons:
- Cuts mid-sentence, mid-paragraph
- No awareness of document structure
- May separate related content
Typical values: 256-512 tokens with 10-20% overlap.
Overlap: Preserving Context
Chunks with no overlap can split important content across boundaries. Overlap ensures information at boundaries appears in multiple chunks.
Interactive: Chunk overlap
Light overlap: boundary content in both chunks
No overlap: Content at boundaries retrieved only if the right chunk is found.
10-20% overlap: Most boundary content appears in both chunks. Redundancy in storage but better retrieval.
50%+ overlap: Significant redundancy. Rarely necessary.
Overlap increases storage but improves robustness to boundary effects.
Semantic Chunking
Instead of fixed size, split at natural boundaries:
- Paragraph breaks
- Section headings
- Sentence boundaries
Semantic vs fixed chunking
Fixed chunking cuts mid-sentence, producing incoherent fragments
Recursive text splitting: Try to split at paragraphs. If a paragraph is too long, split at sentences. If a sentence is too long, split at words.
Markdown/HTML-aware: Respect document structure. Keep lists together. Split at headings.
Sentence-based: Split every N sentences. Never cut mid-sentence.
Pros:
- Respects document structure
- More coherent chunks
- Better for structured documents
Cons:
- Variable chunk sizes
- Complex implementation
- May produce very small or very large chunks
Hierarchical Chunking
Different granularities for different purposes:
Hierarchical chunking
Paragraph-level enables precise retrieval; return parent section for context.
Parent-child relationships:
- Embed sentences (fine-grained retrieval)
- Store sentences with parent paragraph IDs
- Return parent paragraphs for context
Multi-level indices:
- Document-level embeddings for topic matching
- Section-level for specific content
- Paragraph-level for precise facts
Retrieve small, return large: Search on sentence embeddings, return the enclosing paragraph or section. Precise retrieval with rich context.
Chunking for Different Content Types
Prose (articles, books):
- Paragraph-based semantic chunking
- 256-512 token chunks
- Overlap for context preservation
Technical documentation:
- Section-aware chunking
- Keep code blocks intact
- Include section headers in each chunk
Conversations (chat logs, transcripts):
- Message or turn-based chunks
- Include speaker identification
- Sliding window for context
Code:
- Function or class-based chunks
- Include imports and signatures
- Separate documentation from implementation
Tables and structured data:
- Row-based chunking with headers
- Or convert to prose descriptions
- Preserve relationships
Chunk Metadata
Chunks should carry metadata for filtering and context:
Source information:
- Document ID, title, URL
- Section/chapter/page number
- Timestamp, author
Structural context:
- Parent section heading
- Position in document (beginning/middle/end)
- Previous and next chunk IDs
Processing info:
- Chunk index within document
- Overlap with neighbors
- Token count
This metadata enables filtered retrieval and context reconstruction.
Practical Recommendations
-
Start with 512 tokens, 20% overlap: A reasonable default for most text.
-
Use recursive splitting: Prefer paragraph → sentence → word fallback.
-
Include context in chunks: Add section headers, document title to each chunk.
-
Test on your data: Chunk size effects vary by domain and query patterns.
-
Consider query length: Short queries need small chunks. Long queries may benefit from larger.
-
Balance precision and context: Too small loses context; too large dilutes relevance.
Key Takeaways
- Chunking is necessary because embedding models have context limits and retrieval benefits from precision
- Fixed-size chunking is simple but cuts arbitrarily; semantic chunking respects document structure
- Overlap (10-20%) helps preserve information at chunk boundaries
- Hierarchical chunking enables "retrieve small, return large" for precise yet contextual results
- Different content types require different strategies: prose, code, tables, conversations
- Always include metadata for filtering and context reconstruction
- Default to 512 tokens with 20% overlap, then tune based on empirical retrieval quality