Chunking Strategies

Fixed-size, semantic, hierarchical: trade-offs and implementations

Documents are too long to embed whole. A 10-page PDF produces a single vector—too coarse for precise retrieval. Chunking splits documents into smaller pieces, each embedded separately. But chunking is not neutral: how you chunk affects what you can retrieve.

Why Chunking Matters

Embedding models have context windows (512-8192 tokens typically). Long documents must be split. But even if models could handle long documents, chunking improves retrieval:

Precision: A query about a specific fact should return the paragraph containing that fact, not the entire document.

Diversity: Multiple relevant sections from one document can appear in results.

Relevance signal: The embedding of a focused chunk is more informative than a diluted whole-document embedding.

Fixed-Size Chunking

The simplest approach: split text every N characters or tokens.

Interactive: Chunk size effects

Chunk size: ~150 chars150

5 chunks created

Chunk 1: Machine learning is a subset of artificial intelligence that enables systems to learn from data. Unlike traditional programming where rules are

Chunk 2: explicitly coded, ML systems discover patterns automatically. Neural networks are a key technology in modern ML. They consist of layers of

Chunk 3: interconnected nodes that process information. Deep learning uses neural networks with many layers, enabling complex pattern recognition. Training a

Chunk 4: neural network involves showing it many examples and adjusting weights to minimize errors. This process is called backpropagation and uses gradient

Chunk 5: descent optimization.

Medium chunks: balanced precision and context

Parameters:

Chunk size (tokens or characters)
Overlap between chunks

Pros:

Simple to implement
Predictable chunk sizes
Easy to parallelize

Cons:

Cuts mid-sentence, mid-paragraph
No awareness of document structure
May separate related content

Typical values: 256-512 tokens with 10-20% overlap.

Overlap: Preserving Context

Chunks with no overlap can split important content across boundaries. Overlap ensures information at boundaries appears in multiple chunks.

Interactive: Chunk overlap

Overlap: 20%20

3 chunks with 20% overlap

Chunk 1

Chunk 2

Chunk 3

Overlap effect:

Light overlap: boundary content in both chunks

No overlap: Content at boundaries retrieved only if the right chunk is found.

10-20% overlap: Most boundary content appears in both chunks. Redundancy in storage but better retrieval.

50%+ overlap: Significant redundancy. Rarely necessary.

Overlap increases storage but improves robustness to boundary effects.

Semantic Chunking

Instead of fixed size, split at natural boundaries:

Paragraph breaks
Section headings
Sentence boundaries

Semantic vs fixed chunking

4 fixed chunks

Chunk 1: Machine learning enables systems to learn from data rather than following explic|cut|

Chunk 2: it rules. Neural networks consist of layers of interconnected nodes that process|cut|

Chunk 3: information hierarchically. Training involves adjusting weights through backpro|cut|

Chunk 4: pagation to minimize prediction errors.|cut|

Fixed chunking cuts mid-sentence, producing incoherent fragments

Recursive text splitting: Try to split at paragraphs. If a paragraph is too long, split at sentences. If a sentence is too long, split at words.

Markdown/HTML-aware: Respect document structure. Keep lists together. Split at headings.

Sentence-based: Split every N sentences. Never cut mid-sentence.

Pros:

Respects document structure
More coherent chunks
Better for structured documents

Cons:

Variable chunk sizes
Complex implementation
May produce very small or very large chunks

Hierarchical Chunking

Different granularities for different purposes:

Hierarchical chunking

P1: Machine learning is a subset of AI that enables systems to learn from data.

P2: Unlike traditional programming, ML discovers patterns automatically.

P3: Neural networks consist of layers of interconnected nodes.

P4: Deep learning uses many layers for complex pattern recognition.

P5: Training adjusts weights to minimize prediction errors.

P6: Backpropagation uses gradient descent for optimization.

Hierarchical strategy:

Paragraph-level enables precise retrieval; return parent section for context.

Parent-child relationships:

Embed sentences (fine-grained retrieval)
Store sentences with parent paragraph IDs
Return parent paragraphs for context

Multi-level indices:

Document-level embeddings for topic matching
Section-level for specific content
Paragraph-level for precise facts

Retrieve small, return large: Search on sentence embeddings, return the enclosing paragraph or section. Precise retrieval with rich context.

Chunking for Different Content Types

Prose (articles, books):

Paragraph-based semantic chunking
256-512 token chunks
Overlap for context preservation

Technical documentation:

Section-aware chunking
Keep code blocks intact
Include section headers in each chunk

Conversations (chat logs, transcripts):

Message or turn-based chunks
Include speaker identification
Sliding window for context

Code:

Function or class-based chunks
Include imports and signatures
Separate documentation from implementation

Tables and structured data:

Row-based chunking with headers
Or convert to prose descriptions
Preserve relationships

Chunk Metadata

Chunks should carry metadata for filtering and context:

Source information:

Document ID, title, URL
Section/chapter/page number
Timestamp, author

Structural context:

Parent section heading
Position in document (beginning/middle/end)
Previous and next chunk IDs

Processing info:

Chunk index within document
Overlap with neighbors
Token count

This metadata enables filtered retrieval and context reconstruction.

Practical Recommendations

Start with 512 tokens, 20% overlap: A reasonable default for most text.
Use recursive splitting: Prefer paragraph → sentence → word fallback.
Include context in chunks: Add section headers, document title to each chunk.
Test on your data: Chunk size effects vary by domain and query patterns.
Consider query length: Short queries need small chunks. Long queries may benefit from larger.
Balance precision and context: Too small loses context; too large dilutes relevance.

Key Takeaways

Chunking is necessary because embedding models have context limits and retrieval benefits from precision
Fixed-size chunking is simple but cuts arbitrarily; semantic chunking respects document structure
Overlap (10-20%) helps preserve information at chunk boundaries
Hierarchical chunking enables "retrieve small, return large" for precise yet contextual results
Different content types require different strategies: prose, code, tables, conversations
Always include metadata for filtering and context reconstruction
Default to 512 tokens with 20% overlap, then tune based on empirical retrieval quality