Chunking Strategies

Fixed-size, semantic, hierarchical: trade-offs and implementations

Documents are too long to embed whole. A 10-page PDF produces a single vector—too coarse for precise retrieval. Chunking splits documents into smaller pieces, each embedded separately. But chunking is not neutral: how you chunk affects what you can retrieve.

Why Chunking Matters

Embedding models have context windows (512-8192 tokens typically). Long documents must be split. But even if models could handle long documents, chunking improves retrieval:

Precision: A query about a specific fact should return the paragraph containing that fact, not the entire document.

Diversity: Multiple relevant sections from one document can appear in results.

Relevance signal: The embedding of a focused chunk is more informative than a diluted whole-document embedding.

Fixed-Size Chunking

The simplest approach: split text every N characters or tokens.

Interactive: Chunk size effects

5 chunks created
Chunk 1: Machine learning is a subset of artificial intelligence that enables systems to learn from data. Unlike traditional programming where rules are
Chunk 2: explicitly coded, ML systems discover patterns automatically. Neural networks are a key technology in modern ML. They consist of layers of
Chunk 3: interconnected nodes that process information. Deep learning uses neural networks with many layers, enabling complex pattern recognition. Training a
Chunk 4: neural network involves showing it many examples and adjusting weights to minimize errors. This process is called backpropagation and uses gradient
Chunk 5: descent optimization.

Medium chunks: balanced precision and context

Parameters:

  • Chunk size (tokens or characters)
  • Overlap between chunks

Pros:

  • Simple to implement
  • Predictable chunk sizes
  • Easy to parallelize

Cons:

  • Cuts mid-sentence, mid-paragraph
  • No awareness of document structure
  • May separate related content

Typical values: 256-512 tokens with 10-20% overlap.

Overlap: Preserving Context

Chunks with no overlap can split important content across boundaries. Overlap ensures information at boundaries appears in multiple chunks.

Interactive: Chunk overlap

3 chunks with 20% overlap
Chunk 1
Chunk 2
Chunk 3
Overlap effect:

Light overlap: boundary content in both chunks

No overlap: Content at boundaries retrieved only if the right chunk is found.

10-20% overlap: Most boundary content appears in both chunks. Redundancy in storage but better retrieval.

50%+ overlap: Significant redundancy. Rarely necessary.

Overlap increases storage but improves robustness to boundary effects.

Semantic Chunking

Instead of fixed size, split at natural boundaries:

  • Paragraph breaks
  • Section headings
  • Sentence boundaries

Semantic vs fixed chunking

4 fixed chunks
Chunk 1: Machine learning enables systems to learn from data rather than following explic|cut|
Chunk 2: it rules. Neural networks consist of layers of interconnected nodes that process|cut|
Chunk 3: information hierarchically. Training involves adjusting weights through backpro|cut|
Chunk 4: pagation to minimize prediction errors.|cut|

Fixed chunking cuts mid-sentence, producing incoherent fragments

Recursive text splitting: Try to split at paragraphs. If a paragraph is too long, split at sentences. If a sentence is too long, split at words.

Markdown/HTML-aware: Respect document structure. Keep lists together. Split at headings.

Sentence-based: Split every N sentences. Never cut mid-sentence.

Pros:

  • Respects document structure
  • More coherent chunks
  • Better for structured documents

Cons:

  • Variable chunk sizes
  • Complex implementation
  • May produce very small or very large chunks

Hierarchical Chunking

Different granularities for different purposes:

Hierarchical chunking

P1: Machine learning is a subset of AI that enables systems to learn from data.
P2: Unlike traditional programming, ML discovers patterns automatically.
P3: Neural networks consist of layers of interconnected nodes.
P4: Deep learning uses many layers for complex pattern recognition.
P5: Training adjusts weights to minimize prediction errors.
P6: Backpropagation uses gradient descent for optimization.
Hierarchical strategy:

Paragraph-level enables precise retrieval; return parent section for context.

Parent-child relationships:

  • Embed sentences (fine-grained retrieval)
  • Store sentences with parent paragraph IDs
  • Return parent paragraphs for context

Multi-level indices:

  • Document-level embeddings for topic matching
  • Section-level for specific content
  • Paragraph-level for precise facts

Retrieve small, return large: Search on sentence embeddings, return the enclosing paragraph or section. Precise retrieval with rich context.

Chunking for Different Content Types

Prose (articles, books):

  • Paragraph-based semantic chunking
  • 256-512 token chunks
  • Overlap for context preservation

Technical documentation:

  • Section-aware chunking
  • Keep code blocks intact
  • Include section headers in each chunk

Conversations (chat logs, transcripts):

  • Message or turn-based chunks
  • Include speaker identification
  • Sliding window for context

Code:

  • Function or class-based chunks
  • Include imports and signatures
  • Separate documentation from implementation

Tables and structured data:

  • Row-based chunking with headers
  • Or convert to prose descriptions
  • Preserve relationships

Chunk Metadata

Chunks should carry metadata for filtering and context:

Source information:

  • Document ID, title, URL
  • Section/chapter/page number
  • Timestamp, author

Structural context:

  • Parent section heading
  • Position in document (beginning/middle/end)
  • Previous and next chunk IDs

Processing info:

  • Chunk index within document
  • Overlap with neighbors
  • Token count

This metadata enables filtered retrieval and context reconstruction.

Practical Recommendations

  1. Start with 512 tokens, 20% overlap: A reasonable default for most text.

  2. Use recursive splitting: Prefer paragraph → sentence → word fallback.

  3. Include context in chunks: Add section headers, document title to each chunk.

  4. Test on your data: Chunk size effects vary by domain and query patterns.

  5. Consider query length: Short queries need small chunks. Long queries may benefit from larger.

  6. Balance precision and context: Too small loses context; too large dilutes relevance.

Key Takeaways

  • Chunking is necessary because embedding models have context limits and retrieval benefits from precision
  • Fixed-size chunking is simple but cuts arbitrarily; semantic chunking respects document structure
  • Overlap (10-20%) helps preserve information at chunk boundaries
  • Hierarchical chunking enables "retrieve small, return large" for precise yet contextual results
  • Different content types require different strategies: prose, code, tables, conversations
  • Always include metadata for filtering and context reconstruction
  • Default to 512 tokens with 20% overlap, then tune based on empirical retrieval quality