TL;DR
- Chunking is the process of splitting documents into smaller passages that get embedded and indexed. It happens once at ingestion time and constrains everything downstream.
- Four strategies dominate: fixed-size, recursive character splitting, sentence-window, and semantic chunking. Each makes different trade-offs between embedding fidelity, context preservation, and operational simplicity.
- Chunk size is bounded above by the embedding model's context window and below by the need to give the LLM enough surrounding context to answer well. Typical sweet spot is 200-800 tokens with 10-20% overlap.
- Poor chunking is the most common cause of poor retrieval quality. Re-tuning chunking is often a higher-leverage fix than swapping embedding models.
Fixed-Size Chunking#
The simplest strategy: split the document every N tokens (or characters), with an overlap of O tokens between adjacent chunks. Implementation is one loop. Works surprisingly well on prose-heavy corpora where paragraph and sentence boundaries fall naturally within the window.
Two parameters: chunk_size (typically 256-1024 tokens) and overlap (typically 10-20% of chunk_size). Overlap exists so a query that semantically falls between two chunks is still retrievable from at least one of them.
Recursive Character Splitting#
Popularised by LangChain's RecursiveCharacterTextSplitter, this strategy tries a list of separators in order — typically '\n\n' (paragraphs), '\n' (lines), '. ' (sentences), ' ' (words), '' (characters) — splitting at the highest-level separator that produces chunks under the size limit. The result respects paragraph and sentence boundaries when possible and falls back gracefully when not.
It is the most widely-used strategy because it is deterministic, fast, and produces noticeably better chunks than naive fixed-size splitting at no extra inference cost.
Sentence-Window Retrieval#
Sentence-window retrieval (introduced as a pattern in LlamaIndex) embeds individual sentences but stores a small window of surrounding sentences alongside each one. At query time, the retrieval index returns the sentence whose embedding best matches; the response includes the surrounding window so the LLM has context.
The advantage is that the embedding represents one specific sentence — high signal — while the LLM still sees enough surrounding context to answer well. Particularly effective on long, dense documents (legal contracts, technical specifications) where the precise answer is in one sentence and surrounding text matters for context.
Semantic Chunking#
Semantic chunking, popularised by Greg Kamradt's 2023 talks and built into LangChain and LlamaIndex, computes embeddings for every sentence and splits at the points where consecutive sentence embeddings differ most — the assumption being that those are the boundaries of distinct topics.
More expensive than the other strategies (it requires embedding every sentence at ingest time), and the quality lift over recursive splitting is uneven. Worth trying on heterogeneous documents that mix topics within a single file; rarely worth it on clean prose where paragraph breaks already encode topic shifts.
Semantic chunking is a frequently-cited 'magic bullet' that turns out to be modest in practice. Benchmark it against recursive splitting on your own corpus before adopting it as the default.
Structural Chunking#
For documents with structure — Markdown, HTML, source code, PDFs with reliable headings — splitting by structural elements often outperforms text-only strategies. Markdown chunkers split at heading boundaries (preserving the heading path as metadata). Code chunkers split at function or class boundaries using a language-aware parser (Tree-sitter). PDF chunkers respect section structure when extraction quality permits.
Structural chunking is usually combined with size limits — split at structural boundaries, but recursively split any too-large chunk further.
| Strategy | Cost at ingest | Quality on prose | Quality on structure |
|---|---|---|---|
| Fixed-size | Lowest | Acceptable | Poor |
| Recursive character | Low | Good | Acceptable |
| Sentence-window | Low | Very good | Good |
| Semantic | High (embedding per sentence) | Modest lift over recursive | Limited benefit |
| Structural | Medium | N/A — preserves doc structure | Best |
Practical Defaults#
- Start with recursive character splitting at 512 tokens with 64-token overlap; it is the strongest single default.
- Move to sentence-window for long-form dense documents where the answer is usually one sentence.
- Use structural chunking for Markdown, code, or any corpus with reliable headings.
- Store the original document id and chunk offset as metadata so you can stitch context back together at generation time.
- Always store the chunk's parent section heading (if any) as metadata; rerankers and LLMs both use it.
References
- LangChain RecursiveCharacterTextSplitter documentation · LangChain
- LlamaIndex Sentence Window Retrieval · LlamaIndex Docs
- The 5 Levels of Text Splitting for Retrieval · Greg Kamradt, GitHub