Chunking
Splitting documents for retrieval
Breaking long documents into smaller pieces ('chunks') that can be embedded, stored, and searched individually — the invisible but critical step in RAG.
LLM context windows are limited, and embedding models accept at most a few thousand tokens at a time. You can't embed a 500-page PDF as one blob. Chunking splits the document into meaningful smaller pieces; each chunk is embedded separately, stored separately, retrieved separately.
Critical parameters: - Chunk size (typically 200–1000 tokens): too small = context lost, too big = retrieval precision drops. - Overlap (typically 10–20%): nearby chunks share content so a sentence cut at the boundary isn't lost. - Splitting strategy: by sentence, paragraph, heading, semantic boundary, or fixed token count.
Roughly 50% of RAG quality lives in chunking. Most "bad retrieval" issues are actually "bad chunking" issues.
You want to add a book to a library. Writing the whole book on one catalog card means no one can find a specific topic ("this book is about everything"). Writing each page as a separate card fragments the context. The right move: split by chapter and include the table of contents.
A law firm has an archive of 200-page contracts. Naive approach: embed each contract as one document. Query: "rent increase rate". 5 contracts come back — but on which page? The user has to read each one.
After chunking: each contract becomes 30 chunks (~500 tokens, 15% overlap). The same query now returns 5 specific chunks like "rent shall increase annually at the CPI rate". User gets the answer in 3 seconds.
- Building RAG and documents are 1000+ tokens
- Large knowledge base (wiki, manuals, contract archive)
- In-page navigation (telling the user 'this answer is in §3.2 of page X')
- Short documents (< 1000 tokens) — embed the whole thing
- Highly structured data (JSON, CSV) — index by field instead
- One-shot Q&A with no RAG — chunking is unnecessary
Splitting at fixed token boundaries
Cutting mid-sentence destroys context. 'because of X, the policy applies' in chunk1 and '...only when Y is true' in chunk2 — both misleading. Respect sentence/paragraph boundaries.
No overlap
Information near boundaries gets lost repeatedly. 10–20% overlap (or semantic chunking) is standard practice.
One strategy fits all
Legal, code, blog — each wants different chunking. Legal: by clause. Code: by function. Blog: by paragraph + heading. There's no single recipe.