Chunking — Explained

Definition

LLM context windows are limited, and embedding models accept at most a few thousand tokens at a time. You can't embed a 500-page PDF as one blob. Chunking splits the document into meaningful smaller pieces; each chunk is embedded separately, stored separately, retrieved separately.

Critical parameters: - Chunk size (typically 200–1000 tokens): too small = context lost, too big = retrieval precision drops. - Overlap (typically 10–20%): nearby chunks share content so a sentence cut at the boundary isn't lost. - Splitting strategy: by sentence, paragraph, heading, semantic boundary, or fixed token count.

Roughly 50% of RAG quality lives in chunking. Most "bad retrieval" issues are actually "bad chunking" issues.

Analogy

You want to add a book to a library. Writing the whole book on one catalog card means no one can find a specific topic ("this book is about everything"). Writing each page as a separate card fragments the context. The right move: split by chapter and include the table of contents.

Real-world example

A law firm has an archive of 200-page contracts. Naive approach: embed each contract as one document. Query: "rent increase rate". 5 contracts come back — but on which page? The user has to read each one.

After chunking: each contract becomes 30 chunks (~500 tokens, 15% overlap). The same query now returns 5 specific chunks like "rent shall increase annually at the CPI rate". User gets the answer in 3 seconds.

When to use

Building RAG and documents are 1000+ tokens
Large knowledge base (wiki, manuals, contract archive)
In-page navigation (telling the user 'this answer is in §3.2 of page X')

When not to use

Short documents (< 1000 tokens) — embed the whole thing
Highly structured data (JSON, CSV) — index by field instead
One-shot Q&A with no RAG — chunking is unnecessary

Common pitfalls

Splitting at fixed token boundaries

Cutting mid-sentence destroys context. 'because of X, the policy applies' in chunk1 and '...only when Y is true' in chunk2 — both misleading. Respect sentence/paragraph boundaries.

No overlap

Information near boundaries gets lost repeatedly. 10–20% overlap (or semantic chunking) is standard practice.

One strategy fits all

Legal, code, blog — each wants different chunking. Legal: by clause. Code: by function. Blog: by paragraph + heading. There's no single recipe.