Context Window — Explained

Definition

Every LLM has a context-window limit: 8K, 32K, 128K, 1M tokens, even 10M+ on some. Prompt + system message + chat history + documents + response all have to fit. Once full, older messages drop or you get an error.

Bigger isn't strictly better. The "lost in the middle" effect: models pay more attention to the start and end of long inputs, and skim the middle. Critical info buried in a 1M-token prompt may help less than a well-structured 32K one.

When the window fills up: summarization, sliding window, RAG to fetch only relevant chunks, prompt caching for static parts.

Analogy

Picture the monitor on your desk. The LLM only sees what fits on that screen at any moment. If a document doesn't fit, the model can't read it. Bigger monitor = more visible at once — but still not the whole library.

Real-world example

A support bot remembers chat history. After 50 turns, 80% of the context window is history. No room left for the new question.

Fixes: 1. Rolling window: keep the last 20 messages, drop the rest. 2. Summarization: replace older history with a 200-token summary. 3. RAG: store all messages in a vector DB, fetch only relevant ones.

GPT-4o (128K), Claude Sonnet (200K), Gemini 2.5 Pro (2M) flex this in different ways. But there's no free lunch: long windows are expensive, slow, and risk lost-in-the-middle.

When to use

Long document analysis (legal, academic, technical manuals)
Multi-turn conversation — when carrying history matters
Multimodal: images burn tokens fast and fill the window
Complex agent workflows — tool results + plan + context

When not to use

Single-shot, short tasks — big-window cost is wasted
Questions solvable with RAG — use retrieval instead of long context
Latency-sensitive paths — long context = long response time

Common pitfalls

The 'paste the whole doc' trap

Pasting a 100-page PDF into the prompt rarely works. Expensive, lost-in-the-middle, the model conflates topics. Chunking + RAG is usually better.

Forgetting output tokens

Context window = input + output! With 128K total and 100K input, only 28K is left for the response. Keep a buffer.

Not using prompt caching

Static prompts (system message, instructions, examples) get sent every request. Anthropic, OpenAI, Google all support caching — up to 90% cost cut.