AI Dictionary
Intermediate· ~2 min read#context#token#llm

Context Window

How much an LLM can see at once

The maximum number of tokens an LLM can attend to in a single response.

WHAT THE MODEL CAN SEE AT ONCECONTEXT WINDOW — N TOKENS← old tokens dropnew tokens arrive →bigger window = more text the model can reason about at once
Definition

Every LLM has a context-window limit: 8K, 32K, 128K, 1M tokens, even 10M+ on some. Prompt + system message + chat history + documents + response all have to fit. Once full, older messages drop or you get an error.

Bigger isn't strictly better. The "lost in the middle" effect: models pay more attention to the start and end of long inputs, and skim the middle. Critical info buried in a 1M-token prompt may help less than a well-structured 32K one.

When the window fills up: summarization, sliding window, RAG to fetch only relevant chunks, prompt caching for static parts.

Analogy

Picture the monitor on your desk. The LLM only sees what fits on that screen at any moment. If a document doesn't fit, the model can't read it. Bigger monitor = more visible at once — but still not the whole library.

Real-world example

A support bot remembers chat history. After 50 turns, 80% of the context window is history. No room left for the new question.

Fixes: 1. Rolling window: keep the last 20 messages, drop the rest. 2. Summarization: replace older history with a 200-token summary. 3. RAG: store all messages in a vector DB, fetch only relevant ones.

GPT-4o (128K), Claude Sonnet (200K), Gemini 2.5 Pro (2M) flex this in different ways. But there's no free lunch: long windows are expensive, slow, and risk lost-in-the-middle.

When to use
  • Long document analysis (legal, academic, technical manuals)
  • Multi-turn conversation — when carrying history matters
  • Multimodal: images burn tokens fast and fill the window
  • Complex agent workflows — tool results + plan + context
When not to use
  • Single-shot, short tasks — big-window cost is wasted
  • Questions solvable with RAG — use retrieval instead of long context
  • Latency-sensitive paths — long context = long response time
Common pitfalls

The 'paste the whole doc' trap

Pasting a 100-page PDF into the prompt rarely works. Expensive, lost-in-the-middle, the model conflates topics. Chunking + RAG is usually better.

Forgetting output tokens

Context window = input + output! With 128K total and 100K input, only 28K is left for the response. Keep a buffer.

Not using prompt caching

Static prompts (system message, instructions, examples) get sent every request. Anthropic, OpenAI, Google all support caching — up to 90% cost cut.