num_ctx — Explained · AI Sözlüğü

Definition

context-window is the model's theoretical max (Llama 3.1 8B = 128K). num_ctx is the memory you actually allocate at runtime. Most engines default low (Ollama: 2048, llama.cpp: 512!) because context size scales VRAM linearly.

Math: KV cache ≈ 2 × num_layers × num_heads × head_dim × 2 (K+V) × num_ctx × bytes. On Llama 3.1 8B that's ~131 KB per context token. 8K context → ~1 GB VRAM. 32K → ~4 GB. 128K → ~16 GB just for KV cache — on top of model weights.

Provider variations: - Ollama: OLLAMA_CONTEXT_LENGTH=8192 env or PARAMETER num_ctx 8192 in the Modelfile. Default 2048. - llama.cpp: -c 8192 flag. Default 512 (surprisingly low). - vLLM: --max-model-len 8192. Rarely smaller; typical 4K-32K. - OpenAI/Anthropic: not exposed — API handles it.

Analogy

Setting how many pages a notebook has. The model's "memory capacity" is finite — you reserve pages up front. 2 pages (num_ctx=2048) = fast but forgetful. 100 pages (128K) = takes lots of desk space (VRAM) but remembers long conversations.

Real-world example

Running Llama 3.1 8B on Ollama for chat:

``ollama run llama3.1:8b > Pasting a 50K-token doc... > [model only "sees" the last 2048 tokens]``

Cause: num_ctx defaults to 2048. The doc is ignored.

Fix: start with OLLAMA_CONTEXT_LENGTH=32768 ollama serve. Or Modelfile: ``FROM llama3.1:8b PARAMETER num_ctx 32768`` Now the 50K doc → first 32K kept, last 18K dropped.

VRAM: num_ctx=2048 → ~5 GB total (8B Q4 + cache). num_ctx=32768 → ~9 GB. Fits comfortably on a 16 GB Mac.

When to use

RAG or long-doc analysis — default 2K is not enough
Multi-turn chat — 8K+ to keep history
Code editing — 16K+ for function + context
VRAM headroom available — go ahead, set 32K-128K

When not to use

One-shot short queries — 2K-4K is fine, save VRAM
Tight VRAM (8 GB Mac) with a big model (13B+) — trade off num_ctx vs model size
Beyond model's true max — Llama 3.1 caps at 128K, more errors out

Common pitfalls

The 2048 default surprise

Llama 3.1 says 'supports 128K' but Ollama defaults to 2048! For long context, set it explicitly. 80% of 'why doesn't it understand the long doc' bugs trace here.

Skipping the VRAM math

num_ctx=128K, 8B Q4 model, 16 GB Mac → OOM. Compute first: model + (num_layers × heads × head_dim × 4 × num_ctx). Quantization (Q4) shrinks weights, not cache.

Pushing past the model's real ceiling

'128K supported' doesn't mean it produces good output at 128K. Lost-in-the-middle is real. Cap at 32K and use RAG to manage the window — better results.