AI Dictionary
Intermediate· ~2 min read#num_ctx#context-length#inference

num_ctx

Runtime context length parameter

On local inference engines (Ollama, llama.cpp), this sets the actual context-window memory the model allocates at runtime.

ALLOCATED CONTEXT = ALLOCATED VRAMLlama 3.1 8B · KV cache footprintnum_ctx=2K0.27 GBnum_ctx=8K1.05 GBnum_ctx=32K4.20 GBnum_ctx=128K16.80 GBOllama defaults to 2048 — bump it explicitly for long documents
Definition

context-window is the model's theoretical max (Llama 3.1 8B = 128K). num_ctx is the memory you actually allocate at runtime. Most engines default low (Ollama: 2048, llama.cpp: 512!) because context size scales VRAM linearly.

Math: KV cache ≈ 2 × num_layers × num_heads × head_dim × 2 (K+V) × num_ctx × bytes. On Llama 3.1 8B that's ~131 KB per context token. 8K context → ~1 GB VRAM. 32K → ~4 GB. 128K → ~16 GB just for KV cache — on top of model weights.

Provider variations: - Ollama: OLLAMA_CONTEXT_LENGTH=8192 env or PARAMETER num_ctx 8192 in the Modelfile. Default 2048. - llama.cpp: -c 8192 flag. Default 512 (surprisingly low). - vLLM: --max-model-len 8192. Rarely smaller; typical 4K-32K. - OpenAI/Anthropic: not exposed — API handles it.

Analogy

Setting how many pages a notebook has. The model's "memory capacity" is finite — you reserve pages up front. 2 pages (num_ctx=2048) = fast but forgetful. 100 pages (128K) = takes lots of desk space (VRAM) but remembers long conversations.

Real-world example

Running Llama 3.1 8B on Ollama for chat:

`` ollama run llama3.1:8b > Pasting a 50K-token doc... > [model only "sees" the last 2048 tokens] ``

Cause: num_ctx defaults to 2048. The doc is ignored.

Fix: start with OLLAMA_CONTEXT_LENGTH=32768 ollama serve. Or Modelfile: `` FROM llama3.1:8b PARAMETER num_ctx 32768 `` Now the 50K doc → first 32K kept, last 18K dropped.

VRAM: num_ctx=2048 → ~5 GB total (8B Q4 + cache). num_ctx=32768 → ~9 GB. Fits comfortably on a 16 GB Mac.

When to use
  • RAG or long-doc analysis — default 2K is not enough
  • Multi-turn chat — 8K+ to keep history
  • Code editing — 16K+ for function + context
  • VRAM headroom available — go ahead, set 32K-128K
When not to use
  • One-shot short queries — 2K-4K is fine, save VRAM
  • Tight VRAM (8 GB Mac) with a big model (13B+) — trade off num_ctx vs model size
  • Beyond model's true max — Llama 3.1 caps at 128K, more errors out
Common pitfalls

The 2048 default surprise

Llama 3.1 says 'supports 128K' but Ollama defaults to 2048! For long context, set it explicitly. 80% of 'why doesn't it understand the long doc' bugs trace here.

Skipping the VRAM math

num_ctx=128K, 8B Q4 model, 16 GB Mac → OOM. Compute first: model + (num_layers × heads × head_dim × 4 × num_ctx). Quantization (Q4) shrinks weights, not cache.

Pushing past the model's real ceiling

'128K supported' doesn't mean it produces good output at 128K. Lost-in-the-middle is real. Cap at 32K and use RAG to manage the window — better results.