num_ctx
Runtime context length parameter
On local inference engines (Ollama, llama.cpp), this sets the actual context-window memory the model allocates at runtime.
context-window is the model's theoretical max (Llama 3.1 8B = 128K).
num_ctx is the memory you actually allocate at runtime. Most
engines default low (Ollama: 2048, llama.cpp: 512!) because context
size scales VRAM linearly.
Math: KV cache ≈ 2 × num_layers × num_heads × head_dim × 2 (K+V) × num_ctx × bytes. On Llama 3.1 8B that's ~131 KB per context token. 8K context → ~1 GB VRAM. 32K → ~4 GB. 128K → ~16 GB just for KV cache — on top of model weights.
Provider variations:
- Ollama: OLLAMA_CONTEXT_LENGTH=8192 env or PARAMETER num_ctx
8192 in the Modelfile. Default 2048.
- llama.cpp: -c 8192 flag. Default 512 (surprisingly low).
- vLLM: --max-model-len 8192. Rarely smaller; typical 4K-32K.
- OpenAI/Anthropic: not exposed — API handles it.
Setting how many pages a notebook has. The model's "memory capacity" is finite — you reserve pages up front. 2 pages (num_ctx=2048) = fast but forgetful. 100 pages (128K) = takes lots of desk space (VRAM) but remembers long conversations.
Running Llama 3.1 8B on Ollama for chat:
``
ollama run llama3.1:8b
> Pasting a 50K-token doc...
> [model only "sees" the last 2048 tokens]
``
Cause: num_ctx defaults to 2048. The doc is ignored.
Fix: start with OLLAMA_CONTEXT_LENGTH=32768 ollama serve. Or
Modelfile:
``
FROM llama3.1:8b
PARAMETER num_ctx 32768
``
Now the 50K doc → first 32K kept, last 18K dropped.
VRAM: num_ctx=2048 → ~5 GB total (8B Q4 + cache). num_ctx=32768 → ~9 GB. Fits comfortably on a 16 GB Mac.
- RAG or long-doc analysis — default 2K is not enough
- Multi-turn chat — 8K+ to keep history
- Code editing — 16K+ for function + context
- VRAM headroom available — go ahead, set 32K-128K
- One-shot short queries — 2K-4K is fine, save VRAM
- Tight VRAM (8 GB Mac) with a big model (13B+) — trade off num_ctx vs model size
- Beyond model's true max — Llama 3.1 caps at 128K, more errors out
The 2048 default surprise
Llama 3.1 says 'supports 128K' but Ollama defaults to 2048! For long context, set it explicitly. 80% of 'why doesn't it understand the long doc' bugs trace here.
Skipping the VRAM math
num_ctx=128K, 8B Q4 model, 16 GB Mac → OOM. Compute first: model + (num_layers × heads × head_dim × 4 × num_ctx). Quantization (Q4) shrinks weights, not cache.
Pushing past the model's real ceiling
'128K supported' doesn't mean it produces good output at 128K. Lost-in-the-middle is real. Cap at 32K and use RAG to manage the window — better results.