KV Cache — Explained · AI Sözlüğü

Definition

Transformers compute K (key) and V (value) matrices for attention at every token. A naive implementation recomputes K, V for all N-1 prior tokens at step N = O(N²) work.

KV cache: store K and V in memory after the first compute; each new token only computes its own K, V and appends to the cache. Result: per-token work becomes O(1), total O(N) — a dramatic improvement. Non-negotiable for modern inference.

Cost: VRAM. Cache size = 2 (K+V) × num_layers × num_heads × head_dim × num_ctx × bytes. ~131 KB per token on Llama 3.1 8B. 32K context → ~4 GB just for cache. 128K → ~16 GB. Model weights might fit but cache can blow it up.

Optimizations: - PagedAttention (vLLM): paged cache management, ~25% less fragmentation. - GQA (Grouped-Query Attention, Llama 3): groups KV heads, shrinks cache 4-8×. - Sliding window (Mistral): only cache the last N tokens. - Quantize KV cache (FP8, INT4): ~50% extra savings, small accuracy hit.

Analogy

Instead of an accountant recomputing all past transactions each time you ask, they keep a ledger. First compute, write to ledger; later queries just read. Fast — but the ledger grows. KV cache is the same: speed at the cost of memory.

Real-world example

Llama 3.1 8B (FP16) uses 16 GB of VRAM. You start chatting.

Token 1 produced → cache: 131 KB Token 100: cache: 13 MB Token 1000: cache: 131 MB Token 10000 (writing a long essay): cache: 1.31 GB Token 32000 (max num_ctx): cache: 4.2 GB

Total VRAM: 16 GB (weights) + 4.2 GB (cache) = 20.2 GB. Won't fit on an RTX 4090! Fixes: 1. Quantize the model (Q4 → 5 GB, +cache 4.2 = 9 GB) ✓ 2. Lower num_ctx (8K → cache 1 GB) 3. Quantize the KV cache (FP8 → 2.1 GB) 4. Pick a GQA model — Llama 3.1 already is GQA; older ones weren't.

vLLM PagedAttention fits ~25% more concurrent users in the same VRAM — no fragmentation waste.

When to use

Inference engineering — can't plan VRAM without knowing KV cache
Running long-context models — cache is the bottleneck
Production with PagedAttention (vLLM) or KV quant
Multi-tenant serving — cache sharing/isolation is critical

When not to use

Training — KV cache is inference-only, training is different
A single short request — imperceptible
Hosted APIs that hide the details (OpenAI, Anthropic) — they manage it

Common pitfalls

Forgetting to count cache

Saying '8B fits in 16 GB VRAM' then opening 32K context → OOM. Cache can be 25-100% of model size. Include it from day one.

Multi-user cache isolation

Mixed cache across users on shared servers leaks data. Production frameworks (vLLM, TGI) handle isolation; if you build your own, be careful.

Skipping prefix caching

If you send the same system prompt every request, don't recompute it. Anthropic and OpenAI offer prompt-caching APIs; vLLM/SGLang do prefix caching automatically.