KV Cache
Key-Value Cache
An optimization that stores attention's key/value tensors so the model doesn't recompute them every token — eats most of the inference VRAM.
Transformers compute K (key) and V (value) matrices for attention at every token. A naive implementation recomputes K, V for all N-1 prior tokens at step N = O(N²) work.
KV cache: store K and V in memory after the first compute; each new token only computes its own K, V and appends to the cache. Result: per-token work becomes O(1), total O(N) — a dramatic improvement. Non-negotiable for modern inference.
Cost: VRAM. Cache size = 2 (K+V) × num_layers × num_heads × head_dim × num_ctx × bytes. ~131 KB per token on Llama 3.1 8B. 32K context → ~4 GB just for cache. 128K → ~16 GB. Model weights might fit but cache can blow it up.
Optimizations: - PagedAttention (vLLM): paged cache management, ~25% less fragmentation. - GQA (Grouped-Query Attention, Llama 3): groups KV heads, shrinks cache 4-8×. - Sliding window (Mistral): only cache the last N tokens. - Quantize KV cache (FP8, INT4): ~50% extra savings, small accuracy hit.
Instead of an accountant recomputing all past transactions each time you ask, they keep a ledger. First compute, write to ledger; later queries just read. Fast — but the ledger grows. KV cache is the same: speed at the cost of memory.
Llama 3.1 8B (FP16) uses 16 GB of VRAM. You start chatting.
Token 1 produced → cache: 131 KB Token 100: cache: 13 MB Token 1000: cache: 131 MB Token 10000 (writing a long essay): cache: 1.31 GB Token 32000 (max num_ctx): cache: 4.2 GB
Total VRAM: 16 GB (weights) + 4.2 GB (cache) = 20.2 GB. Won't fit on an RTX 4090! Fixes: 1. Quantize the model (Q4 → 5 GB, +cache 4.2 = 9 GB) ✓ 2. Lower num_ctx (8K → cache 1 GB) 3. Quantize the KV cache (FP8 → 2.1 GB) 4. Pick a GQA model — Llama 3.1 already is GQA; older ones weren't.
vLLM PagedAttention fits ~25% more concurrent users in the same VRAM — no fragmentation waste.
- Inference engineering — can't plan VRAM without knowing KV cache
- Running long-context models — cache is the bottleneck
- Production with PagedAttention (vLLM) or KV quant
- Multi-tenant serving — cache sharing/isolation is critical
- Training — KV cache is inference-only, training is different
- A single short request — imperceptible
- Hosted APIs that hide the details (OpenAI, Anthropic) — they manage it
Forgetting to count cache
Saying '8B fits in 16 GB VRAM' then opening 32K context → OOM. Cache can be 25-100% of model size. Include it from day one.
Multi-user cache isolation
Mixed cache across users on shared servers leaks data. Production frameworks (vLLM, TGI) handle isolation; if you build your own, be careful.
Skipping prefix caching
If you send the same system prompt every request, don't recompute it. Anthropic and OpenAI offer prompt-caching APIs; vLLM/SGLang do prefix caching automatically.