Top-k
Top-k Sampling
A sampling method where the model only considers the K most likely tokens when picking the next one.
For every token, an LLM assigns a probability to every token in the vocabulary (usually 32K-256K). Most have near-zero probability but can still get sampled — that's why long-tail nonsense words occasionally sneak in. Top-k cuts the tail: only the K most likely tokens are kept as candidates.
Practical values: - k = 1: greedy decoding — always pick the top token. Fully deterministic. - k = 5-10: very narrow — clean grammar but repetitive and cliché. - k = 40 ★: common default. Good balance of variety and quality. - k = 100+: opens the long tail — more creative, more risk.
Difference from top-p (nucleus): top-k keeps a fixed number of candidates; top-p keeps a variable number based on a probability threshold. Top-p adapts better — narrows in low-entropy spots, widens in high-entropy ones. Production usually picks top-p (0.9), but some runtimes (llama.cpp, Ollama) keep both on by default.
Picking from a restaurant menu. Whole menu open (k=∞) — you might be served a weird leftover. Top 5 dishes (k=5) — safe but always the same 3-4 picks. Top 40 (k=40) — enough variety, weird outliers excluded. Top-k caps "how long the menu the model considers" is.
Prompt: "The capital of Turkey is …"
k = 1 (greedy): "Ankara." (always, one option)
k = 40, T = 0.7: usually "Ankara.", sometimes "Ankara, the capital city.", or "Ankara, located in central Anatolia." — paraphrases.
k = 1000, T = 1.2: "Ankara, though historically Istanbul..." — the long tail can derail the answer.
Llama.cpp defaults to --top-k 40 --top-p 0.9 --temp 0.8. This trio is
a solid starting point for chat.
- Low-level sampling control (llama.cpp, Ollama)
- Combined with top-p — one caps the tail, the other adapts to entropy
- Deterministic tests that need greedy decoding (k=1)
- Extra guarantee when also using very low temperature
- Reasoning models (o1, Claude reasoning) — no manual sampling tuning
- OpenAI/Anthropic APIs — they don't expose top-k, only top-p
- When top-p alone is producing good output — keep it simple
Very low k = repetition
k=5 traps the model in the same 5-token pool; long outputs collapse into 'saying the same thing again'. Pair with a repetition penalty.
Aggressive top-k + top-p combination
k=10 + p=0.5 + T=1.5 = chaos. Tune one, leave the other at default. Most guides agree: tune one knob, not three.
Provider API differences
OpenAI and Anthropic don't expose top-k. llama.cpp, Ollama, vLLM do. If you have a multi-provider abstraction, this knob can't normalize cleanly.