AI Dictionary
Intermediate· ~2 min read#top-k#sampling#inference

Top-k

Top-k Sampling

A sampling method where the model only considers the K most likely tokens when picking the next one.

TOP K MOST LIKELY CANDIDATEStop_k = 5→ first 5 tokens stay in the pool0.62Ankara0.14İstanbul0.07İzmir0.04Bursa0.03Adana0.03Konya0.02Antalya0.01Trabzon0.01Mersincutoffk=40 default — drops the long-tail nonsense
Definition

For every token, an LLM assigns a probability to every token in the vocabulary (usually 32K-256K). Most have near-zero probability but can still get sampled — that's why long-tail nonsense words occasionally sneak in. Top-k cuts the tail: only the K most likely tokens are kept as candidates.

Practical values: - k = 1: greedy decoding — always pick the top token. Fully deterministic. - k = 5-10: very narrow — clean grammar but repetitive and cliché. - k = 40 ★: common default. Good balance of variety and quality. - k = 100+: opens the long tail — more creative, more risk.

Difference from top-p (nucleus): top-k keeps a fixed number of candidates; top-p keeps a variable number based on a probability threshold. Top-p adapts better — narrows in low-entropy spots, widens in high-entropy ones. Production usually picks top-p (0.9), but some runtimes (llama.cpp, Ollama) keep both on by default.

Analogy

Picking from a restaurant menu. Whole menu open (k=∞) — you might be served a weird leftover. Top 5 dishes (k=5) — safe but always the same 3-4 picks. Top 40 (k=40) — enough variety, weird outliers excluded. Top-k caps "how long the menu the model considers" is.

Real-world example

Prompt: "The capital of Turkey is …"

k = 1 (greedy): "Ankara." (always, one option)

k = 40, T = 0.7: usually "Ankara.", sometimes "Ankara, the capital city.", or "Ankara, located in central Anatolia." — paraphrases.

k = 1000, T = 1.2: "Ankara, though historically Istanbul..." — the long tail can derail the answer.

Llama.cpp defaults to --top-k 40 --top-p 0.9 --temp 0.8. This trio is a solid starting point for chat.

When to use
  • Low-level sampling control (llama.cpp, Ollama)
  • Combined with top-p — one caps the tail, the other adapts to entropy
  • Deterministic tests that need greedy decoding (k=1)
  • Extra guarantee when also using very low temperature
When not to use
  • Reasoning models (o1, Claude reasoning) — no manual sampling tuning
  • OpenAI/Anthropic APIs — they don't expose top-k, only top-p
  • When top-p alone is producing good output — keep it simple
Common pitfalls

Very low k = repetition

k=5 traps the model in the same 5-token pool; long outputs collapse into 'saying the same thing again'. Pair with a repetition penalty.

Aggressive top-k + top-p combination

k=10 + p=0.5 + T=1.5 = chaos. Tune one, leave the other at default. Most guides agree: tune one knob, not three.

Provider API differences

OpenAI and Anthropic don't expose top-k. llama.cpp, Ollama, vLLM do. If you have a multi-provider abstraction, this knob can't normalize cleanly.