Top-p
Nucleus Sampling
When picking the next token, sample only from the smallest set whose cumulative probability exceeds threshold p.
For every token an LLM produces, it computes a probability distribution over the entire vocabulary (e.g. 50K tokens, each with a probability). Top-p (a.k.a. nucleus sampling) limits sampling to only the most likely tokens.
How it works: sort tokens by probability. Walk down the sorted list,
summing probabilities, until the cumulative probability exceeds p.
What's left is the "nucleus" — a small subset. Only sample from it.
Typical values: - top_p = 1.0: all tokens included (unbounded variety, risky) - top_p = 0.9: standard "slightly varied" (good default) - top_p = 0.5: narrow, more conservative - top_p = 0.1: very narrow, near-deterministic
Used with temperature — top_p shapes which tokens are even eligible, temperature shapes their selection probabilities.
Like restricting a restaurant menu. Leave the whole menu open (top_p=1) and the waiter might bring you an obscure aged dish (risky). Open the 90% favorites (top_p=0.9), eliminate weird picks — still some variety. Open only the top 5 (top_p=0.1) — always chooses among the same 3-4 dishes.
Generating a creative story opener with GPT-4.
top_p = 1.0: "Once upon a time" / "It was a dark and stormy night" / "Wherein the silver moths of dawn..." → sometimes cliché, sometimes too weird.
top_p = 0.9: "Once upon a time" / "Long ago in a forgotten kingdom" / "She had always known..." → varied yet sensible.
top_p = 0.3: "Once upon a time" / "There was once" / "Long ago" → safe, cliché, predictable.
A good production starting point is temperature=0.7, top_p=0.9.
OpenAI docs say to tune one or the other — not both at once.
- When you want creative/varied output — top_p=0.9-0.95
- Preventing rare weird tokens (use top_p=0.95 over 1.0)
- Classification/structured output — top_p=0.1 for determinism
- Long-form text — top_p stabilizes quality across length
- Aggressively combining with temperature (both at 1.5+ → chaos)
- Treating it alone as a 'quality knob' — prompt + examples matter more
- JSON/structured output — use low temperature + low top_p
Confusing top_p with temperature
Most guides say 'tune one, leave the other default.' Tweaking both aggressively yields unpredictable results. Even OpenAI's own docs warn about this.
top_p = 0 ≠ temperature = 0
top_p=0 doesn't mean the model picks only the top token (the math doesn't work that way). For full determinism, set temperature=0 instead.
Manual tuning on reasoning models
o1, Claude reasoning, etc. don't expose temperature/top_p — the model self-optimizes. Manual interference usually backfires.