Temperature
The randomness dial
A numeric parameter (typically 0–2) that controls how consistent (deterministic) or varied (creative) the LLM's output is.
When generating each token, an LLM assigns probabilities to all possible next tokens. After "Once upon a" → "time" (~70%), "morning" (~5%), "fish" (~0.1%). Temperature reshapes that distribution.
0 = always pick the most likely token (deterministic). Same prompt → same output. 0.7 = mix probabilities moderately (balanced). 1.5+ = boldly sample even low-probability tokens (creative, unpredictable).
Mathematically it's a divisor applied to softmax: probabilities get
rescaled by prob^(1/T). Often used with top-p (nucleus) and top-k
sampling.
Think of a DJ's mixer. 0 = single track, same song forever. 0.7 = multiple tracks, smooth mix. 1.5 = random samples from 30 channels, experimental but sometimes chaotic. Same library, different output.
You're building a classification API: tag the user message as "spam | normal | urgent". Use T = 0 — the same input must always produce the same label, otherwise you can't write tests.
Creative writing: suggest blog intros. Use T = 0.8 — each call yields a fresh opening with different tone. With T = 0 you'd get the same cliché every time.
- Deterministic output required: classification, extraction, structured output → T ≈ 0
- Factual Q&A → T ≈ 0.2
- General chat, help, explanation → T ≈ 0.7
- Creative writing, brainstorm, variety → T ≈ 0.9–1.2
- Setting T = 0 and complaining 'why is it always the same answer' (that's the point)
- Setting T = 2 and complaining 'the answer is gibberish' (too high = nonsense token stream)
- Treating it as a 'creativity' knob — creative ideas come from prompts and examples, T is just a fine-tuning
Temperature ≠ creativity
Higher T gives more varied output, not more creative output. For creative results, use a strong prompt with examples first; tweak T only after.
Forgetting seed for reproducibility
T = 0 alone isn't enough on some models — you also need a seed parameter (OpenAI supports it, Anthropic doesn't).
Hallucinations spike at high T
T > 1.0 pushes the model toward low-probability (often wrong) tokens. Hallucination rate climbs dramatically.