max_tokens
Output length cap
The maximum number of tokens the model can produce in a single response. Directly impacts cost and latency.
Every LLM call gets an output ceiling: max_tokens (OpenAI),
max_output_tokens (Anthropic), num_predict (Ollama), -n
(llama.cpp). The model stops at that count — even mid-sentence —
making billing predictable and timeout risk lower.
max_tokens counts only generated tokens. Input tokens don't
count toward it, but they do count toward the context window. So:
input + max_tokens ≤ context_window. Enforcing that is your job;
most APIs error if you violate it.
On reasoning models (o1, Claude reasoning), max_tokens includes the
invisible "thinking" tokens. Setting 1000 max_tokens for a query that
wants 8000 thinking tokens will get you cut off early.
"Answer in at most 200 words" on an exam. If the student hits 200, they stop — even mid-sentence. The cap makes the teacher's grading time predictable, but if the cap is too low you get incomplete answers. max_tokens is the same: cost/latency safety, but it cuts hard.
A summarization service. You set max_tokens=200 everywhere.
Problem: long docs produce summaries that end mid-sentence ("The
contract terms can be grouped into 5 main clauses:"). Fixes:
1. max_tokens=400 (safety margin).
2. Tell the prompt "max 5 bullets, one sentence each" — let the model
self-limit.
3. Check finish_reason: "length" → show user "continue?";
"stop" → fine.
Cost math: 100K req/day × 200 max_tokens × $0.6/1M = $12/day = $360/mo. Bumping to 400 = $720. Whether to do it depends on user value.
- Every production API call — billing and timeout safety
- Streaming UIs — knowing how long to render
- Structured output (JSON, yes/no) — short caps suffice
- Reasoning models — bump it up to fit hidden thinking tokens
- Setting it too small without thought — answers cut off
- Setting it absurdly large (100K) — eats your context budget
- Relying on max_tokens but not telling the prompt to be brief — models love length
Forgetting finish_reason
If the model hits max_tokens, finish_reason returns 'length'. Skip the check and users see truncated content believing it's complete. Always check, always warn in UI.
Underbudgeting reasoning models
o1/Claude-reasoning include hidden thinking tokens in max_tokens. Don't set 500 and complain the answer is short. Use 4000-8000.
Filling the context window
128K context with 100K input leaves only 28K for output. Set max_tokens=50K and the API will reject. Always: input + max_tokens ≤ context.