max_tokens — Explained · AI Sözlüğü

Definition

Every LLM call gets an output ceiling: max_tokens (OpenAI), max_output_tokens (Anthropic), num_predict (Ollama), -n (llama.cpp). The model stops at that count — even mid-sentence — making billing predictable and timeout risk lower.

max_tokens counts only generated tokens. Input tokens don't count toward it, but they do count toward the context window. So: input + max_tokens ≤ context_window. Enforcing that is your job; most APIs error if you violate it.

On reasoning models (o1, Claude reasoning), max_tokens includes the invisible "thinking" tokens. Setting 1000 max_tokens for a query that wants 8000 thinking tokens will get you cut off early.

Analogy

"Answer in at most 200 words" on an exam. If the student hits 200, they stop — even mid-sentence. The cap makes the teacher's grading time predictable, but if the cap is too low you get incomplete answers. max_tokens is the same: cost/latency safety, but it cuts hard.

Real-world example

A summarization service. You set max_tokens=200 everywhere.

Problem: long docs produce summaries that end mid-sentence ("The contract terms can be grouped into 5 main clauses:"). Fixes: 1. max_tokens=400 (safety margin). 2. Tell the prompt "max 5 bullets, one sentence each" — let the model self-limit. 3. Check finish_reason: "length" → show user "continue?"; "stop" → fine.

Cost math: 100K req/day × 200 max_tokens × $0.6/1M = $12/day = $360/mo. Bumping to 400 = $720. Whether to do it depends on user value.

When to use

Every production API call — billing and timeout safety
Streaming UIs — knowing how long to render
Structured output (JSON, yes/no) — short caps suffice
Reasoning models — bump it up to fit hidden thinking tokens

When not to use

Setting it too small without thought — answers cut off
Setting it absurdly large (100K) — eats your context budget
Relying on max_tokens but not telling the prompt to be brief — models love length

Common pitfalls

Forgetting finish_reason

If the model hits max_tokens, finish_reason returns 'length'. Skip the check and users see truncated content believing it's complete. Always check, always warn in UI.

Underbudgeting reasoning models

o1/Claude-reasoning include hidden thinking tokens in max_tokens. Don't set 500 and complain the answer is short. Use 4000-8000.

Filling the context window

128K context with 100K input leaves only 28K for output. Set max_tokens=50K and the API will reject. Always: input + max_tokens ≤ context.