Reasoning Model
Models that think first
A new generation of LLMs that spend extra compute 'thinking' before answering — dramatically better at math, code, and logic than classic LLMs.
Classic LLMs try to produce whatever you ask, instantly. Reasoning models go through an internal "thinking" stage before responding: they explore possibilities, try alternative paths, critique their own answers. This process is invisible but real (it's literally extra token generation), making them slower and more expensive — and dramatically more accurate.
First major example: OpenAI's o1 (late 2024). Then DeepSeek R1, OpenAI o3, Claude 3.7 Sonnet and Sonnet 4 reasoning mode, Google Gemini 2.5 Pro. The recipe is similar: train with RL to produce long chains of thought.
The term "test-time compute" comes from this: answer quality depends not on training, but on how much the model "thinks" while answering.
Standard LLM: a fast chess player making moves in 5 seconds, usually good but sometimes blunders. Reasoning model: the same player at a tournament — 5 minutes per move, calculates alternatives, then plays. Slower, pricier, far fewer mistakes.
The AIME (American Invitational Mathematics Examination): GPT-4o scores ~13%. The same company's reasoning model o1 scores ~83%. The difference is purely "thinking time" — same model size, same training data, but o1 generates a 30+ step chain of internal reasoning per question.
The user never sees that — only the final answer. But behind that answer are thousands of intermediate tokens.
- Hard math, logic puzzles (olympiad-level)
- Algorithmic code: sorting, optimization, edge-case analysis
- Scientific reasoning: hypothesis generation, inference
- Bug hunting: 'why does this code fail?' style investigation
- Chat, simple Q&A, summarization — overkill
- Latency-critical flows — 30s+ response time kills UX
- Cost-sensitive paths — reasoning tokens are 5–10× pricier
- Creative writing — they sometimes 'over-think' into flat, cliché prose
Adding CoT backfires
Telling a reasoning model to 'think step by step' on top of its built-in reasoning can actually hurt. Trust its native strategy.
Reasoning tokens are invisible but billed
Bills will show reasoning_tokens as a separate line. A 100-word answer can hide 5000 reasoning tokens behind it.
Routing everything to reasoning
Hybrid is better: standard model for fast queries, reasoning model for hard ones. Model selection is itself a design decision.