Reasoning Model — Explained

Definition

Classic LLMs try to produce whatever you ask, instantly. Reasoning models go through an internal "thinking" stage before responding: they explore possibilities, try alternative paths, critique their own answers. This process is invisible but real (it's literally extra token generation), making them slower and more expensive — and dramatically more accurate.

First major example: OpenAI's o1 (late 2024). Then DeepSeek R1, OpenAI o3, Claude 3.7 Sonnet and Sonnet 4 reasoning mode, Google Gemini 2.5 Pro. The recipe is similar: train with RL to produce long chains of thought.

The term "test-time compute" comes from this: answer quality depends not on training, but on how much the model "thinks" while answering.

Analogy

Standard LLM: a fast chess player making moves in 5 seconds, usually good but sometimes blunders. Reasoning model: the same player at a tournament — 5 minutes per move, calculates alternatives, then plays. Slower, pricier, far fewer mistakes.

Real-world example

The AIME (American Invitational Mathematics Examination): GPT-4o scores ~13%. The same company's reasoning model o1 scores ~83%. The difference is purely "thinking time" — same model size, same training data, but o1 generates a 30+ step chain of internal reasoning per question.

The user never sees that — only the final answer. But behind that answer are thousands of intermediate tokens.

When to use

Hard math, logic puzzles (olympiad-level)
Algorithmic code: sorting, optimization, edge-case analysis
Scientific reasoning: hypothesis generation, inference
Bug hunting: 'why does this code fail?' style investigation

When not to use

Chat, simple Q&A, summarization — overkill
Latency-critical flows — 30s+ response time kills UX
Cost-sensitive paths — reasoning tokens are 5–10× pricier
Creative writing — they sometimes 'over-think' into flat, cliché prose

Common pitfalls

Adding CoT backfires

Telling a reasoning model to 'think step by step' on top of its built-in reasoning can actually hurt. Trust its native strategy.

Reasoning tokens are invisible but billed

Bills will show reasoning_tokens as a separate line. A 100-word answer can hide 5000 reasoning tokens behind it.

Routing everything to reasoning

Hybrid is better: standard model for fast queries, reasoning model for hard ones. Model selection is itself a design decision.