Chain-of-Thought — Explained

Definition

Ask an LLM a complex math question and demand a direct answer — it'll usually get it wrong. Tell the same model to "think step by step" and accuracy jumps dramatically. This is Chain-of-Thought (CoT) prompting, discovered by Google researchers in 2022.

The mechanism: the model emits "thought" steps as tokens. Each step becomes context that shapes the next. Instead of doing arithmetic in its head, it's like writing on scratch paper.

Two flavors: few-shot CoT (include an example reasoning chain in the prompt) and zero-shot CoT (just append "Let's think step by step"). Zero-shot works surprisingly well.

Analogy

Math test: "do it in your head" → ~40% accuracy. "Show your work on scratch paper" → ~85% accuracy. Writing steps lets you catch mistakes — same reason works for the model.

Real-world example

Question: "Roger has 5 balls. He buys 2 cans of 3 balls. How many now?"

Direct answer: "10" (wrong)

CoT answer: - Roger has 5 balls. - 2 cans × 3 balls = 6 new balls. - Total: 5 + 6 = 11. - Answer: 11

Same model, same question. The only change: adding "let's think step by step." On GSM8K math benchmark this trick alone moves accuracy from ~18% to ~58%.

When to use

Multi-step math or logic problems
Code debugging — when the model needs to walk through execution
Decision trees, conditional logic (if-then inferences)
Breaking down complex instructions (first X, then Y, then Z)

When not to use

Simple factual questions ('capital of England?') — wasteful
When you need a very short output (classification, single word)
Latency-critical paths — CoT means 3–10× longer output = more cost + slower

Common pitfalls

Redundant on reasoning models

Models like o1 or Claude reasoning already do CoT internally. Telling them to 'think step by step' on top sometimes backfires.

Steps can be wrong while the answer is right

The model can confabulate steps and still happen to land on the right answer. Don't treat CoT as a guarantee — verification still matters.

Token budget explosion

Each CoT answer adds 200–1000 tokens. In production, hide reasoning steps and surface only the final answer if UX matters.