AI Dictionary
Advanced· ~2 min read#rlhf#alignment#training

RLHF

Reinforcement Learning from Human Feedback

A 3-stage technique that aligns a model to human preferences — the recipe that turned GPT-3 into ChatGPT.

HUMAN FEEDBACK · 3 STAGES1 · SFTsupervised fine-tunelabeled data2 · REWARDhumans rank outputs👍👍👎👍3 · RLtune to maximize rewardreward ↑the secret sauce that turned GPT-3 into ChatGPT
Definition

A base model (like GPT-3) only does "next-token prediction" — that doesn't make it answer, be polite, or follow instructions. RLHF closes that gap: it teaches the model what kinds of answers humans actually like.

Three-stage pipeline: 1. SFT (Supervised Fine-Tuning): pre-tune the model on human-written quality examples. 2. Reward Model: generate 4 answers per prompt, have humans rank them, train a separate model to predict that ranking. 3. PPO/DPO: fine-tune the main model with reinforcement learning to maximize the reward model's score.

Result: the model produces not "correct tokens" but "answers humans would approve of." ChatGPT's helpful, polite tone comes from RLHF.

Analogy

You opened a new restaurant. The chef cooks well technically (base model) but doesn't know what flavor your customers like. Stage 1: you set up an example menu. Stage 2: you collect "this dish or that one?" feedback from diners. Stage 3: you tell the chef to adjust recipes based on that feedback.

Real-world example

Ask GPT-3 "write a fizz buzz program" — the base model might produce a random code snippet, missing explanation, or something completely off-topic. ChatGPT (GPT-3 + RLHF): short intro → clean code → how to run it. Same base model, RLHF-aligned to what people prefer.

Anthropic's Constitutional AI and OpenAI's DPO are variants of the same idea.

When to use
  • Turning a base model into an assistant (chat product)
  • Tuning model output to user preferences (length, tone, format)
  • Reducing harmful content generation (safety alignment)
  • Polish on top of fine-tuning
When not to use
  • Small projects — costs millions of dollars and thousands of human labelers
  • When an already-aligned model exists (GPT, Claude, Llama-Instruct) — just use it
  • To improve raw performance (accuracy) — RLHF is for style/behavior, not capability
Common pitfalls

Reward hacking

The model learns to optimize 'looking good,' not actually being good. Writes longer answers (humans equate length with thoroughness), adds emojis, over-validates.

Sycophancy

RLHF'd models start agreeing with the user even when they're wrong, because labelers rewarded politeness.

Mode collapse

After RLHF the model locks into a specific tone and format. Same question → same structure every time. The base model's diversity is lost.