RLHF
Reinforcement Learning from Human Feedback
A 3-stage technique that aligns a model to human preferences — the recipe that turned GPT-3 into ChatGPT.
A base model (like GPT-3) only does "next-token prediction" — that doesn't make it answer, be polite, or follow instructions. RLHF closes that gap: it teaches the model what kinds of answers humans actually like.
Three-stage pipeline: 1. SFT (Supervised Fine-Tuning): pre-tune the model on human-written quality examples. 2. Reward Model: generate 4 answers per prompt, have humans rank them, train a separate model to predict that ranking. 3. PPO/DPO: fine-tune the main model with reinforcement learning to maximize the reward model's score.
Result: the model produces not "correct tokens" but "answers humans would approve of." ChatGPT's helpful, polite tone comes from RLHF.
You opened a new restaurant. The chef cooks well technically (base model) but doesn't know what flavor your customers like. Stage 1: you set up an example menu. Stage 2: you collect "this dish or that one?" feedback from diners. Stage 3: you tell the chef to adjust recipes based on that feedback.
Ask GPT-3 "write a fizz buzz program" — the base model might produce a random code snippet, missing explanation, or something completely off-topic. ChatGPT (GPT-3 + RLHF): short intro → clean code → how to run it. Same base model, RLHF-aligned to what people prefer.
Anthropic's Constitutional AI and OpenAI's DPO are variants of the same idea.
- Turning a base model into an assistant (chat product)
- Tuning model output to user preferences (length, tone, format)
- Reducing harmful content generation (safety alignment)
- Polish on top of fine-tuning
- Small projects — costs millions of dollars and thousands of human labelers
- When an already-aligned model exists (GPT, Claude, Llama-Instruct) — just use it
- To improve raw performance (accuracy) — RLHF is for style/behavior, not capability
Reward hacking
The model learns to optimize 'looking good,' not actually being good. Writes longer answers (humans equate length with thoroughness), adds emojis, over-validates.
Sycophancy
RLHF'd models start agreeing with the user even when they're wrong, because labelers rewarded politeness.
Mode collapse
After RLHF the model locks into a specific tone and format. Same question → same structure every time. The base model's diversity is lost.