Alignment
Tuning models to human values
The discipline of making models not just powerful, but behaving as we want — helpful, harmless, honest.
Train a model on massive data and it becomes powerful — but not necessarily good. Being useful, harmless, and honest is a separate problem. Alignment is the umbrella term for techniques that align model output with human preferences and values.
Three practical dimensions: 1. Outer alignment: the metric you maximize is actually what you want (reward design problem). 2. Inner alignment: the model internally pursues the goal (no proxy-chasing). 3. Capability alignment: the model is capable enough (refusing a harmful suggestion requires recognizing it).
Methods: RLHF (human feedback), DPO (direct preference optimization), Constitutional AI (self-critique against written principles), inference-time alignment (output filtering).
Raising a talented but undisciplined apprentice. Technical skills exist, but "be polite," "don't lie to customers," "keep secrets" haven't been taught. Alignment is that upper-layer training. Skill + values = reliable product.
GPT-3 (2020): capable, but would answer "how do I make harmful content?" ChatGPT (2022 = GPT-3 + RLHF) refuses. Same base, totally different product thanks to alignment.
Anthropic's Constitutional AI went further: instead of human labelers, the model self-critiques against a written "constitution." More scalable, more consistent.
DeepSeek R1 took another path: train pure capability with RL first, then align — treating capability and alignment as separate phases.
- Building a production AI product — alignment isn't optional
- High-stakes domains (health, finance, law) — small misbehavior, big harm
- Brand tone/consistency matters — the model should always sound like 'you'
- Safety testing (red-teaming) — what's slipping through?
- Doing alignment from scratch isn't practical for small teams — use managed models
- Believing 'fully aligned' is a thing — every method has side effects
- Reducing alignment to RLHF alone — multiple layers are needed
Reward hacking
The model learns to maximize 'looking good,' not actually being good. Writes longer answers (humans equate length with thoroughness), drops emojis, over-validates. You get what you measure.
Sycophancy
RLHF'd models start agreeing with the user even when wrong. Labelers reward politeness, this side-effect emerges. Use eval suites that prioritize correctness.
Capabilities/alignment gap
Alignment techniques lag model capabilities. More powerful model = more sophisticated misbehavior. Alignment isn't a one-time job; it co-evolves with the model.