AI Dictionary
Advanced· ~2 min read#alignment#safety#values

Alignment

Tuning models to human values

The discipline of making models not just powerful, but behaving as we want — helpful, harmless, honest.

ALIGNING MODELS TO HUMAN VALUESRAW MODEL+HUMAN👍👍👎values/rules=ALIGNEDgoalRLHF, DPO, Constitutional AI — all are alignment methods
Definition

Train a model on massive data and it becomes powerful — but not necessarily good. Being useful, harmless, and honest is a separate problem. Alignment is the umbrella term for techniques that align model output with human preferences and values.

Three practical dimensions: 1. Outer alignment: the metric you maximize is actually what you want (reward design problem). 2. Inner alignment: the model internally pursues the goal (no proxy-chasing). 3. Capability alignment: the model is capable enough (refusing a harmful suggestion requires recognizing it).

Methods: RLHF (human feedback), DPO (direct preference optimization), Constitutional AI (self-critique against written principles), inference-time alignment (output filtering).

Analogy

Raising a talented but undisciplined apprentice. Technical skills exist, but "be polite," "don't lie to customers," "keep secrets" haven't been taught. Alignment is that upper-layer training. Skill + values = reliable product.

Real-world example

GPT-3 (2020): capable, but would answer "how do I make harmful content?" ChatGPT (2022 = GPT-3 + RLHF) refuses. Same base, totally different product thanks to alignment.

Anthropic's Constitutional AI went further: instead of human labelers, the model self-critiques against a written "constitution." More scalable, more consistent.

DeepSeek R1 took another path: train pure capability with RL first, then align — treating capability and alignment as separate phases.

When to use
  • Building a production AI product — alignment isn't optional
  • High-stakes domains (health, finance, law) — small misbehavior, big harm
  • Brand tone/consistency matters — the model should always sound like 'you'
  • Safety testing (red-teaming) — what's slipping through?
When not to use
  • Doing alignment from scratch isn't practical for small teams — use managed models
  • Believing 'fully aligned' is a thing — every method has side effects
  • Reducing alignment to RLHF alone — multiple layers are needed
Common pitfalls

Reward hacking

The model learns to maximize 'looking good,' not actually being good. Writes longer answers (humans equate length with thoroughness), drops emojis, over-validates. You get what you measure.

Sycophancy

RLHF'd models start agreeing with the user even when wrong. Labelers reward politeness, this side-effect emerges. Use eval suites that prioritize correctness.

Capabilities/alignment gap

Alignment techniques lag model capabilities. More powerful model = more sophisticated misbehavior. Alignment isn't a one-time job; it co-evolves with the model.