Optimizer — Explained

Definition

Plain SGD struggles to train modern large models — slow convergence, sensitive to local minima, fragile under hyperparameter changes. Optimizer families build on gradient descent with techniques that actually work in practice.

Roughly in historical order:

- SGD + Momentum: accumulates the direction of past steps. Builds speed downhill, stays steady on noisy gradients. Still used on small models and final fine-tuning. - AdaGrad: per-parameter learning rate based on past gradient magnitudes. Frequently updated weights slow down. Issue: LR decays to zero over time. - RMSprop / AdaDelta: fix AdaGrad's decay by using a moving average of recent gradient magnitudes. - Adam: momentum + RMSprop combined. The default since 2014. Fast convergence, sane defaults. - AdamW: Adam with proper weight decay decoupling. The modern default for transformers and large-model training. Better generalization than vanilla Adam.

In practice: start a new deep-learning project with AdamW (lr=1e-4 or 3e-4). After a good fit, optionally switch to SGD + Momentum (lr=1e-2) for the final fine-tune.

Analogy

Like a coach for a mountain run. Plain gradient descent says "every stride equal, straight ahead." Momentum says "speed up downhill, sustain on flats." Adam gives a per-leg directive: "right leg's been working hard, ease off; left leg's idle, push." AdamW adds discipline: "don't bulk up — stay lean" (weight decay). Result: a faster, steadier run.

Real-world example

Fine-tuning a 7B-parameter LLM on 100K-token data. Three optimizers are tried:

- SGD lr=0.01: loss still drops slowly after 50 epochs. Tuning LR by hand takes days. - Adam lr=3e-4: descends nicely in 5 epochs, but overfits on validation. - AdamW lr=3e-4, weight_decay=0.1: same training loss in similar epochs, validation 2% better.

Ship AdamW; consider switching to SGD + Momentum for the very final fine-tune to squeeze the last 0.5%. Standard choreography for modern large-model training.

Code examples

PyTorch · optimizer + schedulerPython

import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

# Per-parameter-group weight decay (common in transformers)
no_decay = ["bias", "LayerNorm.weight"]
param_groups = [
    {
        "params": [p for n, p in model.named_parameters()
                   if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.01,
    },
    {
        "params": [p for n, p in model.named_parameters()
                   if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]

optimizer = AdamW(param_groups, lr=3e-4, betas=(0.9, 0.95))
scheduler = CosineAnnealingLR(optimizer, T_max=total_steps)

for batch in train_loader:
    optimizer.zero_grad()
    loss = compute_loss(model, batch)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    scheduler.step()

When to use

Training any neural network — pick an optimizer
Unfamiliar task: start with AdamW lr=3e-4
Computer vision final fine-tune: SGD+Momentum often finds flatter minima
Very large models: gradient accumulation + AdamW + warm-up + cosine

When not to use

Classical ML (RF, GBDT) — they have their own optimizers
Convex closed-form problems — analytic is better

Common pitfalls

Confusing Adam and AdamW

Vanilla Adam misapplies L2 regularization via weight decay. For modern transformer training use AdamW. Library defaults can mislead.

Switching optimizers without retuning LR

Optimal LRs for SGD and Adam differ by orders of magnitude (1e-2 vs 1e-4). When you change the optimizer, retune the LR.

Skipping warm-up

Without a few thousand warm-up steps where LR ramps from 0 to target, large-model training is unstable early on. Linear warmup + cosine decay is the modern standard.