Gradient Descent
The optimizer underneath everything
An iterative optimization algorithm that updates a model's parameters in the direction of steepest decrease of the loss — the foundation of nearly every ML training run.
Gradient descent finds the minimum of a function iteratively. At each step it computes the gradient at the current point and takes a step in the opposite direction — that's the path of fastest descent. Step size is the learning rate.
Three main flavors. Batch GD uses the entire dataset per update — slow but steady. Stochastic GD (SGD) uses a single sample — noisy but fast, can escape local minima. Mini-batch GD is the middle ground and the modern standard, with batch sizes typically 32–512.
Plain SGD is rarely used as-is. Momentum averages past gradients — accelerates through flat regions. AdaGrad/RMSprop /Adam maintain per-parameter learning rates — frequently updated weights slow down, rare ones speed up. Adam or AdamW is the default in modern deep learning.
Descending a foggy mountain: you can't see, but you can feel the slope under your feet. You step opposite the slope and repeat. Big step → you overshoot or fall off; tiny step → you'll never get there. The learning rate is the size of that step. Tuning it is the craft.
A CNN trains on ImageNet. Each image yields a prediction; loss
is cross-entropy vs the true label. Backprop computes gradients
for every weight. Gradient descent updates each weight against
its gradient by learning_rate × gradient.
Batch size 256 → 5,000 steps per epoch on 1.28M images. Ninety epochs → 450,000 updates. Adam adapts each parameter's step based on past gradients. End result: in days, millions of parameters become a vision model with 75%+ accuracy.
import torch
import torch.nn as nn
model = nn.Linear(100, 10)
loss_fn = nn.CrossEntropyLoss()
optimizers = {
"SGD": torch.optim.SGD(model.parameters(), lr=0.01),
"SGD+Momentum": torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
"Adam": torch.optim.Adam(model.parameters(), lr=0.001),
"AdamW": torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01),
}
# AdamW is the modern default for transformer/large-model training.
# Vanilla SGD still wins on small models and final fine-tuning —
# it can find flatter, more generalizing minima.from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
scheduler = CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(100):
for batch in train_loader:
optimizer.zero_grad()
loss = compute_loss(model, batch)
loss.backward()
optimizer.step()
scheduler.step()
print(f"Epoch {epoch}, LR={scheduler.get_last_lr()[0]:.5f}")- Lots of parameters and gradients are available
- Neural networks, logistic regression, gradient-boosting internals
- Large datasets without closed-form solutions
- Closed-form-solvable problems (small linear regression) — analytic is faster
- Non-differentiable operations — try RL or gradient-free methods
Wrong learning rate
Too high → loss oscillates or NaNs. Too low → training never finishes. LR finder, warm-up + cosine schedules are modern best practice.
Local minima / saddle points
Gradient descent doesn't guarantee a global minimum and can stall on saddles. Momentum and Adam mitigate, but don't fully eliminate, the issue.
Ignoring batch size
Tiny batches → noisy and slow; huge batches → poor generalization. Batch size and learning rate generally scale together.