Intermediate· ~2 min read#gradient-descent#optimization#training

Gradient Descent

The optimizer underneath everything

An iterative optimization algorithm that updates a model's parameters in the direction of steepest decrease of the loss — the foundation of nearly every ML training run.

Definition

Gradient descent finds the minimum of a function iteratively. At each step it computes the gradient at the current point and takes a step in the opposite direction — that's the path of fastest descent. Step size is the learning rate.

Three main flavors. Batch GD uses the entire dataset per update — slow but steady. Stochastic GD (SGD) uses a single sample — noisy but fast, can escape local minima. Mini-batch GD is the middle ground and the modern standard, with batch sizes typically 32–512.

Plain SGD is rarely used as-is. Momentum averages past gradients — accelerates through flat regions. AdaGrad/RMSprop /Adam maintain per-parameter learning rates — frequently updated weights slow down, rare ones speed up. Adam or AdamW is the default in modern deep learning.

Analogy

Descending a foggy mountain: you can't see, but you can feel the slope under your feet. You step opposite the slope and repeat. Big step → you overshoot or fall off; tiny step → you'll never get there. The learning rate is the size of that step. Tuning it is the craft.

Real-world example

A CNN trains on ImageNet. Each image yields a prediction; loss is cross-entropy vs the true label. Backprop computes gradients for every weight. Gradient descent updates each weight against its gradient by learning_rate × gradient.

Batch size 256 → 5,000 steps per epoch on 1.28M images. Ninety epochs → 450,000 updates. Adam adapts each parameter's step based on past gradients. End result: in days, millions of parameters become a vision model with 75%+ accuracy.

Code examples

Optimizer comparisonPython

import torch
import torch.nn as nn

model = nn.Linear(100, 10)
loss_fn = nn.CrossEntropyLoss()

optimizers = {
    "SGD":          torch.optim.SGD(model.parameters(), lr=0.01),
    "SGD+Momentum": torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
    "Adam":         torch.optim.Adam(model.parameters(), lr=0.001),
    "AdamW":        torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01),
}

# AdamW is the modern default for transformer/large-model training.
# Vanilla SGD still wins on small models and final fine-tuning —
# it can find flatter, more generalizing minima.

Learning rate schedulerPython

from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

scheduler = CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100):
    for batch in train_loader:
        optimizer.zero_grad()
        loss = compute_loss(model, batch)
        loss.backward()
        optimizer.step()
    scheduler.step()
    print(f"Epoch {epoch}, LR={scheduler.get_last_lr()[0]:.5f}")

When to use

Lots of parameters and gradients are available
Neural networks, logistic regression, gradient-boosting internals
Large datasets without closed-form solutions

When not to use

Closed-form-solvable problems (small linear regression) — analytic is faster
Non-differentiable operations — try RL or gradient-free methods

Common pitfalls

Wrong learning rate

Too high → loss oscillates or NaNs. Too low → training never finishes. LR finder, warm-up + cosine schedules are modern best practice.

Local minima / saddle points

Gradient descent doesn't guarantee a global minimum and can stall on saddles. Momentum and Adam mitigate, but don't fully eliminate, the issue.

Ignoring batch size

Tiny batches → noisy and slow; huge batches → poor generalization. Batch size and learning rate generally scale together.