Backpropagation
Backprop · Computing gradients across layers
The algorithm that computes how much each parameter contributes to the loss in a neural network — the mathematical heart of modern deep learning.
A neural network passes information left-to-right at inference (input → output). During training, the error flows the other way — from output back to input. Backpropagation uses the chain rule to compute, layer by layer, how much each weight affects the loss. Once those partial derivatives are known, gradient descent updates each weight in proportion.
Backprop dates back to the 1980s; its real moment came in 2012 once GPUs and large labeled datasets (ImageNet) made it practical. Modern frameworks (PyTorch, TensorFlow, JAX) handle backprop transparently via autograd — you write the forward pass, gradients come for free.
A crucial property: without backprop's efficiency, training billion-parameter models would be impossible. Naive numerical differentiation needs one forward pass per parameter — physically intractable at GPT scale. Backprop computes all gradients in a single backward pass.
A restaurant fixing a salty dish. The customer complains — who oversalted? The chef? The salad station? The sauce reduction? The fix walks the kitchen pipeline backward to its source: "starter fine, mains under-salted, sauce too salty", each station gets its share of correction. Backprop does the same — it spreads the loss responsibility back along the chain.
Training a 3-layer network in PyTorch, per mini-batch:
1. Forward: input → layer 1 → layer 2 → output → loss.
2. Backward: start from loss; compute dL/d(output), then
dL/d(input of last layer), then dL/d(weights). The chain
rule multiplies these together to give a single gradient
per weight.
3. Update: each weight W -= learning_rate × dL/dW.
You don't write these steps explicitly — loss.backward()
derives them from the autograd graph. One line, billions of
parameters. The cornerstone of modern deep learning.
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 32),
nn.ReLU(),
nn.Linear(32, 1),
)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()
X = torch.randn(64, 10)
y = torch.randn(64, 1)
optimizer.zero_grad()
preds = model(X)
loss = loss_fn(preds, y)
loss.backward()
optimizer.step()
for name, p in model.named_parameters():
print(name, p.grad.shape if p.grad is not None else None)- Training any neural network — fundamental
- Writing custom losses or layers — gradient flow understanding required
- Diagnosing exploding/vanishing gradient issues
- Classical ML (random forest, logistic regression, etc.) — different optimizers
- Non-differentiable operations — need alternatives like surrogate gradients or RL
Exploding / vanishing gradients
Deep networks see gradients blow up (NaN) or shrink to zero. Mitigations: ReLU-style activations, batch norm, residual connections (ResNet), gradient clipping.
Forgetting to zero gradients
PyTorch accumulates gradients. Without optimizer.zero_grad() each step, old gradients pile up and training breaks.
Wrong train/eval mode
Dropout and batch norm behave differently in model.train() vs model.eval(). Forgetting to switch produces inconsistent inference.