Backpropagation — Explained

Definition

A neural network passes information left-to-right at inference (input → output). During training, the error flows the other way — from output back to input. Backpropagation uses the chain rule to compute, layer by layer, how much each weight affects the loss. Once those partial derivatives are known, gradient descent updates each weight in proportion.

Backprop dates back to the 1980s; its real moment came in 2012 once GPUs and large labeled datasets (ImageNet) made it practical. Modern frameworks (PyTorch, TensorFlow, JAX) handle backprop transparently via autograd — you write the forward pass, gradients come for free.

A crucial property: without backprop's efficiency, training billion-parameter models would be impossible. Naive numerical differentiation needs one forward pass per parameter — physically intractable at GPT scale. Backprop computes all gradients in a single backward pass.

Analogy

A restaurant fixing a salty dish. The customer complains — who oversalted? The chef? The salad station? The sauce reduction? The fix walks the kitchen pipeline backward to its source: "starter fine, mains under-salted, sauce too salty", each station gets its share of correction. Backprop does the same — it spreads the loss responsibility back along the chain.

Real-world example

Training a 3-layer network in PyTorch, per mini-batch:

1. Forward: input → layer 1 → layer 2 → output → loss. 2. Backward: start from loss; compute dL/d(output), then dL/d(input of last layer), then dL/d(weights). The chain rule multiplies these together to give a single gradient per weight. 3. Update: each weight W -= learning_rate × dL/dW.

You don't write these steps explicitly — loss.backward() derives them from the autograd graph. One line, billions of parameters. The cornerstone of modern deep learning.

Code examples

PyTorch · backprop in actionPython

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 1),
)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

X = torch.randn(64, 10)
y = torch.randn(64, 1)

optimizer.zero_grad()
preds = model(X)
loss = loss_fn(preds, y)
loss.backward()
optimizer.step()

for name, p in model.named_parameters():
    print(name, p.grad.shape if p.grad is not None else None)

When to use

Training any neural network — fundamental
Writing custom losses or layers — gradient flow understanding required
Diagnosing exploding/vanishing gradient issues

When not to use

Classical ML (random forest, logistic regression, etc.) — different optimizers
Non-differentiable operations — need alternatives like surrogate gradients or RL

Common pitfalls

Exploding / vanishing gradients

Deep networks see gradients blow up (NaN) or shrink to zero. Mitigations: ReLU-style activations, batch norm, residual connections (ResNet), gradient clipping.

Forgetting to zero gradients

PyTorch accumulates gradients. Without optimizer.zero_grad() each step, old gradients pile up and training breaks.

Wrong train/eval mode

Dropout and batch norm behave differently in model.train() vs model.eval(). Forgetting to switch produces inconsistent inference.