Activation Function
Where neurons get their nonlinearity
A nonlinear function applied to a neuron's weighted input — it's what gives a neural network its ability to model complex patterns.
A network of pure linear layers collapses, mathematically, into one linear transformation; depth would buy you nothing. Activation functions inject nonlinearity at each neuron's output, giving the network its expressive power.
The common ones:
- ReLU (Rectified Linear Unit): max(0, x). The modern
default. Negatives clipped to 0, positives pass through.
Fast, gradient rarely dies — but the "dead ReLU" problem
exists (neurons stuck at 0).
- Leaky ReLU / GELU / SiLU: ReLU variants that allow a
small negative slope. GELU is the standard in transformers.
- Sigmoid: 1 / (1+e^-x). Output in 0–1; classic for
binary classification output but causes vanishing gradients
in hidden layers.
- Tanh: sigmoid-like with output in [-1, 1] centered at 0.
- Softmax: for multi-class classification output — turns N
logits into a probability distribution over N classes.
Activation choice meaningfully affects training speed and final performance. Hidden layers in CNNs and transformers usually use ReLU/GELU; output layers vary by problem (sigmoid/softmax/identity).
A door latch. Push lightly — nothing happens (zero output). Push past a threshold — the latch moves proportional to your push. ReLU is exactly that: zero below, linear above. Sigmoid is a softer door — small touches give a small response, big pushes saturate and stop opening further.
Designing an image-classification CNN with three convolutional blocks (conv + activation + pool). Three activations are compared on an ImageNet subset:
| Activation | Accuracy | Train speed | Notes | |-----------|----------|-------------|-------| | Sigmoid | 71% | slow | vanishing gradients in deep layers | | ReLU | 78% | fast | a few neurons go dead | | GELU | 79% | fast | smooth, transformer favorite |
Activation choice is a real swing. The output layer uses softmax over 1000 classes; cross-entropy compares the resulting probability vector to the true label.
import torch
import torch.nn as nn
x = torch.linspace(-3, 3, 7)
print("Input:", x.tolist())
activations = {
"ReLU": nn.ReLU(),
"LeakyReLU": nn.LeakyReLU(0.01),
"GELU": nn.GELU(),
"Sigmoid": nn.Sigmoid(),
"Tanh": nn.Tanh(),
}
for name, fn in activations.items():
y = fn(x).tolist()
print(f"{name:10s}: {[round(v, 3) for v in y]}")logits = torch.tensor([[2.0, 1.0, 0.1]])
probs = nn.functional.softmax(logits, dim=-1)
print(probs) # tensor([[0.659, 0.242, 0.099]])
# Three probabilities sum to 1.0; standard multi-class output- Designing any neural network — every hidden layer needs one
- Hidden layers: ReLU, LeakyReLU, GELU, SiLU
- Output: sigmoid for binary, softmax for multi-class, identity for regression
- Classical ML (random forest, SVM, etc.) — concept doesn't apply
- Wrong choice on the output layer — e.g. sigmoid for multi-class
Dead ReLU
If inputs to a ReLU neuron stay negative, output is always 0, gradient 0 — the neuron stops learning. LeakyReLU/GELU largely solves this.
Sigmoid in deep hidden layers
Sigmoid's derivative tops out at 0.25; chained, gradients vanish. Use ReLU/GELU in hidden layers.
Wrong output activation
Sigmoid in regression squashes outputs to 0–1. Sigmoid in multi-class doesn't sum to 1. Match the activation to the problem.