Activation Function — Explained

Definition

A network of pure linear layers collapses, mathematically, into one linear transformation; depth would buy you nothing. Activation functions inject nonlinearity at each neuron's output, giving the network its expressive power.

The common ones:

- ReLU (Rectified Linear Unit): max(0, x). The modern default. Negatives clipped to 0, positives pass through. Fast, gradient rarely dies — but the "dead ReLU" problem exists (neurons stuck at 0). - Leaky ReLU / GELU / SiLU: ReLU variants that allow a small negative slope. GELU is the standard in transformers. - Sigmoid: 1 / (1+e^-x). Output in 0–1; classic for binary classification output but causes vanishing gradients in hidden layers. - Tanh: sigmoid-like with output in [-1, 1] centered at 0. - Softmax: for multi-class classification output — turns N logits into a probability distribution over N classes.

Activation choice meaningfully affects training speed and final performance. Hidden layers in CNNs and transformers usually use ReLU/GELU; output layers vary by problem (sigmoid/softmax/identity).

Analogy

A door latch. Push lightly — nothing happens (zero output). Push past a threshold — the latch moves proportional to your push. ReLU is exactly that: zero below, linear above. Sigmoid is a softer door — small touches give a small response, big pushes saturate and stop opening further.

Real-world example

Designing an image-classification CNN with three convolutional blocks (conv + activation + pool). Three activations are compared on an ImageNet subset:

| Activation | Accuracy | Train speed | Notes | |-----------|----------|-------------|-------| | Sigmoid | 71% | slow | vanishing gradients in deep layers | | ReLU | 78% | fast | a few neurons go dead | | GELU | 79% | fast | smooth, transformer favorite |

Activation choice is a real swing. The output layer uses softmax over 1000 classes; cross-entropy compares the resulting probability vector to the true label.

Code examples

PyTorch · activation comparisonPython

import torch
import torch.nn as nn

x = torch.linspace(-3, 3, 7)
print("Input:", x.tolist())

activations = {
    "ReLU":      nn.ReLU(),
    "LeakyReLU": nn.LeakyReLU(0.01),
    "GELU":      nn.GELU(),
    "Sigmoid":   nn.Sigmoid(),
    "Tanh":      nn.Tanh(),
}

for name, fn in activations.items():
    y = fn(x).tolist()
    print(f"{name:10s}: {[round(v, 3) for v in y]}")

Softmax for classification outputPython

logits = torch.tensor([[2.0, 1.0, 0.1]])
probs = nn.functional.softmax(logits, dim=-1)
print(probs)  # tensor([[0.659, 0.242, 0.099]])
# Three probabilities sum to 1.0; standard multi-class output

When to use

Designing any neural network — every hidden layer needs one
Hidden layers: ReLU, LeakyReLU, GELU, SiLU
Output: sigmoid for binary, softmax for multi-class, identity for regression

When not to use

Classical ML (random forest, SVM, etc.) — concept doesn't apply
Wrong choice on the output layer — e.g. sigmoid for multi-class

Common pitfalls

Dead ReLU

If inputs to a ReLU neuron stay negative, output is always 0, gradient 0 — the neuron stops learning. LeakyReLU/GELU largely solves this.

Sigmoid in deep hidden layers

Sigmoid's derivative tops out at 0.25; chained, gradients vanish. Use ReLU/GELU in hidden layers.

Wrong output activation

Sigmoid in regression squashes outputs to 0–1. Sigmoid in multi-class doesn't sum to 1. Match the activation to the problem.