Dropout
Random neuron deactivation
Randomly zeros a fraction of neurons during each training forward pass — a regularizer that fights overfitting in neural nets.
Dropout randomly zeros a fraction of neurons in a hidden layer at each training forward pass. Typical rates are 0.2–0.5. The model can't lean on any particular neuron; every neuron has to work assuming its peers might not be there, leading to more robust, distributed representations.
Behavior differs between training and inference. In training
a fraction of neurons are dropped and the rest are scaled by
1/(1-p) to preserve the expected activation. In inference no
drops happen — all neurons active. That's why
model.train() vs model.eval() matters in PyTorch.
Dropout became mainstream with AlexNet (2012). Today modern transformers still use it, though batch normalization and weight decay handle most of the regularization. Dropout is still helpful in small FCNs/LSTMs and in Monte Carlo Dropout — keeping it active at inference to estimate model uncertainty.
A team that fields different starters every match. Each player learns to play assuming any teammate might be missing — no one becomes a single point of failure. The team learns to perform with many lineups. Dropout teaches a network the same — every neuron must be useful in many possible sub-networks.
A small word-embedding + 2-hidden-layer classifier on 50K
text examples. Without dropout: training 98%, validation 72%
— heavy overfit. Adding Dropout(0.5) after each hidden
layer drops training to 88% but lifts validation to 81%.
The gap shrank; the model actually generalizes.
On a modern transformer the story differs. GPT-style attention layers are huge; only a small dropout (~0.1) is used; weight decay and LR scheduling do most of the regularizing work. As model size and data diversity grow, dropout matters less.
import torch
import torch.nn as nn
class Classifier(nn.Module):
def __init__(self, in_dim, hidden, out_dim, p=0.3):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden),
nn.ReLU(),
nn.Dropout(p),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Dropout(p),
nn.Linear(hidden, out_dim),
)
def forward(self, x):
return self.net(x)
model = Classifier(100, 64, 10)
model.train()
out = model(torch.randn(32, 100))
model.eval()
with torch.no_grad():
preds = model(torch.randn(32, 100))# Keep dropout on at inference and average multiple predictions
model.train()
n_samples = 50
with torch.no_grad():
preds = torch.stack([model(x) for _ in range(n_samples)])
mean_pred = preds.mean(dim=0)
uncertainty = preds.std(dim=0)- Small-medium FCN, LSTM, basic CNN training
- Visible overfit (large train/val gap)
- Uncertainty estimation via MC Dropout
- Small dose (0.1) inside transformers
- Very large models + lots of data — weight decay and data augmentation are higher priority
- Already using batch norm — combining can destabilize training
- Default at inference time — keep it off unless deliberately doing MC Dropout
Forgetting eval mode
Without model.eval(), dropout stays on at inference; predictions become non-deterministic. Classic bug that leaks to production.
Dropout rate too high
0.7–0.8 usually pushes the model into underfitting. 0.2–0.5 is standard; 0.1 is enough in transformers.
Plain dropout on conv layers
Standard Dropout after conv layers isn't very effective. Use SpatialDropout2D (channel-level) or DropBlock instead.