AI Atlas
Intermediate· ~2 min read#dropout#regularization#neural-network

Dropout

Random neuron deactivation

Randomly zeros a fraction of neurons during each training forward pass — a regularizer that fights overfitting in neural nets.

DROPOUTStep 1Step 2Random neurons drop each forward pass; overfitting drops with them.
Definition

Dropout randomly zeros a fraction of neurons in a hidden layer at each training forward pass. Typical rates are 0.2–0.5. The model can't lean on any particular neuron; every neuron has to work assuming its peers might not be there, leading to more robust, distributed representations.

Behavior differs between training and inference. In training a fraction of neurons are dropped and the rest are scaled by 1/(1-p) to preserve the expected activation. In inference no drops happen — all neurons active. That's why model.train() vs model.eval() matters in PyTorch.

Dropout became mainstream with AlexNet (2012). Today modern transformers still use it, though batch normalization and weight decay handle most of the regularization. Dropout is still helpful in small FCNs/LSTMs and in Monte Carlo Dropout — keeping it active at inference to estimate model uncertainty.

Analogy

A team that fields different starters every match. Each player learns to play assuming any teammate might be missing — no one becomes a single point of failure. The team learns to perform with many lineups. Dropout teaches a network the same — every neuron must be useful in many possible sub-networks.

Real-world example

A small word-embedding + 2-hidden-layer classifier on 50K text examples. Without dropout: training 98%, validation 72% — heavy overfit. Adding Dropout(0.5) after each hidden layer drops training to 88% but lifts validation to 81%. The gap shrank; the model actually generalizes.

On a modern transformer the story differs. GPT-style attention layers are huge; only a small dropout (~0.1) is used; weight decay and LR scheduling do most of the regularizing work. As model size and data diversity grow, dropout matters less.

Code examples
PyTorch · dropout in a modelPython
import torch
import torch.nn as nn

class Classifier(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, p=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.ReLU(),
            nn.Dropout(p),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Dropout(p),
            nn.Linear(hidden, out_dim),
        )

    def forward(self, x):
        return self.net(x)

model = Classifier(100, 64, 10)

model.train()
out = model(torch.randn(32, 100))

model.eval()
with torch.no_grad():
    preds = model(torch.randn(32, 100))
Monte Carlo Dropout for uncertaintyPython
# Keep dropout on at inference and average multiple predictions
model.train()
n_samples = 50

with torch.no_grad():
    preds = torch.stack([model(x) for _ in range(n_samples)])

mean_pred = preds.mean(dim=0)
uncertainty = preds.std(dim=0)
When to use
  • Small-medium FCN, LSTM, basic CNN training
  • Visible overfit (large train/val gap)
  • Uncertainty estimation via MC Dropout
  • Small dose (0.1) inside transformers
When not to use
  • Very large models + lots of data — weight decay and data augmentation are higher priority
  • Already using batch norm — combining can destabilize training
  • Default at inference time — keep it off unless deliberately doing MC Dropout
Common pitfalls

Forgetting eval mode

Without model.eval(), dropout stays on at inference; predictions become non-deterministic. Classic bug that leaks to production.

Dropout rate too high

0.7–0.8 usually pushes the model into underfitting. 0.2–0.5 is standard; 0.1 is enough in transformers.

Plain dropout on conv layers

Standard Dropout after conv layers isn't very effective. Use SpatialDropout2D (channel-level) or DropBlock instead.