Intermediate· ~2 min read#dropout#regularization#neural-network

Dropout

Random neuron deactivation

Randomly zeros a fraction of neurons during each training forward pass — a regularizer that fights overfitting in neural nets.

Definition

Dropout randomly zeros a fraction of neurons in a hidden layer at each training forward pass. Typical rates are 0.2–0.5. The model can't lean on any particular neuron; every neuron has to work assuming its peers might not be there, leading to more robust, distributed representations.

Behavior differs between training and inference. In training a fraction of neurons are dropped and the rest are scaled by 1/(1-p) to preserve the expected activation. In inference no drops happen — all neurons active. That's why model.train() vs model.eval() matters in PyTorch.

Dropout became mainstream with AlexNet (2012). Today modern transformers still use it, though batch normalization and weight decay handle most of the regularization. Dropout is still helpful in small FCNs/LSTMs and in Monte Carlo Dropout — keeping it active at inference to estimate model uncertainty.

Analogy

A team that fields different starters every match. Each player learns to play assuming any teammate might be missing — no one becomes a single point of failure. The team learns to perform with many lineups. Dropout teaches a network the same — every neuron must be useful in many possible sub-networks.

Real-world example

A small word-embedding + 2-hidden-layer classifier on 50K text examples. Without dropout: training 98%, validation 72% — heavy overfit. Adding Dropout(0.5) after each hidden layer drops training to 88% but lifts validation to 81%. The gap shrank; the model actually generalizes.

On a modern transformer the story differs. GPT-style attention layers are huge; only a small dropout (~0.1) is used; weight decay and LR scheduling do most of the regularizing work. As model size and data diversity grow, dropout matters less.

Code examples

PyTorch · dropout in a modelPython

import torch
import torch.nn as nn

class Classifier(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, p=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.ReLU(),
            nn.Dropout(p),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Dropout(p),
            nn.Linear(hidden, out_dim),
        )

    def forward(self, x):
        return self.net(x)

model = Classifier(100, 64, 10)

model.train()
out = model(torch.randn(32, 100))

model.eval()
with torch.no_grad():
    preds = model(torch.randn(32, 100))

Monte Carlo Dropout for uncertaintyPython

# Keep dropout on at inference and average multiple predictions
model.train()
n_samples = 50

with torch.no_grad():
    preds = torch.stack([model(x) for _ in range(n_samples)])

mean_pred = preds.mean(dim=0)
uncertainty = preds.std(dim=0)

When to use

Small-medium FCN, LSTM, basic CNN training
Visible overfit (large train/val gap)
Uncertainty estimation via MC Dropout
Small dose (0.1) inside transformers

When not to use

Very large models + lots of data — weight decay and data augmentation are higher priority
Already using batch norm — combining can destabilize training
Default at inference time — keep it off unless deliberately doing MC Dropout

Common pitfalls

Forgetting eval mode

Without model.eval(), dropout stays on at inference; predictions become non-deterministic. Classic bug that leaks to production.

Dropout rate too high

0.7–0.8 usually pushes the model into underfitting. 0.2–0.5 is standard; 0.1 is enough in transformers.

Plain dropout on conv layers

Standard Dropout after conv layers isn't very effective. Use SpatialDropout2D (channel-level) or DropBlock instead.