Dropout in Neural Networks

Interactive architecture builder. Drag to pan, scroll to zoom, right-click to edit.

Analysis

Layers -

Neurons -

Total Params -

Selection

Hover over nodes for details.

Right-click to edit | Scroll to Zoom

Dropout in Neural Networks: Complete Guide

The elegant regularization trick that makes networks learn redundant, robust features.

Quick Context

Dropout (Srivastava et al., 2014) randomly sets a fraction of neuron activations to zero during each training forward pass. This prevents neurons from co-adapting — each neuron must learn to be useful independently, leading to more robust features and reduced overfitting.

1) How Dropout Works

During training, for each forward pass, every neuron in a dropout layer has a probability p of being "dropped" (set to zero). The remaining neurons are scaled by 1/(1-p) to maintain the expected output magnitude.

Training:
  mask = random_bernoulli(shape, keep_prob=1-p)
  output = (activation * mask) / (1 - p)

Inference:
  output = activation   # no dropout, no scaling

The 1/(1-p) scaling during training (called inverted dropout) ensures that at inference time, you can simply use the network as-is without any modification.

2) Why Dropout Reduces Overfitting

Breaks co-adaptation — neurons can't rely on specific other neurons being present, so each must independently extract useful features.
Implicit ensemble — each training step uses a different random sub-network. Dropout is approximately equivalent to averaging predictions from 2^n different networks (where n is the number of droppable neurons).
Noise injection — the randomness acts as a form of data augmentation at the feature level.

3) Choosing the Dropout Rate

Input layer: p = 0.1–0.2 (dropping too many inputs loses information).
Hidden layers: p = 0.2–0.5 (0.5 was the original paper's default).
Output layer: typically no dropout.
Higher dropout = stronger regularization. If the model still overfits, increase p. If it underfits, decrease p.

4) Guided Experiments

Train without dropout — notice the gap between training loss (low) and validation loss (higher).
Enable dropout at 0.2 — the gap should narrow.
Increase to 0.5 — strong regularization; training loss may be higher but validation loss improves.
Try 0.8 — too aggressive; model underfits.
Watch the visualization to see which neurons are active vs. dropped in each forward pass.

5) Common Mistakes

Leaving dropout on during inference — predictions become random. Always switch to eval mode.
Using dropout with BatchNorm naively — both introduce stochasticity; their interaction can be harmful. Test carefully.
Applying dropout to every layer including convolutions — for CNNs, Spatial Dropout (drop entire feature maps) works better than standard dropout.
Same rate everywhere — different layers may benefit from different rates.

6) Key Takeaways

Dropout randomly zeros neuron activations during training, preventing co-adaptation.
It acts as an implicit ensemble of exponentially many sub-networks.
Typical rates: 0.2–0.5 for hidden layers; always turn off at inference.
Inverted dropout scales activations during training so inference needs no modification.

Parameter Calculation

From	To	Weights	Biases	Total

Total Weights

Total Biases

Total Parameters