Regularization

←Back to Tech Tree

inventorycoverage

Regularization #

Machine LearningDifficulty: ★★★★☆Depth: 11Unlocks: 4

Preventing overfitting. L1, L2 penalties. Dropout.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

lambda (regularization strength, scalar multiplier on penalty)L2 norm squared written as ||w||_2^2L1 norm written as ||w||_1

Essential Relationships #

Prerequisites (2) #

Neural Networks6 atomsConvex Optimization5 atoms

Unlocks (1) #

Deep Learninglvl 5

Advanced Learning Details

Graph Position #

154

Depth Cost

4

Fan-Out (ROI)

2

Bottleneck Score

11

Chain Length

Cognitive Load #

9

Atomic Elements

41

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (16) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Training loss going down while validation loss goes up is one of the most common “surprises” in machine learning. Regularization is the toolkit for preventing that surprise: you deliberately restrict (or noise up) your model so it can’t memorize quirks of the training set and is forced to learn patterns that generalize.

TL;DR:

Regularization modifies learning to prefer simpler, more robust solutions. The most common form is adding a penalty to the objective: minimize L(θ) + λ·Ω(θ). L2 (‖w‖₂²) shrinks weights smoothly (“weight decay”), L1 (‖w‖₁) promotes sparsity (many weights become exactly 0), and dropout randomly masks units during training to reduce co-adaptation and behaves like implicit ensembling.

What Is Regularization? #

The problem: overfitting is an optimization success but a modeling failure #

In supervised learning, you pick parameters θ to minimize an empirical (training) loss:

J_train(θ) = (1/n) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ)

If the model is flexible enough (especially deep nets), J_train can often be pushed very low. But “low training loss” is not the goal. The goal is low generalization error on unseen data.

Overfitting happens when the model uses its capacity to fit idiosyncrasies: noise, rare coincidences, spurious correlations. The optimization succeeds (training loss drops), but the representation learned is brittle.

The idea: constrain complexity so learning can’t memorize #

Regularization is any technique that changes the learning problem so that the solution is biased toward simpler / smoother / more stable models.

The most canonical framing is: augment the loss with a penalty term.

J_reg(θ) = J_train(θ) + λ·Ω(θ)

This is the atomic concept to keep returning to:

Regularization = minimize loss + penalty.

It’s not merely a trick; it’s a deliberate statement of preference: among many parameter settings that fit the data similarly well, prefer the one with smaller norm, fewer nonzero parameters, or better robustness.

Why regularization helps generalization #

There are several complementary lenses:

  1. Bias–variance tradeoff (classical)
  1. Stability / robustness (modern intuition)
  1. Constrained optimization equivalence

Minimizing loss + penalty is often equivalent to minimizing loss subject to a constraint.

For L2:

minimize J_train(w) + λ‖w‖₂²

is closely related to

minimize J_train(w) subject to ‖w‖₂² ≤ c

The penalty formulation is convenient for gradient methods; the constraint formulation is useful for geometric intuition.

Regularization in deep learning #

Deep networks complicate the story because “complexity” is not perfectly captured by parameter count alone. Still, regularization remains essential.

Three cornerstone tools you’ll use constantly:

TechniqueWhat it modifiesPrimary effect
L2 penalty (weight decay)Objective (adds λ‖w‖₂²)Shrinks weights smoothly; improves stability
L1 penaltyObjective (adds λ‖w‖₁)Drives many weights to 0 (sparsity)
DropoutTraining procedure (random masking)Reduces co-adaptation; implicit model averaging

In the next sections, we’ll build each one from motivation → math → behavior → practical use.

Core Mechanic 1: L2 Regularization (‖w‖₂²) and Weight Decay #

Why L2 is the default #

If you want a regularizer that:

…then L2 is usually the first choice.

In neural nets, you’ll often see it called weight decay, because the update rule literally decays weights a bit each step.

The objective #

Let w denote the vector of weights you want to penalize (often all weights, sometimes excluding biases and normalization parameters).

J(w) = J_train(w) + λ‖w‖₂²

Recall:

w‖₂² = ∑ⱼ wⱼ²

So the penalty grows quadratically with magnitude.

Show the work: gradient of the L2 penalty #

We’ll derive the gradient term you add during backprop.

Ω(w) = ‖w‖₂² = ∑ⱼ wⱼ²

Take partial derivative for coordinate j:

∂Ω/∂wⱼ = ∂/∂wⱼ (∑ₖ wₖ²)

= 2wⱼ

So

∇Ω(w) = 2w

Therefore the gradient of the regularized objective is:

∇J(w) = ∇J_train(w) + λ∇Ω(w)

= ∇J_train(w) + 2λw

Weight decay update (SGD) #

With learning rate η, SGD updates are:

ww − η(∇J_train(w) + 2λw)

Rearrange:

ww − η∇J_train(w) − 2ηλw

Factor w:

w ← (1 − 2ηλ)w − η∇J_train(w)

That factor (1 − 2ηλ) is the “decay”: every step pulls weights toward 0.

Geometric intuition: why L2 gives smooth shrinkage #

Consider the constrained form:

minimize J_train(w) subject to ‖w‖₂² ≤ c

The L2 ball is a circle (in 2D) or sphere (in higher dimensions). When you intersect a smooth loss surface with a round constraint set, the optimum typically lies on the boundary but not at axes corners. This leads to many small weights rather than a few exactly zero weights.

Bayesian view (useful intuition, not required) #

L2 regularization corresponds to a Gaussian prior on weights:

p(w) ∝ exp(−(λ)‖w‖₂²)

Minimizing J_train + λ‖w‖₂² is like MAP estimation: fit the data while preferring weights near 0.

Practical notes in deep nets #

  1. What to regularize
  1. Choosing λ
  1. Weight decay vs L2 penalty in adaptive optimizers

In SGD, “L2 regularization” and “weight decay” are effectively the same. In Adam/RMSProp, naïvely adding λ‖w‖₂² to the loss is not identical to decoupled weight decay.

In practice, for Adam-family optimizers, AdamW is often preferred because the regularization effect is more predictable.

What L2 does to the learned function #

Even though L2 acts on parameters, its functional effect is:

Keep one mental picture: L2 spreads influence across many features with small coefficients, rather than betting everything on a few huge weights.

Core Mechanic 2: L1 Regularization (‖w‖₁) and Sparsity #

Why L1 is different from L2 #

L1 regularization is used when you want:

or when you suspect many true effects are actually irrelevant.

Where L2 shrinks everything smoothly, L1 tends to create exact zeros.

The objective #

J(w) = J_train(w) + λ‖w‖₁

where

w‖₁ = ∑ⱼ |wⱼ|

Why L1 induces sparsity (geometric intuition) #

Consider the constrained form:

minimize J_train(w) subject to ‖w‖₁ ≤ c

In 2D, the L1 ball is a diamond (a rotated square). Its corners lie on the coordinate axes.

When a smooth loss contour first touches this diamond, it often touches at a corner.

Touching at a corner means one coordinate is exactly 0.

This “corners encourage zeros” intuition is the key:

Show the work: (sub)gradient of L1 #

The absolute value is not differentiable at 0, so we use a subgradient.

For a single coordinate:

d|w|/dw =

So a subgradient of ‖w‖₁ is:

∂‖w‖₁/∂wⱼ = sign(wⱼ) (with sign(0) ∈ [−1, +1])

Thus a subgradient update looks like:

ww − η(∇J_train(w) + λ·sign(w))

Soft-thresholding (important behavior) #

For some losses (notably squared loss in linear regression), L1 leads to a closed-form coordinate update called soft-thresholding.

Even if you don’t use the closed form in deep learning, the behavior is worth understanding: L1 applies a constant pull toward 0, not proportional to w.

Compare pulls:

That’s why small weights get “snapped” to zero under L1.

L1 in modern deep learning practice #

Pure L1 on all network weights is less common than L2/weight decay, because:

But L1 still matters a lot in:

Elastic net (bridge between L1 and L2) #

Sometimes you want sparsity and stability:

J(w) = J_train(w) + λ₁‖w‖₁ + λ₂‖w‖₂²

L1 alone can be unstable when features are strongly correlated; L2 helps group correlated features and improves conditioning.

Summary comparison: L1 vs L2 #

PropertyL1 (‖w‖₁)L2 (‖w‖₂²)
Pull toward 0constant magnitude (λ)proportional to size (2λw)
Exact zeros?yes, oftenrarely
Optimization smoothnessnonsmooth at 0smooth
Typical effectsparsity / feature selectionshrinkage / stability
Common in deep netsless common (unstructured)very common

If L2 is “make everything smaller,” L1 is “make many things vanish.”

Core Mechanic 3: Dropout as Stochastic Regularization and Implicit Ensembling #

Why dropout exists #

Neural networks can overfit by forming co-adaptations: a hidden unit becomes useful only because some other unit reliably provides a complementary signal. This can produce fragile internal representations.

Dropout regularizes by injecting structured noise during training:

The motivation is simple: if any unit might disappear, the network must distribute information and learn redundant, robust features.

The mechanism (inverted dropout) #

Let h be a vector of activations at some layer. Sample a mask m with independent Bernoulli entries:

mⱼ ∼ Bernoulli(p)

Apply mask:

= (mh) / p

Show the work:

For one coordinate:

h̃ⱼ = (mⱼ hⱼ)/p

E[h̃ⱼ] = E[mⱼ]·hⱼ/p

= p·hⱼ/p

= hⱼ

So at test time you typically do nothing special (no dropout, no scaling), because the scaling was already handled during training.

Dropout as implicit model averaging #

Each dropout mask corresponds to a sub-network. Training with dropout is like training a huge ensemble of thinned networks that share weights.

You do not explicitly average predictions over all masks at test time (that would be expensive). Instead, using the full network without dropout approximates that ensemble average.

This is why dropout often improves generalization even when it increases training loss: it optimizes performance across many perturbations, not one fixed computation graph.

Where dropout works well (and where it doesn’t) #

Dropout is most effective when:

It can be less helpful (or need careful tuning) when:

Modern practice often uses:

Dropout vs L1/L2: what’s being “penalized”? #

Dropout usually isn’t written as loss + λ·Ω(θ) explicitly. It’s a procedural regularizer.

Still, conceptually it:

A useful comparison:

MethodHow it regularizesTypical symptom it fixes
L2discourages large weightsoverly sharp decision boundaries
L1enforces sparsitytoo many irrelevant features
Dropoutdisrupts co-adaptation with stochastic masksbrittle internal representations

Practical tuning parameters #

A common approach is to start with small dropout (e.g., q = 0.1–0.3 in dense layers) and increase only if overfitting persists after using weight decay and data augmentation.

Application/Connection: Using Regularization in Real Training Loops #

Regularization is a design choice, not an afterthought #

A practical workflow is:

  1. Pick an architecture capable of fitting the task.

  2. Use monitoring: training vs validation curves.

  3. Add regularization to address observed gaps.

If training and validation losses are both high → underfitting: reduce regularization or increase capacity.

If training loss is low but validation loss is high → overfitting: increase regularization.

How to choose among L2, L1, and dropout #

Think in terms of what kind of “simplicity” you want:

Often you combine them:

Interactions with optimization #

Regularization changes gradients and thus training dynamics.

Interactions with data augmentation and early stopping #

Regularization is part of a larger family of generalization controls:

You’ll often see these combined:

ToolWhat it controlsNotes
Weight decayparameter magnitudecheap, widely applicable
Dropoutinternal co-adaptationhelps more in dense parts
Data augmentationinvariance & sample diversityoften the biggest win in vision
Early stoppingeffective capacity via training timerequires validation monitoring

A concrete mental model for λ #

λ is not “a little extra term.” It is a knob that sets the relative scale between fitting the data and shrinking complexity.

If the loss term is scaled (e.g., average vs sum over batch), the same numeric λ can behave differently.

So when you tune λ, do it in context:

Regularization and deep learning readiness #

Deep learning systems are powerful partly because they can fit almost anything—so they will happily fit the wrong thing unless you apply constraints.

Understanding regularization prepares you for:

This is why regularization is a core “unlock” for the broader Deep Learning node: it turns raw capacity into reliable performance.

Worked Examples (3) #

L2 Regularization Changes the Gradient (and Causes Weight Decay) #

Assume a model with parameter vector w and training objective J_train(w). We define the regularized objective:

J(w) = J_train(w) + λ‖w‖₂²

We want to derive the SGD update and interpret it as weight decay.

  1. Start with the penalty term:

    Ω(w) = ‖w‖₂² = ∑ⱼ wⱼ²

  2. Differentiate coordinate-wise:

    ∂Ω/∂wⱼ = 2wⱼ

    So ∇Ω(w) = 2w

  3. Differentiate the full objective:

    ∇J(w) = ∇J_train(w) + λ∇Ω(w)

    = ∇J_train(w) + 2λw

  4. Write the SGD step with learning rate η:

    ww − η(∇J_train(w) + 2λw)

  5. Rearrange to expose decay:

    ww − η∇J_train(w) − 2ηλw

    w ← (1 − 2ηλ)w − η∇J_train(w)

Insight: The factor (1 − 2ηλ) multiplies the current weights every step, shrinking them toward 0 even if the data gradient were zero. This is why L2 is called weight decay: it continuously damps parameter magnitude, which tends to reduce variance and improve generalization.

Why L1 Produces Exact Zeros (Constant Pull Toward 0) #

Consider the regularized objective:

J(w) = J_train(w) + λ‖w‖₁

We’ll examine the (sub)gradient contributed by the L1 term and compare it to L2.

  1. Write the L1 norm:

    w‖₁ = ∑ⱼ |wⱼ|

  2. For one coordinate wⱼ, the derivative of |wⱼ| is not defined at 0, so use a subgradient:

    ∂|wⱼ|/∂wⱼ =

    +1 if wⱼ > 0

    −1 if wⱼ < 0

    any value in [−1, +1] if wⱼ = 0

  3. Thus a valid subgradient of the full L1 norm is:

    ∂‖w‖₁/∂wⱼ = sign(wⱼ), where sign(0) ∈ [−1, +1]

  4. Gradient-style update (conceptually):

    ww − η(∇J_train(w) + λ·sign(w))

  5. Compare with L2’s contribution (2λw):

    • •L2 pull shrinks as wⱼ → 0
    • •L1 pull stays roughly constant magnitude λ until the weight hits 0

Insight: Because L1 applies a constant-magnitude force toward 0, small weights don’t get “protected” the way they do under L2. They are driven to exactly 0, yielding sparsity (feature selection).

Dropout Keeps the Expected Activation the Same (Inverted Dropout) #

Let h be a layer’s activation vector during training. We apply inverted dropout with keep probability p. We want to show E[] = h.

  1. Sample a Bernoulli mask m with independent entries:

    mⱼ ∼ Bernoulli(p)

  2. Apply inverted dropout:

    = (mh) / p

  3. Take expectation coordinate-wise:

    h̃ⱼ = (mⱼ hⱼ)/p

    E[h̃ⱼ] = E[mⱼ]·hⱼ/p

  4. Use E[mⱼ] = p for a Bernoulli(p):

    E[h̃ⱼ] = p·hⱼ/p = hⱼ

  5. Therefore E[] = h

Insight: Inverted dropout preserves the mean activation during training, so at inference you can turn dropout off without additional scaling. The regularization comes from the randomness (variance), not from a shifted mean.

Key Takeaways #

Common Mistakes #

Practice #

easy

You are training with SGD and L2 regularization. Suppose η = 0.1 and λ = 0.01. Ignoring the data gradient (assume ∇J_train(w) = 0 for this step), what happens to w after one update? Write the multiplicative factor applied to w.

Hint: Use w ← (1 − 2ηλ)w when the data gradient is zero.

Show solution

With L2: ww − η(2λw) = (1 − 2ηλ)w.

Compute 2ηλ = 2·0.1·0.01 = 0.002.

So w is multiplied by (1 − 0.002) = 0.998 after one step.

medium

Explain, using the constrained-optimization geometry, why L1 regularization tends to produce sparse solutions while L2 does not. Focus on the shape of the constraint sets in 2D and where smooth loss contours typically touch them.

Hint: Compare the L1 ball (diamond) vs the L2 ball (circle) and think about corners vs smooth boundaries.

Show solution

In 2D, the constraint ‖w‖₁ ≤ c is a diamond with sharp corners on the axes, while ‖w‖₂² ≤ c is a circle with a smooth boundary. A smooth loss contour (e.g., an ellipse) expanded outward from its minimum will typically first touch the feasible set at a point of tangency. Because the L1 feasible set has corners, tangency often occurs at a corner, which lies on an axis, implying one coordinate is exactly 0 (sparsity). The L2 feasible set is round, so tangency usually happens at a point with both coordinates nonzero, producing shrinkage but not exact zeros.

medium

You apply inverted dropout to an activation h with keep probability p. Let h = 3 (a scalar activation). Compute the distribution of h̃ and verify E[h̃] = 3 for p = 0.6.

Hint: With probability p you keep the unit and scale by 1/p; otherwise it becomes 0.

Show solution

Mask m ∼ Bernoulli(p) with p = 0.6. Inverted dropout gives h̃ = (m·h)/p.

So:

Expectation:

E[h̃] = 0.6·5 + 0.4·0 = 3

So E[h̃] = 3, matching the original activation.

Connections #

Next steps and related nodes:

Related concepts you may want nearby in the tech tree:

Quality: A (4.2/5)

← back to treebrowse all →