←Back to Tech Tree
inventorycoverage
Numerical Stability and Conditioning #
AlgorithmsDifficulty: ★★★★☆Depth: 1Unlocks: 4
Principles for avoiding floating-point issues (overflow/underflow, catastrophic cancellation) and understanding problem conditioning; includes common fixes like log-sum-exp, normalization, and careful initialization. These concerns are crucial for reliable training of deep models.
Interactive Visualization #
⏮◀◀▶▶STEP0.25x1xZOOM
t=0s
Core Concepts #
- -Finite-precision arithmetic: rounding error and limited dynamic range causing overflow/underflow
Key Symbols & Notation #
epsilon_machine (machine epsilon, unit roundoff)
Essential Relationships #
- -Ill-conditioned problems (high condition number) amplify rounding errors and catastrophic cancellation into large relative errors
- -Algebraic reformulation and scaling (e.g., log-domain, normalization, careful scaling/initialization) mitigate overflow/underflow and reduce cancellation effects
Prerequisites (2) #
Activation Functions6 atomsComputational Graphs6 atoms
Unlocks (1) #
Deep Learninglvl 5
Advanced Learning Details
Graph Position #
16
Depth Cost
4
Fan-Out (ROI)
2
Bottleneck Score
1
Chain Length
Cognitive Load #
4
Atomic Elements
43
Total Elements
L3
Percentile Level
L2
Atomic Level
All Concepts (21) #
- Floating-point representation and finite precision: numbers are represented with limited mantissa and exponent leading to quantization of real values
- Machine epsilon (unit roundoff): the basic bound on relative rounding error for a floating-point operation
- Rounding error: error introduced by replacing real arithmetic with finite-precision arithmetic
- Absolute error vs relative error: two ways to quantify discrepancy between true and computed values
- Overflow: computed magnitude too large to represent in the floating-point format
- Underflow: computed magnitude too small to represent in the normalized floating-point format
- Subnormal (denormal) numbers and flush-to-zero behaviors: representation/handling of tiny magnitudes and their performance/accuracy implications
- Catastrophic cancellation (loss of significance): large relative error produced when subtracting nearly equal floating-point numbers
- Condition number of a problem (κ or cond): a scalar quantifying how sensitive the output is to small changes in input
- Well-conditioned vs ill-conditioned problems: qualitative labels based on condition number magnitude
- Numerical stability of algorithms: whether an algorithm amplifies or controls rounding errors during computation
- Forward error vs backward error: two ways to measure algorithmic error (difference in outputs vs input perturbation that explains output)
- Backward/forward stability definitions: backward-stable algorithm yields exact solution to a slightly perturbed problem; forward-stable implies small forward error
- Loss of associativity in floating-point arithmetic and its consequences (order of operations matters)
- Log-domain computations (e.g., log-sum-exp) as a general technique to avoid overflow/underflow for exponentials/products
- Numerically stable implementations of common primitives (e.g., stable softmax/log-sum-exp)
- Compensated summation (e.g., Kahan summation): an algorithmic technique to reduce accumulated rounding error in sums
- Normalization and scaling as numerical fixes: rescaling inputs/weights/activations to keep numeric ranges safe
- Careful initialization to control numeric ranges of activations/gradients (beyond activation gradient properties)
- Gradient clipping as a practical fix to prevent numeric explosion of gradients in training
- Algorithmic vs problem-level remedies: choosing numerically stable algorithms vs reformulating the problem to improve conditioning
Teaching Strategy #
Self-serve tutorial - low prerequisites, straightforward concepts.
Training deep models is often less about “the math is wrong” and more about “the math was computed in a fragile way.” Numerical stability is the craft of making mathematically correct formulas behave reliably under finite-precision floating-point arithmetic—so your softmax doesn’t overflow, your variance doesn’t go negative, and your gradients don’t silently drift due to reduction order.
TL;DR:
Finite-precision arithmetic introduces rounding and a limited dynamic range. Numerical stability is about choosing algorithms that control rounding error and avoid overflow/underflow; conditioning is about how sensitive the problem itself is to small input perturbations. In deep learning, common fixes include log-sum-exp for softmax/log-likelihoods, stable variance/normalization formulas, careful reduction (pairwise/Kahan), and scale-aware initialization/normalization—especially in float16/float32 training.
What Is Numerical Stability and Conditioning? #
You can write down a correct mathematical expression and still get nonsense on a real computer.
- •A softmax that should produce probabilities returns NaN because e1000e^{1000}e1000 overflowed.
- •A variance that should be nonnegative becomes slightly negative, and then a square root yields NaN.
- •The same sum of numbers gives different answers depending on the order you add them.
These aren’t “bugs” in deep learning libraries—they’re consequences of finite-precision arithmetic.
Numerical analysis separates two ideas:
- 1)Conditioning (problem sensitivity): if we slightly perturb the input, how much can the true answer change?
- 2)Stability (algorithm robustness): given rounding errors and finite precision, does our computed answer stay close to the true answer of a nearby input?
A well-conditioned problem can be solved poorly by an unstable algorithm. A well-designed stable algorithm cannot fully rescue an ill-conditioned problem—but it can avoid making things worse.
Floating-point reality: rounding + limited dynamic range #
Computers represent real numbers in floating-point format (float16/float32/float64). Two consequences matter constantly:
- •Rounding error: most real numbers cannot be represented exactly, so operations are rounded.
- •Limited dynamic range: numbers that are too large overflow to ∞; too small underflow toward 0 (sometimes with subnormals).
A standard model of floating-point arithmetic is:
fl(a∘b)=(a∘b)(1+δ),∣δ∣≤u\mathrm{fl}(a \circ b) = (a \circ b)(1 + \delta), \quad |\delta| \le ufl(a∘b)=(a∘b)(1+δ),∣δ∣≤u
where:
- •∘\circ∘ is +++, −-−, ×\times×, or ///,
- •uuu is the unit roundoff, often roughly half of machine epsilon.
We’ll use machine epsilon (often denoted εmachine\varepsilon_{\text{machine}}εmachine) as the “scale” of relative rounding error.
Typical values (approximate):
| type | significand bits | εmachine\varepsilon_{\text{machine}}εmachine (≈) | dynamic range (rough) |
|---|
| float16 | 11 (incl. hidden bit) | 9.77e-4 | ~1e−5 to ~6e4 |
| bfloat16 | 8 (incl. hidden bit) | 7.81e-3 | ~1e−38 to ~3e38 |
| float32 | 24 | 1.19e-7 | ~1e−38 to ~3e38 |
| float64 | 53 | 2.22e-16 | ~1e−308 to ~1e308 |
Two important takeaways:
- •float16 has much worse precision and a much smaller max value than float32.
- •bfloat16 has low precision like float16 but a large exponent range like float32.
Conditioning: “how sensitive is the math problem?” #
Consider computing f(x)f(x)f(x). If a tiny perturbation Δx\Delta xΔx can cause a large change in f(x)f(x)f(x), the problem is ill-conditioned near xxx.
A classic measure is the relative condition number:
κf(x)≈∣xf′(x)f(x)∣\kappa_f(x) \approx \left|\frac{x f'(x)}{f(x)}\right|κf(x)≈f(x)xf′(x)
Interpretation:
- •If κf(x)\kappa_f(x)κf(x) is large, then a small relative error in xxx can produce a large relative error in f(x)f(x)f(x).
Example intuition:
- •f(x)=xf(x) = xf(x)=x has κ≈1\kappa \approx 1κ≈1 (well-conditioned).
- •f(x)=1/xf(x) = 1/xf(x)=1/x near x=0x=0x=0 is ill-conditioned (tiny changes in xxx blow up the output).
In deep learning, conditioning shows up as:
- •nearly singular matrices (e.g., Hessians, Jacobians)
- •poorly scaled features/activations
- •exploding/vanishing gradients
Stability: “does the algorithm amplify rounding errors?” #
An algorithm is (backward) stable if the computed result equals the exact result for a slightly perturbed input:
y^=f(x+Δx),with ∥Δx∥ small (on the order of rounding).\hat{y} = f(x + \Delta x), \quad \text{with } |\Delta x| \text{ small (on the order of rounding).}y^=f(x+Δx),with ∥Δx∥ small (on the order of rounding).
This is the gold standard: rounding happens anyway; stability means your algorithm behaves like it just saw a tiny bit of noise.
Key mental model:
- •Conditioning is about the landscape you’re standing on.
- •Stability is about whether you’re walking carefully or running with untied shoelaces.
Supporting static diagram (conceptual): float spacing vs magnitude #
A core stability intuition is that floats are denser near 0 and sparser at large magnitudes.
value magnitude: 0 ---- 1 ---- 10 ---- 1e3 ---- 1e6 ---- 1e9
float spacing: tiny small bigger larger huge enormous
At large magnitudes, the next representable number can be far away, so adding a small number can do nothing (it rounds away). This drives reduction-order issues and “small updates disappear” problems.
Core Mechanic 1: Failure Modes in Finite-Precision Arithmetic #
This section builds a concrete catalog of failure modes you will repeatedly see in ML code. The goal is to recognize them quickly, then reach for the right stable pattern.
1) Overflow and underflow #
What it is #
- •Overflow: a value exceeds the largest representable finite float → becomes ∞.
- •Underflow: a value is too small in magnitude → becomes 0 (or subnormal, then 0).
Why ML triggers it #
Exponentials, products over many terms, and repeated multiplications are common.
- •Softmax uses ezie^{z_i}ezi.
- •Likelihoods multiply probabilities.
- •Deep nets multiply Jacobians across layers.
Classic example: softmax overflow #
Naive softmax:
softmax(z)i=ezi∑jezj\mathrm{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}}softmax(z)i=∑jezjezi
If some ziz_izi is large (say 1000 in float32), ezie^{z_i}ezi overflows to ∞, and you get ∞/∞ → NaN.
2) Catastrophic cancellation #
What it is #
Subtracting two nearly equal numbers destroys significant digits.
If a≈ba \approx ba≈b, then a−ba - ba−b is small, but the floating-point representation of aaa and bbb each has rounding error. The subtraction exposes that error.
Why ML triggers it #
- •variance computations
- •normalization
- •loss differences
- •some distance computations
Example: variance via “E[x²] − (E[x])²” #
The identity is correct:
Var(x)=E[x2]−(E[x])2\mathrm{Var}(x) = \mathbb{E}[x^2] - (\mathbb{E}[x])^2Var(x)=E[x2]−(E[x])2
But numerically, if xxx has large mean and small variance, both terms are large and close, so subtraction cancels digits. You may get a small negative result.
3) Reduction-order (summation) error #
What it is #
Floating-point addition is not associative:
(a+b)+c≠a+(b+c)(a + b) + c \ne a + (b + c)(a+b)+c=a+(b+c)
because each intermediate sum is rounded.
Why ML triggers it #
Training is full of reductions:
- •summing losses over a batch
- •summing gradients
- •dot products and matrix multiplies
On GPU/TPU, parallel reductions change order depending on hardware and scheduling. That can change results—usually slightly, sometimes significantly.
A helpful intuition #
If you add a tiny number to a huge number, the tiny number may not change the sum at all (it gets rounded away). So order matters: adding small numbers together first can preserve more information.
4) Division by tiny numbers / unstable normalization #
Normalization often has the form:
x^=x−μσ\hat{x} = \frac{x - \mu}{\sigma}x^=σx−μ
If σ\sigmaσ is near 0, this blows up. Many algorithms add an ε\varepsilonε term:
x^=x−μσ2+ε\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \varepsilon}}x^=σ2+εx−μ
The ε\varepsilonε here is a stability constant, not machine epsilon (though it’s motivated by it). In float16, ε\varepsilonε often must be larger.
5) Gradients that vanish due to underflow or saturation #
Even if your forward pass is stable, gradients can underflow or become 0 in low precision.
Examples:
- •sigmoid/tanh saturation: for large |x|, derivatives are near 0.
- •multiplying many small factors across layers.
This interacts with floating-point: values smaller than ~1e−8 may become subnormals or 0 in float16.
6) Infs/NaNs that propagate through a computational graph #
Given prerequisites about computational graphs, here’s the key operational point:
- •One NaN in forward → many NaNs in backward.
- •One overflow in an activation (e.g., exp) → NaNs in loss → NaN gradients.
This is why stable local formulas (softmax, log-softmax, normalization) matter so much: they sit early in the graph and affect everything.
Interactive canvas specification (explicitly maps to 3 failure modes) #
To address these issues systematically, your interactive canvas should have three mini-labs, each with:
- •a toggle for precision: float16 / float32 / float64
- •side-by-side traces: naive vs stable
- •a plot of intermediate values (including detection of inf/nan)
Mini-lab A: Softmax overflow #
Controls:
- •slider: max logit magnitude (e.g., from 0 to 2000)
- •toggle: stable shift (subtract max) on/off
- •precision selector
Display:
- •naive exp(logits), sum exp, softmax output
- •stable exp(logits − max), sum exp, softmax output
- •highlight when overflow happens
Mini-lab B: Cancellation in variance #
Controls:
- •slider: mean magnitude μ (e.g., 0 to 1e6)
- •slider: true std σ (e.g., 1e−2 to 10)
- •toggle: one-pass (E[x²]−E[x]²) vs Welford
- •precision selector
Display:
- •computed variance (naive vs stable)
- •difference from ground truth
- •highlight negative variance events
Mini-lab C: Reduction-order error #
Controls:
- •choose data distribution: (many small + few large), alternating signs, random
- •toggle: left-to-right sum vs pairwise sum vs Kahan
- •precision selector
Display:
- •final sum
- •absolute/relative error vs high-precision reference
- •step-by-step accumulation trace to show where small terms get lost
This canvas ties directly to: overflow in softmax, cancellation in variance, and reduction-order error—the three most common “why did my model blow up?” numerics in practice.
Core Mechanic 2: Stable Patterns and Derivations You’ll Reuse Constantly #
Now we turn the failure modes into stable, reusable techniques.
1) Log-sum-exp (LSE): the workhorse for stable exponentials #
Motivation #
You want to compute:
log∑i=1nezi\log \sum_{i=1}^n e^{z_i}logi=1∑nezi
Directly computing ezie^{z_i}ezi is risky if any ziz_izi is large.
Trick: factor out the maximum #
Let m=maxizim = \max_i z_im=maxizi. Then:
∑iezi=∑iemezi−m=em∑iezi−m\sum_i e^{z_i} = \sum_i e^{m} e^{z_i - m} = e^{m} \sum_i e^{z_i - m}i∑ezi=i∑emezi−m=emi∑ezi−m
Taking log:
log∑iezi=m+log∑iezi−m\log \sum_i e^{z_i} = m + \log \sum_i e^{z_i - m}logi∑ezi=m+logi∑ezi−m
Key benefits:
- •all exponents zi−m≤0z_i - m \le 0zi−m≤0, so ezi−m∈(0,1]e^{z_i - m} \in (0, 1]ezi−m∈(0,1]
- •prevents overflow
- •reduces risk of underflow dominating the sum (though very negative values may still underflow safely to 0)
Connection to softmax and log-softmax #
Stable softmax:
softmax(z)i=ezi−m∑jezj−m\mathrm{softmax}(\mathbf{z})_i = \frac{e^{z_i - m}}{\sum_j e^{z_j - m}}softmax(z)i=∑jezj−mezi−m
Stable log-softmax:
logsoftmax(z)i=zi−(m+log∑jezj−m)\log \mathrm{softmax}(\mathbf{z})_i = z_i - \left(m + \log \sum_j e^{z_j - m}\right)logsoftmax(z)i=zi−(m+logj∑ezj−m)
In practice you compute log-softmax directly (stable) and then NLL loss, rather than softmax then log.
2) Stable variance: Welford’s algorithm #
Motivation #
Variance is everywhere: batch norm, layer norm, feature scaling, monitoring.
The naive formula is prone to cancellation:
Var=E[x2]−(E[x])2\mathrm{Var} = \mathbb{E}[x^2] - (\mathbb{E}[x])^2Var=E[x2]−(E[x])2
Welford’s one-pass stable update #
Given samples x1,x2,…,xnx_1, x_2, \dots, x_nx1,x2,…,xn, maintain:
- •running mean μk\mu_kμk
- •running sum of squares of deviations M2kM2_kM2k
Initialize:
- •μ1=x1\mu_1 = x_1μ1=x1
- •M21=0M2_1 = 0M21=0
For k≥2k \ge 2k≥2:
δ=xk−μk−1\delta = x_k - \mu_{k-1}δ=xk−μk−1
μk=μk−1+δk\mu_k = \mu_{k-1} + \frac{\delta}{k}μk=μk−1+kδ
δ2=xk−μk\delta_2 = x_k - \mu_kδ2=xk−μk
M2k=M2k−1+δ⋅δ2M2_k = M2_{k-1} + \delta \cdot \delta_2M2k=M2k−1+δ⋅δ2
Then:
Var=M2nn(population)orM2nn−1(sample)\mathrm{Var} = \frac{M2_n}{n} \quad (\text{population}) \qquad \text{or} \qquad \frac{M2_n}{n-1} \quad (\text{sample})Var=nM2n(population)orn−1M2n(sample)
Why it helps #
It avoids subtracting two nearly equal large numbers. It tracks deviations relative to an evolving mean, keeping intermediate quantities closer to the scale of the variance.
3) Stable summation: pairwise and Kahan #
Motivation #
Summing many numbers is ubiquitous: losses, gradient accumulation, dot products.
Pairwise summation #
Instead of summing linearly, sum in a tree (balanced) structure. This tends to reduce error growth from O(nu)O(nu)O(nu) to closer to O(logn⋅u)O(\log n \cdot u)O(logn⋅u) in many practical settings.
Kahan compensated summation #
Maintain a compensation term ccc for lost low-order bits.
Algorithm (for summing values xix_ixi):
- •s=0s = 0s=0, c=0c = 0c=0
- •For each xxx:
- •y=x−cy = x - cy=x−c
- •t=s+yt = s + yt=s+y
- •c=(t−s)−yc = (t - s) - yc=(t−s)−y
- •s=ts = ts=t
Interpretation: ccc estimates what was lost to rounding.
Practical note for ML #
- •Many BLAS/GEMM kernels already use strategies to reduce error, but you still see reduction issues in custom kernels, metrics, and loss aggregation.
- •Mixed precision often accumulates in float32 even if operands are float16 (a major stability improvement).
Normalization isn’t only about optimization speed; it’s also about preventing overflow/underflow and controlling conditioning.
Feature scaling #
If one feature has magnitude ~1e6 and another ~1e−3, dot products and gradients become ill-scaled. Scaling features can reduce condition numbers and improve numerical behavior.
Activation scaling and initialization #
Careful initialization (e.g., Xavier/He) aims to keep variance of activations/gradients roughly stable across depth. That’s partly a conditioning story (keeping Jacobian singular values reasonable) and partly a numerical range story (avoid overflow/underflow).
5) Conditioning vs stability: a quadrant you should internalize #
A useful mental diagram:
| Stable algorithm | Unstable algorithm |
|---|
| Well-conditioned problem | Great: accurate results | Often fixable by changing algorithm |
| Ill-conditioned problem | Best you can do: results sensitive to noise | Worst case: meaningless results |
Deep learning often has locally ill-conditioned regions (e.g., during early training or near sharp minima). Your job is to avoid compounding that with unstable computations.
6) Mixed precision stability patterns #
Modern training often uses float16/bfloat16 for speed, with float32 master weights or accumulators.
Common stability patterns:
- •Loss scaling: multiply loss by a scale factor to prevent gradient underflow in float16; then unscale before the optimizer step.
- •FP32 accumulation: accumulate sums/products in float32 (especially reductions).
- •Keep sensitive ops in FP32: softmax/log-softmax, normalization stats, some reductions.
These are not “cheats”—they’re explicit numerics engineering.
Applications/Connections: Reliable Deep Learning Training in Practice #
This section connects the numerics toolbox to everyday deep learning components.
1) Softmax + cross-entropy: do not compute them separately (naively) #
In classification, the loss is often:
L=−logsoftmax(z)y\mathcal{L} = -\log \mathrm{softmax}(\mathbf{z})_yL=−logsoftmax(z)y
Naively:
compute softmax
take log
This is numerically fragile because:
- •softmax can overflow
- •probabilities can underflow to 0, then log(0) → −∞
Stable approach:
- •compute log-softmax via log-sum-exp
- •index the correct class
This avoids both overflow and log(0) underflow.
2) BatchNorm/LayerNorm: stable mean/variance and safe eps #
Normalization layers compute statistics over batches/features.
Key stability decisions:
- •Use stable variance computation (often a two-pass or Welford-like approach internally).
- •Choose ε\varepsilonε appropriate to dtype and scale.
Important nuance:
- •ε\varepsilonε is not a universal constant; it depends on typical activation magnitudes and precision.
- •In float16, using too-small ε\varepsilonε can fail to prevent division blow-ups.
Attention uses logits:
zij=qi⋅kjdz_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d}}zij=dqi⋅kj
Then softmax over jjj.
The $1/\sqrt{d}$ scaling is partly about gradients/optimization, but also about numerics:
- •without scaling, dot products can grow with ddd, producing large logits
- •large logits make softmax overflow more likely and cause saturation (one-hot attention)
FlashAttention and similar kernels implement careful stabilization (blockwise softmax with max-subtraction) explicitly to keep values in safe ranges.
4) Log-likelihoods, HMMs, CRFs: stay in log space #
Whenever you multiply many probabilities, underflow is inevitable:
∏tpt→0\prod_t p_t \to 0t∏pt→0
Use logs:
log∏tpt=∑tlogpt\log \prod_t p_t = \sum_t \log p_tlogt∏pt=t∑logpt
And whenever you need to sum probabilities in log space, use log-sum-exp.
This pattern generalizes far beyond softmax.
5) Gradient clipping and normalization as overflow control #
Exploding gradients can cause:
- •weight updates that overflow
- •activations that overflow next step
Gradient clipping (by norm) can be viewed as enforcing a bound on magnitude to keep values in range.
6) Debugging numerics systematically (a checklist) #
When training becomes NaN/Inf:
- 1)Locate first NaN/Inf in the forward pass (hooks / anomaly detection).
- 2)Check common culprits:
- •softmax/log-softmax
- •normalization variance / division by small numbers
- •exp, log, sqrt
- •mixed precision overflow/underflow
- 3)Try interventions:
- •replace softmax+log with log-softmax
- •increase normalization eps
- •enable FP32 accumulation
- •use loss scaling
- •clamp logits or activations (carefully)
A key mindset: you’re not only debugging code—you’re debugging a numeric process under rounding.
7) Connecting conditioning to optimization behavior #
Conditioning appears as “how hard the optimizer must work.”
- •If the Hessian has a wide spread of eigenvalues, gradient descent zig-zags.
- •Normalization, adaptive optimizers, and preconditioning methods try to reduce effective condition number.
While this lesson focuses on floating-point issues, it’s important to see that:
- •better conditioning → often fewer extreme values → fewer numeric failures
- •stable numerics → fewer NaNs → more reliable optimization signals
Worked Examples (3) #
Worked Example 1: Stable Softmax and Log-Softmax via Log-Sum-Exp #
Let logits be z = [1000, 1001, 999]. Compute softmax and log-softmax stably (in exact math), and explain what goes wrong with the naive computation in float32.
Naive softmax uses exp(zᵢ). But exp(1000) and exp(1001) overflow in float32, producing ∞.
So the naive numerator/denominator become ∞/∞, which is undefined → NaN.
Use the stability shift m = max(z) = 1001.
Compute shifted logits: z' = z − m = [−1, 0, −2].
Compute exponentials of shifted logits:
exp(z') = [e^{−1}, e^{0}, e^{−2}] = [0.367879..., 1, 0.135335...].
These are all in (0, 1], safe in float16/32/64.
Compute denominator:
S = 0.367879... + 1 + 0.135335... = 1.503214...
Compute softmax:
softmax(z) = exp(z') / S
= [0.367879/1.503214, 1/1.503214, 0.135335/1.503214]
≈ [0.244728, 0.665241, 0.090031].
Compute log-sum-exp:
LSE(z) = m + log(∑ exp(zᵢ − m))
= 1001 + log(1.503214...)
≈ 1001 + 0.407606 = 1001.407606.
Compute log-softmax:
log softmax(z)ᵢ = zᵢ − LSE(z)
So:
log softmax = [1000 − 1001.407606,
1001 − 1001.407606,
999 − 1001.407606]
= [−1.407606, −0.407606, −2.407606].
Sanity check: exponentiating log-softmax returns the softmax probabilities:
exp(−0.407606) ≈ 0.665241, etc.
Insight: Subtracting max(logits) is not a heuristic; it’s an algebraically exact transformation that keeps exponentials in a safe numeric range. Log-sum-exp is the reusable pattern: whenever you see log(sum(exp(·))), you should think “shift by max.”
Worked Example 2: Catastrophic Cancellation in Variance (Naive vs Welford) #
Suppose x values are near 1e8 with tiny spread: x = [100000000.0, 100000001.0, 99999999.0, 100000000.0]. Show why Var = E[x²] − (E[x])² is fragile, and outline Welford’s stable computation.
Compute the mean (exact math):
μ = (100000000 + 100000001 + 99999999 + 100000000) / 4
= 400000000 / 4 = 100000000.
Compute deviations from the mean:
(x − μ) = [0, 1, −1, 0].
So the true population variance is:
Var = (0² + 1² + (−1)² + 0²)/4 = 2/4 = 0.5.
Now look at the naive formula components:
E[x²] is about (1e8)² = 1e16 scale.
(μ)² is also (1e8)² = 1e16 scale.
Their difference should be 0.5 — extremely tiny relative to 1e16.
In float32, you only have ~7 decimal digits of precision.
At magnitude 1e16, the spacing between representable floats is huge (much larger than 1).
So x² and μ² are rounded so coarsely that subtracting them cannot reliably yield 0.5.
This is catastrophic cancellation: the small result is dominated by rounding error in the large terms.
Welford’s algorithm avoids subtracting huge nearly-equal numbers.
Initialize:
μ₁ = x₁ = 100000000
M2₁ = 0
k=2, x₂=100000001:
δ = x₂ − μ₁ = 1
μ₂ = μ₁ + δ/2 = 100000000 + 0.5 = 100000000.5
δ₂ = x₂ − μ₂ = 0.5
M2₂ = M2₁ + δ·δ₂ = 0 + 1·0.5 = 0.5
k=3, x₃=99999999:
δ = x₃ − μ₂ = 99999999 − 100000000.5 = −1.5
μ₃ = μ₂ + δ/3 = 100000000.5 − 0.5 = 100000000.0
δ₂ = x₃ − μ₃ = −1
M2₃ = 0.5 + (−1.5)(−1) = 2.0
k=4, x₄=100000000:
δ = x₄ − μ₃ = 0
μ₄ = μ₃ + 0/4 = 100000000
δ₂ = 0
M2₄ = 2.0
Population variance:
Var = M2₄ / 4 = 2/4 = 0.5 (correct).
Insight: The naive identity is mathematically correct but numerically dangerous when mean ≫ std. Welford keeps intermediate values near the scale of deviations, so rounding error doesn’t get amplified by subtracting huge nearly-equal numbers.
Worked Example 3: Reduction-Order Error in Summation #
Sum the numbers: [1e8, 1, −1e8] in float32-like arithmetic to illustrate non-associativity. Compare (a+b)+c vs a+(b+c).
Let a = 1e8, b = 1, c = −1e8.
Compute (a + b) + c:
First a + b = 1e8 + 1.
At magnitude 1e8 in float32, the spacing between representable floats is larger than 1, so 1e8 + 1 rounds to 1e8.
So (a+b) ≈ 1e8.
Then (a+b)+c ≈ 1e8 + (−1e8) = 0.
Compute a + (b + c):
First b + c = 1 + (−1e8) = −99999999.
This is representable near −1e8 and keeps the +1 effect.
Then a + (b+c) = 1e8 + (−99999999) = 1.
So:
(a+b)+c ≈ 0, but a+(b+c) ≈ 1.
Same real-number expression, different floating-point result due to rounding in intermediate steps.
Insight: Summation order matters because adding a tiny number to a huge one can lose the tiny number entirely. Pairwise summation and compensated summation reduce this loss, which matters for large reductions (loss totals, gradient sums, dot products).
Key Takeaways #
✓
Finite-precision arithmetic means every operation is rounded; machine epsilon (and unit roundoff) sets the scale of relative rounding error.
✓
Overflow/underflow are about limited dynamic range; catastrophic cancellation and reduction-order error are about limited precision.
✓
Conditioning describes sensitivity of the true solution to input perturbations; stability describes whether an algorithm controls rounding error (often via backward stability).
✓
Log-sum-exp (shift by max) is the standard fix for softmax/log-softmax and log-probability computations.
✓
Variance is a classic cancellation trap; prefer Welford or stable two-pass methods over E[x²] − (E[x])² when mean ≫ std.
✓
Floating-point addition is not associative; for large reductions, consider pairwise summation, FP32 accumulation, or compensated summation.
✓
Normalization and scale-aware initialization help both optimization conditioning and numerical range control.
✓
In mixed precision, keep sensitive ops and accumulations in FP32 and use loss scaling to prevent gradient underflow.
Common Mistakes #
✗
Computing softmax then log for cross-entropy instead of using a stable log-softmax (log-sum-exp).
✗
Using Var = E[x²] − (E[x])² in low precision when data have a large mean and small variance, leading to negative variance and NaNs.
✗
Assuming sums/reductions are deterministic and associative across hardware; changing parallel reduction order can change results.
✗
Confusing the stability epsilon added in normalizations (a modeling/engineering constant) with machine epsilon (a property of the dtype).
Practice #
easy
Compute stably: L = log( exp(100) + exp(101) + exp(99) ). Give the exact stable expression and a numerical approximation (to ~4 decimals).
Hint: Use m = max = 101 and write L = m + log(∑ exp(zᵢ − m)).
Show solution
Let m = 101.
Then:
L = 101 + log( exp(100−101) + exp(101−101) + exp(99−101) )
= 101 + log( e^{−1} + 1 + e^{−2} ).
Compute: e^{−1}≈0.3679, e^{−2}≈0.1353, sum≈1.5032.
So L ≈ 101 + log(1.5032) ≈ 101 + 0.4076 = 101.4076.
medium
You are summing 10 million float16 values to compute a loss total. Name two strategies to reduce summation error and briefly explain why each helps.
Hint: Think about changing the summation order and using higher precision for the accumulator.
Show solution
Two good strategies:
Accumulate in float32 (FP32 accumulation): float32 has much smaller ε_machine than float16, so each addition rounds less, reducing total error.
Pairwise/tree reduction (or compensated summation like Kahan): summing numbers of similar magnitude earlier reduces the “tiny added to huge gets lost” effect, lowering rounding error compared to naive left-to-right summation.
hard
Explain (no code required) why computing cross-entropy as −log(softmax(z)[y]) can produce −∞ or NaN even when the true value is finite. Then state the stable alternative formula.
Hint: Consider underflow in softmax probabilities and overflow in exp(z). Use log-sum-exp.
Show solution
Naively computing softmax requires exp(z). Large logits can overflow exp(z) to ∞, producing ∞/∞ → NaN. Small probabilities can underflow to 0, and then log(0) → −∞, even if the mathematically correct probability is merely very small but nonzero. The stable approach is to compute log-softmax directly:
log softmax(z)ᵢ = zᵢ − (m + log ∑ⱼ exp(zⱼ − m)), where m = maxⱼ zⱼ.
Then cross-entropy is −log softmax(z)_y, avoiding overflow and log(0) underflow.
Connections #
Related next-step ideas you’ll likely meet in the Deep Learning node:
- •Mixed precision training (loss scaling, FP32 master weights)
- •Stable attention implementations (blockwise softmax / FlashAttention)
- •Gradient clipping and normalization layers as numerics + conditioning tools
Quality: A (4.2/5)
← back to treebrowse all →