←Back to Tech Tree
inventorycoverage
Concentration Inequalities #
Probability & StatisticsDifficulty: ★★★★★Depth: 7Unlocks: 0
Chernoff, Hoeffding bounds. Tail probabilities.
Interactive Visualization #
⏮◀◀▶▶STEP0.25x1xZOOM
t=0s
Core Concepts #
- -Exponential Markov method: apply Markov's inequality to exp(lambda X) to convert tail probabilities into moment-generating bounds.
- -Cumulant-generating function (log mgf) as the quantity that controls exponential tail decay: its growth rate determines concentration.
- -Bounded-summand (bounded-difference) principle: uniformly bounded independent summands yield sub-Gaussian tails (Hoeffding-type quadratic CGF upper bound).
Key Symbols & Notation #
psi_X(lambda) = log E[exp(lambda X)] (cumulant-generating function, i.e., log mgf)
Essential Relationships #
- -Exponential-Markov tail bound: for any lambda>0, P(X >= t) <= exp(-lambda t + psi_X(lambda)) (then optimize over lambda).
- -Independence additivity: for independent X_i, psi_{sum X_i}(lambda) = sum_i psi_{X_i}(lambda), allowing tail bounds for sums.
Prerequisites (2) #
Central Limit Theorem6 atomsMarkov Chains6 atoms
Referenced by (1) #
Where this concept shows up in the operating-finance and personal-finance graphs.
From Business (1) #
[Tail RiskBusiness
Chernoff and Hoeffding bounds are the mathematical formalization of tail risk - they bound the probability mass in distribution tails, which is exactly what characterizing tail risk requires beyond variance alone](/business/tail-risk/)
Advanced Learning Details
Graph Position #
120
Depth Cost
0
Fan-Out (ROI)
0
Bottleneck Score
7
Chain Length
Cognitive Load #
6
Atomic Elements
41
Total Elements
L3
Percentile Level
L4
Atomic Level
All Concepts (18) #
- Tail probability: the probability mass in the tails P(X >= t), P(X <= -t), or two-sided P(|X - E[X]| >= t)
- Concentration inequality: a non-asymptotic bound that upper-bounds tail probabilities (typically exponentially small in t or n)
- Markov's inequality (as used in tail bounds): P(Y >= a) <= E[Y]/a for nonnegative Y, used on exponentials in the Chernoff method
- Chernoff method (exponential/tilting method): applying Markov's inequality to e^{λX} and optimizing over λ to obtain exponential tail bounds
- Moment-generating function (MGF): M_X(λ) = E[e^{λX}] and its use as the basic object controlling tails
- Cumulant-generating function / log-MGF: ψ_X(λ) = log E[e^{λX}] and using its convexity to derive bounds
- Optimization over the tilt parameter λ: choosing the λ that minimizes the Chernoff bound to tighten the tail estimate
- Hoeffding's lemma: a specific bound on the MGF of a zero-mean bounded random variable (gives a quadratic bound in λ)
- Hoeffding's inequality: a concrete exponential tail bound for sums of independent bounded random variables
- Multiplicity (multiplicative) Chernoff bounds for Bernoulli/binomial sums: relative-error tail bounds of the form P(S >= (1+δ)μ) that decay exponentially in μ
- Sub-Gaussian random variable (definition via MGF or tail): variables whose tails are bounded like a Gaussian (equivalent forms: tail decay, moment growth, MGF bound)
- Sub-exponential random variable (parameterized tail/ MGFs): heavier-tailed class characterized by MGF behavior and linear-exponential tail bounds
- Bounded differences / McDiarmid-type condition: control of function deviation when each input coordinate is changed (used to obtain concentration for functions of independent variables)
- One-sided vs two-sided concentration: distinguishing upper-tail, lower-tail, and absolute-deviation bounds
- Exponential decay rate / rate function idea: tail probabilities often behave like exp(-I(t)) where I(t) is (often) quadratic near the mean and linear farther out
- Using union bound with concentration inequalities to get uniform bounds over finite collections
- Dependence vs independence assumptions: many concentration inequalities require independence or bounded-difference/martingale-type dependence controls
- Additivity/composition for sums: how MGFs / sub-Gaussian parameters combine when summing independent variables (e.g., variances/squared-parameters add)
Teaching Strategy #
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
When you average noisy data, you feel like it should stabilize. Concentration inequalities make that feeling precise: they turn “rare big deviations” into concrete, exponentially small tail probabilities—with constants you can compute and optimize.
TL;DR:
Concentration inequalities bound tail probabilities like P(S − E[S] ≥ t). The core move is the exponential Markov method: apply Markov to exp(λ(S−E[S])) to get P(S−E[S] ≥ t) ≤ exp(ψ_S(λ) − λt). Optimizing over λ gives a Legendre/duality picture: the best exponent is sup_λ (λt − ψ_S(λ)) (a convex conjugate). For sums of independent bounded variables, Hoeffding’s lemma yields a quadratic upper bound on ψ, giving sub-Gaussian tails (Hoeffding bounds). For Bernoulli sums, sharper ψ yields Chernoff bounds with KL-divergence exponents.
What Is a Concentration Inequality? #
A concentration inequality is a guarantee that a random quantity stays close to its typical value (usually its expectation) with high probability.
Concretely, if S is a random variable (often a sum or average), we want bounds of the form
- •Upper tail: P(S − E[S] ≥ t)
- •Lower tail: P(S − E[S] ≤ −t)
- •Two-sided: P(|S − E[S]| ≥ t)
The defining feature is that these probabilities decay fast in t (often like exp(−c t²) or exp(−c t)). This is the mathematical version of “averaging reduces noise.”
Why concentration is not the CLT #
You already know the Central Limit Theorem (CLT): sums of many independent variables become approximately normal under mild conditions.
But CLT answers a different question:
- •CLT is asymptotic (n → ∞) and gives an approximate distribution near the center.
- •Concentration inequalities are finite-sample (fixed n) and give non-asymptotic upper bounds on tail probabilities.
In practice, concentration is what you use when you want statements like:
“With probability at least 1 − δ, the sample mean differs from the true mean by at most ε.”
This is a tail probability question.
The common structure #
Most concentration results for sums of independent variables share a template:
- 1)Convert a tail event into an inequality about an exponential moment.
- 2)Bound that exponential moment using some structure (independence, boundedness, sub-Gaussianity, etc.).
- 3)Optimize a parameter (λ) to get the best exponent.
This node focuses on the classic pair:
- •Hoeffding bounds (bounded variables)
- •Chernoff bounds (especially for Bernoulli sums / binomials)
And the unifying engine behind both:
- •Exponential Markov method using the cumulant-generating function (CGF)
The key symbol: the cumulant-generating function #
For a random variable X, define its moment-generating function (mgf) when it exists:
MX(λ)=E[eλX]M_X(\lambda) = \mathbb{E}[e^{\lambda X}]MX(λ)=E[eλX]
and its cumulant-generating function (CGF)
ψX(λ)=logE[eλX]\psi_X(\lambda) = \log \mathbb{E}[e^{\lambda X}]ψX(λ)=logE[eλX]
Why the log? Because for independent sums it turns products into sums:
If X and Y are independent, then
ψX+Y(λ)=logE[eλ(X+Y)]=log(E[eλX]E[eλY])=ψX(λ)+ψY(λ).\psi_{X+Y}(\lambda) = \log \mathbb{E}[e^{\lambda(X+Y)}] = \log(\mathbb{E}[e^{\lambda X}]\mathbb{E}[e^{\lambda Y}]) = \psi_X(\lambda) + \psi_Y(\lambda).ψX+Y(λ)=logE[eλ(X+Y)]=log(E[eλX]E[eλY])=ψX(λ)+ψY(λ).
This additivity is the backbone of concentration for sums.
Two mental pictures to keep in mind #
- Tail bound as an optimization problem:
We will repeatedly get bounds that look like
P(S−ES≥t)≤exp(ψS(λ)−λt)\mathbb{P}(S - \mathbb{E}S \ge t) \le \exp(\psi_S(\lambda) - \lambda t)P(S−ES≥t)≤exp(ψS(λ)−λt)
and then choose the best λ ≥ 0.
- Geometry / duality:
The exponent comes from comparing the curve ψ(λ) against the line λt.
- •ψ(λ) is convex.
- •λt is a line through the origin with slope t.
- •The best λ is where the line is “most above” ψ(λ) in the sense of maximizing λt − ψ(λ).
We will make that picture explicit later, because it is the most reliable way to remember what’s going on.
Core Mechanic 1: Exponential Markov Method and the ψ(λ) vs λt Optimization #
The starting point is humble: Markov’s inequality.
For a nonnegative random variable Y and a > 0,
P(Y≥a)≤E[Y]a.\mathbb{P}(Y \ge a) \le \frac{\mathbb{E}[Y]}{a}.P(Y≥a)≤aE[Y].
This is often too weak directly. The trick is to manufacture a useful nonnegative variable by exponentiating.
Step 1: Convert a tail event into an exponential event #
Suppose we want an upper tail for X:
P(X≥t).\mathbb{P}(X \ge t).P(X≥t).
For any λ > 0,
X≥t ⟺ eλX≥eλt.X \ge t \iff e^{\lambda X} \ge e^{\lambda t}.X≥t⟺eλX≥eλt.
Since e^{\lambda X} ≥ 0, we can apply Markov:
P(X≥t)=P(eλX≥eλt)≤E[eλX]eλt.\mathbb{P}(X \ge t) = \mathbb{P}(e^{\lambda X} \ge e^{\lambda t}) \le \frac{\mathbb{E}[e^{\lambda X}]}{e^{\lambda t}}.P(X≥t)=P(eλX≥eλt)≤eλtE[eλX].
That is
P(X≥t)≤exp(ψX(λ)−λt).\mathbb{P}(X \ge t) \le \exp(\psi_X(\lambda) - \lambda t).P(X≥t)≤exp(ψX(λ)−λt).
This is the exponential Markov method.
Centering matters #
Most concentration statements are about deviations from the mean. Let Z = S − E[S]. Then
P(S−ES≥t)=P(Z≥t)≤exp(ψZ(λ)−λt).\mathbb{P}(S - \mathbb{E}S \ge t) = \mathbb{P}(Z \ge t) \le \exp(\psi_Z(\lambda) - \lambda t).P(S−ES≥t)=P(Z≥t)≤exp(ψZ(λ)−λt).
Note that
ψZ(λ)=logE[eλ(S−ES)]=log(e−λES E[eλS])=ψS(λ)−λES.\psi_Z(\lambda) = \log \mathbb{E}[e^{\lambda(S-\mathbb{E}S)}] = \log \left(e^{-\lambda\mathbb{E}S},\mathbb{E}[e^{\lambda S}]\right) = \psi_S(\lambda) - \lambda \mathbb{E}S.ψZ(λ)=logE[eλ(S−ES)]=log(e−λESE[eλS])=ψS(λ)−λES.
So you can either work with ψ_S and subtract λE[S], or directly define ψ for the centered variable.
Step 2: Optimize over λ #
The inequality holds for every λ > 0. So take the best one:
P(Z≥t)≤infλ>0exp(ψZ(λ)−λt).\mathbb{P}(Z \ge t) \le \inf_{\lambda > 0} \exp(\psi_Z(\lambda) - \lambda t).P(Z≥t)≤λ>0infexp(ψZ(λ)−λt).
Equivalently,
P(Z≥t)≤exp(−supλ>0(λt−ψZ(λ))).\mathbb{P}(Z \ge t) \le \exp\Big( - \sup_{\lambda > 0} (\lambda t - \psi_Z(\lambda)) \Big).P(Z≥t)≤exp(−λ>0sup(λt−ψZ(λ))).
Define the convex conjugate (Legendre transform) of ψ_Z:
ψZ∗(t)=supλ∈R(λt−ψZ(λ)).\psi_Z^*(t) = \sup_{\lambda \in \mathbb{R}} (\lambda t - \psi_Z(\lambda)).ψZ∗(t)=λ∈Rsup(λt−ψZ(λ)).
Then the clean conceptual form is:
P(Z≥t)≤e−ψZ∗(t).\mathbb{P}(Z \ge t) \le e^{-\psi_Z^*(t)}.P(Z≥t)≤e−ψZ∗(t).
This “duality” is not just fancy notation: it is exactly the geometry of “supporting lines” to a convex function.
The geometry: ψ(λ) and its supporting line #
Because ψ_Z(λ) is convex in λ (log mgf is convex), the function
g(λ)=ψZ(λ)−λtg(\lambda) = \psi_Z(\lambda) - \lambda tg(λ)=ψZ(λ)−λt
is convex too (convex minus linear). Minimizing it has a crisp first-order condition when differentiable:
g′(λ)=ψZ′(λ)−t=0⇒ψZ′(λ∗)=t.g'(\lambda) = \psi_Z'(\lambda) - t = 0 \quad\Rightarrow\quad \psi_Z'(\lambda^*) = t.g′(λ)=ψZ′(λ)−t=0⇒ψZ′(λ∗)=t.
Interpretation:
- •ψ_Z'(λ) is the slope of ψ_Z at λ.
- •We choose λ* so that the slope of ψ matches the target deviation t.
This directly matches the visualization suggestion:
- •Plot ψ_Z(λ) vs λ.
- •Plot the line ℓ(λ) = λt.
- •The quantity λt − ψ_Z(λ) is the vertical gap between the line and the curve.
- •The best λ* maximizes that gap.
Interactive canvas idea (explicit) #
If your interactive environment supports it, show two linked panels:
Panel A (duality / optimization):
- •A convex curve ψ(λ).
- •A movable line λt (you drag t; line slope changes).
- •A highlighted λ* where the vertical gap is maximized.
- •Display: exponent = ψ*(t) = λ*t − ψ(λ*).
Panel B (resulting tail):
- •Plot the bound exp(−ψ*(t)) as a function of t.
- •As you move t in Panel A, show the corresponding tail point in Panel B.
This makes “optimize over λ” feel like reading off a best supporting line.
Step 3: Use structure to bound ψ #
So far, everything is exact but useless unless we can bound ψ_Z(λ).
The entire field of concentration inequalities is essentially a library of ways to upper bound ψ for different kinds of random variables.
For this node, we focus on:
- •Independent bounded summands → quadratic upper bound on ψ (Hoeffding-type)
- •Bernoulli/Binomial → exact-ish ψ → KL divergence exponents (Chernoff)
Before going there, one more crucial simplification:
ψ for sums of independent terms #
Let S = ∑ᵢ Xᵢ with independent Xᵢ. For centered terms (or after centering),
ψS(λ)=∑iψXi(λ).\psi_S(\lambda) = \sum_i \psi_{X_i}(\lambda).ψS(λ)=i∑ψXi(λ).
So if you can bound each ψ_{X_i}(λ), you can bound ψ_S(λ) by summing.
That is why bounded-difference / bounded-summand principles are so powerful: they give a uniform bound per term.
Core Mechanic 2: Bounded Summands → Hoeffding’s Lemma → Hoeffding Bounds #
Hoeffding’s inequality is the canonical “bounded independent noise averages out” theorem.
We’ll build it in two stages:
Hoeffding’s lemma: a single bounded random variable has a sub-Gaussian mgf.
Additivity of ψ for independent sums + λ-optimization yields Hoeffding’s inequality.
Stage 1: Hoeffding’s lemma (mgf bound for a bounded variable) #
Let X be a random variable with support in an interval [a, b]. Define the centered variable:
Y=X−E[X].Y = X - \mathbb{E}[X].Y=X−E[X].
Hoeffding’s lemma states:
E[eλY]≤exp(λ2(b−a)28)∀λ∈R.\mathbb{E}[e^{\lambda Y}] \le \exp\left(\frac{\lambda^2 (b-a)^2}{8}\right) \quad \forall \lambda \in \mathbb{R}.E[eλY]≤exp(8λ2(b−a)2)∀λ∈R.
Equivalently in CGF form:
ψY(λ)≤λ2(b−a)28.\psi_Y(\lambda) \le \frac{\lambda^2 (b-a)^2}{8}.ψY(λ)≤8λ2(b−a)2.
Why this is the key (motivation) #
This is the precise bridge from boundedness to quadratic ψ, and quadratic ψ is what produces Gaussian-like tails exp(−t²).
A useful mental model:
- •A truly Gaussian variable G ∼ N(0, σ²) has
ψG(λ)=logE[eλG]=σ2λ22.\psi_G(\lambda) = \log \mathbb{E}[e^{\lambda G}] = \frac{\sigma^2 \lambda^2}{2}.ψG(λ)=logE[eλG]=2σ2λ2.
So Hoeffding’s lemma says: “a bounded centered variable behaves like a Gaussian in terms of its mgf, with effective variance proxy (b−a)²/4.” (Up to constants.)
A brief sketch of why it’s true #
A full proof uses convexity (or a comparison to a two-point distribution at the endpoints). The intuition:
- •For fixed bounds [a, b] and fixed mean, the mgf is maximized by a distribution that places all mass at the endpoints.
- •That reduces the worst case to a simple two-point calculation.
For this lesson, the important takeaway is the usable inequality:
ψX−EX(λ)≤λ2(b−a)28.\psi_{X-\mathbb{E}X}(\lambda) \le \frac{\lambda^2 (b-a)^2}{8}.ψX−EX(λ)≤8λ2(b−a)2.
Stage 2: Hoeffding inequality for sums #
Let X₁, …, X_n be independent, with Xᵢ ∈ [aᵢ, bᵢ] almost surely. Let
S=∑i=1nXi,μ=E[S]=∑i=1nE[Xi].S = \sum_{i=1}^n X_i, \quad \mu = \mathbb{E}[S] = \sum_{i=1}^n \mathbb{E}[X_i].S=i=1∑nXi,μ=E[S]=i=1∑nE[Xi].
Define the centered sum Z = S − μ = ∑ᵢ (Xᵢ − E[Xᵢ]).
Step A: Apply exponential Markov #
For λ > 0,
P(S−μ≥t)=P(Z≥t)≤exp(ψZ(λ)−λt).\mathbb{P}(S-\mu \ge t) = \mathbb{P}(Z \ge t) \le \exp(\psi_Z(\lambda) - \lambda t).P(S−μ≥t)=P(Z≥t)≤exp(ψZ(λ)−λt).
Step B: Use independence to sum ψ #
Because the centered terms are independent,
ψZ(λ)=∑i=1nψXi−EXi(λ).\psi_Z(\lambda) = \sum_{i=1}^n \psi_{X_i-\mathbb{E}X_i}(\lambda).ψZ(λ)=i=1∑nψXi−EXi(λ).
Step C: Bound each ψ via Hoeffding’s lemma #
Each Xᵢ − E[Xᵢ] is still bounded in an interval of width (bᵢ − aᵢ). So
ψXi−EXi(λ)≤λ2(bi−ai)28.\psi_{X_i-\mathbb{E}X_i}(\lambda) \le \frac{\lambda^2 (b_i-a_i)^2}{8}.ψXi−EXi(λ)≤8λ2(bi−ai)2.
Summing gives
ψZ(λ)≤λ28∑i=1n(bi−ai)2.\psi_Z(\lambda) \le \frac{\lambda^2}{8} \sum_{i=1}^n (b_i-a_i)^2.ψZ(λ)≤8λ2i=1∑n(bi−ai)2.
Define the “range-sum” parameter
V=∑i=1n(bi−ai)2.V = \sum_{i=1}^n (b_i-a_i)^2.V=i=1∑n(bi−ai)2.
Then
P(Z≥t)≤exp(λ2V8−λt).\mathbb{P}(Z \ge t) \le \exp\left(\frac{\lambda^2 V}{8} - \lambda t\right).P(Z≥t)≤exp(8λ2V−λt).
Step D: Optimize over λ #
We minimize the exponent in λ:
h(λ)=λ2V8−λt.h(\lambda) = \frac{\lambda^2 V}{8} - \lambda t.h(λ)=8λ2V−λt.
Differentiate:
h′(λ)=λV4−t.h'(\lambda) = \frac{\lambda V}{4} - t.h′(λ)=4λV−t.
Set h'(λ)=0:
λ∗V4=t⇒λ∗=4tV.\frac{\lambda^* V}{4} = t \quad\Rightarrow\quad \lambda^* = \frac{4t}{V}.4λ∗V=t⇒λ∗=V4t.
Plug back in:
\n1) Compute the quadratic term:
(λ∗)2V8=(16t2/V2)V8=16t28V=2t2V.\frac{(\lambda^*)^2 V}{8} = \frac{(16 t^2 / V^2) V}{8} = \frac{16 t^2}{8V} = \frac{2t^2}{V}.8(λ∗)2V=8(16t2/V2)V=8V16t2=V2t2.
- Compute the linear term:
λ∗t=4tV⋅t=4t2V.\lambda^* t = \frac{4t}{V} \cdot t = \frac{4t^2}{V}.λ∗t=V4t⋅t=V4t2.
So
h(λ∗)=2t2V−4t2V=−2t2V.h(\lambda^*) = \frac{2t^2}{V} - \frac{4t^2}{V} = -\frac{2t^2}{V}.h(λ∗)=V2t2−V4t2=−V2t2.
Therefore
P(S−μ≥t)≤exp(−2t2∑i=1n(bi−ai)2).\mathbb{P}(S-\mu \ge t) \le \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(S−μ≥t)≤exp(−∑i=1n(bi−ai)22t2).
That is Hoeffding’s inequality.
Two-sided version #
Similarly,
P(S−μ≤−t)≤exp(−2t2∑i=1n(bi−ai)2).\mathbb{P}(S-\mu \le -t) \le \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(S−μ≤−t)≤exp(−∑i=1n(bi−ai)22t2).
By a union bound,
P(∣S−μ∣≥t)≤2exp(−2t2∑i=1n(bi−ai)2).\mathbb{P}(|S-\mu| \ge t) \le 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(∣S−μ∣≥t)≤2exp(−∑i=1n(bi−ai)22t2).
If Xᵢ ∈ [0,1], then (bᵢ − aᵢ)=1 and V = n.
Let \bar{X} = (1/n)∑ Xᵢ and E[\bar{X}] = μ̄.
Then
\mathbb{P}(\bar{X} - \mū \ge \varepsilon) = \mathbb{P}(S - \mu \ge n\varepsilon) \le \exp\left(-2n\varepsilon^2\right).
And two-sided:
\mathbb{P}(|\bar{X} - \mū| \ge \varepsilon) \le 2\exp(-2n\varepsilon^2).
Visualization reinforcement: quadratic ψ bound → sub-Gaussian tail curves #
This is the second visualization the node asked for.
In Hoeffding’s proof, we replaced the true ψ_Z(λ) by an upper quadratic:
ψZ(λ)≤λ2V8.\psi_Z(\lambda) \le \frac{\lambda^2 V}{8}.ψZ(λ)≤8λ2V.
Geometrically, you can plot:
- •The true ψ_Z(λ) (unknown generally)
- •The quadratic upper envelope (λ²V/8)
Then the optimized exponent becomes the conjugate of the quadratic:
supλ≥0(λt−λ2V/8)=2t2V.\sup_{\lambda \ge 0}(\lambda t - \lambda^2 V/8) = \frac{2t^2}{V}.λ≥0sup(λt−λ2V/8)=V2t2.
So the tail is bounded by exp(−2t²/V).
Interactive canvas idea (explicit):
- •Let the user pick n, δ.
- •Show ε(δ) such that 2exp(−2nε²)=δ, i.e.
ε(δ)=log(2/δ)2n.\varepsilon(\delta) = \sqrt{\frac{\log(2/\delta)}{2n}}.ε(δ)=2nlog(2/δ).
- •Plot the family of curves exp(−2nε²) as n varies.
- •Show how changing n shifts the curve down (stronger concentration).
This connects directly to “how many samples do I need for error ≤ ε with probability ≥ 1−δ?”
Application/Connection: Chernoff Bounds, KL Exponents, and How ψ Controls Tail Decay #
Hoeffding is powerful and simple, but it is sometimes loose because it uses only boundedness.
Chernoff bounds refine the story when the distribution is known (especially Bernoulli) by using a sharper ψ.
Bernoulli sums and a sharper ψ #
Let Xᵢ ∼ Bernoulli(p) independent. Then S = ∑ Xᵢ ∼ Binomial(n, p), and μ = E[S] = np.
For a single Bernoulli X:
E[eλX]=(1−p) e0+p eλ=1−p+peλ.\mathbb{E}[e^{\lambda X}] = (1-p),e^{0} + p,e^{\lambda} = 1-p + pe^{\lambda}.E[eλX]=(1−p)e0+peλ=1−p+peλ.
So
ψX(λ)=log(1−p+peλ).\psi_X(\lambda) = \log(1-p + pe^{\lambda}).ψX(λ)=log(1−p+peλ).
For the sum S, independence gives
ψS(λ)=nlog(1−p+peλ).\psi_S(\lambda) = n\log(1-p + pe^{\lambda}).ψS(λ)=nlog(1−p+peλ).
For the centered sum Z = S − μ,
ψZ(λ)=ψS(λ)−λμ=nlog(1−p+peλ)−λnp.\psi_Z(\lambda) = \psi_S(\lambda) - \lambda\mu = n\log(1-p + pe^{\lambda}) - \lambda np.ψZ(λ)=ψS(λ)−λμ=nlog(1−p+peλ)−λnp.
Now apply exponential Markov:
P(S−μ≥t)≤exp(nlog(1−p+peλ)−λnp−λt).\mathbb{P}(S-\mu \ge t) \le \exp\left(n\log(1-p + pe^{\lambda}) - \lambda np - \lambda t\right).P(S−μ≥t)≤exp(nlog(1−p+peλ)−λnp−λt).
Let’s express the event in relative terms. Set t = n\varepsilon, i.e. S ≥ n(p+ε). Then
P(Sn≥p+ε)≤exp(n[log(1−p+peλ)−λ(p+ε)]).\mathbb{P}\left(\frac{S}{n} \ge p+\varepsilon\right) \le \exp\left(n\big[\log(1-p + pe^{\lambda}) - \lambda(p+\varepsilon)\big]\right).P(nS≥p+ε)≤exp(n[log(1−p+peλ)−λ(p+ε)]).
We optimize over λ ≥ 0.
Let q = p+ε (so q > p). The optimal exponent is
infλ>0(log(1−p+peλ)−λq)=−D(q∥p),\inf_{\lambda>0} \Big(\log(1-p+pe^{\lambda}) - \lambda q\Big) = -D(q|p),λ>0inf(log(1−p+peλ)−λq)=−D(q∥p),
where D(q|p) is the Bernoulli KL divergence:
D(q∥p)=qlogqp+(1−q)log1−q1−p.D(q|p) = q\log\frac{q}{p} + (1-q)\log\frac{1-q}{1-p}.D(q∥p)=qlogpq+(1−q)log1−p1−q.
Thus the Chernoff bound becomes
P(Sn≥q)≤exp(−n D(q∥p)).\mathbb{P}\left(\frac{S}{n} \ge q\right) \le \exp\big(-n,D(q|p)\big).P(nS≥q)≤exp(−nD(q∥p)).
Similarly for the lower tail (q < p):
P(Sn≤q)≤exp(−n D(q∥p)).\mathbb{P}\left(\frac{S}{n} \le q\right) \le \exp\big(-n,D(q|p)\big).P(nS≤q)≤exp(−nD(q∥p)).
This is a major conceptual upgrade:
- •Hoeffding: exponent is quadratic in (q−p), distribution-free.
- •Chernoff: exponent is KL divergence, distribution-aware and often tighter.
A popular parameterization is S ≥ (1+δ)μ with δ>0.
Since μ=np, the event is S ≥ (1+δ)np.
A standard Chernoff upper tail bound is:
P(S≥(1+δ)μ)≤(eδ(1+δ)1+δ)μ.\mathbb{P}(S \ge (1+\delta)\mu) \le \left(\frac{e^{\delta}}{(1+\delta)^{1+\delta}}\right)^{\mu}.P(S≥(1+δ)μ)≤((1+δ)1+δeδ)μ.
And a simpler (looser) but easy-to-remember form:
P(S≥(1+δ)μ)≤exp(−μδ22+δ).\mathbb{P}(S \ge (1+\delta)\mu) \le \exp\left(-\frac{\mu\delta^2}{2+\delta}\right).P(S≥(1+δ)μ)≤exp(−2+δμδ2).
Lower tail for δ∈(0,1):
P(S≤(1−δ)μ)≤exp(−μδ22).\mathbb{P}(S \le (1-\delta)\mu) \le \exp\left(-\frac{\mu\delta^2}{2}\right).P(S≤(1−δ)μ)≤exp(−2μδ2).
These are derived by selecting a convenient λ rather than solving exactly, or by bounding KL by a quadratic.
Connection back to ψ(λ) vs λt geometry #
For Bernoulli sums,
- •ψ_Z(λ) is known exactly.
- •The best exponent is the convex conjugate ψ_Z*(t).
- •That conjugate turns out to be n·D(q||p) when t corresponds to q.
This is the same duality picture as before, but now the optimized supporting line corresponds to a KL divergence.
Visualization suggestion (Chernoff / KL) #
Extend the earlier Panel A:
- •Plot ψ_Z(λ) for a chosen p and n (or per-sample ψ and then multiply by n).
- •Fix a target deviation q>p.
- •Plot the line λ·n(q−p) (or equivalently incorporate centering as above).
- •Highlight λ* and show the exponent nD(q||p).
Then in Panel B, plot two bounds versus q:
- •Chernoff: exp(−nD(q||p))
- •Hoeffding: exp(−2n(q−p)²)
This comparison makes the “distribution-aware” advantage visible.
Where Markov chains connect (high level) #
Even though this node centers on independent sums, the exponential Markov method generalizes.
In Markov chain settings, you often want concentration for empirical averages along a trajectory:
1n∑t=1nf(Xt).\frac{1}{n}\sum_{t=1}^n f(X_t).n1t=1∑nf(Xt).
Independence fails, but one can still bound exponential moments using spectral gaps / mixing and obtain Hoeffding-like or Bernstein-like inequalities for dependent data.
The important connection is conceptual:
- •Still use exp(λ·sum) + Markov.
- •Still bound ψ(λ) (now harder).
- •Still optimize over λ.
So learning the ψ(λ) vs λt optimization here pays off later in dependent concentration.
Summary table: Hoeffding vs Chernoff #
| Inequality | Setting | Tail behavior | What ψ bound uses | Typical exponent |
|---|
| Hoeffding | Independent, bounded Xᵢ ∈ [aᵢ,bᵢ] | Sub-Gaussian | Quadratic upper bound on ψ (Hoeffding lemma) | 2t² / ∑(bᵢ−aᵢ)² |
| Chernoff | Bernoulli/Binomial (and related) | Often tighter; KL form | Exact ψ, then optimize | n·D(q |
Both are the same engine + different ψ information.
Worked Examples (3) #
Hoeffding bound for a sample mean in [0,1] and solving for n given (ε, δ) #
Let X₁,…,X_n be independent with Xᵢ ∈ [0,1]. Let \bar{X} = (1/n)∑Xᵢ and μ̄ = E[\bar{X}]. We want n such that P(|\bar{X} − μ̄| ≥ ε) ≤ δ.
Start from two-sided Hoeffding for bounded [0,1] variables:
P(|\bar{X} − μ̄| ≥ ε) ≤ 2 exp(−2nε²).
Set the bound ≤ δ:
2 exp(−2nε²) ≤ δ.
Divide by 2:
exp(−2nε²) ≤ δ/2.
Take logs (note log is increasing):
−2nε² ≤ log(δ/2).
Multiply by −1 (flips inequality):
2nε² ≥ log(2/δ).
Solve for n:
n≥log(2/δ)2ε2.n \ge \frac{\log(2/\delta)}{2\varepsilon^2}.n≥2ε2log(2/δ).
Interpretation: to cut ε in half, you need ~4× more samples; to make δ 10× smaller, you need only an additive increase of (log 10)/(2ε²).
Insight: This is the classic PAC-style sample complexity tradeoff. The square dependence on ε comes from the quadratic ψ upper bound (sub-Gaussianity), while the log dependence on 1/δ comes from the exponential tail.
Deriving Hoeffding’s inequality via ψ(λ) and optimizing λ explicitly #
Let X₁,…,X_n independent with Xᵢ ∈ [aᵢ,bᵢ]. Let S=∑Xᵢ and μ=E[S]. Define V=∑(bᵢ−aᵢ)². Show P(S−μ ≥ t) ≤ exp(−2t²/V).
Let Z = S − μ = ∑(Xᵢ − E[Xᵢ]). We apply exponential Markov:
For any λ>0,
P(Z ≥ t) = P(e^{λZ} ≥ e^{λt}) ≤ E[e^{λZ}] / e^{λt}.
Take logs via ψ:
P(Z ≥ t) ≤ exp( ψ_Z(λ) − λt ), where ψ_Z(λ)=log E[e^{λZ}].
Use independence to add CGFs:
ψ_Z(λ) = ∑ ψ_{Xᵢ−E[Xᵢ]}(λ).
Apply Hoeffding’s lemma to each term:
ψ_{Xᵢ−E[Xᵢ]}(λ) ≤ λ²(bᵢ−aᵢ)²/8.
Sum the bounds:
ψ_Z(λ) ≤ (λ²/8)∑(bᵢ−aᵢ)² = λ²V/8.
So P(Z ≥ t) ≤ exp( λ²V/8 − λt ).
Optimize h(λ)=λ²V/8 − λt:
h'(λ)=λV/4 − t = 0 ⇒ λ* = 4t/V.
Plug in:
h(λ*) = ( (16t²/V²)·V / 8 ) − (4t/V)·t = (2t²/V) − (4t²/V) = −2t²/V.
Conclude:
P(S−μ≥t)≤exp(−2t2V).P(S−\mu \ge t) \le \exp\left(−\frac{2t^2}{V}\right).P(S−μ≥t)≤exp(−V2t2).
Insight: This example makes the optimization/duality tangible: once ψ is upper-bounded by a quadratic, the entire tail bound becomes a one-line quadratic minimization in λ.
Let S ∼ Binomial(n,p). For a target fraction q>p, bound P(S/n ≥ q).
Use exponential Markov on S:
For λ>0,
P(S ≥ nq) = P(e^{λS} ≥ e^{λnq}) ≤ E[e^{λS}] / e^{λnq}.
Compute the mgf of a binomial via independence:
E[e^{λS}] = (E[e^{λX₁}])^n with X₁∼Bern(p).
E[e^{λX₁}] = 1−p + pe^{λ}.
So:
P(S/n ≥ q) ≤ exp( n log(1−p+pe^{λ}) − λnq ).
Optimize over λ>0:
Take derivative w.r.t. λ of the exponent per-sample:
φ(λ)=log(1−p+pe^{λ}) − λq.
φ'(λ) = (pe^{λ})/(1−p+pe^{λ}) − q.
Set φ'(λ*)=0:
(pe^{λ*})/(1−p+pe^{λ*}) = q.
Solve for e^{λ*}:
q(1−p+pe^{λ*}) = pe^{λ*}
q(1−p) = pe^{λ*}(1−q)
⇒ e^{λ*} = (q(1−p))/(p(1−q)).
Plugging λ* back yields the KL form:
min_λ [log(1−p+pe^{λ}) − λq] = −D(q||p),
so
P(S/n≥q)≤exp(−nD(q∥p)).P(S/n \ge q) \le \exp\big(−n D(q|p)\big).P(S/n≥q)≤exp(−nD(q∥p)).
Insight: Chernoff is the same exponential Markov method, but with an exact ψ. The optimization condition ψ'(λ*)=t becomes an explicit equation whose solution encodes the tilted (exponentially reweighted) Bernoulli distribution, and the exponent becomes a KL divergence.
Key Takeaways #
✓
The exponential Markov method is the universal engine: apply Markov to e^{λX} to turn P(X≥t) into an mgf/CGF bound.
✓
The cumulant-generating function ψ_X(λ)=log E[e^{λX}] controls tail decay; for sums of independent variables, ψ adds: ψ_{∑Xᵢ}(λ)=∑ψ_{Xᵢ}(λ).
✓
Tail bounds are optimization problems: P(Z≥t) ≤ exp( inf_{λ>0}(ψ_Z(λ)−λt) ) = exp(−sup_{λ>0}(λt−ψ_Z(λ))).
✓
The ψ(λ) vs λt picture is geometric duality: maximizing the gap λt−ψ(λ) gives the best exponent; the optimizer satisfies ψ'(λ*)=t when differentiable.
✓
Hoeffding’s lemma gives a quadratic upper bound on ψ for bounded variables, implying sub-Gaussian tails; Hoeffding’s inequality follows by summing ψ and optimizing λ.
✓
For Bernoulli/Binomial sums, Chernoff bounds use a sharper ψ and yield KL-divergence exponents exp(−nD(q||p)), often tighter than Hoeffding.
✓
In high-probability estimation, concentration turns (ε,δ) requirements into explicit sample complexity like n ≥ log(2/δ)/(2ε²) for [0,1] averages.
Common Mistakes #
✗
Forgetting to center: bounding P(S≥t) when you really need P(S−E[S]≥t) changes ψ and often worsens constants or breaks symmetry.
✗
Optimizing λ with the wrong sign (e.g., using λ>0 for a lower tail); for P(Z≤−t) you typically apply the method to −Z (or λ<0).
✗
Assuming Hoeffding always beats everything: it ignores variance and distribution shape; Bernstein/Chernoff can be much tighter when additional information exists.
✗
Mixing up t for sums vs averages: if Z = \bar{X}−E\bar{X}, then the corresponding sum deviation is n·t, which changes exponents by n.
Practice #
easy
Let X₁,…,X_n be independent with Xᵢ ∈ [−1,2]. Give a bound on P(|\bar{X}−E\bar{X}| ≥ ε).
Hint: Use two-sided Hoeffding on the sum with (bᵢ−aᵢ)=3, then translate from S to \bar{X}.
Show solution
Here (bᵢ−aᵢ)=3 for each i, so V=∑(bᵢ−aᵢ)² = n·9 = 9n.
For S=∑Xᵢ and μ=E[S], Hoeffding gives:
P(|S−μ| ≥ t) ≤ 2 exp(−2t²/V) = 2 exp(−2t²/(9n)).
Now |\bar{X}−E\bar{X}| ≥ ε is equivalent to |S−μ| ≥ nε, so t=nε:
P(|\bar{X}−E\bar{X}| ≥ ε) ≤ 2 exp(−2(nε)²/(9n)) = 2 exp(−2nε²/9).
medium
Let S ∼ Binomial(n,p). Using the exponential Markov method, show that for any λ>0:
P(S/n ≥ q) ≤ exp(n[log(1−p+pe^{λ}) − λq]). Then write the condition that defines the optimal λ*.
Hint: Compute E[e^{λS}] via independence; then differentiate the exponent per-sample and set to zero.
Show solution
Exponential Markov:
P(S≥nq)=P(e^{λS}≥e^{λnq})≤E[e^{λS}]/e^{λnq}.
For Bernoulli X, E[e^{λX}]=1−p+pe^{λ}. For S=∑Xᵢ, E[e^{λS}]=(1−p+pe^{λ})^n.
Thus
P(S/n ≥ q) ≤ (1−p+pe^{λ})^n / e^{λnq} = exp(n[log(1−p+pe^{λ}) − λq]).
For optimal λ*, minimize f(λ)=log(1−p+pe^{λ}) − λq.
Set derivative to zero:
f'(λ)=(pe^{λ})/(1−p+pe^{λ}) − q = 0,
so λ satisfies (pe^{λ})/(1−p+pe^{λ*}) = q.
hard
Suppose Xᵢ ∈ [0,1] independent, and you want P(\bar{X} − E\bar{X} ≥ ε) ≤ δ (one-sided). Solve for the smallest n that Hoeffding guarantees.
Hint: Use the one-sided Hoeffding bound exp(−2nε²) and solve exp(−2nε²) ≤ δ.
Show solution
For Xᵢ∈[0,1], Hoeffding one-sided gives:
P(\bar{X} − E\bar{X} ≥ ε) ≤ exp(−2nε²).
Require exp(−2nε²) ≤ δ.
Take logs:
−2nε² ≤ log δ.
Multiply by −1 (flip):
2nε² ≥ log(1/δ).
Thus
n≥log(1/δ)2ε2.n \ge \frac{\log(1/\delta)}{2\varepsilon^2}.n≥2ε2log(1/δ).
Connections #
Next nodes you might connect:
Quality: A (4.3/5)
← back to treebrowse all →