Concentration Inequalities #

Probability & StatisticsDifficulty: ★★★★★Depth: 7Unlocks: 0

Chernoff, Hoeffding bounds. Tail probabilities.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

-Exponential Markov method: apply Markov's inequality to exp(lambda X) to convert tail probabilities into moment-generating bounds.
-Cumulant-generating function (log mgf) as the quantity that controls exponential tail decay: its growth rate determines concentration.
-Bounded-summand (bounded-difference) principle: uniformly bounded independent summands yield sub-Gaussian tails (Hoeffding-type quadratic CGF upper bound).

Key Symbols & Notation #

psi_X(lambda) = log E[exp(lambda X)] (cumulant-generating function, i.e., log mgf)

Essential Relationships #

-Exponential-Markov tail bound: for any lambda>0, P(X >= t) <= exp(-lambda t + psi_X(lambda)) (then optimize over lambda).
-Independence additivity: for independent X_i, psi_{sum X_i}(lambda) = sum_i psi_{X_i}(lambda), allowing tail bounds for sums.

Prerequisites (2) #

Central Limit Theorem6 atoms Markov Chains6 atoms

Referenced by (1) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (1) #

[Tail RiskBusiness

Chernoff and Hoeffding bounds are the mathematical formalization of tail risk - they bound the probability mass in distribution tails, which is exactly what characterizing tail risk requires beyond variance alone](/business/tail-risk/)

Advanced Learning Details

Graph Position #

120

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load #

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (18) #

- Tail probability: the probability mass in the tails P(X >= t), P(X <= -t), or two-sided P(|X - E[X]| >= t)
- Concentration inequality: a non-asymptotic bound that upper-bounds tail probabilities (typically exponentially small in t or n)
- Markov's inequality (as used in tail bounds): P(Y >= a) <= E[Y]/a for nonnegative Y, used on exponentials in the Chernoff method
- Chernoff method (exponential/tilting method): applying Markov's inequality to e^{λX} and optimizing over λ to obtain exponential tail bounds
- Moment-generating function (MGF): M_X(λ) = E[e^{λX}] and its use as the basic object controlling tails
- Cumulant-generating function / log-MGF: ψ_X(λ) = log E[e^{λX}] and using its convexity to derive bounds
- Optimization over the tilt parameter λ: choosing the λ that minimizes the Chernoff bound to tighten the tail estimate
- Hoeffding's lemma: a specific bound on the MGF of a zero-mean bounded random variable (gives a quadratic bound in λ)
- Hoeffding's inequality: a concrete exponential tail bound for sums of independent bounded random variables
- Multiplicity (multiplicative) Chernoff bounds for Bernoulli/binomial sums: relative-error tail bounds of the form P(S >= (1+δ)μ) that decay exponentially in μ
- Sub-Gaussian random variable (definition via MGF or tail): variables whose tails are bounded like a Gaussian (equivalent forms: tail decay, moment growth, MGF bound)
- Sub-exponential random variable (parameterized tail/ MGFs): heavier-tailed class characterized by MGF behavior and linear-exponential tail bounds
- Bounded differences / McDiarmid-type condition: control of function deviation when each input coordinate is changed (used to obtain concentration for functions of independent variables)
- One-sided vs two-sided concentration: distinguishing upper-tail, lower-tail, and absolute-deviation bounds
- Exponential decay rate / rate function idea: tail probabilities often behave like exp(-I(t)) where I(t) is (often) quadratic near the mean and linear farther out
- Using union bound with concentration inequalities to get uniform bounds over finite collections
- Dependence vs independence assumptions: many concentration inequalities require independence or bounded-difference/martingale-type dependence controls
- Additivity/composition for sums: how MGFs / sub-Gaussian parameters combine when summing independent variables (e.g., variances/squared-parameters add)

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

When you average noisy data, you feel like it should stabilize. Concentration inequalities make that feeling precise: they turn “rare big deviations” into concrete, exponentially small tail probabilities—with constants you can compute and optimize.

TL;DR:

Concentration inequalities bound tail probabilities like P(S − E[S] ≥ t). The core move is the exponential Markov method: apply Markov to exp(λ(S−E[S])) to get P(S−E[S] ≥ t) ≤ exp(ψ_S(λ) − λt). Optimizing over λ gives a Legendre/duality picture: the best exponent is sup_λ (λt − ψ_S(λ)) (a convex conjugate). For sums of independent bounded variables, Hoeffding’s lemma yields a quadratic upper bound on ψ, giving sub-Gaussian tails (Hoeffding bounds). For Bernoulli sums, sharper ψ yields Chernoff bounds with KL-divergence exponents.

What Is a Concentration Inequality? #

A concentration inequality is a guarantee that a random quantity stays close to its typical value (usually its expectation) with high probability.

Concretely, if S is a random variable (often a sum or average), we want bounds of the form

•Upper tail: P(S − E[S] ≥ t)
•Lower tail: P(S − E[S] ≤ −t)
•Two-sided: P(|S − E[S]| ≥ t)

The defining feature is that these probabilities decay fast in t (often like exp(−c t²) or exp(−c t)). This is the mathematical version of “averaging reduces noise.”

Why concentration is not the CLT #

You already know the Central Limit Theorem (CLT): sums of many independent variables become approximately normal under mild conditions.

But CLT answers a different question:

•CLT is asymptotic (n → ∞) and gives an approximate distribution near the center.
•Concentration inequalities are finite-sample (fixed n) and give non-asymptotic upper bounds on tail probabilities.

In practice, concentration is what you use when you want statements like:

“With probability at least 1 − δ, the sample mean differs from the true mean by at most ε.”

This is a tail probability question.

The common structure #

Most concentration results for sums of independent variables share a template:

1)Convert a tail event into an inequality about an exponential moment.
2)Bound that exponential moment using some structure (independence, boundedness, sub-Gaussianity, etc.).
3)Optimize a parameter (λ) to get the best exponent.

This node focuses on the classic pair:

•Hoeffding bounds (bounded variables)
•Chernoff bounds (especially for Bernoulli sums / binomials)

And the unifying engine behind both:

•Exponential Markov method using the cumulant-generating function (CGF)

The key symbol: the cumulant-generating function #

For a random variable X, define its moment-generating function (mgf) when it exists:

MX(λ)=E[eλX]M_X(\lambda) = \mathbb{E}[e^{\lambda X}]MX(λ)=E[eλX]

and its cumulant-generating function (CGF)

ψX(λ)=log⁡E[eλX]\psi_X(\lambda) = \log \mathbb{E}[e^{\lambda X}]ψX(λ)=logE[eλX]

Why the log? Because for independent sums it turns products into sums:

If X and Y are independent, then

ψX+Y(λ)=log⁡E[eλ(X+Y)]=log⁡(E[eλX]E[eλY])=ψX(λ)+ψY(λ).\psi_{X+Y}(\lambda) = \log \mathbb{E}[e^{\lambda(X+Y)}] = \log(\mathbb{E}[e^{\lambda X}]\mathbb{E}[e^{\lambda Y}]) = \psi_X(\lambda) + \psi_Y(\lambda).ψX+Y(λ)=logE[eλ(X+Y)]=log(E[eλX]E[eλY])=ψX(λ)+ψY(λ).

This additivity is the backbone of concentration for sums.

Two mental pictures to keep in mind #

Tail bound as an optimization problem:

We will repeatedly get bounds that look like

P(S−ES≥t)≤exp⁡(ψS(λ)−λt)\mathbb{P}(S - \mathbb{E}S \ge t) \le \exp(\psi_S(\lambda) - \lambda t)P(S−ES≥t)≤exp(ψS(λ)−λt)

and then choose the best λ ≥ 0.

Geometry / duality:

The exponent comes from comparing the curve ψ(λ) against the line λt.

•ψ(λ) is convex.
•λt is a line through the origin with slope t.
•The best λ is where the line is “most above” ψ(λ) in the sense of maximizing λt − ψ(λ).

We will make that picture explicit later, because it is the most reliable way to remember what’s going on.

Core Mechanic 1: Exponential Markov Method and the ψ(λ) vs λt Optimization #

The starting point is humble: Markov’s inequality.

For a nonnegative random variable Y and a > 0,

P(Y≥a)≤E[Y]a.\mathbb{P}(Y \ge a) \le \frac{\mathbb{E}[Y]}{a}.P(Y≥a)≤aE[Y].

This is often too weak directly. The trick is to manufacture a useful nonnegative variable by exponentiating.

Step 1: Convert a tail event into an exponential event #

Suppose we want an upper tail for X:

P(X≥t).\mathbb{P}(X \ge t).P(X≥t).

For any λ > 0,

X≥t ⟺ eλX≥eλt.X \ge t \iff e^{\lambda X} \ge e^{\lambda t}.X≥t⟺eλX≥eλt.

Since e^{\lambda X} ≥ 0, we can apply Markov:

P(X≥t)=P(eλX≥eλt)≤E[eλX]eλt.\mathbb{P}(X \ge t) = \mathbb{P}(e^{\lambda X} \ge e^{\lambda t}) \le \frac{\mathbb{E}[e^{\lambda X}]}{e^{\lambda t}}.P(X≥t)=P(eλX≥eλt)≤eλtE[eλX].

That is

P(X≥t)≤exp⁡(ψX(λ)−λt).\mathbb{P}(X \ge t) \le \exp(\psi_X(\lambda) - \lambda t).P(X≥t)≤exp(ψX(λ)−λt).

This is the exponential Markov method.

Centering matters #

Most concentration statements are about deviations from the mean. Let Z = S − E[S]. Then

P(S−ES≥t)=P(Z≥t)≤exp⁡(ψZ(λ)−λt).\mathbb{P}(S - \mathbb{E}S \ge t) = \mathbb{P}(Z \ge t) \le \exp(\psi_Z(\lambda) - \lambda t).P(S−ES≥t)=P(Z≥t)≤exp(ψZ(λ)−λt).

Note that

ψZ(λ)=log⁡E[eλ(S−ES)]=log⁡(e−λES E[eλS])=ψS(λ)−λES.\psi_Z(\lambda) = \log \mathbb{E}[e^{\lambda(S-\mathbb{E}S)}] = \log \left(e^{-\lambda\mathbb{E}S},\mathbb{E}[e^{\lambda S}]\right) = \psi_S(\lambda) - \lambda \mathbb{E}S.ψZ(λ)=logE[eλ(S−ES)]=log(e−λESE[eλS])=ψS(λ)−λES.

So you can either work with ψ_S and subtract λE[S], or directly define ψ for the centered variable.

Step 2: Optimize over λ #

The inequality holds for every λ > 0. So take the best one:

P(Z≥t)≤inf⁡λ>0exp⁡(ψZ(λ)−λt).\mathbb{P}(Z \ge t) \le \inf_{\lambda > 0} \exp(\psi_Z(\lambda) - \lambda t).P(Z≥t)≤λ>0infexp(ψZ(λ)−λt).

Equivalently,

P(Z≥t)≤exp⁡(−sup⁡λ>0(λt−ψZ(λ))).\mathbb{P}(Z \ge t) \le \exp\Big( - \sup_{\lambda > 0} (\lambda t - \psi_Z(\lambda)) \Big).P(Z≥t)≤exp(−λ>0sup(λt−ψZ(λ))).

Define the convex conjugate (Legendre transform) of ψ_Z:

ψZ∗(t)=sup⁡λ∈R(λt−ψZ(λ)).\psi_Z^*(t) = \sup_{\lambda \in \mathbb{R}} (\lambda t - \psi_Z(\lambda)).ψZ∗(t)=λ∈Rsup(λt−ψZ(λ)).

Then the clean conceptual form is:

P(Z≥t)≤e−ψZ∗(t).\mathbb{P}(Z \ge t) \le e^{-\psi_Z^*(t)}.P(Z≥t)≤e−ψZ∗(t).

This “duality” is not just fancy notation: it is exactly the geometry of “supporting lines” to a convex function.

The geometry: ψ(λ) and its supporting line #

Because ψ_Z(λ) is convex in λ (log mgf is convex), the function

g(λ)=ψZ(λ)−λtg(\lambda) = \psi_Z(\lambda) - \lambda tg(λ)=ψZ(λ)−λt

is convex too (convex minus linear). Minimizing it has a crisp first-order condition when differentiable:

g′(λ)=ψZ′(λ)−t=0⇒ψZ′(λ∗)=t.g'(\lambda) = \psi_Z'(\lambda) - t = 0 \quad\Rightarrow\quad \psi_Z'(\lambda^*) = t.g′(λ)=ψZ′(λ)−t=0⇒ψZ′(λ∗)=t.

Interpretation:

•ψ_Z'(λ) is the slope of ψ_Z at λ.
•We choose λ* so that the slope of ψ matches the target deviation t.

This directly matches the visualization suggestion:

•Plot ψ_Z(λ) vs λ.
•Plot the line ℓ(λ) = λt.
•The quantity λt − ψ_Z(λ) is the vertical gap between the line and the curve.
•The best λ* maximizes that gap.

Interactive canvas idea (explicit) #

If your interactive environment supports it, show two linked panels:

Panel A (duality / optimization):

•A convex curve ψ(λ).
•A movable line λt (you drag t; line slope changes).
•A highlighted λ* where the vertical gap is maximized.
•Display: exponent = ψ*(t) = λ*t − ψ(λ*).

Panel B (resulting tail):

•Plot the bound exp(−ψ*(t)) as a function of t.
•As you move t in Panel A, show the corresponding tail point in Panel B.

This makes “optimize over λ” feel like reading off a best supporting line.

Step 3: Use structure to bound ψ #

So far, everything is exact but useless unless we can bound ψ_Z(λ).

The entire field of concentration inequalities is essentially a library of ways to upper bound ψ for different kinds of random variables.

For this node, we focus on:

•Independent bounded summands → quadratic upper bound on ψ (Hoeffding-type)
•Bernoulli/Binomial → exact-ish ψ → KL divergence exponents (Chernoff)

Before going there, one more crucial simplification:

ψ for sums of independent terms #

Let S = ∑ᵢ Xᵢ with independent Xᵢ. For centered terms (or after centering),

ψS(λ)=∑iψXi(λ).\psi_S(\lambda) = \sum_i \psi_{X_i}(\lambda).ψS(λ)=i∑ψXi(λ).

So if you can bound each ψ_{X_i}(λ), you can bound ψ_S(λ) by summing.

That is why bounded-difference / bounded-summand principles are so powerful: they give a uniform bound per term.

Core Mechanic 2: Bounded Summands → Hoeffding’s Lemma → Hoeffding Bounds #

Hoeffding’s inequality is the canonical “bounded independent noise averages out” theorem.

We’ll build it in two stages:

Hoeffding’s lemma: a single bounded random variable has a sub-Gaussian mgf.
Additivity of ψ for independent sums + λ-optimization yields Hoeffding’s inequality.

Stage 1: Hoeffding’s lemma (mgf bound for a bounded variable) #

Let X be a random variable with support in an interval [a, b]. Define the centered variable:

Y=X−E[X].Y = X - \mathbb{E}[X].Y=X−E[X].

Hoeffding’s lemma states:

E[eλY]≤exp⁡(λ2(b−a)28)∀λ∈R.\mathbb{E}[e^{\lambda Y}] \le \exp\left(\frac{\lambda^2 (b-a)^2}{8}\right) \quad \forall \lambda \in \mathbb{R}.E[eλY]≤exp(8λ2(b−a)2)∀λ∈R.

Equivalently in CGF form:

ψY(λ)≤λ2(b−a)28.\psi_Y(\lambda) \le \frac{\lambda^2 (b-a)^2}{8}.ψY(λ)≤8λ2(b−a)2.

Why this is the key (motivation) #

This is the precise bridge from boundedness to quadratic ψ, and quadratic ψ is what produces Gaussian-like tails exp(−t²).

A useful mental model:

•A truly Gaussian variable G ∼ N(0, σ²) has

ψG(λ)=log⁡E[eλG]=σ2λ22.\psi_G(\lambda) = \log \mathbb{E}[e^{\lambda G}] = \frac{\sigma^2 \lambda^2}{2}.ψG(λ)=logE[eλG]=2σ2λ2.

So Hoeffding’s lemma says: “a bounded centered variable behaves like a Gaussian in terms of its mgf, with effective variance proxy (b−a)²/4.” (Up to constants.)

A brief sketch of why it’s true #

A full proof uses convexity (or a comparison to a two-point distribution at the endpoints). The intuition:

•For fixed bounds [a, b] and fixed mean, the mgf is maximized by a distribution that places all mass at the endpoints.
•That reduces the worst case to a simple two-point calculation.

For this lesson, the important takeaway is the usable inequality:

ψX−EX(λ)≤λ2(b−a)28.\psi_{X-\mathbb{E}X}(\lambda) \le \frac{\lambda^2 (b-a)^2}{8}.ψX−EX(λ)≤8λ2(b−a)2.

Stage 2: Hoeffding inequality for sums #

Let X₁, …, X_n be independent, with Xᵢ ∈ [aᵢ, bᵢ] almost surely. Let

S=∑i=1nXi,μ=E[S]=∑i=1nE[Xi].S = \sum_{i=1}^n X_i, \quad \mu = \mathbb{E}[S] = \sum_{i=1}^n \mathbb{E}[X_i].S=i=1∑nXi,μ=E[S]=i=1∑nE[Xi].

Define the centered sum Z = S − μ = ∑ᵢ (Xᵢ − E[Xᵢ]).

Step A: Apply exponential Markov #

For λ > 0,

P(S−μ≥t)=P(Z≥t)≤exp⁡(ψZ(λ)−λt).\mathbb{P}(S-\mu \ge t) = \mathbb{P}(Z \ge t) \le \exp(\psi_Z(\lambda) - \lambda t).P(S−μ≥t)=P(Z≥t)≤exp(ψZ(λ)−λt).

Step B: Use independence to sum ψ #

Because the centered terms are independent,

ψZ(λ)=∑i=1nψXi−EXi(λ).\psi_Z(\lambda) = \sum_{i=1}^n \psi_{X_i-\mathbb{E}X_i}(\lambda).ψZ(λ)=i=1∑nψXi−EXi(λ).

Step C: Bound each ψ via Hoeffding’s lemma #

Each Xᵢ − E[Xᵢ] is still bounded in an interval of width (bᵢ − aᵢ). So

ψXi−EXi(λ)≤λ2(bi−ai)28.\psi_{X_i-\mathbb{E}X_i}(\lambda) \le \frac{\lambda^2 (b_i-a_i)^2}{8}.ψXi−EXi(λ)≤8λ2(bi−ai)2.

Summing gives

ψZ(λ)≤λ28∑i=1n(bi−ai)2.\psi_Z(\lambda) \le \frac{\lambda^2}{8} \sum_{i=1}^n (b_i-a_i)^2.ψZ(λ)≤8λ2i=1∑n(bi−ai)2.

Define the “range-sum” parameter

V=∑i=1n(bi−ai)2.V = \sum_{i=1}^n (b_i-a_i)^2.V=i=1∑n(bi−ai)2.

Then

P(Z≥t)≤exp⁡(λ2V8−λt).\mathbb{P}(Z \ge t) \le \exp\left(\frac{\lambda^2 V}{8} - \lambda t\right).P(Z≥t)≤exp(8λ2V−λt).

Step D: Optimize over λ #

We minimize the exponent in λ:

h(λ)=λ2V8−λt.h(\lambda) = \frac{\lambda^2 V}{8} - \lambda t.h(λ)=8λ2V−λt.

Differentiate:

h′(λ)=λV4−t.h'(\lambda) = \frac{\lambda V}{4} - t.h′(λ)=4λV−t.

Set h'(λ)=0:

λ∗V4=t⇒λ∗=4tV.\frac{\lambda^* V}{4} = t \quad\Rightarrow\quad \lambda^* = \frac{4t}{V}.4λ∗V=t⇒λ∗=V4t.

Plug back in:

\n1) Compute the quadratic term:

(λ∗)2V8=(16t2/V2)V8=16t28V=2t2V.\frac{(\lambda^*)^2 V}{8} = \frac{(16 t^2 / V^2) V}{8} = \frac{16 t^2}{8V} = \frac{2t^2}{V}.8(λ∗)2V=8(16t2/V2)V=8V16t2=V2t2.

Compute the linear term:

λ∗t=4tV⋅t=4t2V.\lambda^* t = \frac{4t}{V} \cdot t = \frac{4t^2}{V}.λ∗t=V4t⋅t=V4t2.

h(λ∗)=2t2V−4t2V=−2t2V.h(\lambda^*) = \frac{2t^2}{V} - \frac{4t^2}{V} = -\frac{2t^2}{V}.h(λ∗)=V2t2−V4t2=−V2t2.

Therefore

P(S−μ≥t)≤exp⁡(−2t2∑i=1n(bi−ai)2).\mathbb{P}(S-\mu \ge t) \le \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(S−μ≥t)≤exp(−∑i=1n(bi−ai)22t2).

That is Hoeffding’s inequality.

Two-sided version #

Similarly,

P(S−μ≤−t)≤exp⁡(−2t2∑i=1n(bi−ai)2).\mathbb{P}(S-\mu \le -t) \le \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(S−μ≤−t)≤exp(−∑i=1n(bi−ai)22t2).

By a union bound,

P(∣S−μ∣≥t)≤2exp⁡(−2t2∑i=1n(bi−ai)2).\mathbb{P}(|S-\mu| \ge t) \le 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(∣S−μ∣≥t)≤2exp(−∑i=1n(bi−ai)22t2).

The sample mean form #

If Xᵢ ∈ [0,1], then (bᵢ − aᵢ)=1 and V = n.

Let \bar{X} = (1/n)∑ Xᵢ and E[\bar{X}] = μ̄.

Then

\mathbb{P}(\bar{X} - \mū \ge \varepsilon) = \mathbb{P}(S - \mu \ge n\varepsilon) \le \exp\left(-2n\varepsilon^2\right).

And two-sided:

\mathbb{P}(|\bar{X} - \mū| \ge \varepsilon) \le 2\exp(-2n\varepsilon^2).

Visualization reinforcement: quadratic ψ bound → sub-Gaussian tail curves #

This is the second visualization the node asked for.

In Hoeffding’s proof, we replaced the true ψ_Z(λ) by an upper quadratic:

ψZ(λ)≤λ2V8.\psi_Z(\lambda) \le \frac{\lambda^2 V}{8}.ψZ(λ)≤8λ2V.

Geometrically, you can plot:

•The true ψ_Z(λ) (unknown generally)
•The quadratic upper envelope (λ²V/8)

Then the optimized exponent becomes the conjugate of the quadratic:

sup⁡λ≥0(λt−λ2V/8)=2t2V.\sup_{\lambda \ge 0}(\lambda t - \lambda^2 V/8) = \frac{2t^2}{V}.λ≥0sup(λt−λ2V/8)=V2t2.

So the tail is bounded by exp(−2t²/V).

Interactive canvas idea (explicit):

•Let the user pick n, δ.
•Show ε(δ) such that 2exp(−2nε²)=δ, i.e.

ε(δ)=log⁡(2/δ)2n.\varepsilon(\delta) = \sqrt{\frac{\log(2/\delta)}{2n}}.ε(δ)=2nlog(2/δ).

•Plot the family of curves exp(−2nε²) as n varies.
•Show how changing n shifts the curve down (stronger concentration).

This connects directly to “how many samples do I need for error ≤ ε with probability ≥ 1−δ?”

Application/Connection: Chernoff Bounds, KL Exponents, and How ψ Controls Tail Decay #

Hoeffding is powerful and simple, but it is sometimes loose because it uses only boundedness.

Chernoff bounds refine the story when the distribution is known (especially Bernoulli) by using a sharper ψ.

Bernoulli sums and a sharper ψ #

Let Xᵢ ∼ Bernoulli(p) independent. Then S = ∑ Xᵢ ∼ Binomial(n, p), and μ = E[S] = np.

For a single Bernoulli X:

E[eλX]=(1−p) e0+p eλ=1−p+peλ.\mathbb{E}[e^{\lambda X}] = (1-p),e^{0} + p,e^{\lambda} = 1-p + pe^{\lambda}.E[eλX]=(1−p)e0+peλ=1−p+peλ.

ψX(λ)=log⁡(1−p+peλ).\psi_X(\lambda) = \log(1-p + pe^{\lambda}).ψX(λ)=log(1−p+peλ).

For the sum S, independence gives

ψS(λ)=nlog⁡(1−p+peλ).\psi_S(\lambda) = n\log(1-p + pe^{\lambda}).ψS(λ)=nlog(1−p+peλ).

For the centered sum Z = S − μ,

ψZ(λ)=ψS(λ)−λμ=nlog⁡(1−p+peλ)−λnp.\psi_Z(\lambda) = \psi_S(\lambda) - \lambda\mu = n\log(1-p + pe^{\lambda}) - \lambda np.ψZ(λ)=ψS(λ)−λμ=nlog(1−p+peλ)−λnp.

Now apply exponential Markov:

P(S−μ≥t)≤exp⁡(nlog⁡(1−p+peλ)−λnp−λt).\mathbb{P}(S-\mu \ge t) \le \exp\left(n\log(1-p + pe^{\lambda}) - \lambda np - \lambda t\right).P(S−μ≥t)≤exp(nlog(1−p+peλ)−λnp−λt).

Let’s express the event in relative terms. Set t = n\varepsilon, i.e. S ≥ n(p+ε). Then

P(Sn≥p+ε)≤exp⁡(n[log⁡(1−p+peλ)−λ(p+ε)]).\mathbb{P}\left(\frac{S}{n} \ge p+\varepsilon\right) \le \exp\left(n\big[\log(1-p + pe^{\lambda}) - \lambda(p+\varepsilon)\big]\right).P(nS≥p+ε)≤exp(n[log(1−p+peλ)−λ(p+ε)]).

We optimize over λ ≥ 0.

The KL-divergence form (the cleanest statement) #

Let q = p+ε (so q > p). The optimal exponent is

inf⁡λ>0(log⁡(1−p+peλ)−λq)=−D(q∥p),\inf_{\lambda>0} \Big(\log(1-p+pe^{\lambda}) - \lambda q\Big) = -D(q|p),λ>0inf(log(1−p+peλ)−λq)=−D(q∥p),

where D(q|p) is the Bernoulli KL divergence:

D(q∥p)=qlog⁡qp+(1−q)log⁡1−q1−p.D(q|p) = q\log\frac{q}{p} + (1-q)\log\frac{1-q}{1-p}.D(q∥p)=qlogpq+(1−q)log1−p1−q.

Thus the Chernoff bound becomes

P(Sn≥q)≤exp⁡(−n D(q∥p)).\mathbb{P}\left(\frac{S}{n} \ge q\right) \le \exp\big(-n,D(q|p)\big).P(nS≥q)≤exp(−nD(q∥p)).

Similarly for the lower tail (q < p):

P(Sn≤q)≤exp⁡(−n D(q∥p)).\mathbb{P}\left(\frac{S}{n} \le q\right) \le \exp\big(-n,D(q|p)\big).P(nS≤q)≤exp(−nD(q∥p)).

This is a major conceptual upgrade:

•Hoeffding: exponent is quadratic in (q−p), distribution-free.
•Chernoff: exponent is KL divergence, distribution-aware and often tighter.

Recovering the “classic” multiplicative Chernoff forms #

A popular parameterization is S ≥ (1+δ)μ with δ>0.

Since μ=np, the event is S ≥ (1+δ)np.

A standard Chernoff upper tail bound is:

P(S≥(1+δ)μ)≤(eδ(1+δ)1+δ)μ.\mathbb{P}(S \ge (1+\delta)\mu) \le \left(\frac{e^{\delta}}{(1+\delta)^{1+\delta}}\right)^{\mu}.P(S≥(1+δ)μ)≤((1+δ)1+δeδ)μ.

And a simpler (looser) but easy-to-remember form:

P(S≥(1+δ)μ)≤exp⁡(−μδ22+δ).\mathbb{P}(S \ge (1+\delta)\mu) \le \exp\left(-\frac{\mu\delta^2}{2+\delta}\right).P(S≥(1+δ)μ)≤exp(−2+δμδ2).

Lower tail for δ∈(0,1):

P(S≤(1−δ)μ)≤exp⁡(−μδ22).\mathbb{P}(S \le (1-\delta)\mu) \le \exp\left(-\frac{\mu\delta^2}{2}\right).P(S≤(1−δ)μ)≤exp(−2μδ2).

These are derived by selecting a convenient λ rather than solving exactly, or by bounding KL by a quadratic.

Connection back to ψ(λ) vs λt geometry #

For Bernoulli sums,

•ψ_Z(λ) is known exactly.
•The best exponent is the convex conjugate ψ_Z*(t).
•That conjugate turns out to be n·D(q||p) when t corresponds to q.

This is the same duality picture as before, but now the optimized supporting line corresponds to a KL divergence.

Visualization suggestion (Chernoff / KL) #

Extend the earlier Panel A:

•Plot ψ_Z(λ) for a chosen p and n (or per-sample ψ and then multiply by n).
•Fix a target deviation q>p.
•Plot the line λ·n(q−p) (or equivalently incorporate centering as above).
•Highlight λ* and show the exponent nD(q||p).

Then in Panel B, plot two bounds versus q:

•Chernoff: exp(−nD(q||p))
•Hoeffding: exp(−2n(q−p)²)

This comparison makes the “distribution-aware” advantage visible.

Where Markov chains connect (high level) #

Even though this node centers on independent sums, the exponential Markov method generalizes.

In Markov chain settings, you often want concentration for empirical averages along a trajectory:

1n∑t=1nf(Xt).\frac{1}{n}\sum_{t=1}^n f(X_t).n1t=1∑nf(Xt).

Independence fails, but one can still bound exponential moments using spectral gaps / mixing and obtain Hoeffding-like or Bernstein-like inequalities for dependent data.

The important connection is conceptual:

•Still use exp(λ·sum) + Markov.
•Still bound ψ(λ) (now harder).
•Still optimize over λ.

So learning the ψ(λ) vs λt optimization here pays off later in dependent concentration.

Summary table: Hoeffding vs Chernoff #

Inequality	Setting	Tail behavior	What ψ bound uses	Typical exponent
Hoeffding	Independent, bounded Xᵢ ∈ [aᵢ,bᵢ]	Sub-Gaussian	Quadratic upper bound on ψ (Hoeffding lemma)	2t² / ∑(bᵢ−aᵢ)²
Chernoff	Bernoulli/Binomial (and related)	Often tighter; KL form	Exact ψ, then optimize	n·D(q

Both are the same engine + different ψ information.

Worked Examples (3) #

Hoeffding bound for a sample mean in [0,1] and solving for n given (ε, δ) #

Let X₁,…,X_n be independent with Xᵢ ∈ [0,1]. Let \bar{X} = (1/n)∑Xᵢ and μ̄ = E[\bar{X}]. We want n such that P(|\bar{X} − μ̄| ≥ ε) ≤ δ.

Start from two-sided Hoeffding for bounded [0,1] variables:
P(|\bar{X} − μ̄| ≥ ε) ≤ 2 exp(−2nε²).
Set the bound ≤ δ:
2 exp(−2nε²) ≤ δ.
Divide by 2:
exp(−2nε²) ≤ δ/2.
Take logs (note log is increasing):
−2nε² ≤ log(δ/2).
Multiply by −1 (flips inequality):
2nε² ≥ log(2/δ).
Solve for n:
n≥log⁡(2/δ)2ε2.n \ge \frac{\log(2/\delta)}{2\varepsilon^2}.n≥2ε2log(2/δ).
Interpretation: to cut ε in half, you need ~4× more samples; to make δ 10× smaller, you need only an additive increase of (log 10)/(2ε²).

Insight: This is the classic PAC-style sample complexity tradeoff. The square dependence on ε comes from the quadratic ψ upper bound (sub-Gaussianity), while the log dependence on 1/δ comes from the exponential tail.

Deriving Hoeffding’s inequality via ψ(λ) and optimizing λ explicitly #

Let X₁,…,X_n independent with Xᵢ ∈ [aᵢ,bᵢ]. Let S=∑Xᵢ and μ=E[S]. Define V=∑(bᵢ−aᵢ)². Show P(S−μ ≥ t) ≤ exp(−2t²/V).

Let Z = S − μ = ∑(Xᵢ − E[Xᵢ]). We apply exponential Markov:
For any λ>0,
P(Z ≥ t) = P(e^{λZ} ≥ e^{λt}) ≤ E[e^{λZ}] / e^{λt}.
Take logs via ψ:
P(Z ≥ t) ≤ exp( ψ_Z(λ) − λt ), where ψ_Z(λ)=log E[e^{λZ}].
Use independence to add CGFs:
ψ_Z(λ) = ∑ ψ_{Xᵢ−E[Xᵢ]}(λ).
Apply Hoeffding’s lemma to each term:
ψ_{Xᵢ−E[Xᵢ]}(λ) ≤ λ²(bᵢ−aᵢ)²/8.
Sum the bounds:
ψ_Z(λ) ≤ (λ²/8)∑(bᵢ−aᵢ)² = λ²V/8.
So P(Z ≥ t) ≤ exp( λ²V/8 − λt ).
Optimize h(λ)=λ²V/8 − λt:
h'(λ)=λV/4 − t = 0 ⇒ λ* = 4t/V.
Plug in:
h(λ*) = ( (16t²/V²)·V / 8 ) − (4t/V)·t = (2t²/V) − (4t²/V) = −2t²/V.
Conclude:
P(S−μ≥t)≤exp⁡(−2t2V).P(S−\mu \ge t) \le \exp\left(−\frac{2t^2}{V}\right).P(S−μ≥t)≤exp(−V2t2).

Insight: This example makes the optimization/duality tangible: once ψ is upper-bounded by a quadratic, the entire tail bound becomes a one-line quadratic minimization in λ.

Chernoff bound for a Binomial upper tail in KL form #

Let S ∼ Binomial(n,p). For a target fraction q>p, bound P(S/n ≥ q).

Use exponential Markov on S:
For λ>0,
P(S ≥ nq) = P(e^{λS} ≥ e^{λnq}) ≤ E[e^{λS}] / e^{λnq}.
Compute the mgf of a binomial via independence:
E[e^{λS}] = (E[e^{λX₁}])^n with X₁∼Bern(p).
E[e^{λX₁}] = 1−p + pe^{λ}.
So:
P(S/n ≥ q) ≤ exp( n log(1−p+pe^{λ}) − λnq ).
Optimize over λ>0:
Take derivative w.r.t. λ of the exponent per-sample:
φ(λ)=log(1−p+pe^{λ}) − λq.
φ'(λ) = (pe^{λ})/(1−p+pe^{λ}) − q.
Set φ'(λ*)=0:
(pe^{λ*})/(1−p+pe^{λ*}) = q.
Solve for e^{λ*}:
q(1−p+pe^{λ*}) = pe^{λ*}
q(1−p) = pe^{λ*}(1−q)
⇒ e^{λ*} = (q(1−p))/(p(1−q)).
Plugging λ* back yields the KL form:
min_λ [log(1−p+pe^{λ}) − λq] = −D(q||p),
so
P(S/n≥q)≤exp⁡(−nD(q∥p)).P(S/n \ge q) \le \exp\big(−n D(q|p)\big).P(S/n≥q)≤exp(−nD(q∥p)).

Insight: Chernoff is the same exponential Markov method, but with an exact ψ. The optimization condition ψ'(λ*)=t becomes an explicit equation whose solution encodes the tilted (exponentially reweighted) Bernoulli distribution, and the exponent becomes a KL divergence.

Key Takeaways #

✓
The exponential Markov method is the universal engine: apply Markov to e^{λX} to turn P(X≥t) into an mgf/CGF bound.
✓
The cumulant-generating function ψ_X(λ)=log E[e^{λX}] controls tail decay; for sums of independent variables, ψ adds: ψ_{∑Xᵢ}(λ)=∑ψ_{Xᵢ}(λ).
✓
Tail bounds are optimization problems: P(Z≥t) ≤ exp( inf_{λ>0}(ψ_Z(λ)−λt) ) = exp(−sup_{λ>0}(λt−ψ_Z(λ))).
✓
The ψ(λ) vs λt picture is geometric duality: maximizing the gap λt−ψ(λ) gives the best exponent; the optimizer satisfies ψ'(λ*)=t when differentiable.
✓
Hoeffding’s lemma gives a quadratic upper bound on ψ for bounded variables, implying sub-Gaussian tails; Hoeffding’s inequality follows by summing ψ and optimizing λ.
✓
For Bernoulli/Binomial sums, Chernoff bounds use a sharper ψ and yield KL-divergence exponents exp(−nD(q||p)), often tighter than Hoeffding.
✓
In high-probability estimation, concentration turns (ε,δ) requirements into explicit sample complexity like n ≥ log(2/δ)/(2ε²) for [0,1] averages.

Common Mistakes #

✗
Forgetting to center: bounding P(S≥t) when you really need P(S−E[S]≥t) changes ψ and often worsens constants or breaks symmetry.
✗
Optimizing λ with the wrong sign (e.g., using λ>0 for a lower tail); for P(Z≤−t) you typically apply the method to −Z (or λ<0).
✗
Assuming Hoeffding always beats everything: it ignores variance and distribution shape; Bernstein/Chernoff can be much tighter when additional information exists.
✗
Mixing up t for sums vs averages: if Z = \bar{X}−E\bar{X}, then the corresponding sum deviation is n·t, which changes exponents by n.

Practice #

easy

Let X₁,…,X_n be independent with Xᵢ ∈ [−1,2]. Give a bound on P(|\bar{X}−E\bar{X}| ≥ ε).

Hint: Use two-sided Hoeffding on the sum with (bᵢ−aᵢ)=3, then translate from S to \bar{X}.

Show solution

Here (bᵢ−aᵢ)=3 for each i, so V=∑(bᵢ−aᵢ)² = n·9 = 9n.

For S=∑Xᵢ and μ=E[S], Hoeffding gives:

P(|S−μ| ≥ t) ≤ 2 exp(−2t²/V) = 2 exp(−2t²/(9n)).

Now |\bar{X}−E\bar{X}| ≥ ε is equivalent to |S−μ| ≥ nε, so t=nε:

P(|\bar{X}−E\bar{X}| ≥ ε) ≤ 2 exp(−2(nε)²/(9n)) = 2 exp(−2nε²/9).

medium

Let S ∼ Binomial(n,p). Using the exponential Markov method, show that for any λ>0:

P(S/n ≥ q) ≤ exp(n[log(1−p+pe^{λ}) − λq]). Then write the condition that defines the optimal λ*.

Hint: Compute E[e^{λS}] via independence; then differentiate the exponent per-sample and set to zero.

Show solution

Exponential Markov:

P(S≥nq)=P(e^{λS}≥e^{λnq})≤E[e^{λS}]/e^{λnq}.

For Bernoulli X, E[e^{λX}]=1−p+pe^{λ}. For S=∑Xᵢ, E[e^{λS}]=(1−p+pe^{λ})^n.

Thus

P(S/n ≥ q) ≤ (1−p+pe^{λ})^n / e^{λnq} = exp(n[log(1−p+pe^{λ}) − λq]).

For optimal λ*, minimize f(λ)=log(1−p+pe^{λ}) − λq.

Set derivative to zero:

f'(λ)=(pe^{λ})/(1−p+pe^{λ}) − q = 0,

so λ satisfies (pe^{λ})/(1−p+pe^{λ*}) = q.

hard

Suppose Xᵢ ∈ [0,1] independent, and you want P(\bar{X} − E\bar{X} ≥ ε) ≤ δ (one-sided). Solve for the smallest n that Hoeffding guarantees.

Hint: Use the one-sided Hoeffding bound exp(−2nε²) and solve exp(−2nε²) ≤ δ.

Show solution

For Xᵢ∈[0,1], Hoeffding one-sided gives:

P(\bar{X} − E\bar{X} ≥ ε) ≤ exp(−2nε²).

Require exp(−2nε²) ≤ δ.

Take logs:

−2nε² ≤ log δ.

Multiply by −1 (flip):

2nε² ≥ log(1/δ).

Thus

n≥log⁡(1/δ)2ε2.n \ge \frac{\log(1/\delta)}{2\varepsilon^2}.n≥2ε2log(1/δ).

Connections #

Next nodes you might connect:

•Markov's Inequality and Tail Bounds — the starting tool behind the exponential Markov method.
•Moment Generating Functions — ψ(λ) = log mgf and why convexity/duality appears.
•Sub-Gaussian Random Variables — abstraction of “quadratic ψ” beyond boundedness.
•Bernstein Inequalities — variance-sensitive concentration (often tighter than Hoeffding).
•Large Deviations and Cramér’s Theorem — the general theory where ψ* governs exponential rates.
•Concentration for Markov Chains — extending the same exp(λ·sum) idea to dependent samples.

Quality: A (4.3/5)

← back to tree browse all →