Concentration Inequalities

←Back to Tech Tree

inventorycoverage

Concentration Inequalities #

Probability & StatisticsDifficulty: ★★★★★Depth: 7Unlocks: 0

Chernoff, Hoeffding bounds. Tail probabilities.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

psi_X(lambda) = log E[exp(lambda X)] (cumulant-generating function, i.e., log mgf)

Essential Relationships #

Prerequisites (2) #

Central Limit Theorem6 atomsMarkov Chains6 atoms

Referenced by (1) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (1) #

[Tail RiskBusiness

Chernoff and Hoeffding bounds are the mathematical formalization of tail risk - they bound the probability mass in distribution tails, which is exactly what characterizing tail risk requires beyond variance alone](/business/tail-risk/)

Advanced Learning Details

Graph Position #

120

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

7

Chain Length

Cognitive Load #

6

Atomic Elements

41

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (18) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

When you average noisy data, you feel like it should stabilize. Concentration inequalities make that feeling precise: they turn “rare big deviations” into concrete, exponentially small tail probabilities—with constants you can compute and optimize.

TL;DR:

Concentration inequalities bound tail probabilities like P(S − E[S] ≥ t). The core move is the exponential Markov method: apply Markov to exp(λ(S−E[S])) to get P(S−E[S] ≥ t) ≤ exp(ψ_S(λ) − λt). Optimizing over λ gives a Legendre/duality picture: the best exponent is sup_λ (λt − ψ_S(λ)) (a convex conjugate). For sums of independent bounded variables, Hoeffding’s lemma yields a quadratic upper bound on ψ, giving sub-Gaussian tails (Hoeffding bounds). For Bernoulli sums, sharper ψ yields Chernoff bounds with KL-divergence exponents.

What Is a Concentration Inequality? #

A concentration inequality is a guarantee that a random quantity stays close to its typical value (usually its expectation) with high probability.

Concretely, if S is a random variable (often a sum or average), we want bounds of the form

The defining feature is that these probabilities decay fast in t (often like exp(−c t²) or exp(−c t)). This is the mathematical version of “averaging reduces noise.”

Why concentration is not the CLT #

You already know the Central Limit Theorem (CLT): sums of many independent variables become approximately normal under mild conditions.

But CLT answers a different question:

In practice, concentration is what you use when you want statements like:

“With probability at least 1 − δ, the sample mean differs from the true mean by at most ε.”

This is a tail probability question.

The common structure #

Most concentration results for sums of independent variables share a template:

  1. 1)Convert a tail event into an inequality about an exponential moment.
  2. 2)Bound that exponential moment using some structure (independence, boundedness, sub-Gaussianity, etc.).
  3. 3)Optimize a parameter (λ) to get the best exponent.

This node focuses on the classic pair:

And the unifying engine behind both:

The key symbol: the cumulant-generating function #

For a random variable X, define its moment-generating function (mgf) when it exists:

MX(λ)=E[eλX]M_X(\lambda) = \mathbb{E}[e^{\lambda X}]MX​(λ)=E[eλX]

and its cumulant-generating function (CGF)

ψX(λ)=log⁡E[eλX]\psi_X(\lambda) = \log \mathbb{E}[e^{\lambda X}]ψX​(λ)=logE[eλX]

Why the log? Because for independent sums it turns products into sums:

If X and Y are independent, then

ψX+Y(λ)=log⁡E[eλ(X+Y)]=log⁡(E[eλX]E[eλY])=ψX(λ)+ψY(λ).\psi_{X+Y}(\lambda) = \log \mathbb{E}[e^{\lambda(X+Y)}] = \log(\mathbb{E}[e^{\lambda X}]\mathbb{E}[e^{\lambda Y}]) = \psi_X(\lambda) + \psi_Y(\lambda).ψX+Y​(λ)=logE[eλ(X+Y)]=log(E[eλX]E[eλY])=ψX​(λ)+ψY​(λ).

This additivity is the backbone of concentration for sums.

Two mental pictures to keep in mind #

  1. Tail bound as an optimization problem:

We will repeatedly get bounds that look like

P(S−ES≥t)≤exp⁡(ψS(λ)−λt)\mathbb{P}(S - \mathbb{E}S \ge t) \le \exp(\psi_S(\lambda) - \lambda t)P(S−ES≥t)≤exp(ψS​(λ)−λt)

and then choose the best λ ≥ 0.

  1. Geometry / duality:

The exponent comes from comparing the curve ψ(λ) against the line λt.

We will make that picture explicit later, because it is the most reliable way to remember what’s going on.

Core Mechanic 1: Exponential Markov Method and the ψ(λ) vs λt Optimization #

The starting point is humble: Markov’s inequality.

For a nonnegative random variable Y and a > 0,

P(Y≥a)≤E[Y]a.\mathbb{P}(Y \ge a) \le \frac{\mathbb{E}[Y]}{a}.P(Y≥a)≤aE[Y]​.

This is often too weak directly. The trick is to manufacture a useful nonnegative variable by exponentiating.

Step 1: Convert a tail event into an exponential event #

Suppose we want an upper tail for X:

P(X≥t).\mathbb{P}(X \ge t).P(X≥t).

For any λ > 0,

X≥t  ⟺  eλX≥eλt.X \ge t \iff e^{\lambda X} \ge e^{\lambda t}.X≥t⟺eλX≥eλt.

Since e^{\lambda X} ≥ 0, we can apply Markov:

P(X≥t)=P(eλX≥eλt)≤E[eλX]eλt.\mathbb{P}(X \ge t) = \mathbb{P}(e^{\lambda X} \ge e^{\lambda t}) \le \frac{\mathbb{E}[e^{\lambda X}]}{e^{\lambda t}}.P(X≥t)=P(eλX≥eλt)≤eλtE[eλX]​.

That is

P(X≥t)≤exp⁡(ψX(λ)−λt).\mathbb{P}(X \ge t) \le \exp(\psi_X(\lambda) - \lambda t).P(X≥t)≤exp(ψX​(λ)−λt).

This is the exponential Markov method.

Centering matters #

Most concentration statements are about deviations from the mean. Let Z = S − E[S]. Then

P(S−ES≥t)=P(Z≥t)≤exp⁡(ψZ(λ)−λt).\mathbb{P}(S - \mathbb{E}S \ge t) = \mathbb{P}(Z \ge t) \le \exp(\psi_Z(\lambda) - \lambda t).P(S−ES≥t)=P(Z≥t)≤exp(ψZ​(λ)−λt).

Note that

ψZ(λ)=log⁡E[eλ(S−ES)]=log⁡(e−λES E[eλS])=ψS(λ)−λES.\psi_Z(\lambda) = \log \mathbb{E}[e^{\lambda(S-\mathbb{E}S)}] = \log \left(e^{-\lambda\mathbb{E}S},\mathbb{E}[e^{\lambda S}]\right) = \psi_S(\lambda) - \lambda \mathbb{E}S.ψZ​(λ)=logE[eλ(S−ES)]=log(e−λESE[eλS])=ψS​(λ)−λES.

So you can either work with ψ_S and subtract λE[S], or directly define ψ for the centered variable.

Step 2: Optimize over λ #

The inequality holds for every λ > 0. So take the best one:

P(Z≥t)≤inf⁡λ>0exp⁡(ψZ(λ)−λt).\mathbb{P}(Z \ge t) \le \inf_{\lambda > 0} \exp(\psi_Z(\lambda) - \lambda t).P(Z≥t)≤λ>0inf​exp(ψZ​(λ)−λt).

Equivalently,

P(Z≥t)≤exp⁡(−sup⁡λ>0(λt−ψZ(λ))).\mathbb{P}(Z \ge t) \le \exp\Big( - \sup_{\lambda > 0} (\lambda t - \psi_Z(\lambda)) \Big).P(Z≥t)≤exp(−λ>0sup​(λt−ψZ​(λ))).

Define the convex conjugate (Legendre transform) of ψ_Z:

ψZ∗(t)=sup⁡λ∈R(λt−ψZ(λ)).\psi_Z^*(t) = \sup_{\lambda \in \mathbb{R}} (\lambda t - \psi_Z(\lambda)).ψZ∗​(t)=λ∈Rsup​(λt−ψZ​(λ)).

Then the clean conceptual form is:

P(Z≥t)≤e−ψZ∗(t).\mathbb{P}(Z \ge t) \le e^{-\psi_Z^*(t)}.P(Z≥t)≤e−ψZ∗​(t).

This “duality” is not just fancy notation: it is exactly the geometry of “supporting lines” to a convex function.

The geometry: ψ(λ) and its supporting line #

Because ψ_Z(λ) is convex in λ (log mgf is convex), the function

g(λ)=ψZ(λ)−λtg(\lambda) = \psi_Z(\lambda) - \lambda tg(λ)=ψZ​(λ)−λt

is convex too (convex minus linear). Minimizing it has a crisp first-order condition when differentiable:

g′(λ)=ψZ′(λ)−t=0⇒ψZ′(λ∗)=t.g'(\lambda) = \psi_Z'(\lambda) - t = 0 \quad\Rightarrow\quad \psi_Z'(\lambda^*) = t.g′(λ)=ψZ′​(λ)−t=0⇒ψZ′​(λ∗)=t.

Interpretation:

This directly matches the visualization suggestion:

Interactive canvas idea (explicit) #

If your interactive environment supports it, show two linked panels:

Panel A (duality / optimization):

Panel B (resulting tail):

This makes “optimize over λ” feel like reading off a best supporting line.

Step 3: Use structure to bound ψ #

So far, everything is exact but useless unless we can bound ψ_Z(λ).

The entire field of concentration inequalities is essentially a library of ways to upper bound ψ for different kinds of random variables.

For this node, we focus on:

Before going there, one more crucial simplification:

ψ for sums of independent terms #

Let S = ∑ᵢ Xᵢ with independent Xᵢ. For centered terms (or after centering),

ψS(λ)=∑iψXi(λ).\psi_S(\lambda) = \sum_i \psi_{X_i}(\lambda).ψS​(λ)=i∑​ψXi​​(λ).

So if you can bound each ψ_{X_i}(λ), you can bound ψ_S(λ) by summing.

That is why bounded-difference / bounded-summand principles are so powerful: they give a uniform bound per term.

Core Mechanic 2: Bounded Summands → Hoeffding’s Lemma → Hoeffding Bounds #

Hoeffding’s inequality is the canonical “bounded independent noise averages out” theorem.

We’ll build it in two stages:

  1. Hoeffding’s lemma: a single bounded random variable has a sub-Gaussian mgf.

  2. Additivity of ψ for independent sums + λ-optimization yields Hoeffding’s inequality.

Stage 1: Hoeffding’s lemma (mgf bound for a bounded variable) #

Let X be a random variable with support in an interval [a, b]. Define the centered variable:

Y=X−E[X].Y = X - \mathbb{E}[X].Y=X−E[X].

Hoeffding’s lemma states:

E[eλY]≤exp⁡(λ2(b−a)28)∀λ∈R.\mathbb{E}[e^{\lambda Y}] \le \exp\left(\frac{\lambda^2 (b-a)^2}{8}\right) \quad \forall \lambda \in \mathbb{R}.E[eλY]≤exp(8λ2(b−a)2​)∀λ∈R.

Equivalently in CGF form:

ψY(λ)≤λ2(b−a)28.\psi_Y(\lambda) \le \frac{\lambda^2 (b-a)^2}{8}.ψY​(λ)≤8λ2(b−a)2​.

Why this is the key (motivation) #

This is the precise bridge from boundedness to quadratic ψ, and quadratic ψ is what produces Gaussian-like tails exp(−t²).

A useful mental model:

ψG(λ)=log⁡E[eλG]=σ2λ22.\psi_G(\lambda) = \log \mathbb{E}[e^{\lambda G}] = \frac{\sigma^2 \lambda^2}{2}.ψG​(λ)=logE[eλG]=2σ2λ2​.

So Hoeffding’s lemma says: “a bounded centered variable behaves like a Gaussian in terms of its mgf, with effective variance proxy (b−a)²/4.” (Up to constants.)

A brief sketch of why it’s true #

A full proof uses convexity (or a comparison to a two-point distribution at the endpoints). The intuition:

For this lesson, the important takeaway is the usable inequality:

ψX−EX(λ)≤λ2(b−a)28.\psi_{X-\mathbb{E}X}(\lambda) \le \frac{\lambda^2 (b-a)^2}{8}.ψX−EX​(λ)≤8λ2(b−a)2​.

Stage 2: Hoeffding inequality for sums #

Let X₁, …, X_n be independent, with Xᵢ ∈ [aᵢ, bᵢ] almost surely. Let

S=∑i=1nXi,μ=E[S]=∑i=1nE[Xi].S = \sum_{i=1}^n X_i, \quad \mu = \mathbb{E}[S] = \sum_{i=1}^n \mathbb{E}[X_i].S=i=1∑n​Xi​,μ=E[S]=i=1∑n​E[Xi​].

Define the centered sum Z = S − μ = ∑ᵢ (Xᵢ − E[Xᵢ]).

Step A: Apply exponential Markov #

For λ > 0,

P(S−μ≥t)=P(Z≥t)≤exp⁡(ψZ(λ)−λt).\mathbb{P}(S-\mu \ge t) = \mathbb{P}(Z \ge t) \le \exp(\psi_Z(\lambda) - \lambda t).P(S−μ≥t)=P(Z≥t)≤exp(ψZ​(λ)−λt).

Step B: Use independence to sum ψ #

Because the centered terms are independent,

ψZ(λ)=∑i=1nψXi−EXi(λ).\psi_Z(\lambda) = \sum_{i=1}^n \psi_{X_i-\mathbb{E}X_i}(\lambda).ψZ​(λ)=i=1∑n​ψXi​−EXi​​(λ).

Step C: Bound each ψ via Hoeffding’s lemma #

Each Xᵢ − E[Xᵢ] is still bounded in an interval of width (bᵢ − aᵢ). So

ψXi−EXi(λ)≤λ2(bi−ai)28.\psi_{X_i-\mathbb{E}X_i}(\lambda) \le \frac{\lambda^2 (b_i-a_i)^2}{8}.ψXi​−EXi​​(λ)≤8λ2(bi​−ai​)2​.

Summing gives

ψZ(λ)≤λ28∑i=1n(bi−ai)2.\psi_Z(\lambda) \le \frac{\lambda^2}{8} \sum_{i=1}^n (b_i-a_i)^2.ψZ​(λ)≤8λ2​i=1∑n​(bi​−ai​)2.

Define the “range-sum” parameter

V=∑i=1n(bi−ai)2.V = \sum_{i=1}^n (b_i-a_i)^2.V=i=1∑n​(bi​−ai​)2.

Then

P(Z≥t)≤exp⁡(λ2V8−λt).\mathbb{P}(Z \ge t) \le \exp\left(\frac{\lambda^2 V}{8} - \lambda t\right).P(Z≥t)≤exp(8λ2V​−λt).

Step D: Optimize over λ #

We minimize the exponent in λ:

h(λ)=λ2V8−λt.h(\lambda) = \frac{\lambda^2 V}{8} - \lambda t.h(λ)=8λ2V​−λt.

Differentiate:

h′(λ)=λV4−t.h'(\lambda) = \frac{\lambda V}{4} - t.h′(λ)=4λV​−t.

Set h'(λ)=0:

λ∗V4=t⇒λ∗=4tV.\frac{\lambda^* V}{4} = t \quad\Rightarrow\quad \lambda^* = \frac{4t}{V}.4λ∗V​=t⇒λ∗=V4t​.

Plug back in:

\n1) Compute the quadratic term:

(λ∗)2V8=(16t2/V2)V8=16t28V=2t2V.\frac{(\lambda^*)^2 V}{8} = \frac{(16 t^2 / V^2) V}{8} = \frac{16 t^2}{8V} = \frac{2t^2}{V}.8(λ∗)2V​=8(16t2/V2)V​=8V16t2​=V2t2​.

  1. Compute the linear term:

λ∗t=4tV⋅t=4t2V.\lambda^* t = \frac{4t}{V} \cdot t = \frac{4t^2}{V}.λ∗t=V4t​⋅t=V4t2​.

So

h(λ∗)=2t2V−4t2V=−2t2V.h(\lambda^*) = \frac{2t^2}{V} - \frac{4t^2}{V} = -\frac{2t^2}{V}.h(λ∗)=V2t2​−V4t2​=−V2t2​.

Therefore

P(S−μ≥t)≤exp⁡(−2t2∑i=1n(bi−ai)2).\mathbb{P}(S-\mu \ge t) \le \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(S−μ≥t)≤exp(−∑i=1n​(bi​−ai​)22t2​).

That is Hoeffding’s inequality.

Two-sided version #

Similarly,

P(S−μ≤−t)≤exp⁡(−2t2∑i=1n(bi−ai)2).\mathbb{P}(S-\mu \le -t) \le \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(S−μ≤−t)≤exp(−∑i=1n​(bi​−ai​)22t2​).

By a union bound,

P(∣S−μ∣≥t)≤2exp⁡(−2t2∑i=1n(bi−ai)2).\mathbb{P}(|S-\mu| \ge t) \le 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).P(∣S−μ∣≥t)≤2exp(−∑i=1n​(bi​−ai​)22t2​).

The sample mean form #

If Xᵢ ∈ [0,1], then (bᵢ − aᵢ)=1 and V = n.

Let \bar{X} = (1/n)∑ Xᵢ and E[\bar{X}] = μ̄.

Then

\mathbb{P}(\bar{X} - \mū \ge \varepsilon) = \mathbb{P}(S - \mu \ge n\varepsilon) \le \exp\left(-2n\varepsilon^2\right).

And two-sided:

\mathbb{P}(|\bar{X} - \mū| \ge \varepsilon) \le 2\exp(-2n\varepsilon^2).

Visualization reinforcement: quadratic ψ bound → sub-Gaussian tail curves #

This is the second visualization the node asked for.

In Hoeffding’s proof, we replaced the true ψ_Z(λ) by an upper quadratic:

ψZ(λ)≤λ2V8.\psi_Z(\lambda) \le \frac{\lambda^2 V}{8}.ψZ​(λ)≤8λ2V​.

Geometrically, you can plot:

Then the optimized exponent becomes the conjugate of the quadratic:

sup⁡λ≥0(λt−λ2V/8)=2t2V.\sup_{\lambda \ge 0}(\lambda t - \lambda^2 V/8) = \frac{2t^2}{V}.λ≥0sup​(λt−λ2V/8)=V2t2​.

So the tail is bounded by exp(−2t²/V).

Interactive canvas idea (explicit):

ε(δ)=log⁡(2/δ)2n.\varepsilon(\delta) = \sqrt{\frac{\log(2/\delta)}{2n}}.ε(δ)=2nlog(2/δ)​​.

This connects directly to “how many samples do I need for error ≤ ε with probability ≥ 1−δ?”

Application/Connection: Chernoff Bounds, KL Exponents, and How ψ Controls Tail Decay #

Hoeffding is powerful and simple, but it is sometimes loose because it uses only boundedness.

Chernoff bounds refine the story when the distribution is known (especially Bernoulli) by using a sharper ψ.

Bernoulli sums and a sharper ψ #

Let Xᵢ ∼ Bernoulli(p) independent. Then S = ∑ Xᵢ ∼ Binomial(n, p), and μ = E[S] = np.

For a single Bernoulli X:

E[eλX]=(1−p) e0+p eλ=1−p+peλ.\mathbb{E}[e^{\lambda X}] = (1-p),e^{0} + p,e^{\lambda} = 1-p + pe^{\lambda}.E[eλX]=(1−p)e0+peλ=1−p+peλ.

So

ψX(λ)=log⁡(1−p+peλ).\psi_X(\lambda) = \log(1-p + pe^{\lambda}).ψX​(λ)=log(1−p+peλ).

For the sum S, independence gives

ψS(λ)=nlog⁡(1−p+peλ).\psi_S(\lambda) = n\log(1-p + pe^{\lambda}).ψS​(λ)=nlog(1−p+peλ).

For the centered sum Z = S − μ,

ψZ(λ)=ψS(λ)−λμ=nlog⁡(1−p+peλ)−λnp.\psi_Z(\lambda) = \psi_S(\lambda) - \lambda\mu = n\log(1-p + pe^{\lambda}) - \lambda np.ψZ​(λ)=ψS​(λ)−λμ=nlog(1−p+peλ)−λnp.

Now apply exponential Markov:

P(S−μ≥t)≤exp⁡(nlog⁡(1−p+peλ)−λnp−λt).\mathbb{P}(S-\mu \ge t) \le \exp\left(n\log(1-p + pe^{\lambda}) - \lambda np - \lambda t\right).P(S−μ≥t)≤exp(nlog(1−p+peλ)−λnp−λt).

Let’s express the event in relative terms. Set t = n\varepsilon, i.e. S ≥ n(p+ε). Then

P(Sn≥p+ε)≤exp⁡(n[log⁡(1−p+peλ)−λ(p+ε)]).\mathbb{P}\left(\frac{S}{n} \ge p+\varepsilon\right) \le \exp\left(n\big[\log(1-p + pe^{\lambda}) - \lambda(p+\varepsilon)\big]\right).P(nS​≥p+ε)≤exp(n[log(1−p+peλ)−λ(p+ε)]).

We optimize over λ ≥ 0.

The KL-divergence form (the cleanest statement) #

Let q = p+ε (so q > p). The optimal exponent is

inf⁡λ>0(log⁡(1−p+peλ)−λq)=−D(q∥p),\inf_{\lambda>0} \Big(\log(1-p+pe^{\lambda}) - \lambda q\Big) = -D(q|p),λ>0inf​(log(1−p+peλ)−λq)=−D(q∥p),

where D(q|p) is the Bernoulli KL divergence:

D(q∥p)=qlog⁡qp+(1−q)log⁡1−q1−p.D(q|p) = q\log\frac{q}{p} + (1-q)\log\frac{1-q}{1-p}.D(q∥p)=qlogpq​+(1−q)log1−p1−q​.

Thus the Chernoff bound becomes

P(Sn≥q)≤exp⁡(−n D(q∥p)).\mathbb{P}\left(\frac{S}{n} \ge q\right) \le \exp\big(-n,D(q|p)\big).P(nS​≥q)≤exp(−nD(q∥p)).

Similarly for the lower tail (q < p):

P(Sn≤q)≤exp⁡(−n D(q∥p)).\mathbb{P}\left(\frac{S}{n} \le q\right) \le \exp\big(-n,D(q|p)\big).P(nS​≤q)≤exp(−nD(q∥p)).

This is a major conceptual upgrade:

Recovering the “classic” multiplicative Chernoff forms #

A popular parameterization is S ≥ (1+δ)μ with δ>0.

Since μ=np, the event is S ≥ (1+δ)np.

A standard Chernoff upper tail bound is:

P(S≥(1+δ)μ)≤(eδ(1+δ)1+δ)μ.\mathbb{P}(S \ge (1+\delta)\mu) \le \left(\frac{e^{\delta}}{(1+\delta)^{1+\delta}}\right)^{\mu}.P(S≥(1+δ)μ)≤((1+δ)1+δeδ​)μ.

And a simpler (looser) but easy-to-remember form:

P(S≥(1+δ)μ)≤exp⁡(−μδ22+δ).\mathbb{P}(S \ge (1+\delta)\mu) \le \exp\left(-\frac{\mu\delta^2}{2+\delta}\right).P(S≥(1+δ)μ)≤exp(−2+δμδ2​).

Lower tail for δ∈(0,1):

P(S≤(1−δ)μ)≤exp⁡(−μδ22).\mathbb{P}(S \le (1-\delta)\mu) \le \exp\left(-\frac{\mu\delta^2}{2}\right).P(S≤(1−δ)μ)≤exp(−2μδ2​).

These are derived by selecting a convenient λ rather than solving exactly, or by bounding KL by a quadratic.

Connection back to ψ(λ) vs λt geometry #

For Bernoulli sums,

This is the same duality picture as before, but now the optimized supporting line corresponds to a KL divergence.

Visualization suggestion (Chernoff / KL) #

Extend the earlier Panel A:

Then in Panel B, plot two bounds versus q:

This comparison makes the “distribution-aware” advantage visible.

Where Markov chains connect (high level) #

Even though this node centers on independent sums, the exponential Markov method generalizes.

In Markov chain settings, you often want concentration for empirical averages along a trajectory:

1n∑t=1nf(Xt).\frac{1}{n}\sum_{t=1}^n f(X_t).n1​t=1∑n​f(Xt​).

Independence fails, but one can still bound exponential moments using spectral gaps / mixing and obtain Hoeffding-like or Bernstein-like inequalities for dependent data.

The important connection is conceptual:

So learning the ψ(λ) vs λt optimization here pays off later in dependent concentration.

Summary table: Hoeffding vs Chernoff #

InequalitySettingTail behaviorWhat ψ bound usesTypical exponent
HoeffdingIndependent, bounded Xᵢ ∈ [aᵢ,bᵢ]Sub-GaussianQuadratic upper bound on ψ (Hoeffding lemma)2t² / ∑(bᵢ−aᵢ)²
ChernoffBernoulli/Binomial (and related)Often tighter; KL formExact ψ, then optimizen·D(q

Both are the same engine + different ψ information.

Worked Examples (3) #

Hoeffding bound for a sample mean in [0,1] and solving for n given (ε, δ) #

Let X₁,…,X_n be independent with Xᵢ ∈ [0,1]. Let \bar{X} = (1/n)∑Xᵢ and μ̄ = E[\bar{X}]. We want n such that P(|\bar{X} − μ̄| ≥ ε) ≤ δ.

  1. Start from two-sided Hoeffding for bounded [0,1] variables:

    P(|\bar{X} − μ̄| ≥ ε) ≤ 2 exp(−2nε²).

  2. Set the bound ≤ δ:

    2 exp(−2nε²) ≤ δ.

  3. Divide by 2:

    exp(−2nε²) ≤ δ/2.

  4. Take logs (note log is increasing):

    −2nε² ≤ log(δ/2).

  5. Multiply by −1 (flips inequality):

    2nε² ≥ log(2/δ).

  6. Solve for n:

    n≥log⁡(2/δ)2ε2.n \ge \frac{\log(2/\delta)}{2\varepsilon^2}.n≥2ε2log(2/δ)​.

  7. Interpretation: to cut ε in half, you need ~4× more samples; to make δ 10× smaller, you need only an additive increase of (log 10)/(2ε²).

Insight: This is the classic PAC-style sample complexity tradeoff. The square dependence on ε comes from the quadratic ψ upper bound (sub-Gaussianity), while the log dependence on 1/δ comes from the exponential tail.

Deriving Hoeffding’s inequality via ψ(λ) and optimizing λ explicitly #

Let X₁,…,X_n independent with Xᵢ ∈ [aᵢ,bᵢ]. Let S=∑Xᵢ and μ=E[S]. Define V=∑(bᵢ−aᵢ)². Show P(S−μ ≥ t) ≤ exp(−2t²/V).

  1. Let Z = S − μ = ∑(Xᵢ − E[Xᵢ]). We apply exponential Markov:

    For any λ>0,

    P(Z ≥ t) = P(e^{λZ} ≥ e^{λt}) ≤ E[e^{λZ}] / e^{λt}.

  2. Take logs via ψ:

    P(Z ≥ t) ≤ exp( ψ_Z(λ) − λt ), where ψ_Z(λ)=log E[e^{λZ}].

  3. Use independence to add CGFs:

    ψ_Z(λ) = ∑ ψ_{Xᵢ−E[Xᵢ]}(λ).

  4. Apply Hoeffding’s lemma to each term:

    ψ_{Xᵢ−E[Xᵢ]}(λ) ≤ λ²(bᵢ−aᵢ)²/8.

  5. Sum the bounds:

    ψ_Z(λ) ≤ (λ²/8)∑(bᵢ−aᵢ)² = λ²V/8.

  6. So P(Z ≥ t) ≤ exp( λ²V/8 − λt ).

  7. Optimize h(λ)=λ²V/8 − λt:

    h'(λ)=λV/4 − t = 0 ⇒ λ* = 4t/V.

  8. Plug in:

    h(λ*) = ( (16t²/V²)·V / 8 ) − (4t/V)·t = (2t²/V) − (4t²/V) = −2t²/V.

  9. Conclude:

    P(S−μ≥t)≤exp⁡(−2t2V).P(S−\mu \ge t) \le \exp\left(−\frac{2t^2}{V}\right).P(S−μ≥t)≤exp(−V2t2​).

Insight: This example makes the optimization/duality tangible: once ψ is upper-bounded by a quadratic, the entire tail bound becomes a one-line quadratic minimization in λ.

Chernoff bound for a Binomial upper tail in KL form #

Let S ∼ Binomial(n,p). For a target fraction q>p, bound P(S/n ≥ q).

  1. Use exponential Markov on S:

    For λ>0,

    P(S ≥ nq) = P(e^{λS} ≥ e^{λnq}) ≤ E[e^{λS}] / e^{λnq}.

  2. Compute the mgf of a binomial via independence:

    E[e^{λS}] = (E[e^{λX₁}])^n with X₁∼Bern(p).

    E[e^{λX₁}] = 1−p + pe^{λ}.

  3. So:

    P(S/n ≥ q) ≤ exp( n log(1−p+pe^{λ}) − λnq ).

  4. Optimize over λ>0:

    Take derivative w.r.t. λ of the exponent per-sample:

    φ(λ)=log(1−p+pe^{λ}) − λq.

    φ'(λ) = (pe^{λ})/(1−p+pe^{λ}) − q.

  5. Set φ'(λ*)=0:

    (pe^{λ*})/(1−p+pe^{λ*}) = q.

    Solve for e^{λ*}:

    q(1−p+pe^{λ*}) = pe^{λ*}

    q(1−p) = pe^{λ*}(1−q)

    ⇒ e^{λ*} = (q(1−p))/(p(1−q)).

  6. Plugging λ* back yields the KL form:

    min_λ [log(1−p+pe^{λ}) − λq] = −D(q||p),

    so

    P(S/n≥q)≤exp⁡(−nD(q∥p)).P(S/n \ge q) \le \exp\big(−n D(q|p)\big).P(S/n≥q)≤exp(−nD(q∥p)).

Insight: Chernoff is the same exponential Markov method, but with an exact ψ. The optimization condition ψ'(λ*)=t becomes an explicit equation whose solution encodes the tilted (exponentially reweighted) Bernoulli distribution, and the exponent becomes a KL divergence.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let X₁,…,X_n be independent with Xᵢ ∈ [−1,2]. Give a bound on P(|\bar{X}−E\bar{X}| ≥ ε).

Hint: Use two-sided Hoeffding on the sum with (bᵢ−aᵢ)=3, then translate from S to \bar{X}.

Show solution

Here (bᵢ−aᵢ)=3 for each i, so V=∑(bᵢ−aᵢ)² = n·9 = 9n.

For S=∑Xᵢ and μ=E[S], Hoeffding gives:

P(|S−μ| ≥ t) ≤ 2 exp(−2t²/V) = 2 exp(−2t²/(9n)).

Now |\bar{X}−E\bar{X}| ≥ ε is equivalent to |S−μ| ≥ nε, so t=nε:

P(|\bar{X}−E\bar{X}| ≥ ε) ≤ 2 exp(−2(nε)²/(9n)) = 2 exp(−2nε²/9).

medium

Let S ∼ Binomial(n,p). Using the exponential Markov method, show that for any λ>0:

P(S/n ≥ q) ≤ exp(n[log(1−p+pe^{λ}) − λq]). Then write the condition that defines the optimal λ*.

Hint: Compute E[e^{λS}] via independence; then differentiate the exponent per-sample and set to zero.

Show solution

Exponential Markov:

P(S≥nq)=P(e^{λS}≥e^{λnq})≤E[e^{λS}]/e^{λnq}.

For Bernoulli X, E[e^{λX}]=1−p+pe^{λ}. For S=∑Xᵢ, E[e^{λS}]=(1−p+pe^{λ})^n.

Thus

P(S/n ≥ q) ≤ (1−p+pe^{λ})^n / e^{λnq} = exp(n[log(1−p+pe^{λ}) − λq]).

For optimal λ*, minimize f(λ)=log(1−p+pe^{λ}) − λq.

Set derivative to zero:

f'(λ)=(pe^{λ})/(1−p+pe^{λ}) − q = 0,

so λ satisfies (pe^{λ})/(1−p+pe^{λ*}) = q.

hard

Suppose Xᵢ ∈ [0,1] independent, and you want P(\bar{X} − E\bar{X} ≥ ε) ≤ δ (one-sided). Solve for the smallest n that Hoeffding guarantees.

Hint: Use the one-sided Hoeffding bound exp(−2nε²) and solve exp(−2nε²) ≤ δ.

Show solution

For Xᵢ∈[0,1], Hoeffding one-sided gives:

P(\bar{X} − E\bar{X} ≥ ε) ≤ exp(−2nε²).

Require exp(−2nε²) ≤ δ.

Take logs:

−2nε² ≤ log δ.

Multiply by −1 (flip):

2nε² ≥ log(1/δ).

Thus

n≥log⁡(1/δ)2ε2.n \ge \frac{\log(1/\delta)}{2\varepsilon^2}.n≥2ε2log(1/δ)​.

Connections #

Next nodes you might connect:

Quality: A (4.3/5)

← back to treebrowse all →