←Back to Tech Tree
inventorycoverage
Rate-Distortion Theory #
Information TheoryDifficulty: ★★★★★Depth: 8Unlocks: 0
Optimal lossy compression. Tradeoff between bits and fidelity.
Interactive Visualization #
⏮◀◀▶▶STEP0.25x1xZOOM
t=0s
Core Concepts #
- -Single-letter distortion measure: a nonnegative function d(x,x_hat) and the expected distortion constraint E[d(X,X_hat)] \u2264 D (per-symbol average).
- -Rate-distortion function: R(D) as the minimal information cost (bits per symbol) required for lossy compression at distortion level D.
Key Symbols & Notation #
R(D) (rate-distortion function)p(x_hat|x) (test-channel / conditional reconstruction distribution)
Essential Relationships #
- -R(D) = inf_{p(x_hat|x): E[d(X,X_hat)] \u2264 D} I(X;X_hat) (variational definition linking distortion, test channel, and mutual information)
Prerequisites (2) #
Entropy5 atomsMutual Information5 atoms
Advanced Learning Details
Graph Position #
100
Depth Cost
0
Fan-Out (ROI)
0
Bottleneck Score
8
Chain Length
Cognitive Load #
5
Atomic Elements
38
Total Elements
L2
Percentile Level
L3
Atomic Level
All Concepts (16) #
- Distortion measure d(x,x̂): a nonnegative function that quantifies fidelity loss between a source symbol x and a reproduction symbol x̂
- Reproduction alphabet X̂: the set of allowed reconstruction symbols (distinct from the source alphabet)
- Expected (average) distortion E[d(X,X̂)] = sum_{x,x̂} p(x,x̂) d(x,x̂): the constraint used in lossy compression
- Rate–Distortion function R(D): the minimum information rate (bits/symbol) required to achieve average distortion ≤ D
- Distortion–Rate function D(R): minimal achievable distortion given a rate R (inverse viewpoint)
- Test channel / conditional reproduction law p(x̂|x): the randomized mapping from source symbol x to reconstruction x̂ used in the single-letter formulation
- Operational coding primitives for RD: encoder, decoder, codebook, block length n, rate R=(1/n) log M, and average distortion over a block
- Single-letter characterization: the asymptotic (per-symbol) reduction of the coding problem to a per-letter optimization over p(x̂|x)
- Lagrangian (parametric) formulation: introducing a multiplier β (or λ) to convert the constrained minimization to an unconstrained one
- Parametric representation R(β) (or R–D curve via β): computing R(D) by sweeping the Lagrange multiplier
- Optimal test channel structure (exponential family form) at optimum: p*(x̂|x) proportional to p(x̂) exp(-β d(x,x̂))
- Convexity and monotonicity properties: R(D) is a nonincreasing, convex function of D
- Achievability and converse (rate–distortion coding theorem): R(D) is both an achievable rate and a lower bound asymptotically
- Blahut–Arimoto algorithm for Rate–Distortion: iterative numerical method to compute R(D) and the optimal p(x̂|x)
- Important canonical examples/closed forms (as concepts): e.g., Gaussian source with MSE where R(D) = (1/2) log(σ^2/D)
- Shannon lower bound and other analytic bounds useful for approximating R(D)
Teaching Strategy #
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Lossless compression asks: “How many bits do I need to reproduce the data exactly?” Rate–distortion theory asks the harder—and more practical—question: “If I’m allowed to be slightly wrong, what is the absolute minimum number of bits I need?”
TL;DR:
Rate–distortion theory characterizes the optimal tradeoff between compression rate (bits/symbol) and allowed average distortion. For a source X with distortion measure d(x,x̂), the rate–distortion function is
R(D)=minp(x^∣x): E[d(X,X^)]≤DI(X;X^).R(D)=\min_{p(\hat x|x):;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).R(D)=p(x^∣x):E[d(X,X^)]≤DminI(X;X^).
It is decreasing and convex in D, with R(0)=H(X) under a “zero-distortion implies exact reconstruction” condition. Operationally, R(D) is the best achievable asymptotic rate for block lossy coding, and the minimizing conditional distribution p(x̂|x) is the “test channel” that describes the optimal stochastic reconstruction mechanism.
What Is Rate–Distortion Theory? #
Rate–distortion theory is the part of information theory that formalizes optimal lossy compression. “Lossy” means the decoder’s reconstruction X^\hat XX^ need not equal the original source symbol XXX; instead we control a fidelity criterion.
The central objects #
A source XXX with distribution p(x)p(x)p(x). Typically we assume an i.i.d. source sequence Xn=(X1,…,Xn)X^n=(X₁,\dots,X_n)Xn=(X1,…,Xn) where XiX_iXi are independent and identically distributed like XXX.
A distortion measure (single-letter) d(x,x^)d(x,\hat x)d(x,x^) satisfying
- •d(x,x^)≥0d(x,\hat x)\ge 0d(x,x^)≥0 for all (x,x^)(x,\hat x)(x,x^),
- •usually d(x,x)=0d(x,x)=0d(x,x)=0 (not strictly required, but common).
From this we define the per-symbol average distortion constraint
E [d(X,X^)]≤D.\mathbb E,[d(X,\hat X)] \le D.E[d(X,X^)]≤D.
For blocks, a standard extension is
dn(xn,x^n)=1n∑i=1nd(xi,x^i),d_n(x^n,\hat x^n)=\frac{1}{n}\sum_{i=1}^n d(x_i,\hat x_i),dn(xn,x^n)=n1i=1∑nd(xi,x^i),
and we ask that E[dn(Xn,X^n)]≤D\mathbb E[d_n(X^n,\hat X^n)]\le DE[dn(Xn,X^n)]≤D.
- A rate RRR in bits/symbol. If a block encoder maps XnX^nXn to an index M∈{1,…,2nR}M\in{1,\dots,2^{nR}}M∈{1,…,2nR}, then RRR measures the number of bits available per source symbol.
The question RD theory answers #
Given an allowed distortion level DDD, what is the smallest achievable rate RRR?
This minimum is the rate–distortion function R(D)R(D)R(D).
The clean single-letter characterization #
For a discrete memoryless source (DMS) with single-letter distortion, Shannon’s rate–distortion theorem states
R(D)=minp(x^∣x): E[d(X,X^)]≤DI(X;X^).R(D)=\min_{p(\hat x|x):;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).R(D)=p(x^∣x):E[d(X,X^)]≤DminI(X;X^).
The minimizer p(x^∣x)p(\hat x|x)p(x^∣x) is called a test channel. It is not a physical channel; it is a mathematical object describing the optimal way (in an information-theoretic sense) to correlate the reconstruction with the source under the distortion constraint.
An RD curve: the picture you should have in mind #
Even without interactivity, it helps to visualize R(D)R(D)R(D) as a curve:
Rate R(D) (bits/symbol)
^
|\
| \
| \
| \
| \
| \
| \
+--------------------> Distortion D
D_max
- •Left side (small D): high fidelity → high rate.
- •Right side (large D): low fidelity → low rate.
- •Typically, R(D)R(D)R(D) is nonincreasing and convex in DDD.
A useful intuition: “bits buy you correlation” #
R(D)R(D)R(D) is a minimum over I(X;X^)I(X;\hat X)I(X;X^). Mutual information measures how many bits per symbol the reconstruction “knows” about the source. Allowing more distortion lets X^\hat XX^ be less correlated with XXX, reducing I(X;X^)I(X;\hat X)I(X;X^).
Caveats (so you don’t overgeneralize) #
- •Continuous alphabets: one must be careful with differential entropy, existence of minimizers, and distortion constraints. The formula still has an analog, but the operational meaning needs measure-theoretic care.
- •When does $R(0)=H(X)$ hold? A common statement is R(0)=H(X)R(0)=H(X)R(0)=H(X). This holds under the condition that zero distortion forces exact reconstruction, i.e. d(x,x^)=0d(x,\hat x)=0d(x,x^)=0 iff x^=x\hat x=xx^=x (or at least uniquely identifies xxx). If your distortion measure has d(x,x^)=0d(x,\hat x)=0d(x,x^)=0 for multiple x^≠x\hat x\ne xx^=x, then R(0)R(0)R(0) can be smaller.
- •RD is asymptotic: the theorem is about large blocklength nnn with vanishing probability of exceeding distortion (or average distortion), not necessarily about small finite files.
Prerequisites (Accuracy Box): What You Must Already Be Comfortable With #
This node builds on several ideas. You said you already know entropy and mutual information; that’s necessary but not sufficient for a smooth ride at difficulty 5.
Required prerequisites #
| Topic | What you need here | Quick reminder |
|---|
| Entropy | Interpret H(X)H(X)H(X) as optimal lossless rate | H(X)=−∑xp(x)logp(x)H(X)=-\sum_x p(x)\log p(x)H(X)=−∑xp(x)logp(x) |
| Mutual information | Understand I(X;Y)I(X;Y)I(X;Y) as “bits of dependence” | I(X;Y)=∑x,yp(x,y)logp(x,y)p(x)p(y)I(X;Y)=\sum_{x,y} p(x,y)\log\frac{p(x,y)}{p(x)p(y)}I(X;Y)=∑x,yp(x,y)logp(x)p(y)p(x,y) |
| i.i.d. block coding / typicality | Why coding theorems are asymptotic and why “per-symbol” rates emerge | Typical sequences occupy about $2^{nH(X)}$ outcomes |
| Conditional distributions | Comfort manipulating $p(\hat x | x)andinduced and induced andinducedp(\hat x)$ |
| Basic convex optimization | Recognize convexity, constraints, and Lagrange multipliers/KKT | Minimize objective subject to inequality constraints |
Helpful (but not strictly mandatory) background #
- •Jensen’s inequality and convex functions.
- •Log-sum inequality (often used to prove convexity and information inequalities).
- •A bit of familiarity with channel capacity is helpful because RD is a “dual sibling” of capacity (minimize III subject to distortion vs maximize III subject to cost).
Key caveats to keep in mind as you learn #
Discrete vs continuous: for continuous sources, H(X)H(X)H(X) is replaced by differential entropy h(X)h(X)h(X), which is not directly a coding rate. Operational statements often involve mutual information directly and require additional regularity.
Distortion definition matters: average distortion can be defined in expectation, with high probability, or as excess-distortion probability. Standard Shannon RD uses expected per-letter distortion; variants exist.
$R(0)=H(X)$ is conditional: as noted above, it’s true in the common case where d(x,x^)=0d(x,\hat x)=0d(x,x^)=0 only when x^=x\hat x=xx^=x.
If any of these prerequisites feel shaky, it’s worth revisiting them before you attempt the proofs or algorithms (e.g., Blahut–Arimoto) in the later sections.
Core Mechanic 1: Distortion Measures and What “Fidelity” Really Means #
Before optimizing anything, we must decide what “good reconstruction” means. Rate–distortion theory forces you to be explicit: you provide a distortion measure d(x,x^)d(x,\hat x)d(x,x^) and a target distortion level DDD.
Single-letter distortion and per-symbol averaging #
A single-letter distortion assigns a penalty to each symbol pair (x,x^)(x,\hat x)(x,x^). For sequences, we typically average:
dn(xn,x^n)=1n∑i=1nd(xi,x^i).d_n(x^n,\hat x^n)=\frac{1}{n}\sum_{i=1}^n d(x_i,\hat x_i).dn(xn,x^n)=n1i=1∑nd(xi,x^i).
The average matters because it makes the constraint scale nicely with nnn and matches i.i.d. assumptions.
We then require (one common formulation)
E [dn(Xn,X^n)]≤D.\mathbb E,[d_n(X^n,\hat X^n)] \le D.E[dn(Xn,X^n)]≤D.
A subtle point: This is an expected distortion constraint; it allows rare large-distortion events as long as they don’t contribute too much to the average. Another common alternative is an excess-distortion probability constraint Pr(dn(Xn,X^n)>D)≤ϵ\Pr(d_n(X^n,\hat X^n)>D)\le \epsilonPr(dn(Xn,X^n)>D)≤ϵ.
Examples of distortion measures #
- Hamming distortion for discrete alphabets:
d(x,x^)=1{x≠x^}.d(x,\hat x)=\mathbf{1}{x\ne \hat x}.d(x,x^)=1{x=x^}.
Then E[d(X,X^)]=Pr(X≠X^)\mathbb E[d(X,\hat X)] = \Pr(X\ne \hat X)E[d(X,X^)]=Pr(X=X^), i.e. average distortion equals symbol error rate.
- Squared error for real-valued sources:
d(x,x^)=(x−x^)2.d(x,\hat x)=(x-\hat x)^2.d(x,x^)=(x−x^)2.
Then DDD is a mean-squared error (MSE) constraint.
- Weighted distortions: sometimes different symbols are more costly to distort.
Feasible distortion range: DminD_{\min}Dmin to DmaxD_{\max}Dmax #
Two distortion levels are especially important.
•Best possible distortion (often 0):
•If X^=X\hat X=XX^=X is allowed, then Dmin=minE[d(X,X^)]D_{\min}=\min \mathbb E[d(X,\hat X)]Dmin=minE[d(X,X^)] is typically 0.
•Distortion with zero rate: what if the decoder gets no bits?
•Then X^\hat XX^ cannot depend on XXX at all. The best it can do is choose a fixed reproduction distribution p(x^)p(\hat x)p(x^) (or a constant x^\hat xx^).
•Define
Dmax=minp(x^)E[d(X,X^)]subject to X^⊥X.D_{\max}=\min_{p(\hat x)} \mathbb E[d(X,\hat X)] \quad \text{subject to } \hat X \perp X.Dmax=p(x^)minE[d(X,X^)]subject to X^⊥X.
For Hamming distortion, a zero-rate decoder should output the most probable symbol, so Dmax=1−maxxp(x)D_{\max}=1-\max_x p(x)Dmax=1−maxxp(x).
This gives a boundary condition you should remember:
- •For D≥DmaxD\ge D_{\max}D≥Dmax, you can achieve distortion with zero bits, so R(D)=0R(D)=0R(D)=0.
Why distortion is not “error” #
In lossy compression, the reconstruction might be “wrong” but perceptually fine (audio/image), or wrong but still useful (downstream ML tasks). Rate–distortion cleanly separates:
- •Semantics: you pick d(x,x^)d(x,\hat x)d(x,x^) to encode what you care about.
- •Limits: R(D)R(D)R(D) tells you the best possible compression given that notion of fidelity.
That separation is powerful: once ddd is fixed, the rest is mathematics.
A sanity check that prevents confusion #
If your distortion is Hamming, the RD problem resembles “coding with errors.” But it is not channel coding: in RD, the encoder sees the entire xnx^nxn and chooses a description; the distortion constraint is on reconstruction quality, not on a noisy channel outcome.
This perspective will matter when we introduce the test channel: the “noise” in p(x^∣x)p(\hat x|x)p(x^∣x) is controlled by us to meet distortion, not imposed by nature.
Now we connect the operational question (“minimum bits”) to an information quantity.
The definition #
For a fixed source distribution p(x)p(x)p(x) and distortion measure d(x,x^)d(x,\hat x)d(x,x^), define
R(D)=minp(x^∣x): E[d(X,X^)]≤DI(X;X^).R(D)=\min_{p(\hat x|x):;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).R(D)=p(x^∣x):E[d(X,X^)]≤DminI(X;X^).
This is sometimes called the Shannon lower bound form (not to be confused with the separate Shannon lower bound approximation used in continuous cases). The key is: we are minimizing mutual information over all conditional distributions p(x^∣x)p(\hat x|x)p(x^∣x) that satisfy the distortion constraint.
Expand the pieces (so it feels concrete) #
Given p(x)p(x)p(x) and a candidate test channel p(x^∣x)p(\hat x|x)p(x^∣x):
- Induced joint:
p(x,x^)=p(x)p(x^∣x).p(x,\hat x)=p(x)p(\hat x|x).p(x,x^)=p(x)p(x^∣x).
- Induced marginal:
p(x^)=∑xp(x)p(x^∣x).p(\hat x)=\sum_x p(x)p(\hat x|x).p(x^)=x∑p(x)p(x^∣x).
- Mutual information:
I(X;X^)=∑x,x^p(x)p(x^∣x)logp(x^∣x)p(x^).I(X;\hat X)=\sum_{x,\hat x} p(x)p(\hat x|x)\log\frac{p(\hat x|x)}{p(\hat x)}.I(X;X^)=x,x^∑p(x)p(x^∣x)logp(x^)p(x^∣x).
- Expected distortion:
E[d(X,X^)]=∑x,x^p(x)p(x^∣x)d(x,x^).\mathbb E[d(X,\hat X)] = \sum_{x,\hat x} p(x)p(\hat x|x)d(x,\hat x).E[d(X,X^)]=x,x^∑p(x)p(x^∣x)d(x,x^).
We choose p(x^∣x)p(\hat x|x)p(x^∣x) to minimize the first, while keeping the second ≤ D.
A rough operational interpretation:
- •If you want the decoder to output X^n\hat X^nX^n correlated with XnX^nXn, the encoder must convey information about XnX^nXn.
- •The best achievable per-symbol information transfer is captured by I(X;X^)I(X;\hat X)I(X;X^) for an optimal randomized reconstruction rule.
RD theory makes this precise: in the limit of large blocklength, the minimum description rate equals the minimum mutual information.
Fundamental properties of R(D) #
- Monotonicity: if you allow more distortion, you never need more rate.
If D2≥D1D₂\ge D₁D2≥D1, then the feasible set at D2D₂D2 contains the feasible set at D1D₁D1, so
R(D2)≤R(D1).R(D₂)\le R(D₁).R(D2)≤R(D1).
- Convexity: R(D)R(D)R(D) is convex in DDD.
Intuition: you can time-share between two codes (or two test channels) achieving distortions D1D₁D1 and D2D₂D2. Using them a fraction θ\thetaθ and $1-\thetayieldsaveragedistortion yields average distortion yieldsaveragedistortion\theta D₁+(1-\theta)D₂andaveragerate and average rate andaveragerate\theta R(D₁)+(1-\theta)R(D₂)$. Optimality then implies
R(θD1+(1−θ)D2)≤θR(D1)+(1−θ)R(D2).R(\theta D₁+(1-\theta)D₂) \le \theta R(D₁)+(1-\theta)R(D₂).R(θD1+(1−θ)D2)≤θR(D1)+(1−θ)R(D2).
That inequality is exactly convexity.
- Boundary values:
- •For large enough DDD (≥ DmaxD_{\max}Dmax), R(D)=0R(D)=0R(D)=0.
- •Under the “zero distortion implies exact reconstruction” condition, R(0)=H(X)R(0)=H(X)R(0)=H(X).
Why R(0)=H(X)R(0)=H(X)R(0)=H(X) under that condition:
- •If E[d(X,X^)]=0\mathbb E[d(X,\hat X)]=0E[d(X,X^)]=0, then d(X,X^)=0d(X,\hat X)=0d(X,X^)=0 almost surely.
- •If d(x,x^)=0d(x,\hat x)=0d(x,x^)=0 only when x^=x\hat x=xx^=x, then X^=X\hat X=XX^=X a.s.
- •Then I(X;X^)=I(X;X)=H(X)I(X;\hat X)=I(X;X)=H(X)I(X;X^)=I(X;X)=H(X).
- •Since R(0)R(0)R(0) is the minimum such III, it equals H(X)H(X)H(X).
If instead multiple reconstructions have zero distortion for a given xxx, then X^\hat XX^ might be a function that preserves less information than XXX, and I(X;X^)I(X;\hat X)I(X;X^) (hence R(0)R(0)R(0)) can be smaller.
The Lagrangian viewpoint (a workhorse) #
We often solve constrained problems by introducing a multiplier β≥0\beta\ge 0β≥0:
minp(x^∣x) I(X;X^)+β E[d(X,X^)].\min_{p(\hat x|x)}; I(X;\hat X) + \beta,\mathbb E[d(X,\hat X)].p(x^∣x)minI(X;X^)+βE[d(X,X^)].
Then, for each β\betaβ, the optimizer corresponds to a point on the RD curve. (Formally, β\betaβ parameterizes supporting lines to the convex function R(D)R(D)R(D).)
This is where convex optimization enters: I(X;X^)I(X;\hat X)I(X;X^) is convex in p(x^∣x)p(\hat x|x)p(x^∣x) for fixed p(x)p(x)p(x), and the distortion constraint is linear in p(x^∣x)p(\hat x|x)p(x^∣x), making the problem a convex program.
At optimum, the conditional often has an exponential tilt form (for discrete alphabets):
p(x^∣x)∝p(x^) e−βd(x,x^).p(\hat x|x) \propto p(\hat x),e^{-\beta d(x,\hat x)}.p(x^∣x)∝p(x^)e−βd(x,x^).
More explicitly,
p(x^∣x)=p(x^)e−βd(x,x^)∑x^′p(x^′)e−βd(x,x^′).p(\hat x|x)=\frac{p(\hat x)e^{-\beta d(x,\hat x)}}{\sum_{\hat x'} p(\hat x')e^{-\beta d(x,\hat x')}}.p(x^∣x)=∑x^′p(x^′)e−βd(x,x^′)p(x^)e−βd(x,x^).
This is not magic; it is the KKT optimality condition for minimizing mutual information subject to expected distortion. It is also the basis for the Blahut–Arimoto algorithm (iterative computation of R(D)R(D)R(D)).
Pause here and internalize the meaning: to compress optimally, you behave as if the reconstruction is drawn from a Gibbs distribution where reconstructions with smaller distortion are exponentially preferred, but also modulated by a global prior p(x^)p(\hat x)p(x^) that must self-consistently match the induced marginal.
Application/Connection: Operational Meaning, Achievability vs Converse, and How RD Is Used #
The single-letter formula is elegant, but why does it equal the true minimum code rate? Rate–distortion theory has an operational meaning: it predicts what block codes can and cannot do as n→∞n→∞n→∞.
Fix a DMS XXX and single-letter distortion ddd. For any distortion level DDD:
- •Achievability: For any rate R>R(D)R>R(D)R>R(D), there exists a sequence of (n,R)(n,R)(n,R) lossy codes whose expected distortion is ≤ D for sufficiently large nnn.
- •Converse: For any rate R<R(D)R<R(D)R<R(D), no sequence of codes can achieve expected distortion ≤ D for all large nnn.
So R(D)R(D)R(D) is the sharp threshold.
Why block coding matters #
Lossy compression gains power from coding across many symbols:
- •With n=1n=1n=1, you are designing a scalar quantizer.
- •With large nnn, you can use a large codebook of $2^{nR}$ reproduction sequences and map each source block to the nearest (lowest distortion) reproduction.
This “vector quantization” viewpoint is one way to understand why R(D)R(D)R(D) is often lower than what naive symbol-by-symbol quantization would suggest.
A typical RD-code construction (high level) #
Given an optimal test channel p(x^∣x)p(\hat x|x)p(x^∣x) achieving R(D)R(D)R(D):
Compute the induced marginal p(x^)p(\hat x)p(x^).
Randomly generate a codebook of $2^{nR}reproductionsequences reproduction sequences reproductionsequences\hat x^n(m)i.i.d.from i.i.d. from i.i.d.from\prod_i p(\hat x_i)$.
Given a source block xnx^nxn, pick an index mmm such that (xn,x^n(m))(x^n,\hat x^n(m))(xn,x^n(m)) is “jointly typical” with respect to p(x,x^)=p(x)p(x^∣x)p(x,\hat x)=p(x)p(\hat x|x)p(x,x^)=p(x)p(x^∣x) (or approximately minimizes distortion).
Send mmm; decoder outputs x^n(m)\hat x^n(m)x^n(m).
The probability that no suitable codeword exists drops when R>I(X;X^)R>I(X;\hat X)R>I(X;X^), which is the core reason mutual information appears.
The converse idea (why you can’t beat R(D)) #
For any code:
- •The message MMM is at most nRnRnR bits, so H(M)≤nRH(M)\le nRH(M)≤nR.
- •The reconstruction X^n\hat X^nX^n is a function of MMM, so information about XnX^nXn that reaches the decoder is bottlenecked.
- •One shows that
nR≥I(Xn;X^n).nR \ge I(X^n;\hat X^n).nR≥I(Xn;X^n).
Then, using memorylessness and single-letterization arguments (plus convexity), you get
1nI(Xn;X^n)≥R(E[d(X,X^)]).\frac{1}{n}I(X^n;\hat X^n) \ge R\big(\mathbb E[d(X,\hat X)]\big).n1I(Xn;X^n)≥R(E[d(X,X^)]).
Thus RRR must be at least R(D)R(D)R(D).
Where RD theory shows up in practice #
Benchmarking codecs: Real codecs (JPEG, AAC, H.264) can be compared to the theoretical RD bound for simplified source models.
Learning and VAEs: The objective in a variational autoencoder resembles an RD Lagrangian:
loss≈E[distortion]+λ I(input;latent).\text{loss} \approx \mathbb E[\text{distortion}] + \lambda, I(\text{input};\text{latent}).loss≈E[distortion]+λI(input;latent).
(Details depend on approximations; but the conceptual link is strong.)
Information bottleneck: Minimizing I(X;T)I(X;T)I(X;T) subject to preserving information about a target YYY echoes RD structure, with “distortion” defined via prediction loss.
Control and estimation: RD-like bounds appear in quantized control and remote estimation (minimum communication to achieve an MSE).
One more static RD diagram (annotated) #
R
| • (low D, high rate)
| •
| •
| •
| •
| •
|•________________________ D
0 D* D_max
At D=0: often R(0)=H(X).
At D>=D_max: R(D)=0.
The curve is decreasing and convex.
A slope (supporting line) corresponds to a Lagrange multiplier β.
When you later compute RD points numerically, you are essentially moving along this curve by varying β\betaβ.
Connecting back to prerequisites #
- •Entropy explains the lossless endpoint (under conditions).
- •Mutual information explains the “cost of correlation.”
- •Typical sequences explain why random codebooks work.
- •Convex optimization explains why the test channel has an exponential form and why the curve is convex.
With these connections in place, the rate–distortion formula should feel less like a definition and more like a deep theorem tying together coding, probability, and optimization.
Worked Examples (3) #
Binary symmetric source with Hamming distortion: compute R(D) #
Let X ~ Bernoulli(1/2) (a fair bit). Use Hamming distortion d(x,x̂)=1{x≠x̂}. Find the rate–distortion function R(D) for 0 ≤ D ≤ 1/2.
Step 1: Write the definition
R(D)=min_{p(x̂|x): E[d(X,X̂)]≤D} I(X;X̂).
Step 2: Parameterize the test channel
For binary X and binary X̂, a natural family is a binary symmetric channel (BSC) with crossover probability q:
P(X̂≠X)=q.
Then E[d(X,X̂)]=q.
So the distortion constraint is q ≤ D.
Step 3: Compute mutual information for a BSC with uniform input
If X ~ Bern(1/2) and X̂ is X flipped with prob q, then X̂ is also uniform.
Thus
I(X;X̂)=H(X̂)−H(X̂|X)=1 − H₂(q),
where H₂(q)=−q log₂ q − (1−q)log₂(1−q) is the binary entropy.
Step 4: Minimize over feasible q
We need to minimize 1 − H₂(q) subject to q ≤ D.
Since H₂(q) increases on [0,1/2], 1−H₂(q) decreases on [0,1/2].
So the minimum occurs at the largest feasible q, i.e. q=D.
Step 5: State R(D)
For 0 ≤ D ≤ 1/2,
R(D)=1 − H₂(D).
For D ≥ 1/2, you can output random bits independent of X to get distortion 1/2 at rate 0, so R(D)=0.
Insight: For a uniform bit, the optimal lossy code behaves like deliberately adding Bernoulli(D) noise: the reconstruction is a noisy version of the source. The “price” in bits is exactly the mutual information left after that noise, 1 − H₂(D).
Let X ~ Bernoulli(p) with p=0.9. Use Hamming distortion. (i) Compute D_max (minimum distortion achievable at zero rate). (ii) Explain what this implies about R(D) for large D.
Step 1: Zero rate means X̂ is independent of X
With R=0, the decoder gets no information, so it must output a constant symbol (or random symbol) not depending on X.
Step 2: Minimize expected Hamming distortion over constant reconstructions
If decoder always outputs 1, distortion is P(X=0)=1−p.
If decoder always outputs 0, distortion is P(X=1)=p.
Pick the smaller:
D_max = min(p, 1−p).
Step 3: Plug in p=0.9
D_max = min(0.9, 0.1)=0.1.
Step 4: Interpret for the RD curve
If you allow distortion D ≥ 0.1, then the decoder can simply always output 1 and achieve expected distortion 0.1 with zero bits.
Therefore R(D)=0 for all D ≥ 0.1.
Insight: The right endpoint of the RD curve depends strongly on the source distribution and distortion. For skewed sources, you can get surprisingly low distortion without sending any bits—just guess the mode.
Show the key KKT/variational step that leads to p(x̂|x) ∝ p(x̂)e^{−β d(x,x̂)} for the optimization min I(X;X̂) subject to E[d] ≤ D and ∑_{x̂} p(x̂|x)=1 for each x.
Step 1: Write the objective in a convenient form
I(X;X̂)=∑_{x,x̂} p(x)p(x̂|x) log( p(x̂|x) / p(x̂) ),
where p(x̂)=∑_x p(x)p(x̂|x).
Step 2: Form the Lagrangian
Introduce β ≥ 0 for the distortion constraint and λ(x) for normalization:
L = I(X;X̂) + β( ∑_{x,x̂} p(x)p(x̂|x)d(x,x̂) − D ) + ∑_x λ(x)( ∑_{x̂} p(x̂|x) − 1 ).
Step 3: Take a directional derivative w.r.t. p(x̂|x)
Holding p(x) fixed, set derivative to zero at optimum.
A standard calculation yields the stationarity condition:
log p(x̂|x) − log p(x̂) + 1 + β d(x,x̂) + λ(x)/p(x) = 0.
Step 4: Solve for p(x̂|x)
Rearrange:
log p(x̂|x) = log p(x̂) − β d(x,x̂) − c(x),
where c(x) collects constants for each x.
Exponentiate:
p(x̂|x) = p(x̂) e^{−β d(x,x̂)} e^{−c(x)}.
Step 5: Enforce normalization to find e^{−c(x)}
We need ∑_{x̂} p(x̂|x)=1:
1 = e^{−c(x)} ∑_{x̂} p(x̂)e^{−β d(x,x̂)}
⇒ e^{−c(x)} = 1 / ∑_{x̂} p(x̂)e^{−β d(x,x̂)}.
Step 6: Final form
Therefore
p(x̂|x)= \frac{p(x̂)e^{−β d(x,x̂)}}{\sum_{x̂'} p(x̂')e^{−β d(x,x̂')}}.
Insight: This is the mathematical heart of RD computation: the optimal encoder–decoder behavior can be represented by a self-consistent “Gibbs” conditional. Varying β traces the RD curve, and iterating the consistency between p(x̂|x) and p(x̂) leads to Blahut–Arimoto.
Key Takeaways #
✓
A distortion measure d(x,x̂) and constraint E[d(X,X̂)]≤D define what “fidelity” means; everything else follows from that choice.
✓
The rate–distortion function is the constrained minimum mutual information: $R(D)=minp(x^∣x): E[d]≤DI(X;X^).R(D)=\min_{p(\hat x|x):;\mathbb E[d]\le D} I(X;\hat X).R(D)=minp(x^∣x):E[d]≤DI(X;X^).$
✓
R(D) is nonincreasing and convex in D; for D ≥ D_max, R(D)=0.
✓
The endpoint R(0)=H(X) holds when zero distortion forces exact reconstruction (d(x,x̂)=0 ⇒ x̂=x).
✓
The optimizer p(x̂|x) is called the test channel and often has the exponential tilt form p(x̂|x) ∝ p(x̂)e^{−β d(x,x̂)}.
✓
Operationally, R(D) is the asymptotically optimal bits/symbol for block lossy coding; it is not merely a definition.
✓
Time-sharing (mixing codes) explains convexity and provides a constructive intuition for intermediate operating points.
✓
Many modern ML objectives (e.g., bottlenecked representations) resemble RD Lagrangians trading distortion against information.
Common Mistakes #
✗
Assuming R(0)=H(X) always, without checking the distortion measure’s “zero distortion implies equality” property.
✗
Confusing the test channel p(x̂|x) with a real physical channel; it is an optimization variable describing an optimal reconstruction law.
✗
Forgetting that RD results are asymptotic in blocklength n; scalar quantizers can be far from the RD bound.
✗
Mixing discrete and continuous intuitions: replacing H with differential entropy h and treating it like a code rate leads to incorrect conclusions.
Practice #
easy
Let X ~ Bernoulli(1/2) with Hamming distortion. Compute R(D) at D=0, D=0.1, and D=0.5.
Hint: Use R(D)=1−H₂(D) for 0≤D≤1/2, and R(D)=0 for D≥1/2. Recall H₂(0)=0 and H₂(1/2)=1.
Show solution
D=0: R(0)=1−H₂(0)=1−0=1 bit/symbol.
D=0.1: R(0.1)=1−H₂(0.1) ≈ 1−0.468995 ≈ 0.5310 bits/symbol.
D=0.5: R(0.5)=0 (since 1−H₂(0.5)=0 and also D≥1/2 implies zero rate).
medium
For a general discrete source X with distribution p(x) and Hamming distortion d(x,x̂)=1{x≠x̂}, compute D_max (the distortion achievable at zero rate).
Hint: At zero rate, X̂ cannot depend on X. The best constant guess is the most probable symbol.
Show solution
Zero rate means choose a fixed reconstruction x̂ minimizing P(X≠x̂). The minimizer is any mode x̂* ∈ argmax_x p(x). Then D_max = 1 − max_x p(x). Therefore R(D)=0 for all D ≥ 1−max_x p(x).
hard
Show (using time-sharing) that R(D) is convex: for any D₁, D₂ and θ∈[0,1], prove R(θD₁+(1−θ)D₂) ≤ θR(D₁)+(1−θ)R(D₂).
Hint: Take two test channels p₁(x̂|x) and p₂(x̂|x) that achieve (or nearly achieve) R(D₁), R(D₂). Introduce an independent Bernoulli(θ) variable Q selecting which channel to use, and define X̂ based on Q.
Show solution
Let Q~Bernoulli(θ) independent of X. Define conditional p(x̂|x,Q=q)=p_q(x̂|x) for q∈{1,2}. Then expected distortion is
E[d(X,X̂)] = θE[d(X,X̂)|Q=1] + (1−θ)E[d(X,X̂)|Q=2] ≤ θD₁+(1−θ)D₂.
Now bound mutual information:
I(X;X̂) ≤ I(X;X̂,Q) = I(X;Q)+I(X;X̂|Q) = 0 + θI₁(X;X̂)+(1−θ)I₂(X;X̂),
where I_q is computed under p_q.
Thus there exists a feasible channel at distortion θD₁+(1−θ)D₂ with mutual information ≤ θR(D₁)+(1−θ)R(D₂) (up to approximation if p_q are near-optimal). Taking the minimum over all feasible channels gives
R(θD₁+(1−θ)D₂) ≤ θR(D₁)+(1−θ)R(D₂), proving convexity.
Connections #
Next nodes you may want:
Quality: A (4.2/5)
← back to treebrowse all →