Rate-Distortion Theory

←Back to Tech Tree

inventorycoverage

Rate-Distortion Theory #

Information TheoryDifficulty: ★★★★★Depth: 8Unlocks: 0

Optimal lossy compression. Tradeoff between bits and fidelity.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

R(D) (rate-distortion function)p(x_hat|x) (test-channel / conditional reconstruction distribution)

Essential Relationships #

Prerequisites (2) #

Entropy5 atomsMutual Information5 atoms

Advanced Learning Details

Graph Position #

100

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

8

Chain Length

Cognitive Load #

5

Atomic Elements

38

Total Elements

L2

Percentile Level

L3

Atomic Level

All Concepts (16) #

Teaching Strategy #

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

Lossless compression asks: “How many bits do I need to reproduce the data exactly?” Rate–distortion theory asks the harder—and more practical—question: “If I’m allowed to be slightly wrong, what is the absolute minimum number of bits I need?”

TL;DR:

Rate–distortion theory characterizes the optimal tradeoff between compression rate (bits/symbol) and allowed average distortion. For a source X with distortion measure d(x,x̂), the rate–distortion function is

R(D)=min⁡p(x^∣x):  E[d(X,X^)]≤DI(X;X^).R(D)=\min_{p(\hat x|x):;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).R(D)=p(x^∣x):E[d(X,X^)]≤Dmin​I(X;X^).

It is decreasing and convex in D, with R(0)=H(X) under a “zero-distortion implies exact reconstruction” condition. Operationally, R(D) is the best achievable asymptotic rate for block lossy coding, and the minimizing conditional distribution p(x̂|x) is the “test channel” that describes the optimal stochastic reconstruction mechanism.

What Is Rate–Distortion Theory? #

Rate–distortion theory is the part of information theory that formalizes optimal lossy compression. “Lossy” means the decoder’s reconstruction X^\hat XX^ need not equal the original source symbol XXX; instead we control a fidelity criterion.

The central objects #

  1. A source XXX with distribution p(x)p(x)p(x). Typically we assume an i.i.d. source sequence Xn=(X1,…,Xn)X^n=(X₁,\dots,X_n)Xn=(X1​,…,Xn​) where XiX_iXi​ are independent and identically distributed like XXX.

  2. A distortion measure (single-letter) d(x,x^)d(x,\hat x)d(x,x^) satisfying

From this we define the per-symbol average distortion constraint

E [d(X,X^)]≤D.\mathbb E,[d(X,\hat X)] \le D.E[d(X,X^)]≤D.

For blocks, a standard extension is

dn(xn,x^n)=1n∑i=1nd(xi,x^i),d_n(x^n,\hat x^n)=\frac{1}{n}\sum_{i=1}^n d(x_i,\hat x_i),dn​(xn,x^n)=n1​i=1∑n​d(xi​,x^i​),

and we ask that E[dn(Xn,X^n)]≤D\mathbb E[d_n(X^n,\hat X^n)]\le DE[dn​(Xn,X^n)]≤D.

  1. A rate RRR in bits/symbol. If a block encoder maps XnX^nXn to an index M∈{1,…,2nR}M\in{1,\dots,2^{nR}}M∈{1,…,2nR}, then RRR measures the number of bits available per source symbol.

The question RD theory answers #

Given an allowed distortion level DDD, what is the smallest achievable rate RRR?

This minimum is the rate–distortion function R(D)R(D)R(D).

The clean single-letter characterization #

For a discrete memoryless source (DMS) with single-letter distortion, Shannon’s rate–distortion theorem states

R(D)=min⁡p(x^∣x):  E[d(X,X^)]≤DI(X;X^).R(D)=\min_{p(\hat x|x):;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).R(D)=p(x^∣x):E[d(X,X^)]≤Dmin​I(X;X^).

The minimizer p(x^∣x)p(\hat x|x)p(x^∣x) is called a test channel. It is not a physical channel; it is a mathematical object describing the optimal way (in an information-theoretic sense) to correlate the reconstruction with the source under the distortion constraint.

An RD curve: the picture you should have in mind #

Even without interactivity, it helps to visualize R(D)R(D)R(D) as a curve:

Rate R(D) (bits/symbol)
^
|\
| \
|  \
|   \
|    \
|     \
|      \
+--------------------> Distortion D
       D_max

A useful intuition: “bits buy you correlation” #

R(D)R(D)R(D) is a minimum over I(X;X^)I(X;\hat X)I(X;X^). Mutual information measures how many bits per symbol the reconstruction “knows” about the source. Allowing more distortion lets X^\hat XX^ be less correlated with XXX, reducing I(X;X^)I(X;\hat X)I(X;X^).

Caveats (so you don’t overgeneralize) #

Prerequisites (Accuracy Box): What You Must Already Be Comfortable With #

This node builds on several ideas. You said you already know entropy and mutual information; that’s necessary but not sufficient for a smooth ride at difficulty 5.

Required prerequisites #

TopicWhat you need hereQuick reminder
EntropyInterpret H(X)H(X)H(X) as optimal lossless rateH(X)=−∑xp(x)log⁡p(x)H(X)=-\sum_x p(x)\log p(x)H(X)=−∑x​p(x)logp(x)
Mutual informationUnderstand I(X;Y)I(X;Y)I(X;Y) as “bits of dependence”I(X;Y)=∑x,yp(x,y)log⁡p(x,y)p(x)p(y)I(X;Y)=\sum_{x,y} p(x,y)\log\frac{p(x,y)}{p(x)p(y)}I(X;Y)=∑x,y​p(x,y)logp(x)p(y)p(x,y)​
i.i.d. block coding / typicalityWhy coding theorems are asymptotic and why “per-symbol” rates emergeTypical sequences occupy about $2^{nH(X)}$ outcomes
Conditional distributionsComfort manipulating $p(\hat xx)andinduced and induced andinducedp(\hat x)$
Basic convex optimizationRecognize convexity, constraints, and Lagrange multipliers/KKTMinimize objective subject to inequality constraints

Helpful (but not strictly mandatory) background #

Key caveats to keep in mind as you learn #

  1. Discrete vs continuous: for continuous sources, H(X)H(X)H(X) is replaced by differential entropy h(X)h(X)h(X), which is not directly a coding rate. Operational statements often involve mutual information directly and require additional regularity.

  2. Distortion definition matters: average distortion can be defined in expectation, with high probability, or as excess-distortion probability. Standard Shannon RD uses expected per-letter distortion; variants exist.

  3. $R(0)=H(X)$ is conditional: as noted above, it’s true in the common case where d(x,x^)=0d(x,\hat x)=0d(x,x^)=0 only when x^=x\hat x=xx^=x.

If any of these prerequisites feel shaky, it’s worth revisiting them before you attempt the proofs or algorithms (e.g., Blahut–Arimoto) in the later sections.

Core Mechanic 1: Distortion Measures and What “Fidelity” Really Means #

Before optimizing anything, we must decide what “good reconstruction” means. Rate–distortion theory forces you to be explicit: you provide a distortion measure d(x,x^)d(x,\hat x)d(x,x^) and a target distortion level DDD.

Single-letter distortion and per-symbol averaging #

A single-letter distortion assigns a penalty to each symbol pair (x,x^)(x,\hat x)(x,x^). For sequences, we typically average:

dn(xn,x^n)=1n∑i=1nd(xi,x^i).d_n(x^n,\hat x^n)=\frac{1}{n}\sum_{i=1}^n d(x_i,\hat x_i).dn​(xn,x^n)=n1​i=1∑n​d(xi​,x^i​).

The average matters because it makes the constraint scale nicely with nnn and matches i.i.d. assumptions.

We then require (one common formulation)

E [dn(Xn,X^n)]≤D.\mathbb E,[d_n(X^n,\hat X^n)] \le D.E[dn​(Xn,X^n)]≤D.

A subtle point: This is an expected distortion constraint; it allows rare large-distortion events as long as they don’t contribute too much to the average. Another common alternative is an excess-distortion probability constraint Pr⁡(dn(Xn,X^n)>D)≤ϵ\Pr(d_n(X^n,\hat X^n)>D)\le \epsilonPr(dn​(Xn,X^n)>D)≤ϵ.

Examples of distortion measures #

  1. Hamming distortion for discrete alphabets:

d(x,x^)=1{x≠x^}.d(x,\hat x)=\mathbf{1}{x\ne \hat x}.d(x,x^)=1{x=x^}.

Then E[d(X,X^)]=Pr⁡(X≠X^)\mathbb E[d(X,\hat X)] = \Pr(X\ne \hat X)E[d(X,X^)]=Pr(X=X^), i.e. average distortion equals symbol error rate.

  1. Squared error for real-valued sources:

d(x,x^)=(x−x^)2.d(x,\hat x)=(x-\hat x)^2.d(x,x^)=(x−x^)2.

Then DDD is a mean-squared error (MSE) constraint.

  1. Weighted distortions: sometimes different symbols are more costly to distort.

Feasible distortion range: Dmin⁡D_{\min}Dmin​ to Dmax⁡D_{\max}Dmax​ #

Two distortion levels are especially important.

Dmax⁡=min⁡p(x^)E[d(X,X^)]subject to X^⊥X.D_{\max}=\min_{p(\hat x)} \mathbb E[d(X,\hat X)] \quad \text{subject to } \hat X \perp X.Dmax​=p(x^)min​E[d(X,X^)]subject to X^⊥X.

For Hamming distortion, a zero-rate decoder should output the most probable symbol, so Dmax⁡=1−max⁡xp(x)D_{\max}=1-\max_x p(x)Dmax​=1−maxx​p(x).

This gives a boundary condition you should remember:

Why distortion is not “error” #

In lossy compression, the reconstruction might be “wrong” but perceptually fine (audio/image), or wrong but still useful (downstream ML tasks). Rate–distortion cleanly separates:

That separation is powerful: once ddd is fixed, the rest is mathematics.

A sanity check that prevents confusion #

If your distortion is Hamming, the RD problem resembles “coding with errors.” But it is not channel coding: in RD, the encoder sees the entire xnx^nxn and chooses a description; the distortion constraint is on reconstruction quality, not on a noisy channel outcome.

This perspective will matter when we introduce the test channel: the “noise” in p(x^∣x)p(\hat x|x)p(x^∣x) is controlled by us to meet distortion, not imposed by nature.

Core Mechanic 2: The Rate–Distortion Function as a Constrained Mutual Information Minimum #

Now we connect the operational question (“minimum bits”) to an information quantity.

The definition #

For a fixed source distribution p(x)p(x)p(x) and distortion measure d(x,x^)d(x,\hat x)d(x,x^), define

R(D)=min⁡p(x^∣x):  E[d(X,X^)]≤DI(X;X^).R(D)=\min_{p(\hat x|x):;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).R(D)=p(x^∣x):E[d(X,X^)]≤Dmin​I(X;X^).

This is sometimes called the Shannon lower bound form (not to be confused with the separate Shannon lower bound approximation used in continuous cases). The key is: we are minimizing mutual information over all conditional distributions p(x^∣x)p(\hat x|x)p(x^∣x) that satisfy the distortion constraint.

Expand the pieces (so it feels concrete) #

Given p(x)p(x)p(x) and a candidate test channel p(x^∣x)p(\hat x|x)p(x^∣x):

  1. Induced joint:

p(x,x^)=p(x)p(x^∣x).p(x,\hat x)=p(x)p(\hat x|x).p(x,x^)=p(x)p(x^∣x).

  1. Induced marginal:

p(x^)=∑xp(x)p(x^∣x).p(\hat x)=\sum_x p(x)p(\hat x|x).p(x^)=x∑​p(x)p(x^∣x).

  1. Mutual information:

I(X;X^)=∑x,x^p(x)p(x^∣x)log⁡p(x^∣x)p(x^).I(X;\hat X)=\sum_{x,\hat x} p(x)p(\hat x|x)\log\frac{p(\hat x|x)}{p(\hat x)}.I(X;X^)=x,x^∑​p(x)p(x^∣x)logp(x^)p(x^∣x)​.

  1. Expected distortion:

E[d(X,X^)]=∑x,x^p(x)p(x^∣x)d(x,x^).\mathbb E[d(X,\hat X)] = \sum_{x,\hat x} p(x)p(\hat x|x)d(x,\hat x).E[d(X,X^)]=x,x^∑​p(x)p(x^∣x)d(x,x^).

We choose p(x^∣x)p(\hat x|x)p(x^∣x) to minimize the first, while keeping the second ≤ D.

Why mutual information is the right “cost” #

A rough operational interpretation:

RD theory makes this precise: in the limit of large blocklength, the minimum description rate equals the minimum mutual information.

Fundamental properties of R(D) #

  1. Monotonicity: if you allow more distortion, you never need more rate.

If D2≥D1D₂\ge D₁D2​≥D1​, then the feasible set at D2D₂D2​ contains the feasible set at D1D₁D1​, so

R(D2)≤R(D1).R(D₂)\le R(D₁).R(D2​)≤R(D1​).

  1. Convexity: R(D)R(D)R(D) is convex in DDD.

Intuition: you can time-share between two codes (or two test channels) achieving distortions D1D₁D1​ and D2D₂D2​. Using them a fraction θ\thetaθ and $1-\thetayieldsaveragedistortion yields average distortion yieldsaveragedistortion\theta D₁+(1-\theta)D₂andaveragerate and average rate andaveragerate\theta R(D₁)+(1-\theta)R(D₂)$. Optimality then implies

R(θD1+(1−θ)D2)≤θR(D1)+(1−θ)R(D2).R(\theta D₁+(1-\theta)D₂) \le \theta R(D₁)+(1-\theta)R(D₂).R(θD1​+(1−θ)D2​)≤θR(D1​)+(1−θ)R(D2​).

That inequality is exactly convexity.

  1. Boundary values:

Why R(0)=H(X)R(0)=H(X)R(0)=H(X) under that condition:

If instead multiple reconstructions have zero distortion for a given xxx, then X^\hat XX^ might be a function that preserves less information than XXX, and I(X;X^)I(X;\hat X)I(X;X^) (hence R(0)R(0)R(0)) can be smaller.

The Lagrangian viewpoint (a workhorse) #

We often solve constrained problems by introducing a multiplier β≥0\beta\ge 0β≥0:

min⁡p(x^∣x)  I(X;X^)+β E[d(X,X^)].\min_{p(\hat x|x)}; I(X;\hat X) + \beta,\mathbb E[d(X,\hat X)].p(x^∣x)min​I(X;X^)+βE[d(X,X^)].

Then, for each β\betaβ, the optimizer corresponds to a point on the RD curve. (Formally, β\betaβ parameterizes supporting lines to the convex function R(D)R(D)R(D).)

This is where convex optimization enters: I(X;X^)I(X;\hat X)I(X;X^) is convex in p(x^∣x)p(\hat x|x)p(x^∣x) for fixed p(x)p(x)p(x), and the distortion constraint is linear in p(x^∣x)p(\hat x|x)p(x^∣x), making the problem a convex program.

The “test channel” form you should remember #

At optimum, the conditional often has an exponential tilt form (for discrete alphabets):

p(x^∣x)∝p(x^) e−βd(x,x^).p(\hat x|x) \propto p(\hat x),e^{-\beta d(x,\hat x)}.p(x^∣x)∝p(x^)e−βd(x,x^).

More explicitly,

p(x^∣x)=p(x^)e−βd(x,x^)∑x^′p(x^′)e−βd(x,x^′).p(\hat x|x)=\frac{p(\hat x)e^{-\beta d(x,\hat x)}}{\sum_{\hat x'} p(\hat x')e^{-\beta d(x,\hat x')}}.p(x^∣x)=∑x^′​p(x^′)e−βd(x,x^′)p(x^)e−βd(x,x^)​.

This is not magic; it is the KKT optimality condition for minimizing mutual information subject to expected distortion. It is also the basis for the Blahut–Arimoto algorithm (iterative computation of R(D)R(D)R(D)).

Pause here and internalize the meaning: to compress optimally, you behave as if the reconstruction is drawn from a Gibbs distribution where reconstructions with smaller distortion are exponentially preferred, but also modulated by a global prior p(x^)p(\hat x)p(x^) that must self-consistently match the induced marginal.

Application/Connection: Operational Meaning, Achievability vs Converse, and How RD Is Used #

The single-letter formula is elegant, but why does it equal the true minimum code rate? Rate–distortion theory has an operational meaning: it predicts what block codes can and cannot do as n→∞n→∞n→∞.

The operational RD theorem (informal but accurate) #

Fix a DMS XXX and single-letter distortion ddd. For any distortion level DDD:

So R(D)R(D)R(D) is the sharp threshold.

Why block coding matters #

Lossy compression gains power from coding across many symbols:

This “vector quantization” viewpoint is one way to understand why R(D)R(D)R(D) is often lower than what naive symbol-by-symbol quantization would suggest.

A typical RD-code construction (high level) #

Given an optimal test channel p(x^∣x)p(\hat x|x)p(x^∣x) achieving R(D)R(D)R(D):

  1. Compute the induced marginal p(x^)p(\hat x)p(x^).

  2. Randomly generate a codebook of $2^{nR}reproductionsequences reproduction sequences reproductionsequences\hat x^n(m)i.i.d.from i.i.d. from i.i.d.from\prod_i p(\hat x_i)$.

  3. Given a source block xnx^nxn, pick an index mmm such that (xn,x^n(m))(x^n,\hat x^n(m))(xn,x^n(m)) is “jointly typical” with respect to p(x,x^)=p(x)p(x^∣x)p(x,\hat x)=p(x)p(\hat x|x)p(x,x^)=p(x)p(x^∣x) (or approximately minimizes distortion).

  4. Send mmm; decoder outputs x^n(m)\hat x^n(m)x^n(m).

The probability that no suitable codeword exists drops when R>I(X;X^)R>I(X;\hat X)R>I(X;X^), which is the core reason mutual information appears.

The converse idea (why you can’t beat R(D)) #

For any code:

nR≥I(Xn;X^n).nR \ge I(X^n;\hat X^n).nR≥I(Xn;X^n).

Then, using memorylessness and single-letterization arguments (plus convexity), you get

1nI(Xn;X^n)≥R(E[d(X,X^)]).\frac{1}{n}I(X^n;\hat X^n) \ge R\big(\mathbb E[d(X,\hat X)]\big).n1​I(Xn;X^n)≥R(E[d(X,X^)]).

Thus RRR must be at least R(D)R(D)R(D).

Where RD theory shows up in practice #

  1. Benchmarking codecs: Real codecs (JPEG, AAC, H.264) can be compared to the theoretical RD bound for simplified source models.

  2. Learning and VAEs: The objective in a variational autoencoder resembles an RD Lagrangian:

loss≈E[distortion]+λ I(input;latent).\text{loss} \approx \mathbb E[\text{distortion}] + \lambda, I(\text{input};\text{latent}).loss≈E[distortion]+λI(input;latent).

(Details depend on approximations; but the conceptual link is strong.)

  1. Information bottleneck: Minimizing I(X;T)I(X;T)I(X;T) subject to preserving information about a target YYY echoes RD structure, with “distortion” defined via prediction loss.

  2. Control and estimation: RD-like bounds appear in quantized control and remote estimation (minimum communication to achieve an MSE).

One more static RD diagram (annotated) #

R
|            • (low D, high rate)
|          •
|        •
|      •
|    •
|  •
|•________________________ D
  0        D*         D_max

At D=0: often R(0)=H(X).
At D>=D_max: R(D)=0.
The curve is decreasing and convex.
A slope (supporting line) corresponds to a Lagrange multiplier β.

When you later compute RD points numerically, you are essentially moving along this curve by varying β\betaβ.

Connecting back to prerequisites #

With these connections in place, the rate–distortion formula should feel less like a definition and more like a deep theorem tying together coding, probability, and optimization.

Worked Examples (3) #

Binary symmetric source with Hamming distortion: compute R(D) #

Let X ~ Bernoulli(1/2) (a fair bit). Use Hamming distortion d(x,x̂)=1{x≠x̂}. Find the rate–distortion function R(D) for 0 ≤ D ≤ 1/2.

  1. Step 1: Write the definition

    R(D)=min_{p(x̂|x): E[d(X,X̂)]≤D} I(X;X̂).

  2. Step 2: Parameterize the test channel

    For binary X and binary X̂, a natural family is a binary symmetric channel (BSC) with crossover probability q:

    P(X̂≠X)=q.

    Then E[d(X,X̂)]=q.

    So the distortion constraint is q ≤ D.

  3. Step 3: Compute mutual information for a BSC with uniform input

    If X ~ Bern(1/2) and X̂ is X flipped with prob q, then X̂ is also uniform.

    Thus

    I(X;X̂)=H(X̂)−H(X̂|X)=1 − H₂(q),

    where H₂(q)=−q log₂ q − (1−q)log₂(1−q) is the binary entropy.

  4. Step 4: Minimize over feasible q

    We need to minimize 1 − H₂(q) subject to q ≤ D.

    Since H₂(q) increases on [0,1/2], 1−H₂(q) decreases on [0,1/2].

    So the minimum occurs at the largest feasible q, i.e. q=D.

  5. Step 5: State R(D)

    For 0 ≤ D ≤ 1/2,

    R(D)=1 − H₂(D).

    For D ≥ 1/2, you can output random bits independent of X to get distortion 1/2 at rate 0, so R(D)=0.

Insight: For a uniform bit, the optimal lossy code behaves like deliberately adding Bernoulli(D) noise: the reconstruction is a noisy version of the source. The “price” in bits is exactly the mutual information left after that noise, 1 − H₂(D).

Non-uniform binary source: compute D_max and interpret the zero-rate point #

Let X ~ Bernoulli(p) with p=0.9. Use Hamming distortion. (i) Compute D_max (minimum distortion achievable at zero rate). (ii) Explain what this implies about R(D) for large D.

  1. Step 1: Zero rate means X̂ is independent of X

    With R=0, the decoder gets no information, so it must output a constant symbol (or random symbol) not depending on X.

  2. Step 2: Minimize expected Hamming distortion over constant reconstructions

    If decoder always outputs 1, distortion is P(X=0)=1−p.

    If decoder always outputs 0, distortion is P(X=1)=p.

    Pick the smaller:

    D_max = min(p, 1−p).

  3. Step 3: Plug in p=0.9

    D_max = min(0.9, 0.1)=0.1.

  4. Step 4: Interpret for the RD curve

    If you allow distortion D ≥ 0.1, then the decoder can simply always output 1 and achieve expected distortion 0.1 with zero bits.

    Therefore R(D)=0 for all D ≥ 0.1.

Insight: The right endpoint of the RD curve depends strongly on the source distribution and distortion. For skewed sources, you can get surprisingly low distortion without sending any bits—just guess the mode.

Derive the exponential-form optimality condition for the test channel (discrete case) #

Show the key KKT/variational step that leads to p(x̂|x) ∝ p(x̂)e^{−β d(x,x̂)} for the optimization min I(X;X̂) subject to E[d] ≤ D and ∑_{x̂} p(x̂|x)=1 for each x.

  1. Step 1: Write the objective in a convenient form

    I(X;X̂)=∑_{x,x̂} p(x)p(x̂|x) log( p(x̂|x) / p(x̂) ),

    where p(x̂)=∑_x p(x)p(x̂|x).

  2. Step 2: Form the Lagrangian

    Introduce β ≥ 0 for the distortion constraint and λ(x) for normalization:

    L = I(X;X̂) + β( ∑_{x,x̂} p(x)p(x̂|x)d(x,x̂) − D ) + ∑_x λ(x)( ∑_{x̂} p(x̂|x) − 1 ).

  3. Step 3: Take a directional derivative w.r.t. p(x̂|x)

    Holding p(x) fixed, set derivative to zero at optimum.

    A standard calculation yields the stationarity condition:

    log p(x̂|x) − log p(x̂) + 1 + β d(x,x̂) + λ(x)/p(x) = 0.

  4. Step 4: Solve for p(x̂|x)

    Rearrange:

    log p(x̂|x) = log p(x̂) − β d(x,x̂) − c(x),

    where c(x) collects constants for each x.

    Exponentiate:

    p(x̂|x) = p(x̂) e^{−β d(x,x̂)} e^{−c(x)}.

  5. Step 5: Enforce normalization to find e^{−c(x)}

    We need ∑_{x̂} p(x̂|x)=1:

    1 = e^{−c(x)} ∑_{x̂} p(x̂)e^{−β d(x,x̂)}

    ⇒ e^{−c(x)} = 1 / ∑_{x̂} p(x̂)e^{−β d(x,x̂)}.

  6. Step 6: Final form

    Therefore

    p(x̂|x)= \frac{p(x̂)e^{−β d(x,x̂)}}{\sum_{x̂'} p(x̂')e^{−β d(x,x̂')}}.

Insight: This is the mathematical heart of RD computation: the optimal encoder–decoder behavior can be represented by a self-consistent “Gibbs” conditional. Varying β traces the RD curve, and iterating the consistency between p(x̂|x) and p(x̂) leads to Blahut–Arimoto.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let X ~ Bernoulli(1/2) with Hamming distortion. Compute R(D) at D=0, D=0.1, and D=0.5.

Hint: Use R(D)=1−H₂(D) for 0≤D≤1/2, and R(D)=0 for D≥1/2. Recall H₂(0)=0 and H₂(1/2)=1.

Show solution

D=0: R(0)=1−H₂(0)=1−0=1 bit/symbol.

D=0.1: R(0.1)=1−H₂(0.1) ≈ 1−0.468995 ≈ 0.5310 bits/symbol.

D=0.5: R(0.5)=0 (since 1−H₂(0.5)=0 and also D≥1/2 implies zero rate).

medium

For a general discrete source X with distribution p(x) and Hamming distortion d(x,x̂)=1{x≠x̂}, compute D_max (the distortion achievable at zero rate).

Hint: At zero rate, X̂ cannot depend on X. The best constant guess is the most probable symbol.

Show solution

Zero rate means choose a fixed reconstruction x̂ minimizing P(X≠x̂). The minimizer is any mode x̂* ∈ argmax_x p(x). Then D_max = 1 − max_x p(x). Therefore R(D)=0 for all D ≥ 1−max_x p(x).

hard

Show (using time-sharing) that R(D) is convex: for any D₁, D₂ and θ∈[0,1], prove R(θD₁+(1−θ)D₂) ≤ θR(D₁)+(1−θ)R(D₂).

Hint: Take two test channels p₁(x̂|x) and p₂(x̂|x) that achieve (or nearly achieve) R(D₁), R(D₂). Introduce an independent Bernoulli(θ) variable Q selecting which channel to use, and define X̂ based on Q.

Show solution

Let Q~Bernoulli(θ) independent of X. Define conditional p(x̂|x,Q=q)=p_q(x̂|x) for q∈{1,2}. Then expected distortion is

E[d(X,X̂)] = θE[d(X,X̂)|Q=1] + (1−θ)E[d(X,X̂)|Q=2] ≤ θD₁+(1−θ)D₂.

Now bound mutual information:

I(X;X̂) ≤ I(X;X̂,Q) = I(X;Q)+I(X;X̂|Q) = 0 + θI₁(X;X̂)+(1−θ)I₂(X;X̂),

where I_q is computed under p_q.

Thus there exists a feasible channel at distortion θD₁+(1−θ)D₂ with mutual information ≤ θR(D₁)+(1−θ)R(D₂) (up to approximation if p_q are near-optimal). Taking the minimum over all feasible channels gives

R(θD₁+(1−θ)D₂) ≤ θR(D₁)+(1−θ)R(D₂), proving convexity.

Connections #

Next nodes you may want:

Quality: A (4.2/5)

← back to treebrowse all →