Markov Decision Processes

←Back to Tech Tree

inventorycoverage

Markov Decision Processes #

Machine LearningDifficulty: ★★★★★Depth: 7Unlocks: 3

Formal framework for sequential decision-making. Bellman equations.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

gamma (discount factor: scalar 0<=gamma<=1 that weights future rewards)

Essential Relationships #

Prerequisites (3) #

Reinforcement Learning Introduction6 atomsMarkov Chains6 atomsDynamic Programming6 atoms

Unlocks (2) #

Policy Gradient Methodslvl 5Task Discretizationlvl 5

Referenced by (2) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (2) #

[Time HorizonBusiness

MDPs formalize time horizon mathematically via finite vs infinite horizon planning, discount factors encoding time preference, and Bellman equations for sequential decision-making - the exact mathematical framework underlying business time horizon analysis.](/business/time-horizon/)[Discount FactorBusiness

Bellman equations formally define the discount factor gamma that weights future rewards, making MDPs the mathematical framework where this concept is rigorously grounded and applied.](/business/discount-factor/)

Advanced Learning Details

Graph Position #

146

Depth Cost

3

Fan-Out (ROI)

2

Bottleneck Score

7

Chain Length

Cognitive Load #

6

Atomic Elements

42

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (15) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Markov Decision Processes (MDPs) are the “assembly language” of reinforcement learning: once you can express a problem as an MDP, you can derive Bellman equations, prove what an optimal policy means, and justify algorithms like value iteration, policy iteration, and actor-critic methods.

TL;DR:

An MDP formalizes sequential decision-making with (S, A, P, R, γ). A policy π(a|s) induces a Markov reward process, which defines value functions V^π(s) and Q^π(s,a) as expected discounted return. Bellman expectation equations describe V^π and Q^π recursively. Bellman optimality equations define V*(s) and Q*(s,a) via a max over actions. Dynamic programming uses these equations to compute optimal policies when P and R are known.

What Is a Markov Decision Process? #

Why we need a formalism #

In reinforcement learning, you interact with an environment over time. The key difficulty is credit assignment across time: an action now may cause a good or bad outcome many steps later. To reason about this cleanly, we want a model that:

  1. 1)Separates what the agent controls (actions, policy) from what the environment does (stochastic transitions).
  2. 2)Encodes time and uncertainty.
  3. 3)Is simple enough to yield recursive equations (so dynamic programming applies).

An MDP is the standard model that accomplishes this.

Definition #

A (finite) Markov Decision Process is a tuple:

P(s′ | s, a) = Pr(Sₜ₊₁ = s′ | Sₜ = s, Aₜ = a)

The Markov decision epoch (Markov property) #

The “atomic step” of an MDP is a decision epoch:

The Markov property means the future depends on the past only through the current state (and chosen action):

Pr(Sₜ₊₁ = s′ | S₀, A₀, …, Sₜ = s, Aₜ = a) = Pr(Sₜ₊₁ = s′ | Sₜ = s, Aₜ = a)

This assumption is what lets us write Bellman equations: it provides the “optimal substructure” for sequential decisions.

Trajectories and return #

A trajectory is a sequence (S₀, A₀, R₁, S₁, A₁, R₂, …). The (discounted) return from time t is:

Gₜ = ∑ₖ₌₀^∞ γᵏ Rₜ₊ₖ₊₁

What γ really does (motivation before math) #

Discounting is not just a mathematical trick. It does three practical jobs:

  1. 1)Preference for sooner rewards (time preference).
  2. 2)Convergence: if rewards are bounded, discounting ensures ∑ γᵏ R converges.
  3. 3)Effective horizon: roughly, rewards more than ~1/(1−γ) steps away contribute little.

Relationship to Markov chains and dynamic programming #

An MDP generalizes a Markov chain:

An MDP also sets up dynamic programming:

A quick taxonomy (to orient you) #

ObjectWhat it specifiesWhat you can do with it
Markov chainP(s′s)
Markov reward processP(s′s), R(s)
MDPP(s′s,a), R, γ

The rest of this lesson builds the two central ingredients: policies and value functions, and then shows how Bellman equations connect them.

Core Mechanic 1: Policies Induce Value Functions #

Why policies come first #

An MDP by itself is just a world model. To talk about “expected return,” we must specify how actions are selected. That’s the role of a policy.

Policies (deterministic and stochastic) #

A policy π is a decision rule mapping states to actions.

Stochastic policies matter even when the environment is fully known:

From an MDP + policy to a Markov reward process #

Once you fix π, the agent is no longer “choosing” actions freely; actions are sampled from π. The resulting state process becomes a Markov chain with transition matrix:

P^π(s′|s) = ∑ₐ π(a|s) P(s′|s,a)

If rewards depend on (s,a) (or (s,a,s′)), you also get an induced reward model, e.g.:

R^π(s) = ∑ₐ π(a|s) R(s,a)

This reduction is conceptually important: policy evaluation is “just” analyzing the Markov reward process induced by π.

Value function: expected discounted return #

Now we define the central predictive quantities.

State-value function:

V^π(s) = 𝔼_π[ Gₜ | Sₜ = s ] = 𝔼_π[ ∑ₖ₌₀^∞ γᵏ Rₜ₊ₖ₊₁ | Sₜ = s ]

Action-value function:

Q^π(s,a) = 𝔼_π[ Gₜ | Sₜ = s, Aₜ = a ]

Intuition:

Advantage function (useful later) #

A closely related quantity is the advantage:

A^π(s,a) = Q^π(s,a) − V^π(s)

It measures how much better/worse an action is compared to the policy’s average behavior at s.

Bellman expectation equations (the key recursion) #

Why a recursion exists #

Because of the Markov property, the future after the next step depends only on the next state (and the policy). So the return decomposes into:

Gₜ = Rₜ₊₁ + γ Gₜ₊₁

Take conditional expectation under π.

Deriving the Bellman expectation equation for V^π #

Start from the definition:

V^π(s) = 𝔼_π[ Gₜ | Sₜ = s ]

Substitute Gₜ = Rₜ₊₁ + γ Gₜ₊₁:

V^π(s)

= 𝔼_π[ Rₜ₊₁ + γ Gₜ₊₁ | Sₜ = s ]

= 𝔼_π[ Rₜ₊₁ | Sₜ = s ] + γ 𝔼_π[ Gₜ₊₁ | Sₜ = s ]

Now expand the next-step expectation by conditioning on Aₜ and Sₜ₊₁:

𝔼_π[ Rₜ₊₁ + γ V^π(Sₜ₊₁) | Sₜ = s ]

= ∑ₐ π(a|s) ∑_{s′} P(s′|s,a) \big( R(s,a,s′) + γ V^π(s′) \big)

So:

V^π(s) = ∑ₐ π(a|s) ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V^π(s′) )

This is the Bellman expectation equation for V^π.

Bellman expectation equation for Q^π #

Similarly:

Q^π(s,a) = ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ 𝔼_{a′∼π(·|s′)}[ Q^π(s′,a′) ] )

Or equivalently:

Q^π(s,a) = ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ ∑_{a′} π(a′|s′) Q^π(s′,a′) )

Vector/matrix form (finite state MRP view) #

When π is fixed, you can write the state-value Bellman equation in linear algebra form.

Let V^π ∈ ℝ^{|S|} be a vector with entries V^π(s). Define:

Then:

V^π = r^π + γ PV

Rearrange:

( I − γ P^π ) V^π = r

If ( I − γ P^π ) is invertible (it is under standard discounted assumptions), then:

V^π = ( I − γ P^π )⁻¹ r

This “closed form” is conceptually useful: policy evaluation becomes solving a linear system. Practically, DP and iterative methods are often preferred.

A note on terminal states and episodic tasks #

Many tasks end (episode termination). A common modeling choice:

Either way, Bellman equations remain valid with appropriate boundary conditions.

Core Mechanic 2: Optimality and the Bellman Optimality Equations #

Why “optimality” is subtle #

Once you define V^π and Q^π, it’s tempting to say “pick the action with highest value.” But value depends on the policy itself: if you change actions now, future actions may also change.

So we define optimality in a fixed point sense: an optimal policy is one whose value dominates all others.

Optimal value functions #

Define the optimal state-value function:

V*(s) = max_π V^π(s)

and optimal action-value function:

Q*(s,a) = max_π Q^π(s,a)

These exist for finite discounted MDPs.

Bellman optimality equation for V* #

Start from: if you are in s, and you act optimally, you choose the best action a, then the environment transitions, and you continue optimally from s′.

That gives:

V*(s) = maxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V*(s′) )

This is the Bellman optimality equation.

Bellman optimality equation for Q* #

Similarly, for Q* you commit to the first action a, then behave optimally afterward:

Q*(s,a) = ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ max_{a′} Q*(s′,a′) )

Greedy policies and extracting an optimal policy #

Once you have Q*(s,a), you can define a greedy policy:

π*(s) ∈ argmaxₐ Q*(s,a)

This π* is optimal (ties may yield multiple optimal policies).

Policy improvement theorem (why greedy helps) #

A key piece of theory is that being greedy with respect to a value function improves (or at least does not worsen) the policy.

Let π be any policy, and define a new policy π′ such that for all s:

π′(s) ∈ argmaxₐ Q^π(s,a)

Then:

V^{π′}(s) ≥ V^π(s) for all s

This theorem justifies policy iteration: evaluate π to get V^π (or Q^π), improve greedily to get π′, repeat.

Contraction view (why the Bellman operator converges) #

Define the Bellman optimality operator T acting on a value function V:

(TV)(s) = maxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V(s′) )

Then V* is the unique fixed point:

V = T V

Under the max norm ‖·‖∞, T is a γ-contraction:

‖T V − T W‖∞ ≤ γ ‖V − W‖∞

This matters because:

Even if you don’t prove the contraction fully, the intuition is strong: discounting shrinks the influence of future differences by γ each step.

Comparing the “three Bellman equations” #

Equation typeRecursionUnknownUse
Expectation (policy evaluation)V^π = r^π + γ P^π V^πV^πEvaluate a fixed π
Optimality (planning)V = T VV*Find optimal values
Q-optimalityQ = T_Q QQ*Extract greedy optimal π*

Where the reward function fits #

A common confusion: “Is reward tied to states or transitions?” In general, reward can depend on (s,a,s′). For many derivations, we use its expectation:

R̄(s,a) = 𝔼[ Rₜ₊₁ | Sₜ=s, Aₜ=a ] = ∑_{s′} P(s′|s,a) R(s,a,s′)

Then Bellman optimality for V* can be written:

V*(s) = maxₐ \big( R̄(s,a) + γ ∑_{s′} P(s′|s,a) V*(s′) \big)

It’s the same idea—just less notation.

Deterministic vs stochastic optimal policies #

In finite discounted MDPs, there always exists an optimal deterministic stationary policy (depends only on s, not time). That’s a powerful simplification:

This fact is part of why MDPs are such a clean framework.

Application/Connection: Solving MDPs with Dynamic Programming (Planning) and How This Enables Modern RL #

Why dynamic programming applies #

Dynamic programming works when:

The Bellman optimality equation is precisely the DP recursion.

Value iteration #

Value iteration applies the optimality operator repeatedly.

Initialize V₀(s) arbitrarily (often 0). Then iterate:

V_{k+1}(s) = maxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V_k(s′) )

After convergence to V*, extract:

π*(s) ∈ argmaxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V*(s′) )

What’s happening conceptually:

Policy iteration #

Policy iteration alternates between evaluation and improvement.

  1. Policy evaluation: solve for V^π via:

V^π(s) = ∑ₐ π(a|s) ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V^π(s′) )

In matrix form:

( I − γ P^π ) V^π = r

  1. Policy improvement:

π_new(s) ∈ argmaxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V^π(s′) )

Repeat until the policy stops changing.

Value iteration vs policy iteration #

MethodMain ideaProsCons
Value iterationUpdate V toward V* directlySimple, memory-lightMany iterations when γ≈1
Policy iterationEvaluate π exactly/approximately, then improveOften fewer iterationsEvaluation can be expensive
Modified policy iterationPartial evaluation + improvementFlexible tradeoffMore moving parts

Where “learning” begins (model-free RL) #

DP assumes you know P and R. In many RL problems you don’t; you only observe samples.

But the MDP framing still matters because:

This is the bridge from MDP theory to modern RL.

Connection to policy gradients (unlock) #

Policy gradient methods optimize π_θ directly. The objective is typically:

J(θ) = 𝔼_{τ∼π_θ}[ ∑ₜ γᵗ Rₜ₊₁ ]

Even though policy gradients do not require an explicit model P, they still rely on the MDP structure:

So understanding Bellman equations makes actor-critic updates feel motivated rather than magical.

Connection to task discretization (unlock) #

When you discretize a continuous or ill-defined task (common in LLM agent systems), you often create:

If the resulting system approximately satisfies a Markov property (state summarizes relevant history), you can treat it as an MDP and apply planning/evaluation ideas:

Even if the environment is only approximately Markov, the MDP lens provides a disciplined way to debug and improve the design.

Beyond the basics (what to keep in mind at difficulty 5) #

MDPs are clean, but real RL systems introduce complications:

Still, the MDP + Bellman equations remain the conceptual backbone.

Worked Examples (3) #

Compute V^π for a tiny MDP by solving Bellman expectation equations #

States S = {A, B}. Actions are irrelevant because the policy is fixed and the transition is policy-induced.

Dynamics under π:

Discount γ = 0.9

Find V^π(A) and V^π(B).

  1. Write Bellman equations.

    V^π(A) = 𝔼[R|A] + γ 𝔼[V^π(S′)|A]

    V^π(B) = 𝔼[R|B] + γ 𝔼[V^π(S′)|B]

  2. Substitute the deterministic transitions and rewards.

    From A: reward 2, next state B.

    V^π(A) = 2 + 0.9 V^π(B)

    From B: reward 1, next state B.

    V^π(B) = 1 + 0.9 V^π(B)

  3. Solve for V^π(B) first.

    V^π(B) = 1 + 0.9 V^π(B)

    V^π(B) − 0.9 V^π(B) = 1

    0.1 V^π(B) = 1

    V^π(B) = 10

  4. Plug into V^π(A).

    V^π(A) = 2 + 0.9 · 10

    V^π(A) = 2 + 9

    V^π(A) = 11

Insight: Bellman expectation equations often let you solve values with simple algebra. Notice how a self-loop with constant reward creates a geometric series; the Bellman equation encodes that series compactly.

Derive V* and an optimal policy in a 2-action MDP (Bellman optimality) #

Single nonterminal state s, plus terminal state T with V(T)=0.

At state s you can choose:

Discount γ = 0.8

Compute V*(s), Q*(s,a₁), Q*(s,a₂), and the optimal action.

  1. Write Q* for each action using the Q Bellman optimality equation.

    Q*(s,a) = ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ max_{a′} Q*(s′,a′) )

  2. Compute Q*(s,a₁).

    Taking a₁ leads to terminal T with reward 5.

    Q*(s,a₁) = 5 + 0.8 · V*(T)

    V*(T) = 0

    So Q*(s,a₁) = 5

  3. Compute Q*(s,a₂).

    Taking a₂ gives reward 1 and returns to s.

    Q*(s,a₂) = 1 + 0.8 · max_{a′} Q*(s,a′)

    But max_{a′} Q*(s,a′) is exactly V*(s). So:

    Q*(s,a₂) = 1 + 0.8 V*(s)

  4. Relate V*(s) to Q*.

    V*(s) = max( Q*(s,a₁), Q*(s,a₂) )

    = max( 5, 1 + 0.8 V*(s) )

  5. Check which action is optimal by solving the fixed point.

    If a₁ is optimal, then V*(s)=5.

    Then Q*(s,a₂)=1+0.8·5=1+4=5.

    So a₂ ties with a₁; both give value 5.

  6. Confirm the tie is consistent.

    If V*(s)=5, then max(5, 1+0.8·5)=max(5,5)=5, consistent.

    Therefore V*(s)=5 and both actions are optimal.

    Final:

    Q*(s,a₁)=5

    Q*(s,a₂)=5

    V*(s)=5

Insight: Bellman optimality can produce ties and multiple optimal policies. Here, a₂ is a “delayed gratification loop” whose discounted infinite sum matches the immediate terminal payoff.

Policy improvement step using Q^π (one-step lookahead) #

MDP with states S={s₁,s₂}, actions A={L,R}. Discount γ=0.5.

Transitions and rewards:

Current policy π chooses L everywhere.

Compute Q^π(s₁,L), Q^π(s₁,R) and improve π at s₁.

  1. Evaluate V^π(s₂) under π (which picks L at s₂).

    Bellman:

    V^π(s₂) = 1 + 0.5 V^π(s₂)

    V^π(s₂) − 0.5 V^π(s₂) = 1

    0.5 V^π(s₂) = 1

    V^π(s₂) = 2

  2. Compute Q^π(s₁,L).

    Taking L in s₁ gives reward 0 then goes to s₂.

    Q^π(s₁,L) = 0 + 0.5 V^π(s₂)

    = 0.5 · 2

    = 1

  3. Compute Q^π(s₁,R).

    Taking R in s₁ gives reward 2 then goes to s₂.

    Q^π(s₁,R) = 2 + 0.5 V^π(s₂)

    = 2 + 1

    = 3

  4. Improve policy greedily at s₁.

    argmax_a Q^π(s₁,a) = R because 3 > 1.

    So set π′(s₁)=R (keeping π′(s₂)=L if only improving at s₁).

Insight: Policy improvement is local but principled: compare actions by their Q^π values (immediate reward plus discounted value of the next state). This is the core move behind policy iteration and many RL control methods.

Key Takeaways #

Common Mistakes #

Practice #

easy

Consider an MDP with states {0,1}. From state 0, action a keeps you in 0 with prob 1 and reward 1 each step. From state 1, the episode terminates with reward 0. Suppose the policy keeps taking a in state 0 and you never reach state 1. With γ=0.9, compute V^π(0).

Hint: Write V^π(0) = 1 + γ V^π(0) and solve.

Show solution

V^π(0) = 1 + 0.9 V^π(0)

V^π(0) − 0.9 V^π(0) = 1

0.1 V^π(0) = 1

V^π(0) = 10

medium

Let an MDP have two states s₁,s₂ and one action per state (so it’s a Markov reward process). Transitions: s₁→s₂ with prob 1 and reward 0; s₂→s₂ with prob 1 and reward 2. With γ=0.5, compute V(s₁), V(s₂) using Bellman equations.

Hint: Solve V(s₂)=2+γV(s₂) first, then plug into V(s₁)=0+γV(s₂).

Show solution

V(s₂)=2+0.5V(s₂)

V(s₂)−0.5V(s₂)=2

0.5V(s₂)=2

V(s₂)=4

V(s₁)=0+0.5V(s₂)=0.5·4=2

hard

In a finite discounted MDP, define the Bellman optimality operator (TV)(s)=maxₐ ∑_{s′} P(s′|s,a)(R(s,a,s′)+γV(s′)). Show that if you have two value functions V and W, then |(TV)(s) − (TW)(s)| ≤ γ max_{s′}|V(s′)−W(s′)| for every state s.

Hint: Use that maxₐ fₐ − maxₐ gₐ ≤ maxₐ (fₐ−gₐ), then apply triangle inequality and the fact that probabilities sum to 1.

Show solution

Let Δ = max_{x} |V(x)−W(x)|.

For a fixed s and action a define:

F_a = ∑_{s′} P(s′|s,a)(R(s,a,s′)+γV(s′))

G_a = ∑_{s′} P(s′|s,a)(R(s,a,s′)+γW(s′))

Then F_a−G_a = γ ∑_{s′} P(s′|s,a)(V(s′)−W(s′)).

So |F_a−G_a| ≤ γ ∑_{s′} P(s′|s,a) |V(s′)−W(s′)| ≤ γ ∑_{s′} P(s′|s,a) Δ = γΔ.

Now,

(TV)(s)=max_a F_a, (TW)(s)=max_a G_a.

Use the inequality |max_a F_a − max_a G_a| ≤ max_a |F_a−G_a|.

Therefore |(TV)(s)−(TW)(s)| ≤ max_a γΔ = γΔ = γ max_{s′}|V(s′)−W(s′)|.

Connections #

Quality: A (4.4/5)

← back to treebrowse all →