policy-gradient-methods

← ~/visualizations

policy-gradient-methods #

Visualizes a parameterized stochastic policy πθ(a|s) as action-probability bars for two states, then repeatedly samples a short trajectory under πθ. A moving cursor shows the current timestep’s score-function factor ∇θ log πθ(a|s) multiplied by return/advantage (baseline V), illustrating the policy gradient theorem as a sample-based estimator that drives updates and improves the running objective J(θ).

canvasclick to interact

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

practical uses #

technical notes #

Implements a tiny 2-state/2-action MDP and a toy softmax policy with persistent parameters (W logits) plus a baseline V(s). Every ~520ms it samples a horizon-6 trajectory, computes discounted returns with γ=0.95, forms advantages (G−V), applies a REINFORCE-style update to W and an MSE update to V. Rendering uses pixel-snapped rectangles on a black background with GREEN/GREEN_DIM and eased cursor motion for smooth, educational animation.

← dimensionality-reductionbasis-and-dimension →