← ~/visualizations
policy-gradient-methods #
Visualizes a parameterized stochastic policy πθ(a|s) as action-probability bars for two states, then repeatedly samples a short trajectory under πθ. A moving cursor shows the current timestep’s score-function factor ∇θ log πθ(a|s) multiplied by return/advantage (baseline V), illustrating the policy gradient theorem as a sample-based estimator that drives updates and improves the running objective J(θ).
canvasclick to interact
⏮◀◀▶▶STEP0.25x1xZOOM
t=0s
practical uses #
- 01.Training stochastic policies with REINFORCE when you only have sampled episodes
- 02.Actor-critic methods (using a learned baseline/value to reduce variance)
- 03.Optimizing continuous or discrete action policies directly without a differentiable environment model
technical notes #
Implements a tiny 2-state/2-action MDP and a toy softmax policy with persistent parameters (W logits) plus a baseline V(s). Every ~520ms it samples a horizon-6 trajectory, computes discounted returns with γ=0.95, forms advantages (G−V), applies a REINFORCE-style update to W and an MSE update to V. Rendering uses pixel-snapped rectangles on a black background with GREEN/GREEN_DIM and eased cursor motion for smooth, educational animation.
← dimensionality-reductionbasis-and-dimension →