RLHF

←Back to Tech Tree

inventorycoverage

RLHF #

Machine LearningDifficulty: ★★★★★Depth: 10Unlocks: 0

Reinforcement Learning from Human Feedback. Reward modeling, PPO.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

r_phi - scalar reward function learned from human feedback (phi = reward-model parameters)pi_ref - reference (pretrained) policy used in KL penalty/constraint

Essential Relationships #

Prerequisites (2) #

Policy Gradient Methods6 atomsLoss Functions7 atoms

Referenced by (2) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (2) #

[GARPBusiness

RLHF is revealed preference applied to alignment: human pairwise comparisons reveal a latent reward function, exactly as consumer choices reveal utility under GARP. Afriat-style consistency checks could audit whether a reward model is rationalizable.](/business/garp/)[redlineBusiness

Optimizing redline suggestions to achieve ≥0.8 human accept rate is directly the RLHF loop: generate candidate edits, collect human accept/reject signals, update the policy to align suggestions with human judgment](/business/redline/)

Advanced Learning Details

Graph Position #

250

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

10

Chain Length

Cognitive Load #

6

Atomic Elements

39

Total Elements

L2

Percentile Level

L4

Atomic Level

All Concepts (15) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

RLHF is the engineering bridge between “a powerful language model” and “a model that reliably behaves the way humans want.” It does this by (1) learning a reward function from human preference comparisons, then (2) fine-tuning the policy to maximize that learned reward while staying close to a trusted reference policy.

TL;DR:

RLHF (Reinforcement Learning from Human Feedback) typically has two big stages: (1) train a reward model r_ϕ from human preference data over model outputs, and (2) optimize a policy π to maximize expected reward r_ϕ while applying a KL penalty to keep π close to a reference policy π_ref. In practice, the policy step is often done with PPO on a token-level sequence model, using r_ϕ as the terminal reward (plus optional shaping) and adding −β·KL(π‖π_ref) for stability and alignment.

What Is RLHF? #

RLHF (Reinforcement Learning from Human Feedback) is a training recipe for turning a pretrained generative model (the “policy”) into one that better matches human preferences.

Why this exists: supervised fine-tuning (SFT) teaches a model to imitate demonstrations. But in many real tasks—helpfulness, harmlessness, style, honesty, instruction-following—there isn’t a single “correct” output. Instead, there are outputs humans prefer over others. Preferences are comparative and subjective, and they can be inconsistent across people and contexts.

RLHF treats “what humans like” as a reward signal. The catch is that humans don’t provide a numeric reward for every output; they can more reliably answer questions like:

So RLHF usually proceeds in two phases:

  1. Reward modeling: learn a scalar reward function r_ϕ that predicts human preference.

  2. Policy optimization: fine-tune the policy π_θ to maximize expected reward under r_ϕ.

A third ingredient is crucial in large language models: keep the optimized policy close to a reference policy π_ref (typically the pretrained or SFT model). If you optimize purely for the learned reward, the policy can drift out of distribution, exploit weaknesses in r_ϕ (reward hacking), or collapse into repetitive high-reward patterns. The KL term is the stabilizer.

A useful mental picture is:

Key symbols you’ll see throughout:

Although RLHF is often described with “reinforcement learning,” in language modeling it’s sequence-level RL: you sample tokens step-by-step, but the reward might be computed at the end of the sequence from r_ϕ.

RLHF is not magic. It is a pragmatic approach that works when:

When those fail, RLHF can produce confident misalignment: behavior that looks good to r_ϕ but not to humans.

Core Mechanic 1: Reward Modeling from Human Preferences #

Why reward modeling first: humans can’t score every output on a consistent numeric scale. But pairwise comparisons are easier, faster, and often more reliable. Reward modeling turns those comparisons into a scalar function you can optimize.

Preference data format #

A typical dataset contains tuples like:

You can think of this as teaching a model to assign higher reward to preferred responses.

A standard probabilistic model (Bradley–Terry / logistic preference) #

A common approach assumes the probability that y₁ is preferred over y₂ is a logistic function of the reward difference:

P(y₁ ≻ y₂ | x) = σ(r_ϕ(x, y₁) − r_ϕ(x, y₂))

where σ(t) = 1 / (1 + e^(−t)).

This has a nice interpretation: only relative differences matter. If you add a constant c to all rewards, preferences don’t change.

The reward model loss #

Given a labeled pair (x, y⁺, y⁻), maximize log-likelihood:

ℓ(ϕ) = log σ(r_ϕ(x, y⁺) − r_ϕ(x, y⁻))

Equivalently, minimize the negative log-likelihood:

L_RM(ϕ) = − E[(x,y⁺,y⁻)] [ log σ(r_ϕ(x, y⁺) − r_ϕ(x, y⁻)) ]

Let Δ = r_ϕ(x, y⁺) − r_ϕ(x, y⁻). Then:

L_RM = −log σ(Δ)

and the gradient pushes Δ upward.

Architecture: how r_ϕ is implemented #

In LLM RLHF, r_ϕ is usually a copy of the base transformer with an added scalar “reward head” on top of the final hidden state (often the end-of-sequence token). Conceptually:

Vectors are bold: h, w.

Calibration and identifiability #

Because only differences matter, reward values are only defined up to an additive constant. In practice:

This isn’t just cosmetic: PPO stability depends heavily on reward scale.

Why reward models fail (and why that matters) #

Reward modeling is supervised learning under distribution shift.

This mismatch creates “reward hacking”: the policy finds outputs that exploit blind spots in r_ϕ.

Common failure modes:

This is why the next mechanic (KL to π_ref) is not optional: it’s a containment strategy.

A short derivation: from pairwise labels to a margin-like objective #

Start with the loss for one example:

L = −log σ(r⁺ − r⁻)

Use σ(t) = 1/(1+e^(−t)):

L = −log( 1/(1+e^(−(r⁺−r⁻))) )

= log(1 + e^(−(r⁺−r⁻)))

= softplus(−(r⁺−r⁻))

So it behaves like a smooth hinge: if r⁺ is already much larger than r⁻, the loss is small; otherwise it pushes them apart.

Data collection notes (practical) #

Human preference datasets are typically built by:

Ranking can be reduced to pairwise comparisons (all pairs, or tournament-style). Pairwise is easiest to train with and scales well.

Core Mechanic 2: Policy Optimization with PPO + KL Regularization #

Why RL at all: once you have r_ϕ, you want to change the policy π_θ so that it produces higher-reward outputs. This is not just supervised learning because the “label” depends on what the policy generates.

But vanilla policy gradients are high variance and can take destabilizingly large steps—especially with giant language models and imperfect rewards. PPO (Proximal Policy Optimization) is widely used because it constrains updates to be conservative.

The objective you actually want (regularized) #

A common RLHF objective is a KL-regularized expected reward:

J(θ) = E_{x∼D, y∼π_θ(·|x)} [ r_ϕ(x, y) − β · KL(π_θ(·|x) ‖ π_ref(·|x)) ]

Interpretation:

This can be implemented either as:

Penalty vs constraint (why you see β) #

In constrained form:

maximize E[r_ϕ]

subject to E[KL(π_θ‖π_ref)] ≤ δ

The penalty form is the Lagrangian relaxation with β as a Lagrange multiplier. Many systems adapt β online to hit a target KL.

Sequence models: where does the KL come from? #

For an autoregressive policy:

π_θ(y|x) = ∏_{t=1}^T π_θ(y_t | x, y_{<t})

Then the log-prob decomposes:

log π_θ(y|x) = ∑_{t=1}^T log π_θ(y_t | x, y_{<t})

Token-level KL between two autoregressive policies over a sampled trajectory y is often estimated by:

KL(π_θ‖π_ref) ≈ ∑_{t=1}^T ( log π_θ(y_t|s_t) − log π_ref(y_t|s_t) )

where s_t = (x, y_{<t}). This is an on-policy sample estimate, not an exact KL over all y, but it’s practical.

Shaped reward used in practice #

Often you define a per-trajectory shaped reward:

R(x, y) = r_ϕ(x, y) − β · ∑_{t=1}^T ( log π_θ(y_t|s_t) − log π_ref(y_t|s_t) )

Then you treat R as the return for policy gradient.

Notice something subtle: the penalty contains log π_θ, which depends on θ. This means you must be careful about what is treated as “reward” vs what is part of the optimization objective. PPO implementations handle this by incorporating the KL term explicitly or by adding it as an extra loss.

Why PPO (and what it does) #

Vanilla policy gradient would update θ proportional to:

∇_θ E[ R ] = E[ ∇_θ log π_θ(y|x) · (R − b(x)) ]

where b(x) is a baseline (value function) to reduce variance.

But if R is large or noisy, the update can be too big, changing π drastically, which:

PPO addresses this with a clipped surrogate objective.

PPO clipped objective (single-step view) #

Let

Then PPO maximizes:

L_PPO(θ) = E_t [ min( r_t(θ) · Â_t , clip(r_t(θ), 1−ε, 1+ε) · Â_t ) ]

This prevents r_t from moving too far from 1. ε is typically ~0.1–0.2.

In language modeling, “actions” are tokens y_t, and trajectories are sequences.

Advantage estimation in RLHF #

A standard actor-critic setup learns a value function V_ψ(s_t). You compute:

Â_t = G_t − V_ψ(s_t)

where G_t is a return estimate.

If the reward is terminal only (reward model evaluated at end):

Then a simple return is:

G_t = r_ϕ(x, y)

for all t along the trajectory (or discounted versions).

In practice, you often include per-token KL penalties, making rewards dense:

r_t = −β (logπ_θ(y_t|s_t) − logπ_ref(y_t|s_t))

and terminal r_T += r_ϕ(x, y)

Then you can compute G_t via discounted sums:

G_t = ∑_{k=t}^T γ^(k−t) r_k

Often γ is near 1; sometimes γ = 1 is used for episodic tasks.

Putting it together: the total loss #

A typical implementation minimizes a weighted sum:

L_total = L_actor + c_v · L_value + c_e · L_entropy + L_KL(optional)

Where:

Why KL to π_ref is specifically important in RLHF #

You might ask: “Isn’t PPO already conservative?” It is, relative to π_old. But RLHF needs conservatism relative to a trusted reference distribution π_ref for two reasons:

  1. Reward model validity: r_ϕ is trained on samples near π_ref/SFT. Staying close reduces out-of-distribution exploitation.

  2. Capability retention: π_ref encodes broad language competence. Without KL, optimizing narrow preferences can degrade general performance.

A helpful lens is: KL is a regularizer on the function space of policies, not just step size.

A brief derivation: KL-regularized RL as “reward + log prior” #

Consider maximizing:

E_{y∼π} [ r_ϕ(x,y) ] − β KL(π‖π_ref)

Expand KL:

KL(π‖π_ref) = E_{y∼π} [ log π(y|x) − log π_ref(y|x) ]

So the objective becomes:

J = E_{y∼π} [ r_ϕ(x,y) ] − β E_{y∼π}[ log π(y|x) − log π_ref(y|x) ]

= E_{y∼π} [ r_ϕ(x,y) − β log π(y|x) + β log π_ref(y|x) ]

This shows two things:

This is one reason RLHF often behaves like: “Prefer what humans like, but among those, choose something plausible under the base model.”

Application/Connection: The Full RLHF Pipeline, Design Choices, and Failure Modes #

This section connects the mechanics into an end-to-end pipeline and highlights the engineering decisions that matter at scale.

End-to-end pipeline (typical) #

A common three-stage setup is:

  1. Pretrain a language model on next-token prediction (not RLHF itself).

  2. Supervised fine-tuning (SFT) on instruction-following demonstrations.

  3. RLHF:

π_ref is often the SFT model (or a snapshot of the current policy before RL). The choice affects how conservative you are.

Key design axes (with tradeoffs) #

DecisionOptionsWhy it mattersTypical choice
Preference labelspairwise, ranking, scalar ratingslabel noise + ease for humanspairwise or ranking → pairwise
Reward modelseparate model, shared backbone, ensemblegeneralization + reward hackingseparate RM, sometimes ensemble
Policy optimizerPPO, A2C, DPO/IPO-style direct methodsstability + complexityPPO for classic RLHF
KL controlfixed β, adaptive β, hard constraintprevents driftadaptive β targeting KL
Reward shapingterminal only, +per-token KL, +length penaltiesvariance + behaviorterminal RM + token KL

What PPO is optimizing in language model RLHF (concrete) #

Each iteration:

Even if you conceptually think “sequence reward,” implementations are token-based because:

Monitoring: what you watch during RLHF #

You typically track:

A healthy run often shows:

Failure modes (and what causes them) #

1) Reward hacking #

The policy finds outputs that exploit r_ϕ.

Root cause: r_ϕ is a learned proxy; it generalizes imperfectly. The optimization is strong and adversarial.

Mitigations:

2) Over-optimization and mode collapse #

Symptoms: repetitive outputs, generic safe responses, refusal everywhere.

Root cause: the learned reward landscape may have narrow peaks; PPO can push into them.

Mitigations:

3) Capability regression #

Symptoms: worse factuality, worse reasoning, lower performance on standard NLP benchmarks.

Root cause: RL step optimizes a narrow preference signal; without enough KL or with biased data, the model drifts away from broadly useful behaviors.

Mitigations:

4) Miscalibration / “reward likes confidence” #

Symptoms: confident wrong answers; refusal patterns.

Root cause: labelers prefer confident style, or comparisons don’t penalize confident errors enough.

Mitigations:

RLHF vs direct preference optimization (connection) #

RLHF (RM + PPO) is not the only way. There are “direct” methods (e.g., DPO-style) that skip explicit RL and use preference data to directly update π.

Why you still learn RLHF:

When KL regularization is not just “nice,” but necessary #

If r_ϕ is even slightly misspecified, the unconstrained optimum can be far from human intent. KL provides a trust region around π_ref where r_ϕ is more reliable.

Think of it like this: r_ϕ is a map of a city drawn from a small neighborhood. KL keeps you from driving into areas where the map is wrong.

Worked Examples (3) #

Training a reward model from one preference pair (logistic loss + gradient direction) #

You have a prompt x and two responses y⁺ (preferred) and y⁻ (rejected). The reward model outputs r⁺ = r_ϕ(x,y⁺) and r⁻ = r_ϕ(x,y⁻). Use the standard Bradley–Terry likelihood: P(y⁺ ≻ y⁻|x) = σ(r⁺ − r⁻).

  1. Define Δ = r⁺ − r⁻.

  2. Negative log-likelihood loss:

    L = −log σ(Δ).

  3. Rewrite using σ(Δ) = 1/(1+e^(−Δ)):

    L = −log(1/(1+e^(−Δ)))

    = log(1 + e^(−Δ))

    = softplus(−Δ).

  4. Differentiate w.r.t. Δ:

    ∂L/∂Δ = ∂/∂Δ log(1+e^(−Δ))

    = (1/(1+e^(−Δ))) · (−e^(−Δ))

    = − 1/(1+e^(Δ))

    = −σ(−Δ)

    = σ(Δ) − 1.

  5. Chain rule to see direction on r⁺ and r⁻:

    ∂L/∂r⁺ = ∂L/∂Δ · ∂Δ/∂r⁺ = (σ(Δ) − 1) · 1

    ∂L/∂r⁻ = ∂L/∂Δ · ∂Δ/∂r⁻ = (σ(Δ) − 1) · (−1) = 1 − σ(Δ).

  6. Interpretation:

    • •If Δ is small (model unsure), σ(Δ) ≈ 0.5, so ∂L/∂r⁺ ≈ −0.5 (push r⁺ up) and ∂L/∂r⁻ ≈ +0.5 (push r⁻ down).
    • •If Δ is already large, σ(Δ) ≈ 1, gradients go to 0 (pair is already separated).

Insight: The preference loss only cares about relative reward. Training pushes the preferred output’s reward above the rejected output’s reward, with diminishing pressure once the ordering is confidently correct.

KL-regularized objective expanded into an equivalent “shaped reward” form #

You want to optimize: J(θ) = E_{y∼π_θ(·|x)}[ r_ϕ(x,y) − β KL(π_θ(·|x)‖π_ref(·|x)) ]. Expand KL to see what signal the policy gradient is getting.

  1. Start with the definition:

    KL(π_θ‖π_ref) = E_{y∼π_θ}[ log π_θ(y|x) − log π_ref(y|x) ].

  2. Substitute into J:

    J = E_{y∼π_θ}[ r_ϕ(x,y) ] − β E_{y∼π_θ}[ log π_θ(y|x) − log π_ref(y|x) ].

  3. Combine expectations (same sampling distribution π_θ):

    J = E_{y∼π_θ}[ r_ϕ(x,y) − β log π_θ(y|x) + β log π_ref(y|x) ].

  4. Interpret each term:

    • •r_ϕ(x,y): learn to produce outputs humans prefer.
    • •−β log π_θ(y|x): encourages higher entropy (avoids overly peaky policy).
    • •+β log π_ref(y|x): pulls probability mass toward what π_ref already considers likely.
  5. Autoregressive decomposition:

    log π_θ(y|x) = ∑_{t=1}^T log π_θ(y_t|s_t)

    log π_ref(y|x) = ∑_{t=1}^T log π_ref(y_t|s_t)

    So the KL-related terms naturally become token-level sums.

Insight: The KL term is not just a ‘penalty’; it turns π_ref into a probabilistic prior and adds an entropy-like term. This is why RLHF often improves preference without completely destroying fluency—when β is tuned correctly.

A tiny PPO ratio calculation on one token (why clipping prevents big jumps) #

At some token position t with state s_t, the old policy assigns probability 0.02 to token a, and the new policy assigns probability 0.06. Suppose the advantage estimate is Â_t = +4 and PPO ε = 0.2.

  1. Compute the probability ratio:

    r_t = π_new(a|s_t) / π_old(a|s_t) = 0.06 / 0.02 = 3.

  2. Unclipped surrogate contribution:

    r_t · Â_t = 3 · 4 = 12.

  3. Compute clipped ratio:

    clip(r_t, 1−ε, 1+ε) = clip(3, 0.8, 1.2) = 1.2.

  4. Clipped surrogate contribution:

    clip(r_t,...) · Â_t = 1.2 · 4 = 4.8.

  5. PPO takes the min for positive advantage:

    min(12, 4.8) = 4.8.

    So the objective gain for this token is capped.

Insight: Even if the optimizer tries to triple a token probability in one update, PPO’s clipping limits how much that move can improve the objective—reducing destructive, reward-chasing jumps.

Key Takeaways #

Common Mistakes #

Practice #

easy

You observe preference data where (x, y₁) is preferred to (x, y₂). Your reward model currently predicts r_ϕ(x,y₁)=1.0 and r_ϕ(x,y₂)=1.5. Compute the probability P(y₁ ≻ y₂|x)=σ(r₁−r₂) and the loss L=−log σ(r₁−r₂).

Hint: Compute Δ = r₁ − r₂ and apply σ(Δ)=1/(1+e^(−Δ)).

Show solution

Δ = 1.0 − 1.5 = −0.5.

P = σ(−0.5) = 1/(1+e^(0.5)) ≈ 1/(1+1.6487) ≈ 0.3775.

L = −log(0.3775) ≈ 0.9741.

medium

Show that maximizing E[r_ϕ(x,y) − β·KL(π_θ(·|x)‖π_ref(·|x))] is equivalent to maximizing E[r_ϕ(x,y) + β log π_ref(y|x) − β log π_θ(y|x)] under y∼π_θ. What conceptual role does β log π_ref play?

Hint: Expand KL(π‖π_ref) as an expectation of log ratios under π.

Show solution

KL(π_θ‖π_ref) = E_{y∼π_θ}[log π_θ(y|x) − log π_ref(y|x)].

So:

E[r_ϕ − β KL]

= E[r_ϕ] − β E[log π_θ − log π_ref]

= E[r_ϕ − β log π_θ + β log π_ref].

Conceptually, β log π_ref acts like a prior term that biases the optimized policy toward outputs that the reference policy already considers likely (helping preserve fluency/capabilities and reducing out-of-distribution drift).

hard

In a PPO step for one token, suppose π_old(a|s)=0.10, π_new(a|s)=0.13, ε=0.2, and advantage Â=−3. Compute the unclipped term r·Â and the clipped term clip(r,1−ε,1+ε)·Â, then the PPO contribution min(r·Â, clipped·Â) or max depending on sign. Which one is used and why?

Hint: For negative advantages, PPO uses max( r·Â, clip(r,...)·Â ) (equivalently min with sign accounted for) to prevent the update from over-decreasing probability.

Show solution

r = 0.13/0.10 = 1.3.

Unclipped: r·Â = 1.3·(−3) = −3.9.

Clipped ratio: clip(1.3, 0.8, 1.2) = 1.2.

Clipped: 1.2·(−3) = −3.6.

Because  is negative, PPO takes the max of the two (less negative is better for the objective): max(−3.9, −3.6) = −3.6.

This prevents the optimizer from making the probability change too large in a direction that would excessively reduce the likelihood of the sampled action when the advantage is negative.

Connections #

Policy Gradient Methods

Actor-Critic

Proximal Policy Optimization (PPO)

KL Divergence

Preference Learning / Bradley–Terry Models

Direct Preference Optimization (DPO)

Reward Hacking & Specification Gaming

Quality: A (4.4/5)

← back to treebrowse all →