Chapter 23 · Full excerpt

Reward models and RLHF.

RLHF does two things: it trains a reward model on human preference data, and then it optimizes a language model against that reward while a KL penalty keeps it from sprinting away from the reference policy. The reward model turns sparse human judgments into a dense scoring function. The policy optimizer — PPO in the original recipe, GRPO in the version this chapter builds — uses that scoring function to shift probability mass toward responses the reward model prefers.

This chapter builds both pieces from scratch on top of a small SFT checkpoint, then breaks the reward model on purpose. You watch the reward curve climb while task accuracy quietly falls. That is reward hacking, not as a slogan, but as a plot.

This is Chapter 23 of Under The Hood — Build Every Layer of a Large Language Model from Scratch. The full 35-project book is on Leanpub. Code companion at github.com/mechramc/Under-the-hood.

The concept

Supervised fine-tuning teaches a model to imitate examples. If your dataset says "user: solve this math problem, assistant: here is the correct answer," then the model learns: when I see something like this, continue with something that looks like that. That gets you behavior. It does not get you judgment.

After SFT, you still need a way to say "response A is better than response B." That is the reward model: a second model whose job is not to answer the question, but to grade the answer. The language model writes the answer, the reward model grades it, and RLHF — Reinforcement Learning from Human Feedback — trains the writer to satisfy the grader. That phrasing should make you suspicious immediately. If the grader rewards confidence, length, or cheer instead of correctness, the model learns those habits. If the grader has a flaw, optimization will find it. The whole chapter fits in one sentence: RLHF does not directly optimize truth, usefulness, or safety. It optimizes a score, and everything depends on what that score actually measures.

The mental image I keep coming back to is a student being graded by a tired teacher. Once the student figures out that the teacher only skims the first paragraph and looks for confident-sounding sentences, the student stops working on the rest of the answer. The student is not malicious. The grading is just easy to game.

From SFT to RLHF

Instruction tuning trains on demonstrations; RLHF trains on preferences. Those are not the same task. A supervised model asks "what would a good answer look like, based on this dataset?" An RLHF-trained model asks "what answer gets the highest score from this reward model?" Sometimes those line up. Sometimes they do not. When they stop lining up, you get reward hacking. RLHF is not an upgrade from SFT. It is a different tool with different failure modes. Treating it as "SFT but more" is exactly the trap that makes the reward-hacking outcome surprising when you first see it on your own dashboard.

The plain-English loop

The loop without math is short. Start with a language model that can already produce answers. Give it a prompt, generate several candidate responses, ask the reward model to score each one, treat the higher-scoring responses as better, and then update the language model so it becomes more likely to produce those better-scoring responses in the future. If you can explain that loop clearly in an interview, you already understand more than most people who only know the acronym.

Why a separate reward model exists

Why not skip the reward model and train directly on thumbs-up and thumbs-down? Because the language model needs a fast, differentiable training signal. A thumbs-up is sparse. A reward model turns sparse human judgments into a dense scoring function, which means you can score lots of outputs automatically, cheaply, and repeatedly. Humans cannot sit inside every training loop step and grade thousands of outputs per minute, so we collect preference data once, train a reward model to imitate those preferences, and let that reward model act as the automated judge. That is efficient. It is also dangerous, because the automated judge is now part of the system, and the policy will learn its blind spots.

Earn the equation

Let x be the prompt, y be a candidate response, and r(x, y) be the reward model score. RLHF tries to change the policy parameters θ so the model produces responses with higher reward:

maximize over θ:  E[ r(x, y) ]  where y ~ π_θ(·|x)

Read that in plain English: change the model so that, on average, the responses it generates get higher reward scores. That is all. The whole problem is visible from there. If r measures the wrong thing, optimization drives the model in the wrong direction. The optimizer is not confused. It is obedient.

Why it matters

Without reward modeling, we can only imitate labeled examples. That is enough to make a model answer in the right format, but it is not enough to shape behavior along axes like helpfulness, harmlessness, honesty, refusal style, brevity versus detail, or safety preferences. Those are preferences, not next-token continuations. RLHF gives us a knob for preferences.

The downside is that we now have a proxy objective sitting between us and the behavior we actually care about. This is not a minor implementation detail. It is the central engineering fact of alignment. We rarely optimize the true objective directly. We optimize a measurable substitute, and the gap between the substitute and the real goal is where things break.

With a reward model, we can express things that training data alone may not capture cleanly: prefer correct math over confident wrong math, prefer harmless refusals over harmful compliance, prefer concise answers over rambling ones. We can collect pairwise preference labels — "A is better than B," "B is safer than A" — train a reward model on those judgments, and make that judgment reusable at scale. SFT says: imitate this answer. RLHF says: among these answers, become more like the ones that score better. That second loop is how most assistant behaviors get sharpened.

Strong opinion: a lot of discussion about AI alignment stays abstract. What is useful about this project is that it makes the failure mode concrete and reproducible. We build a flawed judge, optimize against it, and watch the score diverge from actual quality.

The build

We will build this in three layers: train a reward model, use it to score generated responses, and run GRPO so the language model learns to prefer higher-scoring outputs. The starting point is the SFT checkpoint from Chapter 21. The shape of the pipeline does not change with model size: one script to load a base or SFT policy, one module for reward scoring, one script for the policy update, and one task source — GSM8K is convenient because correctness is checkable.

Keep the structure in mind first:

policy generates  →  reward model grades  →  GRPO updates policy

Step 1 — Build the reward dataset

A reward model needs labeled comparisons. For a small chapter-sized experiment, the simplest path is a pairwise comparison format with three fields: prompt, chosen response, and rejected response. We train the reward model so it assigns a higher score to the chosen response.

For this project, use GSM8K prompts, create correct and incorrect candidate answers, and label the correct ones higher. That gives the reward model a narrow job: score math responses by likely correctness and useful reasoning structure.

The negatives should not be cartoonishly bad. If all the bad answers are nonsense, the reward model learns a fake task. Bad answers should often be plausible — "3 plus 4 gives 7 apples" (wrong but structured) or "We combine the quantities carefully and get 14 apples" (wrong but polished). That polished wrong answer is exactly the kind of response that causes trouble later.

Step 2 — Reward model architecture

A reward model is usually just a language model with a scalar head on top. Feed the prompt and response through a transformer, take the final hidden representation, pass it through a linear layer, and output one number. You are reusing the language model's representation ability, but instead of predicting the next token, you are predicting "how good is this whole response?"

The scalar head is tiny. Most of the work is still done by the transformer body:

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, backbone, d_model):
        super().__init__()
        self.backbone = backbone          # transformer, same as policy
        self.value_head = nn.Linear(d_model, 1, bias=True)

    def forward(self, input_ids, attention_mask):
        # hidden: (B, T, d_model)
        hidden = self.backbone(input_ids, attention_mask=attention_mask)

        # take the last non-pad token's hidden state per row
        last_idx = attention_mask.sum(dim=1) - 1            # (B,)
        h_last = hidden[torch.arange(hidden.size(0)), last_idx]

        score = self.value_head(h_last).squeeze(-1)         # (B,)
        return score

Two things worth noting. First, you read the hidden state at the last real token, not at a fixed position; the prompt+response lengths vary, and you do not want the model to score a pad token. Second, the value head is just a single linear projection. It does not need to be more than that. The transformer below it already knows how to summarize text.

Step 3 — The Bradley-Terry pairwise loss

Given pairwise chosen-vs-rejected comparisons, we want the reward model to assign a higher score to the chosen response. Let s_chosen be the score for the chosen response and s_rejected for the rejected one. The standard objective is the Bradley-Terry loss:

L = -log(sigmoid(s_chosen - s_rejected))

Read it in plain English: if the chosen response already scores much higher than the rejected one, the loss is small. If the rejected response scores too high, the loss gets large. That teaches ranking, not absolute calibration. In code:

def reward_loss(model, batch):
    s_chosen   = model(batch["chosen_ids"],   batch["chosen_mask"])
    s_rejected = model(batch["rejected_ids"], batch["rejected_mask"])

    # Bradley-Terry: model the prob that chosen beats rejected
    margin = s_chosen - s_rejected
    loss = -torch.nn.functional.logsigmoid(margin).mean()

    with torch.no_grad():
        pair_acc = (margin > 0).float().mean()
    return loss, pair_acc

The validation metric you actually trust is pair accuracy: on held-out pairs, how often does the model rank the chosen one higher than the rejected one? Reward loss going down is necessary but not sufficient. A model can get a low loss by being confidently right on easy pairs and confidently wrong on the rest. Pair accuracy keeps you honest.

Step 4 — Sanity-check before RL

This is where many people get burned. Do not jump into RL just because the reward loss went down. Before touching GRPO, sample a dozen prompt-response pairs and inspect reward scores by hand. What you are looking to verify: correct answers score higher than incorrect ones, concise correct answers are not unfairly punished, long wrong answers do not dominate, empty refusals do not get weirdly high scores, and sentiment is not overpowering correctness.

I skipped this step exactly once. The reward loss looked clean; the validation pair accuracy was a respectable 86%. The reward model still gave higher scores to longer wrong answers than to short correct ones, because the training data had been silently length-biased. I lost three days of RL runs to a check that takes twenty minutes.

The point is not whether your reward model is perfect. It will not be. The point is that you know its leaks before you start optimizing against it. RL does not fix a bad reward model. It weaponizes it.

Step 5 — The GRPO loop

Now move to the policy side. For each prompt, sample several candidate responses from the current policy, score each one with the reward model, and compare them within the group. GRPO — Group Relative Policy Optimization — treats the above-average scorers as "better" within this group: not universally perfect, just winning the local comparison. Raw reward values are noisy on an absolute scale. Local ranking is more reliable.

Turn scores into advantages by subtracting the per-prompt mean. If a group of 4 responses scores [2.7, 2.3, 0.8, 0.5], the mean is 1.575 and the advantages are roughly [+1.125, +0.725, -0.775, -1.075]. Positive means "make this more likely." Negative means "make this less likely." You are not claiming a response with score 2.7 has that value in some cosmic sense. You are saying it did better than average inside this group.

Once you have advantages, the policy update is a single line. For a sampled response y with advantage A, the per-sample loss is:

L_policy = - A * log p_θ(y | x)

If A is positive, minimizing this loss increases the probability of y. If A is negative, it decreases it. That is the whole trick. The full loop in working form:

for prompts in dataloader:
    # 1. Generate G candidate responses per prompt.
    candidates = [policy.generate(prompts) for _ in range(G)]
    candidates = torch.stack(candidates, dim=1)              # (B, G, T)

    # 2. Score each (prompt, response) with the reward model.
    with torch.no_grad():
        rewards = reward_model.score(prompts, candidates)    # (B, G)

    # 3. Per-prompt advantages = rewards minus the group mean.
    advantages = rewards - rewards.mean(dim=1, keepdim=True) # (B, G)

    # 4. Recompute log-probs under the current policy.
    logps = policy.logprob(prompts, candidates)              # (B, G)

    # 5. Policy gradient: push probability toward positive advantage.
    pg_loss = -(advantages.detach() * logps).mean()

    # 6. KL penalty: keep the policy near the reference (frozen SFT).
    kl = kl_to_reference(policy, ref_policy, prompts, candidates)
    loss = pg_loss + beta * kl

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Step 6 — Why the KL penalty exists

Real RLHF systems add a KL penalty — Kullback-Leibler divergence — that sounds technical but is just a leash. It punishes the policy for drifting too far from a reference model, usually the frozen SFT model we started from. The KL term is computed token by token along the sampled response, and added to the policy loss with a coefficient beta:

def kl_to_reference(policy, ref_policy, prompts, responses):
    # Token-level log-probs from both models on the same sampled tokens.
    logp_policy = policy.token_logprobs(prompts, responses)   # (B, G, T)
    with torch.no_grad():
        logp_ref = ref_policy.token_logprobs(prompts, responses)

    # Per-token KL contribution: log π(a|s) - log π_ref(a|s).
    kl_per_token = logp_policy - logp_ref                     # (B, G, T)
    return kl_per_token.sum(dim=-1).mean()

The reason we need that leash is that reward optimization will otherwise push the model into weird corners of behavior very quickly. If the reward model has any loopholes, the policy will find them faster when it is allowed to drift freely. So the RL loss contains two opposing forces: reward pressure pushing toward higher scores, and KL pressure pulling back toward the original distribution. Think of it as an accelerator and a brake. Without the brake, reward hacking arrives sooner.

Step 7 — What to log

On every RL step, log at least four things: mean reward on the batch, mean KL divergence to the reference policy, task accuracy on a held-out GSM8K subset, and sample outputs from a fixed eval set. The reason you need both reward and accuracy is that reward alone will lie to you. A healthy run looks like this: reward steadily increasing, accuracy improving early then flattening, KL increasing slowly. An unhealthy run shows the same rising reward, but accuracy flat or falling, KL increasing sharply, and outputs becoming more verbose, more formulaic, and less correct. That unhealthy picture is the entire lesson of the next section.

BREAK IT

Building the loop is useful. Breaking it proves why every piece exists. There are two failures to force on purpose: run RL much longer than feels safe, and then sharply weaken — or remove — the KL leash. The first failure teaches you what reward hacking looks like in your own logs. The second teaches you what the KL term was actually buying.

Break 1

Run RLHF long enough to watch the reward model lie

Take the short GRPO run from the build and increase the RL horizon. If the script has max_steps = 10, change it to max_steps = 1000. Keep the prompt set, the reward model, the policy family, and the sampler fixed. Do not silently change multiple things at once. You want one clean causal story: same task, same judge, much longer optimization pressure.

Then watch four signals together: reward score, task accuracy on the held-out GSM8K subset, average response length, and sample outputs at fixed eval prompts. A typical pattern emerges. Steps 0–50: reward rises and accuracy rises a bit. Steps 50–150: reward keeps rising and accuracy peaks or plateaus. Steps 150–1000: reward keeps climbing and accuracy drifts downward, while response length creeps up and outputs become more formulaic.

The exact step numbers vary by setup. The shape is the point. The reward keeps saying "better." Reality says "worse." That tells you the reward model is only a proxy, and you have just run the optimizer long enough to find the gap between the proxy and the real objective.

The outputs themselves usually drift in a recognizable direction. An early policy might answer "5 notebooks at $2 each gives $10. Answer: 10." A later hacked policy might answer "Let us solve carefully and verify each step. Since the store offers 5 notebooks priced at $2 per notebook, the correct total after systematic multiplication is 12 dollars. Therefore the answer is 12." That looks more polished. If the reward model mistakenly likes structure and confidence too much, it will score that higher than it should. You are not watching the model become smarter. You are watching it become better at appearing the way your judge measures.

To see this in code, swap the reward model out for one trained on the wrong axis. The cleanest version is a sentiment reward model — train it to prefer positive-toned text, then run RL on GSM8K with that as the only reward:

# Same GRPO loop, but with a reward model whose preference axis
# has nothing to do with math correctness.
reward_model = SentimentRewardModel.from_pretrained("rm_sentiment.pt")

for prompts in gsm8k_loader:
    candidates = [policy.generate(prompts) for _ in range(G)]
    candidates = torch.stack(candidates, dim=1)

    rewards    = reward_model.score(prompts, candidates)   # rewards cheer
    advantages = rewards - rewards.mean(dim=1, keepdim=True)
    logps      = policy.logprob(prompts, candidates)
    loss       = -(advantages.detach() * logps).mean()
    loss.backward(); optimizer.step(); optimizer.zero_grad()

You are telling the policy: "On math problems, produce the answer that sounds most positive." The result is a language model that learns to answer math questions in a way that sounds upbeat, not correct. Outputs drift toward "Great question! Let's work through this together. You are doing amazing. The answer is 14." The tone improves. The math does not. This proves something stronger than "reward models can be flawed." It proves that the reward model defines the task more than the prompt does. The prompt says "solve math." The reward says "be cheerful." RL follows the reward.

Break 2

Cut the KL leash and watch the policy collapse faster

The first break shows reward hacking with the safety rail in place. The second shows what that rail was actually doing. Run the same long-horizon experiment three times, sweeping only the KL coefficient:

python train_grpo.py ... --beta 0.05   # baseline leash
python train_grpo.py ... --beta 0.005  # weak leash
python train_grpo.py ... --beta 0.0    # no leash at all

Hold everything else identical: the policy, the reward model, the prompt set, the sampler, the optimizer, the number of steps. The only thing that changes between runs is how strongly the policy is anchored to the reference SFT model. With beta = 0.05, you typically see reward rise steadily, KL drift up slowly, and accuracy peak somewhere in the middle of the run before drifting down. With beta = 0.005, the same shape arrives sooner and the KL trace climbs faster. With beta = 0.0, the policy is unmoored: KL increases sharply, outputs become noticeably weird within a few hundred steps, repetition and stylization grow, and accuracy can fall off a cliff while the reward model is still cheerfully reporting "this is great."

Without the reference policy as an anchor, the model is free to sprint into degenerate modes that the reward model still likes. The KL term is not preventing reward hacking — that is what Break 1 already showed — it is slowing down the rate at which a flawed judge gets weaponized. The KL term is a damage limiter, not a cure. If you ever read an RLHF result that does not report KL alongside reward, you are reading half of the story. The missing half is almost always the half you would have argued with.

The deeper lesson under both breaks is the same. Optimizing a proxy objective and optimizing the real objective are different problems. The reward score approximates quality, but the approximation has edges. More optimization pressure does not smooth those edges; it tends to expose them faster. The failure mode does not require anything exotic. An imperfect reward function plus an optimizer plus enough steps is a combination that shows up in ordinary engineering work, not only in research settings. Once you have watched it once on your own dashboard, you will recognize it the next time it appears in a production system that someone else trained.

Questions to answer

During the 10-step GRPO run, what changed first: reward score, answer style, or actual GSM8K accuracy — and what does that ordering tell you about which metric is leading and which is lagging?
In the 1000-step run, at what step did reward and accuracy stop moving together? Is the divergence smooth, or does it have a visible knee?
What did the model learn to do that the reward model liked but the task did not actually require? Be specific — point to a phrase, a structure, or a tone.
When you swapped in the sentiment reward model, what behavior changed most visibly: tone, structure, or answer correctness? Which axis did the policy follow, the prompt's or the reward's?
What tradeoff is the KL penalty protecting? What got better and what got worse when you reduced beta from 0.05 to 0.005 to 0.0?

Go further

Training Language Models to Follow Instructions with Human Feedback (Ouyang et al., 2022). The original large-scale RLHF framing: supervised fine-tuning first, then preference modeling and policy optimization. Read this once you have your own GRPO loop running so the design decisions feel earned.
Direct Preference Optimization (Rafailov et al., 2023). A clean alternative that drops the explicit RL rollout loop and trains directly from preference pairs. After this chapter you can read the paper as a direct comparison: same data, different optimizer, different failure surface.
Chapter 21 — Fine-Tuning and Instruction Tuning. The behavior-shaping contract starts there. This chapter sharpens it: the judge is part of the contract too. Once you add RLHF, you are no longer training a model to answer. You are training a model to satisfy a grader.
Verifier-based training and process reward models. For narrow tasks like math or code, a stricter external checker often beats a learned reward model. Search "MATH benchmark," "GSM8K verifier," and "reward bench" for the current state of practice.

What you now know

Supervised fine-tuning and RLHF should feel like different tools now, not variations of the same recipe. A reward model is no longer hand-wavy: it takes a prompt and a response and returns one scalar score through a single linear head sitting on top of a transformer body. The Bradley-Terry loss should also feel concrete — it teaches ranking on pairs, not absolute calibration. GRPO is concrete too: sample several answers, compare them within the group, reinforce the ones the reward model prefers. The KL penalty reads correctly now — it is a leash that slows destructive drift toward reward-model loopholes, not a cure for them. When you see a rising reward curve, the real question is "rising according to whom?" Reward hacking is not a metaphor. It is an engineering failure with a visible mechanism: a flawed proxy objective, repeated optimization, policy drift, apparent metric improvement, and real task degradation. That lets you say something stronger than "alignment is hard." It gets hard at the point where your measurable judge stops matching what you actually care about.

Chapter 18

Mixture of Experts

Train a routed model and watch router collapse appear in the activation distribution before it appears in the loss.

Chapter 32

Fusing Independently Trained Specialists

Combine separately fine-tuned models into one without retraining — and see where the fusion breaks.

Continue the build

This was Chapter 23 of 35.

The full book is 934 pages and 256,587 words. 35 hands-on projects from autograd to fused specialists. PDF and EPUB on Leanpub, lifetime free updates.

Buy on Leanpub — $15.99 ~~$19.99~~ Project 23 code on GitHub