RLHF and preference optimization from scratch.
Build it. Break it. Measure it.
Pretraining gives you a model that knows the distribution of text. Post-training gives you a model that does what you ask. This page is the entry point into the stretch of Under The Hood where reward models, PPO, GRPO, and DPO get written from the loss function up — and where the reward-hacking failure mode gets reproduced on purpose so you can recognise it later.
What post-training actually does.
Pretraining produces a model that has read most of the internet. It is good at one task: continue the text in a way that is statistically likely given the corpus. That is not the same as answering a question, refusing a harmful request, or staying on task. Those are preferences, not next-token continuations.
Post-training is the stack of techniques that turn a pretrained base into an assistant. Three layers sit on top of each other. Supervised fine-tuning teaches the model to imitate examples — given an instruction, produce something that looks like a good response. That gets you behavior in the right format. It does not get you judgment.
The second layer is the reward model: a separate model whose job is not to answer the question but to grade the answer. You collect pairwise preference labels — "response A is better than response B" — and train a small scalar-head transformer to imitate those judgments. The reward model turns sparse human ratings into a dense, differentiable score the policy can chase. That is efficient. It is also where most of the trouble starts, because the policy will learn the reward model's blind spots faster than it will learn the underlying task.
The third layer is the policy update. Classical RLHF uses PPO, where the policy generates candidates, the reward model scores them, and an actor-critic loop pushes log-probabilities toward higher-scoring outputs. GRPO simplifies this by replacing the value function with a group-relative baseline: sample several responses per prompt, treat above-average scorers as winners inside that group. A KL penalty against the reference SFT model acts as a leash so the policy cannot sprint into degenerate corners that the reward model happens to like.
DPO collapses the whole thing into one step. The closed-form solution of the KL-constrained reward maximization, substituted back into the Bradley-Terry preference model, makes the reward model algebraically vanish. What is left is a single classification-style loss on preference pairs, written entirely in terms of the policy and a frozen reference. No reward model, no rollouts, no value function. The cluster builds both — PPO/GRPO and DPO — because the choice between them is not a one-line answer, and the only way to make it well is to have run both and seen what each does to your data.
The projects this cluster covers.
The cluster covers Part VI of Under The Hood — the post-training stack from instruction following through preference optimization and test-time reasoning, with tool use at the end. Each project lands a working pipeline first and then breaks it on purpose.
The mini example.
The whole DPO loss fits in about fifteen lines. Two log-probability sums for the chosen completion (policy and reference), two for the rejected, one sigmoid, one mean. No reward model, no rollouts, no value function.
def dpo_loss(policy_chosen_logps, policy_rejected_logps,
ref_chosen_logps, ref_rejected_logps, beta=0.1):
# log-ratio of policy vs reference, on chosen and rejected
pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = ref_chosen_logps - ref_rejected_logps
# implicit reward difference: chosen should beat rejected
logits = beta * (pi_logratios - ref_logratios)
# classification-style loss: sigmoid of the gap, take -log
return -F.logsigmoid(logits).mean()
# each *_logps is the sum of token-level log-probabilities
# over the completion (not the prompt) under the named model.
# the reference model is frozen; only the policy gets gradients.
That is the entire algorithm. The reward model is hiding inside the log-ratio. Train against this loss on a preference dataset and you have, in effect, run RLHF — without ever instantiating a separate reward model, value function, or on-policy rollout loop. The whole Project 24 build is about earning the right to read those fifteen lines and know exactly what they are doing and why.
Why BREAK IT matters here.
Post-training has the most photogenic failure modes in the whole book. The reward curve climbs and the model gets worse. The loss converges cleanly on inverted labels. The verifier hits 90 percent held-out accuracy and the system still collapses when you plug it in. Each of these is reproduced on purpose in the cluster.
"Take the short GRPO run and increase max_steps from 10 to 1000. Keep everything else fixed. Reward keeps climbing. GSM8K accuracy rises for a bit, peaks, then drifts down. Outputs become longer, more confident, more formulaic. The optimizer is not confused. It is obedient. The reward model is only a proxy, and a thousand steps of optimization pressure is enough to find every place the proxy disagrees with the task."
Three failure modes get isolated. Reward hacking in Project 23: a flawed proxy plus optimization pressure plus enough steps produces a model that looks like it is improving and is not. The fix is not a better optimizer. The fix is end-to-end accuracy evaluation, a KL leash that you do not silently weaken, and reading the actual outputs at every checkpoint.
Label flipping in Project 24: swap chosen and rejected on every preference pair, retrain DPO, and the loss curve is indistinguishable from the correctly-labeled run. The gradients are healthy. The implicit-reward gap grows at the same rate. The only signal that something is wrong lives outside the training loop, in an external eval the model never sees. Label quality is the single most important variable in DPO, and the training procedure has no instrument to second-guess it.
Reference-policy divergence in the KL ablation: drop the beta coefficient sharply or remove it entirely, and the policy sprints into degenerate modes the reward model still likes. KL is not decoration. It is a damage limiter. Not a cure — a limiter. Each of these three breaks teaches the same lesson at a different layer: optimization is faithful to whatever signal it is given, and the engineering work is in making sure the signal is the right one.
Related clusters and excerpts.
FAQ
What is the difference between RLHF and DPO?
RLHF trains a separate reward model on preference pairs, then optimizes the language model against that reward with PPO or GRPO using on-policy sampling and a KL leash to a reference policy. DPO skips the reward model. It uses the closed-form solution of the KL-constrained RLHF objective to write a single classification-style loss directly on preference pairs. Same data in, similar behavior out, with fewer moving parts and no rollouts.
Why is reward hacking hard to avoid?
Because the reward model is a proxy, not the real objective. Optimization will find any pattern that scores high, including patterns the model trainer never intended — verbosity, confident tone, formulaic structure, cheerful sentiment. The reward number keeps rising while the task quality drifts down. The defense is end-to-end accuracy evaluation on a held-out set, a KL penalty against the reference policy, and reading sample outputs by hand at every checkpoint. None of that prevents reward hacking. It just lets you see it before you ship.
Do I need an actual human label dataset?
Not to learn the method. The cluster uses two public datasets — Anthropic's HH-RLHF and OpenBMB's UltraFeedback — and a small synthetic preference set built from GSM8K correct-versus-wrong answers for the reward model project. UltraFeedback is GPT-4 scored rather than human labeled, which is honest about the failure mode in Project 24's BREAK IT: a bad label source produces a confidently misaligned model with the same training stability as a clean one.
Is DPO replacing PPO?
It is replacing it as the default starting point. PPO and GRPO are still in the toolbox for cases where you need multi-objective reward shaping, online preference collection, or reward signals that do not fit pairwise comparisons. For most assistant-style alignment work since 2023, DPO and its variants (KTO, ORPO, SimPO) are the path of least resistance. The book builds both so the choice is informed, not vibes-based.
Can I run this without a GPU?
Most of it, yes, at small scale. A 125M model with a 1000-example subset of UltraFeedback runs DPO for 100 steps on CPU and lets you verify the loss math, the log-ratio gap, and the inverted-label break. The GRPO and PPO projects benefit from a consumer GPU with 12 to 24 GB of VRAM for the actual reward-hacking demonstration over 1000 steps. The math is the same at every scale; only the wall-clock and the visible degradation change.
Open Chapter 23 tonight.
Six projects gets you from a working SFT checkpoint through PPO, DPO, test-time reasoning, and tool use — with deliberate failure experiments at every layer. The book is on Leanpub with lifetime updates.