Cluster · Post-training

RLHF and preference optimization from scratch.

Build it. Break it. Measure it.

Pretraining gives you a model that knows the distribution of text. Post-training gives you a model that does what you ask. This page is the entry point into the stretch of Under The Hood where reward models, PPO, GRPO, and DPO get written from the loss function up — and where the reward-hacking failure mode gets reproduced on purpose so you can recognise it later.

Buy on Leanpub — $15.99 ~~$19.99~~ Read Chapter 23 free

What post-training actually does.

Pretraining produces a model that has read most of the internet. It is good at one task: continue the text in a way that is statistically likely given the corpus. That is not the same as answering a question, refusing a harmful request, or staying on task. Those are preferences, not next-token continuations.

Post-training is the stack of techniques that turn a pretrained base into an assistant. Three layers sit on top of each other. Supervised fine-tuning teaches the model to imitate examples — given an instruction, produce something that looks like a good response. That gets you behavior in the right format. It does not get you judgment.

The second layer is the reward model: a separate model whose job is not to answer the question but to grade the answer. You collect pairwise preference labels — "response A is better than response B" — and train a small scalar-head transformer to imitate those judgments. The reward model turns sparse human ratings into a dense, differentiable score the policy can chase. That is efficient. It is also where most of the trouble starts, because the policy will learn the reward model's blind spots faster than it will learn the underlying task.

The third layer is the policy update. Classical RLHF uses PPO, where the policy generates candidates, the reward model scores them, and an actor-critic loop pushes log-probabilities toward higher-scoring outputs. GRPO simplifies this by replacing the value function with a group-relative baseline: sample several responses per prompt, treat above-average scorers as winners inside that group. A KL penalty against the reference SFT model acts as a leash so the policy cannot sprint into degenerate corners that the reward model happens to like.

DPO collapses the whole thing into one step. The closed-form solution of the KL-constrained reward maximization, substituted back into the Bradley-Terry preference model, makes the reward model algebraically vanish. What is left is a single classification-style loss on preference pairs, written entirely in terms of the policy and a frozen reference. No reward model, no rollouts, no value function. The cluster builds both — PPO/GRPO and DPO — because the choice between them is not a one-line answer, and the only way to make it well is to have run both and seen what each does to your data.

The projects this cluster covers.

The cluster covers Part VI of Under The Hood — the post-training stack from instruction following through preference optimization and test-time reasoning, with tool use at the end. Each project lands a working pipeline first and then breaks it on purpose.

Project 21

Fine-tuning and instruction tuning

Take a pretrained base, build an instruction dataset, run SFT with LoRA on a single consumer GPU. The chapter that everything in this cluster builds on. BREAK IT: train on demonstrations that confuse style and correctness and watch the model adopt the cadence instead of the content.

Project 22

Evaluation methodology for aligned models

Win-rate evals with length control, judge-model bias, held-out preference splits. Why a rising number is not the same as a better model, and what to log so the difference shows up before you ship.

Project 23

Reward models and RLHF

Build a pairwise reward model with a scalar head, sanity-check it by hand, run GRPO on GSM8K, then extend the run to 1000 steps and watch reward and accuracy decouple. The reward-hacking experiment in this chapter is the cluster's defining lesson. Read the full chapter →

Project 24

DPO and preference optimization

Derive the DPO loss from the RLHF objective by hand. Train on UltraFeedback. Sweep beta. Implement KTO, ORPO, and SimPO with the same scaffolding. BREAK IT: invert the labels and watch a confidently misaligned model train with the exact same loss curve as the aligned one.

Project 25

Test-time reasoning: CoT, self-consistency, best-of-N

Five inference strategies on GSM8K: direct, chain-of-thought, self-consistency, best-of-N with an outcome reward model, step-level search with a process reward model, and a tiny MCTS. Plot accuracy against compute, then deliberately train a broken PRM and watch the verifier confidently reward wrong patterns.

Project 26

Tool use and function calling

Teach the model to call a calculator, a retriever, and a code sandbox. Format the tool-call protocol, train the model to use it, and measure how much accuracy comes from the tools versus from the policy. BREAK IT: corrupt one tool's outputs and watch the model trust it anyway.

The mini example.

The whole DPO loss fits in about fifteen lines. Two log-probability sums for the chosen completion (policy and reference), two for the rejected, one sigmoid, one mean. No reward model, no rollouts, no value function.

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    # log-ratio of policy vs reference, on chosen and rejected
    pi_logratios  = policy_chosen_logps  - policy_rejected_logps
    ref_logratios = ref_chosen_logps     - ref_rejected_logps

    # implicit reward difference: chosen should beat rejected
    logits = beta * (pi_logratios - ref_logratios)

    # classification-style loss: sigmoid of the gap, take -log
    return -F.logsigmoid(logits).mean()

# each *_logps is the sum of token-level log-probabilities
# over the completion (not the prompt) under the named model.
# the reference model is frozen; only the policy gets gradients.

That is the entire algorithm. The reward model is hiding inside the log-ratio. Train against this loss on a preference dataset and you have, in effect, run RLHF — without ever instantiating a separate reward model, value function, or on-policy rollout loop. The whole Project 24 build is about earning the right to read those fifteen lines and know exactly what they are doing and why.

Why BREAK IT matters here.

Post-training has the most photogenic failure modes in the whole book. The reward curve climbs and the model gets worse. The loss converges cleanly on inverted labels. The verifier hits 90 percent held-out accuracy and the system still collapses when you plug it in. Each of these is reproduced on purpose in the cluster.

From Project 23 — Reward Models and RLHF

"Take the short GRPO run and increase max_steps from 10 to 1000. Keep everything else fixed. Reward keeps climbing. GSM8K accuracy rises for a bit, peaks, then drifts down. Outputs become longer, more confident, more formulaic. The optimizer is not confused. It is obedient. The reward model is only a proxy, and a thousand steps of optimization pressure is enough to find every place the proxy disagrees with the task."

Three failure modes get isolated. Reward hacking in Project 23: a flawed proxy plus optimization pressure plus enough steps produces a model that looks like it is improving and is not. The fix is not a better optimizer. The fix is end-to-end accuracy evaluation, a KL leash that you do not silently weaken, and reading the actual outputs at every checkpoint.

Label flipping in Project 24: swap chosen and rejected on every preference pair, retrain DPO, and the loss curve is indistinguishable from the correctly-labeled run. The gradients are healthy. The implicit-reward gap grows at the same rate. The only signal that something is wrong lives outside the training loop, in an external eval the model never sees. Label quality is the single most important variable in DPO, and the training procedure has no instrument to second-guess it.

Reference-policy divergence in the KL ablation: drop the beta coefficient sharply or remove it entirely, and the policy sprints into degenerate modes the reward model still likes. KL is not decoration. It is a damage limiter. Not a cure — a limiter. Each of these three breaks teaches the same lesson at a different layer: optimization is faithful to whatever signal it is given, and the engineering work is in making sure the signal is the right one.

Related clusters and excerpts.

FAQ

What is the difference between RLHF and DPO?

RLHF trains a separate reward model on preference pairs, then optimizes the language model against that reward with PPO or GRPO using on-policy sampling and a KL leash to a reference policy. DPO skips the reward model. It uses the closed-form solution of the KL-constrained RLHF objective to write a single classification-style loss directly on preference pairs. Same data in, similar behavior out, with fewer moving parts and no rollouts.

Why is reward hacking hard to avoid?

Because the reward model is a proxy, not the real objective. Optimization will find any pattern that scores high, including patterns the model trainer never intended — verbosity, confident tone, formulaic structure, cheerful sentiment. The reward number keeps rising while the task quality drifts down. The defense is end-to-end accuracy evaluation on a held-out set, a KL penalty against the reference policy, and reading sample outputs by hand at every checkpoint. None of that prevents reward hacking. It just lets you see it before you ship.

Do I need an actual human label dataset?

Not to learn the method. The cluster uses two public datasets — Anthropic's HH-RLHF and OpenBMB's UltraFeedback — and a small synthetic preference set built from GSM8K correct-versus-wrong answers for the reward model project. UltraFeedback is GPT-4 scored rather than human labeled, which is honest about the failure mode in Project 24's BREAK IT: a bad label source produces a confidently misaligned model with the same training stability as a clean one.

Is DPO replacing PPO?

It is replacing it as the default starting point. PPO and GRPO are still in the toolbox for cases where you need multi-objective reward shaping, online preference collection, or reward signals that do not fit pairwise comparisons. For most assistant-style alignment work since 2023, DPO and its variants (KTO, ORPO, SimPO) are the path of least resistance. The book builds both so the choice is informed, not vibes-based.

Can I run this without a GPU?

Most of it, yes, at small scale. A 125M model with a 1000-example subset of UltraFeedback runs DPO for 100 steps on CPU and lets you verify the loss math, the log-ratio gap, and the inverted-label break. The GRPO and PPO projects benefit from a consumer GPU with 12 to 24 GB of VRAM for the actual reward-hacking demonstration over 1000 steps. The math is the same at every scale; only the wall-clock and the visible degradation change.

Start the post-training stack

Open Chapter 23 tonight.

Six projects gets you from a working SFT checkpoint through PPO, DPO, test-time reasoning, and tool use — with deliberate failure experiments at every layer. The book is on Leanpub with lifetime updates.

Buy on Leanpub — $15.99 ~~$19.99~~ Back to the pillar

RLHF and preference optimization from scratch.

What post-training actually does.

The projects this cluster covers.

The mini example.

Why BREAK IT matters here.

Related clusters and excerpts.

Mixture of Experts

Build an LLM from Scratch

Reward Models and RLHF

Fusing Independently Trained Specialists

FAQ

Open Chapter 23 tonight.