Chapter 5 · Full excerpt
Your GPT from a blank file.
This chapter assembles a working GPT decoder from nothing. Token embeddings, positional embeddings, a stack of transformer blocks with attention and a feed-forward network, residual connections, LayerNorm, a final language-model head — all in one readable file you can hold in your head.
Then the chapter trains it on TinyShakespeare with AdamW, a warmup-plus-cosine learning rate schedule, gradient clipping, and checkpoints. Samples are printed as training runs. The chapter closes by deliberately breaking each piece — removing weight tying, removing the schedule, removing clipping, zeroing the initialization, removing checkpointing — so you can see how the system fails when any one part is missing.
This is Chapter 5 of Under The Hood — Build Every Layer of a Large Language Model from Scratch. The full 35-project book is on Leanpub. Code companion at github.com/mechramc/Under-the-hood.
The concept
By the time you reach this chapter, you already have the parts: gradients from Project 1, a tiny language model from Project 2, a tokenizer from Project 3, and attention from Project 4. So why does opening a blank file still feel like stepping off a curb you cannot see? Because the hard part is no longer any one component. The hard part is the glue. A GPT is not "attention plus some layers." It is a whole system where token embeddings, positional information, residual paths, normalization, initialization, batching, optimization, clipping, checkpointing, and generation all have to cooperate. Miss one line and the whole thing does not become slightly worse. It becomes nonsense, NaNs, or wasted hours.
What we are assembling
A tokenizer turns raw text into tokens. The model never sees "language" directly. It sees token IDs, and everything else is built on top of that integer sequence. An embedding table is a lookup that turns each token ID into a list of numbers — that list is the token's current internal representation. If token ID 42 means "king", the embedding table maps 42 to a vector in d_model dimensions that will shift throughout training as the model figures out what "king" means in context.
Position matters too. "dog bites man" and "man bites dog" use exactly the same tokens in a different order. So the model adds position information alongside token identity: if token embeddings say what a token is, positional embeddings say where it sits in the sequence.
Then the transformer blocks begin. Each block does two jobs. First, attention lets each token decide which earlier tokens matter right now. Second, the feed-forward network does local thinking on each position after attention has already mixed in context from elsewhere. These two sublayers are wrapped in residual connections — the x + F(x) skip lane that carries the old signal forward by simple addition. If the new sublayer is not helping yet, the old information still passes through untouched. LayerNorm keeps the scale of activations under control at each step, functioning like electrical regulation. Without it, one layer can push values into a range the next layer cannot sensibly consume.
After N blocks, the model has a final hidden state for every position. It then projects that hidden state into vocabulary-sized logits: one score per possible next token. Softmax turns those scores into probabilities. Training is next-token prediction. If the input is "To be, or not to b", the target is "o be, or not to be". The model predicts the next token at every position, compares those predictions to the true next tokens, and computes a loss. Backpropagation assigns blame through the whole network, the optimizer nudges the parameters, and you do it again, ten thousand times or more.
The first time I watched this loop run on my own code, I expected the loss to fall in a clean diagonal. It does not. It bounces. Sometimes for two hundred steps before it commits to a direction. That bouncing is a feature, not a bug — but you do not learn that from a textbook curve. Until you write the whole thing yourself, "GPT" is still a bag of named parts in your head. After you write it, you know where the batch comes from, why targets are shifted by exactly one position, where weight tying saves parameters, why learning rate schedules are not decoration, why clipping exists, why initialization is part of the model itself, and why checkpointing is not optional.
Why it matters
Without building this yourself, the next reference implementation will lie to you by accident. Not because the code is wrong, but because production code hides pain. A good reference implementation makes dozens of choices look inevitable. If you meet those choices only inside polished code, they look like style. They are not style. They are survival.
Strong opinion: most "from scratch" tutorials I have read fail right here. They give you the transformer block and skip the wiring around it, and the wiring is where every painful weekend of my life has been spent.
This project matters because it turns the transformer from architecture into system. When training loss drops, you know that is not one knob. It is the result of many agreements being kept at once: the tokenizer gives sensible units, the batch sampler does not feed garbage windows, initialization does not trap the model in symmetry, the learning rate is high enough to learn but not high enough to explode, gradient clipping catches rare spikes before they wreck the run.
If you can write a GPT from a blank file, you stop treating language models as sealed machines. You can now ask: Why did validation loss stop improving? Is this instability in optimizer state or in the model code? Did I really resume training, or only reload the weights?
The build
Start with one file: my_gpt.py. Do not read Karpathy's microgpt first. That restraint matters. If you read the reference first, your brain will follow it. You will type something that works, but you will miss the moment where you had to decide what the model needs. This chapter is about those decisions. Build the file in layers.
Step 1 — Decide the smallest complete system
You need exactly these parts: imports, config, tokenizer, data loading, batch sampler, model components, a GPT class with forward() and generate(), optimizer, learning rate schedule, training loop, validation, sample printing, and checkpointing. That list is your map. Anything outside it is optional. Anything missing from it is a hole. A minimal config:
from dataclasses import dataclass
@dataclass
class Config:
batch_size = 64
block_size = 128
d_model = 256
n_heads = 4
n_layers = 6
dropout = 0.2
learning_rate = 3e-4
min_lr = 3e-5
warmup_steps = 200
max_steps = 10000
grad_clip = 1.0
device = "cuda"
These are not magic numbers. They are a budget. block_size is sequence length. d_model is width. n_layers is depth. If you run on a small GPU, these numbers control whether the run fits in memory.
Step 2 — Write the batch sampler honestly
This looks small. It is not small. The model trains on fixed-length windows. Sample a random start index i, take x = data[i:i+T], and set the target y = data[i+1:i+T+1]. That shift by one token is the whole training objective — the model sees tokens 1 through T and predicts tokens 2 through T+1.
def get_batch(split, cfg):
data = train_data if split == "train" else val_data
ix = torch.randint(len(data) - cfg.block_size - 1, (cfg.batch_size,))
x = torch.stack([data[i:i+cfg.block_size] for i in ix])
y = torch.stack([data[i+1:i+cfg.block_size+1] for i in ix])
return x.to(cfg.device), y.to(cfg.device)
get_batch() quietly defines the learning problem. It chooses what counts as context, what counts as the next-token target, and whether training windows reflect the real corpus or mix together things that should have stayed separate. In larger systems, data bugs often begin here rather than inside the model. I have hit this exact bug — packing three unrelated sources back to back with no separator, with validation loss looking fine while the model was quietly writing answers to one source's questions using vocabulary from another. Write your rule in code, in a comment. Future you needs to know whether the training distribution is honest.
Step 3 — Build causal self-attention
You built attention already. Now you put it inside a full model. Each token creates three projections — Query, Key, Value — and for each head, attention computes scaled dot products, applies a causal mask so tokens cannot see the future, and takes a softmax-weighted sum of values.
class CausalSelfAttention(nn.Module):
def __init__(self, cfg):
super().__init__()
assert cfg.d_model % cfg.n_heads == 0
self.n_heads = cfg.n_heads
self.head_dim = cfg.d_model // cfg.n_heads
self.qkv = nn.Linear(cfg.d_model, 3 * cfg.d_model)
self.proj = nn.Linear(cfg.d_model, cfg.d_model)
self.dropout = nn.Dropout(cfg.dropout)
self.register_buffer(
"mask",
torch.tril(torch.ones(cfg.block_size, cfg.block_size))
.view(1, 1, cfg.block_size, cfg.block_size)
)
def forward(self, x):
B, T, C = x.shape
qkv = self.qkv(x)
q, k, v = qkv.split(C, dim=2)
q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
att = F.softmax(att, dim=-1)
att = self.dropout(att)
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.dropout(self.proj(y))
Nothing here is decorative. The mask prevents the model from cheating by looking at future tokens. The scaling by 1/sqrt(head_dim) prevents softmax from saturating. The final projection lets the heads recombine their outputs. A scar from drafting this section: my first hand-written attention had the mask broadcasting along the wrong axis. The model still trained. Loss still fell. Samples were almost coherent. Then I noticed the same word kept showing up in positions where it had no business being there, and I traced it back to a 1-vs-2 axis bug in the mask.
Step 4 — Wrap attention and MLP into a block
The feed-forward network does per-position computation after attention has already mixed information across positions. A common shape is d_model -> 4*d_model -> d_model; that expansion gives the model room to transform features nonlinearly before projecting back down. Combine attention and the MLP with LayerNorm and residual connections using a pre-norm layout:
class MLP(nn.Module):
def __init__(self, cfg):
super().__init__()
self.net = nn.Sequential(
nn.Linear(cfg.d_model, 4 * cfg.d_model),
nn.GELU(),
nn.Linear(4 * cfg.d_model, cfg.d_model),
nn.Dropout(cfg.dropout),
)
def forward(self, x):
return self.net(x)
class Block(nn.Module):
def __init__(self, cfg):
super().__init__()
self.ln1 = nn.LayerNorm(cfg.d_model)
self.attn = CausalSelfAttention(cfg)
self.ln2 = nn.LayerNorm(cfg.d_model)
self.mlp = MLP(cfg)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
A transformer block is easier than it first appears. ln1 -> attn -> residual add is one correction to the running representation. ln2 -> mlp -> residual add is a second correction. If you can identify those two residual updates while skimming, you can navigate most GPT implementations without getting lost in the class scaffolding. Pre-norm trains more reliably than post-norm in small GPT setups because normalization happens before each sublayer rather than after, keeping inputs to attention and the MLP in a controlled range regardless of what previous blocks have done.
Step 5 — Assemble the GPT class and tie the embeddings
Your GPT class holds embeddings, transformer blocks, final norm, output projection, initialization, weight tying, the forward pass, and generation all in one place. It is where the architectural contracts are enforced.
class GPT(nn.Module):
def __init__(self, cfg, vocab_size):
super().__init__()
self.cfg = cfg
self.token_embedding = nn.Embedding(vocab_size, cfg.d_model)
self.position_embedding = nn.Embedding(cfg.block_size, cfg.d_model)
self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layers)])
self.ln_f = nn.LayerNorm(cfg.d_model)
self.lm_head = nn.Linear(cfg.d_model, vocab_size, bias=False)
# weight tying
self.lm_head.weight = self.token_embedding.weight
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
nn.init.xavier_uniform_(module.weight)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
B, T = idx.shape
assert T <= self.cfg.block_size
pos = torch.arange(0, T, device=idx.device)
x = self.token_embedding(idx) + self.position_embedding(pos)
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.lm_head(x)
loss = None
if targets is not None:
B, T, V = logits.shape
loss = F.cross_entropy(logits.view(B * T, V), targets.view(B * T))
return logits, loss
That one line of weight tying matters a lot. The same matrix used to map token IDs into vectors is also used, transposed by the linear layer, to map hidden states back into token logits. The dictionary the model uses to read tokens is also the dictionary it uses to write them. Suppose vocab_size = 50000 and d_model = 768: an untied output projection costs 38.4M extra weights. Tie them and that table disappears. Not a small saving.
Initialization is part of the model, not paperwork. If weights start too large, activations and gradients explode. If they start too small, updates become weak. If they all start at zero, many parameters remain perfect twins forever — they receive identical gradients and never diverge. While running fusion experiments at scale, the recurring surprise was how often a "bad start" looked like a "bad architecture" two thousand steps in. You can fix initialization in one line. You cannot recover those thousand steps.
Step 6 — Schedule the learning rate and write the training loop
Use AdamW. But do not call a scheduler helper and move on. Write the schedule yourself. Warmup, where you start small and ramp to peak learning rate, protects the fragile early phase when the network is uncalibrated. Cosine decay smoothly lowers the rate so the late phase is fine adjustment rather than broad movement.
def get_lr(step, cfg):
if step < cfg.warmup_steps:
return cfg.learning_rate * (step + 1) / cfg.warmup_steps
if step > cfg.max_steps:
return cfg.min_lr
decay_ratio = (step - cfg.warmup_steps) / (cfg.max_steps - cfg.warmup_steps)
coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
return cfg.min_lr + coeff * (cfg.learning_rate - cfg.min_lr)
Then the training loop itself. The loop fetches a batch, sets the current step size, computes predictions and loss, backpropagates, clips extreme gradients, updates weights, then periodically evaluates and checkpoints:
model = GPT(cfg, vocab_size).to(cfg.device)
optimizer = torch.optim.AdamW(model.parameters(), lr=cfg.learning_rate)
for step in range(cfg.max_steps):
lr = get_lr(step, cfg)
for param_group in optimizer.param_groups:
param_group["lr"] = lr
xb, yb = get_batch("train", cfg)
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), cfg.grad_clip)
optimizer.step()
if step % cfg.eval_interval == 0:
train_loss = estimate_loss("train")
val_loss = estimate_loss("val")
sample_text(model)
save_checkpoint(...)
What caught me off guard with gradient clipping was the asymmetry. It is the kind of safety code I assumed I would tune carefully, and instead it spent ninety percent of the run doing literally nothing and then, on one step out of forty thousand, saved the entire training run. That asymmetry is the actual reason it is in the loop. Save model state, optimizer state, current step, and config every few hundred steps. If you reload only weights and not optimizer state, AdamW's running moment estimates are erased — the run resumes with the model's memory wiped but the weights preserved. That is not a clean continuation.
BREAK IT
A system that only works when left untouched has not taught you much. Run every ablation against the same baseline, with the same seed, and save each run in its own directory. The point is not to get a healthy model from each ablation. The point is to watch which specific failure arrives first when a specific piece is removed.
Remove gradient clipping and raise the learning rate
Exact change — comment out:
torch.nn.utils.clip_grad_norm_(model.parameters(), cfg.grad_clip)
Then raise the peak learning rate. If you trained at 3e-4, try 6e-4 or 1e-3.
What changes first: training may look fine for a while, then one step suddenly spikes, loss jumps hard, and soon after you may see nan. Generated text becomes garbage or sampling crashes. This is the important part: clipping often appears to do nothing right until the step where it saves the run. If you log gradient norm, you may see many steps around 0.8 to 2.5, then a rare step at 40 or 200, then collapse.
What this proves: gradient clipping protects the optimizer from rare extreme updates. It is there for tail events, not average events. Without it, one spike can erase thousands of good updates. Clipping is not a fix for a hot learning rate. It is a stability net that prevents catastrophic damage while you find the actual fix. Train with clipping always on, threshold set well above your typical norm, so it only fires during anomalies.
Initialize all parameters to zeros
Exact change — replace the initialization with zeros for every weight:
def _init_weights(self, module):
if isinstance(module, nn.Linear):
nn.init.zeros_(module.weight)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
nn.init.zeros_(module.weight)
What changes first: loss barely improves, if at all. Different neurons inside a layer behave identically. Logits often start uniform or near-uniform. Generated samples remain repetitive or meaningless. Training may appear "stable" while learning almost nothing.
Why: the network starts in perfect symmetry. Many weights receive identical gradients, so they stay identical. Multiple heads do not specialize. Multiple neurons do not specialize. The model acts like a much smaller model trapped inside a larger shell. What this proves: initialization is part of the learning system. Randomness at the start is not noise you tolerate. It is how the model gets permission to become different from itself.
The deeper lesson
Three more ablations are worth running on your own. Remove weight tying and watch parameter count jump by roughly V * d_model while validation loss often trails the tied version. Replace the warmup-plus-cosine schedule with a constant learning rate and watch the same compromise force itself on two different phases of training — early instability or late stagnation. Delete checkpointing entirely, train for thirty minutes, kill the process, and discover that nothing inside the model changed but thirty minutes of compute vanished anyway. None of these failures look the same. Tying removal shifts parameter geometry. Schedule removal shifts the optimizer's working range. Missing checkpoints fail operationally rather than mathematically. That is the real lesson of this chapter — a GPT is not one decision. It is a stack of decisions, and each one is attached to a specific failure mode you can watch arrive on purpose.
Questions to answer
- When you removed weight tying, what changed more noticeably: parameter count, loss, or sample quality? Why do you think the biggest change showed up there first?
- In the constant learning rate run, what failed first: early stability or late improvement? What compromise was the fixed LR forcing?
- When you removed gradient clipping and raised the learning rate, did the run fail gradually or suddenly? What does that tell you about why clipping exists?
- In the zero-initialization run, what evidence told you the model was stuck in symmetry rather than merely learning slowly?
- If you had to keep only one safeguard in a tiny local training setup, which would you keep first: good initialization, LR schedule, clipping, or checkpointing? What failure is that choice protecting against?
Go further
- Project 4 — Attention From Scratch. If any line of the
CausalSelfAttentionmodule above looks opaque, this is where the Q/K/V split, scaling factor, causal mask, and multi-head reshape get derived from first principles. - Project 6 — A Better Tokenizer. The character-level fallback here is good enough to learn from, but real GPTs run on byte-pair encoding. Project 6 builds that tokenizer end to end and shows what changes when token boundaries are designed rather than given.
- Karpathy, nanoGPT. Read this only after your own file runs. The progression matters — your blank-file version should answer most of the questions the reference raises, rather than answering them for you.
- Code companion on GitHub. The full Project 5 directory, including training logs and BREAK IT artifacts, lives at github.com/mechramc/Under-the-hood/tree/main/projects/05-gpt.
What you now know
You can now write a complete GPT training file from nothing. The whole machine, not only the transformer blocks. Every piece has a reason: shifted targets define the training objective, the batch sampler shapes the distribution the model sees, weight tying aligns the geometry of reading and writing tokens, initialization determines whether the model can break symmetry at all, warmup and decay solve different instability problems at different moments in training, clipping protects against rare destructive spikes, and checkpointing is the difference between a training system and a training gamble. When you open a larger reference implementation next, you will not see magic. You will see choices.
This was Chapter 5 of 35.
The full book is 934 pages and 256,587 words. 35 hands-on projects from autograd to fused specialists. PDF and EPUB on Leanpub, lifetime free updates.