Chapter 4 · Full excerpt

Attention from scratch.

Attention is the first place in a language model where one token gets to look at the others and decide which ones matter. Every token computes three views of itself — a query, a key, and a value. The query compares against every other token's key, those comparisons become softmax weights, and the token then forms its new representation by mixing the values of the tokens it attended to.

This chapter builds attention from a blank file: scaled dot-product attention first, then causal masking, then multi-head, then the per-head heatmaps that let you see what each head learned. The chapter ends by deliberately breaking the scaling factor and the causal mask so you can watch the two specific ways attention fails when those guardrails are missing.

This is Chapter 4 of Under The Hood — Build Every Layer of a Large Language Model from Scratch. The full 35-project book is on Leanpub. Code companion at github.com/mechramc/Under-the-hood.

The concept

Up to Project 3 you have tokens and embeddings. Useful, but still not enough for context. If the token "bank" appears, the model has a problem. Does it mean a financial institution? A river bank? A verb, like banking a plane? The token by itself cannot settle that. Attention is the mechanism that lets a token ask the rest of the sequence for help.

Imagine every token in a sentence sitting around a table with an index card. On the card, each token writes three things:

a query — what I am looking for
a key — what I contain
a value — what I can contribute if someone picks me

If you are the token "it" in the sentence "The server returned an error because it timed out," you need to figure out what "it" refers to. So "it" sends out its query: "I am looking for something singular, nearby, and probably a thing that can time out." Other tokens present their keys. "server" has a key that says, roughly, "infrastructure noun, singular, plausible agent of timing out." Attention compares the query from "it" against the keys from all tokens. The better the match, the more weight that token gets. Then "it" forms its new representation by taking a weighted mixture of the values from the tokens it attended to.

That weighted mixture is the whole trick. It does not copy one token exactly. It blends information from the places it thinks matter.

The cleanest mental model I have for the Q/K/V split is from a library: each book on the shelf has a title (the key), what you write on the back of the request slip (the query), and what you actually take home (the value). Those are obviously three different things. The librarian does not match request slips to take-home contents directly; the slip matches the title, then the title delivers the contents. Attention is doing the same routing trick at every layer.

The math

For one head of attention, the formula is:

Attention(Q, K, V) = softmax((QK^T) / sqrt(d)) V

It looks dense until you say it out loud: take every token's query, compare it with every token's key, turn those comparisons into weights, use those weights to mix together the value vectors. That is attention.

The attention matrix itself is worth staring at. If your sequence length is T, then QK^T is a T × T matrix. Row i tells you what token i attends to. Column j tells you how much other tokens attend to token j. That is why attention heatmaps are so useful. They let you watch context routing happen.

Why it matters

Without attention, tokens do not get to dynamically ask "who should matter to me right now?" That removes the main mechanism that makes transformers good at context. Think about prose: "Maria handed the package to Jordan because she was leaving." Who is "she"? You do not solve that by looking at "she" alone. You solve it by comparing it against earlier nouns and the surrounding verb structure. A model without attention cannot make that comparison at all.

Attention also matters because it is where several critical design choices live: the 1/√d scaling factor, causal masking, multi-head splitting, and softmax over scores. If you break any of these, the model stops behaving well in specific, diagnosable ways. That is exactly why this chapter asks you to inspect the attention matrix, see the weights, and break it on purpose. Not because the formula is complicated, but because each piece of the formula exists to prevent a concrete failure.

Strong opinion: most published explanations of attention overweight the math and underweight the breakages. The math fits on a napkin. The breakages are where the engineering lives.

The build

Build attention from the inside out. Start with a small sentence, then one head, then causal masking, then multi-head, then inspect what each head is doing. Do not rush to a full transformer block. This chapter is about attention itself.

Step 1 — Produce queries, keys, and values

Take the embedding matrix X of shape (T, d_model). Create three learned weight matrices and project:

Q = x @ W_Q   # (T, d_head)
K = x @ W_K   # (T, d_head)
V = x @ W_V   # (T, d_head)

Every token embedding gets re-expressed three ways. Same token, different job. Do not hide this in a class yet if classes make the mechanism harder to see. Print the shapes. Check them.

Step 2 — Compute raw scores, then scale

Compare every query against every key, then scale by √d_head:

scores = (Q @ K.T) / math.sqrt(d_head)   # (T, T)

Why divide? Because dot products grow with dimension. If query and key vectors have many components, the sum of all those pairwise multiplications gets large. Large scores make softmax saturate — one entry gets almost all the probability and the rest get almost none. That sounds decisive, but it is actually a training problem: once the distribution becomes too sharp too early, gradients through softmax get tiny for most positions, and the model stops learning cleanly. If you later see 1/√d_head buried inside a framework, read it as a stability term, not a formatting choice.

Step 3 — Apply the causal mask, then softmax

In an autoregressive model, position i can only attend to positions ≤ i. Create a lower-triangular mask and overwrite the upper triangle with negative infinity before softmax:

mask = torch.tril(torch.ones(T, T))
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = torch.softmax(scores, dim=-1)

The order matters: compute scores, scale, mask, then softmax. Do not softmax first and then zero things out. That breaks the probability normalization.

Step 4 — Mix the values

Use the weights to combine value vectors:

out = weights @ V   # (T, d_head)

This is the new representation for each token after attention. Notice what happened: the token did not just pick the most relevant other token. It built a mixture. Context is often distributed — a token may need syntax from one place, referent information from another, and punctuation structure from a third. Weighted mixtures let that happen.

Step 5 — Scale to multi-head

Single-head attention proves the mechanism. Now make it practical. Let d_model be the full embedding size, choose H heads, and let d_head = d_model / H.

# x: (T, d_model)
Q = x @ W_Q
K = x @ W_K
V = x @ W_V

# reshape to (H, T, d_head)
Q = Q.view(T, H, d_head).transpose(0, 1)
K = K.view(T, H, d_head).transpose(0, 1)
V = V.view(T, H, d_head).transpose(0, 1)

scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_head)   # (H, T, T)
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = torch.softmax(scores, dim=-1)
head_out = weights @ V                                    # (H, T, d_head)

# back to (T, d_model)
out = head_out.transpose(0, 1).contiguous().view(T, d_model)

The reshapes and transposes can look intimidating, but the conceptual job is simple: split the model width into several smaller attention systems, run them independently, then stitch their outputs back together. Each head gets its own T × T attention matrix. Inspect head 0 separately from head 1. Do that. The whole educational point of multi-head attention is that the heads differ.

Step 6 — Measure per-head attention entropy

Heatmaps are useful, but entropy catches something your eyes may miss. For a row of attention weights p, entropy is H(p) = -Σ p_i log p_i. Low entropy means concentrated attention; high entropy means spread-out attention.

entropy = -(weights * (weights.clamp_min(1e-9).log())).sum(dim=-1)  # (H, T)
mean_entropy_per_head = entropy.mean(dim=-1)                        # (H,)

Why does this metric matter? Because raw heatmaps can mislead. A head may look mostly diagonal, but entropy tells you whether that diagonal is soft or nearly one-hot. When you remove scaling, this metric often falls hard. That gives you a quantitative handle on the failure.

BREAK IT

Building attention is useful. Breaking it proves why the pieces exist. There are two failures to force on purpose: remove the 1/√d scaling factor, and remove causal masking. Each failure teaches a different lesson. The first teaches you why score magnitude matters. The second teaches you why the training setup must prevent cheating.

Break 1

Remove the scaling factor

Replace this:

scores = (Q @ K.T) / math.sqrt(d_head)

With this:

scores = Q @ K.T

Do not change anything else. Then print the raw dot products before softmax, the attention weights after softmax, and the per-head attention entropy.

What happens? The raw dot products get large. A dot product sums over d_head terms. If d_head is 64 or more, you are adding a lot of numbers together. Softmax then sees large differences and becomes brutally decisive. A row that should read [0.05, 0.12, 0.51, 0.20, 0.08, 0.04] becomes [0.00, 0.00, 1.00, 0.00, 0.00, 0.00].

At first that may seem good. Isn't decisive attention a sign that the model knows what matters? No — not when it happens because your score scale is broken. Softmax has a useful operating range. When inputs are moderate, changes in the scores still change the output probabilities smoothly. When inputs are huge, softmax saturates: one position dominates, and small changes to the raw scores barely affect the probabilities. That means gradients through softmax get tiny for most positions and learning becomes brittle.

This is exactly the kind of failure that vanishes if you only stare at the final formula and never inspect internals. You need to print the raw scores and see that the problem begins before softmax. The scaling factor is not style. It is a guardrail that keeps dot products in a range where softmax and gradient descent can still behave well.

Break 2

Remove causal masking

Delete this line:

scores = scores.masked_fill(mask == 0, float('-inf'))

Then inspect the attention matrix on a training example. Tokens will attend to future tokens, and you will literally see non-zero weights above the diagonal. That means token position i is using information from positions i+1, i+2, and beyond.

During training, that makes the task easier in a fake way. Suppose the training text is "The capital of France is Paris." At the position before "Paris", the model should learn to predict "Paris" from the previous context. Without masking, it can simply attend directly to the token "Paris" itself. That lowers training loss for the wrong reason. You are no longer training an autoregressive model. You are training a sequence model with future leakage.

At inference time, the cheating path disappears. Future tokens do not exist yet. The model cannot use the same shortcut. That mismatch between training and inference is the real damage: the model learned under easier conditions than it will face when deployed. Causal masking is what makes GPT-style training honest.

The deeper lesson

Attention looks simple in formula form. That simplicity is deceptive. Every piece in the formula exists because something breaks without it: no Q/K/V split means less expressive matching and retrieval; no scaling means softmax saturation; no masking means future-token cheating; no multi-head means fewer parallel relational patterns. This is what engineering looks like inside models — not a bag of arbitrary tweaks, but a trail of fixes attached to specific failure modes. The formula is small because the engineers were solving concrete problems, not decorating the math.

Questions to answer

If you remove the softmax and use raw scores directly to weight the values, what specifically breaks during training, and which row property no longer holds?
If d_head doubles from 64 to 128, by roughly what factor do raw dot products grow on average, and what does that imply about the right scaling factor?
What is the difference between a head with very low entropy that is doing something useful and a head with very low entropy that has collapsed?
If you swap the causal mask for a "bidirectional" mask (no masking), what kind of task is this model now solving — and why is that not the same task as next-token prediction?
For a transformer with 12 heads and d_model = 768, what is d_head? If you doubled the heads to 24, what would change in compute and in representational capacity?

Go further

Project 8 — Flash Attention and Tiled Kernels. The same attention operation, now reimplemented so it does not materialise the full T × T matrix. Online softmax with tile decomposition.
Project 13 — Fast Inference: The KV Cache. At inference time, you do not need to recompute attention for tokens that have already been processed. Build the cache that turns O(n²) generation into O(n) per token.
The Annotated Transformer (Harvard NLP). A line-by-line annotated PyTorch re-implementation of the original transformer paper. Good companion reading after this chapter.
Karpathy, Let's build GPT. The two-hour video walkthrough. Good to watch after you have already written attention yourself.

What you now know

You can write attention from a blank file. You can name what each piece in the formula prevents: scaling prevents softmax saturation, masking prevents future-token leakage, multi-head gives you parallel relational patterns, the output projection re-mixes the heads. You have measured per-head entropy and watched it drop when scaling is removed. You have seen the upper triangle fill with weight when masking is removed. The next time you read a paper that adds something to attention — sliding windows, rotary embeddings, ALiBi — you will recognise it as another fix attached to another failure mode, not arbitrary decoration.

Chapter 5 — Next

Your GPT From a Blank File

Take attention, wrap it in a transformer block, stack it, and train a working GPT.

Chapter 8

Flash Attention and Tiled Kernels

The same attention operation, now without materialising the full T × T matrix.

Continue the build

This was Chapter 4 of 35.

The full book is 934 pages and 256,587 words. 35 hands-on projects from autograd to fused specialists. PDF and EPUB on Leanpub, lifetime free updates.

Buy on Leanpub — $15.99 ~~$19.99~~ Project 4 code on GitHub