Attention from scratch.
Build it. Break it. Measure it.
Attention is the first place in a transformer where one token gets to look at the others and decide, in real time, which ones matter. The mental model fits in one paragraph. The code that implements it involves enough reshape, transpose, and broadcast that a careless line will land you somewhere you cannot debug from the error alone. This cluster of Under The Hood is where you write that code and watch the math behave.
What attention actually is.
Attention is a routing operation. For every position in a sequence, the model produces three vectors from the token's hidden state: a query, a key, and a value. The query is what that token is looking for. The key is what each token advertises about itself. The value is what gets carried forward if the token is selected. The mechanism compares every query against every key, turns those comparisons into weights with a softmax, and uses those weights to mix the value vectors into a new representation for each position.
The formula is one line: Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k)) V. Read out loud, it says: take every token's query, dot it against every token's key, scale the result so the magnitudes do not blow up, turn the scaled scores into a row-wise probability distribution, then use those probabilities to take a weighted sum of the value vectors. The output is the same shape as the input. The model has not changed its dimensions. It has only updated each position to carry context from the positions that matter to it.
Four pieces of the formula carry their weight for specific reasons. The Q, K, V split exists because what a token seeks, what it offers, and what it contributes are not the same job; a single projection would force all three jobs through one bottleneck. The 1 / sqrt(d_k) scaling exists because dot products grow with dimension, and unscaled scores push softmax into saturation where gradients die. The causal mask exists because, in an autoregressive model, position i must not see position i+1 during training; the upper triangle of the score matrix gets set to negative infinity before softmax so future weights become exactly zero. Multi-head splitting exists because language has many relations stacked on top of each other, and a single head can only learn one pattern at a time. Eight or sixteen smaller heads, each working in a slice of the embedding, can specialize.
Every one of those design choices exists because something specific breaks without it. That is the part most explanations skip, and it is the part this cluster is built around.
The projects this cluster covers.
Two projects make up this cluster. The first builds attention itself, from a blank file, in pure PyTorch. The second takes the working attention and rewrites it so that peak memory stops scaling with the square of the sequence length.
sqrt(d_head), apply the causal mask, softmax row-wise, and mix the value vectors. Then add multi-head splitting, plot the per-head attention heatmaps, and measure per-head entropy. Then BREAK IT: remove the scaling, then remove the mask, and watch the failure modes appear in the data before they appear in the loss. Read the full chapter →(batch, heads, seq, seq). At sequence length 4096 with 8 heads, that tensor is roughly 2 GB per layer in fp16, and it kills the run on a consumer GPU. This project rewrites the forward pass as a tiled loop with an online softmax, so the full N-by-N score matrix never gets materialized. Peak memory drops from quadratic in sequence length to linear, in code you can still read line by line. Read the full chapter →The mini example.
Here is what one head of scaled dot-product attention looks like by the time you finish Project 4. This is the version with causal masking but without the multi-head reshape; that gets added one step later in the chapter.
import math
import torch
def attention(x, W_Q, W_K, W_V):
T, d_model = x.shape
d_head = W_Q.shape[1]
Q = x @ W_Q # (T, d_head)
K = x @ W_K # (T, d_head)
V = x @ W_V # (T, d_head)
scores = (Q @ K.T) / math.sqrt(d_head) # (T, T)
mask = torch.tril(torch.ones(T, T, dtype=torch.bool))
scores = scores.masked_fill(~mask, float('-inf'))
weights = torch.softmax(scores, dim=-1) # (T, T), rows sum to 1
return weights @ V # (T, d_head)
Fifteen lines. By the end of Project 4 you can read every line and say what would happen if you deleted it. Delete the divide by sqrt(d_head) and softmax saturates. Delete the masked_fill and the model starts attending to the future. Delete the row-wise softmax and the output is no longer a weighted mixture. None of those failures is hypothetical. The chapter has you run each one and watch what changes.
Why BREAK IT matters here.
Building attention is useful. Breaking it is what proves the pieces exist for a reason. Project 4 ends with two deliberate failures, and they teach two completely different lessons.
"Remove the 1/sqrt(d_head) scaling factor and print the raw dot products. They will be much larger than you expect. Softmax then becomes brutally decisive: a row that should read like [0.20, 0.15, 0.30, 0.35] instead reads [0.00, 0.00, 0.97, 0.03]. That is not the model getting more confident. That is softmax saturating. The gradient through the suppressed positions is now nearly zero, and the model loses the ability to make careful credit assignments across candidate tokens. Per-head entropy collapses. Training plateaus earlier, at a worse loss."
The second failure is removing the causal mask. The attention matrix then has non-zero weights above the diagonal, which means position i is using information from position i+1. During training, the model can just attend directly to the next token it is supposed to predict. Loss looks suspiciously good. At inference time the cheating path is gone and the model collapses. The mismatch between training and inference is the actual damage: the model learned under easier conditions than it will face in production.
These are two failures that vanish if you only stare at the final formula. You catch them by printing the raw scores, plotting the attention heatmap, and computing per-head entropy. Once you have seen each one happen, the formula stops feeling like notation and starts feeling like engineering: a small set of fixes attached to specific failure modes that someone solved on the way here.
Related clusters and excerpts.
FAQ
Does the order Q/K/V actually mean anything?
Yes. Query, key, and value are three learned linear projections of the same token, but they are doing three different jobs. The query is what a token is looking for. The key is what a token advertises about itself. The value is what gets carried forward if the token is selected. The cleanest mental model is a library: the request slip is the query, the title on the shelf is the key, and the actual book you take home is the value. The librarian matches slips against titles, not against contents.
Why divide by the square root of d_k?
Because dot products grow with dimension. A dot product of two vectors of size d sums d term-by-term products, and the magnitude of that sum scales like sqrt(d) for reasonable inputs. Without the scaling, softmax sees very large scores, picks one winner, and the gradient through the other positions becomes nearly zero. The model still trains, but it plateaus earlier and at a worse loss. The 1 / sqrt(d_k) factor keeps the score distribution in a range where softmax is still informative and gradients still flow.
Why softmax and not something cheaper?
Softmax gives you a probability distribution: positive weights that sum to 1. That property is what makes the output of attention a true weighted mixture of value vectors rather than an arbitrary linear combination. There is research on linear attention and other softmax replacements, and some of it works well in narrow settings, but exact softmax remains the default because the probability semantics carry over cleanly into the rest of the architecture, including masking and gradient behavior. The book covers efficient softmax variants in Project 8 on Flash Attention.
Is multi-head attention just attention done several times?
Not quite. Each head sees a smaller slice of the embedding, not the full thing. If d_model is 512 and you have 8 heads, each head works in 64 dimensions. The heads compute attention independently and in parallel, then their outputs are concatenated and run through a final output projection. The reason for the split is that each head learns a different relational pattern: one tracks local context, one matches brackets, one watches punctuation. One giant head would have to learn all those patterns at once. Many small heads can specialize.
Do I need a GPU to do this project?
No. Project 4 runs comfortably on a laptop CPU. The whole point of the chapter is to inspect the attention matrix on small inputs, and toy sequences of 8 to 20 tokens are enough to see everything that matters. Project 8 on Flash Attention is where a GPU starts to help, because the lessons there are about peak memory at long context. For Project 4 alone, CPU is fine.
Write attention tonight.
One chapter to a working single-head attention. Two chapters to multi-head with per-head heatmaps. Two more to a memory-efficient tiled version that survives long context. The book is on Leanpub with lifetime updates.