Cluster · Foundations

Build an LLM from scratch.

Q: Is this the same book as Sebastian Raschka's?

Different book. Raschka's Build a Large Language Model (From Scratch) is excellent and tightly focused on getting to a working GPT. Under The Hood covers 35 projects spanning the whole modern stack — post-training, KV cache, MoE, quantization, multimodal, specialist fusion — and adds a deliberate BREAK IT pass to every project.

Build it. Break it. Measure it.

An LLM is not a single artifact. It is six or seven layers of work that have to fit together, and each layer is small enough to write yourself in an afternoon. This page is the entry point into the stretch of Under The Hood where those layers get written.

Buy on Leanpub — $15.99 ~~$19.99~~ Read Chapter 5 free

What "build an LLM from scratch" actually means.

The phrase gets used loosely. Sometimes it means importing PyTorch and writing a training loop. Sometimes it means cloning nanoGPT and changing the hyperparameters. Under The Hood takes the phrase literally. You start with a scalar autograd engine — fifty lines of Python, no NumPy, no Torch — and you stop only when you have a routed mixture of experts that fuses two independently trained specialists into one model.

Building from scratch is not nostalgia. The reason you do it is that the things people get wrong about LLMs in production almost always trace back to one of the layers below them. Tokenizer misconfiguration. Position-encoding edge cases. KV cache that drifted across a checkpoint restart. Reward model trained on the wrong distribution. When you have written each layer yourself, those failures stop being mysterious — they become familiar shapes you recognise.

The build, in order.

The book moves through the stack in the order you would actually learn it. The cluster below names the projects from Part I through Part II that get you from a blank file to a working decoder.

Project 1

A scalar autograd engine

Backprop in 50 lines. No tensors. You write the Value class, the operator overloads, the topological sort. By the end you know what an autograd graph is at the same depth a compiler writer knows what an AST is.

Project 2

Tensors, broadcasting, and a tiny PyTorch-shaped library

Lift scalars to tensors. Implement broadcasting from the shape rules. Re-derive matmul as the operation that makes the autograd graph small.

Project 3

BPE tokenization from scratch

Pair counts, merge rules, the actual minbpe algorithm. Tokenize a corpus, look at what the merges decide to glue together, and understand why GPT-2's tokenizer treats whitespace the way it does.

Project 4

Attention from scratch

Scaled dot-product attention. The query / key / value formulation, the mask, the softmax. Then BREAK IT: turn off scaling, remove the mask, kill the softmax, and watch the model fail in three different specific ways. Read the full chapter →

Project 5

Your GPT from a blank file

Token embeddings, position embeddings, the residual stream, the language-model head. Stack four transformer blocks. Train on TinyShakespeare. Sample. Read what comes out. Read the full chapter →

Project 6

Training loops that survive contact

Learning-rate schedules, gradient clipping, checkpointing, the watch-the-loss-curve reflexes. The project's BREAK IT pass intentionally tanks the run four different ways so you know what a broken run looks like before one happens to you accidentally.

The mini example.

The smallest unit of progress in the book is a working forward pass. Here is what Project 5 leaves you with after three hours of typing:

class GPT(nn.Module):
    def __init__(self, vocab, n_embd, n_head, n_layer, block_size):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab, n_embd)
        self.pos_emb = nn.Embedding(block_size, n_embd)
        self.blocks  = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        self.ln_f    = nn.LayerNorm(n_embd)
        self.head    = nn.Linear(n_embd, vocab, bias=False)

    def forward(self, idx):
        B, T = idx.shape
        pos  = torch.arange(T, device=idx.device)
        x    = self.tok_emb(idx) + self.pos_emb(pos)
        x    = self.blocks(x)
        x    = self.ln_f(x)
        return self.head(x)

Twenty lines. Every line corresponds to something the book has already explained. The point is not the code — it is that by Project 5 you can read every line and say what would happen if you deleted it.

Why BREAK IT is the wedge.

Most from-scratch books stop when the thing works. The deliberate sabotage at the end of every project is what separates code you typed from code you understand.

From Project 4 — Attention

"Remove the /√d scaling and watch the softmax saturate. The model still trains — it just plateaus 30 percent earlier and at a worse loss. You learn that the scaling is not a numerical nicety. It is the difference between a useful gradient and a dead one."

BREAK IT is also why the excerpts on this site read the way they do. The deliberate-failure sections are the highest-signal content in any technical book, because failures have specific shapes and specific causes. They are quotable. They are searchable. They are how you start to recognise problems in production.

Related clusters and excerpts.

FAQ

Do I need a GPU to build an LLM from scratch?

No. The early projects run comfortably on a laptop CPU. A small GPT trains in minutes on CPU at the sizes used in the book. A consumer GPU helps from the training-at-scale projects onward, but is not required to follow the book.

How long does it take to work through the whole book?

Working part-time, plan for fourteen weeks at roughly one part of the book per fortnight. The first two parts go fast because the projects are short. Training-at-scale and post-training projects take longer because they have measurement steps that need real runs.

Is the code in PyTorch?

PyTorch for tensor projects from Project 2 onward. Project 1 uses a hand-written scalar autograd engine in pure Python — the point is to see how a tensor library works before importing one. CUDA-style kernels in later projects use Triton where a working tile-based kernel is the lesson.

Is this the same book as Sebastian Raschka's?

Different book. Raschka's Build a Large Language Model (From Scratch) is excellent and tightly focused on getting to a working GPT. Under The Hood covers 35 projects spanning the whole modern stack — post-training, KV cache, MoE, quantization, multimodal, specialist fusion — and adds a deliberate BREAK IT pass to every project.

Where does the code live?

The public code companion is at github.com/mechramc/Under-the-hood. Each project maps to a directory you can clone and run.

Start the build

Open Chapter 1 tonight.

Six projects gets you from a blank file to a working GPT decoder. The book is on Leanpub with lifetime updates.

Buy on Leanpub — $15.99 ~~$19.99~~ Back to the pillar

Build an LLM from scratch.

What "build an LLM from scratch" actually means.

The build, in order.

The mini example.

Why BREAK IT is the wedge.

Related clusters and excerpts.

Attention from Scratch

KV Cache and Fast Inference

Your GPT From a Blank File

Quantization and Deployment

FAQ

Open Chapter 1 tonight.