Cluster · MoE

Mixture of experts from scratch.

Build it. Break it. Measure it.

A mixture-of-experts layer is a transformer feedforward block with a traffic problem. You add a small router, you put several feedforward networks behind it, and you ask the router to decide which one or two of them should handle each token. This page is the entry point into the stretch of Under The Hood where that routed layer gets written, trained, instrumented, and deliberately broken.

Buy on Leanpub — $15.99 ~~$19.99~~ Read Chapter 18 free

What MoE actually is.

A mixture-of-experts layer is a feedforward block with a router in front of it. Where a dense transformer sends every token through the same FFN, an MoE layer holds several feedforward networks — the experts — and asks a small linear layer to pick the top k of N of them per token. That is the whole structural change. The router scores each expert with a single linear projection, the top k are kept, their outputs are mixed by the routing probabilities, and the rest of the experts sit out the forward pass.

The reason this is interesting is that it separates two quantities that dense models tie together. Total parameters set how much knowledge the model can store. Active parameters per token set how much compute each token pays for. In a dense model, you scale them together — wider FFN means more storage and more compute, every token, every step. In an MoE layer, you can add experts without raising active compute, as long as k stays fixed. That is the conditional-compute argument. More total parameters, roughly similar work per token.

This is also why "How big is the model?" stops being a single number once MoE shows up. You need at least three: total parameters in the checkpoint, active parameters per token, and how many experts are actually doing useful work. A sparse model with 47B total and 12B active does not behave like a dense 12B. It stores more, and the active slice that runs for any one token is the slice that matters at inference time. The architecture diagram suggests a lot of capacity. The forward pass uses a small fraction of it.

The catch is that the router is now part of the learning problem. It is a linear layer trained by the same gradient that trains everything else, but the only signal it gets is the downstream loss — there is no manager telling it which expert was the right call. Most beginner explanations of MoE describe the experts in detail and treat the router as a footnote. The book takes the opposite position. The router is the project. Get it wrong and the experts may as well not exist.

The projects this cluster covers.

The book introduces MoE late on purpose. By Project 18 you have a working decoder, a training loop you trust, and the instruments to read it. The cluster moves from a single routed block, to the scaling laws that make sparse architectures economically interesting, to the autonomous sweeps that let you measure them.

Project 18

Mixture of Experts

Replace one block's FFN with a 4-expert MoE layer. Build the router as a single linear projection. Implement top-1 and top-2 routing. Add the load-balancing auxiliary loss. Sweep expert counts. Then BREAK IT: freeze two experts and amputate one. Read the full chapter →

Project 19

Scaling Laws

The reason MoE matters at all is that it changes the scaling game. This project fits Chinchilla-style curves to your own runs, then asks what those curves look like when total and active parameters can move independently. The answer reshapes how you read every published MoE paper after it.

Project 20

Autonomous Experimentation

MoE has too many knobs to sweep by hand — number of experts, top-k, balancing weight, capacity factor, which blocks to MoE-ify. This project wires a small autonomous experiment runner that proposes configurations, runs them, logs the four vital signs, and feeds results back into the next round.

The mini example.

The first working MoE forward pass is short. The teaching version runs all experts on all tokens and masks out the ones the router did not pick. It wastes compute, but the mechanics are obvious — and obvious is what you want when you are debugging a router for the first time.

class MoE(nn.Module):
    def __init__(self, d_model, d_ff, num_experts=4, k=2):
        super().__init__()
        self.k = k
        self.router  = nn.Linear(d_model, num_experts, bias=False)
        self.experts = nn.ModuleList(
            [Expert(d_model, d_ff) for _ in range(num_experts)])

    def forward(self, x):                               # (B, T, d)
        router_probs = torch.softmax(self.router(x), dim=-1)    # (B, T, N)
        topk_probs, topk_idx = torch.topk(router_probs, self.k, dim=-1)
        topk_probs = topk_probs / topk_probs.sum(dim=-1, keepdim=True)

        mask = torch.zeros_like(router_probs)
        mask.scatter_(-1, topk_idx, topk_probs)         # (B, T, N)

        expert_out = torch.stack(
            [e(x) for e in self.experts], dim=2)        # (B, T, N, d)
        return (expert_out * mask.unsqueeze(-1)).sum(dim=2)

Twenty lines. The router is one nn.Linear. The "sparse computation" is a softmax, a top-k, a renormalization, and a masked sum. Everything beyond this — capacity factors, expert parallelism, scatter-gather dispatch — is a systems optimization on top of these three ideas. Start here, then make it faster once you trust it.

Why BREAK IT matters here.

Most introductions to MoE stop when the loss curve looks plausible. That is the moment the book starts paying attention. A working training run is the weakest possible evidence that an MoE layer is healthy, because the loss can keep going down while the router quietly funnels every token to one or two experts and the rest of the pool turns into dead weight.

From Project 18 — Mixture of Experts

"The failure I had braced for was router collapse to a single expert. The failure I actually got first was the opposite. The router fanned traffic out almost perfectly evenly across all experts, which sounds healthy, except none of them were learning anything useful because no token had committed to any of them. Balanced is not the same as alive."

The chapter's deliberate failures pick at exactly that point. Freeze two experts and force all routing onto the remaining two — quality degrades, but the model does not collapse, which tells you the surviving experts hold overlapping capacity and the router has room to adapt. Remove one expert entirely and the degradation is smaller, but the utilization histogram is where the real story lives: if one remaining expert jumps to 80% of traffic after the cut, the load-balancing pressure was not a training nicety. It was protecting the system against traffic concentration whenever the expert pool shifted. MoE is not many FFNs. It is a routing problem under load, and the breaks are how you see it.

Related clusters and excerpts.

FAQ

Why does MoE save compute?

Because each token only activates a subset of the experts. A dense feedforward network charges every token the full bill for every parameter, whether the token needs that knowledge or not. MoE separates total parameters (how much the model can store) from active parameters per token (what each token actually pays for). Adding more experts grows storage; keeping k fixed keeps the active compute roughly flat.

Why top-2 routing instead of top-1?

Top-1 is cheaper and simpler — one expert wins per token. Top-2 sends the token through two experts and mixes their outputs by routing probability. The book builds both and measures the gap. Top-2 is usually more stable in training and handles missing experts more gracefully, because every token already has a backup path. The quality gap is often smaller than people expect once routing is trained well, which is exactly why you measure it instead of arguing about it.

What is routing collapse?

Routing collapse is the failure mode where the router sends most tokens to one or two experts. Those experts get more gradient updates, improve faster, and the router prefers them even more. Positive feedback locks in. The neglected experts become dead weight. The whole sparse-computation argument falls apart because the model is, in practice, a dense FFN with extra parameters that never train. The fix is a load-balancing auxiliary loss that taxes uneven traffic.

Is this the same architecture as Mixtral?

Mixtral 8x7B is a production-scale instance of the same family — sparse mixture of experts with top-k routing per token. The book builds a smaller version of the same mechanism inside nanochat: a routed FFN with four experts and a learned linear router, with top-1 and top-2 variants. The systems engineering around Mixtral (expert parallelism, capacity factors, distributed dispatch) is its own problem, but the routing contract is the same one you implement in Project 18.

Do I need a GPU for the MoE project?

A consumer GPU with 12 to 24 GB of VRAM is enough for the proxy version. You replace one block's FFN with a 4-expert MoE layer, train a small nanochat run, and sweep top-1 versus top-2 and a few expert counts. The full sweep across more experts and longer runs benefits from stronger hardware, but the central lessons — routing imbalance, the load-balancing effect, behaviour under expert removal — show up on a single GPU in a few hours.

Open the routed layer

Read Chapter 18 tonight.

One block's FFN replaced with a router and four experts. Top-k routing, load balancing, and two deliberate breaks. The book is on Leanpub with lifetime updates.

Buy on Leanpub — $15.99 ~~$19.99~~ Back to the pillar