Chapter 32 · Full excerpt

Fusing independently trained specialists.

Specialist fusion means taking two or more large language models that were trained separately on different data and combining them into one routed system at inference time. A small router reads the input, decides which specialist should handle each token or query, and either picks one expert outright or mixes their outputs with continuous weights. No joint retraining of the experts. No shared optimizer step. The composition happens after the experts already exist.

This chapter builds the fusion end to end, evaluates it against each individual specialist, and then breaks it on purpose to show the silent failure mode: fusion that degrades to "average of two mediocre answers" instead of best-of-both, and routers that quietly default to one specialist while looking like they are working. The break is the point. The success case says this can work. The failure case says here is the exact condition that has to hold for it to work at all.

This is Chapter 32 of Under The Hood — Build Every Layer of a Large Language Model from Scratch. The full 35-project book is on Leanpub. Code companion at github.com/mechramc/Under-the-hood.

The concept

Picture three translators at one desk. One handles Python code, one handles medical notes, one handles legal contracts. A clerk skims the incoming page and decides who gets it. The setup works if all three translators learned language from the same textbook and then specialized later. The clerk is not reading their minds, but the shared training leaves enough common structure that routing stays legible. Now wreck that assumption. Each translator learned from a different textbook, with different notation habits, different grammar instincts, different shortcuts. The desk still looks tidy from the outside. Inside, the clerk has no stable basis for comparison. "This feels medical" only makes sense relative to some internal frame, and the specialists no longer share one.

That is the whole project. You take one pretrained model, freeze the early layers so every specialist keeps the same trunk, then train three specialists on three domains: code, medical text, and legal text. Each specialist keeps the same early shared representation and adapts later layers to its domain. Then you build a router. A router is a small model that looks at the shared hidden state and decides how much weight to assign to each specialist. If the input smells like code, push weight toward the code expert. If it smells like a discharge summary, push toward the medical expert. This is not standard Mixture of Experts, where one large model grows experts and routing together during training. It is closer to after-market composition: you train specialists separately, keep the interface controlled, and ask whether they still compose afterward.

The newer picture of "building a model" is modular: keep the shared backbone steady, train specialists independently, connect them with routing. That only works if the modules agree on the signals passing between them. If two specialists start from the same base and preserve the same early layers, their hidden states still live in roughly the same internal coordinate system. If they come from unrelated pretrained models, that shared coordinate system disappears. The router then needs more than three competent specialists. It needs specialists that speak a compatible internal language.

I learned this the hard way running the KALAVAI cooperative-fusion experiments — long training sweeps of LoRA-fused contributors spanning several model families. The pattern was sharp. Runs where contributors started from the same pretrained base and only their LoRA adapters varied composed cleanly. Runs where contributors started from different bases needed a lot more scaffolding to produce a coherent fused output, and they often still underperformed the single best specialist. The shared-foundation requirement was not a paper detail. It was the difference between a system that worked and a system that did not.

There are three plausible compositions you can run on top of two trained specialists, and they are not interchangeable. Routed composition reads each input, picks weights for each expert, and combines their predictions per example. Output averaging ignores the input and applies a fixed blend to every example. Weight merging averages the parameter tensors of the specialists directly and runs one fused model at inference. Routed composition asks "can these experts each answer well on different inputs?" Weight merging asks "can these models live in the same parameter space?" Those are different questions, and they fail in different ways. Weight merging is the most fragile, because the same parameter tensor can mean slightly different things after each specialist's fine-tuning has nudged its internal geometry. Prediction fusion is less brittle. Output averaging without a router is just a confident way to dilute both experts at once.

Why it matters

This is the chapter that ties the production patterns most readers see in deployed systems — Mixtral-style routed experts, MoE-style mixtures, adapter routing — back to the messier experimental literature on post-hoc cooperative training. Classic MoE papers like Switch, GLaM, and Mixtral learn experts and routing together end-to-end. That is one world. The world this chapter sits in is stricter: can specialists that were trained independently, by different teams, on different data, at different times, still compose into something better than any one of them alone? That distinction is the entire point.

The payoff, when it works, is modularity you can actually test. You can add a specialist without retraining the system. You can isolate failures by domain. You can ask whether composition beats any single expert on mixed data. But there is a hard constraint hiding underneath. Composition depends on shared initialization. Honestly, I think this is the single most under-discussed constraint in the entire modular-AI conversation. People will compare seven router architectures and skip the question of whether the experts they are routing to share a basis at all. The router design barely matters if the inputs to it do not live in the same space.

Project 31 (Layer Freezing and Transfer) asked where specialization starts when you freeze layers. This chapter asks a sharper question: once specialists exist, can they combine into something better than any one of them alone? And then the question that matters more: what exact condition makes that combination possible? The answer is not "have a good router." The answer is "have a shared foundation first."

Modularity is not a property of the diagram on the slide. It is a property of the geometry inside the models. Modular composition is plausible only when the experts inherit a shared coordinate system. The router cannot manufacture that compatibility from outside.

The build

Do not think of this as "train three models and add a classifier." That framing is too loose. The point is to keep the interfaces controlled enough that success or failure actually means something. Build it in stages.

Step 1 — One pretrained base, one tokenizer, one architecture

All three specialists must begin from the same base model. That is non-negotiable for the main experiment. One checkpoint, one tokenizer, one architecture, one hidden size, one positional encoding scheme, one layer-norm style, one vocabulary. If even one of those differs, you are no longer testing composition of specialists; you are testing a mess of representation mismatch, token mismatch, or architecture mismatch. The clean setup is same base checkpoint, same frozen early layers, same later-layer shapes, different domain data. Only one thing changes: what each specialist trains on. That keeps the experiment honest.

One of my early KALAVAI cooperative-fusion sweeps had to be thrown away because two contributors had silently used slightly different tokenizer vocabularies. The architectures matched. The hidden sizes matched. The token ID for the same surface form did not. Routing was incoherent and I spent half a day blaming the router code before checking the tokenizer hash. Pin every part of the interface, including the parts you assume cannot drift.

Step 2 — Freeze the trunk, specialize the crown

Early layers in a transformer tend to track broad language structure. Later layers can bend harder toward a task or domain. Freeze the early layers because you want all specialists to keep the same foundation. Leave the later layers trainable because that is where domain adaptation happens. In a small GPT-like model that might mean freezing the first 6 of 12 layers. The exact split is an experiment, but do not make it arbitrary. If you freeze too little, the specialists drift apart and the shared interface weakens. If you freeze too much, the specialists may not adapt enough to their domains.

shared base checkpoint
        |
   copy three times
   /      |       \
code   medical   legal
train    train    train
later    later    later
layers   layers   layers
only     only     only

The clean mental model is not "three independent full models." It is "one shared trunk, three separately trained crowns." That distinction matters because the router will read from the shared trunk.

Step 3 — A linear router over the shared hidden state

The router should not look at the final output logits of each specialist and pick one after the fact. That would work, but it hides the interface question. The point is to route based on the shared representation, so the router should read the frozen layers' output: the last hidden state before the specialists diverge. If the code, medical, and legal specialists all inherit the same frozen prefix, then the frozen prefix output means roughly the same thing across all of them. That makes it a good control point for routing. Think of it as the last common hallway before the building splits into three wings. If you stand in the hallway, you can still make a routing decision using a shared map.

import torch
import torch.nn as nn
import torch.nn.functional as F

class LinearRouter(nn.Module):
    def __init__(self, d_model: int, n_experts: int = 3) -> None:
        super().__init__()
        self.proj = nn.Linear(d_model, n_experts)

    def forward(self, shared_h: torch.Tensor) -> torch.Tensor:
        # shared_h: [B, T, d_model]
        # use the last token as the routing summary
        x = shared_h[:, -1, :]              # [B, d_model]
        return F.softmax(self.proj(x), dim=-1)  # [B, 3]

A linear router means the routing model is simple: one matrix multiply plus softmax. If a simple router works, the interface is genuinely usable. If you need a huge router to paper over representation mismatch, that is a tell. The router is not the place to absorb interface failures.

Step 4 — Soft fusion of the specialists' logits

Once you have routing weights, you have to decide what to do with them. There are two straightforward options. Hard routing picks the specialist with the highest routing score and uses only that specialist's output. Soft routing takes all three specialists' output logits and computes a weighted average using the router's weights. For this chapter, soft routing is cleaner because it gives you a continuous signal and makes the oracle-routing analysis later much easier.

import torch

def fuse_logits(
    router_weights: torch.Tensor, logits_list: list[torch.Tensor]
) -> torch.Tensor:
    # router_weights: [B, 3]
    # logits_list: three tensors of shape [B, T, V]
    fused = 0
    for i, logits in enumerate(logits_list):
        w = router_weights[:, i].view(-1, 1, 1)
        fused = fused + w * logits
    return fused

This is worth pausing on. You are not blending the specialists' weights. You are blending their predictions, per example. Weight merging is a strong baseline that often loses because the same parameter tensor can mean slightly different things after domain specialization. Prediction fusion is less brittle. The code above looks almost insultingly simple. Good. If composition requires two thousand lines of machinery before it shows any sign of life, it is harder to learn what is doing the work. The interesting difficulty in this project is not code complexity. It is representation compatibility.

Step 5 — Performance-based router training

The router needs supervision. There are two ways to train it. With domain labels, you treat the router as a three-way classifier over example tags. Easy, but a little crude — it teaches the router to recognize domain, not necessarily performance. With performance-based targets, you evaluate all three specialists on each example, build soft targets from their losses, and train the router to predict those targets from the shared hidden state. This is better for the scientific question because it says: route to the specialist that really performs best on this input, not just the one that matches the human-assigned bucket.

with torch.no_grad():
    losses = torch.stack([loss_code, loss_med, loss_legal], dim=-1)  # [B, 3]
    targets = F.softmax(-losses / temp, dim=-1)

weights = router(shared_h)  # [B, 3]
router_loss = -(targets * weights.log()).sum(dim=-1).mean()

We negate the losses because lower loss means better performance, but softmax expects higher score to mean better. Temperature controls how sharp the targets are. Low temperature concentrates the target on the single best specialist; high temperature lets the router express uncertainty across multiple specialists.

Step 6 — Evaluation harness against the right baselines

The single most common failure of "we built a fused system" papers is bad baselines. If you skip them, the result means almost nothing. You need at least these six rows in your evaluation:

  1. Base model — the original pretrained checkpoint before specialization.
  2. Code specialist alone.
  3. Medical specialist alone.
  4. Legal specialist alone.
  5. Weight-averaged model — average the trainable layer weights of the three specialists and evaluate the result.
  6. Fused routed system — the router plus soft combination of specialist outputs.

Weight averaging belongs in the list because it is the naive "maybe these can just merge" idea that sounds reasonable to people who have not yet felt how fragile parameter alignment is. The base model belongs in the list because specialization is not automatically a win on mixed data — domain specialists can lose outside their area, and the fused system should beat the base on mixed held-out evaluation if composition is doing real work.

Step 7 — The oracle routing gap

The most diagnostic number in this whole project is the oracle routing gap. The oracle is a fake perfect router: for each evaluation example, you run all three specialists, pick whichever one would have given the best loss, and pretend you had routed there. That gives you the best possible performance any router could achieve using these specific specialists. It is not a deployable system — at inference you do not get to peek at the loss before deciding — but it is the cleanest possible upper bound.

Compare the learned router's performance against the oracle. The difference is the oracle routing gap. If the gap is small, your router is already extracting most of the available benefit. If the gap is large, the specialists are useful but the router is weak. This matters because it separates two failure modes — bad specialists versus bad routing — and without the oracle gap, you cannot tell which one you have. Headline mixed-loss numbers tell you something improved. The oracle gap tells you whether the improvement came from the router doing more of what was already possible, or from the experts becoming more separable. Those are different mechanisms and the oracle gap is the cheapest way to distinguish them.

Step 8 — Report a table, not a single number

Validation loss improved is not a result. You want a table where each specialist wins on its own domain, each loses on at least one other domain, the base is decent everywhere but best nowhere, weight averaging is mediocre or actively bad, the routed fusion beats the base on mixed data, the routed fusion beats each single specialist on mixed data, and the oracle is a bit better than the learned router but not by much. That last point is the litmus test. If the oracle is wildly better than the learned router, your router is leaving value on the table. If the learned router nearly matches the oracle, you have evidence that composition is working and the interface is usable.

BREAK IT

The main experiment works because the specialists share a frozen foundation. There are two ways the system silently degrades, and you should produce both with your own hands before trusting fusion in any system that matters.

Break 1

Different pretrained bases — the silent degradation

Instead of copying one base checkpoint three times, take pretrained checkpoint X and fine-tune a code specialist from X. Take pretrained checkpoint Y and fine-tune a medical specialist from Y. Take pretrained checkpoint Z (or Y again) and fine-tune a legal specialist from there. Same architecture if you want. Same tokenizer if you want. Same datasets if you want. But not the same pretrained initialization. Now try to fuse them.

The first problem hits before you train anything: which frozen prefix should the router even read? If the specialists do not share a frozen prefix, there is no single common intermediate state to route from. You have to cheat somehow — pick one model's prefix as the router input, concatenate states across models, route from raw token embeddings, or fall back to cheap metadata like a domain classifier. None of these restores the missing shared coordinate system. They just make the mismatch less obvious.

The first cooperative-fusion experiment in the KALAVAI line that actually tested this hypothesis used two contributors who had each fine-tuned a 1.3B model with strong domain numbers. On their own benchmarks, both looked great. Composed together with a shared router, the combined system underperformed either one alone on mixed data. I tried three different router redesigns before accepting that the bottleneck was upstream of the router entirely.

The easiest way to picture the failure: imagine two teams independently build GPS systems. Both output latitude and longitude. But one secretly measures angles relative to a different meridian and flips one axis. A router sitting above them sees numbers with the same shape and the same data type. It assumes the outputs live in the same world. They do not. That is what happens with hidden states from independently pretrained models. A hidden vector is meaningful because of the learned geometry around it, not because of its shape. Dimension 417 is not "the law dimension." Dimension 417 only means something in relation to the rest of the model that learned it.

The fused system becomes worse than the clean shared-initialization version, often worse than the best specialist, and sometimes worse than almost every specialist on mixed data. Routing confidence becomes unstable; the router overcommits to one specialist on many unrelated inputs; per-domain routing accuracy drops sharply; learned routing falls far behind oracle routing; soft fusion of logits becomes noisy because the specialists disagree in ways that do not combine cleanly; weight averaging becomes useless or catastrophic. The oracle routing gap usually grows. The specialists are still individually good. The oracle can still pick whichever one happens to work best. The learned router has a much harder job, because its input no longer lives in a shared representational space tied cleanly to all experts.

This is what "silent degradation" means: every individual specialist looks fine, the router runs without raising any alarms, the fused output is a confident logit distribution that looks plausible, and the only thing that says the system is broken is the mixed-data loss compared to a strong baseline. There is no exception. There is no crash. Just an average of two mediocre answers where there should have been best-of-both.

Break 2

The router that defaults to one specialist

A second failure mode looks like the first but happens for a different reason. Even with shared initialization, you can train a router that collapses onto one specialist for most inputs and looks like it is doing nothing — but is silently averaging into the dominant expert's predictions with a small constant tail from the other two. Two situations cause this: domain imbalance in the routing training set, or a routing target temperature so low that soft targets collapse to nearly one-hot.

Diagnose it with per-domain routing accuracy and routing-weight histograms. If the router gives the code specialist 0.92 weight on legal examples, the headline mixed loss can still look reasonable — because the code specialist happens to be a strong generalist — but the system is not really composing anything. It is a code model with a noisy second voice.

There is also a deeper failure worth naming. Even if the router picks the right expert more often than chance, the fused predictions can still underperform because the outputs are not calibrated the same way. Two specialists can both be right on different examples while producing logits at different scales or with different uncertainty habits. Shared initialization helps keep those habits more compatible; independent pretraining lets them drift. So the failure is more than misclassification by the router. The whole interface between modules has gone soft.

What both breaks prove

Composition is not only about expertise. It is about compatible internal structure. People often hear "modular AI systems" and imagine Lego bricks. But models are not Lego bricks. They are more like organs. Transplantation requires matching blood type, tissue compatibility, and connection points. Shape alone is not enough. Shared initialization acts like that compatibility layer. It gives the specialists a common ancestry so their internal features may specialize but still inherit a shared geometry. Remove shared initialization, and the router loses the one thing that made late composition plausible in the first place. That is why this failure matters more than the success case. Success says modularity is possible. Failure says modularity has a price of entry. That price is a shared foundation.

Questions to answer

  1. On the mixed evaluation set, which baseline does the fused model beat, and which baseline still beats it? If weight averaging beats the routed fusion, what does that tell you about the router rather than the specialists?
  2. How large is the oracle routing gap? Does that gap tell you the bottleneck is the router or the specialists — and what intervention would you try for each diagnosis?
  3. When you break shared initialization, what fails first: routing accuracy, fused loss, calibration of output scores, or weight averaging? Which of those would you have caught if you only looked at the headline mixed loss?
  4. Does freezing more early layers make fusion easier because the shared interface is stronger, or harder because specialists cannot adapt enough? Where is the boundary in your run?
  5. When the router makes a bad choice, is it choosing the wrong domain entirely, or is it confused between two specialists that both partially fit? Those failure modes need different fixes.

Go further

What you now know

At this point the magic story is gone. A fused system can beat each specialist alone because the specialists split the domain, the router reads a shared hidden interface, and the final prediction mixes strengths case by case. The failure mode should feel sharper now too. When specialists come from different pretrained origins, the trouble is not "router too small." The trouble is geometric. Their competence may survive. Their composability does not. You also have a clean way to test that claim: train specialists from one frozen base, route from the shared prefix representation, compare fusion against strong baselines, compute the oracle routing gap, then break shared initialization and watch composition silently fail. "Modularity sounds nice" is no longer the bar. You can say what modularity requires.

Chapter 18

Mixture of Experts

The joint-training cousin to this chapter — experts and routing learned end-to-end, the way Mixtral does it.

Chapter 23

Reward Models and RLHF

Where router-style composition meets preference learning. Reward heads, KL constraints, and how RLHF avoids its own version of silent degradation.

Continue the build

This was Chapter 32 of 35.

The full book is 934 pages and 256,587 words. 35 hands-on projects from autograd to fused specialists. PDF and EPUB on Leanpub, lifetime free updates.