Four steps. Zero communication during training.
The entire KALAVAI protocol fits in a paragraph. A coordinator distributes a shared base checkpoint. Each contributor fine-tunes their copy on their own domain — independently, asynchronously, on whatever hardware they have. Nobody shares data, gradients, or activations. When everyone is done, they submit their checkpoints. A lightweight router (a single linear layer, trained for 500 steps on mixed data) learns which expert is best for which token. At inference, all specialists run in parallel and the router combines their outputs.
That's it. The mechanism is the protocol, not the infrastructure. Standard PyTorch. Standard HuggingFace. No custom CUDA kernels, no distributed training framework, no LoRA, no adapters.
# Everyone starts from the same model
base = load("pythia-410m", revision="step10000")
# Each person trains on their domain (independently, no communication)
specialist_code = train(copy(base), code_data, steps=2000)
specialist_science = train(copy(base), science_data, steps=2000)
specialist_fiction = train(copy(base), fiction_data, steps=2000)
# A router learns who's good at what (1000 steps, one linear layer)
router = nn.Linear(hidden_size, 3, bias=False)
fused = MoE(specialists=[code, science, fiction], router=router)
train_router(fused, mixed_data, steps=500)
# Result: fused model outperforms every individual specialist
# +7.70% over best specialist (corrected per-domain eval)
Core Results
Consistent gains at every scale tested.
The fused model beats the best individual specialist at every tested scale. All results use the corrected per-domain equal-weight evaluation protocol — each domain evaluated separately at batch size 4, then averaged. The 6.9B result uses the step-budget sweep (2,000 steps, k=4 frozen layers), which recovers the full improvement seen at smaller scales.
Beats equal-compute monolithic training
The natural objection: just train one model on all the data for the same total compute. We tested this directly. A single model fine-tuned on mixed data for 6,000 steps (equal to 3 specialists × 2,000 steps) achieves +6.7% over base on mixed held-out loss. Under corrected per-domain evaluation, the MoE beats the monolithic by +0.47% on equal-weight aggregate and wins on every domain individually. The mechanism is not primarily about training efficiency — it's about cooperative specialisation. Note: the table below uses the original mixed-batch evaluation protocol (absolute losses are from that schema); the corrected evaluation yields lower absolute improvements but the ranking is unchanged.
| Method | Loss (orig. eval) | vs. Base | vs. Monolithic |
|---|---|---|---|
| Base model | 2.248 | — | — |
| Monolithic (6,000 steps mixed) | 2.098 | +6.7% | — |
| Best specialist (code) | 2.089 | +7.1% | +0.4% |
| Weight averaging | 2.158 | +4.0% | — |
| Wider model (3.5× params) | 2.120 | +5.9% | — |
| KALAVAI MoE | 1.793 | +20.2% | +14.5% (orig. eval) |
Corrected per-domain equal-weight eval (seed 42): Base EW 2.320 · Monolithic EW 2.229 (+16.0% vs. base) · Weight avg EW 2.230 · KALAVAI MoE EW 2.218 (+0.47% over monolithic, +7.70% over best specialist). The ranking is unchanged; absolute improvements differ between evaluation protocols.
1B results: replication holds
The 1B scale replicates the 410M result. Under the corrected per-domain equal-weight evaluation, the improvement is +7.49% over the best specialist (seed 42). The monolithic baseline at 1B also confirms: KALAVAI beats equal-compute monolithic training on per-domain evaluation.
6.9B: step-budget sweep recovers the improvement
Initial 6.9B results with 500 training steps showed reduced gains. The step-budget sweep (B1) confirmed the hypothesis: at 2,000 steps with k=4 frozen layers, the 6.9B improvement is +6.53% ± 0.024% (3 seeds, corrected equal-weight eval). The mechanism holds at 6.9B. Freeze depth is largely insensitive at 6.9B (<0.1pp difference across k=0 to k=4).
The Predictive Model
Before you train a single specialist, you can predict whether the cooperative is worth it.
The headline result of the paper is not that fusion works — it's that you can predict how well it will work from a single measurement taken before any training. Across all experimental conditions, fusion gain scales linearly with the mean divergence of the specialists from the base model:
gain ≈ 0.82 × divergence − 2.72 (R² = 0.857, n = 6)
Measure how much your specialists diverge from the base model. The formula tells you what gain to expect before committing to the cooperative.
Divergence is the mean per-token KL divergence of each specialist from the base model, averaged across all specialists. It is computable in minutes on a small validation set after training — no full evaluation required. If divergence is 15%, expect roughly +10% gain. If divergence is 25% (cross-lingual), expect roughly +18% — and likely more, since high-divergence settings exceed the linear prediction.
| Condition | Mean Divergence | Gain vs. Best Spec. | Conv. Rate | Predicted | Residual |
|---|---|---|---|---|---|
| Qwen-1.5B | 3.16% | +1.06% | 0.34× | ≈0% | — |
| Pythia-6.9B | 8.29% | +6.53% | 0.79× | +4.17% | +2.36pp |
| Pythia-1B | 15.28% | +7.49% | 0.49× | +9.81% | −2.32pp |
| Pythia-410M | 15.65% | +7.72% | 0.49× | +10.11% | −2.39pp |
| Exp 2: Private-domain | 18.52% | +10.17% | 0.55× | +12.43% | −2.26pp |
| Exp 1: Cross-lingual | 25.65% | +21.76% | 0.85× | +18.18% | +3.58pp |
| Exp 3: 20-contributor (OOS) | 15.71% | +16.79% | 1.07× | +10.16% | +6.63pp |
The four English-domain conditions cluster within ±2.4pp of the line. The cross-lingual condition exceeds the prediction because the base model is near-random on Yoruba and Welsh — when the base achieves near-random perplexity, specialists correct from a high-loss floor and the router routes with near-certainty. The 20-contributor out-of-sample point (+6.63pp residual) follows the same pattern: its heterogeneous mix of high-divergence language specialists pulls the cooperative above what English-domain regression would predict.
The formula also sets a divergence floor: below approximately 3.3% mean divergence, the predicted gain approaches zero. A cooperative of specialists that are too similar to the base model is unlikely to produce positive returns.
Phase 2 — High-Divergence Domains
Cross-lingual cooperation: Yoruba perplexity 41.9 → 7.7.
Phase 2 tests the predictive model in the high-divergence regime — settings where mean specialist divergence exceeds 18%, far beyond the English-domain Phase 1 conditions. Three experiments, all returning GO verdicts.
What this means for endangered languages. Yoruba base perplexity: 41.9 → 7.7 after fusion (5.4× reduction). Welsh: 102.7 → 22.1 (4.6× reduction). Each language contributor spent ~$5–10 in electricity training on whatever digitised text they had. The cooperative built a model none of them could build alone — not hypothetically, but measured.
The 20-contributor result is out-of-sample for the regression (the formula was fit on 6 conditions). It lands +6.63pp above the linear prediction — consistent with the cross-lingual pattern: a cooperative with high-divergence language specialists converts more efficiently than the English-domain regression predicts. Medical and chemistry specialists share routing (60/40 split), reflecting genuine semantic overlap rather than routing failure. Two domains (dialogue, instructions) show degradation due to data scarcity (<300 training chunks), not a mechanism failure — the router routes correctly (97–89% to own specialist), but the specialists themselves are undertrained.
Three Governing Conditions
When fusion works — and when it doesn't
The paper's core contribution isn't "MoE is good." It's an empirical characterisation of the conditions under which post-hoc fusion of independently trained specialists succeeds or fails. We identify three.
1. Shared initialisation is necessary
Specialists must start from the same checkpoint. The shared starting point preserves representational compatibility — specialists diverge in what they learn, but the geometry of their representations stays aligned enough for a router to combine them. Initialising from different checkpoints breaks this: their representational spaces are no longer aligned, and the router cannot learn coherent dispatch.
Practical implication: The cooperative coordinator must distribute a single canonical checkpoint. All specialists must start from exactly the same revision — same weights, same tokenizer, same architecture. This is the one non-negotiable constraint of the protocol.
2. Frozen layers become necessary beyond ~5,000 steps
At short training horizons (≤2,000 steps), freezing layers is optional — more plastic representations allow specialists to diverge more freely. freeze=0 peaks at 2,000 steps (+8.12%). But beyond approximately 5,000 steps, unfrozen specialists over-specialise and become harder to fuse. Freezing the first K layers provides a structural anchor that preserves routing compatibility. The crossover from "freezing hurts" to "freezing helps" occurs at ~5,000 steps.
| Steps | Freeze=0 | Freeze=4 | Winner |
|---|---|---|---|
| 500 | +5.88% | +5.31% | Freeze=0 |
| 1,000 | +5.94% | +6.48% | Freeze=4 (marginal) |
| 2,000 | +8.12% | +7.56% | Freeze=0 ← freeze=0 peak |
| 5,000 | +7.79% | +8.07% | Freeze=4 ← crossover |
| 10,000 | +5.83% | +7.33% | Freeze=4 |
| 20,000 | +3.38% | +6.30% | Freeze=4 |
3. All specialists must run at inference
This is the paper's most surprising result. A domain classifier with 99.3% accuracy, routing each input to a single specialist, produces −21.1% degradation relative to base. The MoE running all three specialists and combining outputs produces +14.1% improvement. Same routing accuracy — opposite results. The 35 percentage point gap is the difference between a system that works and one that's worse than doing nothing.
Why does single-expert dispatch fail? Each specialist forgets what it wasn't trained on. The code specialist's loss on science data is worse than the base model. When only one specialist runs, out-of-domain tokens have no fallback. Joint inference restores coverage by letting the router suppress out-of-domain specialists token by token.
The mechanism in one image
The cross-domain evaluation matrix shows why fusion works. Each specialist is best on its own domain (the diagonal) and worst on the others. The MoE router dispatches each token to the right diagonal entry, recovering all specialist gains simultaneously.
Ablations
What doesn't matter (and what does)
Router architecture doesn't matter
Router ablation numbers use the original mixed-batch evaluation protocol (as noted in the paper). A uniform router (fixed 1/N weights, no training) achieves +6.7%. A trained linear router achieves +14.2%. A 2-layer MLP achieves +14.2%. The relative ordering holds under corrected evaluation: learned routing > uniform, linear ≈ MLP. The minimum bar is learnable suppression, not the function class.
Freeze depth sweep
We swept freeze depth from 0 (no frozen layers) to 12 (half the model). At 2,000 steps, freeze=0 wins. At 5,000+ steps, freeze=4 and freeze=8 both outperform freeze=0. Practical guideline: training under 5,000 steps — skip freezing. Over 5,000 steps — freeze 4–8 layers.
Specialist count scales gracefully
Three, four, and five specialists all achieve approximately +14.1% with near-zero variance. The mechanism doesn't degrade as you add more contributors. Two-specialist is slightly higher because the evaluation problem is narrower.
The mechanism survives base model maturity
Fusion improvement is consistent across Pythia checkpoints from 3.5% to 100% of pre-training at 410M (+7.03% to +8.81%) and 1B (+0.40% to +8.75%). Qwen-1.5B at full training shows a reduced gain of +1.06% ± 0.01% (corrected per-domain evaluation, 3 seeds) — near the formula's predicted floor (~3.3% divergence). Small divergence produces small gain, consistent with the predictive model.
Extra parameters don't explain the gains
Two baselines rule out "more parameters" as the explanation. A wider single model with 3.5× the parameters achieves only +5.9%. A multi-head baseline with identical parameter count to the MoE but hard single-expert routing achieves −21.1%. The gain comes from cooperative specialisation plus joint inference, not raw capacity.
The Router in Action
Token-level routing, not document classification
On hybrid-domain prompts, the router switches experts mid-sentence. The prompt "Derive the equation for protein folding using Python pandas" forces a domain switch within a single sentence: science tokens ("derive," "equation," "protein," "folding") should activate the science specialist; code tokens ("Python," "pandas") should activate the code specialist. The router discovers this structure from the training signal alone — no supervision, no domain labels, no explicit boundaries.
The pattern is robust across multiple hybrid prompt types, including narrative/science, technical/narrative, and multi-domain sentences with three or more domain switches:
Router confidence distribution
In practice, the router operates as a near-hard switch: the highest-weight expert receives over 95% of the routing weight in more than 99.7% of tokens. Crucially, hard routing (argmax dispatch) and soft routing (learned softmax weights) produce identical perplexity — confirming that the value is selection and suppression, not the specific weighting scheme.
Beyond Perplexity
Downstream benchmarks
Perplexity improvements are clear and consistent. Downstream benchmark accuracy is more modest — less than 1pp on standard tasks at 1B. This is expected at these scales: perplexity and benchmark accuracy don't reliably track below 7B parameters. The KALAVAI paper is explicit about this gap.
On the benchmark gap. Standard commonsense benchmarks (ARC, HellaSwag) are factual Q&A, while the training domains are code, science, and fiction — a partial mismatch. A cooperative aligned to benchmark-relevant domains (e.g., world knowledge, reasoning, factual recall) would present a stronger evaluation of downstream gain potential.
Training Dynamics
How specialists diverge during training
Each specialist rapidly improves on its own domain while degrading on out-of-domain content. This divergence is exactly what makes the MoE valuable — if specialists didn't degrade on other domains, the router would have nothing useful to select between. The code specialist's catastrophic forgetting of science is a feature, not a bug.
Boundary Conditions
What we don't claim
The paper is explicit about five things it does not claim.
No inference efficiency. The fused model runs all N specialists in parallel. For N=3, inference overhead is approximately 2.5×. This is a training-time democratisation that trades inference cost for training accessibility.
Limited architecture generality so far. All primary results use Pythia. Qwen-1.5B shows +1.06% at 3.16% mean divergence — near the formula's predicted floor (~3.1%). Small divergence produces small gain, consistent with the predictive model. The mechanism works; divergence magnitude governs gain magnitude.
No guaranteed downstream gains. Perplexity improvements are clear; benchmark accuracy improvements are modest (<1pp at 1B). Perplexity and downstream accuracy don't reliably track at these scales.
No real cooperative demonstrated. All experiments are simulated cooperatives on single machines. Heterogeneous hardware, asynchronous submission, and contributor reliability are open engineering problems.
No frontier-scale evidence. Results reach 6.9B parameters (+6.53% ±0.024%, 3 seeds, step-budget sweep). Evidence above 7B is absent.
Camera-Ready Roadmap
What's next if NeurIPS 2026 accepts.
The paper is complete for submission. These experiments are planned for the camera-ready version to preempt likely reviewer objections and extend the predictive model.
What This Makes Possible
Five problems cooperative training solves
The protocol's zero-communication property — contributors share only a starting checkpoint and a final trained checkpoint, never data — opens applications that are structurally impossible with synchronous training or federated learning.
A hospital network that can't share patient data
Five hospitals, each with thousands of records in different specialties — cardiology, oncology, pediatrics, radiology, neurology. Privacy laws prevent pooling. Today, each can fine-tune a model on their own data, but it only knows their specialty.
With KALAVAI, each hospital trains a specialist on their private data that never leaves their servers. They share only the trained checkpoint — not a single patient record. The fused model understands all five specialties. No data was shared. No privacy was violated.
This was impossible before. Federated learning requires gradient sharing during training, which leaks information. KALAVAI requires zero communication during training — only the final checkpoint is shared.
Endangered languages get a real language model
This is not hypothetical. Exp 1 measured it directly. A university in Nigeria trained a Yoruba specialist. A team in Wales trained a Welsh specialist. A lab in Chennai trained a Tamil specialist. A code contributor trained on Python. No single institution had enough data or compute to build a multilingual model.
The cooperative did. Yoruba perplexity: 41.9 → 7.7 (5.4× reduction). Welsh: 102.7 → 22.1 (4.6× reduction). Tamil gained a specialist trained on its own literary corpus. The fused model handles all four languages simultaneously. Cost per contributor: approximately $5–10 in electricity.
KALAVAI changes the economics from "one organisation needs all the data and all the compute" to "each community contributes what they have." The Yoruba number is measured, not projected.
A legal AI built across jurisdictions
Indian contract law, UK common law, US constitutional law, EU regulatory law, Brazilian civil law — each a specialised domain with its own corpus and reasoning patterns. No single firm has expertise across all five. Today you either build a generic legal model that's mediocre everywhere, or a narrow one that knows one jurisdiction deeply.
With KALAVAI, a firm in Mumbai trains the Indian law specialist. A firm in London trains the UK specialist. A firm in São Paulo trains the Brazilian specialist. The fused model can analyse a cross-border contract touching Indian, UK, and EU law — routing the relevant clauses to the relevant specialist. Each firm contributed domain expertise without sharing proprietary case databases with competitors.
Scientific research across fields that don't talk to each other
A climate science lab trains on atmospheric modelling. A marine biology lab trains on ocean ecosystems. A geology department trains on seismology. An economics department trains on resource economics. Individually, each model knows its field. Fused, the model can reason about questions at the intersection — "how do seismic events in the Pacific affect marine ecosystems and what are the economic implications for coastal fisheries?"
A new kind of interdisciplinary tool that emerges from collaboration without anyone needing to be interdisciplinary themselves.
A country builds its own sovereign AI without a hyperscaler
A small country — Sri Lanka, Estonia, Rwanda — wants a national language model that understands their language, laws, culture, geography, educational curriculum. They can't afford to train from scratch. They can't rely on OpenAI or Google to prioritise Sinhala or Kinyarwanda.
With KALAVAI, the country's university trains a language specialist. The ministry of justice trains a legal specialist on national law. The education department trains on school textbooks. The health ministry trains on local medical guidelines. Each institution uses the GPUs they already have.
The fused model is a national AI that no foreign company built and no foreign company controls. Digital sovereignty through cooperative intelligence. It was not feasible before because no single institution in a small country has the compute or data. KALAVAI turns that constraint from a blocker into an irrelevance.
Reproduce It
30 minutes. One GPU. The whole protocol.
git clone https://github.com/mechramc/Kalavai.git
cd Kalavai
pip install transformers datasets torch accelerate
python experiments/kalavai_pythia_experiment.py
Requires any GPU with 24GB+ VRAM (RTX 3090, 4090, 5090, A100, or equivalent). Produces trained specialists, fused MoE, all evaluation numbers, and figures. All results use the corrected per-domain equal-weight evaluation protocol.
| Script | Scale | Hardware | Time | Expected output |
|---|---|---|---|---|
kalavai_pythia_experiment.py | 410M | Any 24GB GPU | ~30 min | +7.72% ± 0.02% |
kalavai_pythia_1b_experiment.py | 1B | Any 24GB GPU | ~2 hours | +7.49% |
kalavai_pythia_6b_experiment.py | 6.9B | A100 80GB | ~8 hours | +6.53% ± 0.024% |
kalavai_private_domain_experiment.py | 410M | Any 24GB GPU | ~45 min | +10.17% ± 0.15pp |
kalavai_crosslingual_experiment.py | 410M | Any 24GB GPU | ~45 min | +21.76% ± 0.005pp |
kalavai_training_duration_crossover.py | 410M | Any 24GB GPU | ~4 hours | Crossover at ~5k steps |
kalavai_domain_classifier_baseline.py | 410M | Any 24GB GPU | ~45 min | −21.1% (classifier) |
Every experiment is a self-contained Python file. No config files, no YAML. Read the script, understand the experiment, run it. 322 automated audit checks verify every result before any paper-ready number is reported.