Chapter 27 · Full excerpt
Quantization and deployment.
Quantization converts a model's weights and activations from FP16 or FP32 down to lower-precision integers — int8, int4, sometimes lower. The model then fits in a fraction of the memory and usually runs faster, because fewer bytes need to move from RAM into the compute units. The cost is accuracy. Done well, the cost is small. Done badly, it is catastrophic — the model still produces fluent sentences but stops saying anything true.
This chapter quantizes a trained checkpoint to int8 and int4, measures the change in validation loss, benchmark score, and tokens per second, and then deliberately breaks the calibration to see what bad quantization looks like. A tiny, unrepresentative calibration set or a single outlier weight is enough to blow up the scale and turn the rest of the tensor into noise. That failure mode is the lesson.
This is Chapter 27 of Under The Hood — Build Every Layer of a Large Language Model from Scratch. The full 35-project book is on Leanpub. Code companion at github.com/mechramc/Under-the-hood.
The concept
Think of a trained model as a giant wall of tiny knobs, where each weight is one knob. Training spends a huge amount of compute nudging those knobs until the model predicts the next token well. In FP32, each knob stores its setting using 32 bits — 4 bytes. A weight might be 0.0187342 instead of just 0.02. That precision is what training paid for. The question quantization asks is: how much of that precision do you actually need at inference time?
Now imagine moving this wall of knobs into a smaller room. One option is to build a smaller wall from scratch — retraining or designing a smaller model. The other option is to keep the same wall and repaint every knob with fewer allowed positions. Instead of letting a knob land on nearly any decimal value, you force it into a small set of buckets. That is quantization. FP32 says "this knob can sit almost anywhere." INT8 says "pick one of 256 positions." INT4 says "pick one of 16." INT2 says "pick one of 4."
The trick works because models are more tolerant to roughness than people first expect. A trained model does not need every weight exact to many decimal places. Many weights can be rounded and the overall computation still points in roughly the same direction. The logits change a bit, but not enough to break the answer. Until they do. That is the heart of the chapter: quantization is controlled damage. You compress the model by forcing many precise values into fewer buckets, then measure whether the model still behaves acceptably for your job.
What a quantized weight actually is
A block of FP32 weights contains real numbers like -0.91, 0.03, 1.27, -0.44. To store them as INT8, you choose a mapping from real values to integers. For example, -1.28 maps to -128, 0.0 maps to 0, 1.27 maps to 127. Each original number gets rounded to the nearest integer bucket. Later, during inference, you approximately reconstruct the original value by scaling the integer back into a float-like range. So quantization has two parts: compress the number into a low-bit code, and decompress it just enough at runtime to do the math. The low-bit value is not the real weight. It is a compact code plus a rule for turning that code back into an approximate weight.
The math, earned
The common formula for linear quantization is two lines. Quantize: q = round(x / s). Dequantize: x_hat = q · s. Here x is the original FP32 weight, q is the integer bucket index, s is the scale, x_hat is the reconstructed weight used during inference. Divide the real weight by the scale, round to the nearest integer bucket, multiply by the scale to get an approximate weight back. The error is x − x_hat. If buckets are fine enough, error stays small. If buckets are too coarse, the model forgets distinctions it needs.
Symmetric vs asymmetric, per-tensor vs per-channel
If a weight tensor is roughly centered around zero, symmetric quantization (one scale, zero stays at zero, equal range positive and negative) is enough. If a tensor is skewed, an asymmetric scheme adds a zero-point offset so the bucket range fits the actual distribution. Activations are often skewed; weights usually are not. That is why weight-only int8 dominates while activations need more care.
The other axis is granularity. Per-tensor uses one scale for the whole tensor — cheapest, crudest. Per-channel uses one scale per output row, so a row of small weights keeps fine resolution while a row of large weights gets a wider range. Group-wise goes finer still, one scale per block of 64 or 128 weights inside a row. Smaller groups preserve quality and cost more bookkeeping. Quantization in one sentence: tradeoff after tradeoff.
Calibration
Calibration is how you pick the scale. For weight-only quantization, calibration is trivial — the weights are in front of you, you measure their max magnitude and compute a scale. For activation quantization, you run sample data through the model and watch what the activation ranges look like. That sample data is the calibration set. Most teams under-invest in it, which is what the BREAK IT section will sabotage on purpose. A calibration set that does not match real inputs gives you a scale that is wrong for real inputs.
Why it matters
Without quantization, deployment hits a wall fast. A model with 7 billion parameters at FP32 needs 7e9 × 4 = 28 GB. That is just the weights — not the optimizer states from training, not activations during a forward pass, not the KV cache that grows with sequence length at inference. Just the weights. FP32 blocks many local deployments outright. FP16 cuts memory in half but still leaves large models expensive. INT8 cuts memory to one quarter of FP32. INT4 cuts it to one eighth. INT2 goes smaller still, but often pushes quality below the floor.
The second thing quantization buys you is bandwidth. People often assume inference is mostly arithmetic. For large language models, memory movement is usually the real tax. You keep pulling huge weight matrices from memory into the compute units, and lower-bit weights mean fewer bytes moved. Fewer bytes moved means less waiting. So quantized models are often smaller and faster — not always, but often enough that this is part of the default deployment toolkit.
The third thing quantization changes is the conversation. The right question is not whether quantization is good or bad. It is: how much quality loss buys how much deployment gain? For one product, a 1-point benchmark drop is unacceptable. For another, an 8x size reduction with a small reasoning hit is the whole business model. Once you measure that tradeoff, you stop speaking in slogans and start speaking in numbers.
The right question is never "can we afford quantization." It is "how much can we afford to lose."
The build
Assume you have a trained checkpoint from earlier work. The sequence is: establish a clean FP32 baseline, implement int8 post-training quantization on a single layer, expand to the whole model, measure quality loss, repeat for int4 with group-wise scales, then profile inference speed honestly.
Step 1 — Establish the FP32 baseline
Do not quantize first and benchmark later. That is how people lie to themselves without meaning to. Start with the original checkpoint and record three things: model size on disk, validation loss on a held-out set, and inference speed in tokens per second. Your baseline is the only reason later numbers mean anything. A typical baseline row might read FP32, 28.0 GB, val loss 2.41, 18 tokens/sec. Those numbers are placeholders. Your actual values depend on model size and hardware. Record them once. Do not move on until they are written down.
Step 2 — Implement symmetric INT8 quantization
Post-training quantization means you quantize after training is complete. No retraining, no extra gradient steps. That is what makes it attractive. The simplest useful version is symmetric int8 weight quantization. Symmetric means zero stays at zero, so positive and negative values get equal range. For a weight tensor W, find the maximum absolute weight, compute a scale so that max magnitude maps to 127, round each weight to the nearest integer in [-127, 127]:
import torch
def quantize_int8_symmetric(w: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
max_abs = w.abs().max()
scale = max_abs / 127.0 if max_abs > 0 else torch.tensor(1.0, device=w.device)
q = torch.clamp(torch.round(w / scale), -127, 127).to(torch.int8)
return q, scale
def dequantize_int8(q: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
return q.float() * scale
This is not yet a production kernel. It is enough to make the mechanics visible. Now pause and ask the right beginner question: if the runtime has to turn int8 back into float, are we really saving anything? Yes — for two reasons. Storage still shrinks. And good runtimes do not dequantize the whole model all at once into FP32 and throw away the savings. They fuse quantized storage with matrix multiplication kernels so the movement and arithmetic stay efficient. Your toy version may not be faster yet. The point here is to understand the representation before chasing optimized kernels.
Step 3 — Quantize one layer first, not the whole model
Do not quantize everything at once on your first pass. Pick one linear layer — an MLP projection or an attention projection works well. Run it on a sample input in FP32, then run the quantized-dequantized version, and compare outputs:
def layer_error(linear: torch.nn.Linear, x: torch.Tensor) -> tuple[float, float]:
with torch.no_grad():
y_fp = linear(x)
q_w, scale = quantize_int8_symmetric(linear.weight.data)
w_hat = dequantize_int8(q_w, scale)
y_q = x @ w_hat.t()
if linear.bias is not None:
y_q = y_q + linear.bias
mse = torch.mean((y_fp - y_q) ** 2).item()
max_err = torch.max((y_fp - y_q).abs()).item()
return mse, max_err
Quantization errors are easier to understand locally. If one layer barely changes, good. If one layer blows up, you inspect that layer before blaming the whole pipeline. This is the same discipline you used earlier when debugging training: isolate the component.
Step 4 — Move from one layer to the full model
Once the local check looks sane, quantize all linear weights. In a transformer-style language model, that usually means the attention projections (Wq, Wk, Wv, Wo), the MLP weights, and the final projection layer. Embeddings can be quantized too, but many deployment formats handle them with their own rules. For a plain educational implementation, start by quantizing every nn.Linear.weight:
def quantize_model_int8(model: torch.nn.Module) -> dict:
qstate = {}
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
q_w, scale = quantize_int8_symmetric(module.weight.data)
qstate[name] = {
"q_weight": q_w.cpu(),
"scale": scale.cpu(),
"bias": None if module.bias is None else module.bias.data.cpu()
}
return qstate
The first time I quantized everything in one pass, the model produced output that looked normal for two lines and then unraveled into syllable salad. I spent an afternoon assuming I had a bug somewhere in the dequantization path. The actual problem was that one specific projection layer was carrying a small number of outlier weights, and per-tensor scale across the whole layer crushed everything else. Quantization-per-layer matters. So does looking at which layers have outliers.
Step 5 — Measure the INT8 quality drop
Run your validation set again with quantized weights. You care about three numbers: change in validation loss, change in benchmark accuracy, and change in sample outputs. A small loss increase is expected. For many models, weight-only int8 quantization causes very little degradation, which is why int8 is often described as near-lossless for inference. Do not repeat that phrase like a slogan. Measure it on your model, especially if the model is small or trained noisily. A good result might look like FP32 at val loss 2.41, GSM8K 41.3%, vs INT8 at val loss 2.45, GSM8K 40.8%, with size dropping from 28.0 GB to 7.0 GB. Structure matters more than exact numbers.
Step 6 — Implement group-wise INT4
Now the buckets get much coarser. INT4 means each weight gets 16 possible values total. That sounds absurdly small, and yet 4-bit inference is common because the quality can remain surprisingly usable. At 4 bits, how you group weights matters more. A naive whole-tensor scale often hurts too much, so group-wise quantization is the right educational target. Flatten the weight tensor into chunks of 64 or 128 values. Each chunk gets its own scale, which lets a block with small numbers keep fine resolution while a block with larger numbers gets wider range:
def quantize_int4_groupwise(
w: torch.Tensor, group_size: int = 64
) -> tuple[torch.Tensor, torch.Tensor]:
flat = w.flatten()
n = flat.numel()
q_groups = []
scales = []
for i in range(0, n, group_size):
chunk = flat[i:i+group_size]
max_abs = chunk.abs().max()
scale = max_abs / 7.0 if max_abs > 0 else torch.tensor(1.0, device=w.device)
q = torch.clamp(torch.round(chunk / scale), -8, 7).to(torch.int8)
q_groups.append(q)
scales.append(scale)
q = torch.cat(q_groups).view_as(w)
scales = torch.stack(scales)
return q, scales
Why -8 to 7? Because 4 bits gives 16 values total, and signed symmetric storage allocates those buckets around zero. The exact packing format differs across runtimes — many real implementations pack two int4 values into a single int8 byte to actually save the storage. The logic above captures the bucket math without the bit-packing detail. Dequantization reverses it group by group.
Step 7 — Why INT4 is still plausible
A reasonable reader asks: how can 16 values per weight be enough? Because the model does not need each individual weight to be perfect. Neural networks are distributed systems. Meaning is not stored in one magic neuron or one exact decimal place. It is spread across many interacting parameters, so the network often tolerates a lot of local roughness before the global behavior breaks. INT4 often keeps enough of the broad geometry of the learned function. INT2 often does not.
Step 8 — Profile inference speed honestly
Measure speed, but keep your experimental honesty intact. If you are testing in Python with naive dequantization, your low-bit model may look slower than FP32. That does not mean quantization failed. It means your runtime is not doing quantized inference efficiently. Collect two kinds of speed numbers: a naive Python path (useful for debugging correctness) and an optimized runtime path (useful for deployment). For each format, record prompt length, generation length, hardware, tokens per second, and peak RAM or VRAM. Keep prompt and sampling settings fixed across runs. The runtime is part of the product, not just the weights.
Step 9 — The storage math
If P is parameter count and b is bits per parameter, raw weight storage is roughly (P × b) / 8 bytes. For a 7B parameter model: FP32 is 28 GB, INT8 is 7 GB, INT4 is 3.5 GB, INT2 is 1.75 GB. Real file sizes differ because of metadata, scales, and grouping overhead, but the rough picture holds. The savings are not cosmetic — they are architectural. Lower-bit models change which devices can host the model at all.
BREAK IT
Now do the most useful thing in the chapter. Break it on purpose. There are two failures worth forcing: a calibration set that does not represent real inputs, and a single outlier weight that blows up the scale for everything else in its group. Each failure teaches a different lesson. The first teaches you why data hygiene matters for quantization. The second teaches you why granularity matters.
Calibrate on the wrong distribution
Pick a tiny, unrepresentative calibration set. Three or four short prompts of the same shape. If the model will see code, calibrate on poetry. If the model will see Tamil, calibrate on English. Or, even simpler: use a calibration batch that contains only one kind of input — all lowercase, no punctuation, short sequences — and quantize activations using the ranges observed on that batch.
Then run real inputs and measure. The activation ranges you measured during calibration will be too narrow for actual traffic. Activations that fell inside the calibration range will quantize fine. Activations that fall outside the calibration range will saturate at the bucket boundary — every value above the observed max becomes the same code, every value below the observed min becomes the same code. The dynamic range you needed to represent has been thrown away. The model still runs. The logits still come out. But the activations no longer carry the same information they did in FP32.
The signature failure is interesting because the metrics often look fine on the same distribution you calibrated on. Validation loss does not spike if your validation set looks like your calibration set. The hit shows up on real users, where inputs are longer, weirder, in a different language, or just rougher than the tidy calibration prompts. A calibration set is a tiny piece of training data for the quantizer. If it is biased, the quantizer learns the wrong scales.
The fix is unglamorous. Calibrate on a sample that genuinely resembles deployment traffic — different lengths, different languages, different content classes. Look at the activation histograms during calibration. If one channel has a tail extending ten times further than the others, the scale for that tensor will be dominated by that channel. Which brings us to the second failure.
Let one outlier weight set the scale
In your int4 group-wise quantizer, find a weight tensor in a transformer projection layer and inject one extreme outlier — multiply a single weight by 20. Re-run the quantizer with per-tensor (not group-wise) scaling, so all weights share one scale:
# sabotage one weight to be a giant outlier
W = linear.weight.data.clone()
W[0, 0] = W[0, 0] * 20.0
# per-tensor scale: max_abs is now dominated by the outlier
max_abs = W.abs().max()
scale = max_abs / 7.0 # int4 range
q = torch.clamp(torch.round(W / scale), -8, 7)
What happens? The single outlier weight now defines the scale for the entire tensor. Every other weight in the tensor — millions of them — gets divided by a scale that is far too large, rounded toward zero, and collapses into a tiny range of buckets near the middle. Print the histogram of quantized values. You will see almost everything pile up at 0 and ±1, with one or two cells at the extreme. The tensor has been crushed.
This is the outlier-driven scale blow-up. It is not hypothetical — it is the single most common reason production quantization runs degrade unexpectedly. Real transformer weights are not uniformly distributed. A small number of channels carry magnitudes an order of magnitude larger than the median. If your scheme uses one scale for everything, those outliers set the scale and crush the rest.
Two things fix this. First is granularity: per-channel or group-wise scaling lets the outlier sit in its own group with its own wide scale, while the rest of the tensor keeps fine resolution. That is why all serious 4-bit schemes use small groups, not whole-tensor scales. Second is outlier-aware methods. AWQ identifies the weight channels that matter most for activations and protects them. GPTQ adjusts remaining weights to compensate for rounding error already introduced. SmoothQuant pushes activation outlier magnitudes into the weights, where per-channel scales can absorb them. The common thread is the same: do not let one number ruin a million others.
This experiment proves the deeper point. Quantization quality is not really a story about bit width. It is a story about where the error lands. Two int4 quantizers can produce wildly different models depending on how they choose scales, how they group weights, and how they handle outliers. The bit width sets the ceiling. The calibration and granularity choices decide how close you get to it.
The deeper lesson
Quantization looks like a compression algorithm and behaves like a learning problem. The quantizer makes a small number of decisions — which scale, which granularity, which calibration data — and those decisions cascade through every layer of the model. A bad decision early shows up as fluent-sounding gibberish at the output. A good decision early gives you a model that fits on a phone and still answers correctly. When you read the next paper on quantization, read it for the decisions it makes, not the percentages it reports. The percentages depend on the decisions.
Questions to answer
- At which bit width did the model first become unacceptable for your use case — not in theory, but in actual outputs? What failed first: arithmetic, factual accuracy, long-range coherence, exact formatting, or basic fluency?
- Did validation loss rise smoothly while user-visible quality dropped suddenly, or did both degrade together? Which one would you trust more if you only had one to look at?
- How much speed did you gain from quantization itself, and how much came from switching to a runtime built for inference rather than training? If you removed the runtime change and kept only the bit-width change, where do the tokens-per-second numbers actually land?
- If you replace a per-tensor scale with a group-wise scale of 64 weights per group, what is the storage overhead from the additional scales, and at what group size does the bookkeeping start to dominate the savings?
- Which quantization choice gave the best size-speed-quality tradeoff on your hardware, and would the answer change on a different device — a laptop CPU versus a server GPU versus an edge accelerator?
Go further
- GPTQ. The classic post-training anchor for aggressive low-bit weight quantization. Quantizes weights one column at a time and updates the remaining weights to compensate for the rounding error already introduced. Search for "GPTQ post training quantization".
- AWQ — activation-aware weight quantization. Uses activation statistics to identify the small set of weight channels that matter most, then protects them. Often achieves better quality than GPTQ at the same bit width because it spends precision where the model actually needs it.
llama.cppand GGUF. Not a paper — a runtime worldview. GGUF is a deployment container that stores weights, scales, tokenizer metadata, and architecture metadata together.llama.cpptreats low-bit inference as first-class. The Q4_K_M, Q5_K_M, and IQ2 quantization formats are worth learning by name.bitsandbytes. The library that made 8-bit and 4-bit workflows accessible enough to become routine. LLM.int8 introduced mixed-precision decomposition for transformer inference; NF4 (normal float 4) is the 4-bit format used by QLoRA fine-tuning. If you want quantization to feel like a flag rather than a research project, this is the library that made it so.
What you now know
You can now explain why a model that needs tens of gigabytes in FP32 can still work at 4 bits, and why that is not magic but controlled rounding plus a runtime designed to exploit it. You can quantize a tensor symmetrically to int8, switch to group-wise int4, and measure the change in validation loss and benchmark accuracy. You can describe the outlier-driven failure mode precisely: one large weight sets the per-tensor scale, every other weight gets crushed, the layer stops carrying useful information. You can describe the calibration failure mode precisely: a biased calibration set picks scales that fit one distribution and fail on another. And you can answer the deployment question that matters in the real world — not "Can I quantize this model?" but "How far can I quantize this model before it stops doing my job?"
This was Chapter 27 of 35.
The full book is 934 pages and 256,587 words. 35 hands-on projects from autograd to fused specialists. PDF and EPUB on Leanpub, lifetime free updates.