Distillation: How a Small Model Learns From a Big One

A model with roughly 8 billion parameters, small enough to run on a decent laptop, scores within a few points of a flagship model forty or fifty times its size on a range of standard benchmarks. Nobody rewrote the laws of scaling overnight. What happened is more specific than that: the small model was trained, in part, on the outputs of the large one. It didn’t discover how to answer well on its own. It was shown, over and over, how a much bigger system already answers, and it learned to reproduce that.

Learning by watching

This process is called distillation, and the two models involved get named for the roles they play. The large, expensive model is the teacher. The smaller, cheaper model being trained is the student. In ordinary training, a model learns from raw data: text scraped from the web, books, code, whatever the training set contains, with the model gradually adjusting its own internal weights to predict that data well. Distillation adds a different kind of training signal. Instead of, or in addition to, learning from raw data, the student is trained on the teacher’s responses: the answers it gives, sometimes the full probability distribution it assigns across possible next words, sometimes just the finished text. The student is not rediscovering the underlying task from scratch. It’s copying the teacher’s behavior and, in doing so, absorbing a compressed version of the judgment that behavior reflects.

The comparison that fits best is an apprentice standing next to a master craftsman for years. The apprentice doesn’t live through every failed joint the master ever cut, every material he ever misjudged, every technique he discarded before landing on the one he now uses without thinking. The apprentice just watches the finished motions: the angle of the chisel, the pace of the work, the small corrections made along the way. Given enough hours of watching and imitating, the apprentice starts producing work that holds up next to the master’s, without having personally earned it through decades of raw trial and error. Something similar happens with the student model. It doesn’t need the teacher’s full training history or its full scale. It needs enough exposure to the teacher’s outputs to start reproducing the patterns of judgment behind them.

Why this matters

The result is a model that punches above what its parameter count would normally predict. Size still matters in general, a bigger model still tends to have more raw capacity, but distillation shows that a meaningful chunk of a large model’s usefulness can be transferred rather than regrown. That transfer is what makes it possible to ship models that run cheaply on modest hardware, respond faster, and cost a fraction as much to serve, while holding onto most of the competence of a system that would otherwise require a data center to operate. It’s also why benchmark leaderboards can look confusing at a glance: two models with wildly different sizes sometimes land close together, not because they were built the same way, but because one of them learned to sound like the other. None of this replaces the harder problem of actually running these models efficiently at scale once they’re built, which is its own separate story worth telling on its own terms.

The moat that wasn’t

The uncomfortable part, if you’re the one who spent the money building the large model, is what distillation implies about competitive advantage. If scale were the whole moat, whoever trains the biggest model would keep the lead simply by staying biggest. But a rival doesn’t need your training data, your compute budget, or your research process to close the gap. They need access to your model’s outputs, which a hosted API hands over by design every time someone queries it. Query it enough, train a smaller model on what comes back, and a good chunk of your model’s hard-won judgment shows up in a competitor’s product at a fraction of the cost it took you to produce it. The advantage that looked structural, built on years of investment and mountains of compute, turns out to be more portable than it appeared from the outside. Distillation doesn’t erase the value of building a great teacher model. It just means the moat around it is shallower than it looks from a distance, and it drains faster than its builders would like.

For the fuller story of what that size gap actually represents in the first place, see Parameters and Scale, on what “70 billion parameters” is really describing.