Parameters and Scale: What '70 Billion Parameters' Actually Means
“Introducing Atlas-2: 70 billion parameters, our most capable model yet.” Headlines like this show up every few weeks, and the number always sits front and center, right next to the release date, as if it were a spec you could compare the way you’d compare storage on a phone. Most readers nod past it. Bigger number, presumably better model, move on. But almost nobody stops to ask what a parameter actually is, and that gap is worth closing, because the number is doing real work, it’s just not the work most people assume.
What a parameter actually is
A parameter is a single adjustable number inside the model, one of many knobs that get nudged, very slightly, every time the model is shown a piece of text during training. Before training starts, these numbers are essentially random. As training proceeds, the model predicts the next word in a sentence, checks how wrong it was, and adjusts its parameters a tiny bit in the direction that would have made the prediction better. Do that billions of times across enormous amounts of text and the parameters settle into values that, collectively, encode patterns: grammar, facts, style, the rough shape of how ideas tend to follow each other in language. A “70 billion parameter” model has 70 billion of these adjustable numbers, no more mysterious than that, and no less.
More parameters generally means more room to store distinctions. Think of it the way you’d think about camera resolution. Jumping from a 2 megapixel camera to a 12 megapixel one is an obvious, visible upgrade, fine detail that was simply lost before now shows up clearly. That’s roughly what happened going from million-parameter language models to models with tens of billions of parameters: capabilities that were mushy or absent became sharp and reliable. But jump from a 100 megapixel camera to a 110 megapixel one and almost nobody could tell the difference by eye, even though the second camera is harder and more expensive to build. Extra capacity is still there, but the returns on human-perceptible quality have mostly flattened out.
Why the number gets treated as a headline
Researchers noticed a pattern early on: performance on a wide range of tasks improves in a fairly predictable way as you scale up three things together, parameter count, training data size, and the computing power spent on training. These relationships are called scaling laws, and for years they held up well enough that “bigger model” was a reasonably safe bet for “better model.” That’s why parameter count became the headline figure in the first place. It was a decent, if crude, proxy for capability, and it was easy to put in a press release compared to something like “trained on 4 trillion tokens with this much compute.” The number isn’t meaningless. It’s just one ingredient being mistaken for the whole recipe.
The part the industry already knows
What rarely makes the headline is that scaling laws come with well documented diminishing returns. Each further jump in size tends to buy a smaller improvement in measured performance, while the compute required for that jump grows exponentially rather than in step. Doubling a model’s parameters once might move a benchmark score from 40 percent to 60 percent. Doubling it again, at many times the cost, might move it from 60 percent to
- The curve doesn’t collapse, it just bends, and it’s been bending for a while now, visibly enough that the labs building these systems have seen the same data everyone else has.
That’s the real story behind the recent shift toward smaller, more efficient models rather than a straight march toward ever-larger ones. It isn’t a change of philosophy or a newfound appreciation for elegance. It’s a rational response to a curve that everyone building these systems can already see flattening in front of them.
If you want the other half of the scaling story, emergent abilities covers the surprising, sudden capabilities that show up at certain size thresholds rather than gradually. And once scale stops paying for itself, the next question is what labs do with a large model they’ve already trained, which is exactly what distillation is about.