Emergent Abilities: Skills That Show Up Out of Nowhere

Give a small language model a three-step arithmetic word problem and it fails the way a badly calibrated guess fails: right by accident about as often as wrong, no matter how you phrase the question. Keep the training setup identical and just make the model bigger, and for a long stretch nothing changes. Accuracy sits near 10 percent, the same territory as random guessing, across models of noticeably different sizes. Then, somewhere past a certain size, it doesn’t inch upward. It jumps, to 50 percent, 70 percent, sometimes higher, over a range of scale that looks narrow next to how far the field has scaled overall. The model wasn’t slowly getting better at the task the whole time. It was getting nowhere, and then, abruptly, somewhere.

The heating-water problem

The clean way to picture this is water on a stove. Add heat at a steady, unremarkable rate and for a long time the only visible effect is the temperature ticking up: 60 degrees Celsius, 70, 90, still just warm water behaving like warm water. Nothing about watching it from 60 to 99 degrees would tell you anything dramatic is coming. Then at 100 degrees the behavior of the substance changes completely: it boils, turns to vapor, stops being liquid at all. The input, heat, was added continuously and smoothly the entire time. The output was not continuous or smooth. It was flat, then a cliff.

Language model training scales up in a comparably steady way: more parameters, more data, more computing power, dialed up in increments that look unremarkable from one step to the next. Most capabilities do improve gradually alongside that, in roughly the smooth way you’d expect from a smooth input. But a specific subset, certain kinds of multi-step reasoning, some forms of following a chain of instructions, particular benchmarks requiring several chained inferences, behave like water crossing 100 degrees rather than like a dial turning. Flat, flat, flat, then a jump to something that looks like real competence. Researchers call these emergent abilities, and the threshold where the jump happens was not something anyone designed in advance. It’s something people noticed only by testing many model sizes and comparing the results afterward.

Why this matters when you read AI news

This is worth holding onto because it explains a specific kind of announcement: a new model release claiming it can suddenly do something smaller versions of the same family could not, not as a modest improvement but as a near-absence turning into a solid capability. That’s not usually marketing exaggeration, and it’s also not evidence of some qualitative leap in how the system works internally. It’s what a threshold effect looks like from the outside. It also cuts the other way: a lab cannot simply promise that a future, larger model will gain a specific new skill on a specific date, because nobody has a reliable way to compute in advance where a given capability’s threshold sits, or whether it has one at all. Some abilities seem to emerge sharply. Others never do, and only ever improve gradually, no matter how large the model gets. Telling those two cases apart ahead of time remains largely a matter of testing and finding out.

What “emergent” is actually admitting

It’s worth being blunt about what the word is doing here. Calling an ability “emergent” sounds like a description of something the system is doing. Mostly it’s a description of something researchers can’t do: predict, before training and testing a given size, which abilities will still be flat and which will have already crossed their threshold. That’s a real limitation, not a minor caveat attached to an otherwise fully understood process. Unpredictability isn’t some rough edge that better tooling will smooth away next year. Right now, it’s part of what scaling these systems up actually gives you: steady, well-understood inputs producing a handful of outputs that only reveal their shape after you’ve already built the thing and run the test.

A later piece in this series looks at parameters and scale in more depth, including what a number like “70 billion parameters” actually refers to, and why making a model bigger doesn’t always keep paying off.