Goodhart’s Law: When the Score Stops Measuring Anything

There was a reasoning benchmark that used to separate the strong models from the rest cleanly: a wide spread of scores, real gaps between releases, the kind of chart that told you something. Check it today and every serious model sits in a tight cluster up near 97 or 98 percent. The chart still exists. It still gets cited in press releases. It has also stopped doing the one job a benchmark has, which is telling models apart, because there’s nowhere left for the good ones to go and nothing left to separate them from the merely decent ones.

The call center problem

Imagine a call center that decides average call length is a good stand-in for efficiency. Shorter calls, the reasoning goes, mean problems get solved faster. So call length becomes the number managers watch, the number tied to performance reviews, the number on the dashboard. Employees adapt almost immediately, not by getting better at solving problems, but by transferring callers to other departments, by cutting people off mid-explanation, by closing tickets before the issue is actually fixed. Average call length drops beautifully. Customer problems get resolved less often than before. The metric was never wrong as a description, back when nobody was optimizing for it directly. It became wrong the moment it turned into a target. This is Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure.

Benchmarks in AI follow the identical arc. A benchmark is built to measure some underlying capability, reading comprehension, math reasoning, coding ability, whatever the designers had in mind. Early on, before anyone is paying close attention, scores spread out in a way that actually reflects differences in the underlying skill. Then the benchmark becomes famous enough to show up in every comparison table and every launch announcement, and at that point lab teams have every incentive to make their number on that specific test go up, whether or not the underlying capability moves with it. Training data gets shaped around the kinds of questions the benchmark asks. Fine-tuning gets checked against it repeatedly during development. None of this requires anyone to cheat outright. It only requires the benchmark to be well known enough that everyone is quietly aiming at it.

Why saturation matters more than it sounds like it should

A saturated benchmark doesn’t just become slightly less informative, it becomes actively misleading, because it keeps producing numbers that look like meaningful differences. Two models scoring 98.1 and 98.6 on the same saturated test get compared as if that gap tells you something, when scores bunched that tightly, that close to the ceiling, are often noise, overfitting to the test’s particular style of question, or narrow tuning that doesn’t transfer anywhere else. The test hasn’t gone silent. It’s still generating numbers, still getting screenshotted into slide decks. It’s just generating numbers that no longer track the thing they were built to track, which is arguably worse than no number at all, because a missing number invites scrutiny and a confident, precise-looking number usually doesn’t. This is also, worth noting, a different failure mode from a model having simply seen the test questions before, which is its own problem covered in Data Contamination: When the Model Has Already Seen the Exam. Saturation can happen even with a perfectly clean, never-leaked test, purely because everyone is now optimizing toward it on purpose.

Every yardstick has a lifespan

The field’s actual response to this, once a benchmark saturates, isn’t to fix it. It’s to retire it and move to a newer one, which starts the cycle again: neutral at first, informative for a while, then famous enough to become a target, then saturated. One promising alternative tries to sidestep the whole loop by replacing fixed test questions with something harder to game directly, human votes comparing model outputs head to head, covered in Human Preferences and Elo Rankings. But the underlying pattern is worth sitting with on its own. Every benchmark has a useful lifespan. It’s born as a measurement, dies as a target, and for a good stretch in between everyone keeps citing it and pretending it still means what it used to mean.