Reproducibility and Variance: Same Model, Different Scores

Two models sit on a leaderboard a single percentage point apart. One scores 84.2 percent on a reasoning benchmark, the other 83.1 percent, and the gap is enough to spawn headlines about which lab is “ahead” this quarter. What almost never gets asked is a simpler question: if you ran either model through that same benchmark again tomorrow, would it produce the same number? Often, no. Run it a third time and the score might move again, sometimes enough to erase the gap entirely, sometimes enough to reverse it.

Where the wobble comes from

Part of the answer is a mechanism covered in an earlier piece on temperature and sampling: most models don’t compute one deterministic answer and hand it over, they sample from a probability distribution over possible next words. Unless temperature is set to zero and every other source of randomness is pinned down, the same question can yield a different answer each time it’s asked, and a different answer can mean a different grade on a benchmark question that expects one correct format or one correct final value.

Sampling randomness is only one contributor. The exact wording of a prompt matters more than it should. Asking a math question with “Solve for x” versus “Find the value of x” can nudge a model toward a slightly different approach, and across a few hundred benchmark questions those small nudges add up to a measurable shift in the total score. Then there are the quieter differences between what two labs actually ran: a slightly different prompt template, a different number of few-shot examples, a different version of the model behind the same name, a different cutoff for how many tokens the model was allowed to think before answering. None of these show up in the final table. All of them can move the number.

This is the same situation as two runners with nearly identical personal best times. Race them once on a windy afternoon and one crosses the line half a second ahead, and it’s tempting to call that runner definitively faster. Race the same two runners five more times and their finishing order would likely flip more than once, because half a second sits well within the normal variation of how either of them runs on any given day. A single race tells you about that race. It doesn’t settle the question of who is faster in general. A single benchmark run tells you about that run.

Why this matters when reading a leaderboard

A reported benchmark score is not a fixed property of a model the way its parameter count is. It’s one sample drawn from a wider distribution of possible scores that model would produce if tested repeatedly under the same conditions. Run the same model on the same benchmark a dozen times, varying nothing but the random seed, and the results will cluster somewhere, often within a range of a couple of percentage points, rather than landing on one exact value every time. The number in the report is a point picked out of that cluster, not the cluster itself.

That distinction changes how a one-point difference between two models should be read. It also pairs with a related trap covered in accuracy and its traps: even a single, non-repeated score can mislead for reasons that have nothing to do with variance, before variance is even factored in.

What the leaderboard isn’t telling you

A meaningful share of the gaps reported between rival models, the ones percentage point ahead, one spot higher on a chart, falls inside the ordinary noise of running a benchmark more than once. The leaderboard looks like a clean, stable ranking because it shows one run per model, frozen into a table. What it doesn’t show is the spread: the range of scores each model would actually produce if someone ran the test a dozen times and reported the whole distribution instead of a single favorable draw. Until that spread is visible, the table is more confident than the underlying reality, and the ranking it implies is less settled than it looks.