A compact, inexpensive model climbs to within a few points of the very top of a widely followed public leaderboard, ahead of several systems that cost more to run and that score higher on tests of graduate level science questions or multi step math problems. Nothing about the compact model’s underlying capability changed overnight. What changed is which kind of leaderboard is being read. This one wasn’t built from exam questions with correct answers. It was built from people, thousands of them, voting on which of two responses they liked better.

How a vote becomes a ranking

The setup behind these “arena” style leaderboards is simple to describe. A person types a question or prompt, and two different models answer it, labeled only “A” and “B.” The person doesn’t know which model produced which response. They read both and pick the one they prefer. That single choice gets logged, and the same thing happens thousands of times a day, across thousands of different people and prompts, with the model pairings shuffled each time.

Individual votes like that don’t automatically add up to a ranking, though, since model A might beat model B while losing to model C, and every model has faced a different mix of opponents. So these votes get run through the Elo rating system, the same one built decades ago to rank chess players. Every model starts with some baseline score, and each pairwise vote nudges both scores up or down depending on the outcome and on how strong the opponent was rated going in. Beating a model that was already rated well earns more points than beating a model rated poorly, and losing to a weak model costs more than losing to a strong one. Run enough comparisons and the scores settle into a stable ranking, the way a chess player’s rating eventually reflects real strength across hundreds of games rather than any single result.

It works something like a blind taste test between two dishes, where tasters know nothing about the ingredients or the kitchen and simply pick whichever plate they enjoyed more in that moment. Run that test across enough tasters and enough pairings, and you get a clear, stable ranking of which dishes people preferred. What you don’t get is any information about which dish was more nutritious. Those can be, and often are, two completely different rankings.

Where this kind of ranking misleads

That gap matters because a preference vote and a correctness check are not measuring the same thing, even when they’re dressed up in the same leaderboard format. A response can be wrong and still win the vote, if it’s phrased more confidently, formatted more cleanly, or simply longer and more thorough looking. A response can be right and still lose, if it reads as blunt, hedged, or unglamorous. People voting quickly through a queue of prompts are not fact checking each answer against a reference source. They are reacting to how the answer feels to read, in a few seconds, without follow up questions and often without domain expertise in whatever the prompt happened to be about.

That creates a real incentive for whoever is building these models. A model can gain rating points by becoming genuinely more helpful and accurate, but it can gain the same points by becoming better at the surface qualities voters reward: warmer tone, tidier structure, more agreeable phrasing, more confident delivery regardless of whether that confidence is earned. Nothing in the voting process tells the two apart. The score goes up either way.

What the number is actually telling you

None of that makes arena style rankings worthless, since knowing which responses real people tend to prefer is genuinely useful information, and it captures something fixed exam style benchmarks miss entirely, like tone, clarity, and how an answer feels to actually receive. But it answers a narrower question than it looks like it does. It tells you which response people preferred in the moment they read it, not which one was correct, and a model can rise on this leaderboard by getting better at sounding right, which is a different skill from being right, and the ranking has no way to charge it for the difference.

Fixed benchmarks have their own failure mode, which is worth understanding alongside this one, covered in Goodhart’s Law: When the Score Stops Measuring Anything. A related approach that keeps the scale of an arena while trying to check for correctness rather than just preference is the subject of the next piece in this series, LLM-as-Judge: Using AI to Grade AI.