LLM-as-Judge: Using AI to Grade AI

Picture a research team testing a new chatbot. They need to know which of its answers are actually good, across ten thousand test questions. Hiring human raters to read and score ten thousand responses would take weeks and cost a small fortune. Instead, the team feeds every response, one at a time, to a strong existing model and asks it to act as the grader: read the answer, compare it to a rubric or a reference answer, and assign a score or a verdict. By morning, all ten thousand are scored. No panel of humans could have kept pace with that overnight, and this is now one of the most common ways AI systems get evaluated.

How the grading actually works

The setup is simple enough. A “judge” model is given a question, the answer a different model produced, and sometimes a reference answer or a short rubric describing what a good response should contain. The judge reads all of it and returns something usable: a score from one to ten, a label like “correct” or “incorrect,” or a preference between two candidate answers. It’s fast, it’s consistent in the sense that it never gets tired or distracted, and it scales to any volume you can afford in compute.

The trouble is that a judge model isn’t a neutral instrument. It’s a model with its own habits of phrasing, its own sense of what a well-formed answer looks like, learned from its own training. Think of a massive writing contest with thousands of entries, but instead of a diverse panel of judges, every single entry is graded by one teacher. That teacher happens to love elaborate, flowery prose and has a mild distaste for short, plain sentences. Within a few rounds, that teacher’s personal taste quietly becomes the de facto definition of “good writing” for the entire contest. Entrants who write to flatter that specific teacher’s style gain an advantage that has nothing to do with how well they actually write. A plain, correct, concise entry can lose to a padded, ornate, less correct one simply because it read more like what the teacher already liked.

Where the bias shows up

Two patterns show up again and again when researchers test judge models directly. First, length: judges tend to rate longer answers higher, even when a shorter answer contains the same correct information with less padding. An answer that restates the question, adds a few extra qualifiers, and closes with a tidy summary often scores better than a terse answer that simply gets it right. Second, style: a judge model tends to score answers more favorably when they’re structured the way it would have written them itself, with the headers, hedges, and phrasing its own training nudges it toward. When the model being graded happens to share a training lineage or a similar house style with the judge, that resemblance alone can nudge scores upward, independent of accuracy.

Neither of these preferences is random noise. They’re consistent, directional tilts, the same kind of systematic pattern discussed in Bias: How Training Data Becomes Prejudice: a model doesn’t grade from some neutral vantage point, it grades from inside the statistical habits it absorbed during training. A judge model has preferences the same way any model has preferences, and grading is just another task where those preferences get to act.

What the score is actually telling you

None of this makes LLM-as-judge useless. It makes it something narrower than it’s usually presented as. A score from a judge model isn’t a measurement of quality floating free of any particular perspective, the way a ruler measures length regardless of who’s holding it. It’s a measurement of how closely an answer matches the taste of one specific model, on one specific day, shaped by whatever text that model happened to be trained on. Swap the judge for a different model and some of the rankings shift, because the standard being applied shifts with it. What looks like an objective grade is really a report on how well something conforms to one model’s idea of what a good answer sounds like, not a verdict on whether it actually is one.

This is the automated cousin of a simpler idea covered in Human Preferences and Elo Rankings, where human voters, not a model, decide which answer wins. The mechanism is the same story with the judge swapped out: whoever or whatever holds the pen on “good” ends up deciding the outcome, and it pays to ask who that is before trusting the score.