Evaluating Safety: What Red Teaming Actually Measures
A company announces a new model with a familiar list of numbers: it scores higher than its predecessor on a reasoning test, edges out a rival on a coding benchmark, sets a new mark on a math evaluation. Further down the announcement, in a shorter paragraph, there’s a mention that the model underwent several weeks of safety testing involving outside specialists before release. No score accompanies that sentence. No percentage, no ranking, no chart. The capability claims arrive with receipts. The safety claim arrives as a statement to be taken on trust. That gap isn’t an oversight. It reflects something real about how these two kinds of evaluation work.
What red teaming actually involves
Red teaming means deliberately trying to break a model before the public ever gets the chance. A dedicated group, sometimes employees of the company building the model, sometimes outside experts brought in specifically because they don’t share the builder’s blind spots, spends time trying to manipulate the system into doing things it shouldn’t: producing dangerous instructions, generating content that should have been refused, leaking information it was supposed to keep confidential, or behaving in ways that would embarrass or harm someone if it happened in public instead of in a controlled test. A serious exercise might involve dozens of testers working over several weeks, each probing from a different angle, before a model is ever exposed to the wider world. The point isn’t to prove the model is unbreakable. It’s to find as many of the breakable spots as possible while the consequences are still contained to a report instead of a headline.
This is close to the relationship between a car’s horsepower and its crash safety rating. Horsepower is one number, printed on the spec sheet, that any two cars can be compared on at a glance. Crash safety isn’t like that. It’s a collection of separate tests, front impact, side impact, rollover risk, each producing its own result, assembled by different evaluators, rarely reducible to a single figure a buyer casually quotes at a dinner party. That doesn’t make crash safety less important than horsepower. It just doesn’t compress into a headline the same way. Red teaming sits on the crash-safety side of that divide: real, rigorous, consequential, and structurally resistant to being flattened into one comparable score. Some of what red teaming looks for, like whether a model can be steered off its instructions by content it was only asked to read, is detailed in the companion piece on prompt injection, one specific failure mode among the many a red team tries to surface.
Why this asymmetry shapes what people worry about
A capability benchmark produces exactly the kind of artifact a press release wants: a clean percentage, comparable across model versions and across competitors, easy to put in a headline and easier still to repeat in casual conversation. Safety evaluation resists that treatment. The useful output of a red-teaming exercise isn’t a score, it’s a list of specific failure modes and whatever mitigations were built in response, often described in general terms rather than published in full, since detailing exactly how a model was broken would hand a manual to anyone trying to break the next one. That’s a reasonable tradeoff for the company running the test. But it means the public conversation about a new model ends up lopsided almost by construction. One side of the ledger produces a number people can screenshot and compare. The other produces a paragraph of reassurance that has to be taken partly on trust, because full transparency about vulnerabilities would itself create risk. The benchmark side of this comparison has its own limits worth understanding, but at least it hands the public something concrete to argue about.
The number that isn’t there
None of this means safety testing is weaker or less rigorous than capability testing. It means safety doesn’t produce a public, comparable score the way capability does. There’s no widely cited single figure a company can drop into a press release the way it cites a benchmark percentage, no “92% safe” that would even mean anything if printed. And that absence is doing more work than it looks like. It’s a large part of why hype about what these models can do keeps racing ahead of public concern about what they might do wrong: one side of the story comes with a number everyone can point to and repeat, and the other, by its nature, doesn’t.