Ask a language model to write a short story about a nurse and a doctor, and don’t specify genders. A large share of the time, the nurse comes out female and the doctor male. Ask it to describe a “brilliant” person versus a “diligent” person, and certain names or backgrounds will show up more often attached to one word than the other. Nobody wrote a rule saying doctors are men. Nobody typed “associate diligence with this group and brilliance with that one.” The model produced these patterns anyway, because it learned them from somewhere, and that somewhere is us.

Sediment in the water

A large language model is trained on an enormous volume of ordinary text: news articles, forum posts, product reviews, novels, code comments, wikis, social media threads, scraped from across the internet by the billions of words. That text was written by people, and people write from inside a world that already has imbalances baked into it. Doctors have historically been referred to as men more often. Certain job titles cluster with certain names. Certain adjectives cluster with certain groups. None of this needs a conspiracy or even conscious intent behind it. It just needs enough ordinary writing, repeated often enough, for statistical association to take hold.

That’s the part worth picturing clearly. Think of the training data as water carrying dissolved sediment. Filter out the twigs and leaves floating on the surface (the slurs, the openly hateful sentences, the content anyone would flag on sight) and the water looks clear. But the minerals that give the water its actual character aren’t floating on top. They’re dissolved all the way through the volume, invisible, and they pass straight through a filter built to catch visible debris. A model trained on that water doesn’t absorb a rule that says “women are nurses.” It absorbs a statistical tilt, built from millions of small, unremarkable sentences each nudging the association a fraction of a percent, until the tilt shows up reliably in what the model produces. Nobody programmed the association. It accumulated.

Why “we cleaned the dataset” should make you skeptical

This is why claims that a company “cleaned” or “de-biased” its training data deserve a raised eyebrow rather than automatic relief. Cleaning a dataset usually means running filters that catch explicit slurs, obviously hateful text, and a list of flagged terms. That’s real work and it does remove some genuinely harmful material. But it mostly operates on the floating debris, the visible offenders. It does very little to the dissolved layer: the ten thousand unremarkable sentences where a profession quietly skews one gender, or a name quietly correlates with a sentiment, or a nationality quietly correlates with a topic. Removing 0.01 percent of the most flagrant text doesn’t touch the correlation carried by the other 99.99 percent. So when a product announcement says the training data has been scrubbed of bias, a reasonable response is to ask which layer they mean: the surface, or the sediment. Usually it’s the surface.

Bias doesn’t disappear, it relocates

The deeper issue is that bias in these systems isn’t really a content problem you can spot-check word by word. It’s a distributed statistical pattern, spread thin across text that individually looks completely unremarkable, no different from removing a few grains of one mineral from a lake and calling the water pure. Filter aggressively enough and you can shift where the correlation shows up, push it from one phrasing into another, from one context into a subtler one, without ever eliminating the underlying tilt the model learned. That’s the mechanism to keep in view: not “did they remove the bad words,” but “did they change the statistics,” because the second one is what the model actually learned from, and it’s the one filtering rarely touches.

A later piece in this series digs into the difference between training and inference, which explains why a bias baked in at one stage can behave so differently once the model is actually running and generating text.