Training Data: Where All That Text Actually Comes From

Someone asks how a chatbot knows so much, and the answer they give themselves is “it was trained on the internet.” Said like that, it sounds simple, almost tidy, as if there’s one big archive somewhere labeled “the internet” and a company just pointed a model at it and pressed go. In practice there is no such archive. What actually gets fed into a model is a painstakingly assembled mixture, put together from several very different sources, each with its own quirks, gaps, and legal exposure. Understanding that mixture explains a lot about why some models are better than others, and why lawyers are currently as busy around this industry as engineers.

What actually goes into the mix

The largest single ingredient is usually a broad web crawl: billions of pages pulled from news sites, blogs, forums, product listings, and personal sites, collected by automated crawlers that have been indexing the open web for years. On top of that sits licensed material, digitized books, academic journals, news archives, and other text a company has paid for or struck a deal to use, which tends to be cleaner and more carefully written than the average web page. Code repositories contribute a separate stream entirely, teaching a model the structure of programming languages rather than prose. And some datasets fold in user-contributed material: forum answers, product reviews, chat transcripts, sometimes gathered with consent, sometimes scraped alongside everything else.

Think of it the way a restaurant thinks about its ingredients. Any kitchen can buy the same generic flour, sugar, and cooking oil from the same wholesale suppliers everyone else uses, that’s the open web, sitting there for anyone with a crawler to take. What separates a forgettable kitchen from an exceptional one isn’t the size of the flour sack. It’s access to specific, carefully sourced ingredients that competitors can’t just order from the same catalog: a particular olive oil, a relationship with one farm, a spice blend nobody else has. A training dataset works the same way. The bulk web scrape gets everyone to roughly the same starting point. What’s licensed, curated, or specially prepared on top of that is where the real difference in the finished dish shows up.

Why this has become a legal minefield

That web-scraped layer is also where most of the controversy lives. A huge share of it is copyrighted material, articles, books, images, song lyrics, that was written or published without anyone imagining it would end up training a commercial model. Whether using that text to train a model counts as fair use, a transformative act that doesn’t require the original author’s permission, or whether it requires licensing like any other commercial use of copyrighted work, is genuinely unresolved. Multiple lawsuits are working their way through courts right now, brought by authors, news organizations, and other rights holders against companies that build these models, and none of them have produced a settled, industry-wide answer. This isn’t a footnote. It touches how every major model in use today was built, and it’s one of the reasons some companies have started signing licensing deals with publishers rather than waiting to find out how the litigation resolves.

The real edge isn’t the pile, it’s the sourcing

The open web is not an infinite resource. It’s finite, already indexed many times over, and every serious lab has by now scraped most of what’s reachable and usable. Piling up more raw text from the same wholesale suppliers stops being the advantage it once was, because everyone is buying from the same catalog. What’s left to compete on is the quality of what a model is actually fed: text that’s been curated for accuracy and relevance, licensed from sources nobody else has access to, or generated synthetically and filtered to teach specific skills a raw scrape never covers well. The next phase of this competition won’t be won by whoever has the biggest pile of scraped text. It will be won by whoever controls the quality of the ingredients going into the pot.

If you’re curious how the patterns sitting inside that training mix end up shaping what a model says, Bias: How Training Data Becomes Prejudice looks at that process directly. And since assembling a dataset is only half the story, the next piece, Compute and GPUs: Why Hardware Decides Who Gets to Play, looks at the physical infrastructure required to actually train on all of it.