Training vs Inference: Two Completely Different Phases

A woman spends twenty minutes teaching a chatbot the difference between her two business partners, correcting it twice when it mixes up their names. By the end of the conversation it’s using the right name every time. The next morning she opens a fresh conversation, asks a related question, and the assistant mixes up the names again, as if the correction never happened. She isn’t imagining things. It didn’t happen, not in the way she thinks it did.

Two different machines wearing one name

What looked like the model “learning” her partners’ names during that conversation was something else: inference. Inference is what runs every time you send a prompt and get a reply. The model reads everything currently in front of it, your correction included, and generates its response token by token based on that immediate context, the way you might glance back at notes mid conversation. Nothing about the model itself changes. Its parameters, the billions of internal numbers that encode everything it “knows,” don’t shift by even a fraction. Close the window and that context, the only place her correction ever lived, is gone.

Training is the other machine entirely: the months-long, extraordinarily expensive process where a model’s parameters actually get adjusted. Engineers feed it enormous quantities of text, the model predicts the next word, gets scored against what actually came next, and nudges its internal numbers a tiny bit toward being less wrong. Repeat that step trillions of times across racks of specialized hardware and, eventually, you get a model that generalizes well enough to hold a conversation at all. That process runs once, or gets rerun periodically for a new version, then stops. The resulting parameters are frozen and shipped. Everything after that point, every chat, every prompt, every API call from every user on earth, is inference: using those frozen parameters, without touching them.

Picture training as writing and typesetting a textbook, an enormous, one-time undertaking of research, drafting, editing, and printing plates. Once it’s done, the book doesn’t change no matter how many students read it. Each student’s individual read, the equivalent of inference, is cheap and leaves no mark on the book itself. It doesn’t rewrite itself because one reader scribbled a correction in the margin. The next student gets the same unchanged text. That’s why a chatbot doesn’t learn from talking to you. Talking to it only ever runs inference. Your words shape that one reply, then get discarded, the same way marginalia never travels back to the printing plates.

Why the confusion is so persistent

The interface makes this hard to believe. A conversation feels continuous because the product quietly re-feeds your chat history back into the model with every new message, so it looks like memory when it’s really a longer piece of context reread from scratch each time. Features marketed as giving an assistant “memory” across sessions work the same way from the outside: they store a summary of past chats somewhere and paste it back into the prompt later. Still inference, just with a longer prompt. None of it moves a single parameter. The model shipped on day one is, mechanically, the exact same model still running today, no matter how many millions of conversations have passed through it.

The cost nobody’s tracking

Training gets almost all the public attention because it produces a dramatic, countable number: months of compute, a headline figure compared release over release. That cost is paid once per model version, then it’s over. Inference is the opposite kind of expense: trivial per call, a fraction of a cent for a single reply, easy to wave off. Except it isn’t billions of calls, it’s billions of calls a day, across every user, every app built on these models, every automated system calling one in the background. Multiply a tiny number by that many calls, repeated daily, and it can quietly outgrow the training budget that made headlines. Very little public conversation about AI costs reflects this. It’s still framed around what it took to build the model, when what increasingly drives the real spending is what it costs, over and over, just to keep answering.

That shift from training as the dominant cost to inference as the dominant cost is happening mostly out of view, and it’s reshaping decisions about model size, pricing, and hardware more than any single training run does. A later piece in this series looks at how a raw, freshly trained model turns into a helpful assistant at all; if you want that next step, From Raw Model to Assistant: Fine-Tuning and RLHF picks up right where this leaves off. Further down the line, this series will also cover how these models reach outside themselves to call external tools, a different problem again.