Models That ‘Reason’: Chain-of-Thought and Inference-Time Compute

Ask a newer chatbot a tricky word problem, something like splitting a restaurant bill three ways with an uneven tip and one person who already paid a deposit, and instead of a number appearing instantly, you see a block of text unfold first: “First, let’s total the bill. Then subtract the deposit. Then figure out the tip on the remaining amount.” Only after several lines of this does the actual answer show up. Older models would have just produced a number, sometimes the right one, sometimes not, with no visible work in between. Something in how these systems operate has changed, and it isn’t that they got smarter overnight.

Paying for steps instead of guesses

What you’re watching is called chain-of-thought: the model generates a sequence of intermediate steps, working through the problem piece by piece, before committing to a final answer. Each of those steps is still just text the model produces one token at a time, the same mechanism it always used. What’s different is that it now produces a lot more of that text before stopping, and it uses those extra tokens to break a hard problem into smaller pieces it can check as it goes. This is what people mean by inference-time compute: spending more computation at the moment you ask the question, rather than only during the original training run described in the earlier piece on training versus inference. The model itself hasn’t changed. What changes is how much work it does per question.

The closest everyday version of this is a student solving a math problem on scratch paper versus one who blurts out the first number that comes to mind. The student showing work writes out each step, catches an arithmetic slip halfway through, crosses it out, and corrects course before turning in an answer. It takes longer and burns through more paper, but a mistake gets caught while there’s still a chance to fix it. The student who blurts out a guess commits to one answer with nothing to check it against. A direct answer from a model works the same way: one shot, no visible checkpoint, nothing to catch a slip before it becomes the final output.

Where the extra tokens actually help

A short, direct answer to a simple factual question might run to a few dozen tokens. A reasoning trace on a multi-step math or logic problem can run to several thousand before the model states its conclusion, all of it spent working through sub-steps, checking intermediate results, occasionally backtracking. On problems with several dependent parts, a wrong turn early on used to spread silently through everything that came after it, because there was no intermediate output to catch it in. Breaking the problem into visible steps gives the model, and sometimes the interface itself, a chance to notice an inconsistency before it reaches the final line.

None of this is free. Generating thousands of extra tokens per answer takes more time to produce and more compute to run, and on most current products that shows up as slower replies or a higher cost per query for the more involved “reasoning” modes. That tradeoff, more accuracy on hard multi-step problems in exchange for slower and pricier answers, is now a setting users and developers choose deliberately, problem by problem, rather than a fixed property of a model. A later piece in this series looks at what happens when a model is handed outside documents to work from instead of relying purely on what it generates internally, a related but separate shift in where the work happens.

What’s actually being purchased

It’s tempting to describe this as the model “reasoning” the way a person does, pausing to think something through. That framing gives away too much. There’s no understanding accumulating behind the scenes, no moment of insight. There is only more computation, more tokens generated per answer, spent because more spending tends to catch more mistakes. Answer quality is increasingly something you buy at the moment of asking, priced in tokens and seconds, rather than something fixed once and for all when the model was trained. The meter that used to run once per model version is now running again, per question, every time you ask.