Inference in Production: Serving a Model to Millions

A startup demos its new assistant on stage: one presenter, one question, an answer that streams back in under a second, word by word, flawless. Three weeks later the same model is live behind their product, and users are complaining. The first word of the reply now takes four or five seconds to show up. Sometimes it times out entirely. Nobody touched the model. What changed is that one person asking one question became ten thousand people asking questions at the same moment, and that turns out to be a completely different problem.

The chef and the restaurant

Think of a brilliant chef who can cook one exquisite meal at home, timed perfectly, plated with total attention, for a guest sitting right at the counter. Now open a restaurant that serves that same quality of food to five hundred covers in a single night, with nobody waiting an hour and nobody getting a lukewarm plate. The recipe barely changed. Almost everything hard about running the restaurant has nothing to do with the recipe and everything to do with kitchen logistics: how many orders the kitchen can work on at once, how tickets get grouped so the grill isn’t idle between each one, and what happens when a bus of two hundred people walks in unannounced during the dinner rush.

Serving a model in production is the kitchen, not the recipe. Latency is how long a single diner waits for their plate, and products generally need that first bit of response back in well under a second or two to feel responsive rather than sluggish. Batching is the kitchen grouping several tickets so one expensive piece of equipment (a bank of GPUs, in this case) is doing useful work for many people at once instead of sitting half idle between orders. And handling a spike, everyone opening the app right after a product launch or a viral post, is the equivalent of that unannounced bus: without planning for it, every order slows down together, and at some point the kitchen just stops taking new ones. None of this touches what the model actually knows or how well it reasons. It’s entirely about how many requests can move through the same hardware, in what order, without anyone waiting too long.

Why this is a separate skill

A team can train or fine-tune an excellent model and still ship a product that feels slow, because getting a demo to answer well once and getting a service to answer well for everyone simultaneously call on different expertise. The first is a modeling problem. The second is a systems problem: queueing theory, load balancing, deciding how many requests to group into a batch before the wait to fill that batch starts costing more than the batch saves, deciding when to spin up more hardware and how fast that hardware can come online once demand spikes. A model that answers brilliantly for one user can still buckle at ten thousand concurrent users if none of that scaffolding exists, and a well-run kitchen behind a mediocre recipe will out-serve a genius chef with no system at all, five hundred covers in.

Where the real advantage hides

This is why two products built on the exact same underlying model can feel completely different to use, one instant and steady, the other laggy and prone to falling over on a busy afternoon. A large share of a product’s real cost, and a large share of its actual competitive advantage, lives in this invisible serving layer, not in the model that gets all the attention in headlines and reviews. A mediocre model served efficiently and reliably at scale can beat a genuinely better model that’s slow or falls over under load, because the people waiting on an answer don’t experience benchmark scores. They experience how long the wait was, and whether it was the same the tenth time as the first.

If you want the more basic distinction this piece builds on, Training vs Inference: Two Completely Different Phases covers what inference even is before it gets asked to run at this scale, and the next piece in this series, Quantization and On-Device Models, looks at one concrete way engineers shrink a model to make exactly this serving problem easier.