Diffusion: How AI Images Are Actually Made

Watch an AI image tool work in real time and it looks less like drawing and more like tuning an old television. The first frame is a field of colored static, no shapes, no edges, nothing recognizable. A second later, vague blobs of light and dark start to separate. A few seconds after that, a face or a landscape or a coffee cup pulls itself out of the noise, sharpening with each pass, until what’s left on screen is a finished image. Nothing was added to that static. Something was removed from it, over and over, until a picture was the only thing left.

Cleaning an image into existence

A language model writes forward. It commits to a first word, then a second, each one built on the ones before it, extending a sequence that keeps growing until the answer is done. An image diffusion model does close to the opposite. It starts with a full-size canvas of pure random noise, the visual equivalent of static, and never adds a single new pixel. Instead it runs the whole canvas through dozens of refining passes (a typical count is somewhere between twenty and fifty), and at each pass it asks the same question: given everything on this canvas right now, and given the text description it’s been handed, what does noise look like versus what does signal look like, and how do I nudge this image a little further toward signal. Repeat that enough times and the noise resolves into a coherent photo, guided at every step by the prompt.

The better analogy here isn’t construction, it’s carving. A sculptor working on a rough, uncut block isn’t adding material to build a figure up from nothing. They’re taking a shapeless mass and, stroke by stroke, removing what doesn’t belong, until the block is gone and a specific, intended form is what’s left. A diffusion model treats noise the same way a sculptor treats stone: not as an empty starting point to build on, but as undifferentiated material to be worked down into something specific, one pass at a time.

Why the direction of the process matters

This distinction is worth holding onto because it explains behavior that otherwise looks strange. Ask an image model for the same prompt twice and you’ll often get two different, equally valid images, because the process started from two different random fields of noise and refined each one down a different path toward the same description. It also explains why these tools can produce a rough composition almost instantly and then spend the remaining time sharpening detail: the big shapes get decided in the early, noisiest passes, and the fine texture gets carved out near the end, much like a sculptor blocking out the overall form before returning to shape individual features. None of that maps onto how a language model works, where every word is final the moment it’s written and nothing gets refined after the fact. Two different mechanisms produce two different sets of quirks, and the noise-to-order one is the reason images behave the way they do.

A second paradigm, not a side note

It would be convenient to treat this reverse process as a fact specific to pictures, a quirk of how one type of model happens to work. That undersells what’s happening. The logic of starting from disorder and refining toward a target, rather than building forward token by token, is already showing up in video generation, and researchers are actively adapting versions of it for text as well, refining a whole rough draft toward coherence rather than writing it left to right. Next-token prediction and noise-to-order refinement are turning into two genuinely separate ways of getting a model to produce something, not a main technique and a footnote attached to it. Understanding both is quickly becoming a basic requirement for understanding where generative AI is headed, not just how one corner of it happens to draw pictures.

For the broader question of how an image ends up sharing any common ground with text and audio well before this refining process even begins, see Multimodality: Text, Images, and Audio in the Same Model.