Attention: What the Model Looks At

Feed a chatbot a forty-page contract and ask whether clause 12 contradicts the payment terms mentioned in the introduction. It answers correctly, linking two passages separated by thousands of words, in a fraction of a second. That should feel more surprising than it does. For decades, software that processed text did so strictly in order, one word after another, carrying forward only a running summary of what it had already seen. By the time it reached word 2,000, whatever happened in the first paragraph was faint at best. Something changed in how models handle text, and it’s worth being precise about what.

Deciding what matters, one word at a time

The mechanism is called attention, and the name is literal. When a model generates the next word, it doesn’t just consult the word right before it. It looks across every word currently in front of it (its context) and assigns each one a weight: how relevant is this, right now, to producing the next word correctly. A pronoun like “it” might draw most of its weight from a noun mentioned 300 words earlier, while ignoring the fifty words in between almost entirely. That weighing is recalculated fresh at every single step of generation, which is why the model can track “clause 12” against a definition planted far earlier in the document.

The older approach, built around recurrent networks, worked more like someone reading a book with a notepad, updating one running summary as they go and never rereading a page once they’ve turned it. Detail from early pages had to survive being compressed into that summary, over and over, across every page in between, and a lot of it didn’t survive. Attention works more like a person in a crowded room full of overlapping conversations, who doesn’t listen to everyone equally and doesn’t only listen to whoever spoke last. They tune in to the two or three voices most relevant to what’s being discussed right now, and as the conversation shifts topic, they retune, sometimes snapping back to something said minutes ago because it just became relevant again. The model does the same thing across every token in its context, deciding and redeciding, at each step, who’s worth listening to.

Why this matters when you read AI news

This is why “context window” numbers are worth more than they first appear. A model advertised as handling 200,000 tokens of context isn’t just claiming a bigger notepad. It’s claiming it can weigh any one of those tokens directly against any other, regardless of distance, without funneling everything through a single compressed summary first. That’s a different kind of claim than “it remembers more,” and it explains why a model can sometimes catch an inconsistency between the first and last page of a long document, something a system reading strictly left to right, one compressed step at a time, would have already smoothed over by the time it got there. When a product announcement leans on context length, this mechanism is the reason that number is more than a marketing figure.

The part that gets credited to the wrong thing

The mechanism came from a 2017 paper, “Attention Is All You Need,” and the architecture it introduced (the transformer) is what nearly every language model since has been built on. It’s tempting to file that paper under “researchers found a way to make models bigger,” because bigger models arrived right after it and dominate the conversation now. That gets the order of events backwards. The 2017 contribution wasn’t scale. It was handing the model a way to decide for itself, at every step, which pieces of its input deserved attention, rather than forcing information through a fixed sequential bottleneck one word at a time. Scale only became worth pursuing once that mechanism existed to make good use of it: more parameters and more data are only valuable if the model has a way to actually reach across all of it and pick out what matters. People give the scale the credit, because scale is what shows up in headlines and benchmark charts. The mechanism is what made scale a good bet in the first place.

Before any of this, a model needs a way to represent words as something it can actually weigh and compare, which is the subject of Embeddings: Concepts Turned Into Numbers. How a model is actually trained to assign these weights well, as opposed to just having the mechanism available, is its own story, coming later in this series.