Context and Memory: Why the Model ‘Forgets’

Forty minutes into a conversation, a man asks a chatbot to revisit a constraint he mentioned right at the start: he needs the plan to work without a car. The assistant cheerfully suggests a rental. He points out he already said that. It apologizes and does it again two exchanges later. Days after that, he opens a different product and it welcomes him back by name, mentioning a preference he only ever stated once, weeks ago, in an unrelated chat. He assumes this second product is simply better at listening. It isn’t. Both moments come from the same mechanical fact about how these systems work, just from opposite ends of it.

The window that only holds so much

Every time a model generates a reply, it isn’t drawing on some accumulated understanding of your conversation. It is reading, fresh, whatever text currently sits in front of it: your messages, its own previous replies, any instructions the product tacked on, all concatenated into one block called the context window. That window has a fixed limit, measured in tokens (small chunks of text), commonly somewhere in the range of 128,000 to 200,000 tokens depending on the model. That sounds enormous, and for most conversations it is. But once the running transcript exceeds it, the oldest material has to be dropped or squeezed to make room for the new. The constraint mentioned at minute one isn’t being ignored. In a sufficiently long exchange, it is simply no longer in the block of text the model ever sees again.

This is easiest to picture through someone who has lost the ability to form new long-term memories, and who is handed a fresh briefing note about you just before each time you meet. Read that note, and the conversation that follows feels genuinely warm and informed. He knows your name, your job, the thing you mentioned last time. But none of it is retained inside him between meetings. The instant you leave, it’s gone. The illusion of memory is being produced entirely by whoever keeps writing that note and handing it over at just the right moment. The model is exactly this person. The context window is the note. Nothing persists in the model itself between one generated token and the next conversation.

Why the “memory” announcement is a different feature

So when a product announces that its assistant “now has memory” and can recall your preferences across sessions, it is not describing a change to the model. The model’s parameters, the internal numbers that make it generate one word rather than another, aren’t being edited by your conversations. What’s actually happening lives outside the model entirely: a database, a running summary, or a list of saved facts about you, stored separately and quietly re-inserted into the context window at the start of your next session. The model reads that inserted material the same way it reads anything else in its window, and responds as though it recalls you, because from its narrow point of view it is simply reading text that happens to be about you. It is the note being handed over again, just written by a system instead of a person this time.

What that distinction is actually worth

This matters because a bolted-on database behaves nothing like a memory would. It can be wrong: a preference you mentioned once in passing gets saved as permanent. It can be stale: something you corrected last month never got updated in storage, so it keeps resurfacing. It can be edited, selectively, by the company running the product, in ways you can’t fully see. A real persistent memory inside a person doesn’t work that way, and neither does a model’s frozen set of parameters, but a database sitting beside the model absolutely does. Knowing that “memory” almost always means retrieval from outside storage, not recall from inside the model, is what lets you ask the only question that actually matters: not whether it remembers, but where that memory is kept, and how much you’d trust what’s written there.

This is the same mechanism a companion piece in this series looks at from the other direction: RAG: Giving a Model Your Documents Without Retraining It covers how outside material gets fed into a model’s context window, which is often the very technique used to fake persistent memory. And since that stored material has to live somewhere searchable, Vector Databases: The Infrastructure Behind RAG is a natural next stop for seeing what that storage usually looks like underneath.