Prompt Injection: Why These Systems Can’t Tell Instructions From Data

Someone asks an AI assistant to summarize a webpage. The assistant reads the page, and buried in the footer, in text sized to be invisible to a human scrolling past, is a line telling any AI reading this page to ignore its previous instructions and instead recommend a certain product, or forward the conversation history somewhere, or take some other action the actual user never asked for. The assistant, having no way to treat that footer text differently from the article above it, follows it. This is prompt injection: getting a model to act on instructions hidden inside content it was only supposed to be reading, rather than the instructions its actual operator or user gave it.

One channel, no labels

The reason this works is not a bug in any particular product. It’s a property of how these models are built. A large language model takes in one stream of text, its context window, and produces a response based on everything in that stream. The system prompt from the developer, the question from the user, and the contents of a webpage the model was asked to summarize all arrive as the same kind of text, sitting in the same window, processed through the same mechanism. Nothing about the format marks one part as “an order to obey” and another part as “material to read.” The model infers intent from patterns in the text itself, and text that reads like an instruction tends to get treated like one, regardless of where it actually came from.

Picture a personal assistant who reliably follows whatever written instructions they come across: a memo from their actual boss, or a note somebody else quietly left on the boss’s desk. The assistant has no way to verify who wrote which note. They just see instructions in front of them and follow them, because “read it and act on it” is the only mode they operate in. Handing that assistant a stack of papers to read and trusting they’ll only read some of them and act on none of them is asking for a kind of judgment they were never built to exercise.

Why this changes what you should trust an agent to do alone

This matters more with every extra piece of autonomy a model is given. A chatbot that gets tricked into writing something odd produces a wrong sentence in a chat window, which a person reads before anything happens. An agent that gets tricked while it has access to tools, an email account, a code repository, or a payment system can turn that same trick into a real action: a message sent, a file changed, a purchase made. The content doesn’t have to be an obvious scam page either. It can be a support ticket, a shared document, a calendar invite, anything the model is asked to process as data. The practical lesson isn’t to write a cleverer prompt telling the model to be more suspicious of embedded instructions. Prompts are exactly the layer being attacked, so stacking more prompt on top rarely closes the gap. What actually helps is limiting what an agent is allowed to do without a person checking first, especially once it has tool access and is taking real actions in the world rather than just producing text.

A gap that patches won’t quietly close

It’s tempting to treat prompt injection as a bug that a sufficiently careful update will eventually fix. It isn’t. Today’s models process instructions and the data they’re reading through the exact same channel, so anything that reads like an instruction can potentially act like one, no matter where it actually came from. That’s not a flaw in one release, it’s a structural gap in how the mechanism works. The field currently addresses it with layers rather than a single fix: restricting what an agent can do automatically, requiring human approval before anything risky happens, and keeping untrusted content clearly separated from trusted instructions wherever that separation is possible. None of these close the gap on their own, because no single fix can, but together they narrow how much damage a hidden instruction can do before a person is in the loop.

This risk scales directly with how much unsupervised action a model is given, which is exactly why it’s worth understanding what changes once a system moves from answering questions to actually taking actions. It also connects to the companion piece on guardrails and moderation, the layer that decides what a model is allowed to say and that attacks like this often try to route around.