Tokens: How a Model ‘Sees’ Text

Ask a chatbot how many times the letter r appears in strawberry and there’s a real chance it says two. Ask it to spell a word backward and the same kind of oddly specific error shows up. It’s strange to watch: a system that can write working code or explain a legal clause in plain English, stumbling over what looks like the simplest task in the room. The mistake isn’t a sign the model can’t count. It’s a sign the model never saw the letters to begin with.

The stamp set the model actually reads with

Before a language model reads anything, the text you type gets run through a step called tokenization, which breaks your sentence into chunks called tokens. A token isn’t a letter, and it isn’t reliably a whole word either. It’s whatever chunk the model’s vocabulary has a ready-made slot for.

Picture the model working not with an alphabet but with a fixed set of rubber stamps, somewhere in the range of 50,000 to 100,000 of them, each carved with a small chunk of text. A few hold single letters, but most hold whole common words (“the”, “and”, “dog”) or common fragments (“ing”, “tion”, “pre”). The model doesn’t trace a sentence letter by letter. It finds the smallest set of stamps that reproduces the text and presses those, one after another.

Common words usually get their own stamp. “Strawberry” is common enough in English that it might have a single dedicated stamp, or it might split into two pieces, something like “straw” and “berry” (the exact split depends on the vocabulary a given model was trained with). A rarer or technical word has no dedicated stamp at all, so the model breaks it into three, four, sometimes six smaller pieces to cover the same ground. Either way, once “strawberry” becomes one or two stamped chunks, the letters s-t-r-a-w-b-e-r-r-y are gone from what the model is actually looking at. It receives one or two symbols, and those behave like opaque tags, not spelled-out sequences.

That’s why counting the r’s is hard in a way a person answering the same question isn’t. A person looks at ten letters and counts three r’s directly. The model has to infer the spelling of a token it treats as a single unit, from patterns absorbed during training, without seeing the letters lined up in front of it. Sometimes that inference lands correctly. Often, on exactly this task, it doesn’t.

Why this isn’t a reasoning failure

It’s tempting to read a wrong letter count as proof the model is shallow, or that its competence elsewhere must be fake. That reading mixes up two different things: what the model can reason about, and what it can perceive in the first place. A model that gets strawberry wrong can still solve a multi-step word problem or catch a contradiction in a paragraph, because those tasks operate on the token-level representation it actually has access to. Counting individual letters inside a word requires information that representation simply doesn’t carry.

This distinction matters for how you read AI capability claims and failures in general. Plenty of reported “weird bugs”, struggling with arithmetic, garbling reversed text, mishandling acronyms and unusual spellings, trace back to the same cause: tokenization decides what unit of text the model operates on, and that unit is frequently not the one a human would assume. When a system fails at something a child could do, the useful question isn’t “is it smart,” it’s “what unit of information was it working with when it failed.” Those two questions often have different answers.

A limit of perception, not of intelligence

The letter-counting example is small, almost a party trick, but it points at something worth sitting with: a lot of what looks like a lack of intelligence in these systems is actually a limit of perception. The model isn’t failing to figure something out. It’s answering a question about data it was never given access to, the same way someone asked to count the stripes on a photo they saw only as a blurred thumbnail would guess wrong, not for lack of counting skill, but because the detail was never in front of them. Intelligence and perception get treated as one bundle when we watch a system perform, but they’re separate layers, and a surprising number of the strangest failures live in the lower one, quietly shaping what the upper one is even able to work on.

This sits next to two other pieces here: Next-Word Prediction: The One Trick Behind Everything explains why the model produces text one token at a time in the first place, and Embeddings: Concepts Turned Into Numbers picks up where this one leaves off, showing how a token becomes the numbers the model actually computes with.