Multimodality: Text, Images, and Audio in the Same Model

You’re mid-conversation with a chatbot, typing out a question about a recipe, when you paste in a photo of your half-empty fridge and ask what you could make with what’s there. A few messages later you attach a voice memo, thirty seconds of you thinking out loud about substituting an ingredient, and ask it to weigh in. The model answers all three in the same thread, without missing a beat or asking you to switch modes. It’s easy to assume that under the hood, one part of the system reads, another part looks, and a third part listens, and something stitches their answers together. That’s not what’s happening. There’s one model, and it isn’t doing three different kinds of work.

One representation, not three machines

An image is, at the pixel level, a grid of numbers describing color and brightness at each point. An audio clip is a waveform, air pressure sampled thousands of times a second. Written text is a sequence of characters. These look like three unrelated kinds of data, and if a model actually had to reason directly over pixels, sound waves, and letters as three separate raw formats, you’d need three separate systems glued together at the edges, each with its own logic, and a translator sitting between them.

That’s not the design. Instead, each format gets broken down and converted into the same kind of underlying representation before the model ever starts reasoning: a sequence of tokens, each one mapped to a vector, a list of numbers marking a position in a shared numerical space. A patch of an image becomes a token. A short slice of a sound wave becomes a token. A fragment of a word becomes a token. None of these tokens carry a label saying “I came from a picture” or “I came from speech” in any way the core model treats specially. They arrive as the same kind of object, sitting in the same kind of space, and from that point on the model is doing one job: predicting what comes next in a sequence of vectors, regardless of what those vectors originally represented.

The comparison worth holding onto is a page of sheet music, a spoken sentence, and that same sentence written down. A musician who reads standard notation doesn’t need three separate skills to work with all three. Once the pitch of a hummed note gets translated into notation, and once spoken syllables get transcribed into printed words, all three can be worked with by someone who only reads the shared notation. The point isn’t that music, speech, and text are secretly the same thing. It’s that once they’re converted into one common notation, one reader handles all three without switching methods. A model does the same conversion, pixels and waveforms and letters all reduced to tokens in a shared space, and from there one architecture, not three, does the work.

Why the plumbing matters more than the label

This is why a single model can be handed a photo, a paragraph, and a voice clip in the same request and produce one coherent answer instead of three disconnected ones. It’s also why building a system that handles multiple formats isn’t mostly about writing separate modules for vision and hearing and language. Most of the engineering effort goes into the conversion step, getting pixels and waveforms turned into tokens that land in useful positions in that shared space, so that a photo of a fridge and the words “what can I cook” end up close enough in the same representation to be reasoned about together. Once that conversion is solid, the reasoning machinery on top of it doesn’t need to know or care which format anything started as.

The label was never the architecture

Once everything is reduced to the same kind of token, the boundary between “a text model” and “an image model” stops being a real architectural distinction and becomes mostly a labeling convenience. It’s a useful shorthand for what a product is set up to accept or produce, not a description of some separate machine underneath. The categories people still reach for when they talk about these systems, a chatbot over here, an image generator over there, describe an outdated picture of what’s actually happening: one shared underlying representation, doing several jobs at once, wearing whatever name the interface in front of it happens to have.

If you want the mechanism behind that shared representation, Embeddings: Concepts Turned Into Numbers is the piece to read first, and once you’re comfortable with tokens living in one space, Diffusion: How AI Images Are Actually Made covers the very different process used to turn that representation back into a picture.