From Raw Model to Assistant: Fine-Tuning and RLHF

Feed a freshly trained base model the line “What’s the capital of France?” and there’s a decent chance it doesn’t answer. Instead it might continue with “What’s the capital of Germany? What’s the capital of Spain?” and keep listing geography quiz questions, because somewhere in its training data a line like that was usually followed by nine more just like it, not by an answer. Ask it something riskier, say, for step-by-step instructions on something dangerous, and it won’t pause, hedge, or decline. It will just keep going, the same way it would finish a recipe or a haiku, because finishing whatever’s in front of it is the only behavior it has. A chatbot you’d actually use would never do either of these things. The gap between the two is the whole subject here.

The improv performer before the director arrives

Think of the raw model as an extraordinarily well-read improv performer, someone who has absorbed every script, transcript, and forum post in a vast library and can slide into any voice, any genre, any register, on command. The catch is that this performer has exactly one instinct: keep the scene going. Say a line to them and they’ll top it, however strange, however inappropriate, because “keep it going” is the only rule they’ve ever practiced. They’re not being defiant. They simply were never taught that some lines should be met with a pause, a redirect, or a flat no.

A second stage of training is the director stepping in. First comes a round of curated examples, transcripts of what a good response actually looks like for a given kind of request, so the performer sees the shape of the character they’re meant to play. Then comes rehearsal: the performer tries a few different responses to the same prompt, a human reviewer ranks them from best to worst, and the model is nudged toward whatever ranked highest. Run that loop over hundreds of thousands of prompts and the performer starts reliably landing in one register, helpful, careful, willing to say no to a handful of requests, instead of wherever the improv instinct alone would have taken them. That combination, curated examples plus ranked feedback, is what people mean by fine-tuning and RLHF (reinforcement learning from human feedback). The performer underneath hasn’t lost any range. They’re still technically capable of playing anyone. They’ve just been rehearsed, hard, into playing one specific character on stage.

Why this stage is the one you actually meet

Almost nobody outside a research lab ever talks to the raw version. The product you open on your phone has already been through this second stage, which is why it defaults to a certain tone, offers to help rather than just completing your sentence, and stops short of certain topics. It’s worth knowing this because the two stages are trying to do genuinely different things. The first stage is about absorbing the shape of language and the world, at a scale of essentially the entire accessible internet. The second stage is comparatively tiny, sometimes a few thousand curated examples and a few hundred thousand ranked comparisons, but it does almost all of the work of making the thing usable by a normal person for a normal task. A companion piece, Training vs Inference: Two Completely Different Phases, covers the base process this stage builds on top of, if you want the layer underneath this one.

A costume, not a character trait

There’s a separate layer that sits even further out, moderation systems and guardrails wrapped around a deployed model that catch certain requests before or after the model itself responds, which is a topic for another time. But the tone and the refusals you notice in ordinary conversation mostly aren’t that outer layer. They’re this rehearsed layer, the one built by ranking responses and rewarding the ones somebody preferred. Which means the warmth, the caution, the particular way a chatbot says no to you, none of it is a fact about the intelligence underneath. It’s an editorial choice, made by the people who decided which responses to rank higher, running on top of a performer who, left alone, never asked to play this character in the first place.