A Primer on LLM Post-Training
Full article excerpt tap to expand
Large Language Models (LLMs) have revolutionized how we write and consume documents. In the past year or so, we have started to see them a lot more than just rephrasing docs: LLMs can now think before they act, they can plan, they can call tools like a browser, they can write code and check that it works, and a lot more – indeed, the list is growing quickly! What do all these skills have in common? The answer is that they are all developed in what we call the post-training phase of LLM training. Despite post-training unlocking capabilities that would have looked magical to us a few years ago, it surprisingly gets little coverage compared to the basics of Transformer architectures and pre-training. This tutorial was originally written for the Meta infrastructure team with the target audience of an infra engineer without expertise in LLM modeling who wanted to learn more about post-training to be able to contribute. I believe that this encompasses a large group of engineers: with Reinforcement Learning becoming mainstream, we need new infrastructure to be able to be productive, so bridging this gap is critical! I now share this broadly with the hope that many more folks across PyTorch Foundation will share a similar background and interest, and that they will also find this helpful, like our team did. Primer on post-training Post-training (sometimes referred to as “alignment”) is a key component of modern LLMs, and the way to “teach” models how to answer in a way that humans like, and how to reason. Why is post-training different from pre-training, you ask? Post-training primes the model to have a conversation with a user, which follows a set of basic rules such as: In a conversation, there’s more than one speaker, and they all take turns talking You should listen before you talk to say something relevant We find these obvious, but pre-training is only doing next-word prediction to teach the model about the world, so your data there is completely unstructured, so the model never learned these basic rules. Indeed, a model coming out of pre-training is often bad at understanding that it should stop talking after a while and will blabber on forever, kind of like a Google autocomplete box. Furthermore, it’s also useful to impose some ground rules to the model that take absolute precedence over everything else. This is done in post-training through a system-prompt (and/or through Supervised Fine Tuning (SFT)/reward shaping, see later). Post-training data format Chatting with these models is possible via some plumbing that happens behind the scenes. Every time you talk in a chat window to a service like ChatGPT, you’ll see a UI like this: What actually happens is that the post-training structure is plumbed for you, and the model will see something like this (using the data format for Llama 3): <|begin_of_text|> <|start_header_id|>system<|end_header_id|> … <|eot_id|> <|start_header_id|>user<|end_header_id|> What is the capital of France?<|eot_id|> <|start_header_id|>assistant<|end_header_id|> The capital of France is Paris <|start_header_id|>user<|end_header_id|> How many people live there? Tell me just the number<|eot_id|> <|start_header_id|>assistant<|end_header_id|> START FILLING FROM HERE Note that the basic interface of the LLM is unchanged: you provide some text, and it will continue it to infinity and beyond. What this clever plumbing does is to make sure that the model receives all the metadata to know that the previous speakers have…
This excerpt is published under fair use for community discussion. Read the full article at Pytorch.