Reprinted in AI Evolution-Peanut
Introduction
At the Microsoft Build developer conference over a month ago, OpenAI scientist Andrej Karpathy gave a speech titled "State of GPT" about how Large Language Models (LLMs) are trained, their characteristics, and how to get better results from them.
I've previously shared an important point Andrej mentioned in this speech: 2.6 LLMs don't want to succeed, but you can demand success.
The first 20+ minutes of his talk about how ChatGPT was trained were very technical, and many people found it difficult to understand. However, I believe that for ordinary people not working on underlying technology, we don't need to delve into the details. Having a general understanding of how it's trained can greatly help us understand the model's features (advantages and flaws) and provide insights for our usage. So today, I'll attempt to give you an introduction based on Andrej's speech and my understanding of other learning materials. I hope you find it enlightening.
Of course, you can also watch Andrej's speech directly: video link here.
Four Stages of ChatGPT Training
Stage One: Pretraining
In the pretraining stage, the first issue is corpus selection. Both the quantity and quality of training corpus are important. GPT-3 used about 300 billion tokens of corpus. They haven't disclosed the specific composition of the corpus, but we can use Meta's corpus data for training LLaMA as a reference:
67.0% is Common Crawl, which is a regular web-crawled dataset. This dataset is characterized by a wide range of content types, but because the content could be written by anyone, the quality may be low and may include a lot of noise and irrelevant content, such as advertisements, navigation menus, copyright notices, etc.
15.0% is C4 data (Colossal Clean Crawled Corpus), which contains a large amount of web page text that has been cleaned to remove advertisements, duplicate content, non-English text, and other elements unsuitable for training. The goal of this dataset is to provide a large-scale, high-quality, diverse English text dataset to support various natural language processing tasks. Although C4 has been cleaned, it still contains various texts from the internet, so it may include some low-quality or misleading information.
The remaining 18% of the training corpus data is of relatively higher quality, mainly from sources such as Github, Wikipedia, books, Arxiv papers, stock exchange materials, etc.
Note: Considering the scale and composition of the training corpus, we need to understand that LLMs like ChatGPT have learned almost all the knowledge from all disciplines and fields that humans have published on the internet, so their "common sense" is very rich. However, at the same time, because the proportion of "mediocre" knowledge in the training corpus is too high, and the main goal of the model is to predict the next word, you are likely to get mediocre, average content. You need certain prompt skills to force out higher-level, higher-quality output.
Returning to the training process, after obtaining a large-scale training corpus, OpenAI doesn't directly train on the corpus. Instead, they first decompose the text content into smaller subword units, which we often hear referred to as tokens, for training. You might be confused, as I was, about why they go to this trouble. Why not train directly with complete words? Why tokenize? The logic behind this is as follows:
Handling unknown words: During the training process, the model may encounter words it has never seen before. If tokenization is done at the word level, the model will not be able to handle these unknown words. However, if tokenization is done at the subword or character level, even when encountering unseen words, the model can break them down into known subwords or characters, thus being able to handle unknown words.
Reducing vocabulary size: If tokenization is done at the word level, the size of the vocabulary will be very large, which will increase the complexity and computational burden of the model. Tokenization at the subword or character level can significantly reduce the size of the vocabulary.
Capturing root and affix information: Many English words are composed of roots and affixes (prefixes and suffixes). Tokenization through subword units can help the model capture this root and affix information, which is helpful for understanding and generating text.
Therefore, GPT chose to use subword units as tokens for training. This allows it to handle unknown words, reduce the size of the vocabulary, and capture some inherent rules of language. Most models do this during training, but they may use slightly different tokenization rules. For example, GPT-3 divided into 50,257 subwords before training, while LLaMA used 32,000 subwords.
After obtaining the corpus and tokenizing it, the actual pretraining process begins. Pretraining is actually a process of constantly covering the model's eyes and letting it guess what the next word is. The goal of training is to iterate continuously, making the model's guess of the next word the same as the actual next word in the text content. This involves an indicator called training loss, which is a measure of the prediction errors of machine learning models on training data. Simply put, the loss function is a way to measure the gap between the model's predictions and the actual targets. During training, the model's goal is to minimize this loss.
The Base Model obtained at this stage, as mentioned earlier, aims to predict the next word rather than act as a Chat-bot or assistant. For example, if you input the prompt into this model:
What is the permanent resident population of Beijing?
The result might be:
What is the permanent resident population of Shanghai? What is the permanent resident population of Guangzhou? What is the permanent resident population of Shenzhen?
This is because in GPT's training corpus, these questions may often be placed more closely together. However, by the time of GPT-2, many people discovered that they could actually use prompt techniques to make the model play the role of an assistant or answer questions. This is achieved by writing prompts similar to the following:
Q: What is the area of Beijing?
A: 16,000 square kilometers
Q: How many administrative districts does Beijing have?
A: 16
Q: What is the permanent resident population of Beijing?
A:
At this point, the result you might get is:
21.84 million
The essence of this process is to disguise your question or what you need the base model to do for you as a gap in a continuous content in a document, letting the model automatically attempt to complete it. However, this process has many instabilities and uncertainties, and the results are often unsatisfactory. It also has high requirements for users, so such models are usually not provided for ordinary users but are applied by developers, requiring certain development capabilities and prompt skills.
Stage Two: Supervised Finetuning
To address the issue of the base model only predicting the next word without understanding human input instructions, supervised finetuning was incorporated into the model training process. Essentially, this involves providing the model with a series of examples, allowing it to see "prompts" and "responses" to understand "instructions" and interpret the meaning of human input.
After this processing layer, the model effectively understands the relationship between prompt instructions and the content it should generate. You can also view this as incorporating the few-shot prompting technique directly into the model through fine-tuning. Following this packaging, the model has transformed from simply predicting the next word to assuming the role of an assistant.
Stage Three and Four: Reward Modeling and Reinforcement Learning
I've combined these two stages because stage three essentially serves stage four and has no independent significance. Most models are only trained up to stages one and two. The latter two stages might involve some proprietary techniques used by OpenAI. Currently, in the English-speaking world, only ChatGPT and Claude (created by former OpenAI employees) have undergone this training process. GPT-3 was trained in 2020, and its API was quickly made available for use, but it didn't generate widespread attention at the time. It was GPT-3.5-turbo, fine-tuned from GPT-3 and known as ChatGPT, that sparked large-scale interest and public awareness when released in late 2022. This roughly indicates how challenging these latter two training stages are and how significant their impact is.
In the Reward Modeling stage, OpenAI has the SFT Model from the previous stage generate content for hundreds of thousands of prompts. Human contractors then select and rate multiple results for the same prompt. Based on the data obtained from this stage, OpenAI trains the model to predict the human rating a response might receive.
In the Reinforcement Learning stage, thanks to the model accumulation from the previous three stages, the model can both generate responses to prompts and predict the likely rating of its generated responses. So, the model's task is to continually iterate, generating responses that are likely to receive higher scores.
You can essentially consider that the nature of these two stages is to make the model more aligned with human preferences, generating responses more likely to receive high scores from humans-specifically, those human contractors in the reward modeling stage. Fundamentally, if the people involved in the training process are mediocre or biased, the resulting model will have corresponding characteristics.
👆 Currently, the top three ranked models are all Reinforcement Learning from Human Feedback (RLHF) models.
This concludes the introduction to how ChatGPT was trained. The next issue will detail what prompt strategies we should adopt to maximize the model's advantages while minimizing its flaws and biases based on these model characteristics.