Deep dive into LLMs like ChatGPT (Andrej Karpathy)

Andrej Karpathy made a 3.5 hour long video about how LLMs like ChatGPT work. It’s a great video for understanding the inner workings of LLMs.

It has a treasure trove of information, but it’s a very long video, so I wanted to summarize the key points here so that I don’t have to watch the whole thing again.

Section 1: Training Data

LLMs are trained on a large corpus of text data. This is usually proprietary and not released to the public. But there are some open source corpora available.

He recommends FineWeb as a good source for training data freely available. It’s built out of Common Crawl data that is cleaned and preprocessed. Read this article about what went into building this dataset.

Some stats:

Raw Data: 2.7 billion web pages, totaling 386 TiB of uncompressed HTML text content

Preprocessed Data: 15-trillion tokens, 44TB disk space

Section 2: Tokenization

Next, you take the raw text data and tokenize it. He used this website (mirror) to demonstrate how tokenization works.

Basically, you take the text and break it into tokens. Each token is assigned a unique integer id. Tokens are not characters, nor words, but are like chunk of characters, that often show up in text.

Section 3: Training

Then he spoke about how LLMs are trained using a neural network, mainly a transformer model that is visualized here. Basically, it’s a token predictor. It is trained to take a sequence of tokens and predict the next token. In a way, it could be said that it’s a document completor.

Given that it’s a very large neural network which can have billions of parameters, and given that the training data is so huge, that training this entire model required a lot of compute. Since this work can be parallelized, we use GPUs to train the model.

He also covered the advancement of technology over the last few years, that had brought down the cost of training these models. For example, training GPT-2 is estimated to have cost around $40K, but today he could train a similar model with $672, and that too without trying to optimize the cost. And believes that this cost could be brought down to $100. Which is around 400x cheaper than what it used to be in 2019.

He credits this efficiency boost to the advancement in hardware, developer tools, and algorithm/computational optimizations.

Here is the GPT-2 Research Paper from OpenAI: Language Models are Unsupervised Multitask Learners

After training, you get a Base Model. Base model is a document completor. It’s very powerful, but it’s just a very expensive auto-completer. To turn this into an AI assistant, you need post-training, which is covered in the next section.

Building the base model is very expensive, but once you have it, post-training is relatively cheap. So, it is a good idea to use a existing base model, and then post-train it for a specific use-case.

There are several open source base models. He used Llama 3.1 405B BF16 (Meta’s Llama 3.1 Announcement, Meta’s Llama 3 Paper) as the base model for this tutorial. And he used Hyperbolic to demo the base model and it’s capabilities.

Section 4: Document Completors to AI Assistants

Now taking the base model, and building it into an AI assistant takes post-training. Post-training requires a dataset of input/output pairs, that tells the model what could it expect as an input, and how should it respond.

This is the phase where you can give a certain personality to the model, and you can train the model to act a certain way.

Then he showed us a technique through which we can train the model to always respond in a certain way, with delimiters such as <|im_start|>, <|im_end|>, <|im_sep|> etc, which the model has not seen during training phase, and is added during post-traning phase to tell the model how to respond.

This is from OpenAI’s Instruct GPT paper Training language models to follow instructions with human feedback.

OpenAI didn’t release the post-training data for Instruct GPT, but there are some open datasets available. One such dataset is Open Assistant. Another such dataset is UltraChat, that uses Generative AI to create post-training dataset.

– WORK IN PROGRESS –

SFT Dataset (Supervised Fine-Tuning Dataset) is a dataset of input/output pairs, that is used to train the model to act a certain way.

Knowledge in the parameters == Vague recollection (eg: of something you read 1 month ago) Knowledge in the tokens of the context window == Working Memory