Lesson 6.2: Training LLM
The pretraining stage is the initial phase where a Large Language Model (LLM) learns general language understanding from vast amounts of unlabeled text data. This stage is compute-intensive and forms the backbone of models like GPT, LLaMA, and BERT.
LLM Pretraining Step 1: Process the Internet Data
The performance of a large language model (LLM) is deeply influenced by the quality and scale of its pretraining dataset. If your dataset is clean, structured and easy to process, the model will work accordingly.
Key Filtering Steps
- URL Filtering: Blocks spam/adult content domains.
- Text Extraction: Removes HTML/JS, keeps clean text.
- Language Filtering: Keeps English (fastText score ≥ 0.65).
- Gopher Filtering: Removes low-quality/repetitive text.
- MinHash Deduplication: Eliminates near-duplicate content.
- C4 Filters: Removes boilerplate/noise.
- Custom Filters: Applies domain-specific rules.
- PII Removal: Scrubs personal data (emails, phone numbers).
LLM Pretraining Step 2: Tokenization
Tokenization is the process of breaking down raw text into smaller, machine-readable units called tokens—the basic "vocabulary" that a neural network understands. This step is critical because LLMs don’t process raw text; they process numbers representing tokens.
Why Tokenization Matters
- Neural Networks Need Numbers
- Models like GPT or LLaMA operate on numerical inputs, not text.
- Tokenization maps text → integers (IDs) → embeddings (vectors).
- Efficiency & Generalization
- Splits text into meaningful chunks (e.g., words, subwords, characters).
- Balances vocabulary size (too small → long sequences; too large → sparse learning).
But how exactly do we turn a massive text corpus into tokens that a machine can understand and learn from?
- From Raw Text to One-Dimensional Sequence: Neural networks cannot process raw text directly; they require input as a finite sequence of symbols. Text must first be converted into a structured, one-dimensional sequence of discrete units (tokens) that the model can interpret numerically. This transformation bridges human-readable language and machine-processable data.
- Binary Representation – Bits and Bytes: At the lowest level, computers represent text as binary (0s and 1s). Each character is encoded into 8 bits (1 byte), allowing 256 possible values (0–255). This byte-level encoding forms the foundational vocabulary for text processing, where every symbol (e.g., letters, spaces) maps to a unique byte ID. UTF-8 extends this to support multilingual characters using 1–4 bytes per symbol.
- Reducing Sequence Length – Beyond Bytes: While byte-level encoding is universal, it creates long sequences (e.g., 1 byte per character). To optimize, methods like Byte Pair Encoding (BPE) merge frequent byte pairs into single tokens. For example, recurring sequences like "th" or "ing" become unique tokens, shortening the sequence while expanding the vocabulary dynamically.
- Vocabulary Size – Trade-off Between Length and Granularity: LLMs strike a balance between sequence length and token granularity by capping vocabulary size (e.g., GPT-4 uses 100,277 tokens). Larger vocabularies capture more linguistic features (like common phrases) but require more memory, while smaller ones increase sequence length. BPE iteratively merges tokens until this balance is achieved.
- Tokenizing Text – Practical Insights Modern tokenizers (e.g., GPT-4’s CL100k_base) split text into tokens based on learned patterns. For example, "hello world" becomes two tokens ("hello" and " world"), while subtle changes (like extra spaces) alter tokenization. This impacts model efficiency—shorter sequences speed up training, but overly aggressive merging may lose meaning. Tokenizers are trained on massive corpora to optimize these splits.
LLM Pretraining Step 3: Neural Network
Neural Network I/O
The input to the neural network consists of sequences of tokens derived from a dataset through tokenization. Tokenization breaks down the text into discrete units, which are assigned unique numerical IDs. In this example, we consider a sequence of four tokens:
If you are done with step1
Token ID | Token |
---|---|
2746 | “If” |
499 | “you” |
527 | “are” |
2884 | “Done” |
449 | with |
3094 | step |
16 | 1 |
These tokens are fed into the neural network as context, aiming to predict the next token in the sequence.
Processing: Probability Distribution Prediction
Once the token sequence is passed through the neural network, it generates a probability distribution over a vocabulary of possible next tokens. In this case, the vocabulary size of GPT-4 is 100,277 unique tokens. The output is a probability score assigned to each possible token, representing the likelihood of its occurrence as the next token.
Backpropagation and Adjustment
To correct its predictions, the neural network goes through a mathematical update process:
- Calculate Loss – A loss function (like cross-entropy loss) measures how far the predicted probabilities are from the correct probabilities. A lower probability for the correct token results in a higher loss.
- Compute Gradients – The network uses gradient descent to determine how to adjust the weights of its neurons.
- Update Weights – The model’s internal parameters (weights) are tweaked slightly so that the next time it sees the same sequence, it increases the probability of “Post” and decreases the probability of incorrect options.
Training and Refinement
The neural network updates its parameters using a mathematical optimization process. Given the correct token, the training algorithm adjusts the network weights such that:
- The probability of the correct token increases.
- The probabilities of incorrect tokens decrease.
For instance, after an update, the probability of a token may increase from 4% to 6%, while the probabilities of other tokens adjust accordingly. This iterative process occurs across large batches of training data, refining the network’s ability to model the statistical relationships between tokens.
Through continuous exposure to data and iterative updates, the neural network improves its predictive capability. By analyzing context windows of tokens and refining probability distributions, it learns to generate text sequences that align with real-world linguistic patterns.