Pretraining — AI Learning Course

§ 01

1 — Pretraining

Stage 1 of the LLM pipeline

Pretraining — teaching a model everything from raw text

The most expensive and foundational step. The model is trained on trillions of tokens using one objective: predict the next token. No human labels are needed — the text itself is the supervision signal. This phase creates the "base model" — a powerful text predictor that has implicitly learned grammar, facts, reasoning, and world knowledge.

Raw text corpusFilter + dedupeTokenise (BPE)Train on GPUsBase model

Pipeline — steps light up in order

Data curation — most critical step

Garbage in, garbage out

Data quality determines model quality more than almost any other factor. Modern pipelines apply: deduplication (same text appearing 100× should not dominate), quality classifiers (filtering low-quality web pages), domain balancing (mixing code, books, Wikipedia, conversations), and toxicity filters. Llama 3 used 15 trillion tokens. GPT-3 was trained on ~300B tokens sampled from a ~500B-token curated corpus.

Training objective

One simple loss, trillion-scale consequences

The model sees a sequence of tokens and must predict each next token from all previous ones. The loss is cross-entropy between predicted probabilities and the actual next token. This single objective — applied at massive scale — causes the model to implicitly learn grammar, facts, syntax, reasoning chains, code patterns, and more. No human labels ever needed.

Chinchilla scaling laws — Hoffmann et al. 2022

How much compute, how many tokens, how big a model?

The Chinchilla paper showed that most LLMs before 2022 were overtrained on model size and undertrained on data. The optimal ratio is roughly 20 tokens per parameter — a 7B model should see ~140B tokens for compute-optimal training. Llama 3 deliberately violated this by training on 15T tokens — over-training for inference efficiency. A smaller model that trains longer can be served more cheaply at deployment scale even if it costs more to train.

tokens in Llama 3 pretraining

~$100M+

estimated GPT-4 training cost

0 tok/param

Chinchilla optimal ratio

Continued pretraining

Domain adaptation without full retraining

After initial pretraining on general text, a model can be continued-pretrained on domain-specific data — medical literature, legal documents, code, a new language — using the same next-token objective. Much cheaper than full pretraining. Key findings from Unsloth research: train on ALL linear layers including embed_tokens and lm_head, use Rank-Stabilized LoRA (rsLoRA) at rank 256, and use different learning rates for embeddings to stabilise training.

2026

What modern pretraining actually looks like

The recipe has moved well past "scrape the web and train once." Frontier runs now lean on heavily filtered and synthetic data — FineWeb-Edu/DCLM-style quality classifiers score every document, and model-generated text fills targeted gaps. Models train far past Chinchilla-optimal token counts for inference efficiency. A mid-training/annealing phase finishes the run on the highest-quality data (textbooks, math, code) at a decayed learning rate. Architecturally, Mixture-of-Experts dominates at the frontier, and FP8 precision is standard for frontier-scale training.

§ 02

2 — Optimizations

Training at scale requires engineering solutions

Why naive LLM training fails — and how each optimization fixes it

Training GPT-3 naively in FP32 with a fixed learning rate and no distributed strategy would take decades on a single GPU and crash from memory overflow on the first step. Every modern training run combines a dozen engineering solutions simultaneously.

GPU memory breakdown — training a 7B model (full fine-tuning)

Mixed-precision training costs roughly 16 bytes per parameter:
Weights 14 GB (BF16)
Gradients 14 GB (BF16)
Master weights 28 GB (FP32 copy kept by the optimizer)
Optimizer states 56 GB (AdamW: 2 moment buffers in FP32)
Total ≈ 112 GB — before activations. Far beyond any single consumer GPU. Solutions: mixed precision, gradient checkpointing, ZeRO, LoRA/QLoRA.

§ 03

3 — Supervised fine-tuning

Stage 2 — from base model to assistant

Supervised Fine-Tuning (SFT) — teaching the model to follow instructions

A base model trained only on next-token prediction will complete your text, not answer your question. If you ask "What is the capital of France?", it might respond "What is the capital of Germany?" — because that is the kind of text it has seen. SFT teaches the model that a question expects a specific type of answer, using thousands to hundreds of thousands of (prompt, completion) pairs.

Base modelInstruction datasetSFT trainingInstruction tuned modelRLHF / DPO

Pipeline — steps light up in order

Dataset formats

What instruction data looks like

Data comes as (prompt, completion) pairs. Common formats: Alpaca (instruction + input + output fields), ShareGPT (multi-turn conversations), ChatML (system + user + assistant turns). Quality matters far more than quantity — 1,000 carefully curated examples often outperform 100,000 noisy ones.

Train on completions only

Don't compute loss on the prompt

Compute the cross-entropy loss ONLY on the assistant's response tokens, not on the user's prompt. This teaches the model to generate good answers rather than to reproduce questions. The QLoRA paper showed this significantly improves accuracy for multi-turn conversations. In Unsloth: train_on_responses_only.

Full fine-tuning vs LoRA vs QLoRA

The three SFT strategies — memory comparison for a 7B model

Full fine-tuning updates all weights — highest quality but risks catastrophic forgetting and requires enormous compute. LoRA freezes base weights and trains only small adapter matrices (~1% of params). QLoRA further quantizes the frozen base to 4-bit NF4, fitting a 70B model in roughly 40–48 GB — a single 48 GB GPU.

Instruction tuning

Fewer examples, better generalization

Instruction tuning uses (prompt, completion) pairs that demonstrate behaviour — not task-specific answers. Models learn to generalize the behaviour pattern. InstructGPT showed a 1.3B model — instruction-tuned and then RLHF-trained, not SFT alone — outperformed a raw 175B GPT-3 on helpfulness ratings from human evaluators.

Catastrophic forgetting

The danger of full fine-tuning

When all weights are updated during SFT, the model can "forget" general abilities learned during pretraining. Fine-tuning a coding model might make it lose conversational ability. PEFT methods (LoRA, adapters) mitigate this by keeping base weights frozen and only training small additions.

§ 04

4 — PEFT

Parameter-Efficient Fine-Tuning

Fine-tune a 70B model on your laptop — by training only 1% of its weights

PEFT is a family of techniques that adapt a large pretrained model to a new task by training only a tiny fraction of its parameters, while keeping the base model frozen. The result: most of the benefit of full fine-tuning at 1–10% of the compute and memory cost. This has democratised LLM adaptation — researchers can now fine-tune state-of-the-art models on consumer hardware.

LoRA — how the low-rank decomposition works

The original weight matrix W (d×d) is frozen — never updated. Two small matrices B (d×r) and A (r×d) are added in parallel, where r is the rank (typically 8–64, much smaller than d). Their product ΔW = B·A is d×d, matching W. Only A and B are trained. Forward pass output: Wx + (BA)x × (α/r) B is initialised to zero so the adapter starts with no effect on the model. A is random Gaussian. At inference, BA can be merged directly into W with zero overhead. Input x frozen W trained A (r×d) ↓ B (d×r) Wx + BAx

Target modules

Where LoRA adapters are inserted

LoRA adapters are injected into selected linear weight matrices. Common targets: Q, K, V, O (attention projections), up_proj, down_proj, gate_proj (feed-forward layers). Training ALL linear layers including embed_tokens and lm_head produces better results, especially for continued pretraining.

Key hyperparameters

The knobs to tune

r (rank) — capacity vs memory. r=8 minimal, r=16 standard, r=64–256 for continued pretraining. alpha — scaling factor, typically r or 2r. dropout — 0.05–0.1 to prevent overfitting. learning rate — 2e-4 for standard SFT. If training loss falls below 0.2, the model is likely overfitting.

§ 05

The pretraining
pipeline.