Phase 5 • EduArtha

Large Language Models (LLMs)

This is the core of how modern AI works — Transformer architecture, pre-training on text, and alignment techniques. Every chatbot, code assistant, and AI agent is built on these foundations.

⏱ 6–12 months | 14 Chapters | 50+ Exercises | Industry Problems

Part I

Transformer Architecture

The architecture that changed everything

Chapter 1

Self-Attention & Multi-Head Attention

Learning Objectives

Implement scaled dot-product attention from scratch
Understand queries, keys, values — the information retrieval analogy
Build multi-head attention and understand why multiple heads help
Compute attention complexity and memory requirements

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Python
import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        # Project and reshape to [B, n_heads, T, d_k]
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = self.dropout(torch.softmax(scores, dim=-1))
        out = (attn @ V).transpose(1,2).contiguous().view(B, T, C)
        return self.W_o(out)

# Causal mask for autoregressive (GPT-style)
def causal_mask(T):
    return torch.tril(torch.ones(T, T)).unsqueeze(0).unsqueeze(0)

Industry Problem: Quadratic Memory in Long Documents

Problem: Self-attention is O(n²) in sequence length. Processing a 100K-token legal contract requires 100K × 100K = 10 billion attention scores per layer per head — impossible to fit in memory.

Solutions: (1) Flash Attention — fuses operations, reduces memory from O(n²) to O(n). (2) Sliding window attention (Mistral) — attend to local windows. (3) Ring attention — distributes across GPUs. (4) Sparse attention (BigBird) — attend to only important positions.

Exercises

Exercise 1.1: Why scale by √dₖ and what happens without it?

Without scaling, dot products grow with dimension dₖ (variance ≈ dₖ for random vectors). Large values push softmax into saturation — one position gets ~100% attention, gradients vanish. Scaling by √dₖ keeps variance at ~1, ensuring softmax outputs are smooth and informative. For d_k=64: scores ÷ 8.

Exercise 1.2: Compute memory for MHA with d_model=4096, n_heads=32, seq_len=8192

Attention matrix per head: 8192 × 8192 × 4 bytes (FP32) = 256 MB. With 32 heads: 8 GB. For one layer! A 32-layer model needs 256 GB just for attention matrices. This is why Flash Attention (which never materializes the full matrix) is essential for long contexts.

Exercise 1.3: Why use multiple heads instead of one large attention?

Different heads learn different relationship types: head 1 might attend to syntactic neighbors, head 2 to semantic relationships, head 3 to positional patterns. This is like having multiple "perspectives" on the same data. Empirically, 8-64 heads consistently outperform single-head attention of the same total dimension.

Chapter Summary

Self-attention computes relevance between all position pairs — O(n²) but powerful
Multi-head attention learns diverse relationship types in parallel subspaces
Causal masking enables autoregressive generation (GPT-style LLMs)
Industry challenge: quadratic scaling → solved by Flash Attention and sparse methods

Chapter 2

Positional Encodings: Sinusoidal, RoPE & ALiBi

Learning Objectives

Understand why transformers need position information
Implement sinusoidal, RoPE, and ALiBi encodings
Know which encoding enables long-context extrapolation

Python
import torch, math

# 1. Sinusoidal (Original Transformer)
def sinusoidal_pe(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    pos = torch.arange(max_len).unsqueeze(1).float()
    div = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000) / d_model))
    pe[:, 0::2] = torch.sin(pos * div)
    pe[:, 1::2] = torch.cos(pos * div)
    return pe

# 2. RoPE — Rotary Position Embeddings (LLaMA, GPT-NeoX)
def apply_rope(q, k, positions):
    """Rotate query/key vectors by position-dependent angles"""
    d = q.shape[-1]
    freqs = 1.0 / (10000 ** (torch.arange(0, d, 2).float() / d))
    angles = positions.unsqueeze(-1) * freqs
    cos_a, sin_a = torch.cos(angles), torch.sin(angles)

    # Rotate pairs of dimensions
    q_rot = torch.stack([q[..., 0::2]*cos_a - q[..., 1::2]*sin_a,
                         q[..., 0::2]*sin_a + q[..., 1::2]*cos_a], dim=-1).flatten(-2)
    k_rot = torch.stack([k[..., 0::2]*cos_a - k[..., 1::2]*sin_a,
                         k[..., 0::2]*sin_a + k[..., 1::2]*cos_a], dim=-1).flatten(-2)
    return q_rot, k_rot

# 3. ALiBi — Attention with Linear Biases (no embeddings!)
# Simply adds a linear bias to attention scores based on distance:
# score(i,j) = q_i · k_j - m · |i - j|
# where m is a head-specific slope. No learned parameters!

Encoding	Type	Extrapolation	Used In
Sinusoidal	Additive, fixed	Poor beyond training length	Original Transformer
Learned	Additive, trained	Cannot extrapolate	BERT, GPT-2
RoPE	Multiplicative (rotation)	Good with NTK scaling	LLaMA, Mistral, Qwen
ALiBi	Attention bias	Excellent (zero-shot)	BLOOM, MPT

Industry Problem: Context Window Extension

Problem: A model trained on 4K tokens can't handle 128K-token documents. Legal, medical, and enterprise use cases demand long contexts.

Solutions: (1) RoPE + NTK scaling — adjust frequency base to extend context (LLaMA → 100K). (2) YaRN — combines NTK + dynamic scaling. (3) ALiBi — extrapolates to any length zero-shot. (4) Retrieval-Augmented Generation (RAG) — retrieve relevant chunks instead of stuffing everything into context.

Exercises

Exercise 2.1: Why can't transformers understand position without positional encoding?

Self-attention is permutation equivariant — swapping two tokens produces the same output (with positions swapped). Without position info, "The cat sat on the mat" and "mat the on sat cat The" produce identical attention patterns. Position encodings break this symmetry, encoding order information.

Exercise 2.2: Why has RoPE become the dominant choice for LLMs?

RoPE encodes relative positions through rotation, so attention between positions i and j depends only on (i-j). It's parameter-free, works with linear attention, and can be extended to longer contexts via frequency scaling (NTK/YaRN). LLaMA, Mistral, Qwen, and most open-source LLMs use RoPE.

Exercise 2.3: How does ALiBi achieve zero-shot context extrapolation?

ALiBi adds a penalty proportional to distance: closer tokens get higher attention regardless of training length. Since it's a linear bias (not a learned position embedding), the model naturally handles any distance — no retraining needed. The penalty slopes vary per head, letting some heads focus locally and others globally.

Chapter Summary

Transformers need explicit position information — attention is permutation-equivariant
RoPE (rotary) dominates modern LLMs with good extrapolation via frequency scaling
ALiBi provides zero-shot length generalization with no learned parameters
Context extension is a major industry challenge solved by RoPE scaling + RAG

Chapter 3

Layer Normalization, FFN & KV Cache

Learning Objectives

Understand Pre-LN vs Post-LN and why Pre-LN won
Master the feed-forward network (FFN) in transformers
Implement KV cache for efficient autoregressive inference

Python
class TransformerBlock(nn.Module):
    """Pre-LN Transformer Block (standard in GPT, LLaMA)"""
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)  # Pre-LN: normalize BEFORE attention
        self.attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),                     # SwiGLU in LLaMA
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )

    def forward(self, x, mask=None):
        x = x + self.attn(self.ln1(x), mask)   # Residual + Pre-LN Attention
        x = x + self.ffn(self.ln2(x))           # Residual + Pre-LN FFN
        return x

KV Cache — Critical for Fast Inference

Python
class CachedAttention(nn.Module):
    """Attention with KV cache for autoregressive generation"""
    def forward(self, x, kv_cache=None):
        Q = self.W_q(x)  # Only compute Q for new tokens
        K_new = self.W_k(x)
        V_new = self.W_v(x)

        if kv_cache is not None:
            K = torch.cat([kv_cache[0], K_new], dim=1)  # Append to cache
            V = torch.cat([kv_cache[1], V_new], dim=1)
        else:
            K, V = K_new, V_new

        # Attention with full K,V but only new Q
        out = scaled_dot_product(Q, K, V)
        return out, (K, V)  # Return updated cache
    # Without cache: generating 1000 tokens recomputes ALL K,V for each token
    # With cache: each step is O(1) compute instead of O(n)

Industry Problem: KV Cache Memory Explosion

Problem: For LLaMA-70B with 128K context: KV cache = 2 × 80 layers × 64 heads × 128 dim × 128K tokens × 2 bytes = ~40 GB per request. Serving 100 concurrent users needs 4 TB of GPU memory!

Solutions: (1) Grouped Query Attention (GQA) — share K,V across head groups (LLaMA-2 uses 8 KV heads for 32 Q heads → 4x reduction). (2) Multi-Query Attention (MQA) — all heads share one K,V (Falcon). (3) PagedAttention (vLLM) — allocate KV cache in pages like virtual memory, eliminating waste. (4) KV cache quantization — store cache in INT8.

Exercises

Exercise 3.1: Why did Pre-LN replace Post-LN in modern LLMs?

Post-LN (original): Residual → Add → LayerNorm. Gradients must pass through LayerNorm, which can cause instability in very deep networks. Requires careful warmup. Pre-LN: LayerNorm → Attention → Add. The residual path is clean (identity), enabling stable training of 100+ layer models without warmup. All modern LLMs use Pre-LN.

Exercise 3.2: What is SwiGLU and why does LLaMA use it instead of ReLU?

SwiGLU = Swish(x·W₁) ⊙ (x·W₂) — a gated linear unit with Swish activation. It uses more parameters (3 projections vs 2) but produces better representations. LLaMA, PaLM, and Mistral all use SwiGLU. To keep parameter count similar, d_ff is reduced from 4×d_model to 2.67×d_model.

Exercise 3.3: Calculate the speedup from KV cache for generating 512 tokens

Without cache: token 1 = 1 attn, token 2 = 2 attn, ... token 512 = 512 attn. Total = 512×513/2 = 131,328 attention computations. With cache: each token does 1 attention (against cached K,V). Total = 512 computations. Speedup: ~256x!

Chapter Summary

Pre-LN: normalize before attention/FFN for stable deep training
SwiGLU FFN outperforms ReLU/GELU in modern LLMs
KV cache eliminates redundant computation — essential for fast generation
GQA/MQA reduce KV cache memory by sharing keys/values across heads

Chapter 4

Encoder, Decoder & Encoder-Decoder Variants

Learning Objectives

Distinguish encoder-only, decoder-only, and encoder-decoder architectures
Know which architecture suits which task
Understand why decoder-only won for generative AI

Architecture	Attention	Best For	Examples
Encoder-only	Bidirectional (sees all tokens)	Understanding (classification, NER)	BERT, RoBERTa, DeBERTa
Decoder-only	Causal (sees only past)	Generation (chat, code, reasoning)	GPT-4, LLaMA, Mistral, Claude
Encoder-decoder	Cross-attention	Translation, summarization	T5, BART, Flan-T5

Why Decoder-Only Won

Decoder-only models (GPT-style) dominate because: (1) They unify understanding and generation in one architecture. (2) They scale better with more parameters and data. (3) Next-token prediction is a universal objective — it teaches reasoning, factual knowledge, and code. (4) In-context learning (few-shot prompting) emerged as a surprise capability of large decoder-only models.

Exercises

Exercise 4.1: Why is BERT better than GPT for classification tasks?

BERT sees all tokens bidirectionally — when classifying "The bank by the river was steep," BERT uses "river" to disambiguate "bank." GPT only sees left context. However, large GPT models close this gap through scale, and instruction-tuned LLMs can match BERT on most NLU tasks via prompting.

Exercise 4.2: How does cross-attention work in encoder-decoder models?

The decoder's queries attend to the encoder's keys and values (not self-attention). This lets the decoder "look at" the input while generating output. In T5: encoder processes the input, decoder generates output token-by-token, using cross-attention to focus on relevant input parts at each step.

Exercise 4.3: Could you use an encoder-only model for generation?

Not directly — encoder-only models (BERT) see all positions simultaneously, so there's no autoregressive generation. You could use masked token prediction iteratively (like in diffusion models for text), but it's much slower and lower quality than causal generation. BERT is designed for understanding, not generation.

Chapter Summary

Encoder-only (BERT): bidirectional, best for classification/understanding
Decoder-only (GPT): causal, dominates generative AI and reasoning
Encoder-decoder (T5): cross-attention bridges input and output
Decoder-only won because next-token prediction scales universally

Part II

Pre-training

Teaching LLMs to understand language

Chapter 5

Tokenization: BPE & SentencePiece

Learning Objectives

Understand why we tokenize (not use characters or words)
Implement Byte Pair Encoding (BPE) from scratch
Use SentencePiece and tiktoken for real tokenization

Python
# tiktoken — OpenAI's tokenizer (used in GPT-4)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 encoding
text = "Large Language Models are transforming AI"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")

# BPE from scratch (simplified)
def get_pairs(tokens):
    return {(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)}

def bpe_train(text, num_merges):
    tokens = list(text.encode("utf-8"))
    merges = {}
    for i in range(num_merges):
        pairs = {}
        for j in range(len(tokens)-1):
            pair = (tokens[j], tokens[j+1])
            pairs[pair] = pairs.get(pair, 0) + 1
        if not pairs: break
        best = max(pairs, key=pairs.get)
        new_token = 256 + i
        merges[best] = new_token
        # Replace all occurrences of best pair with new token
        new_tokens = []
        j = 0
        while j < len(tokens):
            if j < len(tokens)-1 and (tokens[j], tokens[j+1]) == best:
                new_tokens.append(new_token); j += 2
            else:
                new_tokens.append(tokens[j]); j += 1
        tokens = new_tokens
    return merges, tokens

Industry Problem: Tokenization of Non-English and Code

Problem: BPE trained on English text creates long token sequences for Chinese/Japanese (each character = multiple tokens) and code (variable names split into sub-tokens). This wastes context window and increases cost.

Solutions: (1) Larger vocabulary — LLaMA-3 uses 128K tokens (vs GPT-2's 50K), reducing non-English token count by ~2x. (2) Byte-level BPE — handles any UTF-8 byte sequence. (3) Language-aware training data — balance corpus to better represent non-English text in merges.

Exercises

Exercise 5.1: Why not use characters or words directly?

Characters: Vocabulary is small (~256) but sequences become very long (10x longer), making attention O(n²) prohibitive. Words: Vocabulary is huge (500K+), most words are rare, and unknown words can't be handled. BPE: Sweet spot — 32K-128K tokens, handles any text, balances sequence length and vocabulary size.

Exercise 5.2: How does vocab size affect model quality and efficiency?

Larger vocab = shorter sequences (more efficient inference, more context fits) but larger embedding matrix. GPT-2: 50K tokens. LLaMA-3: 128K tokens. The embedding matrix for 128K × 4096 dim = 512M parameters — significant but worthwhile for the compression benefit. Optimal vocab size depends on training data size and languages supported.

Chapter Summary

BPE iteratively merges frequent byte pairs to build a sub-word vocabulary
Tokenization is the first step in any LLM pipeline — garbage in, garbage out
Larger vocabularies reduce token count but increase embedding size
tiktoken (GPT-4) and SentencePiece (LLaMA) are industry standards

Chapter 6

Language Modeling Objectives

Learning Objectives

Master causal LM (next-token prediction) — the GPT objective
Understand masked LM (BERT-style) and its limitations
Know prefix LM and span corruption (T5) objectives

Python
# Causal Language Modeling (GPT-style)
# Given: "The cat sat on the"
# Predict: "cat" "sat" "on" "the" "mat"

import torch.nn.functional as F

def causal_lm_loss(logits, targets):
    # logits: [B, T, vocab_size], targets: [B, T]
    # Shift: predict token t+1 from position t
    shift_logits = logits[:, :-1].contiguous()
    shift_labels = targets[:, 1:].contiguous()
    return F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1))

# Every position provides a training signal!
# A 4096-token document gives 4095 prediction tasks
# This efficiency is why causal LM scales so well

Loss = -1/T × Σₜ log P(xₜ | x₁, x₂, ..., xₜ₋₁)

The Unreasonable Effectiveness of Next-Token Prediction

Next-token prediction seems trivially simple, yet it teaches: factual knowledge ("Paris is the capital of..."), reasoning ("If A implies B and B implies C, then..."), code ("def fibonacci(n):\n if n < 2:..."), math, translation, and even theory of mind. As Ilya Sutskever said: "Prediction is compression, and compression is understanding."

Exercises

Exercise 6.1: Why is causal LM more efficient than masked LM for pre-training?

Causal LM: every token is a prediction target → T-1 training signals per document. Masked LM (BERT): only ~15% of tokens are masked → 0.15T signals. For the same compute, causal LM extracts 6-7x more learning signal. This is why GPT-style models scale better than BERT-style models.

Exercise 6.2: What is perplexity and how is it related to loss?

Perplexity = e^(cross-entropy loss). A perplexity of 10 means the model is "as confused as if it had to choose between 10 equally likely tokens." Lower = better. GPT-4's perplexity on benchmark text is ~3-5, meaning it narrows down to 3-5 likely next tokens on average.

Chapter Summary

Causal LM (next-token prediction) is the dominant pre-training objective
Every token provides a training signal — extremely data-efficient
Perplexity = e^loss — the standard metric for language modeling quality
Next-token prediction is surprisingly powerful — it teaches understanding, not just completion

Chapter 7

Data Collection, Cleaning & Deduplication

Learning Objectives

Build a pre-training data pipeline
Understand quality filtering, deduplication, and toxicity removal
Know the data composition of major LLMs

Model	Training Tokens	Data Sources
GPT-3	300B	CommonCrawl, WebText, Books, Wikipedia
LLaMA-2	2T	Web crawl (89%), code, Wikipedia, books
LLaMA-3	15T	Multi-source, heavily filtered and deduplicated
GPT-4	~13T (estimated)	Undisclosed (web + proprietary + synthetic)

Python
# Data pipeline stages
def data_pipeline(raw_docs):
    # 1. Language filtering
    docs = [d for d in raw_docs if detect_language(d) == "en"]

    # 2. Quality filtering (perplexity-based)
    docs = [d for d in docs if quality_score(d) > 0.5]

    # 3. Deduplication (MinHash + LSH)
    docs = minhash_dedup(docs, threshold=0.8)

    # 4. Toxicity/PII removal
    docs = [remove_pii(d) for d in docs if not is_toxic(d)]

    # 5. Tokenize and pack into sequences
    tokens = tokenizer.encode_batch(docs)
    return pack_sequences(tokens, max_len=4096)

Industry Problem: Data Quality vs. Quantity

Problem: CommonCrawl has ~250 billion pages, but ~90% is low-quality (duplicates, spam, SEO content, machine-generated text). Training on raw data produces incoherent models.

Solutions: (1) Classifier-based filtering — train a quality classifier on Wikipedia/books, filter web data. (2) Exact + fuzzy dedup — MinHash for near-duplicate detection (LLaMA removed 86% of CommonCrawl). (3) Domain mixing — oversample high-quality sources (code, textbooks, Wikipedia). (4) Synthetic data — use existing LLMs to generate training data (Phi-2 proved small models + curated data beats large models + raw data).

Exercises

Exercise 7.1: Why does deduplication improve model quality?

Duplicated data causes the model to memorize specific sequences instead of learning generalizable patterns. It also creates training instabilities (loss spikes on repeated content). LLaMA's dedup removed 86% of CommonCrawl but improved quality significantly. Research shows 3-5x deduplication can be equivalent to doubling compute budget.

Exercise 7.2: What is the data mixture problem and how do you solve it?

Different domains have different value: 1 token of Wikipedia > 1 token of a random blog. The optimal mixture allocates more training time to high-quality sources. LLaMA-2 used: 89% web, 5% code, 4% Wikipedia, 2% books. Finding the right mixture requires expensive ablation experiments (Doremi, data mixing laws).

Chapter Summary

Data quality > quantity — Phi-2 proved this definitively
Pipeline: filter → deduplicate → detoxify → tokenize → pack
MinHash LSH enables efficient near-duplicate detection at web scale
Domain mixing ratio significantly affects model capabilities

Chapter 8

Compute Scaling & Scaling Laws

Learning Objectives

Estimate FLOPs for training an LLM
Understand Chinchilla scaling laws — the optimal model-data tradeoff
Apply scaling laws to plan training runs

FLOPs ≈ 6 × N × D (N = parameters, D = tokens)

Model	Parameters	Tokens	FLOPs	GPUs	Time
GPT-3	175B	300B	3.1×10²³	1024 A100s	~34 days
LLaMA-2 70B	70B	2T	8.4×10²³	2048 A100s	~25 days
LLaMA-3 405B	405B	15T	3.6×10²⁵	16384 H100s	~54 days

Chinchilla Scaling Law

For a compute-optimal model, parameters and tokens should scale roughly equally: D ≈ 20N. A 10B model should train on 200B tokens. GPT-3 (175B params, 300B tokens) was undertrained — Chinchilla (70B params, 1.4T tokens) matched it with 4x less compute. LLaMA followed Chinchilla, training smaller models on far more data.

Industry Problem: Training Cost

Problem: Training a frontier LLM costs $10M-$100M+ in compute. A single training run of LLaMA-3 405B on 16K H100s costs ~$30M+ in GPU hours. Failures are catastrophic.

Solutions: (1) Scaling laws — extrapolate from small experiments to predict large model performance before spending millions. (2) Efficient architectures — mixture of experts (Mixtral uses only 2 of 8 experts per token → 2x efficiency). (3) Curriculum learning — start with easy data, increase complexity. (4) Infrastructure — checkpoint frequently, fault-tolerant training frameworks (Megatron, DeepSpeed).

Exercises

Exercise 8.1: How many FLOPs to train a 7B model on 1T tokens?

FLOPs ≈ 6 × 7×10⁹ × 10¹² = 4.2×10²² FLOPs. An H100 does ~1×10¹⁵ FLOPs/sec. At 50% MFU: 5×10¹⁴ effective FLOPs/sec. Time with 1 H100: 4.2×10²²/(5×10¹⁴) = 84M seconds ≈ 2.7 years. With 256 H100s: ~4 days.

Exercise 8.2: What is Mixture of Experts and why is it efficient?

MoE replaces the FFN with N expert FFNs + a router. The router selects top-2 experts per token. Result: 8x parameters but only 2x compute (only 2 experts are active). Mixtral-8x7B has 46B total params but only 12B active — matching LLaMA-2 70B quality at much lower cost. GPT-4 is rumored to use MoE.

Chapter Summary

FLOPs ≈ 6ND is the fundamental compute formula for transformer training
Chinchilla law: optimal D ≈ 20N — train longer, not bigger
MoE provides parameter scaling without proportional compute increase
Scaling laws enable predicting large model performance from small experiments

Part III

Fine-tuning & Alignment

Making LLMs helpful, harmless, and honest

Chapter 9

Supervised Fine-Tuning (SFT)

Learning Objectives

Fine-tune a pre-trained LLM on instruction-response pairs
Build training datasets for SFT
Understand the instruction-following pipeline

Python
# SFT training data format
sft_data = [
    {
        "instruction": "Explain quantum computing in simple terms",
        "response": "Quantum computing uses quantum bits (qubits) that can be both 0 and 1 simultaneously (superposition). This allows quantum computers to explore many solutions at once, making them exponentially faster for certain problems like cryptography and drug discovery..."
    },
    {
        "instruction": "Write a Python function to find prime numbers",
        "response": "def is_prime(n):\n    if n < 2: return False\n    for i in range(2, int(n**0.5)+1):\n        if n % i == 0: return False\n    return True"
    }
]

# Training: only compute loss on the response tokens!
def sft_loss(logits, labels, instruction_mask):
    # Mask out instruction tokens — don't train on the question
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
                           labels.view(-1), reduction='none')
    loss = loss * (~instruction_mask).float().view(-1)
    return loss.sum() / (~instruction_mask).sum()

Project: Fine-tune LLaMA with Hugging Face

Python
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model

# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# Add LoRA for efficient fine-tuning (Chapter 11)
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"])
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")

# Load instruction dataset
dataset = load_dataset("tatsu-lab/alpaca")

# Training
training_args = TrainingArguments(
    output_dir="./sft_output", num_train_epochs=3,
    per_device_train_batch_size=4, gradient_accumulation_steps=8,
    learning_rate=2e-5, fp16=True, warmup_ratio=0.1)

trainer = Trainer(model=model, args=training_args, train_dataset=dataset["train"])
trainer.train()

Exercises

Exercise 9.1: Why only compute loss on response tokens, not instruction tokens?

The model should learn to generate responses, not memorize instructions. Computing loss on instructions wastes gradient signal on content we don't want the model to "generate." It also prevents the model from learning to parrot back instructions. The instruction provides context; the response is the training target.

Exercise 9.2: How many SFT examples are typically needed?

Surprisingly few! LIMA showed 1,000 high-quality examples can outperform models trained on 50K+ low-quality ones. The quality > quantity principle applies strongly to SFT. Key: diverse, well-written, covering different tasks and formats. 1K-50K examples with 1-3 epochs is typical.

Chapter Summary

SFT teaches pre-trained LLMs to follow instructions using (instruction, response) pairs
Only compute loss on response tokens — instructions provide context only
Quality > quantity: 1K excellent examples can outperform 50K mediocre ones (LIMA paper)
SFT is the first step of alignment: pre-train → SFT → RLHF/DPO

Chapter 10

RLHF & DPO — Aligning with Human Preferences

Learning Objectives

Understand RLHF: reward model + PPO optimization
Master DPO — the simpler alternative that skips reward modeling
Know the alignment pipeline used by ChatGPT and Claude

RLHF Pipeline

Pre-train → SFT → Train Reward Model → PPO (optimize policy against reward)

Python
# DPO — Direct Preference Optimization (simpler than RLHF)
# Given pairs: (prompt, chosen_response, rejected_response)

import torch.nn.functional as F

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    """
    policy_chosen_logps: log P_θ(chosen | prompt)
    ref_chosen_logps: log P_ref(chosen | prompt) — frozen reference model
    beta: temperature parameter
    """
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    # The model should assign higher reward to chosen vs rejected
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    return loss

# DPO advantage: no separate reward model, no PPO complexity!
# Trains directly on preference pairs: (prompt, winner, loser)

Method	Complexity	Requires	Used By
RLHF + PPO	High (4 models in memory)	Reward model + PPO training	ChatGPT, Claude (early)
DPO	Low (2 models)	Preference pairs only	LLaMA-3, Zephyr, many open models
RLAIF	Medium	AI-generated feedback	Constitutional AI (Anthropic)
KTO	Lowest	Only thumbs up/down per response	Emerging research

Industry Problem: Alignment Tax and Reward Hacking

Problem: RLHF can reduce model capability ("alignment tax") — the model becomes safer but less knowledgeable. Reward hacking: the model learns to exploit the reward model (producing text that sounds confident but is wrong).

Solutions: (1) DPO — avoids reward model entirely, reducing hacking risk. (2) Iterative DPO — alternate between generating responses and collecting preferences. (3) Process reward models — reward each reasoning step, not just the final answer (OpenAI). (4) Constitutional AI — use principles to self-critique (Chapter 12).

Exercises

Exercise 10.1: Why is DPO preferred over RLHF in most open-source LLMs?

RLHF requires: (1) trained reward model, (2) PPO optimization with 4 models in memory (policy, reference, reward, value), (3) careful hyperparameter tuning. DPO needs only preference pairs and two forward passes. It's mathematically equivalent to RLHF under certain conditions but 10x simpler to implement and 3x cheaper to train.

Exercise 10.2: What is the "reference model" in DPO and why is it needed?

The reference model (usually the SFT model, frozen) prevents the policy from drifting too far from sensible language. Without it, the model could learn to produce degenerate text that maximally satisfies preferences but is incoherent. The KL divergence penalty (built into DPO's loss) keeps the policy close to the reference.

Chapter Summary

RLHF uses a reward model + PPO to optimize for human preferences
DPO simplifies alignment to direct preference optimization — no reward model needed
The alignment pipeline: Pre-train → SFT → DPO/RLHF produces helpful, harmless models
Reward hacking is a real industry risk; process rewards and iterative training help

Chapter 11

LoRA, QLoRA & PEFT Methods

Learning Objectives

Master LoRA — the dominant parameter-efficient fine-tuning method
Understand QLoRA for fine-tuning on consumer GPUs
Compare PEFT methods and know when to use each

Python
import torch.nn as nn

class LoRALayer(nn.Module):
    """Low-Rank Adaptation: W_new = W_frozen + A·B (rank r)"""
    def __init__(self, original_layer, r=16, alpha=32):
        super().__init__()
        self.original = original_layer
        self.original.weight.requires_grad = False  # Freeze original
        d_in, d_out = original_layer.weight.shape
        self.A = nn.Parameter(torch.randn(d_in, r) * 0.01)   # Down-project
        self.B = nn.Parameter(torch.zeros(r, d_out))          # Up-project
        self.scaling = alpha / r

    def forward(self, x):
        # Original output + low-rank adaptation
        return self.original(x) + (x @ self.A @ self.B) * self.scaling

# For LLaMA-7B:
# Full fine-tuning: 7B trainable params, needs 8× A100 80GB
# LoRA (r=16): ~4M trainable params (0.06%), needs 1× A100
# QLoRA (4-bit): same quality, needs 1× RTX 3090 24GB!

Method	Trainable Params	Memory for 7B	Quality
Full Fine-tuning	100%	~120 GB	Best
LoRA (r=16)	~0.1%	~16 GB	~98% of full
QLoRA (4-bit)	~0.1%	~6 GB	~97% of full
Prefix Tuning	~0.1%	~16 GB	~90% of full

Industry Problem: Fine-tuning Cost for Enterprise

Problem: Enterprises need domain-specific LLMs (legal, medical, finance) but can't afford $100K+ for full fine-tuning on A100 clusters. They have limited GPU resources (a few consumer GPUs).

Solutions: (1) QLoRA — fine-tune 70B models on a single 48GB GPU using 4-bit quantization. (2) LoRA adapters — swap task-specific adapters at inference time (one base model, many adapters). (3) RAG — augment with retrieval instead of fine-tuning. (4) Distillation — train a smaller model on the larger model's outputs.

Exercises

Exercise 11.1: Why does LoRA work — how can 0.1% of parameters capture task-specific knowledge?

Pre-training learns general knowledge in the full-rank weight matrix. Task-specific adaptation only needs to make small adjustments — these adjustments live in a low-dimensional subspace. Research shows the "intrinsic dimensionality" of fine-tuning is very low (~hundreds, not billions). LoRA captures this low-rank structure explicitly.

Exercise 11.2: What is QLoRA and how does it fit 70B models on consumer GPUs?

QLoRA: (1) Quantizes the base model to 4-bit (NormalFloat4 format) — 70B × 0.5 bytes = 35 GB. (2) Adds LoRA adapters in FP16. (3) Uses paged optimizers to handle memory spikes. (4) Trains only the small LoRA matrices. Result: 70B model fine-tunable on a 48GB A6000 or even dual 24GB RTX 3090s.

Exercise 11.3: When should you use full fine-tuning vs LoRA vs RAG?

Full FT: When you need maximum quality and have compute (production LLMs). LoRA: Domain adaptation with limited budget — sweet spot for most enterprises. RAG: When knowledge changes frequently (news, docs) or you need citations. Many deployments combine LoRA (style/behavior) + RAG (knowledge).

Chapter Summary

LoRA adds low-rank matrices (A·B) to frozen weights — 0.1% trainable params, ~98% quality
QLoRA quantizes the base model to 4-bit, enabling fine-tuning on consumer GPUs
PEFT methods democratize LLM customization for enterprises and researchers
Combine LoRA (behavior) + RAG (knowledge) for best enterprise results

Chapter 12

Constitutional AI & Instruction Following

Learning Objectives

Understand Constitutional AI (RLAIF) — AI-generated alignment data
Build instruction-following datasets with self-critique
Know the techniques for making LLMs safe and controllable

Python
# Constitutional AI pipeline (Anthropic)
principles = [
    "Choose the response that is most helpful while being harmless.",
    "Choose the response that is most honest and factual.",
    "Avoid responses that are discriminatory or biased.",
]

def constitutional_critique(model, prompt, response):
    """Use the model itself to critique and revise responses"""
    critique_prompt = f"""
Critique this response based on the principle: {principles[0]}

Prompt: {prompt}
Response: {response}

Identify problems and suggest a revised response:"""
    critique = model.generate(critique_prompt)
    return critique

# Process:
# 1. Generate response to harmful prompt
# 2. Self-critique against principles
# 3. Revise response based on critique
# 4. Use (original, revised) as DPO preference pair
# = RLAIF — no human annotators needed!

Industry Problem: Safety Without Losing Capability

Problem: Over-alignment makes models refuse legitimate requests ("I can't help with that"). Under-alignment allows harmful outputs. Finding the right balance is a major industry challenge.

Solutions: (1) Nuanced refusal — refuse harmful content but explain why and offer alternatives. (2) System prompts — define behavior boundaries per deployment. (3) Layered safety — input filters + model alignment + output filters. (4) Red teaming — systematically test for vulnerabilities before deployment.

Exercises

Exercise 12.1: How does RLAIF differ from RLHF?

RLHF: Human annotators compare response pairs → reward model → PPO. Expensive ($2M+ for quality annotations). RLAIF: AI compares responses against constitutional principles → self-generated preference data → DPO. Much cheaper, scalable, and consistent. Trade-off: AI feedback may miss nuances humans catch.

Exercise 12.2: What makes a good instruction-following dataset?

Diversity (many tasks), quality (well-written responses), coverage (edge cases), formatting (consistent structure), and difficulty gradient (simple → complex). Alpaca (52K) used GPT-3.5 to generate data. OpenOrca used GPT-4. Key insight: the teacher model's quality directly determines the student's ceiling.

Chapter Summary

Constitutional AI uses self-critique against principles for scalable alignment
RLAIF replaces expensive human feedback with AI-generated preferences
Instruction-following requires diverse, high-quality training data
Safety is a spectrum — balance helpfulness with harmlessness

Part IV

Efficiency & Deployment

Serving LLMs at scale in production

Chapter 13

Quantization, Pruning & Distillation

Learning Objectives

Quantize LLMs to INT8/INT4 for efficient inference
Understand knowledge distillation — training small models from large ones
Know pruning techniques for model compression

Python
# Quantization with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization (GPTQ/AWQ style)
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # NormalFloat4
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True       # Quantize the quantization constants!
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=quant_config,
    device_map="auto"
)
# 70B model: FP16 = 140 GB → INT4 = ~35 GB → fits on 2× RTX 3090!

Precision	Bits	Size (7B model)	Quality vs FP16
FP32	32	28 GB	Baseline (100%)
FP16/BF16	16	14 GB	~100%
INT8	8	7 GB	~99%
INT4 (GPTQ/AWQ)	4	3.5 GB	~97%
GGUF Q4_K_M	~4.5	~4 GB	~96%

Industry Problem: Serving Cost Per Token

Problem: Running LLaMA-70B FP16 requires 2× A100 80GB ($2/hour each). At 50 tokens/sec, cost is ~$0.003 per 1K tokens. For 1B tokens/day (medium startup): $3K/day = $90K/month.

Solutions: (1) INT4 quantization — same quality, 4× less memory, run on cheaper GPUs. (2) Distillation — train a 7B model to mimic the 70B → 10x cheaper inference. (3) Speculative decoding — use small draft model + large verifier for 2-3x speedup. (4) Batching — serve multiple requests simultaneously with continuous batching.

Exercises

Exercise 13.1: Why does INT4 quantization only lose ~3% quality?

Neural network weights are normally distributed and redundant. Most weights are near zero and don't need 16-bit precision. NormalFloat4 (NF4) is information-theoretically optimal for normally distributed data — it allocates more precision to common values. The small quality loss is because extreme weight values (important but rare) lose some precision.

Exercise 13.2: How does knowledge distillation work?

Train a small "student" model to match the large "teacher" model's output distribution (soft labels), not just the hard labels. The teacher's probability distribution contains "dark knowledge" — e.g., P("cat")=0.7, P("kitten")=0.2 teaches that cats and kittens are related. Student learns faster and better than training from scratch on hard labels.

Exercise 13.3: What is the difference between GPTQ, AWQ, and GGUF quantization?

GPTQ: Post-training quantization using calibration data, minimizes output error. GPU-optimized. AWQ: Activation-aware — preserves channels important for activations, often slightly better quality. GGUF: llama.cpp format for CPU inference, multiple quantization levels (Q4_K_M, Q5_K_S, etc.). GPTQ/AWQ for GPU, GGUF for CPU/edge.

Chapter Summary

INT4 quantization reduces model size 4x with ~3% quality loss
Distillation trains smaller models to mimic larger ones — 10x cheaper inference
GPTQ/AWQ for GPU deployment; GGUF/llama.cpp for CPU/edge
Quantization is the #1 way to reduce serving costs in production

Chapter 14

Serving: Flash Attention, Parallelism & vLLM

Learning Objectives

Understand Flash Attention — the algorithm that enabled long contexts
Master tensor and pipeline parallelism for multi-GPU serving
Deploy LLMs with vLLM and llama.cpp
Design prompt engineering strategies and manage context windows

Flash Attention

Python
# Flash Attention — fused kernel, O(n) memory instead of O(n²)
# Standard attention materializes the full n×n attention matrix
# Flash Attention computes attention in tiles, never storing the full matrix

# In PyTorch 2.0+:
from torch.nn.functional import scaled_dot_product_attention

# Automatically uses Flash Attention if available
output = scaled_dot_product_attention(Q, K, V, is_causal=True)

# For LLaMA with 128K context:
# Standard attention memory: 128K × 128K × 2 bytes = 32 GB per layer
# Flash Attention memory: ~O(n) = ~megabytes per layer
# Speedup: 2-4x on A100, enables 128K+ context windows

Parallelism Strategies

Strategy	Splits	Use Case
Tensor Parallelism	Each layer across GPUs	Single-node multi-GPU (fast interconnect)
Pipeline Parallelism	Different layers on different GPUs	Multi-node (high latency OK)
Data Parallelism	Same model, different batches	Training (not useful for single request)
Sequence Parallelism	Long sequences across GPUs	Very long contexts (Ring Attention)

Serving with vLLM

Python
# vLLM — high-throughput LLM serving with PagedAttention
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
           tensor_parallel_size=2,     # 2 GPUs
           gpu_memory_utilization=0.9)

params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

# vLLM uses PagedAttention — KV cache is managed like virtual memory
# This eliminates memory waste from pre-allocated KV caches
# Throughput improvement: 2-4x over HuggingFace, 24x over naive approach

prompts = [
    "Explain machine learning in simple terms",
    "Write a Python quicksort function",
    "What causes climate change?",
]
outputs = llm.generate(prompts, params)  # Batched! Continuous batching
for out in outputs:
    print(out.outputs[0].text)

Prompt Engineering & Context Windows

Python
# System prompt engineering
system_prompt = """You are a senior financial analyst. Follow these rules:
1. Always cite your sources with specific data points
2. When uncertain, explicitly state your confidence level
3. Present both bullish and bearish perspectives
4. Use tables for comparisons"""

# RAG: Retrieve relevant context to fit in window
def rag_prompt(query, retrieved_docs, max_context=4000):
    context = "\n---\n".join(retrieved_docs[:max_context])
    return f"""Based on the following documents, answer the question.

Context:
{context}

Question: {query}

Answer based ONLY on the provided context. If the answer is not in the context, say so."""

Industry Problem: Latency vs. Throughput

Problem: Users expect <50ms time-to-first-token (TTFT) and 30+ tokens/second streaming. But batch processing (high throughput) increases latency. How do you serve thousands of concurrent users with good latency?

Solutions: (1) Continuous batching (vLLM) — dynamically add/remove requests from the batch without waiting. (2) Speculative decoding — small model drafts tokens, large model verifies (2-3x speedup). (3) Prefix caching — cache KV for common system prompts. (4) Quantized serving — INT4 models fit on fewer GPUs with lower latency. (5) Edge deployment — llama.cpp on consumer hardware for local inference.

Project: Deploy an LLM API with vLLM

Bash
# Start vLLM server (OpenAI-compatible API)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --port 8000

# Client code (works with any OpenAI SDK!)

Python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Explain transformers in 100 words"}],
    max_tokens=200, temperature=0.7
)
print(response.choices[0].message.content)

Exercises

Exercise 14.1: How does PagedAttention (vLLM) improve memory efficiency?

Traditional serving pre-allocates KV cache for max_seq_len per request — most of it is wasted (average response is much shorter). PagedAttention allocates KV cache in small pages (like OS virtual memory), only using what's needed. When a request ends, pages are freed. This reduces memory waste from ~60-80% to ~4%, enabling 2-4x more concurrent requests.

Exercise 14.2: When would you use llama.cpp vs vLLM?

vLLM: GPU server deployment, high throughput, many concurrent users, production APIs. llama.cpp: Local/edge deployment on CPU/Mac/consumer GPU, single-user, privacy-sensitive applications, offline use. llama.cpp supports GGUF quantized models that run on any hardware with no GPU required.

Exercise 14.3: What is speculative decoding and why does it speed up generation?

A small "draft" model (e.g., 1B) generates K tokens quickly. The large "target" model (e.g., 70B) verifies all K tokens in one forward pass (parallel verification is as fast as one step). Accepted tokens are free; rejected tokens are re-generated. Acceptance rate is typically 70-90%, giving ~2-3x speedup with no quality loss.

Chapter Summary

Flash Attention enables 128K+ contexts by reducing memory from O(n²) to O(n)
Tensor parallelism splits layers across GPUs; pipeline parallelism splits the model
vLLM with PagedAttention provides 2-4x throughput improvement for production serving
Speculative decoding, continuous batching, and prefix caching optimize latency
llama.cpp enables local/edge deployment on consumer hardware

🎓 Congratulations!

You've completed Large Language Models. You now understand how to build, train, align, and deploy the AI systems that are transforming the world — from Transformer architecture to production serving at scale.