Phase 5 โ€ข EduArtha

Large Language Models (LLMs)

This is the core of how modern AI works โ€” Transformer architecture, pre-training on text, and alignment techniques. Every chatbot, code assistant, and AI agent is built on these foundations.

โฑ 6โ€“12 months  |  14 Chapters  |  50+ Exercises  |  Industry Problems

Part I

Transformer Architecture

The architecture that changed everything

Chapter 1

Self-Attention & Multi-Head Attention

Learning Objectives

  • Implement scaled dot-product attention from scratch
  • Understand queries, keys, values โ€” the information retrieval analogy
  • Build multi-head attention and understand why multiple heads help
  • Compute attention complexity and memory requirements
Attention(Q, K, V) = softmax(QKแต€ / โˆšdโ‚–) ยท V
Python
import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        # Project and reshape to [B, n_heads, T, d_k]
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = self.dropout(torch.softmax(scores, dim=-1))
        out = (attn @ V).transpose(1,2).contiguous().view(B, T, C)
        return self.W_o(out)

# Causal mask for autoregressive (GPT-style)
def causal_mask(T):
    return torch.tril(torch.ones(T, T)).unsqueeze(0).unsqueeze(0)

Industry Problem: Quadratic Memory in Long Documents

Problem: Self-attention is O(nยฒ) in sequence length. Processing a 100K-token legal contract requires 100K ร— 100K = 10 billion attention scores per layer per head โ€” impossible to fit in memory.

Solutions: (1) Flash Attention โ€” fuses operations, reduces memory from O(nยฒ) to O(n). (2) Sliding window attention (Mistral) โ€” attend to local windows. (3) Ring attention โ€” distributes across GPUs. (4) Sparse attention (BigBird) โ€” attend to only important positions.

Exercises

Exercise 1.1: Why scale by โˆšdโ‚– and what happens without it?

Without scaling, dot products grow with dimension dโ‚– (variance โ‰ˆ dโ‚– for random vectors). Large values push softmax into saturation โ€” one position gets ~100% attention, gradients vanish. Scaling by โˆšdโ‚– keeps variance at ~1, ensuring softmax outputs are smooth and informative. For d_k=64: scores รท 8.

Exercise 1.2: Compute memory for MHA with d_model=4096, n_heads=32, seq_len=8192

Attention matrix per head: 8192 ร— 8192 ร— 4 bytes (FP32) = 256 MB. With 32 heads: 8 GB. For one layer! A 32-layer model needs 256 GB just for attention matrices. This is why Flash Attention (which never materializes the full matrix) is essential for long contexts.

Exercise 1.3: Why use multiple heads instead of one large attention?

Different heads learn different relationship types: head 1 might attend to syntactic neighbors, head 2 to semantic relationships, head 3 to positional patterns. This is like having multiple "perspectives" on the same data. Empirically, 8-64 heads consistently outperform single-head attention of the same total dimension.

Chapter Summary

  • Self-attention computes relevance between all position pairs โ€” O(nยฒ) but powerful
  • Multi-head attention learns diverse relationship types in parallel subspaces
  • Causal masking enables autoregressive generation (GPT-style LLMs)
  • Industry challenge: quadratic scaling โ†’ solved by Flash Attention and sparse methods
Chapter 2

Positional Encodings: Sinusoidal, RoPE & ALiBi

Learning Objectives

  • Understand why transformers need position information
  • Implement sinusoidal, RoPE, and ALiBi encodings
  • Know which encoding enables long-context extrapolation
Python
import torch, math

# 1. Sinusoidal (Original Transformer)
def sinusoidal_pe(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    pos = torch.arange(max_len).unsqueeze(1).float()
    div = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000) / d_model))
    pe[:, 0::2] = torch.sin(pos * div)
    pe[:, 1::2] = torch.cos(pos * div)
    return pe

# 2. RoPE โ€” Rotary Position Embeddings (LLaMA, GPT-NeoX)
def apply_rope(q, k, positions):
    """Rotate query/key vectors by position-dependent angles"""
    d = q.shape[-1]
    freqs = 1.0 / (10000 ** (torch.arange(0, d, 2).float() / d))
    angles = positions.unsqueeze(-1) * freqs
    cos_a, sin_a = torch.cos(angles), torch.sin(angles)

    # Rotate pairs of dimensions
    q_rot = torch.stack([q[..., 0::2]*cos_a - q[..., 1::2]*sin_a,
                         q[..., 0::2]*sin_a + q[..., 1::2]*cos_a], dim=-1).flatten(-2)
    k_rot = torch.stack([k[..., 0::2]*cos_a - k[..., 1::2]*sin_a,
                         k[..., 0::2]*sin_a + k[..., 1::2]*cos_a], dim=-1).flatten(-2)
    return q_rot, k_rot

# 3. ALiBi โ€” Attention with Linear Biases (no embeddings!)
# Simply adds a linear bias to attention scores based on distance:
# score(i,j) = q_i ยท k_j - m ยท |i - j|
# where m is a head-specific slope. No learned parameters!
EncodingTypeExtrapolationUsed In
SinusoidalAdditive, fixedPoor beyond training lengthOriginal Transformer
LearnedAdditive, trainedCannot extrapolateBERT, GPT-2
RoPEMultiplicative (rotation)Good with NTK scalingLLaMA, Mistral, Qwen
ALiBiAttention biasExcellent (zero-shot)BLOOM, MPT

Industry Problem: Context Window Extension

Problem: A model trained on 4K tokens can't handle 128K-token documents. Legal, medical, and enterprise use cases demand long contexts.

Solutions: (1) RoPE + NTK scaling โ€” adjust frequency base to extend context (LLaMA โ†’ 100K). (2) YaRN โ€” combines NTK + dynamic scaling. (3) ALiBi โ€” extrapolates to any length zero-shot. (4) Retrieval-Augmented Generation (RAG) โ€” retrieve relevant chunks instead of stuffing everything into context.

Exercises

Exercise 2.1: Why can't transformers understand position without positional encoding?

Self-attention is permutation equivariant โ€” swapping two tokens produces the same output (with positions swapped). Without position info, "The cat sat on the mat" and "mat the on sat cat The" produce identical attention patterns. Position encodings break this symmetry, encoding order information.

Exercise 2.2: Why has RoPE become the dominant choice for LLMs?

RoPE encodes relative positions through rotation, so attention between positions i and j depends only on (i-j). It's parameter-free, works with linear attention, and can be extended to longer contexts via frequency scaling (NTK/YaRN). LLaMA, Mistral, Qwen, and most open-source LLMs use RoPE.

Exercise 2.3: How does ALiBi achieve zero-shot context extrapolation?

ALiBi adds a penalty proportional to distance: closer tokens get higher attention regardless of training length. Since it's a linear bias (not a learned position embedding), the model naturally handles any distance โ€” no retraining needed. The penalty slopes vary per head, letting some heads focus locally and others globally.

Chapter Summary

  • Transformers need explicit position information โ€” attention is permutation-equivariant
  • RoPE (rotary) dominates modern LLMs with good extrapolation via frequency scaling
  • ALiBi provides zero-shot length generalization with no learned parameters
  • Context extension is a major industry challenge solved by RoPE scaling + RAG
Chapter 3

Layer Normalization, FFN & KV Cache

Learning Objectives

  • Understand Pre-LN vs Post-LN and why Pre-LN won
  • Master the feed-forward network (FFN) in transformers
  • Implement KV cache for efficient autoregressive inference
Python
class TransformerBlock(nn.Module):
    """Pre-LN Transformer Block (standard in GPT, LLaMA)"""
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)  # Pre-LN: normalize BEFORE attention
        self.attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),                     # SwiGLU in LLaMA
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )

    def forward(self, x, mask=None):
        x = x + self.attn(self.ln1(x), mask)   # Residual + Pre-LN Attention
        x = x + self.ffn(self.ln2(x))           # Residual + Pre-LN FFN
        return x

KV Cache โ€” Critical for Fast Inference

Python
class CachedAttention(nn.Module):
    """Attention with KV cache for autoregressive generation"""
    def forward(self, x, kv_cache=None):
        Q = self.W_q(x)  # Only compute Q for new tokens
        K_new = self.W_k(x)
        V_new = self.W_v(x)

        if kv_cache is not None:
            K = torch.cat([kv_cache[0], K_new], dim=1)  # Append to cache
            V = torch.cat([kv_cache[1], V_new], dim=1)
        else:
            K, V = K_new, V_new

        # Attention with full K,V but only new Q
        out = scaled_dot_product(Q, K, V)
        return out, (K, V)  # Return updated cache
    # Without cache: generating 1000 tokens recomputes ALL K,V for each token
    # With cache: each step is O(1) compute instead of O(n)

Industry Problem: KV Cache Memory Explosion

Problem: For LLaMA-70B with 128K context: KV cache = 2 ร— 80 layers ร— 64 heads ร— 128 dim ร— 128K tokens ร— 2 bytes = ~40 GB per request. Serving 100 concurrent users needs 4 TB of GPU memory!

Solutions: (1) Grouped Query Attention (GQA) โ€” share K,V across head groups (LLaMA-2 uses 8 KV heads for 32 Q heads โ†’ 4x reduction). (2) Multi-Query Attention (MQA) โ€” all heads share one K,V (Falcon). (3) PagedAttention (vLLM) โ€” allocate KV cache in pages like virtual memory, eliminating waste. (4) KV cache quantization โ€” store cache in INT8.

Exercises

Exercise 3.1: Why did Pre-LN replace Post-LN in modern LLMs?

Post-LN (original): Residual โ†’ Add โ†’ LayerNorm. Gradients must pass through LayerNorm, which can cause instability in very deep networks. Requires careful warmup. Pre-LN: LayerNorm โ†’ Attention โ†’ Add. The residual path is clean (identity), enabling stable training of 100+ layer models without warmup. All modern LLMs use Pre-LN.

Exercise 3.2: What is SwiGLU and why does LLaMA use it instead of ReLU?

SwiGLU = Swish(xยทWโ‚) โŠ™ (xยทWโ‚‚) โ€” a gated linear unit with Swish activation. It uses more parameters (3 projections vs 2) but produces better representations. LLaMA, PaLM, and Mistral all use SwiGLU. To keep parameter count similar, d_ff is reduced from 4ร—d_model to 2.67ร—d_model.

Exercise 3.3: Calculate the speedup from KV cache for generating 512 tokens

Without cache: token 1 = 1 attn, token 2 = 2 attn, ... token 512 = 512 attn. Total = 512ร—513/2 = 131,328 attention computations. With cache: each token does 1 attention (against cached K,V). Total = 512 computations. Speedup: ~256x!

Chapter Summary

  • Pre-LN: normalize before attention/FFN for stable deep training
  • SwiGLU FFN outperforms ReLU/GELU in modern LLMs
  • KV cache eliminates redundant computation โ€” essential for fast generation
  • GQA/MQA reduce KV cache memory by sharing keys/values across heads
Chapter 4

Encoder, Decoder & Encoder-Decoder Variants

Learning Objectives

  • Distinguish encoder-only, decoder-only, and encoder-decoder architectures
  • Know which architecture suits which task
  • Understand why decoder-only won for generative AI
ArchitectureAttentionBest ForExamples
Encoder-onlyBidirectional (sees all tokens)Understanding (classification, NER)BERT, RoBERTa, DeBERTa
Decoder-onlyCausal (sees only past)Generation (chat, code, reasoning)GPT-4, LLaMA, Mistral, Claude
Encoder-decoderCross-attentionTranslation, summarizationT5, BART, Flan-T5

Why Decoder-Only Won

Decoder-only models (GPT-style) dominate because: (1) They unify understanding and generation in one architecture. (2) They scale better with more parameters and data. (3) Next-token prediction is a universal objective โ€” it teaches reasoning, factual knowledge, and code. (4) In-context learning (few-shot prompting) emerged as a surprise capability of large decoder-only models.

Exercises

Exercise 4.1: Why is BERT better than GPT for classification tasks?

BERT sees all tokens bidirectionally โ€” when classifying "The bank by the river was steep," BERT uses "river" to disambiguate "bank." GPT only sees left context. However, large GPT models close this gap through scale, and instruction-tuned LLMs can match BERT on most NLU tasks via prompting.

Exercise 4.2: How does cross-attention work in encoder-decoder models?

The decoder's queries attend to the encoder's keys and values (not self-attention). This lets the decoder "look at" the input while generating output. In T5: encoder processes the input, decoder generates output token-by-token, using cross-attention to focus on relevant input parts at each step.

Exercise 4.3: Could you use an encoder-only model for generation?

Not directly โ€” encoder-only models (BERT) see all positions simultaneously, so there's no autoregressive generation. You could use masked token prediction iteratively (like in diffusion models for text), but it's much slower and lower quality than causal generation. BERT is designed for understanding, not generation.

Chapter Summary

  • Encoder-only (BERT): bidirectional, best for classification/understanding
  • Decoder-only (GPT): causal, dominates generative AI and reasoning
  • Encoder-decoder (T5): cross-attention bridges input and output
  • Decoder-only won because next-token prediction scales universally
Part II

Pre-training

Teaching LLMs to understand language

Chapter 5

Tokenization: BPE & SentencePiece

Learning Objectives

  • Understand why we tokenize (not use characters or words)
  • Implement Byte Pair Encoding (BPE) from scratch
  • Use SentencePiece and tiktoken for real tokenization
Python
# tiktoken โ€” OpenAI's tokenizer (used in GPT-4)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 encoding
text = "Large Language Models are transforming AI"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")

# BPE from scratch (simplified)
def get_pairs(tokens):
    return {(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)}

def bpe_train(text, num_merges):
    tokens = list(text.encode("utf-8"))
    merges = {}
    for i in range(num_merges):
        pairs = {}
        for j in range(len(tokens)-1):
            pair = (tokens[j], tokens[j+1])
            pairs[pair] = pairs.get(pair, 0) + 1
        if not pairs: break
        best = max(pairs, key=pairs.get)
        new_token = 256 + i
        merges[best] = new_token
        # Replace all occurrences of best pair with new token
        new_tokens = []
        j = 0
        while j < len(tokens):
            if j < len(tokens)-1 and (tokens[j], tokens[j+1]) == best:
                new_tokens.append(new_token); j += 2
            else:
                new_tokens.append(tokens[j]); j += 1
        tokens = new_tokens
    return merges, tokens

Industry Problem: Tokenization of Non-English and Code

Problem: BPE trained on English text creates long token sequences for Chinese/Japanese (each character = multiple tokens) and code (variable names split into sub-tokens). This wastes context window and increases cost.

Solutions: (1) Larger vocabulary โ€” LLaMA-3 uses 128K tokens (vs GPT-2's 50K), reducing non-English token count by ~2x. (2) Byte-level BPE โ€” handles any UTF-8 byte sequence. (3) Language-aware training data โ€” balance corpus to better represent non-English text in merges.

Exercises

Exercise 5.1: Why not use characters or words directly?

Characters: Vocabulary is small (~256) but sequences become very long (10x longer), making attention O(nยฒ) prohibitive. Words: Vocabulary is huge (500K+), most words are rare, and unknown words can't be handled. BPE: Sweet spot โ€” 32K-128K tokens, handles any text, balances sequence length and vocabulary size.

Exercise 5.2: How does vocab size affect model quality and efficiency?

Larger vocab = shorter sequences (more efficient inference, more context fits) but larger embedding matrix. GPT-2: 50K tokens. LLaMA-3: 128K tokens. The embedding matrix for 128K ร— 4096 dim = 512M parameters โ€” significant but worthwhile for the compression benefit. Optimal vocab size depends on training data size and languages supported.

Chapter Summary

  • BPE iteratively merges frequent byte pairs to build a sub-word vocabulary
  • Tokenization is the first step in any LLM pipeline โ€” garbage in, garbage out
  • Larger vocabularies reduce token count but increase embedding size
  • tiktoken (GPT-4) and SentencePiece (LLaMA) are industry standards
Chapter 6

Language Modeling Objectives

Learning Objectives

  • Master causal LM (next-token prediction) โ€” the GPT objective
  • Understand masked LM (BERT-style) and its limitations
  • Know prefix LM and span corruption (T5) objectives
Python
# Causal Language Modeling (GPT-style)
# Given: "The cat sat on the"
# Predict: "cat" "sat" "on" "the" "mat"

import torch.nn.functional as F

def causal_lm_loss(logits, targets):
    # logits: [B, T, vocab_size], targets: [B, T]
    # Shift: predict token t+1 from position t
    shift_logits = logits[:, :-1].contiguous()
    shift_labels = targets[:, 1:].contiguous()
    return F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1))

# Every position provides a training signal!
# A 4096-token document gives 4095 prediction tasks
# This efficiency is why causal LM scales so well
Loss = -1/T ร— ฮฃโ‚œ log P(xโ‚œ | xโ‚, xโ‚‚, ..., xโ‚œโ‚‹โ‚)

The Unreasonable Effectiveness of Next-Token Prediction

Next-token prediction seems trivially simple, yet it teaches: factual knowledge ("Paris is the capital of..."), reasoning ("If A implies B and B implies C, then..."), code ("def fibonacci(n):\n if n < 2:..."), math, translation, and even theory of mind. As Ilya Sutskever said: "Prediction is compression, and compression is understanding."

Exercises

Exercise 6.1: Why is causal LM more efficient than masked LM for pre-training?

Causal LM: every token is a prediction target โ†’ T-1 training signals per document. Masked LM (BERT): only ~15% of tokens are masked โ†’ 0.15T signals. For the same compute, causal LM extracts 6-7x more learning signal. This is why GPT-style models scale better than BERT-style models.

Exercise 6.2: What is perplexity and how is it related to loss?

Perplexity = e^(cross-entropy loss). A perplexity of 10 means the model is "as confused as if it had to choose between 10 equally likely tokens." Lower = better. GPT-4's perplexity on benchmark text is ~3-5, meaning it narrows down to 3-5 likely next tokens on average.

Chapter Summary

  • Causal LM (next-token prediction) is the dominant pre-training objective
  • Every token provides a training signal โ€” extremely data-efficient
  • Perplexity = e^loss โ€” the standard metric for language modeling quality
  • Next-token prediction is surprisingly powerful โ€” it teaches understanding, not just completion
Chapter 7

Data Collection, Cleaning & Deduplication

Learning Objectives

  • Build a pre-training data pipeline
  • Understand quality filtering, deduplication, and toxicity removal
  • Know the data composition of major LLMs
ModelTraining TokensData Sources
GPT-3300BCommonCrawl, WebText, Books, Wikipedia
LLaMA-22TWeb crawl (89%), code, Wikipedia, books
LLaMA-315TMulti-source, heavily filtered and deduplicated
GPT-4~13T (estimated)Undisclosed (web + proprietary + synthetic)
Python
# Data pipeline stages
def data_pipeline(raw_docs):
    # 1. Language filtering
    docs = [d for d in raw_docs if detect_language(d) == "en"]

    # 2. Quality filtering (perplexity-based)
    docs = [d for d in docs if quality_score(d) > 0.5]

    # 3. Deduplication (MinHash + LSH)
    docs = minhash_dedup(docs, threshold=0.8)

    # 4. Toxicity/PII removal
    docs = [remove_pii(d) for d in docs if not is_toxic(d)]

    # 5. Tokenize and pack into sequences
    tokens = tokenizer.encode_batch(docs)
    return pack_sequences(tokens, max_len=4096)

Industry Problem: Data Quality vs. Quantity

Problem: CommonCrawl has ~250 billion pages, but ~90% is low-quality (duplicates, spam, SEO content, machine-generated text). Training on raw data produces incoherent models.

Solutions: (1) Classifier-based filtering โ€” train a quality classifier on Wikipedia/books, filter web data. (2) Exact + fuzzy dedup โ€” MinHash for near-duplicate detection (LLaMA removed 86% of CommonCrawl). (3) Domain mixing โ€” oversample high-quality sources (code, textbooks, Wikipedia). (4) Synthetic data โ€” use existing LLMs to generate training data (Phi-2 proved small models + curated data beats large models + raw data).

Exercises

Exercise 7.1: Why does deduplication improve model quality?

Duplicated data causes the model to memorize specific sequences instead of learning generalizable patterns. It also creates training instabilities (loss spikes on repeated content). LLaMA's dedup removed 86% of CommonCrawl but improved quality significantly. Research shows 3-5x deduplication can be equivalent to doubling compute budget.

Exercise 7.2: What is the data mixture problem and how do you solve it?

Different domains have different value: 1 token of Wikipedia > 1 token of a random blog. The optimal mixture allocates more training time to high-quality sources. LLaMA-2 used: 89% web, 5% code, 4% Wikipedia, 2% books. Finding the right mixture requires expensive ablation experiments (Doremi, data mixing laws).

Chapter Summary

  • Data quality > quantity โ€” Phi-2 proved this definitively
  • Pipeline: filter โ†’ deduplicate โ†’ detoxify โ†’ tokenize โ†’ pack
  • MinHash LSH enables efficient near-duplicate detection at web scale
  • Domain mixing ratio significantly affects model capabilities
Chapter 8

Compute Scaling & Scaling Laws

Learning Objectives

  • Estimate FLOPs for training an LLM
  • Understand Chinchilla scaling laws โ€” the optimal model-data tradeoff
  • Apply scaling laws to plan training runs
FLOPs โ‰ˆ 6 ร— N ร— D   (N = parameters, D = tokens)
ModelParametersTokensFLOPsGPUsTime
GPT-3175B300B3.1ร—10ยฒยณ1024 A100s~34 days
LLaMA-2 70B70B2T8.4ร—10ยฒยณ2048 A100s~25 days
LLaMA-3 405B405B15T3.6ร—10ยฒโต16384 H100s~54 days

Chinchilla Scaling Law

For a compute-optimal model, parameters and tokens should scale roughly equally: D โ‰ˆ 20N. A 10B model should train on 200B tokens. GPT-3 (175B params, 300B tokens) was undertrained โ€” Chinchilla (70B params, 1.4T tokens) matched it with 4x less compute. LLaMA followed Chinchilla, training smaller models on far more data.

Industry Problem: Training Cost

Problem: Training a frontier LLM costs $10M-$100M+ in compute. A single training run of LLaMA-3 405B on 16K H100s costs ~$30M+ in GPU hours. Failures are catastrophic.

Solutions: (1) Scaling laws โ€” extrapolate from small experiments to predict large model performance before spending millions. (2) Efficient architectures โ€” mixture of experts (Mixtral uses only 2 of 8 experts per token โ†’ 2x efficiency). (3) Curriculum learning โ€” start with easy data, increase complexity. (4) Infrastructure โ€” checkpoint frequently, fault-tolerant training frameworks (Megatron, DeepSpeed).

Exercises

Exercise 8.1: How many FLOPs to train a 7B model on 1T tokens?

FLOPs โ‰ˆ 6 ร— 7ร—10โน ร— 10ยนยฒ = 4.2ร—10ยฒยฒ FLOPs. An H100 does ~1ร—10ยนโต FLOPs/sec. At 50% MFU: 5ร—10ยนโด effective FLOPs/sec. Time with 1 H100: 4.2ร—10ยฒยฒ/(5ร—10ยนโด) = 84M seconds โ‰ˆ 2.7 years. With 256 H100s: ~4 days.

Exercise 8.2: What is Mixture of Experts and why is it efficient?

MoE replaces the FFN with N expert FFNs + a router. The router selects top-2 experts per token. Result: 8x parameters but only 2x compute (only 2 experts are active). Mixtral-8x7B has 46B total params but only 12B active โ€” matching LLaMA-2 70B quality at much lower cost. GPT-4 is rumored to use MoE.

Chapter Summary

  • FLOPs โ‰ˆ 6ND is the fundamental compute formula for transformer training
  • Chinchilla law: optimal D โ‰ˆ 20N โ€” train longer, not bigger
  • MoE provides parameter scaling without proportional compute increase
  • Scaling laws enable predicting large model performance from small experiments
Part III

Fine-tuning & Alignment

Making LLMs helpful, harmless, and honest

Chapter 9

Supervised Fine-Tuning (SFT)

Learning Objectives

  • Fine-tune a pre-trained LLM on instruction-response pairs
  • Build training datasets for SFT
  • Understand the instruction-following pipeline
Python
# SFT training data format
sft_data = [
    {
        "instruction": "Explain quantum computing in simple terms",
        "response": "Quantum computing uses quantum bits (qubits) that can be both 0 and 1 simultaneously (superposition). This allows quantum computers to explore many solutions at once, making them exponentially faster for certain problems like cryptography and drug discovery..."
    },
    {
        "instruction": "Write a Python function to find prime numbers",
        "response": "def is_prime(n):\n    if n < 2: return False\n    for i in range(2, int(n**0.5)+1):\n        if n % i == 0: return False\n    return True"
    }
]

# Training: only compute loss on the response tokens!
def sft_loss(logits, labels, instruction_mask):
    # Mask out instruction tokens โ€” don't train on the question
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
                           labels.view(-1), reduction='none')
    loss = loss * (~instruction_mask).float().view(-1)
    return loss.sum() / (~instruction_mask).sum()

Project: Fine-tune LLaMA with Hugging Face

Python
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model

# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# Add LoRA for efficient fine-tuning (Chapter 11)
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"])
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")

# Load instruction dataset
dataset = load_dataset("tatsu-lab/alpaca")

# Training
training_args = TrainingArguments(
    output_dir="./sft_output", num_train_epochs=3,
    per_device_train_batch_size=4, gradient_accumulation_steps=8,
    learning_rate=2e-5, fp16=True, warmup_ratio=0.1)

trainer = Trainer(model=model, args=training_args, train_dataset=dataset["train"])
trainer.train()

Exercises

Exercise 9.1: Why only compute loss on response tokens, not instruction tokens?

The model should learn to generate responses, not memorize instructions. Computing loss on instructions wastes gradient signal on content we don't want the model to "generate." It also prevents the model from learning to parrot back instructions. The instruction provides context; the response is the training target.

Exercise 9.2: How many SFT examples are typically needed?

Surprisingly few! LIMA showed 1,000 high-quality examples can outperform models trained on 50K+ low-quality ones. The quality > quantity principle applies strongly to SFT. Key: diverse, well-written, covering different tasks and formats. 1K-50K examples with 1-3 epochs is typical.

Chapter Summary

  • SFT teaches pre-trained LLMs to follow instructions using (instruction, response) pairs
  • Only compute loss on response tokens โ€” instructions provide context only
  • Quality > quantity: 1K excellent examples can outperform 50K mediocre ones (LIMA paper)
  • SFT is the first step of alignment: pre-train โ†’ SFT โ†’ RLHF/DPO
Chapter 10

RLHF & DPO โ€” Aligning with Human Preferences

Learning Objectives

  • Understand RLHF: reward model + PPO optimization
  • Master DPO โ€” the simpler alternative that skips reward modeling
  • Know the alignment pipeline used by ChatGPT and Claude

RLHF Pipeline

Pre-train โ†’ SFT โ†’ Train Reward Model โ†’ PPO (optimize policy against reward)
Python
# DPO โ€” Direct Preference Optimization (simpler than RLHF)
# Given pairs: (prompt, chosen_response, rejected_response)

import torch.nn.functional as F

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    """
    policy_chosen_logps: log P_ฮธ(chosen | prompt)
    ref_chosen_logps: log P_ref(chosen | prompt) โ€” frozen reference model
    beta: temperature parameter
    """
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    # The model should assign higher reward to chosen vs rejected
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    return loss

# DPO advantage: no separate reward model, no PPO complexity!
# Trains directly on preference pairs: (prompt, winner, loser)
MethodComplexityRequiresUsed By
RLHF + PPOHigh (4 models in memory)Reward model + PPO trainingChatGPT, Claude (early)
DPOLow (2 models)Preference pairs onlyLLaMA-3, Zephyr, many open models
RLAIFMediumAI-generated feedbackConstitutional AI (Anthropic)
KTOLowestOnly thumbs up/down per responseEmerging research

Industry Problem: Alignment Tax and Reward Hacking

Problem: RLHF can reduce model capability ("alignment tax") โ€” the model becomes safer but less knowledgeable. Reward hacking: the model learns to exploit the reward model (producing text that sounds confident but is wrong).

Solutions: (1) DPO โ€” avoids reward model entirely, reducing hacking risk. (2) Iterative DPO โ€” alternate between generating responses and collecting preferences. (3) Process reward models โ€” reward each reasoning step, not just the final answer (OpenAI). (4) Constitutional AI โ€” use principles to self-critique (Chapter 12).

Exercises

Exercise 10.1: Why is DPO preferred over RLHF in most open-source LLMs?

RLHF requires: (1) trained reward model, (2) PPO optimization with 4 models in memory (policy, reference, reward, value), (3) careful hyperparameter tuning. DPO needs only preference pairs and two forward passes. It's mathematically equivalent to RLHF under certain conditions but 10x simpler to implement and 3x cheaper to train.

Exercise 10.2: What is the "reference model" in DPO and why is it needed?

The reference model (usually the SFT model, frozen) prevents the policy from drifting too far from sensible language. Without it, the model could learn to produce degenerate text that maximally satisfies preferences but is incoherent. The KL divergence penalty (built into DPO's loss) keeps the policy close to the reference.

Chapter Summary

  • RLHF uses a reward model + PPO to optimize for human preferences
  • DPO simplifies alignment to direct preference optimization โ€” no reward model needed
  • The alignment pipeline: Pre-train โ†’ SFT โ†’ DPO/RLHF produces helpful, harmless models
  • Reward hacking is a real industry risk; process rewards and iterative training help
Chapter 11

LoRA, QLoRA & PEFT Methods

Learning Objectives

  • Master LoRA โ€” the dominant parameter-efficient fine-tuning method
  • Understand QLoRA for fine-tuning on consumer GPUs
  • Compare PEFT methods and know when to use each
Python
import torch.nn as nn

class LoRALayer(nn.Module):
    """Low-Rank Adaptation: W_new = W_frozen + AยทB (rank r)"""
    def __init__(self, original_layer, r=16, alpha=32):
        super().__init__()
        self.original = original_layer
        self.original.weight.requires_grad = False  # Freeze original
        d_in, d_out = original_layer.weight.shape
        self.A = nn.Parameter(torch.randn(d_in, r) * 0.01)   # Down-project
        self.B = nn.Parameter(torch.zeros(r, d_out))          # Up-project
        self.scaling = alpha / r

    def forward(self, x):
        # Original output + low-rank adaptation
        return self.original(x) + (x @ self.A @ self.B) * self.scaling

# For LLaMA-7B:
# Full fine-tuning: 7B trainable params, needs 8ร— A100 80GB
# LoRA (r=16): ~4M trainable params (0.06%), needs 1ร— A100
# QLoRA (4-bit): same quality, needs 1ร— RTX 3090 24GB!
MethodTrainable ParamsMemory for 7BQuality
Full Fine-tuning100%~120 GBBest
LoRA (r=16)~0.1%~16 GB~98% of full
QLoRA (4-bit)~0.1%~6 GB~97% of full
Prefix Tuning~0.1%~16 GB~90% of full

Industry Problem: Fine-tuning Cost for Enterprise

Problem: Enterprises need domain-specific LLMs (legal, medical, finance) but can't afford $100K+ for full fine-tuning on A100 clusters. They have limited GPU resources (a few consumer GPUs).

Solutions: (1) QLoRA โ€” fine-tune 70B models on a single 48GB GPU using 4-bit quantization. (2) LoRA adapters โ€” swap task-specific adapters at inference time (one base model, many adapters). (3) RAG โ€” augment with retrieval instead of fine-tuning. (4) Distillation โ€” train a smaller model on the larger model's outputs.

Exercises

Exercise 11.1: Why does LoRA work โ€” how can 0.1% of parameters capture task-specific knowledge?

Pre-training learns general knowledge in the full-rank weight matrix. Task-specific adaptation only needs to make small adjustments โ€” these adjustments live in a low-dimensional subspace. Research shows the "intrinsic dimensionality" of fine-tuning is very low (~hundreds, not billions). LoRA captures this low-rank structure explicitly.

Exercise 11.2: What is QLoRA and how does it fit 70B models on consumer GPUs?

QLoRA: (1) Quantizes the base model to 4-bit (NormalFloat4 format) โ€” 70B ร— 0.5 bytes = 35 GB. (2) Adds LoRA adapters in FP16. (3) Uses paged optimizers to handle memory spikes. (4) Trains only the small LoRA matrices. Result: 70B model fine-tunable on a 48GB A6000 or even dual 24GB RTX 3090s.

Exercise 11.3: When should you use full fine-tuning vs LoRA vs RAG?

Full FT: When you need maximum quality and have compute (production LLMs). LoRA: Domain adaptation with limited budget โ€” sweet spot for most enterprises. RAG: When knowledge changes frequently (news, docs) or you need citations. Many deployments combine LoRA (style/behavior) + RAG (knowledge).

Chapter Summary

  • LoRA adds low-rank matrices (AยทB) to frozen weights โ€” 0.1% trainable params, ~98% quality
  • QLoRA quantizes the base model to 4-bit, enabling fine-tuning on consumer GPUs
  • PEFT methods democratize LLM customization for enterprises and researchers
  • Combine LoRA (behavior) + RAG (knowledge) for best enterprise results
Chapter 12

Constitutional AI & Instruction Following

Learning Objectives

  • Understand Constitutional AI (RLAIF) โ€” AI-generated alignment data
  • Build instruction-following datasets with self-critique
  • Know the techniques for making LLMs safe and controllable
Python
# Constitutional AI pipeline (Anthropic)
principles = [
    "Choose the response that is most helpful while being harmless.",
    "Choose the response that is most honest and factual.",
    "Avoid responses that are discriminatory or biased.",
]

def constitutional_critique(model, prompt, response):
    """Use the model itself to critique and revise responses"""
    critique_prompt = f"""
Critique this response based on the principle: {principles[0]}

Prompt: {prompt}
Response: {response}

Identify problems and suggest a revised response:"""
    critique = model.generate(critique_prompt)
    return critique

# Process:
# 1. Generate response to harmful prompt
# 2. Self-critique against principles
# 3. Revise response based on critique
# 4. Use (original, revised) as DPO preference pair
# = RLAIF โ€” no human annotators needed!

Industry Problem: Safety Without Losing Capability

Problem: Over-alignment makes models refuse legitimate requests ("I can't help with that"). Under-alignment allows harmful outputs. Finding the right balance is a major industry challenge.

Solutions: (1) Nuanced refusal โ€” refuse harmful content but explain why and offer alternatives. (2) System prompts โ€” define behavior boundaries per deployment. (3) Layered safety โ€” input filters + model alignment + output filters. (4) Red teaming โ€” systematically test for vulnerabilities before deployment.

Exercises

Exercise 12.1: How does RLAIF differ from RLHF?

RLHF: Human annotators compare response pairs โ†’ reward model โ†’ PPO. Expensive ($2M+ for quality annotations). RLAIF: AI compares responses against constitutional principles โ†’ self-generated preference data โ†’ DPO. Much cheaper, scalable, and consistent. Trade-off: AI feedback may miss nuances humans catch.

Exercise 12.2: What makes a good instruction-following dataset?

Diversity (many tasks), quality (well-written responses), coverage (edge cases), formatting (consistent structure), and difficulty gradient (simple โ†’ complex). Alpaca (52K) used GPT-3.5 to generate data. OpenOrca used GPT-4. Key insight: the teacher model's quality directly determines the student's ceiling.

Chapter Summary

  • Constitutional AI uses self-critique against principles for scalable alignment
  • RLAIF replaces expensive human feedback with AI-generated preferences
  • Instruction-following requires diverse, high-quality training data
  • Safety is a spectrum โ€” balance helpfulness with harmlessness
Part IV

Efficiency & Deployment

Serving LLMs at scale in production

Chapter 13

Quantization, Pruning & Distillation

Learning Objectives

  • Quantize LLMs to INT8/INT4 for efficient inference
  • Understand knowledge distillation โ€” training small models from large ones
  • Know pruning techniques for model compression
Python
# Quantization with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization (GPTQ/AWQ style)
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # NormalFloat4
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True       # Quantize the quantization constants!
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=quant_config,
    device_map="auto"
)
# 70B model: FP16 = 140 GB โ†’ INT4 = ~35 GB โ†’ fits on 2ร— RTX 3090!
PrecisionBitsSize (7B model)Quality vs FP16
FP323228 GBBaseline (100%)
FP16/BF161614 GB~100%
INT887 GB~99%
INT4 (GPTQ/AWQ)43.5 GB~97%
GGUF Q4_K_M~4.5~4 GB~96%

Industry Problem: Serving Cost Per Token

Problem: Running LLaMA-70B FP16 requires 2ร— A100 80GB ($2/hour each). At 50 tokens/sec, cost is ~$0.003 per 1K tokens. For 1B tokens/day (medium startup): $3K/day = $90K/month.

Solutions: (1) INT4 quantization โ€” same quality, 4ร— less memory, run on cheaper GPUs. (2) Distillation โ€” train a 7B model to mimic the 70B โ†’ 10x cheaper inference. (3) Speculative decoding โ€” use small draft model + large verifier for 2-3x speedup. (4) Batching โ€” serve multiple requests simultaneously with continuous batching.

Exercises

Exercise 13.1: Why does INT4 quantization only lose ~3% quality?

Neural network weights are normally distributed and redundant. Most weights are near zero and don't need 16-bit precision. NormalFloat4 (NF4) is information-theoretically optimal for normally distributed data โ€” it allocates more precision to common values. The small quality loss is because extreme weight values (important but rare) lose some precision.

Exercise 13.2: How does knowledge distillation work?

Train a small "student" model to match the large "teacher" model's output distribution (soft labels), not just the hard labels. The teacher's probability distribution contains "dark knowledge" โ€” e.g., P("cat")=0.7, P("kitten")=0.2 teaches that cats and kittens are related. Student learns faster and better than training from scratch on hard labels.

Exercise 13.3: What is the difference between GPTQ, AWQ, and GGUF quantization?

GPTQ: Post-training quantization using calibration data, minimizes output error. GPU-optimized. AWQ: Activation-aware โ€” preserves channels important for activations, often slightly better quality. GGUF: llama.cpp format for CPU inference, multiple quantization levels (Q4_K_M, Q5_K_S, etc.). GPTQ/AWQ for GPU, GGUF for CPU/edge.

Chapter Summary

  • INT4 quantization reduces model size 4x with ~3% quality loss
  • Distillation trains smaller models to mimic larger ones โ€” 10x cheaper inference
  • GPTQ/AWQ for GPU deployment; GGUF/llama.cpp for CPU/edge
  • Quantization is the #1 way to reduce serving costs in production
Chapter 14

Serving: Flash Attention, Parallelism & vLLM

Learning Objectives

  • Understand Flash Attention โ€” the algorithm that enabled long contexts
  • Master tensor and pipeline parallelism for multi-GPU serving
  • Deploy LLMs with vLLM and llama.cpp
  • Design prompt engineering strategies and manage context windows

Flash Attention

Python
# Flash Attention โ€” fused kernel, O(n) memory instead of O(nยฒ)
# Standard attention materializes the full nร—n attention matrix
# Flash Attention computes attention in tiles, never storing the full matrix

# In PyTorch 2.0+:
from torch.nn.functional import scaled_dot_product_attention

# Automatically uses Flash Attention if available
output = scaled_dot_product_attention(Q, K, V, is_causal=True)

# For LLaMA with 128K context:
# Standard attention memory: 128K ร— 128K ร— 2 bytes = 32 GB per layer
# Flash Attention memory: ~O(n) = ~megabytes per layer
# Speedup: 2-4x on A100, enables 128K+ context windows

Parallelism Strategies

StrategySplitsUse Case
Tensor ParallelismEach layer across GPUsSingle-node multi-GPU (fast interconnect)
Pipeline ParallelismDifferent layers on different GPUsMulti-node (high latency OK)
Data ParallelismSame model, different batchesTraining (not useful for single request)
Sequence ParallelismLong sequences across GPUsVery long contexts (Ring Attention)

Serving with vLLM

Python
# vLLM โ€” high-throughput LLM serving with PagedAttention
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
           tensor_parallel_size=2,     # 2 GPUs
           gpu_memory_utilization=0.9)

params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

# vLLM uses PagedAttention โ€” KV cache is managed like virtual memory
# This eliminates memory waste from pre-allocated KV caches
# Throughput improvement: 2-4x over HuggingFace, 24x over naive approach

prompts = [
    "Explain machine learning in simple terms",
    "Write a Python quicksort function",
    "What causes climate change?",
]
outputs = llm.generate(prompts, params)  # Batched! Continuous batching
for out in outputs:
    print(out.outputs[0].text)

Prompt Engineering & Context Windows

Python
# System prompt engineering
system_prompt = """You are a senior financial analyst. Follow these rules:
1. Always cite your sources with specific data points
2. When uncertain, explicitly state your confidence level
3. Present both bullish and bearish perspectives
4. Use tables for comparisons"""

# RAG: Retrieve relevant context to fit in window
def rag_prompt(query, retrieved_docs, max_context=4000):
    context = "\n---\n".join(retrieved_docs[:max_context])
    return f"""Based on the following documents, answer the question.

Context:
{context}

Question: {query}

Answer based ONLY on the provided context. If the answer is not in the context, say so."""

Industry Problem: Latency vs. Throughput

Problem: Users expect <50ms time-to-first-token (TTFT) and 30+ tokens/second streaming. But batch processing (high throughput) increases latency. How do you serve thousands of concurrent users with good latency?

Solutions: (1) Continuous batching (vLLM) โ€” dynamically add/remove requests from the batch without waiting. (2) Speculative decoding โ€” small model drafts tokens, large model verifies (2-3x speedup). (3) Prefix caching โ€” cache KV for common system prompts. (4) Quantized serving โ€” INT4 models fit on fewer GPUs with lower latency. (5) Edge deployment โ€” llama.cpp on consumer hardware for local inference.

Project: Deploy an LLM API with vLLM

Bash
# Start vLLM server (OpenAI-compatible API)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --port 8000

# Client code (works with any OpenAI SDK!)
Python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Explain transformers in 100 words"}],
    max_tokens=200, temperature=0.7
)
print(response.choices[0].message.content)

Exercises

Exercise 14.1: How does PagedAttention (vLLM) improve memory efficiency?

Traditional serving pre-allocates KV cache for max_seq_len per request โ€” most of it is wasted (average response is much shorter). PagedAttention allocates KV cache in small pages (like OS virtual memory), only using what's needed. When a request ends, pages are freed. This reduces memory waste from ~60-80% to ~4%, enabling 2-4x more concurrent requests.

Exercise 14.2: When would you use llama.cpp vs vLLM?

vLLM: GPU server deployment, high throughput, many concurrent users, production APIs. llama.cpp: Local/edge deployment on CPU/Mac/consumer GPU, single-user, privacy-sensitive applications, offline use. llama.cpp supports GGUF quantized models that run on any hardware with no GPU required.

Exercise 14.3: What is speculative decoding and why does it speed up generation?

A small "draft" model (e.g., 1B) generates K tokens quickly. The large "target" model (e.g., 70B) verifies all K tokens in one forward pass (parallel verification is as fast as one step). Accepted tokens are free; rejected tokens are re-generated. Acceptance rate is typically 70-90%, giving ~2-3x speedup with no quality loss.

Chapter Summary

  • Flash Attention enables 128K+ contexts by reducing memory from O(nยฒ) to O(n)
  • Tensor parallelism splits layers across GPUs; pipeline parallelism splits the model
  • vLLM with PagedAttention provides 2-4x throughput improvement for production serving
  • Speculative decoding, continuous batching, and prefix caching optimize latency
  • llama.cpp enables local/edge deployment on consumer hardware

๐ŸŽ“ Congratulations!

You've completed Large Language Models. You now understand how to build, train, align, and deploy the AI systems that are transforming the world โ€” from Transformer architecture to production serving at scale.

ยฉ 2025 EduArtha โ€” Large Language Models Complete Guide