Phase 4 • EduArtha

Deep Learning

The engine behind modern AI. Neural networks, backpropagation, and deep architectures are built here. Every breakthrough from GPT to Stable Diffusion to AlphaFold relies on these foundations.

⏱ 4–8 months | 14 Chapters | 50+ Exercises

Part I

Neural Network Fundamentals

The building blocks of every deep learning model

Chapter 1

Perceptrons & Multilayer Networks

Learning Objectives

Understand the perceptron — the simplest neural unit
Build multilayer perceptrons (MLPs) from scratch
Grasp universal approximation — why depth matters
Connect neurons to modern AI: every LLM is built on these

The Perceptron

A perceptron computes a weighted sum of inputs, adds a bias, and passes through an activation function. It's a single artificial neuron — the atom of deep learning.

output = activation(w₁x₁ + w₂x₂ + ... + wₙxₙ + b) = activation(W·X + b)

Python
import numpy as np

class Perceptron:
    def __init__(self, n_inputs, lr=0.01):
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0
        self.lr = lr

    def forward(self, x):
        return 1.0 if np.dot(self.weights, x) + self.bias > 0 else 0.0

    def train(self, X, y, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                pred = self.forward(xi)
                error = yi - pred
                self.weights += self.lr * error * xi
                self.bias += self.lr * error

# AND gate — linearly separable
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0, 0, 0, 1])
p = Perceptron(2)
p.train(X, y)
print([p.forward(xi) for xi in X])  # [0, 0, 0, 1] ✓

Multilayer Perceptron (MLP)

Stacking layers of neurons creates an MLP — capable of learning any continuous function (Universal Approximation Theorem). The key insight: non-linear activations between layers allow the network to model complex, non-linear relationships.

Python
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

model = MLP(784, 256, 10)  # MNIST: 784 pixels → 10 digits
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Why This Matters for AI

Every modern AI model — GPT-4, Gemini, DALL-E, AlphaFold — is built from layers of neurons. The MLP is the fundamental building block. Transformer feed-forward layers? MLPs. Classification heads? MLPs. Understanding how neurons combine to learn representations is the foundation for everything that follows.

Project: MNIST Digit Classifier from Scratch

Python
import torch, torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,),(0.3081,))])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

# Model
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 512), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(256, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Train
for epoch in range(5):
    model.train()
    for X, y in train_loader:
        loss = criterion(model(X), y)
        optimizer.zero_grad(); loss.backward(); optimizer.step()

    # Evaluate
    model.eval()
    correct = sum(
        (model(X).argmax(1) == y).sum().item()
        for X, y in test_loader)
    print(f"Epoch {epoch+1}: Accuracy = {correct/len(test_data):.2%}")
# Achieves ~98% accuracy!

Exercises

Exercise 1.1: Why can't a single perceptron solve XOR?

XOR is not linearly separable — no single line can divide the four points into correct classes. A perceptron draws one linear boundary. You need at least 2 layers (hidden + output) to create the two boundaries needed for XOR. This limitation motivated the development of MLPs.

Exercise 1.2: How many parameters does a network with layers [784, 512, 256, 10] have?

Layer 1: 784×512 + 512 = 401,920. Layer 2: 512×256 + 256 = 131,328. Layer 3: 256×10 + 10 = 2,570. Total: 535,818. Each layer has weights (input×output) plus biases (output). Modern LLMs have billions — but the math is the same.

Exercise 1.3: What is the Universal Approximation Theorem?

A feedforward network with a single hidden layer of sufficient width can approximate any continuous function to any desired accuracy. However, deep networks (many layers) achieve this with far fewer parameters than wide-shallow ones. Depth enables hierarchical feature learning — edges → shapes → objects.

Chapter Summary

Perceptrons compute weighted sums with activation — the basic neural unit
MLPs stack layers with non-linear activations to learn complex functions
Every modern AI model is built from these fundamental building blocks
Depth enables hierarchical feature learning — the key insight of deep learning

Chapter 2

Activation Functions

Learning Objectives

Understand why non-linearity is essential
Master ReLU, sigmoid, tanh, GELU, and Swish
Know which activation to use where and why

Python
import torch
import torch.nn.functional as F
import numpy as np

x = torch.linspace(-5, 5, 100)

# Key activation functions
sigmoid = torch.sigmoid(x)            # σ(x) = 1/(1+e⁻ˣ) — range (0,1)
tanh = torch.tanh(x)                  # range (-1,1)
relu = F.relu(x)                       # max(0,x) — most popular
leaky_relu = F.leaky_relu(x, 0.01)    # allows small negative gradients
gelu = F.gelu(x)                       # Used in Transformers (GPT, BERT)
silu = F.silu(x)                       # x·σ(x) — "Swish", used in EfficientNet

Function	Formula	Range	Used In	Pros/Cons
Sigmoid	1/(1+e⁻ˣ)	(0,1)	Output layers (binary)	Vanishing gradient, not zero-centered
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	(-1,1)	RNNs, LSTMs	Zero-centered but still saturates
ReLU	max(0,x)	[0,∞)	CNNs, default choice	Fast, no saturation; "dying ReLU"
GELU	x·Φ(x)	(-0.17,∞)	Transformers (GPT, BERT)	Smooth, probabilistic gating
SiLU/Swish	x·σ(x)	(-0.28,∞)	EfficientNet, modern CNNs	Smooth, self-gated, unbounded

Future of AI: Why GELU Dominates Transformers

GELU (Gaussian Error Linear Unit) is used in GPT, BERT, ViT, and most modern transformers. Unlike ReLU's hard cutoff at 0, GELU smoothly weights inputs by their magnitude — providing a probabilistic gate. This smoothness helps with gradient flow in very deep models (100+ layers), which is critical for training the AI models of the future.

Exercises

Exercise 2.1: Why does removing all activations collapse a deep network to a single linear layer?

Without activations: Layer1 = W₁x + b₁, Layer2 = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂) = W'x + b'. The composition of linear functions is linear. No matter how many layers, it's equivalent to one matrix multiplication. Non-linearity is what makes depth useful.

Exercise 2.2: What is the "dying ReLU" problem and how do you fix it?

If a ReLU neuron receives large negative inputs, its output is always 0, and its gradient is always 0 — it never updates again. It's "dead." Fixes: (1) Leaky ReLU (small negative slope), (2) PReLU (learned slope), (3) ELU (smooth negative), (4) GELU/SiLU (smooth everywhere), (5) Careful weight initialization.

Exercise 2.3: Which activation would you use for a model outputting probabilities for 10 classes?

Use Softmax for the final output layer — it converts logits to probabilities that sum to 1. For hidden layers, use ReLU (CNNs) or GELU (transformers). Never use softmax in hidden layers — it constrains the representation and makes training harder.

Chapter Summary

Non-linear activations are what make deep networks powerful — without them, depth is useless
ReLU is the default for most architectures; GELU for transformers
Sigmoid/softmax for output layers (probabilities); tanh for RNNs
Modern activations (GELU, SiLU) are smooth, avoiding dead neuron issues

Chapter 3

Forward Pass & Backpropagation

Learning Objectives

Trace the forward pass computation step by step
Understand backpropagation via the chain rule
Implement backprop from scratch in NumPy
Know why autograd replaces manual gradients

The Forward Pass

Input → [W₁·x + b₁] → ReLU → [W₂·h + b₂] → Softmax → Loss(ŷ, y)

Backpropagation: Chain Rule

Backpropagation computes ∂Loss/∂wᵢ for every weight by applying the chain rule backward through the network. This gradient tells each weight how much it contributed to the error — and which direction to adjust.

Python
import numpy as np

# 2-layer neural network from scratch
np.random.seed(42)
X = np.random.randn(100, 2)
y = (X[:, 0] * X[:, 1] > 0).astype(float).reshape(-1, 1)

# Initialize weights
W1 = np.random.randn(2, 16) * 0.5
b1 = np.zeros((1, 16))
W2 = np.random.randn(16, 1) * 0.5
b2 = np.zeros((1, 1))
lr = 0.1

def sigmoid(z): return 1 / (1 + np.exp(-z))

for epoch in range(1000):
    # FORWARD
    z1 = X @ W1 + b1
    a1 = np.maximum(0, z1)    # ReLU
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)           # Output probability

    # Loss (binary cross-entropy)
    loss = -np.mean(y * np.log(a2 + 1e-8) + (1-y) * np.log(1-a2 + 1e-8))

    # BACKWARD (chain rule)
    m = len(X)
    dz2 = a2 - y                         # ∂L/∂z2
    dW2 = (1/m) * a1.T @ dz2             # ∂L/∂W2
    db2 = (1/m) * np.sum(dz2, axis=0)
    dz1 = (dz2 @ W2.T) * (z1 > 0)       # ReLU gradient
    dW1 = (1/m) * X.T @ dz1
    db1 = (1/m) * np.sum(dz1, axis=0)

    # UPDATE
    W2 -= lr * dW2; b2 -= lr * db2
    W1 -= lr * dW1; b1 -= lr * db1

    if epoch % 200 == 0:
        acc = ((a2 > 0.5) == y).mean()
        print(f"Epoch {epoch}: loss={loss:.4f}, acc={acc:.2%}")

Why This Matters for AI

Backpropagation is the algorithm that makes deep learning possible. Training GPT-4 means computing gradients for 1.7 trillion parameters across billions of tokens — all via the chain rule. Understanding backprop deeply is essential for debugging training failures, implementing custom layers, and pushing the frontier of AI research.

Exercises

Exercise 3.1: Why is the chain rule necessary — can't we compute gradients directly?

A neural network is a deeply nested function: f(g(h(x))). Computing ∂f/∂x directly requires expanding the entire composition — exponentially complex. The chain rule decomposes it: ∂f/∂x = (∂f/∂g)(∂g/∂h)(∂h/∂x). Each factor is simple and local. Backprop computes all gradients in one backward pass — O(n) instead of O(n²).

Exercise 3.2: What is the vanishing gradient problem?

In deep networks with sigmoid/tanh, gradients shrink exponentially as they flow backward (each layer multiplies by σ'(x) ≤ 0.25). By the time gradients reach early layers, they're ~0 — those layers barely learn. Solutions: ReLU (gradient = 1 for positive), skip connections (ResNet), careful initialization, batch normalization.

Exercise 3.3: How does PyTorch's autograd relate to backpropagation?

PyTorch builds a computational graph during the forward pass, recording every operation. Calling loss.backward() traverses this graph in reverse, applying the chain rule automatically. This is reverse-mode automatic differentiation — equivalent to backprop but computed automatically, eliminating manual gradient derivation.

Chapter Summary

Forward pass computes predictions; backward pass computes gradients
Backpropagation = chain rule applied backward through the network
Vanishing gradients in deep networks are solved by ReLU and skip connections
Autograd (PyTorch/TF) automates gradient computation via computational graphs

Chapter 4

Weight Initialization, Batch Normalization & Dropout

Learning Objectives

Initialize weights properly to enable training (Xavier, He, Kaiming)
Normalize activations with Batch Normalization for stable training
Regularize with Dropout to prevent overfitting

Weight Initialization

Method	Formula (Variance)	Best With
Xavier/Glorot	Var = 2/(fan_in + fan_out)	Sigmoid, Tanh
He/Kaiming	Var = 2/fan_in	ReLU, Leaky ReLU
LeCun	Var = 1/fan_in	SELU

Python
import torch.nn as nn

# Proper initialization matters!
layer = nn.Linear(512, 256)

# He initialization for ReLU networks
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.zeros_(layer.bias)

# Xavier for sigmoid/tanh
nn.init.xavier_uniform_(layer.weight)

Batch Normalization

BatchNorm normalizes each layer's activations to zero mean and unit variance, then applies learnable scale (γ) and shift (β). This stabilizes training, allows higher learning rates, and acts as mild regularization.

Python
model = nn.Sequential(
    nn.Linear(784, 512),
    nn.BatchNorm1d(512),   # Normalize activations
    nn.ReLU(),
    nn.Dropout(0.3),       # Randomly zero 30% of neurons
    nn.Linear(512, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 10)
)
# BatchNorm + Dropout = stable training + regularization

Layer Norm vs Batch Norm

BatchNorm: Normalizes across the batch dimension — each feature has mean=0, std=1 across samples. Standard for CNNs. LayerNorm: Normalizes across the feature dimension — each sample is normalized independently. Standard for Transformers (works with variable-length sequences and doesn't depend on batch size).

Future: Layer Norm in Every Transformer

GPT, BERT, LLaMA, and every modern transformer uses LayerNorm, not BatchNorm. When you build the next-gen AI model, you'll use LayerNorm before attention and feed-forward layers. Understanding the normalization landscape (BatchNorm, LayerNorm, RMSNorm, GroupNorm) is essential for architecture design.

Exercises

Exercise 4.1: What happens with all-zeros initialization?

All neurons in a layer receive identical gradients → they update identically → they remain identical forever. The network has effectively one neuron per layer. This is called the symmetry problem. Random initialization breaks symmetry, ensuring neurons learn different features.

Exercise 4.2: Why must Dropout be disabled during evaluation?

During training, Dropout randomly zeros neurons — the network learns redundant representations. At evaluation, we want the full network's prediction (not a random subset). model.eval() disables Dropout (uses all neurons) and uses running statistics for BatchNorm instead of batch statistics.

Exercise 4.3: How does Batch Normalization help with the vanishing gradient problem?

BatchNorm keeps activations in the linear (non-saturating) region of activation functions by normalizing to mean=0, std=1. This prevents inputs to sigmoid/tanh from being extremely large/small (where gradients vanish). It also smooths the loss landscape, allowing larger learning rates.

Chapter Summary

He initialization for ReLU; Xavier for sigmoid/tanh — always match init to activation
BatchNorm stabilizes training and enables higher learning rates
Dropout prevents overfitting by forcing redundant representations
LayerNorm (not BatchNorm) is standard in Transformers

Part II

Optimizers & Training

Making neural networks learn efficiently

Chapter 5

Optimizers: SGD, Adam, AdamW & RMSProp

Learning Objectives

Understand gradient descent variants: batch, mini-batch, stochastic
Master momentum, RMSProp, Adam, and AdamW
Know which optimizer to use for which architecture

Python
import torch.optim as optim

model = MLP(784, 256, 10)

# SGD with momentum — classic, good generalization
opt_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

# Adam — adaptive learning rate per parameter
opt_adam = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))

# AdamW — Adam with decoupled weight decay (standard for Transformers)
opt_adamw = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

# RMSProp — good for RNNs
opt_rms = optim.RMSprop(model.parameters(), lr=1e-3, alpha=0.99)

Optimizer	Key Idea	Best For	LR
SGD+Momentum	Accelerates in consistent gradient direction	CNNs (ResNet, EfficientNet)	0.01-0.1
Adam	Adaptive LR per parameter (momentum + RMS)	Quick prototyping, GANs	1e-3 to 3e-4
AdamW	Adam + proper weight decay	Transformers (GPT, BERT, ViT)	1e-4 to 5e-5
RMSProp	Adapts LR by dividing by running avg of gradients²	RNNs, reinforcement learning	1e-3

Future: AdamW is the Standard for LLMs

Every major language model (GPT-4, LLaMA, Gemini, Claude) is trained with AdamW. Its decoupled weight decay provides better regularization than Adam's L2 penalty. For your future AI models, AdamW with warmup cosine schedule is the go-to recipe. Newer optimizers like LAMB, Lion, and Sophia are emerging but AdamW remains dominant.

Exercises

Exercise 5.1: Why does Adam converge faster than SGD but sometimes generalize worse?

Adam adapts learning rates per-parameter, finding optima quickly. But it can converge to sharp minima (small basins) that generalize poorly. SGD with momentum tends to find flat minima (wide basins) that generalize better. For best results: start with Adam for fast experimentation, then fine-tune with SGD for final performance.

Exercise 5.2: What does "decoupled weight decay" mean in AdamW?

In Adam with L2 regularization, weight decay is entangled with the adaptive learning rate — heavily updated parameters get less decay. AdamW decouples them: weight decay is applied directly to weights regardless of gradient history. This gives more uniform regularization across parameters, which is especially important for large models.

Exercise 5.3: What is the β₁ and β₂ in Adam and why do defaults work well?

β₁=0.9: exponential decay for first moment (mean of gradients) — momentum. β₂=0.999: decay for second moment (mean of squared gradients) — adaptive scaling. These defaults work because β₁ averages over ~10 steps (fast), β₂ over ~1000 steps (slow, stable). For transformers, β₂=0.95 sometimes works better.

Chapter Summary

SGD+momentum: best generalization for CNNs; requires careful LR tuning
Adam: fast convergence, adaptive LR — great for prototyping
AdamW: decoupled weight decay — standard for all transformers and LLMs
Choose optimizer based on architecture and whether speed or generalization matters more

Chapter 6

Learning Rate Schedules

Learning Objectives

Master warmup, step decay, cosine annealing, and one-cycle policies
Implement custom LR schedules in PyTorch
Find optimal learning rates with the LR range test

Python
import torch.optim.lr_scheduler as lr_scheduler

optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Step Decay: reduce LR by 0.1 every 10 epochs
step_sched = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# Cosine Annealing: smoothly decays to near-zero
cosine_sched = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# Warmup + Cosine (used in all modern transformers)
def warmup_cosine(epoch, warmup_epochs=5, total_epochs=100):
    if epoch < warmup_epochs:
        return epoch / warmup_epochs  # Linear warmup
    progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
    return 0.5 * (1 + np.cos(np.pi * progress))  # Cosine decay

warmup_sched = lr_scheduler.LambdaLR(optimizer, warmup_cosine)

# One Cycle — aggressive: ramp up then decay
onecycle = lr_scheduler.OneCycleLR(optimizer, max_lr=1e-2, total_steps=1000)

Future: Warmup + Cosine is the LLM Recipe

GPT-3, LLaMA, and most LLMs use: warmup for first 1-5% of steps (prevents early divergence when random weights produce wild gradients), then cosine decay to near-zero. This schedule is so effective it's become a standard recipe. When you train your own models, start here.

Exercises

Exercise 6.1: Why is warmup necessary for large learning rates?

At the start of training, weights are random and gradients can be very large/noisy. A high learning rate amplifies this noise, causing divergence. Warmup starts with a tiny LR and linearly increases it, allowing the model to stabilize its gradient magnitudes before using the full learning rate.

Exercise 6.2: What advantage does cosine annealing have over step decay?

Cosine annealing provides smooth, gradual LR reduction rather than abrupt drops. Smooth transitions help the optimizer settle into better local minima without the "shock" of sudden LR changes. It also naturally approaches zero at the end, giving fine-grained final optimization.

Exercise 6.3: How does the LR Range Test (Smith, 2015) work?

Train for one epoch while linearly increasing LR from tiny (1e-7) to huge (10). Plot loss vs LR. The optimal LR is where loss decreases fastest (steepest descent region). The max LR is where loss starts increasing. Use this range for OneCycleLR or as your base learning rate.

Chapter Summary

Warmup prevents divergence; cosine decay enables fine-grained optimization
Warmup + cosine is the standard recipe for transformer training
OneCycleLR provides aggressive but effective super-convergence
LR Range Test empirically finds the optimal learning rate

Chapter 7

Gradient Clipping, Mixed Precision & Early Stopping

Learning Objectives

Prevent exploding gradients with gradient clipping
Train 2x faster with mixed precision (FP16/BF16)
Stop training at the right time with early stopping

Python
import torch
from torch.cuda.amp import autocast, GradScaler

# Gradient Clipping — prevents exploding gradients
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

# Mixed Precision Training (FP16/BF16)
scaler = GradScaler()
for X, y in train_loader:
    optimizer.zero_grad()
    with autocast():           # Forward pass in FP16
        output = model(X)
        loss = criterion(output, y)
    scaler.scale(loss).backward()  # Scaled backward pass
    scaler.step(optimizer)
    scaler.update()
# ~2x faster, ~50% less GPU memory!

# Early Stopping
class EarlyStopping:
    def __init__(self, patience=7, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None

    def __call__(self, val_loss):
        if self.best_loss is None or val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
        else:
            self.counter += 1
        return self.counter >= self.patience

early_stop = EarlyStopping(patience=10)
# In training loop: if early_stop(val_loss): break

Future: BF16 is Replacing FP16

BF16 (Brain Float 16) has the same exponent range as FP32 but reduced precision — eliminating the need for loss scaling. A100/H100 GPUs have dedicated BF16 tensor cores. All modern LLM training uses BF16 by default. When you scale to production AI, BF16 is your format.

Exercises

Exercise 7.1: Why do gradients explode in deep networks and RNNs?

During backprop, gradients are multiplied through layers. If weight matrices have eigenvalues > 1, this product grows exponentially. RNNs are especially vulnerable because the same weight matrix is applied at every timestep — equivalent to multiplying it by itself T times. Gradient clipping caps the norm, preventing this explosion.

Exercise 7.2: How does mixed precision training work without losing accuracy?

Forward pass and gradient computation use FP16 (fast, small). A master copy of weights stays in FP32 (full precision). Loss scaling prevents small FP16 gradients from underflowing to zero. The weight update happens in FP32. Result: same convergence, 2x faster, 50% less memory.

Exercise 7.3: How do you choose the patience for early stopping?

Patience = how many epochs of no improvement before stopping. Too low (2-3): premature stopping, may miss later improvements. Too high (50+): wastes compute on a plateaued model. Rule of thumb: patience = 5-15 for most tasks. Monitor validation loss, not training loss. Save the best model checkpoint.

Chapter Summary

Gradient clipping (max_norm=1.0) prevents exploding gradients in deep/recurrent networks
Mixed precision (FP16/BF16) gives 2x speedup with negligible accuracy loss
Early stopping prevents overfitting by halting when validation loss plateaus
These three techniques are used in every production training pipeline

Part III

Core Architectures

The deep learning architectures that power modern AI

Chapter 8

Convolutional Neural Networks (CNNs)

Learning Objectives

Understand convolution, pooling, and feature hierarchies
Build CNNs for image classification
Know key architectures: LeNet, VGG, ResNet

Python
import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Conv block 1: 1→32 channels
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),      # 28×28 → 14×14

            # Conv block 2: 32→64
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),      # 14×14 → 7×7

            # Conv block 3: 64→128
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1)  # Global average pooling → 1×1
        )
        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)
        return self.classifier(x)

model = CNN()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

How Convolution Learns Features

Layer 1 learns edges and textures. Layer 2 combines edges into shapes (eyes, wheels). Layer 3 recognizes parts (faces, windows). Deep layers recognize objects. This hierarchical feature learning is the key insight of CNNs — and it mirrors how the visual cortex works.

Project: CIFAR-10 Image Classifier

Python
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914,0.4822,0.4465), (0.2470,0.2435,0.2616))
])
train_ds = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)

# Change first conv to accept 3 channels (RGB)
cnn = CNN(num_classes=10)
cnn.features[0] = nn.Conv2d(3, 32, 3, padding=1)
optimizer = torch.optim.AdamW(cnn.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    cnn.train()
    total_loss = 0
    for X, y in train_loader:
        loss = criterion(cnn(X), y)
        optimizer.zero_grad(); loss.backward(); optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}")

Exercises

Exercise 8.1: How many parameters in Conv2d(3, 64, kernel_size=3)?

Each filter: 3 (in_channels) × 3 × 3 (kernel) = 27 weights + 1 bias = 28. With 64 filters: 64 × 28 = 1,792 parameters. Compare to a fully connected layer on a 224×224×3 image: 150,528 × 64 = 9.6M! Convolutions share weights spatially, making them parameter-efficient.

Exercise 8.2: Why use pooling layers instead of larger strides?

Pooling provides translation invariance — a feature is detected regardless of exact position. Max pooling selects the strongest activation. However, modern architectures (like ResNet) increasingly use strided convolutions instead of pooling, as they learn the downsampling. Both work; strided convolutions are more flexible.

Exercise 8.3: What is data augmentation and why is it critical for CNNs?

Augmentation creates new training images by applying random transformations (flip, rotate, crop, color jitter). This prevents overfitting by exposing the model to more variation. For image tasks, augmentation alone can improve accuracy by 5-15%. It's essentially free training data.

Chapter Summary

Convolutions share weights spatially — parameter-efficient feature extraction
CNNs learn hierarchical features: edges → shapes → objects
BatchNorm + data augmentation are essential for training deep CNNs
Global average pooling replaces fully connected layers for modern architectures

Chapter 9

RNNs, LSTMs & GRUs

Learning Objectives

Process sequential data with recurrent neural networks
Solve long-range dependencies with LSTMs and GRUs
Understand why Transformers replaced RNNs for most tasks

Python
import torch.nn as nn

# LSTM for sequence modeling
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2,
                           batch_first=True, dropout=0.3, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, 1)  # *2 for bidirectional

    def forward(self, x):
        embed = self.embedding(x)
        out, (hidden, cell) = self.lstm(embed)
        # Use last hidden state from both directions
        hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)
        return torch.sigmoid(self.fc(hidden))

# GRU — simpler, faster, often comparable to LSTM
gru = nn.GRU(input_size=128, hidden_size=256, num_layers=2, batch_first=True)

Architecture	Gates	Memory	Best For
Vanilla RNN	None	Short-term only	Simple sequences (mostly obsolete)
LSTM	Forget, Input, Output	Long-term	Time series, speech, text
GRU	Reset, Update	Long-term (simpler)	Same as LSTM, fewer params
Transformer	Attention	Full context	Everything (replaced RNNs)

Future: RNNs in the Age of Transformers

Transformers have largely replaced RNNs for NLP (GPT, BERT). However, RNNs/LSTMs remain valuable for: (1) real-time streaming data (IoT, finance), (2) on-device inference where memory is limited, (3) state-space models (Mamba, S4) which are RNN-like but parallelizable. Understanding RNNs helps you grasp why attention was invented — to solve their limitations.

Exercises

Exercise 9.1: How does the LSTM forget gate work?

The forget gate fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) outputs values between 0 and 1 for each element of the cell state. 0 = completely forget, 1 = completely keep. This allows the LSTM to selectively remember or forget information over long sequences, solving the vanishing gradient problem.

Exercise 9.2: Why can't vanilla RNNs handle long sequences?

Vanilla RNNs apply the same weight matrix at every timestep. Gradients are multiplied through T steps during backprop. If eigenvalues < 1: gradients vanish (forget early inputs). If > 1: gradients explode. LSTMs solve this with additive cell state updates — gradients flow unchanged through the cell.

Exercise 9.3: When would you still use an LSTM over a Transformer?

(1) Streaming/real-time data where you process one token at a time. (2) Very long sequences where Transformer attention is O(n²). (3) Edge devices with limited memory. (4) Tasks with strict latency requirements. However, efficient transformers (linear attention, Mamba) are increasingly closing this gap.

Chapter Summary

RNNs process sequences by maintaining hidden state across timesteps
LSTMs add forget/input/output gates to handle long-range dependencies
GRUs are simpler (2 gates vs 3) with comparable performance
Transformers have largely replaced RNNs but understanding them illuminates attention

Chapter 10

Attention Mechanisms

Learning Objectives

Understand self-attention: queries, keys, values
Grasp multi-head attention and positional encoding
See why attention is the foundation of modern AI

Self-Attention

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Python
import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape

        # Project to queries, keys, values
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
        attn = torch.softmax(scores, dim=-1)
        out = attn @ V

        # Combine heads
        out = out.transpose(1,2).contiguous().view(B, T, C)
        return self.W_o(out)

# Test
attn = SelfAttention(d_model=512, n_heads=8)
x = torch.randn(2, 100, 512)  # batch=2, seq_len=100, dim=512
out = attn(x)
print(out.shape)  # [2, 100, 512]

Why Attention is the Foundation of AI's Future

Attention is the mechanism behind every major AI breakthrough: GPT-4 (text), DALL-E 3 (images), Whisper (speech), AlphaFold (protein folding), Gemini (multimodal). It allows models to learn which parts of the input are relevant to each part of the output — without hard-coding those relationships. Multi-head attention lets different heads focus on different relationship types (syntax, semantics, position). Understanding attention deeply is the single most valuable skill for future AI work.

Exercises

Exercise 10.1: Why scale by √dₖ in the attention formula?

Without scaling, dot products grow with dimension d_k. Large values push softmax into saturation (one element near 1, rest near 0), making gradients nearly zero. Dividing by √d_k keeps the variance of dot products at ~1 regardless of dimension, ensuring softmax produces meaningful, distributed attention weights.

Exercise 10.2: What is the purpose of multiple attention heads?

Different heads can learn different types of relationships: one head might attend to adjacent tokens (syntax), another to long-range dependencies (semantics), another to positional patterns. With 8 heads and d_model=512, each head operates in a 64-dim subspace, then results are concatenated and projected back to 512.

Exercise 10.3: What is causal (masked) attention and why does GPT use it?

Causal attention prevents each position from attending to future positions — the model can only "see" past tokens. This is done by masking the upper triangle of the attention matrix with -∞ before softmax. This is essential for autoregressive generation: when predicting token t, the model should only use tokens 1...t-1.

Chapter Summary

Self-attention computes relevance between all pairs of positions — O(n²) but powerful
Multi-head attention learns diverse relationship types in parallel
Causal masking enables autoregressive generation (GPT-style)
Attention is the foundation of all modern AI architectures

Chapter 11

Residual Networks (ResNet) & Skip Connections

Learning Objectives

Understand the degradation problem in very deep networks
Master skip connections and residual learning
See how skip connections enable 100+ layer networks

Output = F(x) + x (residual connection — learn the "difference")

Python
class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(channels, channels, 3, padding=1),
            nn.BatchNorm2d(channels),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=1),
            nn.BatchNorm2d(channels)
        )
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.block(x) + x)  # Skip connection!

# Stack residual blocks to create ResNet
class SimpleResNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.pool = nn.MaxPool2d(3, stride=2, padding=1)
        self.layer1 = nn.Sequential(*[ResidualBlock(64) for _ in range(3)])
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(64, 10)

    def forward(self, x):
        x = self.pool(nn.functional.relu(self.bn1(self.conv1(x))))
        x = self.layer1(x)
        x = self.avgpool(x).flatten(1)
        return self.fc(x)

Future: Skip Connections are Everywhere

ResNet's skip connection is used in: Transformers (residual around attention and FFN), U-Net (medical imaging), DenseNet, diffusion models (Stable Diffusion), and even RNNs (highway networks). It's one of the most important architectural innovations in deep learning history. Any future architecture you build will likely use residual connections.

Exercises

Exercise 11.1: Why does simply making networks deeper NOT improve performance?

The degradation problem: beyond a certain depth, adding layers actually increases both training AND test error — not just test error (which would be overfitting). This happens because optimizing very deep networks is fundamentally difficult: gradients degrade, and the loss landscape becomes chaotic. Skip connections provide gradient highways that bypass this problem.

Exercise 11.2: How do skip connections help gradient flow?

During backprop: ∂L/∂x = ∂L/∂(F(x)+x) = ∂L/∂F(x) · ∂F/∂x + ∂L/∂x. The "+" ensures the gradient has a direct path (the identity shortcut) that doesn't decay through layers. Even if ∂F/∂x vanishes, the gradient from the identity path survives. This enables training 152+ layer networks.

Exercise 11.3: What is the difference between pre-activation and post-activation ResNet?

Post-activation (original): Conv → BN → ReLU → Conv → BN → Add → ReLU. Pre-activation (ResNet v2): BN → ReLU → Conv → BN → ReLU → Conv → Add. Pre-activation gives cleaner gradient paths and slightly better performance because the identity mapping is completely unobstructed.

Chapter Summary

Skip connections solve the degradation problem — enabling 100+ layer training
The network learns residual F(x) = H(x) - x instead of full mapping H(x)
Gradient highways through identity shortcuts prevent vanishing gradients
Skip connections appear in virtually every modern architecture

Chapter 12

U-Net & Encoder-Decoder Designs

Learning Objectives

Understand encoder-decoder architecture for dense prediction
Master U-Net's skip connections for precise localization
Apply to segmentation, image-to-image, and generation tasks

Python
class UNet(nn.Module):
    def __init__(self, in_ch=1, out_ch=1):
        super().__init__()
        # Encoder (downsampling)
        self.enc1 = self._block(in_ch, 64)
        self.enc2 = self._block(64, 128)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = self._block(128, 256)

        # Decoder (upsampling)
        self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec2 = self._block(256, 128)  # 256 because of skip concat
        self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec1 = self._block(128, 64)

        self.final = nn.Conv2d(64, out_ch, 1)

    def _block(self, in_c, out_c):
        return nn.Sequential(
            nn.Conv2d(in_c, out_c, 3, padding=1), nn.ReLU(),
            nn.Conv2d(out_c, out_c, 3, padding=1), nn.ReLU())

    def forward(self, x):
        # Encode
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        # Bottleneck
        b = self.bottleneck(self.pool(e2))
        # Decode with skip connections
        d2 = self.dec2(torch.cat([self.up2(b), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
        return self.final(d1)

unet = UNet(3, 1)  # RGB input → binary mask output

Future: Encoder-Decoder Powers Generative AI

The encoder-decoder pattern is fundamental to: Stable Diffusion (U-Net denoises latent images), Seq2Seq (machine translation), Variational Autoencoders (generative models), SAM (Meta's Segment Anything), and medical imaging (tumor segmentation). The U-Net architecture specifically is the backbone of nearly all diffusion models generating AI art today.

Exercises

Exercise 12.1: Why does U-Net concatenate encoder features with decoder features?

The encoder captures "what" (high-level semantic features) but loses "where" (precise spatial information through pooling). The decoder upsamples but lacks fine detail. Skip connections concatenate high-resolution encoder features with upsampled decoder features, combining precise localization with semantic understanding — essential for pixel-level tasks.

Exercise 12.2: How is U-Net used in Stable Diffusion?

Stable Diffusion adds noise to images and trains a U-Net to predict and remove that noise. The U-Net receives a noisy image + text embedding (via cross-attention) and outputs the predicted noise. After many denoising steps, a clean image emerges. The U-Net's skip connections preserve spatial structure during this process.

Exercise 12.3: Compare encoder-decoder vs. purely convolutional approaches for segmentation

Purely convolutional (dilated convolutions) maintains resolution but is computationally expensive. Encoder-decoder reduces resolution (efficient) then upsamples, using skip connections to recover detail. Encoder-decoder is standard because it balances efficiency and precision. Modern approaches like SegFormer use transformer encoders with lightweight decoders.

Chapter Summary

Encoder-decoder architecture: compress → bottleneck → expand for dense prediction
U-Net adds skip connections between encoder and decoder for precise localization
Critical for segmentation, diffusion models (Stable Diffusion), and medical imaging
ConvTranspose2d (transposed convolution) handles learned upsampling

Part IV

Deep Learning Frameworks

Tools that power production AI

Chapter 13

PyTorch — Preferred for Research

Learning Objectives

Master PyTorch's tensor operations, autograd, and nn.Module
Build complete training loops with DataLoaders
Use GPU acceleration and model saving/loading

Python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Model
class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(20, 128), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(128, 64), nn.ReLU(),
            nn.Linear(64, 2)
        )
    def forward(self, x): return self.net(x)

model = Classifier().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Data
X = torch.randn(1000, 20)
y = (X[:, 0] > 0).long()
loader = DataLoader(TensorDataset(X, y), batch_size=32, shuffle=True)

# Training Loop
for epoch in range(20):
    model.train()
    total_loss = 0
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        loss = criterion(model(xb), yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if (epoch+1) % 5 == 0:
        print(f"Epoch {epoch+1}: loss={total_loss/len(loader):.4f}")

# Save & load
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))

Project: Complete PyTorch Training Pipeline

Python
import torch, torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Config
EPOCHS, BATCH_SIZE, LR = 10, 128, 1e-3
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data with augmentation
train_tf = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.ToTensor(), transforms.Normalize((0.1307,),(0.3081,))])
test_tf = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize((0.1307,),(0.3081,))])

train_loader = DataLoader(datasets.MNIST('data', train=True, download=True,
    transform=train_tf), batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(datasets.MNIST('data', train=False,
    transform=test_tf), batch_size=1000)

# Model with best practices
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784,512), nn.BatchNorm1d(512), nn.GELU(), nn.Dropout(0.2),
    nn.Linear(512,256), nn.BatchNorm1d(256), nn.GELU(), nn.Dropout(0.2),
    nn.Linear(256,10)
).to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=LR*10, total_steps=EPOCHS*len(train_loader))
criterion = nn.CrossEntropyLoss()

best_acc = 0
for epoch in range(EPOCHS):
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        loss = criterion(model(X), y)
        optimizer.zero_grad(); loss.backward(); optimizer.step(); scheduler.step()

    model.eval()
    correct = 0
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            correct += (model(X).argmax(1) == y).sum().item()
    acc = correct / len(test_loader.dataset)
    if acc > best_acc:
        best_acc = acc
        torch.save(model.state_dict(), 'best_model.pth')
    print(f"Epoch {epoch+1}: acc={acc:.2%} {'⭐ best!' if acc==best_acc else ''}")

Exercises

Exercise 13.1: Why call model.eval() and torch.no_grad() during evaluation?

model.eval(): Switches BatchNorm to use running stats (not batch stats) and disables Dropout. torch.no_grad(): Disables gradient computation — saves memory and speeds up inference by ~2x. Without it, PyTorch builds the computational graph unnecessarily during evaluation.

Exercise 13.2: What is the difference between model.parameters() and model.state_dict()?

parameters(): Returns an iterator over learnable parameters (weights, biases). Used for optimizers. state_dict(): Returns a dictionary of ALL model state including parameters AND non-learnable buffers (BatchNorm running mean/var). Always save state_dict() for checkpointing.

Exercise 13.3: How do you move a model and data to GPU?

device = torch.device('cuda')
model = model.to(device)      # Move model
X = X.to(device)              # Move data
# Both must be on same device!

Chapter Summary

PyTorch: define-by-run, Pythonic, dominant in research
Standard loop: forward → loss → backward → step (with zero_grad)
Always: model.eval() + torch.no_grad() for inference
Save state_dict() for checkpoints; use to(device) for GPU

Chapter 14

TensorFlow/Keras, JAX & Custom Training

Learning Objectives

Build models with TensorFlow/Keras high-level API
Understand JAX for high-performance, functional ML
Master automatic differentiation concepts
Write custom training loops for full control

TensorFlow/Keras

Python
import tensorflow as tf

# Keras Sequential — high-level API
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train in one line!
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train / 255.0

model.fit(x_train, y_train, epochs=5, batch_size=128,
          validation_split=0.1,
          callbacks=[tf.keras.callbacks.EarlyStopping(patience=3)])

JAX — Functional, XLA-Compiled

Python
import jax
import jax.numpy as jnp
from jax import grad, jit, vmap

# Automatic differentiation — the core of all DL
def loss_fn(w, x, y):
    pred = jnp.dot(x, w)
    return jnp.mean((pred - y) ** 2)

# grad() returns a function that computes gradients
grad_fn = jit(grad(loss_fn))  # JIT-compiled gradient function

# vmap — automatic vectorization (batch processing)
def predict_single(w, x):
    return jnp.dot(x, w)

predict_batch = vmap(predict_single, in_axes=(None, 0))
# Automatically handles batching without writing loops!

Framework	Style	Best For	Used By
PyTorch	Imperative (define-by-run)	Research, prototyping	Meta, most academia
TensorFlow/Keras	Declarative (high-level)	Production, deployment	Google, mobile (TF Lite)
JAX	Functional (transformations)	High-performance research	Google DeepMind (Gemini)

Custom Training Loop (PyTorch)

Python
# Full control training loop with all best practices
def train_one_epoch(model, loader, optimizer, scheduler, scaler, device):
    model.train()
    total_loss = 0
    for batch_idx, (X, y) in enumerate(loader):
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()

        # Mixed precision forward
        with torch.cuda.amp.autocast():
            output = model(X)
            loss = nn.functional.cross_entropy(output, y)

        # Scaled backward + gradient clipping
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

        total_loss += loss.item()
    return total_loss / len(loader)

Future: The Framework Landscape

PyTorch dominates research and is now standard in industry. JAX powers Google's Gemini and is growing for high-performance training (TPU-optimized). TensorFlow remains strong for production deployment (TF Serving, TF Lite for mobile). The trend: PyTorch for development, ONNX for cross-framework deployment. Understanding all three ensures you can work anywhere in the AI ecosystem.

Exercises

Exercise 14.1: What is automatic differentiation and how does it differ from numerical differentiation?

Numerical: Approximates gradients with (f(x+h)-f(x))/h — slow (2 evaluations per parameter) and numerically unstable. Automatic: Applies chain rule at each operation during forward pass, recording a graph. Backward pass computes exact gradients in O(1) per parameter. All DL frameworks use reverse-mode autodiff (equivalent to backprop).

Exercise 14.2: When would you write a custom training loop instead of using model.fit()?

(1) GAN training (alternating generator/discriminator). (2) Mixed precision with gradient scaling. (3) Gradient accumulation for large batch sizes. (4) Multi-task learning with multiple losses. (5) Custom learning rate scheduling. (6) Research with novel training procedures. model.fit() is convenient but inflexible.

Exercise 14.3: What makes JAX different from PyTorch?

JAX is functional: no mutable state, functions are transformed via grad(), jit(), vmap(). This enables: (1) Composable transformations (grad of grad, vmap of grad). (2) Ahead-of-time XLA compilation for maximum speed. (3) Automatic parallelization across TPU/GPU clusters. Trade-off: steeper learning curve, less ecosystem support than PyTorch.

Chapter Summary

TensorFlow/Keras: fastest prototyping with model.fit(); best for deployment
JAX: functional transformations + XLA compilation for maximum performance
Automatic differentiation is the engine behind all DL — exact gradients for free
Custom training loops give full control for advanced training recipes

🎓 Congratulations!

You've completed Deep Learning. You now understand neural networks from individual neurons to production-scale architectures. These foundations power every AI breakthrough — from GPT to Stable Diffusion to AlphaFold. You're ready to build the future of AI.