Phase 4 β€’ EduArtha

Deep Learning

The engine behind modern AI. Neural networks, backpropagation, and deep architectures are built here. Every breakthrough from GPT to Stable Diffusion to AlphaFold relies on these foundations.

⏱ 4–8 months  |  14 Chapters  |  50+ Exercises

Part I

Neural Network Fundamentals

The building blocks of every deep learning model

Chapter 1

Perceptrons & Multilayer Networks

Learning Objectives

  • Understand the perceptron β€” the simplest neural unit
  • Build multilayer perceptrons (MLPs) from scratch
  • Grasp universal approximation β€” why depth matters
  • Connect neurons to modern AI: every LLM is built on these

The Perceptron

A perceptron computes a weighted sum of inputs, adds a bias, and passes through an activation function. It's a single artificial neuron β€” the atom of deep learning.

output = activation(w₁x₁ + wβ‚‚xβ‚‚ + ... + wβ‚™xβ‚™ + b) = activation(WΒ·X + b)
Python
import numpy as np

class Perceptron:
    def __init__(self, n_inputs, lr=0.01):
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0
        self.lr = lr

    def forward(self, x):
        return 1.0 if np.dot(self.weights, x) + self.bias > 0 else 0.0

    def train(self, X, y, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                pred = self.forward(xi)
                error = yi - pred
                self.weights += self.lr * error * xi
                self.bias += self.lr * error

# AND gate β€” linearly separable
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0, 0, 0, 1])
p = Perceptron(2)
p.train(X, y)
print([p.forward(xi) for xi in X])  # [0, 0, 0, 1] βœ“

Multilayer Perceptron (MLP)

Stacking layers of neurons creates an MLP β€” capable of learning any continuous function (Universal Approximation Theorem). The key insight: non-linear activations between layers allow the network to model complex, non-linear relationships.

Python
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

model = MLP(784, 256, 10)  # MNIST: 784 pixels β†’ 10 digits
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Why This Matters for AI

Every modern AI model β€” GPT-4, Gemini, DALL-E, AlphaFold β€” is built from layers of neurons. The MLP is the fundamental building block. Transformer feed-forward layers? MLPs. Classification heads? MLPs. Understanding how neurons combine to learn representations is the foundation for everything that follows.

Project: MNIST Digit Classifier from Scratch

Python
import torch, torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,),(0.3081,))])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

# Model
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 512), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(256, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Train
for epoch in range(5):
    model.train()
    for X, y in train_loader:
        loss = criterion(model(X), y)
        optimizer.zero_grad(); loss.backward(); optimizer.step()

    # Evaluate
    model.eval()
    correct = sum(
        (model(X).argmax(1) == y).sum().item()
        for X, y in test_loader)
    print(f"Epoch {epoch+1}: Accuracy = {correct/len(test_data):.2%}")
# Achieves ~98% accuracy!

Exercises

Exercise 1.1: Why can't a single perceptron solve XOR?

XOR is not linearly separable β€” no single line can divide the four points into correct classes. A perceptron draws one linear boundary. You need at least 2 layers (hidden + output) to create the two boundaries needed for XOR. This limitation motivated the development of MLPs.

Exercise 1.2: How many parameters does a network with layers [784, 512, 256, 10] have?

Layer 1: 784Γ—512 + 512 = 401,920. Layer 2: 512Γ—256 + 256 = 131,328. Layer 3: 256Γ—10 + 10 = 2,570. Total: 535,818. Each layer has weights (inputΓ—output) plus biases (output). Modern LLMs have billions β€” but the math is the same.

Exercise 1.3: What is the Universal Approximation Theorem?

A feedforward network with a single hidden layer of sufficient width can approximate any continuous function to any desired accuracy. However, deep networks (many layers) achieve this with far fewer parameters than wide-shallow ones. Depth enables hierarchical feature learning β€” edges β†’ shapes β†’ objects.

Chapter Summary

  • Perceptrons compute weighted sums with activation β€” the basic neural unit
  • MLPs stack layers with non-linear activations to learn complex functions
  • Every modern AI model is built from these fundamental building blocks
  • Depth enables hierarchical feature learning β€” the key insight of deep learning
Chapter 2

Activation Functions

Learning Objectives

  • Understand why non-linearity is essential
  • Master ReLU, sigmoid, tanh, GELU, and Swish
  • Know which activation to use where and why
Python
import torch
import torch.nn.functional as F
import numpy as np

x = torch.linspace(-5, 5, 100)

# Key activation functions
sigmoid = torch.sigmoid(x)            # Οƒ(x) = 1/(1+e⁻ˣ) β€” range (0,1)
tanh = torch.tanh(x)                  # range (-1,1)
relu = F.relu(x)                       # max(0,x) β€” most popular
leaky_relu = F.leaky_relu(x, 0.01)    # allows small negative gradients
gelu = F.gelu(x)                       # Used in Transformers (GPT, BERT)
silu = F.silu(x)                       # xΒ·Οƒ(x) β€” "Swish", used in EfficientNet
FunctionFormulaRangeUsed InPros/Cons
Sigmoid1/(1+e⁻ˣ)(0,1)Output layers (binary)Vanishing gradient, not zero-centered
Tanh(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)(-1,1)RNNs, LSTMsZero-centered but still saturates
ReLUmax(0,x)[0,∞)CNNs, default choiceFast, no saturation; "dying ReLU"
GELUx·Φ(x)(-0.17,∞)Transformers (GPT, BERT)Smooth, probabilistic gating
SiLU/SwishxΒ·Οƒ(x)(-0.28,∞)EfficientNet, modern CNNsSmooth, self-gated, unbounded

Future of AI: Why GELU Dominates Transformers

GELU (Gaussian Error Linear Unit) is used in GPT, BERT, ViT, and most modern transformers. Unlike ReLU's hard cutoff at 0, GELU smoothly weights inputs by their magnitude β€” providing a probabilistic gate. This smoothness helps with gradient flow in very deep models (100+ layers), which is critical for training the AI models of the future.

Exercises

Exercise 2.1: Why does removing all activations collapse a deep network to a single linear layer?

Without activations: Layer1 = W₁x + b₁, Layer2 = Wβ‚‚(W₁x + b₁) + bβ‚‚ = (Wβ‚‚W₁)x + (Wβ‚‚b₁ + bβ‚‚) = W'x + b'. The composition of linear functions is linear. No matter how many layers, it's equivalent to one matrix multiplication. Non-linearity is what makes depth useful.

Exercise 2.2: What is the "dying ReLU" problem and how do you fix it?

If a ReLU neuron receives large negative inputs, its output is always 0, and its gradient is always 0 β€” it never updates again. It's "dead." Fixes: (1) Leaky ReLU (small negative slope), (2) PReLU (learned slope), (3) ELU (smooth negative), (4) GELU/SiLU (smooth everywhere), (5) Careful weight initialization.

Exercise 2.3: Which activation would you use for a model outputting probabilities for 10 classes?

Use Softmax for the final output layer β€” it converts logits to probabilities that sum to 1. For hidden layers, use ReLU (CNNs) or GELU (transformers). Never use softmax in hidden layers β€” it constrains the representation and makes training harder.

Chapter Summary

  • Non-linear activations are what make deep networks powerful β€” without them, depth is useless
  • ReLU is the default for most architectures; GELU for transformers
  • Sigmoid/softmax for output layers (probabilities); tanh for RNNs
  • Modern activations (GELU, SiLU) are smooth, avoiding dead neuron issues
Chapter 3

Forward Pass & Backpropagation

Learning Objectives

  • Trace the forward pass computation step by step
  • Understand backpropagation via the chain rule
  • Implement backprop from scratch in NumPy
  • Know why autograd replaces manual gradients

The Forward Pass

Input β†’ [W₁·x + b₁] β†’ ReLU β†’ [Wβ‚‚Β·h + bβ‚‚] β†’ Softmax β†’ Loss(Ε·, y)

Backpropagation: Chain Rule

Backpropagation computes βˆ‚Loss/βˆ‚wα΅’ for every weight by applying the chain rule backward through the network. This gradient tells each weight how much it contributed to the error β€” and which direction to adjust.

Python
import numpy as np

# 2-layer neural network from scratch
np.random.seed(42)
X = np.random.randn(100, 2)
y = (X[:, 0] * X[:, 1] > 0).astype(float).reshape(-1, 1)

# Initialize weights
W1 = np.random.randn(2, 16) * 0.5
b1 = np.zeros((1, 16))
W2 = np.random.randn(16, 1) * 0.5
b2 = np.zeros((1, 1))
lr = 0.1

def sigmoid(z): return 1 / (1 + np.exp(-z))

for epoch in range(1000):
    # FORWARD
    z1 = X @ W1 + b1
    a1 = np.maximum(0, z1)    # ReLU
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)           # Output probability

    # Loss (binary cross-entropy)
    loss = -np.mean(y * np.log(a2 + 1e-8) + (1-y) * np.log(1-a2 + 1e-8))

    # BACKWARD (chain rule)
    m = len(X)
    dz2 = a2 - y                         # βˆ‚L/βˆ‚z2
    dW2 = (1/m) * a1.T @ dz2             # βˆ‚L/βˆ‚W2
    db2 = (1/m) * np.sum(dz2, axis=0)
    dz1 = (dz2 @ W2.T) * (z1 > 0)       # ReLU gradient
    dW1 = (1/m) * X.T @ dz1
    db1 = (1/m) * np.sum(dz1, axis=0)

    # UPDATE
    W2 -= lr * dW2; b2 -= lr * db2
    W1 -= lr * dW1; b1 -= lr * db1

    if epoch % 200 == 0:
        acc = ((a2 > 0.5) == y).mean()
        print(f"Epoch {epoch}: loss={loss:.4f}, acc={acc:.2%}")

Why This Matters for AI

Backpropagation is the algorithm that makes deep learning possible. Training GPT-4 means computing gradients for 1.7 trillion parameters across billions of tokens β€” all via the chain rule. Understanding backprop deeply is essential for debugging training failures, implementing custom layers, and pushing the frontier of AI research.

Exercises

Exercise 3.1: Why is the chain rule necessary β€” can't we compute gradients directly?

A neural network is a deeply nested function: f(g(h(x))). Computing βˆ‚f/βˆ‚x directly requires expanding the entire composition β€” exponentially complex. The chain rule decomposes it: βˆ‚f/βˆ‚x = (βˆ‚f/βˆ‚g)(βˆ‚g/βˆ‚h)(βˆ‚h/βˆ‚x). Each factor is simple and local. Backprop computes all gradients in one backward pass β€” O(n) instead of O(nΒ²).

Exercise 3.2: What is the vanishing gradient problem?

In deep networks with sigmoid/tanh, gradients shrink exponentially as they flow backward (each layer multiplies by Οƒ'(x) ≀ 0.25). By the time gradients reach early layers, they're ~0 β€” those layers barely learn. Solutions: ReLU (gradient = 1 for positive), skip connections (ResNet), careful initialization, batch normalization.

Exercise 3.3: How does PyTorch's autograd relate to backpropagation?

PyTorch builds a computational graph during the forward pass, recording every operation. Calling loss.backward() traverses this graph in reverse, applying the chain rule automatically. This is reverse-mode automatic differentiation β€” equivalent to backprop but computed automatically, eliminating manual gradient derivation.

Chapter Summary

  • Forward pass computes predictions; backward pass computes gradients
  • Backpropagation = chain rule applied backward through the network
  • Vanishing gradients in deep networks are solved by ReLU and skip connections
  • Autograd (PyTorch/TF) automates gradient computation via computational graphs
Chapter 4

Weight Initialization, Batch Normalization & Dropout

Learning Objectives

  • Initialize weights properly to enable training (Xavier, He, Kaiming)
  • Normalize activations with Batch Normalization for stable training
  • Regularize with Dropout to prevent overfitting

Weight Initialization

MethodFormula (Variance)Best With
Xavier/GlorotVar = 2/(fan_in + fan_out)Sigmoid, Tanh
He/KaimingVar = 2/fan_inReLU, Leaky ReLU
LeCunVar = 1/fan_inSELU
Python
import torch.nn as nn

# Proper initialization matters!
layer = nn.Linear(512, 256)

# He initialization for ReLU networks
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.zeros_(layer.bias)

# Xavier for sigmoid/tanh
nn.init.xavier_uniform_(layer.weight)

Batch Normalization

BatchNorm normalizes each layer's activations to zero mean and unit variance, then applies learnable scale (Ξ³) and shift (Ξ²). This stabilizes training, allows higher learning rates, and acts as mild regularization.

Python
model = nn.Sequential(
    nn.Linear(784, 512),
    nn.BatchNorm1d(512),   # Normalize activations
    nn.ReLU(),
    nn.Dropout(0.3),       # Randomly zero 30% of neurons
    nn.Linear(512, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 10)
)
# BatchNorm + Dropout = stable training + regularization

Layer Norm vs Batch Norm

BatchNorm: Normalizes across the batch dimension β€” each feature has mean=0, std=1 across samples. Standard for CNNs. LayerNorm: Normalizes across the feature dimension β€” each sample is normalized independently. Standard for Transformers (works with variable-length sequences and doesn't depend on batch size).

Future: Layer Norm in Every Transformer

GPT, BERT, LLaMA, and every modern transformer uses LayerNorm, not BatchNorm. When you build the next-gen AI model, you'll use LayerNorm before attention and feed-forward layers. Understanding the normalization landscape (BatchNorm, LayerNorm, RMSNorm, GroupNorm) is essential for architecture design.

Exercises

Exercise 4.1: What happens with all-zeros initialization?

All neurons in a layer receive identical gradients β†’ they update identically β†’ they remain identical forever. The network has effectively one neuron per layer. This is called the symmetry problem. Random initialization breaks symmetry, ensuring neurons learn different features.

Exercise 4.2: Why must Dropout be disabled during evaluation?

During training, Dropout randomly zeros neurons β€” the network learns redundant representations. At evaluation, we want the full network's prediction (not a random subset). model.eval() disables Dropout (uses all neurons) and uses running statistics for BatchNorm instead of batch statistics.

Exercise 4.3: How does Batch Normalization help with the vanishing gradient problem?

BatchNorm keeps activations in the linear (non-saturating) region of activation functions by normalizing to mean=0, std=1. This prevents inputs to sigmoid/tanh from being extremely large/small (where gradients vanish). It also smooths the loss landscape, allowing larger learning rates.

Chapter Summary

  • He initialization for ReLU; Xavier for sigmoid/tanh β€” always match init to activation
  • BatchNorm stabilizes training and enables higher learning rates
  • Dropout prevents overfitting by forcing redundant representations
  • LayerNorm (not BatchNorm) is standard in Transformers
Part II

Optimizers & Training

Making neural networks learn efficiently

Chapter 5

Optimizers: SGD, Adam, AdamW & RMSProp

Learning Objectives

  • Understand gradient descent variants: batch, mini-batch, stochastic
  • Master momentum, RMSProp, Adam, and AdamW
  • Know which optimizer to use for which architecture
Python
import torch.optim as optim

model = MLP(784, 256, 10)

# SGD with momentum β€” classic, good generalization
opt_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

# Adam β€” adaptive learning rate per parameter
opt_adam = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))

# AdamW β€” Adam with decoupled weight decay (standard for Transformers)
opt_adamw = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

# RMSProp β€” good for RNNs
opt_rms = optim.RMSprop(model.parameters(), lr=1e-3, alpha=0.99)
OptimizerKey IdeaBest ForLR
SGD+MomentumAccelerates in consistent gradient directionCNNs (ResNet, EfficientNet)0.01-0.1
AdamAdaptive LR per parameter (momentum + RMS)Quick prototyping, GANs1e-3 to 3e-4
AdamWAdam + proper weight decayTransformers (GPT, BERT, ViT)1e-4 to 5e-5
RMSPropAdapts LR by dividing by running avg of gradientsΒ²RNNs, reinforcement learning1e-3

Future: AdamW is the Standard for LLMs

Every major language model (GPT-4, LLaMA, Gemini, Claude) is trained with AdamW. Its decoupled weight decay provides better regularization than Adam's L2 penalty. For your future AI models, AdamW with warmup cosine schedule is the go-to recipe. Newer optimizers like LAMB, Lion, and Sophia are emerging but AdamW remains dominant.

Exercises

Exercise 5.1: Why does Adam converge faster than SGD but sometimes generalize worse?

Adam adapts learning rates per-parameter, finding optima quickly. But it can converge to sharp minima (small basins) that generalize poorly. SGD with momentum tends to find flat minima (wide basins) that generalize better. For best results: start with Adam for fast experimentation, then fine-tune with SGD for final performance.

Exercise 5.2: What does "decoupled weight decay" mean in AdamW?

In Adam with L2 regularization, weight decay is entangled with the adaptive learning rate β€” heavily updated parameters get less decay. AdamW decouples them: weight decay is applied directly to weights regardless of gradient history. This gives more uniform regularization across parameters, which is especially important for large models.

Exercise 5.3: What is the β₁ and Ξ²β‚‚ in Adam and why do defaults work well?

β₁=0.9: exponential decay for first moment (mean of gradients) β€” momentum. Ξ²β‚‚=0.999: decay for second moment (mean of squared gradients) β€” adaptive scaling. These defaults work because β₁ averages over ~10 steps (fast), Ξ²β‚‚ over ~1000 steps (slow, stable). For transformers, Ξ²β‚‚=0.95 sometimes works better.

Chapter Summary

  • SGD+momentum: best generalization for CNNs; requires careful LR tuning
  • Adam: fast convergence, adaptive LR β€” great for prototyping
  • AdamW: decoupled weight decay β€” standard for all transformers and LLMs
  • Choose optimizer based on architecture and whether speed or generalization matters more
Chapter 6

Learning Rate Schedules

Learning Objectives

  • Master warmup, step decay, cosine annealing, and one-cycle policies
  • Implement custom LR schedules in PyTorch
  • Find optimal learning rates with the LR range test
Python
import torch.optim.lr_scheduler as lr_scheduler

optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Step Decay: reduce LR by 0.1 every 10 epochs
step_sched = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# Cosine Annealing: smoothly decays to near-zero
cosine_sched = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# Warmup + Cosine (used in all modern transformers)
def warmup_cosine(epoch, warmup_epochs=5, total_epochs=100):
    if epoch < warmup_epochs:
        return epoch / warmup_epochs  # Linear warmup
    progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
    return 0.5 * (1 + np.cos(np.pi * progress))  # Cosine decay

warmup_sched = lr_scheduler.LambdaLR(optimizer, warmup_cosine)

# One Cycle β€” aggressive: ramp up then decay
onecycle = lr_scheduler.OneCycleLR(optimizer, max_lr=1e-2, total_steps=1000)

Future: Warmup + Cosine is the LLM Recipe

GPT-3, LLaMA, and most LLMs use: warmup for first 1-5% of steps (prevents early divergence when random weights produce wild gradients), then cosine decay to near-zero. This schedule is so effective it's become a standard recipe. When you train your own models, start here.

Exercises

Exercise 6.1: Why is warmup necessary for large learning rates?

At the start of training, weights are random and gradients can be very large/noisy. A high learning rate amplifies this noise, causing divergence. Warmup starts with a tiny LR and linearly increases it, allowing the model to stabilize its gradient magnitudes before using the full learning rate.

Exercise 6.2: What advantage does cosine annealing have over step decay?

Cosine annealing provides smooth, gradual LR reduction rather than abrupt drops. Smooth transitions help the optimizer settle into better local minima without the "shock" of sudden LR changes. It also naturally approaches zero at the end, giving fine-grained final optimization.

Exercise 6.3: How does the LR Range Test (Smith, 2015) work?

Train for one epoch while linearly increasing LR from tiny (1e-7) to huge (10). Plot loss vs LR. The optimal LR is where loss decreases fastest (steepest descent region). The max LR is where loss starts increasing. Use this range for OneCycleLR or as your base learning rate.

Chapter Summary

  • Warmup prevents divergence; cosine decay enables fine-grained optimization
  • Warmup + cosine is the standard recipe for transformer training
  • OneCycleLR provides aggressive but effective super-convergence
  • LR Range Test empirically finds the optimal learning rate
Chapter 7

Gradient Clipping, Mixed Precision & Early Stopping

Learning Objectives

  • Prevent exploding gradients with gradient clipping
  • Train 2x faster with mixed precision (FP16/BF16)
  • Stop training at the right time with early stopping
Python
import torch
from torch.cuda.amp import autocast, GradScaler

# Gradient Clipping β€” prevents exploding gradients
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

# Mixed Precision Training (FP16/BF16)
scaler = GradScaler()
for X, y in train_loader:
    optimizer.zero_grad()
    with autocast():           # Forward pass in FP16
        output = model(X)
        loss = criterion(output, y)
    scaler.scale(loss).backward()  # Scaled backward pass
    scaler.step(optimizer)
    scaler.update()
# ~2x faster, ~50% less GPU memory!

# Early Stopping
class EarlyStopping:
    def __init__(self, patience=7, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None

    def __call__(self, val_loss):
        if self.best_loss is None or val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
        else:
            self.counter += 1
        return self.counter >= self.patience

early_stop = EarlyStopping(patience=10)
# In training loop: if early_stop(val_loss): break

Future: BF16 is Replacing FP16

BF16 (Brain Float 16) has the same exponent range as FP32 but reduced precision β€” eliminating the need for loss scaling. A100/H100 GPUs have dedicated BF16 tensor cores. All modern LLM training uses BF16 by default. When you scale to production AI, BF16 is your format.

Exercises

Exercise 7.1: Why do gradients explode in deep networks and RNNs?

During backprop, gradients are multiplied through layers. If weight matrices have eigenvalues > 1, this product grows exponentially. RNNs are especially vulnerable because the same weight matrix is applied at every timestep β€” equivalent to multiplying it by itself T times. Gradient clipping caps the norm, preventing this explosion.

Exercise 7.2: How does mixed precision training work without losing accuracy?

Forward pass and gradient computation use FP16 (fast, small). A master copy of weights stays in FP32 (full precision). Loss scaling prevents small FP16 gradients from underflowing to zero. The weight update happens in FP32. Result: same convergence, 2x faster, 50% less memory.

Exercise 7.3: How do you choose the patience for early stopping?

Patience = how many epochs of no improvement before stopping. Too low (2-3): premature stopping, may miss later improvements. Too high (50+): wastes compute on a plateaued model. Rule of thumb: patience = 5-15 for most tasks. Monitor validation loss, not training loss. Save the best model checkpoint.

Chapter Summary

  • Gradient clipping (max_norm=1.0) prevents exploding gradients in deep/recurrent networks
  • Mixed precision (FP16/BF16) gives 2x speedup with negligible accuracy loss
  • Early stopping prevents overfitting by halting when validation loss plateaus
  • These three techniques are used in every production training pipeline
Part III

Core Architectures

The deep learning architectures that power modern AI

Chapter 8

Convolutional Neural Networks (CNNs)

Learning Objectives

  • Understand convolution, pooling, and feature hierarchies
  • Build CNNs for image classification
  • Know key architectures: LeNet, VGG, ResNet
Python
import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Conv block 1: 1β†’32 channels
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),      # 28Γ—28 β†’ 14Γ—14

            # Conv block 2: 32β†’64
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),      # 14Γ—14 β†’ 7Γ—7

            # Conv block 3: 64β†’128
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1)  # Global average pooling β†’ 1Γ—1
        )
        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)
        return self.classifier(x)

model = CNN()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

How Convolution Learns Features

Layer 1 learns edges and textures. Layer 2 combines edges into shapes (eyes, wheels). Layer 3 recognizes parts (faces, windows). Deep layers recognize objects. This hierarchical feature learning is the key insight of CNNs β€” and it mirrors how the visual cortex works.

Project: CIFAR-10 Image Classifier

Python
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914,0.4822,0.4465), (0.2470,0.2435,0.2616))
])
train_ds = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)

# Change first conv to accept 3 channels (RGB)
cnn = CNN(num_classes=10)
cnn.features[0] = nn.Conv2d(3, 32, 3, padding=1)
optimizer = torch.optim.AdamW(cnn.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    cnn.train()
    total_loss = 0
    for X, y in train_loader:
        loss = criterion(cnn(X), y)
        optimizer.zero_grad(); loss.backward(); optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}")

Exercises

Exercise 8.1: How many parameters in Conv2d(3, 64, kernel_size=3)?

Each filter: 3 (in_channels) Γ— 3 Γ— 3 (kernel) = 27 weights + 1 bias = 28. With 64 filters: 64 Γ— 28 = 1,792 parameters. Compare to a fully connected layer on a 224Γ—224Γ—3 image: 150,528 Γ— 64 = 9.6M! Convolutions share weights spatially, making them parameter-efficient.

Exercise 8.2: Why use pooling layers instead of larger strides?

Pooling provides translation invariance β€” a feature is detected regardless of exact position. Max pooling selects the strongest activation. However, modern architectures (like ResNet) increasingly use strided convolutions instead of pooling, as they learn the downsampling. Both work; strided convolutions are more flexible.

Exercise 8.3: What is data augmentation and why is it critical for CNNs?

Augmentation creates new training images by applying random transformations (flip, rotate, crop, color jitter). This prevents overfitting by exposing the model to more variation. For image tasks, augmentation alone can improve accuracy by 5-15%. It's essentially free training data.

Chapter Summary

  • Convolutions share weights spatially β€” parameter-efficient feature extraction
  • CNNs learn hierarchical features: edges β†’ shapes β†’ objects
  • BatchNorm + data augmentation are essential for training deep CNNs
  • Global average pooling replaces fully connected layers for modern architectures
Chapter 9

RNNs, LSTMs & GRUs

Learning Objectives

  • Process sequential data with recurrent neural networks
  • Solve long-range dependencies with LSTMs and GRUs
  • Understand why Transformers replaced RNNs for most tasks
Python
import torch.nn as nn

# LSTM for sequence modeling
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2,
                           batch_first=True, dropout=0.3, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, 1)  # *2 for bidirectional

    def forward(self, x):
        embed = self.embedding(x)
        out, (hidden, cell) = self.lstm(embed)
        # Use last hidden state from both directions
        hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)
        return torch.sigmoid(self.fc(hidden))

# GRU β€” simpler, faster, often comparable to LSTM
gru = nn.GRU(input_size=128, hidden_size=256, num_layers=2, batch_first=True)
ArchitectureGatesMemoryBest For
Vanilla RNNNoneShort-term onlySimple sequences (mostly obsolete)
LSTMForget, Input, OutputLong-termTime series, speech, text
GRUReset, UpdateLong-term (simpler)Same as LSTM, fewer params
TransformerAttentionFull contextEverything (replaced RNNs)

Future: RNNs in the Age of Transformers

Transformers have largely replaced RNNs for NLP (GPT, BERT). However, RNNs/LSTMs remain valuable for: (1) real-time streaming data (IoT, finance), (2) on-device inference where memory is limited, (3) state-space models (Mamba, S4) which are RNN-like but parallelizable. Understanding RNNs helps you grasp why attention was invented β€” to solve their limitations.

Exercises

Exercise 9.1: How does the LSTM forget gate work?

The forget gate fβ‚œ = Οƒ(WfΒ·[hβ‚œβ‚‹β‚, xβ‚œ] + bf) outputs values between 0 and 1 for each element of the cell state. 0 = completely forget, 1 = completely keep. This allows the LSTM to selectively remember or forget information over long sequences, solving the vanishing gradient problem.

Exercise 9.2: Why can't vanilla RNNs handle long sequences?

Vanilla RNNs apply the same weight matrix at every timestep. Gradients are multiplied through T steps during backprop. If eigenvalues < 1: gradients vanish (forget early inputs). If > 1: gradients explode. LSTMs solve this with additive cell state updates β€” gradients flow unchanged through the cell.

Exercise 9.3: When would you still use an LSTM over a Transformer?

(1) Streaming/real-time data where you process one token at a time. (2) Very long sequences where Transformer attention is O(nΒ²). (3) Edge devices with limited memory. (4) Tasks with strict latency requirements. However, efficient transformers (linear attention, Mamba) are increasingly closing this gap.

Chapter Summary

  • RNNs process sequences by maintaining hidden state across timesteps
  • LSTMs add forget/input/output gates to handle long-range dependencies
  • GRUs are simpler (2 gates vs 3) with comparable performance
  • Transformers have largely replaced RNNs but understanding them illuminates attention
Chapter 10

Attention Mechanisms

Learning Objectives

  • Understand self-attention: queries, keys, values
  • Grasp multi-head attention and positional encoding
  • See why attention is the foundation of modern AI

Self-Attention

Attention(Q, K, V) = softmax(QKα΅€ / √dβ‚–) Β· V
Python
import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape

        # Project to queries, keys, values
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
        attn = torch.softmax(scores, dim=-1)
        out = attn @ V

        # Combine heads
        out = out.transpose(1,2).contiguous().view(B, T, C)
        return self.W_o(out)

# Test
attn = SelfAttention(d_model=512, n_heads=8)
x = torch.randn(2, 100, 512)  # batch=2, seq_len=100, dim=512
out = attn(x)
print(out.shape)  # [2, 100, 512]

Why Attention is the Foundation of AI's Future

Attention is the mechanism behind every major AI breakthrough: GPT-4 (text), DALL-E 3 (images), Whisper (speech), AlphaFold (protein folding), Gemini (multimodal). It allows models to learn which parts of the input are relevant to each part of the output β€” without hard-coding those relationships. Multi-head attention lets different heads focus on different relationship types (syntax, semantics, position). Understanding attention deeply is the single most valuable skill for future AI work.

Exercises

Exercise 10.1: Why scale by √dβ‚– in the attention formula?

Without scaling, dot products grow with dimension d_k. Large values push softmax into saturation (one element near 1, rest near 0), making gradients nearly zero. Dividing by √d_k keeps the variance of dot products at ~1 regardless of dimension, ensuring softmax produces meaningful, distributed attention weights.

Exercise 10.2: What is the purpose of multiple attention heads?

Different heads can learn different types of relationships: one head might attend to adjacent tokens (syntax), another to long-range dependencies (semantics), another to positional patterns. With 8 heads and d_model=512, each head operates in a 64-dim subspace, then results are concatenated and projected back to 512.

Exercise 10.3: What is causal (masked) attention and why does GPT use it?

Causal attention prevents each position from attending to future positions β€” the model can only "see" past tokens. This is done by masking the upper triangle of the attention matrix with -∞ before softmax. This is essential for autoregressive generation: when predicting token t, the model should only use tokens 1...t-1.

Chapter Summary

  • Self-attention computes relevance between all pairs of positions β€” O(nΒ²) but powerful
  • Multi-head attention learns diverse relationship types in parallel
  • Causal masking enables autoregressive generation (GPT-style)
  • Attention is the foundation of all modern AI architectures
Chapter 11

Residual Networks (ResNet) & Skip Connections

Learning Objectives

  • Understand the degradation problem in very deep networks
  • Master skip connections and residual learning
  • See how skip connections enable 100+ layer networks
Output = F(x) + x   (residual connection β€” learn the "difference")
Python
class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(channels, channels, 3, padding=1),
            nn.BatchNorm2d(channels),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=1),
            nn.BatchNorm2d(channels)
        )
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.block(x) + x)  # Skip connection!

# Stack residual blocks to create ResNet
class SimpleResNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.pool = nn.MaxPool2d(3, stride=2, padding=1)
        self.layer1 = nn.Sequential(*[ResidualBlock(64) for _ in range(3)])
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(64, 10)

    def forward(self, x):
        x = self.pool(nn.functional.relu(self.bn1(self.conv1(x))))
        x = self.layer1(x)
        x = self.avgpool(x).flatten(1)
        return self.fc(x)

Future: Skip Connections are Everywhere

ResNet's skip connection is used in: Transformers (residual around attention and FFN), U-Net (medical imaging), DenseNet, diffusion models (Stable Diffusion), and even RNNs (highway networks). It's one of the most important architectural innovations in deep learning history. Any future architecture you build will likely use residual connections.

Exercises

Exercise 11.1: Why does simply making networks deeper NOT improve performance?

The degradation problem: beyond a certain depth, adding layers actually increases both training AND test error β€” not just test error (which would be overfitting). This happens because optimizing very deep networks is fundamentally difficult: gradients degrade, and the loss landscape becomes chaotic. Skip connections provide gradient highways that bypass this problem.

Exercise 11.2: How do skip connections help gradient flow?

During backprop: βˆ‚L/βˆ‚x = βˆ‚L/βˆ‚(F(x)+x) = βˆ‚L/βˆ‚F(x) Β· βˆ‚F/βˆ‚x + βˆ‚L/βˆ‚x. The "+" ensures the gradient has a direct path (the identity shortcut) that doesn't decay through layers. Even if βˆ‚F/βˆ‚x vanishes, the gradient from the identity path survives. This enables training 152+ layer networks.

Exercise 11.3: What is the difference between pre-activation and post-activation ResNet?

Post-activation (original): Conv β†’ BN β†’ ReLU β†’ Conv β†’ BN β†’ Add β†’ ReLU. Pre-activation (ResNet v2): BN β†’ ReLU β†’ Conv β†’ BN β†’ ReLU β†’ Conv β†’ Add. Pre-activation gives cleaner gradient paths and slightly better performance because the identity mapping is completely unobstructed.

Chapter Summary

  • Skip connections solve the degradation problem β€” enabling 100+ layer training
  • The network learns residual F(x) = H(x) - x instead of full mapping H(x)
  • Gradient highways through identity shortcuts prevent vanishing gradients
  • Skip connections appear in virtually every modern architecture
Chapter 12

U-Net & Encoder-Decoder Designs

Learning Objectives

  • Understand encoder-decoder architecture for dense prediction
  • Master U-Net's skip connections for precise localization
  • Apply to segmentation, image-to-image, and generation tasks
Python
class UNet(nn.Module):
    def __init__(self, in_ch=1, out_ch=1):
        super().__init__()
        # Encoder (downsampling)
        self.enc1 = self._block(in_ch, 64)
        self.enc2 = self._block(64, 128)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = self._block(128, 256)

        # Decoder (upsampling)
        self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec2 = self._block(256, 128)  # 256 because of skip concat
        self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec1 = self._block(128, 64)

        self.final = nn.Conv2d(64, out_ch, 1)

    def _block(self, in_c, out_c):
        return nn.Sequential(
            nn.Conv2d(in_c, out_c, 3, padding=1), nn.ReLU(),
            nn.Conv2d(out_c, out_c, 3, padding=1), nn.ReLU())

    def forward(self, x):
        # Encode
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        # Bottleneck
        b = self.bottleneck(self.pool(e2))
        # Decode with skip connections
        d2 = self.dec2(torch.cat([self.up2(b), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
        return self.final(d1)

unet = UNet(3, 1)  # RGB input β†’ binary mask output

Future: Encoder-Decoder Powers Generative AI

The encoder-decoder pattern is fundamental to: Stable Diffusion (U-Net denoises latent images), Seq2Seq (machine translation), Variational Autoencoders (generative models), SAM (Meta's Segment Anything), and medical imaging (tumor segmentation). The U-Net architecture specifically is the backbone of nearly all diffusion models generating AI art today.

Exercises

Exercise 12.1: Why does U-Net concatenate encoder features with decoder features?

The encoder captures "what" (high-level semantic features) but loses "where" (precise spatial information through pooling). The decoder upsamples but lacks fine detail. Skip connections concatenate high-resolution encoder features with upsampled decoder features, combining precise localization with semantic understanding β€” essential for pixel-level tasks.

Exercise 12.2: How is U-Net used in Stable Diffusion?

Stable Diffusion adds noise to images and trains a U-Net to predict and remove that noise. The U-Net receives a noisy image + text embedding (via cross-attention) and outputs the predicted noise. After many denoising steps, a clean image emerges. The U-Net's skip connections preserve spatial structure during this process.

Exercise 12.3: Compare encoder-decoder vs. purely convolutional approaches for segmentation

Purely convolutional (dilated convolutions) maintains resolution but is computationally expensive. Encoder-decoder reduces resolution (efficient) then upsamples, using skip connections to recover detail. Encoder-decoder is standard because it balances efficiency and precision. Modern approaches like SegFormer use transformer encoders with lightweight decoders.

Chapter Summary

  • Encoder-decoder architecture: compress β†’ bottleneck β†’ expand for dense prediction
  • U-Net adds skip connections between encoder and decoder for precise localization
  • Critical for segmentation, diffusion models (Stable Diffusion), and medical imaging
  • ConvTranspose2d (transposed convolution) handles learned upsampling
Part IV

Deep Learning Frameworks

Tools that power production AI

Chapter 13

PyTorch β€” Preferred for Research

Learning Objectives

  • Master PyTorch's tensor operations, autograd, and nn.Module
  • Build complete training loops with DataLoaders
  • Use GPU acceleration and model saving/loading
Python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Model
class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(20, 128), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(128, 64), nn.ReLU(),
            nn.Linear(64, 2)
        )
    def forward(self, x): return self.net(x)

model = Classifier().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Data
X = torch.randn(1000, 20)
y = (X[:, 0] > 0).long()
loader = DataLoader(TensorDataset(X, y), batch_size=32, shuffle=True)

# Training Loop
for epoch in range(20):
    model.train()
    total_loss = 0
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        loss = criterion(model(xb), yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if (epoch+1) % 5 == 0:
        print(f"Epoch {epoch+1}: loss={total_loss/len(loader):.4f}")

# Save & load
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))

Project: Complete PyTorch Training Pipeline

Python
import torch, torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Config
EPOCHS, BATCH_SIZE, LR = 10, 128, 1e-3
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data with augmentation
train_tf = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.ToTensor(), transforms.Normalize((0.1307,),(0.3081,))])
test_tf = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize((0.1307,),(0.3081,))])

train_loader = DataLoader(datasets.MNIST('data', train=True, download=True,
    transform=train_tf), batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(datasets.MNIST('data', train=False,
    transform=test_tf), batch_size=1000)

# Model with best practices
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784,512), nn.BatchNorm1d(512), nn.GELU(), nn.Dropout(0.2),
    nn.Linear(512,256), nn.BatchNorm1d(256), nn.GELU(), nn.Dropout(0.2),
    nn.Linear(256,10)
).to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=LR*10, total_steps=EPOCHS*len(train_loader))
criterion = nn.CrossEntropyLoss()

best_acc = 0
for epoch in range(EPOCHS):
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        loss = criterion(model(X), y)
        optimizer.zero_grad(); loss.backward(); optimizer.step(); scheduler.step()

    model.eval()
    correct = 0
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            correct += (model(X).argmax(1) == y).sum().item()
    acc = correct / len(test_loader.dataset)
    if acc > best_acc:
        best_acc = acc
        torch.save(model.state_dict(), 'best_model.pth')
    print(f"Epoch {epoch+1}: acc={acc:.2%} {'⭐ best!' if acc==best_acc else ''}")

Exercises

Exercise 13.1: Why call model.eval() and torch.no_grad() during evaluation?

model.eval(): Switches BatchNorm to use running stats (not batch stats) and disables Dropout. torch.no_grad(): Disables gradient computation β€” saves memory and speeds up inference by ~2x. Without it, PyTorch builds the computational graph unnecessarily during evaluation.

Exercise 13.2: What is the difference between model.parameters() and model.state_dict()?

parameters(): Returns an iterator over learnable parameters (weights, biases). Used for optimizers. state_dict(): Returns a dictionary of ALL model state including parameters AND non-learnable buffers (BatchNorm running mean/var). Always save state_dict() for checkpointing.

Exercise 13.3: How do you move a model and data to GPU?
device = torch.device('cuda')
model = model.to(device)      # Move model
X = X.to(device)              # Move data
# Both must be on same device!

Chapter Summary

  • PyTorch: define-by-run, Pythonic, dominant in research
  • Standard loop: forward β†’ loss β†’ backward β†’ step (with zero_grad)
  • Always: model.eval() + torch.no_grad() for inference
  • Save state_dict() for checkpoints; use to(device) for GPU
Chapter 14

TensorFlow/Keras, JAX & Custom Training

Learning Objectives

  • Build models with TensorFlow/Keras high-level API
  • Understand JAX for high-performance, functional ML
  • Master automatic differentiation concepts
  • Write custom training loops for full control

TensorFlow/Keras

Python
import tensorflow as tf

# Keras Sequential β€” high-level API
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train in one line!
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train / 255.0

model.fit(x_train, y_train, epochs=5, batch_size=128,
          validation_split=0.1,
          callbacks=[tf.keras.callbacks.EarlyStopping(patience=3)])

JAX β€” Functional, XLA-Compiled

Python
import jax
import jax.numpy as jnp
from jax import grad, jit, vmap

# Automatic differentiation β€” the core of all DL
def loss_fn(w, x, y):
    pred = jnp.dot(x, w)
    return jnp.mean((pred - y) ** 2)

# grad() returns a function that computes gradients
grad_fn = jit(grad(loss_fn))  # JIT-compiled gradient function

# vmap β€” automatic vectorization (batch processing)
def predict_single(w, x):
    return jnp.dot(x, w)

predict_batch = vmap(predict_single, in_axes=(None, 0))
# Automatically handles batching without writing loops!
FrameworkStyleBest ForUsed By
PyTorchImperative (define-by-run)Research, prototypingMeta, most academia
TensorFlow/KerasDeclarative (high-level)Production, deploymentGoogle, mobile (TF Lite)
JAXFunctional (transformations)High-performance researchGoogle DeepMind (Gemini)

Custom Training Loop (PyTorch)

Python
# Full control training loop with all best practices
def train_one_epoch(model, loader, optimizer, scheduler, scaler, device):
    model.train()
    total_loss = 0
    for batch_idx, (X, y) in enumerate(loader):
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()

        # Mixed precision forward
        with torch.cuda.amp.autocast():
            output = model(X)
            loss = nn.functional.cross_entropy(output, y)

        # Scaled backward + gradient clipping
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

        total_loss += loss.item()
    return total_loss / len(loader)

Future: The Framework Landscape

PyTorch dominates research and is now standard in industry. JAX powers Google's Gemini and is growing for high-performance training (TPU-optimized). TensorFlow remains strong for production deployment (TF Serving, TF Lite for mobile). The trend: PyTorch for development, ONNX for cross-framework deployment. Understanding all three ensures you can work anywhere in the AI ecosystem.

Exercises

Exercise 14.1: What is automatic differentiation and how does it differ from numerical differentiation?

Numerical: Approximates gradients with (f(x+h)-f(x))/h β€” slow (2 evaluations per parameter) and numerically unstable. Automatic: Applies chain rule at each operation during forward pass, recording a graph. Backward pass computes exact gradients in O(1) per parameter. All DL frameworks use reverse-mode autodiff (equivalent to backprop).

Exercise 14.2: When would you write a custom training loop instead of using model.fit()?

(1) GAN training (alternating generator/discriminator). (2) Mixed precision with gradient scaling. (3) Gradient accumulation for large batch sizes. (4) Multi-task learning with multiple losses. (5) Custom learning rate scheduling. (6) Research with novel training procedures. model.fit() is convenient but inflexible.

Exercise 14.3: What makes JAX different from PyTorch?

JAX is functional: no mutable state, functions are transformed via grad(), jit(), vmap(). This enables: (1) Composable transformations (grad of grad, vmap of grad). (2) Ahead-of-time XLA compilation for maximum speed. (3) Automatic parallelization across TPU/GPU clusters. Trade-off: steeper learning curve, less ecosystem support than PyTorch.

Chapter Summary

  • TensorFlow/Keras: fastest prototyping with model.fit(); best for deployment
  • JAX: functional transformations + XLA compilation for maximum performance
  • Automatic differentiation is the engine behind all DL β€” exact gradients for free
  • Custom training loops give full control for advanced training recipes

πŸŽ“ Congratulations!

You've completed Deep Learning. You now understand neural networks from individual neurons to production-scale architectures. These foundations power every AI breakthrough β€” from GPT to Stable Diffusion to AlphaFold. You're ready to build the future of AI.

Β© 2025 EduArtha β€” Deep Learning Complete Guide