EduArtha Learning Series

Mathematics for AI

Everything in AI is math. Without this, you cannot understand why any algorithm works. This book covers the four pillars — Linear Algebra, Calculus, Probability, and Optimization — with rigorous discussion, Python code, and AI applications.

⏱ Estimated study time: 3–6 months | 13 Chapters | 50+ Exercises

Part I

Linear Algebra

The language of data and transformations

Chapter 1

Vectors & Vector Spaces

Learning Objectives

Understand vectors as ordered arrays of numbers and their geometric meaning
Master vector operations: addition, scalar multiplication, dot product, cross product
Learn about vector spaces, subspaces, basis, and dimension
Connect vector concepts to real AI applications

What is a Vector?

A vector is an ordered list of numbers representing a point or direction in space. In AI, every data point is a vector. An image is a vector of pixel values. A sentence is a vector of word embeddings. A customer profile is a vector of features.

v⃗ = [v₁, v₂, ..., vₙ] ∈ ℝⁿ

A vector in ℝ² has 2 components (2D plane), ℝ³ has 3 (3D space), and ℝⁿ has n components (n-dimensional hyperspace). In ML, n can be thousands or millions — a 224×224 color image is a vector in ℝ¹⁵⁰,⁵²⁸.

Vector Operations

Addition and Scalar Multiplication

u⃗ + v⃗ = [u₁+v₁, u₂+v₂, ..., uₙ+vₙ] c·v⃗ = [c·v₁, c·v₂, ..., c·vₙ]

Python
import numpy as np

u = np.array([3, 4, 1])
v = np.array([1, -2, 5])

print("Addition:", u + v)          # [4, 2, 6]
print("Scalar mult:", 3 * u)       # [9, 12, 3]
print("Magnitude:", np.linalg.norm(u))  # √(9+16+1) = √26 ≈ 5.1
print("Unit vector:", u / np.linalg.norm(u))  # Normalize to length 1

Dot Product

The dot product measures how aligned two vectors are. It is the single most important operation in ML — used in every neural network layer, every linear model, every attention mechanism.

u⃗ · v⃗ = u₁v₁ + u₂v₂ + ... + uₙvₙ = ||u⃗|| · ||v⃗|| · cos(θ)

If the dot product is positive, vectors point in similar directions. If zero, they are perpendicular (orthogonal). If negative, they point in opposite directions.

Python
u = np.array([1, 2, 3])
v = np.array([4, 5, 6])

dot = np.dot(u, v)  # 1*4 + 2*5 + 3*6 = 32

# Cosine similarity — measures angle between vectors
cos_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
print(f"Cosine similarity: {cos_sim:.4f}")  # 0.9746 — very similar

# This is how search engines find similar documents!

Why This Matters in AI

Neural network layers compute y = Wx + b — that's a dot product of weights W with input x for every neuron. Word embeddings use cosine similarity to find that "king" is to "queen" as "man" is to "woman". Recommendation engines compute dot products between user and item vectors to predict ratings.

Cross Product (3D only)

The cross product produces a vector perpendicular to both input vectors. Used in computer graphics, 3D physics, and robotics.

u⃗ × v⃗ = [u₂v₃ - u₃v₂, u₃v₁ - u₁v₃, u₁v₂ - u₂v₁]

Python
u = np.array([1, 0, 0])
v = np.array([0, 1, 0])
cross = np.cross(u, v)
print("Cross product:", cross)  # [0, 0, 1] — perpendicular to both

Vector Spaces, Basis & Dimension

A vector space is a collection of vectors that is closed under addition and scalar multiplication. The basis is the smallest set of vectors that can represent every vector in the space through linear combinations. The dimension is the number of basis vectors.

In ℝ³, the standard basis is {[1,0,0], [0,1,0], [0,0,1]}. Any 3D vector can be written as a combination of these three. In ML, finding a good basis (like PCA does) means finding the most informative directions in your data.

Linear Independence

Vectors are linearly independent if no vector can be written as a combination of the others. If vectors are dependent, they carry redundant information. In ML, redundant features waste computation and can cause numerical instability.

Python
# Check linear independence via matrix rank
vectors = np.array([
    [1, 0, 0],
    [0, 1, 0],
    [1, 1, 0]  # This = vec1 + vec2 → dependent!
])
print("Rank:", np.linalg.matrix_rank(vectors))  # 2, not 3 → dependent

Exercises

Exercise 1.1: Compute the cosine similarity between u=[1,2,3] and v=[3,2,1]. Are they similar?

u·v = 3+4+3 = 10, ||u|| = √14, ||v|| = √14

cos(θ) = 10/14 ≈ 0.714 — moderately similar (angle ≈ 44°)

Exercise 1.2: Why is cosine similarity preferred over Euclidean distance for text documents?

Documents of different lengths have very different magnitudes. A 1000-word document has much larger vector magnitude than a 100-word document, even if they discuss the same topic. Cosine similarity only measures the angle (direction), ignoring magnitude, so it compares content regardless of length.

Exercise 1.3: Given vectors a=[2,1], b=[1,3], c=[5,5]. Is c a linear combination of a and b?

We need scalars α, β such that α[2,1] + β[1,3] = [5,5].

2α + β = 5 and α + 3β = 5. From equation 1: β = 5-2α. Substituting: α + 3(5-2α) = 5 → α + 15 - 6α = 5 → -5α = -10 → α = 2, β = 1.

Check: 2[2,1] + 1[1,3] = [4,2] + [1,3] = [5,5] ✓. Yes, c = 2a + b.

Chapter Summary

Vectors represent data points in n-dimensional space; every ML input is a vector
Dot product measures similarity and is the core operation in neural networks
Cosine similarity compares direction (meaning) independent of magnitude
Linear independence ensures features carry non-redundant information

Chapter 2

Matrices & Matrix Operations

Learning Objectives

Understand matrices as linear transformations and data containers
Master matrix multiplication, transpose, inverse, and determinant
Learn special matrices: identity, diagonal, symmetric, orthogonal
See how matrix operations power neural networks

What is a Matrix?

A matrix is a 2D array of numbers with m rows and n columns (m × n). In AI, matrices represent: datasets (rows = samples, columns = features), weight matrices in neural networks, images (pixel grids), and transformation operations.

A ∈ ℝᵐˣⁿ where Aᵢⱼ = element at row i, column j

Python
import numpy as np

# Dataset: 4 samples, 3 features
X = np.array([
    [5.1, 3.5, 1.4],
    [4.9, 3.0, 1.4],
    [7.0, 3.2, 4.7],
    [6.3, 3.3, 6.0]
])
print(f"Shape: {X.shape}")  # (4, 3) — 4×3 matrix

Matrix Multiplication

Matrix multiplication is not element-wise. For A (m×n) and B (n×p), the result C = AB is (m×p) where each element is a dot product of a row of A with a column of B.

Cᵢⱼ = Σₖ Aᵢₖ · Bₖⱼ (row i of A · column j of B)

The Dimension Rule

A (m×n) × B (n×p) = C (m×p). The inner dimensions must match. This is why neural network layer dimensions must be compatible: a layer with 128 outputs feeding into a layer expecting 128 inputs.

Python
# Neural network forward pass: y = Wx + b
W = np.random.randn(64, 128)    # 64 neurons, 128 inputs
x = np.random.randn(128, 1)     # input vector
b = np.random.randn(64, 1)      # bias vector

y = W @ x + b                    # (64×128) @ (128×1) = (64×1)
print(f"Output shape: {y.shape}")  # (64, 1) — 64 neuron outputs

Transpose, Inverse & Determinant

Python
A = np.array([[2, 1], [5, 3]])

# Transpose: flip rows and columns
print("Aᵀ =\n", A.T)

# Determinant: scalar that measures how A scales area/volume
det = np.linalg.det(A)
print(f"det(A) = {det:.1f}")  # 2*3 - 1*5 = 1

# Inverse: A⁻¹ such that A·A⁻¹ = I (only if det ≠ 0)
A_inv = np.linalg.inv(A)
print("A⁻¹ =\n", A_inv)
print("A·A⁻¹ =\n", (A @ A_inv).round(2))  # Identity matrix

Special Matrices

Type	Property	AI Use
Identity (I)	Iᵢᵢ=1, rest 0	Skip connections in ResNets
Diagonal	Non-zero only on diagonal	Scaling features independently
Symmetric	A = Aᵀ	Covariance matrices, kernels
Orthogonal	AᵀA = I	Rotation, preserves distances

Why This Matters in AI

Every neural network layer is a matrix multiplication: output = activation(W·input + b). Training a neural network = finding the right values in W. The entire forward pass of a transformer with 175 billion parameters is a sequence of matrix multiplications. GPUs are fast because they are optimized for matrix math.

Exercises

Exercise 2.1: Multiply A = [[1,2],[3,4]] by B = [[5,6],[7,8]] by hand and verify with NumPy

C₁₁ = 1·5 + 2·7 = 19, C₁₂ = 1·6 + 2·8 = 22, C₂₁ = 3·5 + 4·7 = 43, C₂₂ = 3·6 + 4·8 = 48

A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])
print(A @ B)  # [[19,22],[43,48]]

Exercise 2.2: Why is matrix multiplication not commutative (AB ≠ BA)?

Dimensions may not even allow it: if A is 3×2 and B is 2×5, AB is 3×5 but BA is undefined. Even when both are square, the row-column dot products differ. This is important in neural networks — the order of transformations matters.

Exercise 2.3: What does a determinant of 0 mean geometrically and computationally?

Geometrically: The matrix collapses space into a lower dimension (squishes a 2D plane into a line, or 3D volume into a plane). Information is lost.

Computationally: The matrix is singular — it has no inverse, the system Ax=b has either no solution or infinitely many. In ML, this causes numerical instability (e.g., covariance matrix inversion in Gaussian models).

Chapter Summary

Matrices represent data, weights, and transformations in AI
Matrix multiplication is the core computation in neural networks
The determinant measures volume scaling; zero = singular (no inverse)
Special matrices (identity, orthogonal) have important roles in network architectures

Chapter 3

Eigenvalues, Eigenvectors & Matrix Decompositions

Learning Objectives

Understand eigenvalues and eigenvectors intuitively and mathematically
Master Singular Value Decomposition (SVD)
Apply PCA for dimensionality reduction using eigendecomposition
Connect decompositions to recommendation systems and data compression

Eigenvalues & Eigenvectors

An eigenvector of matrix A is a special vector that, when transformed by A, only gets scaled (not rotated). The scaling factor is the eigenvalue.

A · v⃗ = λ · v⃗ where v⃗ = eigenvector, λ = eigenvalue

Think of it this way: most vectors change direction when multiplied by A. Eigenvectors are the stubborn ones that refuse to change direction — they only stretch or shrink. The eigenvalue tells you by how much.

Python
import numpy as np

A = np.array([[4, 2],
              [1, 3]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)    # [5, 2]
print("Eigenvectors:\n", eigenvectors)

# Verify: A·v = λ·v
v = eigenvectors[:, 0]   # First eigenvector
lam = eigenvalues[0]     # First eigenvalue
print("A·v =", A @ v)
print("λ·v =", lam * v)  # Same! ✓

Why This Matters in AI

PCA finds eigenvectors of the covariance matrix — they are the principal directions of maximum variance in your data. Google's PageRank is the dominant eigenvector of the web link matrix. Spectral clustering uses eigenvectors of graph Laplacians. Eigenvalues indicate feature importance.

Singular Value Decomposition (SVD)

SVD decomposes any matrix (even non-square) into three matrices:

A = U · Σ · Vᵀ

Where U contains left singular vectors, Σ is diagonal with singular values (importance scores), and Vᵀ contains right singular vectors. The singular values are sorted from largest to smallest.

Python
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

U, sigma, Vt = np.linalg.svd(A)
print("Singular values:", sigma.round(3))

# Low-rank approximation: keep only top-k singular values
# This is how image compression and recommendation engines work!
k = 2
A_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
print("Rank-2 approximation:\n", A_approx.round(2))

PCA via Eigendecomposition

Python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data  # 150 samples, 4 features

# Reduce 4D → 2D using PCA
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

print(f"Original: {X.shape}")            # (150, 4)
print(f"Reduced: {X_2d.shape}")            # (150, 2)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.2%}")

# Under the hood, PCA computes eigenvectors of the covariance matrix
cov_matrix = np.cov(X.T)
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("Top 2 eigenvalues:", sorted(eigenvalues, reverse=True)[:2])

Exercises

Exercise 3.1: If a 2×2 matrix has eigenvalues 5 and -1, what can you say about it?

det(A) = 5 × (-1) = -5 (non-zero, so A is invertible). trace(A) = 5 + (-1) = 4. The matrix stretches along one eigenvector by 5× and flips+shrinks along the other. The negative eigenvalue means it reverses direction along that eigenvector.

Exercise 3.2: How does Netflix use SVD for recommendations?

Netflix represents users×movies as a sparse rating matrix. SVD decomposes it into User-factor × Importance × Movie-factor matrices. Each "factor" captures a latent concept (like "action-loving" or "comedy-preference"). To predict a rating, compute the dot product of the user's factor vector with the movie's factor vector. Missing entries are predicted by the low-rank approximation.

Exercise 3.3: You have 1000 features. PCA shows the first 50 components retain 95% variance. What do you do?

Use 50 components! You reduce dimensionality by 20× while keeping 95% of the information. Benefits: faster training, less overfitting, better visualization, reduced storage. Tradeoff: the 50 components are linear combinations of original features, so interpretability is reduced.

Chapter Summary

Eigenvectors are directions preserved by a transformation; eigenvalues are scaling factors
SVD decomposes any matrix into U·Σ·Vᵀ — basis of compression and recommendations
PCA uses eigendecomposition of the covariance matrix to find principal directions
Low-rank approximation keeps the most important information with fewer dimensions

Chapter 4

Tensors (Multi-dimensional Arrays)

Learning Objectives

Understand tensors as generalizations of scalars, vectors, and matrices
Work with tensors in NumPy, PyTorch, and TensorFlow
Master tensor operations: reshaping, broadcasting, einsum
See how tensors represent real AI data (images, video, text batches)

The Tensor Hierarchy

Rank	Name	Shape Example	AI Example
0	Scalar	()	Loss value: 0.342
1	Vector	(128,)	Word embedding
2	Matrix	(64, 128)	Weight matrix, grayscale image
3	3D Tensor	(32, 28, 28)	Batch of grayscale images
4	4D Tensor	(32, 3, 224, 224)	Batch of color images
5	5D Tensor	(8, 16, 3, 224, 224)	Batch of video clips

Python
import numpy as np

# Batch of 32 color images, 224×224 pixels, 3 channels (RGB)
batch = np.random.randn(32, 3, 224, 224)
print(f"Shape: {batch.shape}")  # (32, 3, 224, 224)
print(f"Rank: {batch.ndim}")    # 4
print(f"Total elements: {batch.size:,}")  # 4,816,896

# Reshaping — critical for connecting conv layers to dense layers
flat = batch.reshape(32, -1)   # (32, 150528)
print(f"Flattened: {flat.shape}")

# Broadcasting — NumPy auto-expands dimensions
image = np.random.randn(3, 224, 224)
mean = np.array([0.485, 0.456, 0.406]).reshape(3, 1, 1)
normalized = image - mean  # mean broadcasts across H×W

Einstein Summation (einsum)

Einsum is a powerful notation that expresses any tensor operation in a compact string:

Python
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)

# Matrix multiplication
C = np.einsum('ik,kj->ij', A, B)  # Same as A @ B

# Batch matrix multiplication (used in attention)
Q = np.random.randn(8, 10, 64)  # 8 heads, 10 tokens, 64 dim
K = np.random.randn(8, 10, 64)
attn = np.einsum('bij,bkj->bik', Q, K)  # (8, 10, 10) attention scores

Why This Matters in AI

Deep learning IS tensor computation. TensorFlow is literally named after tensors. Every layer transforms tensors: Conv2D reshapes 4D tensors, attention mechanisms use batch matrix multiplication on 3D tensors, and gradients flow through tensors during backpropagation. Understanding tensor shapes is essential for debugging model architectures.

Exercises

Exercise 4.1: A video dataset has 100 clips, each 30 frames, 3 channels, 128×128 pixels. What is the tensor shape?

(100, 30, 3, 128, 128) — a rank-5 tensor with 100 × 30 × 3 × 128 × 128 = 147,456,000 elements.

Exercise 4.2: Why does reshaping/flattening not lose information?

Reshaping only changes how elements are arranged in memory, not the elements themselves. A (2,3,4) tensor with 24 elements can be reshaped to (6,4) or (24,) — same 24 numbers, different interpretation. However, spatial relationships (pixel neighborhoods in images) are lost when flattening, which is why CNNs are preferred over dense layers for images.

Exercise 4.3: Explain broadcasting: what happens when you add a (3,1) tensor to a (3,4) tensor?

NumPy "broadcasts" the (3,1) tensor by repeating it 4 times along the second dimension, effectively making it (3,4). Then element-wise addition proceeds. Result is (3,4). Rule: dimensions are compatible if they are equal or one of them is 1.

Chapter Summary

Tensors generalize scalars (0D), vectors (1D), matrices (2D) to n-dimensions
AI data is naturally multi-dimensional: image batches are 4D, video is 5D
Reshaping, broadcasting, and einsum are essential tensor manipulation tools
Understanding tensor shapes is crucial for debugging neural network architectures

Part II

Calculus

The engine of learning — how models improve

Chapter 5

Derivatives & Partial Derivatives

Learning Objectives

Understand derivatives as rates of change and slopes of functions
Master differentiation rules: power, product, quotient, chain
Extend to partial derivatives for multi-variable functions
See how derivatives drive the learning process in ML

The Derivative: Rate of Change

The derivative f'(x) tells you how fast f(x) changes as x changes. In ML, the loss function L(w) measures error, and ∂L/∂w tells you how the error changes as you adjust each weight — this is how neural networks learn.

f'(x) = lim[h→0] (f(x+h) - f(x)) / h

Essential Differentiation Rules

Rule	Formula	Example
Power	d/dx[xⁿ] = n·xⁿ⁻¹	d/dx[x³] = 3x²
Sum	d/dx[f+g] = f'+g'	d/dx[x²+3x] = 2x+3
Product	d/dx[f·g] = f'g + fg'	d/dx[x·eˣ] = eˣ + x·eˣ
Chain	d/dx[f(g(x))] = f'(g(x))·g'(x)	d/dx[sin(x²)] = cos(x²)·2x
Exponential	d/dx[eˣ] = eˣ	d/dx[e²ˣ] = 2e²ˣ
Logarithm	d/dx[ln(x)] = 1/x	d/dx[ln(2x)] = 1/x

Partial Derivatives

When a function has multiple variables, a partial derivative measures the rate of change with respect to one variable while holding all others constant. In a neural network with millions of weights, we compute partial derivatives with respect to each weight separately.

f(x, y) = x²y + 3xy² → ∂f/∂x = 2xy + 3y² (treat y as constant)
∂f/∂y = x² + 6xy (treat x as constant)

Python
import numpy as np

# Numerical derivative approximation
def numerical_derivative(f, x, h=1e-7):
    return (f(x + h) - f(x - h)) / (2 * h)

# Example: f(x) = x³ + 2x, f'(x) = 3x² + 2
f = lambda x: x**3 + 2*x

x = 3.0
numerical = numerical_derivative(f, x)
analytical = 3 * x**2 + 2
print(f"Numerical: {numerical:.6f}")   # 29.000000
print(f"Analytical: {analytical:.6f}")  # 29.000000 ✓

# Partial derivatives of f(x,y) = x²y + 3xy²
def f_xy(x, y):
    return x**2 * y + 3 * x * y**2

# ∂f/∂x at (2, 3)
df_dx = (f_xy(2 + 1e-7, 3) - f_xy(2 - 1e-7, 3)) / (2e-7)
print(f"∂f/∂x at (2,3) = {df_dx:.2f}")  # 2*2*3 + 3*9 = 39

Why This Matters in AI

Training = minimizing the loss function by adjusting weights. The partial derivative ∂L/∂wᵢ tells us: "if I increase weight wᵢ by a tiny amount, how does the loss change?" If the derivative is positive, decreasing wᵢ reduces the loss. If negative, increasing wᵢ helps. This is gradient descent.

Exercises

Exercise 5.1: Find ∂f/∂x and ∂f/∂y for f(x,y) = 3x²y³ - 2xy + 5

∂f/∂x = 6xy³ - 2y (differentiate w.r.t. x, treat y as constant)

∂f/∂y = 9x²y² - 2x (differentiate w.r.t. y, treat x as constant)

Exercise 5.2: Derive the derivative of the sigmoid function σ(x) = 1/(1+e⁻ˣ)

Let u = 1 + e⁻ˣ, so σ = u⁻¹.

σ' = -u⁻² · (-e⁻ˣ) = e⁻ˣ / (1+e⁻ˣ)²

Note: e⁻ˣ/(1+e⁻ˣ) = 1 - σ(x), so:

σ'(x) = σ(x) · (1 - σ(x)) — this elegant form makes backpropagation efficient!

Exercise 5.3: Why is the ReLU derivative problematic at x=0? How is this handled?

ReLU = max(0, x). Its derivative is 1 for x > 0 and 0 for x < 0. At x = 0, it's technically undefined (not differentiable). In practice, frameworks define it as 0 at x=0 (subgradient), which works fine because the probability of any neuron being exactly 0 is effectively zero with floating-point arithmetic.

Chapter Summary

Derivatives measure instantaneous rate of change — slopes of tangent lines
Partial derivatives extend this to multi-variable functions (one variable at a time)
The sigmoid derivative σ'(x) = σ(x)(1-σ(x)) is central to backpropagation
Every weight update in ML uses partial derivatives of the loss function

Chapter 6

Chain Rule, Gradients & Multivariate Calculus

Learning Objectives

Master the chain rule — the mathematical backbone of backpropagation
Understand gradient vectors and their geometric meaning
Learn multivariate calculus concepts essential for optimization
See how gradients flow through computational graphs

The Chain Rule: Core of Backpropagation

If y = f(g(x)), then dy/dx = f'(g(x)) · g'(x). This lets us differentiate compositions of functions, which is exactly what a neural network is — layers of composed functions.

dy/dx = dy/du · du/dx (chain of derivatives)

A 3-layer neural network computes: y = f₃(f₂(f₁(x))). The chain rule gives us:

∂L/∂w₁ = ∂L/∂y · ∂y/∂f₂ · ∂f₂/∂f₁ · ∂f₁/∂w₁

This is backpropagation — computing gradients layer by layer from output back to input.

Python
import numpy as np

# Example: y = sin(x²)
# dy/dx = cos(x²) · 2x  (chain rule: outer' × inner')
x = 2.0
analytical = np.cos(x**2) * 2 * x

# Numerical verification
h = 1e-7
numerical = (np.sin((x+h)**2) - np.sin((x-h)**2)) / (2*h)
print(f"Analytical: {analytical:.6f}")
print(f"Numerical:  {numerical:.6f}")  # Match! ✓

# Backprop through a simple 2-layer network
def forward_and_backward(x, w1, w2):
    # Forward pass
    z1 = w1 * x           # Layer 1
    a1 = np.maximum(0, z1) # ReLU
    z2 = w2 * a1           # Layer 2
    loss = (z2 - 1.0)**2   # MSE loss (target = 1.0)

    # Backward pass (chain rule!)
    dL_dz2 = 2 * (z2 - 1.0)       # ∂L/∂z2
    dL_dw2 = dL_dz2 * a1          # ∂L/∂w2 = ∂L/∂z2 · ∂z2/∂w2
    dL_da1 = dL_dz2 * w2          # ∂L/∂a1 = ∂L/∂z2 · ∂z2/∂a1
    dL_dz1 = dL_da1 * (z1 > 0)   # ∂L/∂z1 (ReLU derivative)
    dL_dw1 = dL_dz1 * x           # ∂L/∂w1

    return loss, dL_dw1, dL_dw2

loss, grad_w1, grad_w2 = forward_and_backward(2.0, 0.5, 0.3)
print(f"Loss: {loss:.4f}, ∂L/∂w1: {grad_w1:.4f}, ∂L/∂w2: {grad_w2:.4f}")

The Gradient Vector

The gradient ∇f collects all partial derivatives into a vector. It points in the direction of steepest ascent. Moving in the negative gradient direction = steepest descent = fastest way to reduce the loss.

∇f(x,y,z) = [∂f/∂x, ∂f/∂y, ∂f/∂z] — points toward steepest increase

Python
# Gradient of f(x,y) = x² + y² (a bowl)
def f(xy):
    return xy[0]**2 + xy[1]**2

def gradient(xy):
    return np.array([2*xy[0], 2*xy[1]])

# Gradient descent: start at (5, 3), walk downhill
pos = np.array([5.0, 3.0])
lr = 0.1

for i in range(20):
    grad = gradient(pos)
    pos = pos - lr * grad  # Step opposite to gradient
    if i % 5 == 0:
        print(f"Step {i:2d}: pos=({pos[0]:.3f}, {pos[1]:.3f}), f={f(pos):.4f}")
# Converges to (0, 0) — the minimum!

Exercises

Exercise 6.1: Apply the chain rule: find d/dx of e^(sin(x²))

Let u = sin(x²), outer = eᵘ. Then:

d/dx[e^(sin(x²))] = e^(sin(x²)) · cos(x²) · 2x

Chain: (derivative of eᵘ) × (derivative of sin) × (derivative of x²)

Exercise 6.2: Why does the vanishing gradient problem occur in deep networks?

In backprop, gradients are multiplied through layers (chain rule). If each layer's gradient is < 1 (as with sigmoid whose max derivative is 0.25), multiplying many small numbers gives an exponentially small gradient: 0.25¹⁰ = 0.0000009. Early layers barely learn. Solutions: ReLU (gradient = 1 for positive), skip connections (ResNets), batch normalization.

Exercise 6.3: Compute ∇f for f(x,y,z) = x²yz + 2y²z at point (1,2,3)

∂f/∂x = 2xyz = 2(1)(2)(3) = 12

∂f/∂y = x²z + 4yz = (1)(3) + 4(2)(3) = 27

∂f/∂z = x²y + 2y² = (1)(2) + 2(4) = 10

∇f(1,2,3) = [12, 27, 10]

Chapter Summary

The chain rule lets us differentiate composed functions — this IS backpropagation
The gradient vector ∇f points toward steepest increase; -∇f toward steepest decrease
Gradients flow backward through the computational graph, layer by layer
Vanishing gradients occur when many small derivatives multiply — solved by ReLU and skip connections

Chapter 7

Jacobian, Hessian Matrices & Optimization via Calculus

Learning Objectives

Understand the Jacobian matrix for vector-valued functions
Learn the Hessian matrix and second-order optimization
Apply calculus concepts to find optima (minima and maxima)
Connect to Newton's method and advanced optimizers

The Jacobian Matrix

When a function maps multiple inputs to multiple outputs, the Jacobian is the matrix of all first-order partial derivatives. It generalizes the gradient to vector-valued functions.

J = [∂fᵢ/∂xⱼ] — row i = partials of function i, column j = with respect to variable j

Python
import numpy as np

# f(x,y) = [x²+y, xy+y²]
# Jacobian at (1, 2):
# J = [[∂f₁/∂x, ∂f₁/∂y],   = [[2x, 1 ],   = [[2, 1],
#      [∂f₂/∂x, ∂f₂/∂y]]     [ y, x+2y]]     [2, 5]]

def jacobian_numerical(f, x, h=1e-7):
    n = len(x)
    f0 = f(x)
    m = len(f0)
    J = np.zeros((m, n))
    for j in range(n):
        x_plus = x.copy(); x_plus[j] += h
        x_minus = x.copy(); x_minus[j] -= h
        J[:, j] = (f(x_plus) - f(x_minus)) / (2 * h)
    return J

f = lambda x: np.array([x[0]**2 + x[1], x[0]*x[1] + x[1]**2])
J = jacobian_numerical(f, np.array([1.0, 2.0]))
print("Jacobian:\n", J.round(4))

The Hessian Matrix

The Hessian is the matrix of second-order partial derivatives. It tells you about the curvature of the loss surface — is it a bowl (convex, minimum), a saddle, or a ridge?

H = [∂²f/∂xᵢ∂xⱼ]

If all eigenvalues of H are positive → local minimum (bowl). If all negative → local maximum. If mixed → saddle point. This information is used by second-order optimizers like Newton's method.

Python
# Hessian of f(x,y) = x⁴ + y⁴ - 2x²y² at (1,1)
# ∂²f/∂x² = 12x² - 4y² = 8
# ∂²f/∂y² = 12y² - 4x² = 8
# ∂²f/∂x∂y = -4xy = -4 (mixed partial, symmetric by Schwarz)

H = np.array([[8, -4],
              [-4, 8]])
eigenvalues = np.linalg.eigvalsh(H)
print("Eigenvalues:", eigenvalues)  # [4, 12] — both positive → minimum

Why This Matters in AI

The Jacobian appears when computing gradients through multi-output layers (like softmax). The Hessian captures curvature information — second-order optimizers like L-BFGS use it to take smarter steps than vanilla gradient descent. However, for large networks, the full Hessian is prohibitively expensive (n² elements for n parameters), so approximations like Adam's adaptive learning rates are used instead.

Exercises

Exercise 7.1: For f(x,y) = x² + 4y² - 2xy, find the Hessian and classify the critical point at (0,0)

∂f/∂x = 2x - 2y, ∂f/∂y = 8y - 2x. Setting both to zero: x=0, y=0.

H = [[2, -2], [-2, 8]]. Eigenvalues: det(H-λI) = (2-λ)(8-λ) - 4 = λ² - 10λ + 12 = 0 → λ = 5±√13.

Both eigenvalues positive → (0,0) is a local minimum.

Exercise 7.2: Why is Newton's method faster than gradient descent but rarely used in deep learning?

Faster: Newton's method uses curvature (Hessian) to take optimal-sized steps: Δw = -H⁻¹∇L. It converges quadratically vs. linearly for GD.

Impractical: For a network with 100M parameters, the Hessian is a 100M × 100M matrix (10¹⁶ entries) — impossible to compute or store. Even approximations are expensive. That's why Adam (which approximates curvature cheaply) is preferred.

Chapter Summary

The Jacobian generalizes the gradient to vector-valued functions
The Hessian captures second-order curvature of the loss surface
Eigenvalues of the Hessian classify critical points (min, max, saddle)
Second-order methods are theoretically faster but computationally prohibitive for large models

Part III

Probability & Statistics

Reasoning under uncertainty

Chapter 8

Probability Distributions & Bayes' Theorem

Learning Objectives

Understand discrete and continuous probability distributions
Master Gaussian, Bernoulli, Uniform, and Poisson distributions
Deeply understand Bayes' theorem and its AI applications
Apply Bayesian reasoning to classification and decision-making

Probability Distributions

A probability distribution describes the likelihood of each possible outcome. In ML, data is assumed to come from some underlying distribution, and the model tries to learn it.

Key Discrete Distributions

Distribution	Use	Example
Bernoulli(p)	Single binary trial	Is this email spam? (yes/no)
Binomial(n, p)	Number of successes in n trials	How many of 100 emails are spam?
Poisson(λ)	Count of rare events in interval	Website visits per hour
Categorical(p₁..pₖ)	One of k categories	Softmax output probabilities

The Gaussian (Normal) Distribution

The most important continuous distribution. Data in nature often follows a bell curve. Neural network weights are initialized from Gaussians.

f(x) = (1/√(2πσ²)) · exp(-(x-μ)² / (2σ²))

Python
import numpy as np
import matplotlib.pyplot as plt

# Generate samples from different distributions
normal_data = np.random.normal(loc=0, scale=1, size=10000)
uniform_data = np.random.uniform(low=-3, high=3, size=10000)

# Neural network weight initialization
# Xavier/Glorot: Normal with std = √(2 / (fan_in + fan_out))
fan_in, fan_out = 256, 128
weights = np.random.normal(0, np.sqrt(2.0 / (fan_in + fan_out)), size=(fan_in, fan_out))
print(f"Weight mean: {weights.mean():.4f}, std: {weights.std():.4f}")

Bayes' Theorem

Bayes' theorem lets us update our beliefs when we observe new evidence. It reverses conditional probabilities — going from P(evidence|hypothesis) to P(hypothesis|evidence).

P(H|E) = P(E|H) · P(H) / P(E)

Posterior = (Likelihood × Prior) / Evidence

Python
# Medical diagnosis example
# Disease prevalence: 1% of population has disease
# Test sensitivity: 99% (positive if sick)
# Test specificity: 95% (negative if healthy)
# Question: You test positive. What's the probability you're actually sick?

p_disease = 0.01         # Prior: P(D)
p_pos_given_disease = 0.99 # Likelihood: P(+|D)
p_pos_given_healthy = 0.05 # False positive: P(+|¬D)

# P(+) = P(+|D)P(D) + P(+|¬D)P(¬D)
p_positive = p_pos_given_disease * p_disease + p_pos_given_healthy * (1 - p_disease)

# P(D|+) = P(+|D)P(D) / P(+)
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive

print(f"P(disease | positive test) = {p_disease_given_pos:.2%}")
# Only ~16.7%! Despite 99% sensitive test!
# This is because the disease is rare (low prior)

Why This Matters in AI

Naive Bayes classifier directly applies Bayes' theorem for text classification. Bayesian neural networks place distributions over weights instead of point estimates. Bayesian optimization uses priors to efficiently search hyperparameter spaces. Understanding priors and posteriors is fundamental to probabilistic ML.

Exercises

Exercise 8.1: A spam filter sees the word "free" in 80% of spam emails but only 5% of legitimate ones. If 30% of emails are spam, what's the probability an email with "free" is spam?

P(spam|free) = P(free|spam)·P(spam) / P(free)

P(free) = 0.80·0.30 + 0.05·0.70 = 0.24 + 0.035 = 0.275

P(spam|free) = (0.80 × 0.30) / 0.275 = 0.24/0.275 = 87.3%

Exercise 8.2: Why does Naive Bayes assume feature independence? Is this realistic?

Naive Bayes assumes P(x₁,x₂|y) = P(x₁|y)·P(x₂|y) — features are conditionally independent given the class. This is rarely true (word "free" and "money" in spam are correlated), but the simplification makes computation tractable and often works surprisingly well in practice despite violated assumptions.

Exercise 8.3: Why is Xavier initialization better than random initialization?

Random large values → activations explode → gradients explode. Random small values → activations shrink → gradients vanish. Xavier initializes weights from N(0, 2/(fan_in+fan_out)), keeping the variance of activations roughly constant across layers. This ensures healthy gradient flow during backpropagation.

Chapter Summary

Probability distributions model the uncertainty inherent in data
The Gaussian distribution is central to initialization, normalization, and generative models
Bayes' theorem updates beliefs with evidence: Posterior ∝ Likelihood × Prior
Counter-intuitive results (like the medical test) show why rigorous probability matters

Chapter 9

Expectation, Variance & Maximum Likelihood Estimation

Learning Objectives

Understand expectation (mean), variance, and covariance deeply
Master Maximum Likelihood Estimation (MLE) — the foundation of model training
See how MLE connects to loss functions in ML
Apply these concepts to real model fitting scenarios

Expectation & Variance

E[X] = Σ xᵢ · P(xᵢ) (discrete) E[X] = ∫ x · f(x) dx (continuous)

Var(X) = E[(X - μ)²] = E[X²] - (E[X])²

Expectation is the weighted average — the "center" of the distribution. Variance measures spread — how far values typically deviate from the mean. Standard deviation σ = √Var is in the same units as the data.

Python
import numpy as np

data = np.array([72, 85, 90, 78, 95, 88, 76, 92])

print(f"Mean (E[X]):     {np.mean(data):.2f}")
print(f"Variance:        {np.var(data):.2f}")
print(f"Std deviation:   {np.std(data):.2f}")

# Covariance — how two variables move together
height = np.array([165, 170, 175, 180, 185])
weight = np.array([55, 62, 70, 78, 85])
cov_matrix = np.cov(height, weight)
print(f"\nCovariance matrix:\n{cov_matrix}")
print(f"Correlation: {np.corrcoef(height, weight)[0,1]:.4f}")

Maximum Likelihood Estimation (MLE)

MLE finds the parameter values that make the observed data most probable. It asks: "Given this data, which parameters θ maximize the probability of seeing this data?"

θ̂_MLE = argmax_θ P(data | θ) = argmax_θ Π P(xᵢ | θ)

For computational convenience (products → sums), we maximize the log-likelihood:

θ̂_MLE = argmax_θ Σ log P(xᵢ | θ)

Python
import numpy as np

# MLE for Gaussian: find μ and σ that maximize P(data)
data = np.array([2.1, 1.8, 2.5, 2.3, 1.9, 2.7, 2.0, 2.4])

# MLE solutions (derivable analytically):
mu_mle = np.mean(data)     # MLE for μ = sample mean
sigma_mle = np.std(data)   # MLE for σ = sample std (biased)

print(f"MLE μ = {mu_mle:.3f}")
print(f"MLE σ = {sigma_mle:.3f}")

# Connection to ML: minimizing cross-entropy loss
# = maximizing log-likelihood of correct labels!
# So training a classifier with cross-entropy IS doing MLE.

The Big Insight

Minimizing cross-entropy loss = Maximizing log-likelihood. When you train a classifier with cross-entropy loss, you are performing MLE — finding the model parameters that make the training labels most probable. When you use MSE loss for regression, you are doing MLE under a Gaussian noise assumption.

Exercises

Exercise 9.1: For a fair coin flipped 10 times with 7 heads, what is the MLE estimate of P(heads)?

For Bernoulli trials, MLE of p = (number of successes) / (total trials) = 7/10 = 0.7.

Note: we know the true p is 0.5, but MLE gives us the value that makes the observed data most likely. With more flips, MLE converges to the true value.

Exercise 9.2: Why do we maximize log-likelihood instead of likelihood?

(1) Numerical stability: Likelihood is a product of many probabilities (e.g., 0.3 × 0.7 × 0.2 × ... = extremely small number). Log converts products to sums, avoiding underflow.

(2) Convenience: Derivatives of sums are much simpler than derivatives of products.

(3) Equivalence: log is monotonically increasing, so maximizing log(L) gives the same answer as maximizing L.

Exercise 9.3: What is the difference between MLE and MAP (Maximum A Posteriori)?

MLE: θ̂ = argmax P(data|θ) — only uses likelihood, no prior assumptions about θ.

MAP: θ̂ = argmax P(θ|data) = argmax P(data|θ)·P(θ) — includes a prior P(θ). MAP with a Gaussian prior on weights = L2 regularization (weight decay). MAP is more robust with small datasets because the prior prevents extreme parameter values.

Chapter Summary

Expectation, variance, and covariance quantify the center, spread, and relationships of data
MLE finds parameters that maximize the probability of observed data
Cross-entropy loss minimization IS MLE for classification
MAP extends MLE with prior beliefs, corresponding to regularization

Chapter 10

Information Theory & Sampling Methods (MCMC)

Learning Objectives

Understand entropy as a measure of uncertainty/information
Master KL divergence for comparing distributions
Learn cross-entropy and its role as a loss function
Understand Monte Carlo and MCMC sampling methods

Entropy: Measuring Uncertainty

Entropy measures the average "surprise" or uncertainty in a probability distribution. A fair coin (p=0.5) has maximum entropy — you can't predict the outcome. A biased coin (p=0.99) has low entropy — almost always heads.

H(X) = -Σ p(xᵢ) · log₂ p(xᵢ)

Python
import numpy as np

def entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]  # Avoid log(0)
    return -np.sum(probs * np.log2(probs))

print(f"Fair coin:   {entropy([0.5, 0.5]):.3f} bits")    # 1.000
print(f"Biased coin: {entropy([0.99, 0.01]):.3f} bits")  # 0.081
print(f"Fair die:    {entropy([1/6]*6):.3f} bits")       # 2.585
print(f"Certain:     {entropy([1.0, 0.0]):.3f} bits")    # 0.000

KL Divergence

KL divergence measures how different two distributions P and Q are. It's the "extra bits" needed to encode data from P using a code optimized for Q. Used in VAEs, GANs, knowledge distillation.

D_KL(P || Q) = Σ P(x) · log(P(x) / Q(x)) ≥ 0

Python
def kl_divergence(p, q):
    p, q = np.array(p), np.array(q)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

P = [0.4, 0.3, 0.3]   # True distribution
Q = [0.33, 0.33, 0.34] # Approximate distribution

print(f"KL(P||Q) = {kl_divergence(P, Q):.4f}")
print(f"KL(Q||P) = {kl_divergence(Q, P):.4f}")  # Different! KL is asymmetric

Cross-Entropy Loss

Cross-entropy = Entropy + KL divergence. Minimizing cross-entropy between true labels and predictions is equivalent to minimizing KL divergence (since entropy of true labels is constant).

H(P, Q) = -Σ P(x) · log Q(x) = H(P) + D_KL(P || Q)

Python
# Binary cross-entropy — the standard classification loss
def binary_cross_entropy(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1-y_true) * np.log(1-y_pred))

y_true = np.array([1, 0, 1, 1, 0])
y_good = np.array([0.9, 0.1, 0.8, 0.95, 0.05])   # Good predictions
y_bad  = np.array([0.5, 0.5, 0.5, 0.5, 0.5])    # Random predictions

print(f"Good model loss: {binary_cross_entropy(y_true, y_good):.4f}")
print(f"Bad model loss:  {binary_cross_entropy(y_true, y_bad):.4f}")

Sampling Methods & MCMC

When we can't compute probabilities analytically (which is common in complex Bayesian models), we sample from the distribution instead. Monte Carlo methods use random sampling to approximate expectations.

Markov Chain Monte Carlo (MCMC) constructs a Markov chain whose stationary distribution equals the target distribution. The Metropolis-Hastings algorithm is the classic approach:

Python
import numpy as np

# Metropolis-Hastings to sample from a target distribution
# Target: mixture of two Gaussians
def target_pdf(x):
    return 0.3 * np.exp(-0.5 * (x + 2)**2) + 0.7 * np.exp(-0.5 * (x - 3)**2)

def metropolis_hastings(target, n_samples=10000, step_size=1.0):
    samples = []
    x = 0.0  # Starting point

    for _ in range(n_samples):
        # Propose new point
        x_new = x + np.random.normal(0, step_size)
        # Accept/reject based on probability ratio
        acceptance = target(x_new) / target(x)
        if np.random.random() < acceptance:
            x = x_new
        samples.append(x)

    return np.array(samples)

samples = metropolis_hastings(target_pdf, n_samples=50000)
print(f"Sample mean: {samples[1000:].mean():.3f}")  # Discard burn-in
print(f"Sample std:  {samples[1000:].std():.3f}")

Why This Matters in AI

Cross-entropy is THE loss function for classification in deep learning. KL divergence is the loss for VAE latent spaces. MCMC powers Bayesian inference, probabilistic programming (PyMC, Stan), and some generative models. Entropy measures feature informativeness in decision trees (information gain).

Exercises

Exercise 10.1: Why is KL divergence asymmetric? What does D_KL(P||Q) vs D_KL(Q||P) mean?

D_KL(P||Q): "How many extra bits do you need if you use Q to encode data from P?" Penalizes Q heavily where P has mass but Q doesn't (mode-covering).

D_KL(Q||P): Reverse — penalizes where Q has mass but P doesn't (mode-seeking). In VAEs, we minimize D_KL(q(z|x) || p(z)) to keep the learned posterior close to the Gaussian prior.

Exercise 10.2: When would you use MCMC over direct computation?

When the posterior P(θ|data) has no closed-form solution — which is the case for most realistic Bayesian models with non-conjugate priors or complex likelihood functions. MCMC is also used when the normalizing constant is intractable (you know P(x) ∝ f(x) but can't compute ∫f(x)dx). Direct computation is only feasible for simple, low-dimensional models.

Chapter Summary

Entropy quantifies uncertainty; higher entropy = more unpredictable
KL divergence measures distance between distributions — used in VAEs and distillation
Cross-entropy loss = entropy + KL divergence — the standard classification loss
MCMC samples from complex distributions when direct computation is intractable

Part IV

Optimization Theory

Finding the best parameters — the heart of training

Chapter 11

Convex vs Non-Convex Optimization

Learning Objectives

Understand convexity and why it guarantees global optima
Learn why deep learning loss surfaces are non-convex
Visualize loss landscapes and understand their structure
Know which ML problems are convex and which are not

What is Convexity?

A function f is convex if the line segment between any two points on the curve lies above the curve. Geometrically, it's a "bowl" shape. The critical property: any local minimum is also the global minimum. No getting stuck in suboptimal solutions.

f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y) for all λ ∈ [0,1]

Python
import numpy as np
import matplotlib.pyplot as plt

# Convex function: f(x) = x² — single global minimum
x = np.linspace(-5, 5, 200)
convex = x**2

# Non-convex function: multiple local minima
non_convex = x**4 - 8*x**2 + 5*np.sin(3*x)

# In deep learning, loss surfaces look more like the non-convex one
# with millions of dimensions — many local minima and saddle points

Convex vs Non-Convex in ML

Convex (guaranteed optimal)	Non-Convex (local optima risk)
Linear Regression (MSE loss)	Neural Networks (any architecture)
Logistic Regression	Matrix Factorization
SVM (dual formulation)	k-Means Clustering
Ridge/Lasso Regression	GANs, VAEs

The Surprising Truth About Non-Convexity in Deep Learning

Research shows that in high-dimensional spaces (millions of parameters), most local minima have loss values very close to the global minimum. The real problem is saddle points (where the gradient is zero but it's neither a min nor max) — they are exponentially more common than local minima in high dimensions. Modern optimizers like Adam handle this well.

Exercises

Exercise 11.1: Prove that MSE loss for linear regression is convex

L(w) = (1/n)||Xw - y||² = (1/n)(wᵀXᵀXw - 2yᵀXw + yᵀy).

The Hessian is H = (2/n)XᵀX. For any vector v, vᵀHv = (2/n)||Xv||² ≥ 0. So H is positive semi-definite, which means L(w) is convex. Any local minimum is global — that's why linear regression always finds the optimal solution.

Exercise 11.2: If neural network loss is non-convex, how do we still train them successfully?

(1) Saddle points are more common than bad local minima — SGD's noise helps escape them. (2) Most local minima in high dimensions have similar loss to the global minimum. (3) Over-parameterization (more parameters than data) creates many equivalent global minima. (4) Techniques like learning rate schedules, batch normalization, and skip connections smooth the landscape.

Chapter Summary

Convex functions have a single global minimum — guaranteed optimal solution
Linear regression, logistic regression, and SVM are convex optimization problems
Neural network training is non-convex but works well in practice
Saddle points are the real challenge in high-dimensional optimization, not local minima

Chapter 12

Gradient Descent Variants & Loss Functions

Learning Objectives

Master all gradient descent variants: Batch, Mini-batch, SGD
Understand advanced optimizers: Momentum, RMSProp, Adam
Learn common loss functions and when to use each
Implement and compare optimizers from scratch

Gradient Descent Variants

Variant	Batch Size	Speed	Stability
Batch GD	Entire dataset	Slow per step	Very stable, smooth
Stochastic GD	1 sample	Fast per step	Very noisy
Mini-batch GD	32-256 samples	Balanced	Good balance ✓

Python
import numpy as np

# Mini-batch gradient descent
def mini_batch_gd(X, y, lr=0.01, epochs=100, batch_size=32):
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    b = 0

    for epoch in range(epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]

        for i in range(0, n_samples, batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            y_pred = X_batch @ w + b
            error = y_pred - y_batch

            # Gradients averaged over batch
            dw = (2 / len(y_batch)) * X_batch.T @ error
            db = (2 / len(y_batch)) * np.sum(error)

            w -= lr * dw
            b -= lr * db

    return w, b

Advanced Optimizers

Momentum

Adds a "velocity" to gradient descent, accumulating past gradients like a ball rolling downhill. Helps escape shallow local minima and smooths noisy gradients.

vₜ = β · vₜ₋₁ + (1-β) · ∇L wₜ = wₜ₋₁ - α · vₜ

Adam (Adaptive Moment Estimation)

Combines momentum (1st moment) with RMSProp (2nd moment) — tracks both the average gradient direction AND the average gradient magnitude. The most widely used optimizer in deep learning.

Python
class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1, self.beta2, self.eps = beta1, beta2, eps
        self.m = None   # 1st moment (mean of gradients)
        self.v = None   # 2nd moment (mean of squared gradients)
        self.t = 0

    def step(self, w, grad):
        self.t += 1
        if self.m is None:
            self.m = np.zeros_like(w)
            self.v = np.zeros_like(w)

        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad**2

        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)

        w -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        return w

Loss Functions

Loss Function	Formula	Use Case
MSE	(1/n) Σ(y-ŷ)²	Regression
MAE	(1/n) Σ\|y-ŷ\|	Regression (robust to outliers)
Binary Cross-Entropy	-Σ[y·log(ŷ)+(1-y)·log(1-ŷ)]	Binary classification
Categorical Cross-Entropy	-Σ yᵢ·log(ŷᵢ)	Multi-class classification
Huber	MSE if \|e\|<δ, MAE otherwise	Regression (best of both)
Hinge	max(0, 1-y·ŷ)	SVM classification

Exercises

Exercise 12.1: Why does Adam usually outperform vanilla SGD?

(1) Adaptive learning rates: Each parameter gets its own learning rate based on its gradient history. Frequently updated parameters get smaller steps; rare ones get larger steps. (2) Momentum: Smooths gradients and accelerates convergence. (3) Bias correction: Prevents early steps from being too small. Result: less tuning needed, faster convergence, works across diverse architectures.

Exercise 12.2: When would you prefer SGD with momentum over Adam?

For large-scale image classification (ResNets, EfficientNet), SGD with momentum + learning rate scheduling often achieves better final accuracy than Adam. Adam can converge to sharper minima that generalize worse. SGD finds flatter minima. Also, SGD uses less memory (no m and v states per parameter). Many state-of-the-art vision models use SGD.

Exercise 12.3: Why is MAE more robust to outliers than MSE?

MSE squares the error: an outlier with error 100 contributes 10,000 to the loss. MAE only contributes 100. The squared term makes MSE disproportionately sensitive to large errors, pulling the model toward outliers. MAE treats all errors linearly, giving outliers less influence. Huber loss combines both — MSE for small errors (smooth optimization), MAE for large errors (robustness).

Chapter Summary

Mini-batch SGD balances speed and stability; batch size 32-256 is standard
Adam combines momentum and adaptive learning rates — default choice for most tasks
Loss function choice depends on the task: MSE for regression, cross-entropy for classification
Understanding optimizer internals helps diagnose training issues

Chapter 13

Lagrangian Optimization, Saddle Points & Local Minima

Learning Objectives

Master constrained optimization using Lagrange multipliers
Understand saddle points and why they dominate high-dimensional landscapes
Learn the KKT conditions for inequality constraints
Connect these concepts to SVM derivation and neural network training

Lagrangian Optimization

When you need to optimize a function subject to constraints, you can't just use gradient descent. The Lagrangian method converts a constrained problem into an unconstrained one by introducing multiplier variables.

Minimize f(x) subject to g(x) = 0

L(x, λ) = f(x) + λ · g(x) — the Lagrangian

At the optimum, ∇ₓL = 0 and ∇λL = 0 (the constraint is satisfied).

Python
import numpy as np
from scipy.optimize import minimize

# Example: Minimize f(x,y) = x² + y² subject to x + y = 4
# Lagrangian: L = x² + y² + λ(x + y - 4)
# ∂L/∂x = 2x + λ = 0 → x = -λ/2
# ∂L/∂y = 2y + λ = 0 → y = -λ/2
# Constraint: x + y = 4 → -λ/2 - λ/2 = 4 → λ = -4
# Solution: x = 2, y = 2, f(2,2) = 8

# Numerical verification with scipy
result = minimize(
    lambda xy: xy[0]**2 + xy[1]**2,
    x0=[0, 0],
    constraints={'type': 'eq', 'fun': lambda xy: xy[0] + xy[1] - 4}
)
print(f"Optimal: x={result.x[0]:.2f}, y={result.x[1]:.2f}")
print(f"Minimum value: {result.fun:.2f}")

Why This Matters in AI

The SVM dual formulation is derived using Lagrange multipliers with KKT conditions. The constraint is that the margin must be ≥ 1 for all training points. The resulting optimization problem is a quadratic program that can be solved efficiently. Lagrangian optimization also appears in constrained reinforcement learning and fairness-constrained ML.

Saddle Points & Local Minima

A saddle point is where the gradient is zero but it's a minimum along some directions and a maximum along others — like a mountain pass. In a function of n variables, a saddle point has some positive and some negative Hessian eigenvalues.

Python
# Classic saddle point: f(x,y) = x² - y²
# ∇f = [2x, -2y] = [0,0] at origin
# Hessian = [[2, 0], [0, -2]]
# Eigenvalues: 2 and -2 → mixed signs → SADDLE POINT

# At the origin: minimum along x-axis, maximum along y-axis

H_saddle = np.array([[2, 0], [0, -2]])
eigenvalues = np.linalg.eigvalsh(H_saddle)
print(f"Eigenvalues: {eigenvalues}")
print("Mixed signs → Saddle point!")

Why Saddle Points Dominate in High Dimensions

At a critical point (∇f = 0), each Hessian eigenvalue is randomly positive or negative. For a saddle point, you need at least one of each sign. For a local minimum, ALL must be positive. In n dimensions:

P(local minimum) = P(all n eigenvalues > 0) ≈ (1/2)ⁿ

For n = 1000 parameters: P ≈ 10⁻³⁰¹ — essentially zero! Saddle points are overwhelmingly more likely than true local minima in high-dimensional spaces. This is why neural networks can still find good solutions despite non-convexity.

Escaping Saddle Points

Several techniques help optimizers navigate past saddle points:

SGD noise: Stochastic gradients provide natural perturbation that pushes away from saddle points
Momentum: Accumulated velocity carries the optimizer through flat saddle regions
Learning rate warmup: Starts with small steps to avoid early saddle-point traps
Batch normalization: Smooths the loss landscape, reducing saddle point severity

Exercises

Exercise 13.1: Use Lagrange multipliers to find the maximum of f(x,y) = xy subject to x + y = 10

L = xy + λ(x + y - 10)

∂L/∂x = y + λ = 0 → λ = -y

∂L/∂y = x + λ = 0 → λ = -x

So x = y. Constraint: x + x = 10 → x = 5, y = 5.

Maximum: f(5,5) = 25.

Exercise 13.2: Classify the critical point of f(x,y) = x³ - 3xy² at (0,0)

∇f = [3x² - 3y², -6xy]. At (0,0): ∇f = [0,0] ✓ (critical point).

Hessian: H = [[6x, -6y], [-6y, -6x]]. At (0,0): H = [[0,0],[0,0]].

The Hessian is all zeros — the second derivative test is inconclusive. This is actually a monkey saddle — a degenerate saddle point with three directions of descent instead of two.

Exercise 13.3: How do KKT conditions extend Lagrange multipliers to inequality constraints?

For minimize f(x) subject to g(x) ≤ 0:

KKT conditions: (1) ∇f + λ∇g = 0 (stationarity), (2) g(x) ≤ 0 (primal feasibility), (3) λ ≥ 0 (dual feasibility), (4) λ·g(x) = 0 (complementary slackness — either the constraint is active OR λ = 0).

In SVM: the constraints are yᵢ(w·xᵢ + b) ≥ 1. Points where the constraint is tight (=1) have λ > 0 — these are the support vectors.

Chapter Summary

Lagrange multipliers convert constrained optimization into unconstrained by adding penalty terms
SVM is derived using Lagrangian optimization with KKT conditions
Saddle points vastly outnumber local minima in high-dimensional spaces
SGD noise, momentum, and batch normalization help escape saddle points

🎓 Congratulations!

You've completed Mathematics for AI. You now have the mathematical foundation to understand why every AI algorithm works, not just how to use it. This understanding separates practitioners from true AI engineers.