Chapter 2 ยท Part I: Foundations

Mathematical Toolkit for Deep Learning

โฑ 3 hours reading ๐Ÿ“„ ~12,000 words ๐Ÿงฎ Full derivations

Every neural network is a composition of mathematical functions. Before you write a single line of PyTorch, you must own the math that powers it โ€” linear algebra for data flow, calculus for learning, probability for uncertainty, and information theory for loss functions.

Remember: Notation & Definitions Understand: Derivations Apply: NumPy Computation Analyze: Why Cross-Entropy? Evaluate: Compare Loss Functions Create: Implement from Scratch

Prerequisites

  • Class 12 Mathematics (CBSE/ISC) โ€” matrices, determinants, basic calculus
  • Chapter 1 of this textbook (Python & NumPy basics)
  • Comfort with ฮฃ (summation) and basic algebraic manipulation

Learning Objectives

By the end of this chapter, you will be able to:

  • Represent datasets as matrices/tensors and perform core linear algebra operations in NumPy
  • Compute dot products, transposes, inverses, and Hadamard products by hand and in code
  • Derive derivatives from first principles, apply the chain rule, and compute gradients
  • Derive the sigmoid derivative step-by-step and explain why it matters for backpropagation
  • Define Bernoulli and Gaussian distributions and compute Maximum Likelihood Estimates
  • Derive cross-entropy loss from negative log-likelihood (MLE) โ€” the "why" behind the loss function
  • Explain entropy, KL divergence, and cross-entropy and their roles in neural network training
  • Implement all key operations from scratch in NumPy

The Hook: Why Math Matters

๐ŸŽฏ The Challenge

Paytm has 350 million registered users. Their data science team needs to predict which users will stop using Paytm Wallet next month. They have 50 features per user: transaction frequency, average spend (โ‚น), last login days, UPI usage ratio, cashback redeemed, and more.

Here's the secret: every prediction is a math equation. The neural network takes a 50-dimensional vector (one user's features), multiplies it by weight matrices, applies nonlinear functions, and outputs a probability between 0 and 1. That probability is the churn score.

To understand how the network learns the right weights โ€” you need linear algebra (matrix multiplication), calculus (gradient descent), probability (loss function), and information theory (cross-entropy). This chapter gives you every tool.

India Connect

India's UPI processed 14.04 billion transactions worth โ‚น20.64 lakh crore in a single month (March 2024). Behind every fraud detection model, recommendation engine, and credit scoring system at PhonePe, Google Pay, and Paytm โ€” the same mathematical toolkit powers the neural networks. Master this chapter and you speak the language of Indian fintech AI.

2.1 Linear Algebra for Deep Learning

Linear algebra is the language of data in deep learning. Every input, every weight, every output is a multi-dimensional array. Let's build the vocabulary.

2.1.1 Scalars, Vectors, Matrices, and Tensors

Scalar (0-D Tensor)

A single number. We denote scalars with lowercase italic letters: x, y, ฮฑ, ฮป.

Scalar
x = 42   (a single number โ€” e.g., a user's age)

Vector (1-D Tensor)

An ordered list of numbers. We denote vectors with bold lowercase: x. Each element is xโ‚, xโ‚‚, ..., xโ‚™.

Column Vector
x = [xโ‚, xโ‚‚, ..., xโ‚™]แต€   โˆˆ โ„โฟ
Example: A Paytm user's feature vector with 5 features:
x = [25, 12500, 3, 0.85, 450]แต€
(age, avg_spend_โ‚น, days_since_login, upi_ratio, cashback_โ‚น)

Matrix (2-D Tensor)

A 2-D array of numbers. Denoted with bold uppercase: A. Shape is m ร— n (m rows, n columns).

Matrix
A โˆˆ โ„แตหฃโฟ   where A_{ij} is the element at row i, column j

Example: 100 IRCTC passengers ร— 4 features = X โˆˆ โ„ยนโฐโฐหฃโด

Tensor (n-D Array)

An array with more than 2 axes. A color image is a 3-D tensor (height ร— width ร— channels). A batch of images is a 4-D tensor.

Tensor Shapes
Scalar: ()   |   Vector: (n,)   |   Matrix: (m, n)
Color Image: (H, W, 3)   |   Batch of Images: (B, H, W, 3)

Example: 32 images of size 224ร—224 RGB โ†’ shape (32, 224, 224, 3)

Shape Notation is Everything

In deep learning, shape errors are the #1 bug. Always track tensor shapes. A weight matrix connecting a layer of 784 neurons to 256 neurons has shape (784, 256). The input batch has shape (64, 784). The output is (64, 256). Always verify: inner dimensions must match for matrix multiplication.

2.1.2 Matrix Multiplication: The Dot Product

Matrix multiplication is the single most important operation in neural networks. Every forward pass is a series of matrix multiplications.

Matrix Multiplication
Given A โˆˆ โ„แตหฃโฟ and B โˆˆ โ„โฟหฃแต–:

C = A ยท B   where C_{ij} = ฮฃโ‚– A_{ik} ยท B_{kj}   (k = 1 to n)

Result shape: C โˆˆ โ„แตหฃแต–
Rule: Inner dimensions must match โ†’ (mร—n) ยท (nร—p) = (mร—p)

Intuition: Each element C_{ij} is the dot product of row i of A with column j of B. Think of it as: "How much does row i of A align with column j of B?"

In neural networks: If x is a single input vector (1ร—n) and W is a weight matrix (nร—h), then xยทW gives the output (1ร—h) โ€” one number per hidden neuron. Each output neuron computes a weighted sum of the inputs.

2.1.3 Transpose

Transpose
(Aแต€)_{ij} = A_{ji}

If A โˆˆ โ„แตหฃโฟ, then Aแต€ โˆˆ โ„โฟหฃแต
Rows become columns, columns become rows.

Properties:
(Aแต€)แต€ = A   |   (A + B)แต€ = Aแต€ + Bแต€   |   (AB)แต€ = Bแต€Aแต€

2.1.4 Matrix Inverse

Matrix Inverse
Aโปยน exists only for square, non-singular matrices.

A ยท Aโปยน = Aโปยน ยท A = I   (Identity matrix)

Used in: Normal equation for linear regression โ€” w = (Xแต€X)โปยน Xแต€y

Pro Tip

In practice, we never compute matrix inverses explicitly in deep learning. It's computationally expensive (O(nยณ)) and numerically unstable. We use gradient descent instead. But understanding inverses helps you read research papers and derive closed-form solutions.

2.1.5 Element-wise (Hadamard) Product

Hadamard Product
C = A โŠ™ B   where C_{ij} = A_{ij} ร— B_{ij}

Both matrices must have the same shape.
Used in: Gating mechanisms (LSTM, GRU), attention masks, dropout

Don't confuse this with matrix multiplication! The Hadamard product multiplies corresponding elements, while matrix multiplication computes dot products of rows and columns.

Worked Example: IRCTC Passenger Data as a Matrix

Consider 5 IRCTC passengers with 4 features each: Age, Fare (โ‚น), Distance (km), Class (1/2/3).

Data
Passenger Matrix X โˆˆ โ„โตหฃโด:

         Age    Fare(โ‚น)  Distance(km)  Class
P1  [    28,    1450,      850,          3   ]
P2  [    45,    3200,     1200,          2   ]
P3  [    32,    5800,     2100,          1   ]
P4  [    56,    1800,      950,          3   ]
P5  [    23,    4500,     1800,          1   ]

Shape: (5, 4)
X[2, 1] = 5800   (3rd passenger, 2nd feature = Fare)
X[:, 0] = [28, 45, 32, 56, 23]  (all ages โ€” a column vector)

Weight vector for predicting "will book again" (4 features โ†’ 1 output):

w = [0.01, 0.0002, 0.0003, -0.5]แต€

Score for P1 = 28ร—0.01 + 1450ร—0.0002 + 850ร—0.0003 + 3ร—(-0.5) = 0.28 + 0.29 + 0.255 - 1.5 = -0.675

Score for P3 = 32ร—0.01 + 5800ร—0.0002 + 2100ร—0.0003 + 1ร—(-0.5) = 0.32 + 1.16 + 0.63 - 0.5 = 1.61

P3 (1AC traveler, long distance) has a much higher rebooking score. The entire batch computation is simply X ยท w.

2.2 Calculus for Deep Learning

Calculus tells us how to learn. A neural network improves by computing how much each weight contributed to the error, then adjusting it. That computation is a derivative.

2.2.1 Derivatives from First Principles

Definition of a Derivative
f'(x) = lim(hโ†’0) [f(x + h) - f(x)] / h

The derivative tells you: if x changes by a tiny amount,
how much does f(x) change?

It is the slope of the tangent line at point x.

Example: Deriving d/dx [xยฒ] from first principles

1f(x) = xยฒ, so f(x+h) = (x+h)ยฒ = xยฒ + 2xh + hยฒ
2f(x+h) - f(x) = xยฒ + 2xh + hยฒ - xยฒ = 2xh + hยฒ
3[f(x+h) - f(x)] / h = (2xh + hยฒ) / h = 2x + h
4lim(hโ†’0) [2x + h] = 2x

So d/dx [xยฒ] = 2x. At x = 3, the slope is 6 โ€” meaning a tiny increase in x increases xยฒ by approximately 6 times that increase.

Key Derivative Rules

RuleFormulaExample
Power Ruled/dx [xโฟ] = nxโฟโปยนd/dx [xยณ] = 3xยฒ
Constant Multipled/dx [cf(x)] = cยทf'(x)d/dx [5xยฒ] = 10x
Sum Ruled/dx [f+g] = f'+g'd/dx [xยฒ+3x] = 2x+3
Product Ruled/dx [fg] = f'g + fg'd/dx [xยทeหฃ] = eหฃ + xeหฃ
Exponentiald/dx [eหฃ] = eหฃd/dx [eยณหฃ] = 3eยณหฃ
Logarithmd/dx [ln x] = 1/xd/dx [ln(2x)] = 1/x

2.2.2 The Chain Rule โ€” Heart of Backpropagation

Chain Rule
If y = f(g(x)), then:

dy/dx = (dy/du) ยท (du/dx)   where u = g(x)

"The derivative of the outer function ร— the derivative of the inner function"

Why this matters: A neural network is a composition of functions: output = fโ‚ƒ(fโ‚‚(fโ‚(x))). To compute how the loss changes w.r.t. weight wโ‚ in the first layer, we chain derivatives through every subsequent layer. This is backpropagation.

Example: Chain Rule

Let y = (3x + 2)โต. Find dy/dx.

1Let u = 3x + 2, so y = uโต
2dy/du = 5uโด,   du/dx = 3
3dy/dx = 5uโด ยท 3 = 15(3x + 2)โด

2.2.3 Partial Derivatives and Gradients

When a function has multiple inputs (like a loss function with many weights), we take partial derivatives โ€” differentiate w.r.t. one variable while holding others constant.

Partial Derivative
f(x, y) = 3xยฒy + 2xyยณ

โˆ‚f/โˆ‚x = 6xy + 2yยณ   (treat y as constant)
โˆ‚f/โˆ‚y = 3xยฒ + 6xyยฒ   (treat x as constant)

The Gradient Vector

Gradient
โˆ‡f = [โˆ‚f/โˆ‚xโ‚, โˆ‚f/โˆ‚xโ‚‚, ..., โˆ‚f/โˆ‚xโ‚™]แต€

The gradient is a vector of all partial derivatives.
It points in the direction of steepest ascent.
To minimize loss: move in the opposite direction โ†’ w = w - ฮฑโˆ‡L

Gradient Descent Intuition

Imagine you're blindfolded on a hilly terrain (the loss surface). The gradient tells you which direction is uphill. You take a step in the opposite direction (downhill). The learning rate ฮฑ controls how big each step is. Too big โ†’ you overshoot. Too small โ†’ you crawl forever. This is gradient descent: the fundamental learning algorithm of deep learning.

2.2.4 Jacobian and Hessian (Brief Introduction)

Jacobian Matrix
For a vector-valued function f: โ„โฟ โ†’ โ„แต:

J_{ij} = โˆ‚fแตข/โˆ‚xโฑผ   (shape: m ร— n)

The Jacobian generalizes the gradient to vector-valued functions.
Used in: Backpropagation through layers with vector outputs.
Hessian Matrix
H_{ij} = โˆ‚ยฒf / (โˆ‚xแตข โˆ‚xโฑผ)   (shape: n ร— n)

Matrix of second-order partial derivatives.
Tells you about curvature of the loss surface.
Used in: Second-order optimization (Newton's method, L-BFGS).

Pro Tip

You won't compute Jacobians or Hessians by hand in practice โ€” frameworks like PyTorch handle this with automatic differentiation. But understanding them conceptually helps you debug training issues and read advanced papers on optimization.

2.2.5 Worked Example: Derivative of the Sigmoid Function

Deriving ฯƒ'(z) Step by Step

The sigmoid function is one of the most important activation functions:

Sigmoid Function
ฯƒ(z) = 1 / (1 + eโปแถป)

Let's derive its derivative from scratch:

1Rewrite: ฯƒ(z) = (1 + eโปแถป)โปยน
2Apply chain rule: Let u = 1 + eโปแถป, so ฯƒ = uโปยน
    dฯƒ/dz = dฯƒ/du ยท du/dz
3Compute dฯƒ/du: d/du [uโปยน] = -uโปยฒ = -1/(1 + eโปแถป)ยฒ
4Compute du/dz: d/dz [1 + eโปแถป] = -eโปแถป
5Multiply: dฯƒ/dz = [-1/(1 + eโปแถป)ยฒ] ยท [-eโปแถป] = eโปแถป / (1 + eโปแถป)ยฒ
6Simplify:
    = [1/(1 + eโปแถป)] ยท [eโปแถป/(1 + eโปแถป)]
    = [1/(1 + eโปแถป)] ยท [(1 + eโปแถป - 1)/(1 + eโปแถป)]
    = [1/(1 + eโปแถป)] ยท [1 - 1/(1 + eโปแถป)]
    = ฯƒ(z) ยท (1 - ฯƒ(z))
Sigmoid Derivative โ€” Key Result
ฯƒ'(z) = ฯƒ(z) ยท (1 โˆ’ ฯƒ(z))

Maximum value: ฯƒ'(0) = 0.5 ร— 0.5 = 0.25
As |z| โ†’ โˆž, ฯƒ'(z) โ†’ 0 (vanishing gradient!)

Why this matters: The sigmoid derivative is at most 0.25. When you chain many sigmoid layers, gradients multiply: 0.25 ร— 0.25 ร— 0.25 = 0.016. After 10 layers: 0.25ยนโฐ โ‰ˆ 0.000001. The gradient vanishes โ€” the network stops learning. This is the vanishing gradient problem, and it's why ReLU replaced sigmoid in hidden layers.

2.3 Probability & Statistics for Deep Learning

Neural networks don't output certainties โ€” they output probabilities. Understanding probability is essential for designing loss functions, interpreting outputs, and reasoning about uncertainty.

2.3.1 Bernoulli Distribution

Bernoulli Distribution
X ~ Bernoulli(p)

P(X = 1) = p     P(X = 0) = 1 - p

Compact form: P(X = x) = pหฃ(1-p)ยนโปหฃ   for x โˆˆ {0, 1}

Mean: E[X] = p   |   Variance: Var(X) = p(1-p)

Deep learning connection: Binary classification IS a Bernoulli distribution. When your model outputs P(churn) = 0.82 for a Paytm user, it's saying: "This user's churn follows Bernoulli(0.82)."

2.3.2 Gaussian (Normal) Distribution

Gaussian Distribution
X ~ N(ฮผ, ฯƒยฒ)

p(x) = (1 / โˆš(2ฯ€ฯƒยฒ)) ยท exp(โˆ’(x โˆ’ ฮผ)ยฒ / (2ฯƒยฒ))

Mean: ฮผ   |   Variance: ฯƒยฒ   |   68-95-99.7 rule

Deep learning connection: Weight initialization (Xavier, He) draws from Gaussians. Noise in VAEs is Gaussian. Regression targets are often modeled as Gaussian with learned mean and variance.

2.3.3 Conditional Probability & Bayes' Theorem

Conditional Probability
P(A|B) = P(A โˆฉ B) / P(B)    (probability of A given B)
Bayes' Theorem
P(A|B) = P(B|A) ยท P(A) / P(B)

posterior = (likelihood ร— prior) / evidence

India Connect โ€” Bayes in Action

Flipkart's search engine uses Bayesian reasoning: P(user wants "iPhone" | typed "i phone") is high because P(typed "i phone" | wants "iPhone") is very high (likelihood), and iPhones are frequently searched (prior). This is how spelling correction and query understanding work.

2.3.4 Maximum Likelihood Estimation (MLE)

MLE answers: "Given the data we observed, what parameter values make the data most probable?"

MLE Principle
ฮธฬ‚_MLE = argmax_ฮธ P(Data | ฮธ)

= argmax_ฮธ ฮ  P(xแตข | ฮธ)   (assuming i.i.d. samples)

= argmax_ฮธ ฮฃ log P(xแตข | ฮธ)   (log-likelihood โ€” easier to optimize)

2.3.5 MLE for Bernoulli: Full Derivation

Worked Example: MLE for PhonePe Fraud Probability

PhonePe observes 1000 UPI transactions. 47 are fraudulent (y=1), 953 are legitimate (y=0). What is the MLE estimate of the fraud probability p?

1Model: Each transaction yแตข ~ Bernoulli(p). We observe yโ‚, yโ‚‚, ..., yโ‚โ‚€โ‚€โ‚€.
2Likelihood:
L(p) = ฮ _{i=1}^{1000} p^{yแตข} (1-p)^{1-yแตข}
= p^{ฮฃyแตข} (1-p)^{n - ฮฃyแตข}
= pโดโท (1-p)โนโตยณ
3Log-likelihood:
โ„“(p) = log L(p) = 47ยทlog(p) + 953ยทlog(1-p)
4Differentiate and set to zero:
dโ„“/dp = 47/p โˆ’ 953/(1โˆ’p) = 0
47(1โˆ’p) = 953p
47 โˆ’ 47p = 953p
47 = 1000p
pฬ‚ = 47/1000 = 0.047
5Verify it's a maximum:
dยฒโ„“/dpยฒ = โˆ’47/pยฒ โˆ’ 953/(1โˆ’p)ยฒ < 0   โœ“ (concave โ†’ maximum)

Result: The MLE estimate of fraud probability is 4.7% โ€” exactly the observed proportion! For Bernoulli, MLE always gives pฬ‚ = (number of successes) / (total trials).

2.3.6 From MLE to Cross-Entropy Loss

Here's the most important derivation in this chapter โ€” why we use cross-entropy as a loss function.

The Bridge: Negative Log-Likelihood = Cross-Entropy

When our neural network outputs probability ลทแตข for each sample, and the true label is yแตข โˆˆ {0,1}, we want to maximize the likelihood. Equivalently, we minimize the negative log-likelihood:

From MLE to Binary Cross-Entropy
Likelihood: L = ฮ  ลทแตข^{yแตข} (1-ลทแตข)^{1-yแตข}

Log-likelihood: โ„“ = ฮฃ [yแตข log(ลทแตข) + (1-yแตข) log(1-ลทแตข)]

Negative log-likelihood (loss to minimize):
L_BCE = โˆ’(1/n) ฮฃ [yแตข log(ลทแตข) + (1-yแตข) log(1-ลทแตข)]

This IS Binary Cross-Entropy Loss!

Cross-entropy is not an arbitrary choice โ€” it emerges naturally from maximum likelihood estimation under a Bernoulli model. When someone says "we use cross-entropy loss for classification," they're saying "we're doing MLE."

Fun Fact

MSE (Mean Squared Error) is also an MLE result โ€” but for a Gaussian model. If you assume your regression targets follow y ~ N(ลท, ฯƒยฒ), then maximizing the log-likelihood gives you exactly the MSE loss. Every standard loss function has a probabilistic interpretation!

2.4 Information Theory for Deep Learning

Information theory, pioneered by Claude Shannon (1948), gives us a mathematical framework for quantifying uncertainty and information. It provides the theoretical foundation for why cross-entropy is the natural loss function for classification.

2.4.1 Entropy: Measuring Uncertainty

Shannon Entropy
H(p) = โˆ’ฮฃ p(x) logโ‚‚ p(x)

(In deep learning, we use natural log: H(p) = โˆ’ฮฃ p(x) ln p(x))

Intuition: Entropy measures how "surprised" you are on average. A fair coin (p=0.5) has maximum entropy โ€” you're maximally uncertain. A biased coin (p=0.99) has low entropy โ€” you know it'll be heads.

Examples

Distributionp(heads)H (bits)Interpretation
Fair coin0.51.0Maximum uncertainty
Biased coin0.90.47Fairly predictable
Certain coin1.00.0No uncertainty

2.4.2 KL Divergence: Measuring Distribution Difference

Kullback-Leibler Divergence
D_KL(p โ€– q) = ฮฃ p(x) log [p(x) / q(x)]

= ฮฃ p(x) log p(x) โˆ’ ฮฃ p(x) log q(x)

= โˆ’H(p) + H(p, q)

Properties:
โ€ข D_KL โ‰ฅ 0 (always non-negative โ€” Gibbs' inequality)
โ€ข D_KL = 0 iff p = q
โ€ข NOT symmetric: D_KL(pโ€–q) โ‰  D_KL(qโ€–p) in general

Intuition: KL divergence measures how much "extra information" you need when you use distribution q to approximate distribution p. It's the "penalty" for using the wrong distribution.

2.4.3 Cross-Entropy: The Natural Loss

Cross-Entropy
H(p, q) = โˆ’ฮฃ p(x) log q(x)

= H(p) + D_KL(p โ€– q)

Since H(p) is constant w.r.t. model parameters:
Minimizing Cross-Entropy = Minimizing KL Divergence

Why Cross-Entropy is the Natural Classification Loss

Three perspectives converge to the same answer:

  1. MLE perspective: Cross-entropy = negative log-likelihood of a Bernoulli/Categorical model
  2. Information theory perspective: Minimizing cross-entropy = minimizing KL divergence between true and predicted distributions
  3. Practical perspective: Cross-entropy produces stronger gradients than MSE for wrong predictions (no gradient saturation)

All three say: cross-entropy is the right loss for classification.

Numerical Example: Why Cross-Entropy > MSE for Classification

True label: y = 1. Model predicts ลท = 0.01 (confidently wrong!).

Loss FunctionValueGradient w.r.t. ลท
MSE: (y โˆ’ ลท)ยฒ(1 โˆ’ 0.01)ยฒ = 0.98โˆ’2(1 โˆ’ 0.01) = โˆ’1.98
Cross-Entropy: โˆ’y log(ลท)โˆ’log(0.01) = 4.61โˆ’1/0.01 = โˆ’100

Cross-entropy gives a gradient of โˆ’100 vs MSE's โˆ’1.98. When the model is confidently wrong, cross-entropy punishes it 50ร— harder and pushes stronger corrections. That's why training converges faster with cross-entropy.

Common Mistake

Students often confuse entropy H(p), cross-entropy H(p,q), and KL divergence D_KL(pโ€–q). Remember: H(p,q) = H(p) + D_KL(pโ€–q). Cross-entropy decomposes into "irreducible uncertainty" (entropy) plus "extra cost from using the wrong model" (KL divergence).

NumPy Code Lab: Math from Scratch

Let's implement every key operation in NumPy. This is where formulas become executable code.

4.1 Linear Algebra Operations

Python
import numpy as np

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ SCALARS, VECTORS, MATRICES, TENSORS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# Scalar
learning_rate = 0.01

# Vector โ€” 5 features of a Paytm user
user = np.array([25, 12500, 3, 0.85, 450])
print(f"User vector shape: {user.shape}")   # (5,)

# Matrix โ€” 100 users ร— 5 features
np.random.seed(42)
X = np.random.randn(100, 5)
print(f"Data matrix shape: {X.shape}")    # (100, 5)

# Tensor โ€” batch of 32 color images (28ร—28)
images = np.random.randn(32, 28, 28, 3)
print(f"Image batch shape: {images.shape}")  # (32, 28, 28, 3)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ DOT PRODUCT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# Vector dot product: weighted sum of features
weights = np.array([0.3, 0.0001, -0.5, 2.0, 0.002])
score = np.dot(user, weights)
print(f"Churn score: {score:.4f}")

# Matrix multiplication: all users at once
W = np.random.randn(5, 3)    # 5 features โ†’ 3 hidden neurons
H = X @ W                      # (100,5) @ (5,3) = (100,3)
print(f"Hidden layer shape: {H.shape}")  # (100, 3)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ TRANSPOSE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
A = np.array([[1, 2, 3],
              [4, 5, 6]])
print(f"A shape: {A.shape}")          # (2, 3)
print(f"A^T shape: {A.T.shape}")      # (3, 2)

# Verify (AB)^T = B^T A^T
B = np.random.randn(3, 4)
lhs = (A @ B).T
rhs = B.T @ A.T
print(f"(AB)^T == B^T A^T: {np.allclose(lhs, rhs)}")  # True

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ INVERSE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
M = np.array([[4, 7], [2, 6]])
M_inv = np.linalg.inv(M)
print(f"M ยท Mโปยน =\n{M @ M_inv}")     # Identity matrix

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ HADAMARD (ELEMENT-WISE) PRODUCT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
gate = np.array([[1, 0, 1],
                 [0, 1, 0]])   # Binary mask (like dropout)
data = np.array([[5, 3, 8],
                 [2, 7, 4]])
masked = gate * data            # Hadamard product: * in NumPy
print(f"Hadamard:\n{masked}")     # [[5, 0, 8], [0, 7, 0]]

4.2 Calculus: Numerical Gradient & Sigmoid

Python
import numpy as np

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ SIGMOID AND ITS DERIVATIVE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def sigmoid(z):
    """ฯƒ(z) = 1 / (1 + e^(-z))"""
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """ฯƒ'(z) = ฯƒ(z)(1 - ฯƒ(z))"""
    s = sigmoid(z)
    return s * (1 - s)

# Verify: maximum of sigmoid derivative is at z=0
z_values = np.linspace(-6, 6, 1000)
derivs = sigmoid_derivative(z_values)
print(f"Max ฯƒ'(z) = {derivs.max():.4f} at z = {z_values[derivs.argmax()]:.2f}")
# Max ฯƒ'(z) = 0.2500 at z = 0.00

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ NUMERICAL GRADIENT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def numerical_gradient(f, x, h=1e-7):
    """Compute gradient numerically (central difference)"""
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# Test: f(x,y) = xยฒ + 3xy โ†’ โˆ‚f/โˆ‚x = 2x+3y, โˆ‚f/โˆ‚y = 3x
def test_func(v):
    x, y = v[0], v[1]
    return x**2 + 3*x*y

point = np.array([2.0, 5.0])
num_grad = numerical_gradient(test_func, point)
analytical_grad = np.array([2*2 + 3*5, 3*2])  # [19, 6]

print(f"Numerical gradient:  {num_grad}")
print(f"Analytical gradient: {analytical_grad}")
print(f"Match: {np.allclose(num_grad, analytical_grad)}")  # True

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ GRADIENT DESCENT DEMO โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def loss(w):
    """Simple quadratic loss: L = (w - 3)ยฒ"""
    return (w - 3)**2

w = 0.0           # Start at w=0
lr = 0.1          # Learning rate
print(f"{'Step':>4} {'w':>8} {'Loss':>10}")
for step in range(20):
    grad = 2 * (w - 3)   # dL/dw = 2(w-3)
    w = w - lr * grad      # Gradient descent update
    print(f"{step+1:>4} {w:>8.4f} {loss(w):>10.6f}")
# w converges to 3.0 (the minimum)

4.3 Probability: Distributions, MLE, Cross-Entropy

Python
import numpy as np

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ BERNOULLI DISTRIBUTION โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def bernoulli_pmf(x, p):
    """P(X=x) = p^x * (1-p)^(1-x)"""
    return (p ** x) * ((1 - p) ** (1 - x))

# PhonePe fraud example: p = 0.047
p_fraud = 0.047
print(f"P(fraud)     = {bernoulli_pmf(1, p_fraud):.4f}")   # 0.047
print(f"P(not fraud) = {bernoulli_pmf(0, p_fraud):.4f}")   # 0.953

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ GAUSSIAN DISTRIBUTION โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def gaussian_pdf(x, mu, sigma):
    """p(x) = (1/โˆš(2ฯ€ฯƒยฒ)) exp(-(x-ฮผ)ยฒ/(2ฯƒยฒ))"""
    coeff = 1 / np.sqrt(2 * np.pi * sigma**2)
    exponent = -((x - mu)**2) / (2 * sigma**2)
    return coeff * np.exp(exponent)

# Zomato delivery time: mean 35 min, std 8 min
x = np.array([25, 30, 35, 40, 50])
probs = gaussian_pdf(x, mu=35, sigma=8)
for xi, pi in zip(x, probs):
    print(f"P(delivery={xi} min) = {pi:.4f}")

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ MLE FOR BERNOULLI โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# Simulate 1000 transactions, 47 fraudulent
np.random.seed(42)
transactions = np.zeros(1000)
transactions[:47] = 1
np.random.shuffle(transactions)

# MLE estimate: p_hat = sum(x) / n
p_hat = transactions.mean()
print(f"\nMLE estimate of fraud probability: {p_hat:.4f}")  # 0.047

# Log-likelihood at p_hat
log_lik = np.sum(transactions * np.log(p_hat) +
                 (1 - transactions) * np.log(1 - p_hat))
print(f"Log-likelihood at p_hat: {log_lik:.2f}")

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ CROSS-ENTROPY LOSS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def binary_cross_entropy(y_true, y_pred, eps=1e-15):
    """BCE = -(1/n) ฮฃ [y log(ลท) + (1-y) log(1-ลท)]"""
    y_pred = np.clip(y_pred, eps, 1 - eps)  # Numerical stability
    return -np.mean(
        y_true * np.log(y_pred) +
        (1 - y_true) * np.log(1 - y_pred)
    )

# Test: true labels vs model predictions
y_true = np.array([1, 0, 1, 1, 0])
y_good = np.array([0.9, 0.1, 0.85, 0.95, 0.05])  # Good model
y_bad  = np.array([0.3, 0.8, 0.4, 0.2, 0.7])   # Bad model

print(f"\nGood model BCE: {binary_cross_entropy(y_true, y_good):.4f}")
print(f"Bad model  BCE: {binary_cross_entropy(y_true, y_bad):.4f}")
# Good model has LOWER loss โ†’ cross-entropy works!

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ENTROPY & KL DIVERGENCE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def entropy(p):
    """H(p) = -ฮฃ p(x) log p(x)"""
    p = p[p > 0]  # Avoid log(0)
    return -np.sum(p * np.log(p))

def kl_divergence(p, q, eps=1e-15):
    """D_KL(p || q) = ฮฃ p(x) log(p(x)/q(x))"""
    q = np.clip(q, eps, 1.0)
    return np.sum(p * np.log(p / q))

def cross_entropy(p, q, eps=1e-15):
    """H(p, q) = -ฮฃ p(x) log q(x)"""
    q = np.clip(q, eps, 1.0)
    return -np.sum(p * np.log(q))

# True distribution vs two models
p_true = np.array([0.7, 0.2, 0.1])   # 3-class problem
q_good = np.array([0.65, 0.25, 0.1])
q_bad  = np.array([0.2, 0.3, 0.5])

print(f"\nEntropy H(p):        {entropy(p_true):.4f}")
print(f"KL(p||q_good):       {kl_divergence(p_true, q_good):.4f}")
print(f"KL(p||q_bad):        {kl_divergence(p_true, q_bad):.4f}")
print(f"CrossEnt(p, q_good): {cross_entropy(p_true, q_good):.4f}")
print(f"CrossEnt(p, q_bad):  {cross_entropy(p_true, q_bad):.4f}")
print(f"Verify: H(p) + KL(p||q_good) = {entropy(p_true) + kl_divergence(p_true, q_good):.4f}")
# Should equal cross_entropy(p_true, q_good)

Visual Walkthrough: Matrix Multiplication

Let's trace through a 3ร—3 matrix multiplication by hand, step by step.

MATRIX MULTIPLICATION: C = A ร— B (3ร—3 result) A (3ร—2) B (2ร—3) C (3ร—3) โ”Œ โ” โ”Œ โ” โ”Œ โ” โ”‚ 1 2 โ”‚ โ”‚ 7 8 9 โ”‚ โ”‚ Cโ‚โ‚ Cโ‚โ‚‚ Cโ‚โ‚ƒโ”‚ โ”‚ 3 4 โ”‚ ร— โ”‚ 10 11 12โ”‚ = โ”‚ Cโ‚‚โ‚ Cโ‚‚โ‚‚ Cโ‚‚โ‚ƒโ”‚ โ”‚ 5 6 โ”‚ โ”” โ”˜ โ”‚ Cโ‚ƒโ‚ Cโ‚ƒโ‚‚ Cโ‚ƒโ‚ƒโ”‚ โ”” โ”˜ โ”” โ”˜ Step-by-step computation: Cโ‚โ‚ = Row1(A) ยท Col1(B) = (1ร—7) + (2ร—10) = 7 + 20 = 27 Cโ‚โ‚‚ = Row1(A) ยท Col2(B) = (1ร—8) + (2ร—11) = 8 + 22 = 30 Cโ‚โ‚ƒ = Row1(A) ยท Col3(B) = (1ร—9) + (2ร—12) = 9 + 24 = 33 Cโ‚‚โ‚ = Row2(A) ยท Col1(B) = (3ร—7) + (4ร—10) = 21 + 40 = 61 Cโ‚‚โ‚‚ = Row2(A) ยท Col2(B) = (3ร—8) + (4ร—11) = 24 + 44 = 68 Cโ‚‚โ‚ƒ = Row2(A) ยท Col3(B) = (3ร—9) + (4ร—12) = 27 + 48 = 75 Cโ‚ƒโ‚ = Row3(A) ยท Col1(B) = (5ร—7) + (6ร—10) = 35 + 60 = 95 Cโ‚ƒโ‚‚ = Row3(A) ยท Col2(B) = (5ร—8) + (6ร—11) = 40 + 66 = 106 Cโ‚ƒโ‚ƒ = Row3(A) ยท Col3(B) = (5ร—9) + (6ร—12) = 45 + 72 = 117 Result: โ”Œ โ” โ”‚ 27 30 33โ”‚ Shape check: (3ร—2) ร— (2ร—3) = (3ร—3) โœ“ โ”‚ 61 68 75โ”‚ Inner dims match: 2 = 2 โœ“ โ”‚ 95 106 117โ”‚ Total multiplications: 3ร—3ร—2 = 18 โ”” โ”˜
NEURAL NETWORK FORWARD PASS: z = Wx + b Input x (3ร—1) Weights W (2ร—3) Bias b Output z (2ร—1) โ”Œ โ” โ”Œ โ” โ”Œ โ” โ”Œ โ” โ”‚ 0.5 โ”‚ โ”‚ 0.2 0.8 -0.1โ”‚ โ”‚ 0.1โ”‚ โ”‚ zโ‚ โ”‚ โ”‚ 1.0 โ”‚ โ†’ โ”‚-0.5 0.3 0.6โ”‚ + โ”‚-0.2โ”‚ = โ”‚ zโ‚‚ โ”‚ โ”‚ 0.3 โ”‚ โ”” โ”˜ โ”” โ”˜ โ”” โ”˜ โ”” โ”˜ zโ‚ = (0.2ร—0.5) + (0.8ร—1.0) + (-0.1ร—0.3) + 0.1 = 0.10 + 0.80 - 0.03 + 0.10 = 0.97 zโ‚‚ = (-0.5ร—0.5) + (0.3ร—1.0) + (0.6ร—0.3) + (-0.2) = -0.25 + 0.30 + 0.18 - 0.20 = 0.03 After sigmoid: ฯƒ(0.97) = 0.725 ฯƒ(0.03) = 0.507 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Each neuron computes a weighted sum + bias, then passes through an activation function.

Worked Examples: Hand Calculations

Example 1: Matrix Operations for a Jio User Dataset

A Jio dataset has 3 users ร— 3 features: [data_usage_GB, recharge_โ‚น, calls_min]

Calculation
X = โ”Œ             โ”       W = โ”Œ      โ”
    โ”‚ 12   399  45โ”‚           โ”‚ 0.1  โ”‚
    โ”‚  5   199  80โ”‚           โ”‚ 0.005โ”‚    (churn weight vector)
    โ”‚ 25   599  20โ”‚           โ”‚-0.02 โ”‚
    โ””             โ”˜           โ””      โ”˜

Compute XยทW (churn scores):

1User 1: (12ร—0.1) + (399ร—0.005) + (45ร—-0.02) = 1.2 + 1.995 - 0.9 = 2.295
2User 2: (5ร—0.1) + (199ร—0.005) + (80ร—-0.02) = 0.5 + 0.995 - 1.6 = โˆ’0.105
3User 3: (25ร—0.1) + (599ร—0.005) + (20ร—-0.02) = 2.5 + 2.995 - 0.4 = 5.095

After sigmoid: ฯƒ(2.295) = 0.908, ฯƒ(โˆ’0.105) = 0.474, ฯƒ(5.095) = 0.994

Interpretation: User 3 (high data, high recharge, few calls) has 99.4% predicted churn probability โ€” likely a heavy user switching to another carrier. User 2 at 47.4% โ€” borderline.

Example 2: Cross-Entropy Loss Calculation

3 Flipkart customers. True labels (will return product): y = [1, 0, 1]

Model predictions: ลท = [0.8, 0.3, 0.6]

1Per-sample loss:
Lโ‚ = โˆ’[1ยทlog(0.8) + 0ยทlog(0.2)] = โˆ’log(0.8) = โˆ’(โˆ’0.2231) = 0.2231
Lโ‚‚ = โˆ’[0ยทlog(0.3) + 1ยทlog(0.7)] = โˆ’log(0.7) = โˆ’(โˆ’0.3567) = 0.3567
Lโ‚ƒ = โˆ’[1ยทlog(0.6) + 0ยทlog(0.4)] = โˆ’log(0.6) = โˆ’(โˆ’0.5108) = 0.5108
2Average loss:
BCE = (0.2231 + 0.3567 + 0.5108) / 3 = 0.3635

Interpretation: Sample 3 (y=1, ลท=0.6) contributes the most loss because the model is least confident about the correct answer. Cross-entropy penalizes under-confidence proportionally.

Example 3: Computing Entropy and KL Divergence

Swiggy food category distribution in Bangalore:

True: p = [Biryani: 0.4, Pizza: 0.3, Dosa: 0.2, Other: 0.1]

Model A: q_A = [0.35, 0.30, 0.25, 0.10]

Model B: q_B = [0.10, 0.10, 0.10, 0.70]

1Entropy H(p):
H = โˆ’[0.4 ln(0.4) + 0.3 ln(0.3) + 0.2 ln(0.2) + 0.1 ln(0.1)]
= โˆ’[0.4(โˆ’0.916) + 0.3(โˆ’1.204) + 0.2(โˆ’1.609) + 0.1(โˆ’2.303)]
= โˆ’[โˆ’0.366 โˆ’ 0.361 โˆ’ 0.322 โˆ’ 0.230]
= 1.279 nats
2KL(p โ€– q_A):
= 0.4 ln(0.4/0.35) + 0.3 ln(0.3/0.30) + 0.2 ln(0.2/0.25) + 0.1 ln(0.1/0.10)
= 0.4(0.134) + 0.3(0) + 0.2(โˆ’0.223) + 0.1(0)
= 0.054 โˆ’ 0.045 = 0.009 nats (very close!)
3KL(p โ€– q_B):
= 0.4 ln(0.4/0.10) + 0.3 ln(0.3/0.10) + 0.2 ln(0.2/0.10) + 0.1 ln(0.1/0.70)
= 0.4(1.386) + 0.3(1.099) + 0.2(0.693) + 0.1(โˆ’1.946)
= 0.554 + 0.330 + 0.139 โˆ’ 0.195 = 0.828 nats (very far!)

Conclusion: Model A (KL = 0.009) is 92ร— better than Model B (KL = 0.828) at approximating the true distribution. KL divergence correctly captures that Model B's prediction of 70% "Other" is absurdly wrong for Bangalore food orders.

Common Mistakes & Pitfalls

Mistake 1: Confusing Matrix Multiply and Element-wise Multiply

Wrong: Using A * B (Hadamard) when you mean A @ B (matrix multiply) in NumPy. These are completely different operations! A * B requires same shapes; A @ B requires inner dimensions to match.

Fix: Always use @ or np.dot() for matrix multiplication. Reserve * for element-wise operations.

Mistake 2: Shape Mismatches in Matrix Multiplication

Wrong: Trying to multiply (100, 5) ร— (100, 3). Inner dimensions don't match (5 โ‰  100).

Fix: Always write shapes side by side: (mร—n) ร— (nร—p). The bolded dimensions must be equal. If not, transpose one matrix.

Mistake 3: Using MSE for Classification

Wrong: MSE as loss for binary classification. Gradients saturate when sigmoid output is near 0 or 1.

Fix: Always use cross-entropy for classification. It's the MLE-optimal loss and produces stronger gradients for wrong predictions.

Mistake 4: Forgetting log(0) is Undefined

Wrong: Computing np.log(y_pred) when y_pred contains 0. Result: -inf or NaN.

Fix: Always clip predictions: y_pred = np.clip(y_pred, 1e-15, 1-1e-15) before computing log.

Mistake 5: Thinking KL Divergence is Symmetric

Wrong: Assuming D_KL(pโ€–q) = D_KL(qโ€–p). It's NOT โ€” KL divergence is not a true "distance."

Fix: Always specify direction. In training, we minimize D_KL(p_true โ€– q_model), which equals minimizing cross-entropy H(p, q).

Mistake 6: Ignoring the Vanishing Gradient of Sigmoid

Wrong: Stacking many sigmoid layers and wondering why the network doesn't learn.

Fix: Use ReLU for hidden layers. Reserve sigmoid only for the final output layer of binary classification. We derived that ฯƒ'(z) โ‰ค 0.25 โ€” chaining many of these kills the gradient.

Exercises

Section A: Multiple Choice Questions

  1. What is the shape of the result when you multiply a matrix of shape (64, 128) with a matrix of shape (128, 10)?
    (a) (128, 128)   (b) (64, 10)   (c) (10, 64)   (d) (64, 128, 10)
    Answer: (b) โ€” (64ร—128) ร— (128ร—10) = (64ร—10)
  2. Which operation is used in LSTM gating mechanisms?
    (a) Matrix multiplication   (b) Hadamard (element-wise) product   (c) Matrix inverse   (d) Eigenvalue decomposition
    Answer: (b) โ€” Gates multiply element-wise with the cell state
  3. What is the derivative of ฯƒ(z) = 1/(1+eโปแถป)?
    (a) ฯƒ(z)ยฒ   (b) ฯƒ(z)(1โˆ’ฯƒ(z))   (c) 1โˆ’ฯƒ(z)ยฒ   (d) eโปแถป/(1+eโปแถป)
    Answer: (b) โ€” ฯƒ'(z) = ฯƒ(z)(1 โˆ’ ฯƒ(z)), derived in Section 2.2.5
  4. The maximum value of ฯƒ'(z) is:
    (a) 1.0   (b) 0.5   (c) 0.25   (d) 0.1
    Answer: (c) โ€” At z=0: ฯƒ(0)=0.5, ฯƒ'(0)=0.5ร—0.5=0.25
  5. Which of the following is the correct chain rule?
    (a) d/dx[f(g(x))] = f'(x)ยทg'(x)   (b) d/dx[f(g(x))] = f'(g(x))ยทg'(x)   (c) d/dx[f(g(x))] = f(g'(x))   (d) d/dx[f(g(x))] = f'(g(x))+g'(x)
    Answer: (b) โ€” Derivative of outer evaluated at inner ร— derivative of inner
  6. Cross-entropy loss for binary classification is derived from:
    (a) Mean Squared Error   (b) Maximum A Posteriori   (c) Maximum Likelihood Estimation   (d) Least Absolute Deviation
    Answer: (c) โ€” BCE = negative log-likelihood of Bernoulli MLE
  7. If P(fraud) = 0.03 and the model predicts Pฬ‚(fraud) = 0.01, the cross-entropy loss โˆ’[yยทlog(ลท)] for a fraud sample (y=1) is:
    (a) โˆ’log(0.03)   (b) โˆ’log(0.01) = 4.61   (c) โˆ’log(0.99) = 0.01   (d) โˆ’log(0.97)
    Answer: (b) โ€” For y=1, loss = โˆ’log(ลท) = โˆ’log(0.01) = 4.61
  8. KL divergence D_KL(pโ€–q) is always:
    (a) Negative   (b) Zero   (c) Non-negative (โ‰ฅ 0)   (d) Symmetric
    Answer: (c) โ€” Gibbs' inequality guarantees D_KL โ‰ฅ 0, with equality iff p=q
  9. The gradient vector โˆ‡f points in the direction of:
    (a) Steepest descent   (b) Steepest ascent   (c) Zero change   (d) Random direction
    Answer: (b) โ€” Gradient points to steepest ascent; we move opposite for descent
  10. Which is the correct relationship between entropy, cross-entropy, and KL divergence?
    (a) H(p,q) = H(p) โˆ’ D_KL(pโ€–q)   (b) H(p,q) = H(p) + D_KL(pโ€–q)   (c) H(p,q) = D_KL(pโ€–q) โˆ’ H(p)   (d) H(p,q) = H(p) ร— D_KL(pโ€–q)
    Answer: (b) โ€” Cross-entropy = entropy + KL divergence

Section B: Hand Calculation Problems

  1. Matrix Multiplication. Compute C = A ร— B by hand:
    A = โ”Œ       โ”     B = โ”Œ    โ”
        โ”‚ 2   3 โ”‚         โ”‚ 1  โ”‚
        โ”‚ 1  -1 โ”‚         โ”‚ 4  โ”‚
        โ””       โ”˜         โ””    โ”˜
    Show Solution
    C = A ร— B = [(2ร—1)+(3ร—4), (1ร—1)+(โˆ’1ร—4)]แต€ = [14, โˆ’3]แต€

    This is a (2ร—2) ร— (2ร—1) = (2ร—1) result. Each element is a dot product of a row of A with column B.

  2. Chain Rule. Find dy/dx for y = ln(sin(3xยฒ)).
    Show Solution
    1Let u = 3xยฒ, v = sin(u), y = ln(v)
    2dy/dv = 1/v = 1/sin(3xยฒ)
    3dv/du = cos(u) = cos(3xยฒ)
    4du/dx = 6x
    5dy/dx = (1/sin(3xยฒ)) ยท cos(3xยฒ) ยท 6x = 6x ยท cot(3xยฒ)
  3. Gradient Computation. For L(wโ‚, wโ‚‚) = (2wโ‚ + 3wโ‚‚ โˆ’ 7)ยฒ, compute โˆ‡L at wโ‚=1, wโ‚‚=1.
    Show Solution
    1At (1,1): L = (2+3โˆ’7)ยฒ = (โˆ’2)ยฒ = 4
    2โˆ‚L/โˆ‚wโ‚ = 2(2wโ‚+3wโ‚‚โˆ’7)ยท2 = 4(2+3โˆ’7) = 4(โˆ’2) = โˆ’8
    3โˆ‚L/โˆ‚wโ‚‚ = 2(2wโ‚+3wโ‚‚โˆ’7)ยท3 = 6(2+3โˆ’7) = 6(โˆ’2) = โˆ’12
    4โˆ‡L = [โˆ’8, โˆ’12]แต€ โ†’ Move in opposite direction: [+8, +12]
  4. Cross-Entropy. True labels: y = [1, 0, 1, 0]. Predictions: ลท = [0.9, 0.2, 0.7, 0.1]. Compute the binary cross-entropy loss.
    Show Solution
    1Lโ‚ = โˆ’log(0.9) = 0.1054
    2Lโ‚‚ = โˆ’log(1โˆ’0.2) = โˆ’log(0.8) = 0.2231
    3Lโ‚ƒ = โˆ’log(0.7) = 0.3567
    4Lโ‚„ = โˆ’log(1โˆ’0.1) = โˆ’log(0.9) = 0.1054
    5BCE = (0.1054+0.2231+0.3567+0.1054)/4 = 0.1977
  5. MLE. A Zomato delivery model observes 200 orders. 160 arrive on time (y=1), 40 are late (y=0). Find the MLE estimate of on-time probability. Then compute the log-likelihood at that estimate.
    Show Solution
    1pฬ‚ = 160/200 = 0.80
    2โ„“(pฬ‚) = 160ยทln(0.8) + 40ยทln(0.2)
    3= 160(โˆ’0.2231) + 40(โˆ’1.6094)
    4= โˆ’35.70 + (โˆ’64.38) = โˆ’100.08

    The negative value is normal โ€” log-likelihoods for probabilities < 1 are always negative.

Section D: Programming Problems

  1. Implement a complete gradient descent optimizer for f(x) = xโด โˆ’ 3xยณ + 2.
    Start from x=6. Use learning rate 0.01. Run 1000 steps. Print x and f(x) every 100 steps. Verify that x converges near the global minimum. Plot the loss curve.
    Show Starter Code
    Python
    import numpy as np
    import matplotlib.pyplot as plt
    
    def f(x):
        return x**4 - 3*x**3 + 2
    
    def df(x):
        # TODO: compute the derivative
        pass
    
    x = 6.0
    lr = 0.01
    history = []
    
    for step in range(1000):
        # TODO: gradient descent update
        # TODO: record history
        pass
    
    # TODO: plot history
    
  2. Build a softmax + cross-entropy loss function from scratch.
    Given logits z = [2.0, 1.0, 0.1] and true class y = 0, implement:
    (a) Softmax: p_i = exp(z_i) / ฮฃ exp(z_j)
    (b) Cross-entropy: L = โˆ’log(p_y)
    (c) Gradient: โˆ‚L/โˆ‚z_i = p_i โˆ’ 1{i=y}
    Verify your gradient numerically.
    Show Starter Code
    Python
    import numpy as np
    
    def softmax(z):
        # TODO: implement (use max trick for stability)
        pass
    
    def cross_entropy_loss(probs, y_true):
        # TODO: -log(probs[y_true])
        pass
    
    def softmax_ce_gradient(probs, y_true):
        # TODO: p_i - 1{i=y}
        pass
    
    z = np.array([2.0, 1.0, 0.1])
    y = 0
    
    # TODO: compute and print softmax, loss, gradient
    # TODO: verify gradient numerically
    
  3. Implement a complete 2-layer neural network forward pass using only NumPy.
    Architecture: 4 inputs โ†’ 3 hidden (ReLU) โ†’ 1 output (sigmoid).
    Initialize random weights (use seed 42). Process a batch of 5 samples. Print shapes at every step. Compute binary cross-entropy loss.
    Show Starter Code
    Python
    import numpy as np
    np.random.seed(42)
    
    # Data: 5 samples ร— 4 features
    X = np.random.randn(5, 4)
    y = np.array([[1], [0], [1], [0], [1]])  # shape (5,1)
    
    # TODO: Initialize W1 (4ร—3), b1 (1ร—3), W2 (3ร—1), b2 (1ร—1)
    # TODO: Forward pass: Z1 = X@W1+b1, A1 = relu(Z1), Z2 = A1@W2+b2, A2 = sigmoid(Z2)
    # TODO: Compute BCE loss
    # TODO: Print all intermediate shapes
    

Chapter Summary

Key Takeaways

  • Linear Algebra: Data lives in tensors. Matrix multiplication (the dot product) is the core computation in every neural network layer: z = Wx + b.
  • Shape tracking is non-negotiable: (mร—n) ร— (nร—p) = (mร—p). Inner dimensions must match.
  • Transpose flips rows โ†” columns. The Hadamard product (โŠ™) multiplies element-wise. Matrix inverse exists only for square, non-singular matrices.
  • Calculus: The derivative measures sensitivity โ€” how much output changes per unit input change. The chain rule decomposes derivatives through compositions of functions.
  • The sigmoid derivative ฯƒ'(z) = ฯƒ(z)(1โˆ’ฯƒ(z)) has a maximum of 0.25 โ€” chaining many sigmoid layers causes vanishing gradients.
  • The gradient โˆ‡L points uphill; gradient descent moves downhill: w โ† w โˆ’ ฮฑโˆ‡L.
  • Probability: Binary classification = Bernoulli distribution. MLE for Bernoulli gives pฬ‚ = successes/trials.
  • Cross-entropy loss = negative log-likelihood โ€” it emerges naturally from MLE, not from arbitrary choice.
  • Information Theory: Entropy measures uncertainty. KL divergence measures how wrong your model distribution is. Cross-entropy = Entropy + KL Divergence.
  • Cross-entropy produces 50ร— stronger gradients than MSE for confidently wrong predictions โ€” that's why it trains faster for classification.

Cheat Sheet: Formulas You Must Remember

NameFormulaWhere Used
Matrix MultiplyC_{ij} = ฮฃ_k A_{ik}B_{kj}Every forward pass
Sigmoidฯƒ(z) = 1/(1+eโปแถป)Output layer (binary)
Sigmoid Derivativeฯƒ'(z) = ฯƒ(z)(1โˆ’ฯƒ(z))Backpropagation
Chain Ruledy/dx = (dy/du)(du/dx)Backpropagation
Gradient Descentw โ† w โˆ’ ฮฑโˆ‡LAll training
Bernoulli MLEpฬ‚ = ฮฃyแตข / nParameter estimation
Binary Cross-Entropyโˆ’(1/n)ฮฃ[y log ลท + (1โˆ’y)log(1โˆ’ลท)]Classification loss
EntropyH(p) = โˆ’ฮฃ p(x) log p(x)Measuring uncertainty
KL DivergenceD_KL(pโ€–q) = ฮฃ p log(p/q)Comparing distributions
Cross-EntropyH(p,q) = H(p) + D_KL(pโ€–q)Classification loss

References & Further Reading

Primary Textbooks

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2 (Linear Algebra), Chapter 3 (Probability), Chapter 4 (Numerical Computation). MIT Press. deeplearningbook.org
  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 1 (Introduction: Probability). Springer.
  3. Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.

Online Resources

  1. 3Blue1Brown โ€” Essence of Linear Algebra (YouTube Series). Exceptional visual intuition for vectors, transformations, and eigenvalues.
  2. 3Blue1Brown โ€” Essence of Calculus (YouTube Series). Derivatives and integrals explained visually.
  3. Khan Academy โ€” Multivariable Calculus. Gradients, partial derivatives, Jacobians.
  4. Stanford CS229 Notes โ€” Linear Algebra Review. Concise reference for ML-relevant linear algebra.
  5. colah's blog โ€” Visual Information Theory (2015). Beautiful explanation of entropy, cross-entropy, and KL divergence. colah.github.io

Indian Context

  1. NPTEL โ€” Mathematics for Machine Learning by IIT Madras. Free video lectures covering all topics in this chapter.
  2. NPTEL โ€” Deep Learning by Prof. Mitesh Khapra, IIT Madras. Mathematical foundations in Weeks 1-3.
  3. UPI Transaction Statistics โ€” NPCI. npci.org.in

What's Next?

In Chapter 3: The Perceptron & Neuron Model, we'll put this math to work. You'll see how a single neuron computes z = wแต€x + b (linear algebra), applies ฯƒ(z) (calculus), and learns by minimizing cross-entropy (probability + information theory). Every formula from this chapter will come alive.