Neural Networks & Deep Learning

Chapter 6: Shallow Neural Networks

One Hidden Layer โ€” From Single Neuron to Your First Network

โฑ๏ธ Reading Time: ~3.5 hours  |  ๐Ÿ“– Part II: The Single Neuron to Networks  |  ๐Ÿง  Theory + Code Chapter

๐Ÿ“‹ Prerequisites: Chapter 4 (Single Neuron), Chapter 5 (Logistic Regression), NumPy basics

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the notation W[l], b[l], a[l] and the architecture of a 2-layer neural network
๐Ÿ”ต UnderstandExplain why non-linear activations are essential and why tanh outperforms sigmoid in hidden layers
๐ŸŸข ApplyImplement forward propagation and backpropagation from scratch in NumPy for a 2-layer network
๐ŸŸก AnalyzeDerive backpropagation equations step-by-step using the chain rule on the computation graph
๐ŸŸ  EvaluateCompare sigmoid, tanh, ReLU, Leaky ReLU, and ELU โ€” selecting the right one for a given problem
๐Ÿ”ด CreateBuild a complete NeuralNetwork class that learns non-linear decision boundaries on XOR-like data
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Draw the architecture of a shallow neural network (1 hidden layer) with proper notation for layers, weights, biases, and activations
  • Derive the forward propagation equations for a single training example and extend them to vectorized form over the entire dataset
  • Prove mathematically that linear hidden activations collapse the entire network into a single linear transformation, making hidden layers useless
  • Compare five activation functions โ€” sigmoid, tanh, ReLU, Leaky ReLU, and ELU โ€” with their formulas, derivatives, ranges, and trade-offs
  • Derive the complete backpropagation equations for a 2-layer network using chain rule on the computation graph
  • Explain the symmetry-breaking problem and why random initialization of weights is essential
  • Implement a full NeuralNetwork class from scratch in NumPy that trains on planar data and visualizes non-linear decision boundaries
  • Apply shallow neural networks to real-world classification problems in the Indian industry context
Section 2

Opening Hook โ€” When a Single Neuron Isn't Enough

๐Ÿ• "Is this dish healthy or indulgent?" โ€” Zomato's Cuisine Classifier

Imagine you're on Zomato's data science team in Gurugram. The product manager wants a new feature: automatically tag every dish as "Healthy Choice" ๐Ÿฅ— or "Indulgent Treat" ๐Ÿฐ based on two features โ€” calorie count and sugar content.

You try logistic regression (Chapter 5). It draws a straight line: "below 400 calories = healthy, above = indulgent." But wait โ€” a masala oats bowl (350 cal, 5g sugar) is healthy โœ…, and a gulab jamun (300 cal, 45g sugar) is indulgent โœ…. Both are under 400 calories, but one is healthy and the other isn't! The decision boundary isn't a straight line โ€” it's a curve.

You need something more powerful than a single neuron. You need neurons working together โ€” a neural network. Even just one hidden layer with a few neurons can learn these curved boundaries that separate masala oats from gulab jamun, dal makhani from butter chicken, ragi dosa from cheese naan.

This chapter builds your first real neural network โ€” one hidden layer that can learn any non-linear boundary.

๐Ÿ• Zomato๐Ÿ“Š Mu Sigma๐Ÿ›’ Flipkart๐Ÿ’ณ Paytm
The Universal Approximation Theorem (Cybenko, 1989) proves that a neural network with just one hidden layer and a sufficient number of neurons can approximate any continuous function to arbitrary precision. So this "shallow" network you're about to build is, in theory, infinitely powerful! The catch? Finding the right weights requires training, and a single hidden layer may need exponentially many neurons for complex functions โ€” which is why we'll eventually go "deep."
Section 3

Core Concepts โ€” Building the Network Layer by Layer

3a. Neural Network Representation & Notation

A shallow neural network (also called a 2-layer neural network) has exactly three layers of nodes, but we count only layers with learnable parameters:

๐Ÿ—๏ธ Architecture of a 2-Layer Neural Network

Layer 0 โ€” Input Layer

Contains the input features xโ‚, xโ‚‚, โ€ฆ, xโ‚™. This layer has no weights or biases โ€” it just passes data forward. We denote the input as a[0] = X (activations of layer 0).

Layer 1 โ€” Hidden Layer

Contains n[1] hidden units (neurons). Each unit computes z = wยทx + b, then applies an activation function. Parameters: W[1] (shape: n[1] ร— n[0]) and b[1] (shape: n[1] ร— 1). Outputs: a[1].

Layer 2 โ€” Output Layer

Contains n[2] output units (typically 1 for binary classification). Parameters: W[2] (shape: n[2] ร— n[1]) and b[2] (shape: n[2] ร— 1). Final output: a[2] = ลท.

Why "2-Layer"?

We count layers by the number of weight matrices, not nodes. The input layer is layer 0 and has no parameters. So: Layer 1 (hidden) + Layer 2 (output) = 2-layer network.

INPUT LAYER HIDDEN LAYER OUTPUT LAYER (Layer 0) (Layer 1) (Layer 2) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”‚ zโ‚โฝยนโพโ†’ aโ‚โฝยนโพ โ”‚ โ”‚ xโ‚ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ•ฒ โ”‚ zโ‚‚โฝยนโพโ†’ aโ‚‚โฝยนโพ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•ฒโ”€โ”€โ–ถโ”‚ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ•ฑโ”€โ”€โ–ถโ”‚ zโ‚ƒโฝยนโพโ†’ aโ‚ƒโฝยนโพ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ zโฝยฒโพ โ†’ aโฝยฒโพ โ”‚โ”€โ”€โ–ถ ลท โ”‚ xโ‚‚ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ zโ‚„โฝยนโพโ†’ aโ‚„โฝยนโพ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Wโฝยฒโพ: (1 ร— 4) aโฝโฐโพ = X Wโฝยนโพ: (4 ร— 2) bโฝยฒโพ: (1 ร— 1) Shape: (2ร—m) bโฝยนโพ: (4 ร— 1) aโฝยฒโพ = ลท aโฝยนโพ: (4 ร— m)

Superscript Notation Convention

SymbolMeaningExample
W[l]Weight matrix of layer lW[1] connects input โ†’ hidden
b[l]Bias vector of layer lb[2] is the output layer bias
z[l]Pre-activation (linear part) of layer lz[1] = W[1]ยทX + b[1]
a[l]Post-activation of layer la[1] = g(z[1])
g[l](ยท)Activation function of layer lg[1] could be tanh, g[2] could be sigmoid
n[l]Number of units in layer ln[0] = 2, n[1] = 4, n[2] = 1
(i)Superscript in parentheses = training example indexx(3) = 3rd training example
Dimension check shortcut: The shape of W[l] is always (n[l], n[l-1]) โ€” rows = units in current layer, columns = units in previous layer. If your dimensions don't match this pattern, you have a bug. This single rule will save you hours of debugging.

3b. Forward Propagation โ€” Single Example & Vectorized

Single Training Example (x(i))

For a single input vector x with shape (n[0], 1), the forward pass through our 2-layer network computes:

Layer 1 (Hidden):
z[1] = W[1] ยท x + b[1]   โ†’   a[1] = g[1](z[1])

Layer 2 (Output):
z[2] = W[2] ยท a[1] + b[2]   โ†’   a[2] = g[2](z[2]) = ลท

Step-by-step walkthrough:

  1. Multiply: W[1] (4ร—2) ยท x (2ร—1) = z_partial (4ร—1) โ€” each hidden neuron computes its weighted sum of inputs
  2. Add bias: z_partial + b[1] (4ร—1) = z[1] (4ร—1)
  3. Activate: Apply g[1] (e.g., tanh) element-wise โ†’ a[1] (4ร—1) โ€” now we have 4 hidden unit outputs
  4. Multiply: W[2] (1ร—4) ยท a[1] (4ร—1) = z_partial (1ร—1) โ€” output neuron combines hidden outputs
  5. Add bias: z_partial + b[2] (1ร—1) = z[2] (1ร—1)
  6. Activate: Apply g[2] (sigmoid for binary classification) โ†’ a[2] (1ร—1) = ลท โˆˆ [0, 1]

Vectorized Over m Examples

Instead of looping over m training examples, we stack all examples as columns of a matrix X (shape: n[0] ร— m). The vectorized forward propagation becomes:

Vectorized Forward Propagation:

Z[1] = W[1] ยท X + b[1]     (n[1] ร— m)
A[1] = g[1](Z[1])             (n[1] ร— m)
Z[2] = W[2] ยท A[1] + b[2]    (n[2] ร— m)
A[2] = g[2](Z[2]) = ลถ         (n[2] ร— m)

Here, b[1] (shape: n[1] ร— 1) is broadcast across all m columns automatically by NumPy. Each column of A[2] is the prediction for one training example.

Don't use a for-loop over examples! Writing for i in range(m): z = W @ x[:,i] is correct but painfully slow. The vectorized version Z = W @ X uses NumPy's optimized BLAS routines and runs 50โ€“300ร— faster. Always vectorize over training examples; loop only over layers.

3c. Activation Functions โ€” The Complete Deep Dive

The activation function g(z) is what gives neural networks their power. Without it, stacking layers is pointless (we'll prove this next). Here are the five most important activation functions:

โ‘  Sigmoid: ฯƒ(z) = 1 / (1 + eโปแถป)

Formula & Derivative

ฯƒ(z) = 1 / (1 + eโˆ’z)   |   ฯƒ'(z) = ฯƒ(z) ยท (1 โˆ’ ฯƒ(z))

Range

(0, 1) โ€” always positive, interpretable as probability

Pros

Output between 0 and 1, so perfect for output layer in binary classification. Smooth gradient everywhere.

Cons

โ‘  Vanishing gradient: When |z| is large, ฯƒ'(z) โ‰ˆ 0, gradients vanish, learning stops. โ‘ก Not zero-centered: Output is always positive, causing zig-zag gradient updates. โ‘ข Exp() is computationally expensive.

When to Use

Output layer only for binary classification. Almost never for hidden layers.

โ‘ก Tanh: tanh(z) = (eแถป โˆ’ eโปแถป) / (eแถป + eโปแถป)

Formula & Derivative

tanh(z) = (ez โˆ’ eโˆ’z) / (ez + eโˆ’z)   |   tanh'(z) = 1 โˆ’ tanhยฒ(z)

Range

(โˆ’1, 1) โ€” centered around zero

Why tanh > sigmoid for hidden layers

โ‘  Zero-centered output: Since the mean of tanh output is closer to 0 (vs. 0.5 for sigmoid), the next layer's inputs are centered, making gradient descent converge faster. โ‘ก Steeper gradient: tanh has a maximum derivative of 1 (at z=0) vs. sigmoid's maximum of 0.25. This means 4ร— stronger gradients in the active region.

Cons

Still suffers from vanishing gradients for |z| โ‰ซ 0, just like sigmoid.

Relationship

tanh(z) = 2ฯƒ(2z) โˆ’ 1 โ€” tanh is a shifted and scaled version of sigmoid!

โ‘ข ReLU: max(0, z)

Formula & Derivative

ReLU(z) = max(0, z)   |   ReLU'(z) = 1 if z > 0, else 0

Range

[0, โˆž) โ€” unbounded positive

Pros

โ‘  No vanishing gradient for z > 0 (derivative is exactly 1). โ‘ก Computationally cheap โ€” just a comparison, no exp(). โ‘ข Induces sparsity โ€” many neurons output 0, creating efficient representations. โ‘ฃ Default choice for hidden layers in most modern networks.

Cons

Dying ReLU: If a neuron's z is always negative (due to large negative bias), its gradient is always 0, and it never updates โ€” it's "dead." This can happen to 10โ€“40% of neurons in practice.

โ‘ฃ Leaky ReLU: max(ฮฑz, z) where ฮฑ โ‰ˆ 0.01

Formula & Derivative

LeakyReLU(z) = z if z > 0, else ฮฑยทz   |   LeakyReLU'(z) = 1 if z > 0, else ฮฑ

Range

(โˆ’โˆž, โˆž) โ€” unbounded both sides

Pros

Fixes the dying ReLU problem โ€” negative inputs still get a small gradient (ฮฑ = 0.01), so neurons can always recover.

Variant

Parametric ReLU (PReLU): ฮฑ is a learnable parameter, not fixed. The network decides the optimal slope for the negative region.

โ‘ค ELU: Exponential Linear Unit

Formula & Derivative

ELU(z) = z if z > 0, else ฮฑ(ez โˆ’ 1)   |   ELU'(z) = 1 if z > 0, else ELU(z) + ฮฑ

Range

[โˆ’ฮฑ, โˆž) โ€” smoothly saturates to โˆ’ฮฑ for large negative z

Pros

โ‘  Mean activations closer to zero (like tanh). โ‘ก Smooth curve for z < 0 (unlike the kink in ReLU/Leaky ReLU). โ‘ข Better noise robustness.

Cons

Slightly slower due to exp() computation. ฮฑ is typically 1.0.

Master Comparison Table

ActivationFormulaRangeDerivativeZero-Centered?Vanishing Gradient?Use Case
Sigmoid1/(1+eโˆ’z)(0, 1)ฯƒ(1โˆ’ฯƒ)โŒYesOutput layer (binary)
Tanh(ezโˆ’eโˆ’z)/(ez+eโˆ’z)(โˆ’1, 1)1โˆ’tanhยฒโœ…YesHidden layers (small nets)
ReLUmax(0, z)[0, โˆž)0 or 1โŒNo (z>0)Hidden layers (default)
Leaky ReLUmax(ฮฑz, z)(โˆ’โˆž, โˆž)ฮฑ or 1โŒNoHidden layers (fix dying)
ELUz or ฮฑ(ezโˆ’1)[โˆ’ฮฑ, โˆž)1 or ELU+ฮฑโ‰ˆ โœ…NoHidden layers (smooth)
Practical rule of thumb for choosing activation functions:
โ‘  Hidden layers: Start with ReLU. If too many neurons die, try Leaky ReLU or ELU.
โ‘ก Output layer (binary classification): Sigmoid.
โ‘ข Output layer (regression): Linear (no activation).
โ‘ฃ Output layer (multi-class): Softmax (Chapter 10).
โ‘ค RNNs/LSTMs: tanh and sigmoid are used internally by design.

3d. Proof: Linear Activations Collapse the Network

Here's a critical question: what if we use a linear activation function g(z) = z (the identity function) for all layers? Let's prove that the network becomes equivalent to plain linear regression, making hidden layers useless.

๐Ÿ”ข Theorem: A Network with All Linear Activations = Linear Regression

Setup

Consider our 2-layer network with linear activation g(z) = z everywhere:

Forward Pass

z[1] = W[1]ยทx + b[1]
a[1] = g(z[1]) = z[1] = W[1]ยทx + b[1]
z[2] = W[2]ยทa[1] + b[2]
a[2] = g(z[2]) = z[2] = W[2]ยท(W[1]ยทx + b[1]) + b[2]

Expand

a[2] = W[2]ยทW[1]ยทx + W[2]ยทb[1] + b[2]
a[2] = W'ยทx + b'

where W' = W[2]ยทW[1] (a single weight matrix) and b' = W[2]ยทb[1] + b[2] (a single bias vector).

Conclusion

The entire network reduces to ลท = W'x + b', which is exactly linear regression. The hidden layer adds zero expressive power. No matter how many linear hidden layers you stack, the composition of linear functions is linear.

Key Insight: composition of linear functions is linear
f(x) = Wโ‚ƒยท(Wโ‚‚ยท(Wโ‚ยทx + bโ‚) + bโ‚‚) + bโ‚ƒ = (Wโ‚ƒยทWโ‚‚ยทWโ‚)ยทx + (Wโ‚ƒยทWโ‚‚ยทbโ‚ + Wโ‚ƒยทbโ‚‚ + bโ‚ƒ)
= W'ยทx + b'   โ† still linear!
"Can I use linear activation for the output layer?" โ€” Yes! For regression problems where you predict continuous values (e.g., predicting house prices in โ‚น), using g[2](z) = z (linear) at the output is correct. The rule is: never use linear for hidden layers, but the output layer's activation depends on your task.

3e. Backpropagation for a 2-Layer Network โ€” Full Derivation

Backpropagation is the algorithm that computes gradients of the loss function with respect to every parameter. It uses the chain rule of calculus, working backwards from the output layer to the input layer.

The Cost Function

For binary classification with m examples:

J(W[1], b[1], W[2], b[2]) = โˆ’(1/m) ฮฃแตข [ y(i) log(a[2](i)) + (1โˆ’y(i)) log(1โˆ’a[2](i)) ]

Step 1: Output Layer Gradients (Layer 2)

Starting from the loss and working backwards through the sigmoid output:

dZ[2] = A[2] โˆ’ Y     (n[2] ร— m)

dW[2] = (1/m) ยท dZ[2] ยท A[1]T     (n[2] ร— n[1])

db[2] = (1/m) ยท ฮฃ(dZ[2], axis=1, keepdims=True)     (n[2] ร— 1)

Derivation of dZ[2]:

โˆ‚J/โˆ‚a[2] = โˆ’y/a[2] + (1โˆ’y)/(1โˆ’a[2])    (derivative of cross-entropy)
โˆ‚a[2]/โˆ‚z[2] = a[2](1โˆ’a[2])    (derivative of sigmoid)
dz[2] = โˆ‚J/โˆ‚z[2] = โˆ‚J/โˆ‚a[2] ยท โˆ‚a[2]/โˆ‚z[2]
= [โˆ’y/a[2] + (1โˆ’y)/(1โˆ’a[2])] ยท a[2](1โˆ’a[2])
= โˆ’y(1โˆ’a[2]) + (1โˆ’y)a[2]
= a[2] โˆ’ y    โœ“ (beautifully simple!)

Step 2: Hidden Layer Gradients (Layer 1)

Now we propagate the gradient backwards through W[2] and the activation function g[1]:

dZ[1] = W[2]T ยท dZ[2] โŠ™ g[1]'(Z[1])     (n[1] ร— m)

dW[1] = (1/m) ยท dZ[1] ยท XT     (n[1] ร— n[0])

db[1] = (1/m) ยท ฮฃ(dZ[1], axis=1, keepdims=True)     (n[1] ร— 1)

Derivation of dZ[1]:

โˆ‚J/โˆ‚z[1] = โˆ‚J/โˆ‚z[2] ยท โˆ‚z[2]/โˆ‚a[1] ยท โˆ‚a[1]/โˆ‚z[1]
= dz[2] ยท W[2] ยท g[1]'(z[1])
Vectorized: dZ[1] = W[2]T ยท dZ[2] โŠ™ g[1]'(Z[1])

The โŠ™ symbol denotes element-wise multiplication (Hadamard product). The term g[1]'(Z[1]) is the derivative of the hidden layer's activation function applied element-wise.

Derivatives for Common Activations

If g[1] is...Then g[1]'(z) = In terms of a[1]
Sigmoidฯƒ(z)(1โˆ’ฯƒ(z))a[1] โŠ™ (1 โˆ’ a[1])
Tanh1 โˆ’ tanhยฒ(z)1 โˆ’ (a[1])ยฒ
ReLU1 if z > 0, else 0(Z[1] > 0).astype(int)

Step 3: Parameter Updates

W[1] := W[1] โˆ’ ฮฑ ยท dW[1]
b[1] := b[1] โˆ’ ฮฑ ยท db[1]
W[2] := W[2] โˆ’ ฮฑ ยท dW[2]
b[2] := b[2] โˆ’ ฮฑ ยท db[2]

Where ฮฑ is the learning rate (a hyperparameter you choose, e.g., 0.01 or 1.2).

Shape-checking backprop equations: The gradient of a parameter always has the same shape as the parameter itself. If W[1] is (4, 2), then dW[1] must also be (4, 2). If they don't match, you have a bug. Always verify shapes!

3f. Random Initialization โ€” Breaking Symmetry

In logistic regression, we could initialize weights to zero. Can we do the same for neural networks? No! Here's why:

๐Ÿ”„ The Symmetry Problem

What happens if W[1] = 0?

If all weights in W[1] are initialized to zero, then for every hidden unit:

zโ‚[1] = 0ยทxโ‚ + 0ยทxโ‚‚ + 0 = 0
zโ‚‚[1] = 0ยทxโ‚ + 0ยทxโ‚‚ + 0 = 0
zโ‚ƒ[1] = 0ยทxโ‚ + 0ยทxโ‚‚ + 0 = 0
zโ‚„[1] = 0ยทxโ‚ + 0ยทxโ‚‚ + 0 = 0

All hidden units compute the same value โ†’ same activations a[1] โ†’ same gradients dW โ†’ same updates. They stay identical forever. It's like having 4 copies of the same neuron โ€” no matter how long you train, the network can only learn what a single neuron can learn.

Solution: Random Initialization

Initialize weights randomly with small values:

W[1] = np.random.randn(n[1], n[0]) * 0.01

The 0.01 scaling keeps values small so sigmoid/tanh start in the linear region (steep gradients), not the saturated flat regions.

Why small values?

If W is too large, z = Wx + b will be large โ†’ tanh(z) saturates near ยฑ1 โ†’ gradient โ‰ˆ 0 โ†’ learning is glacially slow. With small W, z stays near 0, where tanh has gradient โ‰ˆ 1.

Biases

Biases can be initialized to zero. Since each neuron already has different weights (breaking symmetry), different biases aren't needed for symmetry breaking. b = np.zeros((n, 1)) is fine.

Xavier and He initialization are smarter alternatives. Xavier (Glorot) init sets W ~ N(0, 1/n[l-1]), keeping the variance of activations stable across layers. He initialization (for ReLU) uses N(0, 2/n[l-1]) to account for ReLU zeroing out half the neurons. These are covered in depth in Chapter 11 (Optimization).
Section 4

From-Scratch Code โ€” Building a Neural Network in NumPy

Let's build a complete NeuralNetwork class with one hidden layer. We'll train it on a planar XOR dataset โ€” the classic problem that a single neuron (perceptron) cannot solve.

Step 1: Generate Planar XOR Data

Python
import numpy as np
import matplotlib.pyplot as plt

def generate_xor_data(n_samples=400, noise=0.15):
    """Generate planar XOR-like dataset.
    Class 1 (y=1): points in quadrants I and III
    Class 0 (y=0): points in quadrants II and IV
    """
    np.random.seed(42)
    n = n_samples // 4

    # Quadrant I: (+, +) โ†’ class 1
    q1 = np.random.randn(n, 2) * 0.5 + np.array([1, 1])
    # Quadrant II: (-, +) โ†’ class 0
    q2 = np.random.randn(n, 2) * 0.5 + np.array([-1, 1])
    # Quadrant III: (-, -) โ†’ class 1
    q3 = np.random.randn(n, 2) * 0.5 + np.array([-1, -1])
    # Quadrant IV: (+, -) โ†’ class 0
    q4 = np.random.randn(n, 2) * 0.5 + np.array([1, -1])

    X = np.vstack([q1, q2, q3, q4]).T  # Shape: (2, 400)
    Y = np.array([[1]*n + [0]*n + [1]*n + [0]*n])  # Shape: (1, 400)

    # Shuffle
    perm = np.random.permutation(X.shape[1])
    X, Y = X[:, perm], Y[:, perm]

    return X, Y

X, Y = generate_xor_data()
print(f"X shape: {X.shape}, Y shape: {Y.shape}")
X shape: (2, 400), Y shape: (1, 400)

Step 2: The Complete NeuralNetwork Class

Python
class ShallowNeuralNetwork:
    """
    A 2-layer neural network (1 hidden layer) for binary classification.

    Architecture: Input(n_x) โ†’ Hidden(n_h, tanh) โ†’ Output(1, sigmoid)
    """

    def __init__(self, n_x, n_h, learning_rate=1.2):
        """
        Parameters:
            n_x : int โ€” number of input features
            n_h : int โ€” number of hidden units
            learning_rate : float โ€” step size for gradient descent
        """
        self.lr = learning_rate

        # Random initialization (small weights, zero biases)
        self.W1 = np.random.randn(n_h, n_x) * 0.01
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(1, n_h) * 0.01
        self.b2 = np.zeros((1, 1))

        print(f"Network initialized: {n_x} โ†’ {n_h} โ†’ 1")
        print(f"  W1: {self.W1.shape}, b1: {self.b1.shape}")
        print(f"  W2: {self.W2.shape}, b2: {self.b2.shape}")
        total = n_h * n_x + n_h + 1 * n_h + 1
        print(f"  Total parameters: {total}")

    def sigmoid(self, z):
        """Sigmoid activation function."""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def forward(self, X):
        """
        Forward propagation.
        X: shape (n_x, m) โ€” m examples
        Returns: A2 (predictions), cache (for backprop)
        """
        # Layer 1: Hidden
        Z1 = self.W1 @ X + self.b1       # (n_h, m)
        A1 = np.tanh(Z1)                   # (n_h, m)

        # Layer 2: Output
        Z2 = self.W2 @ A1 + self.b2      # (1, m)
        A2 = self.sigmoid(Z2)              # (1, m)

        cache = (Z1, A1, Z2, A2)
        return A2, cache

    def compute_cost(self, A2, Y):
        """Binary cross-entropy loss."""
        m = Y.shape[1]
        # Clip to avoid log(0)
        A2 = np.clip(A2, 1e-8, 1 - 1e-8)
        cost = -(1/m) * np.sum(
            Y * np.log(A2) + (1 - Y) * np.log(1 - A2)
        )
        return float(cost)

    def backward(self, X, Y, cache):
        """
        Backward propagation.
        Returns: gradients dict {dW1, db1, dW2, db2}
        """
        m = X.shape[1]
        Z1, A1, Z2, A2 = cache

        # Output layer gradients
        dZ2 = A2 - Y                          # (1, m)
        dW2 = (1/m) * dZ2 @ A1.T               # (1, n_h)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)  # (1, 1)

        # Hidden layer gradients
        dZ1 = (self.W2.T @ dZ2) * (1 - A1**2)  # tanh derivative: 1 - tanhยฒ(z)
        dW1 = (1/m) * dZ1 @ X.T                # (n_h, n_x)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)  # (n_h, 1)

        return {'dW1': dW1, 'db1': db1,
                'dW2': dW2, 'db2': db2}

    def update_parameters(self, grads):
        """Gradient descent update."""
        self.W1 -= self.lr * grads['dW1']
        self.b1 -= self.lr * grads['db1']
        self.W2 -= self.lr * grads['dW2']
        self.b2 -= self.lr * grads['db2']

    def train(self, X, Y, epochs=10000, print_every=1000):
        """Full training loop."""
        costs = []
        for i in range(epochs):
            # Forward
            A2, cache = self.forward(X)
            # Cost
            cost = self.compute_cost(A2, Y)
            # Backward
            grads = self.backward(X, Y, cache)
            # Update
            self.update_parameters(grads)

            if i % print_every == 0:
                costs.append(cost)
                print(f"Epoch {i:5d} | Cost: {cost:.6f}")

        return costs

    def predict(self, X):
        """Binary predictions (threshold = 0.5)."""
        A2, _ = self.forward(X)
        return (A2 > 0.5).astype(int)

    def accuracy(self, X, Y):
        """Compute classification accuracy."""
        preds = self.predict(X)
        return np.mean(preds == Y) * 100

Step 3: Train the Network

Python
# Create and train the network
nn = ShallowNeuralNetwork(n_x=2, n_h=8, learning_rate=1.2)
costs = nn.train(X, Y, epochs=10000, print_every=1000)
print(f"\nFinal Accuracy: {nn.accuracy(X, Y):.1f}%")
Network initialized: 2 โ†’ 8 โ†’ 1 W1: (8, 2), b1: (8, 1) W2: (1, 8), b2: (1, 1) Total parameters: 33 Epoch 0 | Cost: 0.693126 Epoch 1000 | Cost: 0.342817 Epoch 2000 | Cost: 0.118493 Epoch 3000 | Cost: 0.064291 Epoch 4000 | Cost: 0.043856 Epoch 5000 | Cost: 0.033745 Epoch 6000 | Cost: 0.027519 Epoch 7000 | Cost: 0.023297 Epoch 8000 | Cost: 0.020223 Epoch 9000 | Cost: 0.017895 Final Accuracy: 99.5%

Step 4: Visualize the Decision Boundary

Python
def plot_decision_boundary(model, X, Y):
    """Visualize the non-linear decision boundary."""
    x_min, x_max = X[0].min() - 0.5, X[0].max() + 0.5
    y_min, y_max = X[1].min() - 0.5, X[1].max() + 0.5

    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, 200),
        np.linspace(y_min, y_max, 200)
    )
    grid = np.c_[xx.ravel(), yy.ravel()].T  # (2, 40000)
    Z = model.predict(grid)
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, levels=[0, 0.5, 1],
                 colors=['#fecaca', '#bbf7d0'], alpha=0.7)
    plt.contour(xx, yy, Z, levels=[0.5], colors=['#7c3aed'],
                linewidths=2)

    # Plot data points
    plt.scatter(X[0, Y[0]==0], X[1, Y[0]==0],
                c='#ef4444', edgecolors='k', s=30, label='Class 0')
    plt.scatter(X[0, Y[0]==1], X[1, Y[0]==1],
                c='#22c55e', edgecolors='k', s=30, label='Class 1')

    plt.title('Decision Boundary โ€” Shallow Neural Network (XOR Data)',
              fontweight='bold', fontsize=14)
    plt.xlabel('Feature xโ‚')
    plt.ylabel('Feature xโ‚‚')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

plot_decision_boundary(nn, X, Y)
Why does XOR need a hidden layer? XOR is not linearly separable โ€” no single straight line can separate class 0 from class 1. The hidden layer creates an intermediate representation where the data becomes linearly separable. Hidden unit 1 might learn "xโ‚ > 0", hidden unit 2 might learn "xโ‚‚ > 0", and the output layer combines them as "XOR = (unit1 AND NOT unit2) OR (NOT unit1 AND unit2)".
Section 5

Industry Code โ€” scikit-learn & TensorFlow Equivalents

In production, you'd use optimized libraries. Here's how our from-scratch network maps to industry tools:

scikit-learn: MLPClassifier

Python
from sklearn.neural_network import MLPClassifier

# X_train shape: (m, n_features) โ€” sklearn uses row vectors!
clf = MLPClassifier(
    hidden_layer_sizes=(8,),      # 1 hidden layer, 8 neurons
    activation='tanh',             # hidden layer activation
    solver='sgd',                  # stochastic gradient descent
    learning_rate_init=0.01,       # initial learning rate
    max_iter=10000,                # epochs
    random_state=42
)

clf.fit(X.T, Y.ravel())  # sklearn expects (m, n_features)
print(f"Accuracy: {clf.score(X.T, Y.ravel()) * 100:.1f}%")
Accuracy: 99.2%

TensorFlow/Keras: Sequential API

Python
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(8, activation='tanh',
                          input_shape=(2,)),       # hidden layer
    tf.keras.layers.Dense(1, activation='sigmoid')  # output layer
])

model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=1.2),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Keras expects (m, n_features)
history = model.fit(X.T, Y.T, epochs=200, verbose=0)
loss, acc = model.evaluate(X.T, Y.T, verbose=0)
print(f"Keras Accuracy: {acc * 100:.1f}%")
Keras Accuracy: 99.5%
Industry data convention: Note the shape difference! Our from-scratch code uses column vectors (each example is a column, shape: features ร— m). scikit-learn and Keras use row vectors (each example is a row, shape: m ร— features). This is the #1 source of bugs when moving between custom and library code. Always .T (transpose) when switching conventions.

Comparison: From Scratch vs Industry

AspectOur From-Scratch Codescikit-learn / Keras
Lines of code~80 lines~10 lines
OptimizerVanilla gradient descentSGD, Adam, RMSprop, etc.
RegularizationNoneL1, L2, Dropout built-in
GPU supportNoYes (Keras/TF)
Learning valueโ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜… (black box)
Production useโ˜… (educational only)โ˜…โ˜…โ˜…โ˜…โ˜…
Section 6

Visual Diagrams โ€” Computation Graph & Architecture

Computation Graph for Forward & Backward Pass

FORWARD PASS (left โ†’ right): โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• X โ”€โ”€โ”€โ”€โ”€โ”€โ” (2,m) โ”‚ โ–ผ WยนยทX + bยน = Zยน โ”€โ”€โ–ถ tanh(Zยน) = Aยน โ”€โ”€โ”€โ”€โ”€โ”€โ” (4,m) (4,m) (4,m) โ”‚ โ–ผ WยฒยทAยน + bยฒ = Zยฒ โ”€โ”€โ–ถ ฯƒ(Zยฒ) = Aยฒ = ลถ (1,m) (1,m) (1,m) โ”‚ โ–ผ โ„’(ลถ, Y) = Cost J BACKWARD PASS (right โ†’ left): โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• dWยน dbยน dWยฒ dbยฒ โ–ฒ โ–ฒ โ–ฒ โ–ฒ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”‚ dZยน dZยฒ โ–ฒ โ–ฒ โ”‚ โ”‚ Wยฒแต€ยทdZยฒ โŠ™ (1-Aยนยฒ) Aยฒ - Y โ–ฒ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โˆ‚J/โˆ‚Aยฒ = -Y/Aยฒ + (1-Y)/(1-Aยฒ)

Shape Flow Through the Network

Layer โ”‚ Linear (Z) โ”‚ Activation (A) โ”‚ Weights โ”‚ Bias โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Input (0) โ”‚ โ€” โ”‚ X: (2, m) โ”‚ โ€” โ”‚ โ€” Hidden(1) โ”‚ Zยน: (4, m) โ”‚ Aยน: (4, m) โ”‚ Wยน: (4, 2) โ”‚ bยน: (4,1) Output(2) โ”‚ Zยฒ: (1, m) โ”‚ Aยฒ: (1, m) = ลถ โ”‚ Wยฒ: (1, 4) โ”‚ bยฒ: (1,1) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Total parameters: 4ร—2 + 4 + 1ร—4 + 1 = 8 + 4 + 4 + 1 = 17

Activation Function Shapes

SIGMOID TANH ReLU โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ .-โ”€โ”€โ”€โ”‚ โ”‚ .โ”€โ”€โ”€โ”€โ”‚ โ”‚ . โ”‚ โ”‚ .' โ”‚ โ”‚ .' โ”‚ โ”‚ .' โ”‚ โ”‚0.5 ----'-------โ”‚ โ”‚ 0 โ”€โ”€โ”€โ”€โ”€'โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ .' โ”‚ โ”‚ .' โ”‚ โ”‚ .' โ”‚ โ”‚ .' โ”‚ โ”‚ .' โ”‚ โ”‚โ”€โ”€โ”€โ”€.' โ”‚ โ”‚โ”€โ”€.'โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚โ”€โ”€โ”€.' โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Range: (0, 1) Range: (-1, 1) Range: [0, โˆž) Zero-centered: No Zero-centered: Yes Zero-centered: No Gradient at 0: 0.25 Gradient at 0: 1.0 Gradient at 0: undef
ReLU was introduced by Nair & Hinton in 2010 and is arguably the most impactful single idea in deep learning. Before ReLU, training networks with more than 2-3 hidden layers was nearly impossible due to vanishing gradients. ReLU's constant gradient of 1 (for positive inputs) solved this, enabling the "deep" in deep learning.
Section 7

Worked Example โ€” Hand-Computing One Forward & Backward Pass

Let's trace through one complete forward + backward pass with actual numbers. This is the best way to build intuition.

๐Ÿ“ Setup: Tiny Network (2 inputs โ†’ 2 hidden โ†’ 1 output)

Input (1 example)

x = [1.0, 0.5]T (shape: 2ร—1),   y = 1

Initialized Parameters

W[1] = [[0.1, 0.3], [0.2, 0.4]] (2ร—2)
b[1] = [[0.0], [0.0]] (2ร—1)
W[2] = [[0.5, 0.6]] (1ร—2)
b[2] = [[0.0]] (1ร—1)

Forward Pass

Hand Calculation
# Layer 1: z[1] = W[1]ยทx + b[1]
zโ‚โฝยนโพ = 0.1ร—1.0 + 0.3ร—0.5 + 0.0 = 0.25
zโ‚‚โฝยนโพ = 0.2ร—1.0 + 0.4ร—0.5 + 0.0 = 0.40

# Layer 1: a[1] = tanh(z[1])
aโ‚โฝยนโพ = tanh(0.25) = 0.2449
aโ‚‚โฝยนโพ = tanh(0.40) = 0.3799

# Layer 2: z[2] = W[2]ยทa[1] + b[2]
zโฝยฒโพ = 0.5ร—0.2449 + 0.6ร—0.3799 + 0.0 = 0.3504

# Layer 2: a[2] = ฯƒ(z[2])
aโฝยฒโพ = ฯƒ(0.3504) = 1/(1+eโปโฐยทยณโตโฐโด) = 0.5867

# ลท = 0.5867 (predicted probability)
# True y = 1, so the network needs to push ลท higher

Cost

Hand Calculation
J = -[yยทlog(ลท) + (1-y)ยทlog(1-ลท)]
  = -[1ยทlog(0.5867) + 0ยทlog(0.4133)]
  = -log(0.5867)
  = 0.5333

Backward Pass

Hand Calculation
# Layer 2 gradients
dzโฝยฒโพ = aโฝยฒโพ - y = 0.5867 - 1 = -0.4133

dWโ‚โฝยฒโพ = dzโฝยฒโพ ร— aโ‚โฝยนโพ = -0.4133 ร— 0.2449 = -0.1013
dWโ‚‚โฝยฒโพ = dzโฝยฒโพ ร— aโ‚‚โฝยนโพ = -0.4133 ร— 0.3799 = -0.1570
dbโฝยฒโพ  = dzโฝยฒโพ = -0.4133

# Layer 1 gradients (tanh derivative: 1 - tanhยฒ)
dzโ‚โฝยนโพ = Wโ‚โฝยฒโพยทdzโฝยฒโพ ร— (1 - aโ‚โฝยนโพยฒ) = 0.5ร—(-0.4133) ร— (1 - 0.2449ยฒ)
       = -0.2067 ร— 0.9400 = -0.1943

dzโ‚‚โฝยนโพ = Wโ‚‚โฝยฒโพยทdzโฝยฒโพ ร— (1 - aโ‚‚โฝยนโพยฒ) = 0.6ร—(-0.4133) ร— (1 - 0.3799ยฒ)
       = -0.2480 ร— 0.8557 = -0.1922

dWโฝยนโพ = [[dzโ‚โฝยนโพยทxโ‚, dzโ‚โฝยนโพยทxโ‚‚],    = [[-0.1943, -0.0972],
          [dzโ‚‚โฝยนโพยทxโ‚, dzโ‚‚โฝยนโพยทxโ‚‚]]       [-0.1922, -0.0961]]

Parameter Update (ฮฑ = 1.0)

Hand Calculation
# W[1] := W[1] - ฮฑยทdW[1]
Wโฝยนโพ_new = [[0.1 - 1.0ร—(-0.1943), 0.3 - 1.0ร—(-0.0972)],
             [0.2 - 1.0ร—(-0.1922), 0.4 - 1.0ร—(-0.0961)]]
          = [[0.2943, 0.3972],
             [0.3922, 0.4961]]

# All weights increased โ€” pushing ลท toward 1 โœ“

# W[2] := W[2] - ฮฑยทdW[2]
Wโฝยฒโพ_new = [[0.5 - 1.0ร—(-0.1013), 0.6 - 1.0ร—(-0.1570)]]
          = [[0.6013, 0.7570]]
Sanity check: After one gradient step, all weights increased (since dW was negative โ€” the network needed to increase ลท). If we recompute the forward pass with these new weights, ลท will be closer to 1. This is exactly what gradient descent should do!
Section 8

Case Study โ€” Mu Sigma: Retail Analytics with Shallow Networks

๐Ÿ“Š Mu Sigma โ€” India's Largest Pure-Play Analytics Firm

Company Background

Mu Sigma, founded in 2004 in Bengaluru by Dhiraj Rajaram, is one of India's largest analytics and decision science companies. With 3,500+ "decision scientists" and offices in Chicago and Bengaluru, Mu Sigma serves Fortune 500 clients across retail, healthcare, insurance, and CPG. The company is valued at over โ‚น8,000 crore ($1 billion).

The Problem

A major Indian retail chain (similar to Reliance Retail / DMart) needed to predict customer churn โ€” which customers would stop shopping at their stores in the next quarter. They had 12 input features per customer: purchase frequency, average basket size (โ‚น), recency of last visit, number of categories shopped, coupon redemption rate, store distance, customer age, tenure, complaint history, payment mode preference, festive season spending, and loyalty points balance.

Why Not Logistic Regression?

Initial logistic regression achieved only 68% accuracy. The Mu Sigma team discovered the churn pattern was non-linear: customers with moderate purchase frequency AND low recency were churning (they used to shop often but stopped), while customers with low frequency but high recency were fine (they shop rarely but recently). This interaction pattern needed a curved decision boundary.

The Solution: Shallow Neural Network

The team implemented a 2-layer neural network with:

  • Input layer: 12 features (normalized to [0, 1])
  • Hidden layer: 24 neurons with ReLU activation
  • Output layer: 1 neuron with sigmoid (churn probability)
  • Training: 200,000 customer records, batch gradient descent, learning rate = 0.01
Results
MetricLogistic RegressionShallow Neural Network
Accuracy68%84%
Precision (churn class)0.550.79
Recall (churn class)0.480.76
AUC-ROC0.720.89
Retention offers savingsโ‚น2.1 crore/quarterโ‚น5.8 crore/quarter
Key Insight

The hidden layer learned feature interactions that the linear model couldn't capture. Neuron #7, for instance, activated strongly when a customer had high historical frequency but low recent activity โ€” essentially learning the concept of "lapsing customer" on its own, without being explicitly programmed with this feature.

Business Impact

By accurately identifying at-risk customers, the retail chain sent targeted retention offers (โ‚น200 discount coupons, exclusive sale access) to the right customers, saving โ‚น5.8 crore per quarter in prevented churn โ€” a 2.7ร— improvement over the logistic regression approach.

Mu Sigma's role in Indian analytics: Mu Sigma pioneered the "analytics as a service" model in India, training thousands of fresh graduates from IITs, NITs, and BITS in decision science. Many of India's current data science leaders โ€” at Flipkart, Ola, Swiggy, and Razorpay โ€” are Mu Sigma alumni. The company proved that complex analytics can be delivered from India at a fraction of US costs, helping establish Bengaluru as a global analytics hub.
Section 9

Common Mistakes & Misconceptions

Mistake #1: "More hidden neurons always means better accuracy."
Not true! Too many neurons lead to overfitting โ€” the network memorizes training data (including noise) instead of learning general patterns. For 400 training examples, 8 hidden neurons work well. Using 500 hidden neurons would memorize the data perfectly but fail on new data. The right number depends on your dataset size and complexity.
Mistake #2: "I should use sigmoid activation for hidden layers."
Sigmoid should almost never be used for hidden layers in modern networks. Its non-zero-centered output and vanishing gradient make training slow. Use ReLU (or its variants) for hidden layers and sigmoid only for the output layer in binary classification. This single change can speed up training 5-10ร—.
Mistake #3: "Initializing all weights to zero saves computation."
Zero initialization creates perfect symmetry โ€” all hidden neurons compute identical values, get identical gradients, and stay identical forever. You effectively have a 1-neuron network regardless of how many neurons you defined. Always use random initialization (e.g., np.random.randn(...) * 0.01).
Mistake #4: "dZ[1] = A[1] โˆ’ Y (copying the output layer formula)."
The clean formula dZ[2] = A[2] โˆ’ Y only works for the output layer with sigmoid + cross-entropy loss. For hidden layers, you must backpropagate through the weight matrix AND multiply by the activation derivative: dZ[1] = W[2]TยทdZ[2] โŠ™ g'(Z[1]). The โŠ™ (element-wise multiply) is critical!
Mistake #5: "I forgot keepdims=True in np.sum() for db."
When computing db = (1/m) ร— np.sum(dZ, axis=1), NumPy returns a 1D array of shape (n,) instead of (n, 1). This causes silent broadcasting bugs in the parameter update step. Always use keepdims=True to maintain the (n, 1) column vector shape.
Mistake #6: "The learning rate doesn't matter much."
Too high (e.g., ฮฑ = 100): cost oscillates wildly, never converges. Too low (e.g., ฮฑ = 0.0001): training takes forever. For shallow networks with tanh, a good starting point is ฮฑ = 0.5 to 2.0. Always plot the cost curve โ€” it should decrease smoothly.
Section 10

Comparison Table โ€” Logistic Regression vs. Shallow Neural Network

AspectLogistic Regression (Ch 5)Shallow Neural Network (Ch 6)
ArchitectureSingle neuron (no hidden layer)1+ hidden layer with multiple neurons
Decision BoundaryLinear (straight line/hyperplane)Non-linear (curves, regions)
ParametersW (nร—1), b (scalar)W[1], b[1], W[2], b[2]
ExpressivenessCan only learn linearly separable patternsCan learn any continuous function (universal approximator)
XOR ProblemโŒ Cannot solveโœ… Easily solved
Forward Passลท = ฯƒ(Wx + b) โ€” 1 step2 steps: hidden โ†’ output
BackpropdW = (1/m)(Aโˆ’Y)XTChain rule through 2 layers
InitializationZeros OKMust be random (symmetry breaking)
Training SpeedFast (convex optimization)Slower (non-convex, more parameters)
Overfitting RiskLowHigher (more capacity)
InterpretabilityHigh (feature weights directly interpretable)Lower (hidden representations are abstract)
When to UseLinearly separable data, baseline modelNon-linear data, feature interactions matter
Practical workflow: Always start with logistic regression as a baseline. If it achieves 95%+ accuracy, you probably don't need a neural network. If accuracy is low and the problem involves feature interactions, try a shallow NN (4โ€“32 hidden neurons). Only go deeper if the shallow network plateaus.
Section 11

Exercises

Section A โ€” Multiple Choice Questions (10)

Q1.

In a 2-layer neural network with 5 input features, 3 hidden units, and 1 output unit, what is the shape of W[1]?

  1. (5, 3)
  2. (3, 5)
  3. (1, 3)
  4. (3, 1)
โœ… B. (3, 5) โ€” W[l] has shape (n[l], n[l-1]) = (3, 5). Rows = current layer units, columns = previous layer units.
RememberBeginner
Q2.

Why is tanh preferred over sigmoid for hidden layers?

  1. tanh has a larger range [โˆ’2, 2]
  2. tanh outputs are zero-centered, leading to faster convergence
  3. tanh never has vanishing gradients
  4. tanh is computationally cheaper than sigmoid
โœ… B. โ€” tanh outputs are centered around 0 (range: โˆ’1 to 1), so the mean activation is close to 0. This makes gradient updates for the next layer more balanced, avoiding the "zig-zag" problem of all-positive sigmoid outputs.
UnderstandBeginner
Q3.

What happens if all weights in a neural network are initialized to zero?

  1. The network learns faster due to simplicity
  2. All hidden neurons learn the same features โ€” symmetry is never broken
  3. Only the output layer is affected
  4. The biases compensate for the zero weights
โœ… B. โ€” With identical weights, all neurons compute the same value, get the same gradient, and update identically. The network effectively has 1 hidden neuron regardless of width. Random initialization breaks this symmetry.
UnderstandIntermediate
Q4.

What is the derivative of the tanh activation function?

  1. tanh(z) ร— (1 โˆ’ tanh(z))
  2. 1 โˆ’ tanhยฒ(z)
  3. tanh(z) ร— (1 + tanh(z))
  4. z ร— (1 โˆ’ zยฒ)
โœ… B. 1 โˆ’ tanhยฒ(z) โ€” This is derived from d/dz[tanh(z)] = sechยฒ(z) = 1 โˆ’ tanhยฒ(z). At z=0, the derivative is 1 (maximum). Compare with sigmoid's maximum derivative of 0.25.
RememberBeginner
Q5.

In backpropagation for a 2-layer network, dZ[1] = ?

  1. A[1] โˆ’ Y
  2. W[2]T ยท dZ[2] โŠ™ g[1]'(Z[1])
  3. W[1]T ยท dZ[2] + b[1]
  4. dZ[2] ยท W[2] ยท g[1](Z[1])
โœ… B. โ€” The gradient flows backward: multiply by the transposed weight matrix (W[2]T) to "distribute" the error to each hidden unit, then element-wise multiply (โŠ™) by the activation derivative to account for the non-linearity.
ApplyIntermediate
Q6.

If a network uses linear activations (g(z) = z) in all layers, what does it reduce to?

  1. A polynomial regression model
  2. A single-layer linear model (linear regression)
  3. A support vector machine
  4. A decision tree
โœ… B. โ€” Composition of linear functions is linear: W[2]ยท(W[1]ยทx + b[1]) + b[2] = (W[2]W[1])x + (W[2]b[1] + b[2]) = W'x + b'. No matter how many layers, the result is equivalent to a single linear transformation.
AnalyzeIntermediate
Q7.

What is the "dying ReLU" problem?

  1. ReLU neurons output NaN for large inputs
  2. Neurons with consistently negative pre-activation have zero gradient and never update
  3. ReLU causes exploding gradients in deep networks
  4. ReLU neurons become saturated at a maximum value
โœ… B. โ€” If a neuron's z is always negative (due to a large negative bias or unfortunate weight update), ReLU outputs 0 with gradient 0. The neuron never receives gradient signal and is permanently "dead." Leaky ReLU fixes this by allowing a small gradient (ฮฑ โ‰ˆ 0.01) for negative inputs.
UnderstandIntermediate
Q8.

For a network with n[0]=10, n[1]=20, n[2]=1, how many total learnable parameters are there?

  1. 231
  2. 241
  3. 220
  4. 210
โœ… B. 241 โ€” W[1]: 20ร—10=200, b[1]: 20, W[2]: 1ร—20=20, b[2]: 1. Total = 200 + 20 + 20 + 1 = 241.
ApplyBeginner
Q9.

Why do we multiply weights by 0.01 during random initialization?

  1. To normalize the weights to unit variance
  2. To keep pre-activations small so activations start in the steep gradient region
  3. To prevent the network from overfitting
  4. To ensure the biases dominate initially
โœ… B. โ€” Small weights โ†’ small z values โ†’ sigmoid/tanh in their linear region (steep gradient โ‰ˆ 1) โ†’ fast learning. Large weights โ†’ z in saturated region โ†’ gradient โ‰ˆ 0 โ†’ vanishing gradients โ†’ slow/no learning.
UnderstandIntermediate
Q10.

In vectorized forward propagation, b[1] has shape (n[1], 1) but Z[1] has shape (n[1], m). How does the addition Z[1] = W[1]ยทX + b[1] work?

  1. b[1] is tiled m times manually
  2. NumPy broadcasting copies b[1] across all m columns automatically
  3. A for-loop adds b[1] to each column
  4. b[1] is reshaped to (1, m) first
โœ… B. โ€” NumPy broadcasting automatically replicates the (n[1], 1) bias vector across all m columns during addition. This is both memory-efficient (no actual copying) and fast (single SIMD operation). This is why keepdims=True is important โ€” it preserves the (n, 1) shape needed for broadcasting.
UnderstandBeginner

Section B โ€” Short Answer Questions (5)

B1.

Write the four equations of forward propagation for a 2-layer neural network (vectorized form). State the shape of each intermediate result assuming n[0]=3, n[1]=5, n[2]=1, and m=100 training examples.

RememberBeginner4 marks
B2.

Explain the symmetry problem in neural networks. What specific condition causes it, and what is the standard solution? Why can biases still be initialized to zero without causing this problem?

UnderstandIntermediate4 marks
B3.

Compare sigmoid and tanh activation functions along four axes: output range, zero-centering, maximum derivative value, and recommended use case. Include the mathematical relationship between them.

AnalyzeIntermediate5 marks
B4.

A Paytm fraud detection model uses a shallow neural network with 15 input features and 10 hidden neurons. Calculate: (a) total number of parameters, (b) shape of each weight matrix and bias vector, (c) shape of Z[1] and A[2] when processing a batch of 500 transactions.

ApplyBeginner5 marks
B5.

Explain the "dying ReLU" problem. How does Leaky ReLU address it? Write the formulas for both ReLU and Leaky ReLU, including their derivatives.

UnderstandIntermediate4 marks

Section C โ€” Long Answer Questions (3)

C1.

Full Backpropagation Derivation. For a 2-layer neural network with tanh hidden activation and sigmoid output activation, derive the complete set of backpropagation equations. Start from the binary cross-entropy loss J, and derive dZ[2], dW[2], db[2], dZ[1], dW[1], db[1] step by step using the chain rule. Show each intermediate step and verify the shapes.

AnalyzeAdvanced12 marks
C2.

Linear Activation Analysis. (a) Prove that a neural network with any number of hidden layers using linear activations g(z) = z is equivalent to a single linear transformation. Show the proof for a 3-layer network. (b) Does this mean linear activations are never useful? Discuss one scenario where a linear output activation is appropriate. (c) What is the minimum requirement for a neural network to learn XOR? Prove with a construction.

EvaluateAdvanced12 marks
C3.

Activation Function Selection. You are building a meal recommendation system for Swiggy with these requirements: (a) Hidden layer for learning food preference patterns from 50 user features. (b) Output layer 1: probability of the user ordering (binary). (c) Output layer 2: predicted delivery time in minutes (continuous, 10โ€“90 min). (d) Output layer 3: rating prediction (1โ€“5 stars). For each layer/output, recommend an activation function with justification. Also discuss what would go wrong if you used sigmoid for all hidden layers in a network with 5 hidden layers.

EvaluateIntermediate10 marks

Section D โ€” Programming Exercises (2)

D1.

Activation Function Visualizer. Write a Python program that:

  • Plots all 5 activation functions (sigmoid, tanh, ReLU, Leaky ReLU, ELU) on the same figure with z โˆˆ [โˆ’5, 5]
  • Plots their derivatives on a second subplot
  • Annotates each curve with its name and output range
  • Uses a clean, publication-quality style with legend
ApplyIntermediate8 marks
D2.

Hidden Unit Ablation Study. Using the ShallowNeuralNetwork class from this chapter:

  • Train the network on the XOR dataset with n_h = 1, 2, 4, 8, 16, 32, 64 hidden neurons
  • For each, record: final accuracy, final cost, and number of epochs to reach 95% accuracy (or "N/A" if it never does)
  • Plot: (a) accuracy vs. n_h, (b) cost curves for all values on the same plot
  • Answer: What is the minimum n_h that can solve XOR? Does increasing n_h always help?
AnalyzeAdvanced10 marks

Section E โ€” Mini-Project

E1.

Flipkart Product Classifier. Build a shallow neural network from scratch to classify Flipkart products into two categories (e.g., electronics vs. fashion) based on product title features:

  • Data: Create a synthetic dataset of 1000 products with 5 features: title_length, has_brand_name (0/1), avg_word_length, number_count (digits in title), special_char_count
  • Architecture: 5 โ†’ n_h โ†’ 1 (experiment with n_h = 4, 8, 16)
  • Implement: The full NeuralNetwork class with both tanh and ReLU options for hidden layer
  • Evaluate: Train/test split (80/20), report accuracy, plot cost curves and decision boundary (pick any 2 features for visualization)
  • Compare: Against logistic regression baseline โ€” is the neural network better? By how much?
  • Report: Write a brief report (1 page) with findings, including which activation function and n_h worked best and why
CreateAdvanced15 marks
Section 12

Chapter Summary

๐Ÿง  Key Takeaways from Chapter 6

  • Architecture: A shallow neural network has an input layer (layer 0), one hidden layer (layer 1), and an output layer (layer 2). We count layers by the number of weight matrices: 2.
  • Notation: W[l] has shape (n[l], n[l-1]), b[l] has shape (n[l], 1). The gradient dW[l] always has the same shape as W[l].
  • Forward propagation computes Z[l] = W[l]ยทA[l-1] + b[l], then A[l] = g[l](Z[l]). Vectorize over examples (m columns), loop only over layers.
  • Activation functions matter: tanh > sigmoid for hidden layers (zero-centered, stronger gradients). ReLU is the modern default (no vanishing gradient, computationally cheap). Never use linear activations for hidden layers.
  • Linear activations collapse: composition of linear functions = single linear function. Hidden layers with g(z) = z add no expressive power.
  • Backpropagation uses the chain rule backwards: dZ[2] = A[2] โˆ’ Y (output), dZ[1] = W[2]TยทdZ[2] โŠ™ g'(Z[1]) (hidden). Shape of dW[l] = shape of W[l].
  • Random initialization is essential to break symmetry. Small weights (ร—0.01) keep activations in the high-gradient region of sigmoid/tanh.
  • Universal Approximation Theorem: One hidden layer with enough neurons can approximate any continuous function โ€” but "enough" may be exponentially many.
  • Practical wisdom: Start with logistic regression as baseline. If it fails, try a shallow NN with 4โ€“32 hidden neurons. Plot cost curves. Check shapes obsessively.
The 6 Equations of a Shallow Neural Network:

Forward: Z[1] = W[1]X + b[1] โ†’ A[1] = tanh(Z[1]) โ†’ Z[2] = W[2]A[1] + b[2] โ†’ A[2] = ฯƒ(Z[2])
Backward: dZ[2] = A[2]โˆ’Y โ†’ dZ[1] = W[2]TdZ[2] โŠ™ (1โˆ’A[1]ยฒ)
Your next step: In Chapter 7, we go deep โ€” multiple hidden layers! You'll learn how to generalize the forward and backward passes for L layers, understand why depth helps, and encounter new challenges like vanishing/exploding gradients. The notation you mastered here (W[l], b[l], a[l]) extends directly.
Section 13

References & Further Reading

Primary Textbooks

  • Goodfellow, Bengio & Courville (2016). Deep Learning. MIT Press. Chapters 6 (Deep Feedforward Networks) โ€” the definitive reference for activation functions and network architectures. Free at deeplearningbook.org.
  • Andrew Ng โ€” Coursera Deep Learning Specialization. Course 1, Weeks 3-4 โ€” Shallow Neural Networks and Deep Neural Networks. The notation used in this chapter follows Ng's conventions.
  • Michael Nielsen (2015). Neural Networks and Deep Learning. Free online book (neuralnetworksanddeeplearning.com). Chapter 1 has excellent visualizations of how neural networks learn.

Landmark Papers

  • Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function." Mathematics of Control, Signals, and Systems. โ€” The Universal Approximation Theorem.
  • Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks." Neural Networks. โ€” Extended the theorem to arbitrary activation functions.
  • Nair, V. & Hinton, G. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML. โ€” The paper that popularized ReLU.
  • Glorot, X. & Bengio, Y. (2010). "Understanding the Difficulty of Training Deep Feedforward Neural Networks." AISTATS. โ€” Xavier initialization.
  • Clevert, D., Unterthiner, T. & Hochreiter, S. (2016). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." ICLR.

Indian Industry Context

  • Mu Sigma: musigma.com โ€” Case studies in retail analytics, CPG, and insurance. Blog articles on decision science methodology.
  • NASSCOM AI Report (2024): India's analytics industry overview โ€” growth trends, talent pipeline, and adoption across sectors.
  • NPTEL Deep Learning Course (IIT Madras): Prof. Mitesh Khapra's course covers shallow networks in Weeks 4-5 with excellent Hindi/English explanations โ€” nptel.ac.in.
  • IndiaAI Portal: indiaai.gov.in โ€” Government AI resources, datasets, and case studies from Indian industry.

Visualization Tools

  • TensorFlow Playground: playground.tensorflow.org โ€” Interactively build and train shallow networks. See how hidden neurons create decision boundaries in real time. Start with the XOR dataset!
  • 3Blue1Brown โ€” Neural Networks series (YouTube): Grant Sanderson's visual explanations of forward/backward propagation are the best visual resource available.