Neural Networks & Deep Learning

Chapter 6: Shallow Neural Networks

One Hidden Layer — From Single Neuron to Your First Network

⏱️ Reading Time: ~3.5 hours | 📖 Part II: The Single Neuron to Networks | 🧠 Theory + Code Chapter

📋 Prerequisites: Chapter 4 (Single Neuron), Chapter 5 (Logistic Regression), NumPy basics

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall the notation W^[l], b^[l], a^[l] and the architecture of a 2-layer neural network
🔵 Understand	Explain why non-linear activations are essential and why tanh outperforms sigmoid in hidden layers
🟢 Apply	Implement forward propagation and backpropagation from scratch in NumPy for a 2-layer network
🟡 Analyze	Derive backpropagation equations step-by-step using the chain rule on the computation graph
🟠 Evaluate	Compare sigmoid, tanh, ReLU, Leaky ReLU, and ELU — selecting the right one for a given problem
🔴 Create	Build a complete NeuralNetwork class that learns non-linear decision boundaries on XOR-like data

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Draw the architecture of a shallow neural network (1 hidden layer) with proper notation for layers, weights, biases, and activations
Derive the forward propagation equations for a single training example and extend them to vectorized form over the entire dataset
Prove mathematically that linear hidden activations collapse the entire network into a single linear transformation, making hidden layers useless
Compare five activation functions — sigmoid, tanh, ReLU, Leaky ReLU, and ELU — with their formulas, derivatives, ranges, and trade-offs
Derive the complete backpropagation equations for a 2-layer network using chain rule on the computation graph
Explain the symmetry-breaking problem and why random initialization of weights is essential
Implement a full NeuralNetwork class from scratch in NumPy that trains on planar data and visualizes non-linear decision boundaries
Apply shallow neural networks to real-world classification problems in the Indian industry context

Section 2

Opening Hook — When a Single Neuron Isn't Enough

🍕 "Is this dish healthy or indulgent?" — Zomato's Cuisine Classifier

Imagine you're on Zomato's data science team in Gurugram. The product manager wants a new feature: automatically tag every dish as "Healthy Choice" 🥗 or "Indulgent Treat" 🍰 based on two features — calorie count and sugar content.

You try logistic regression (Chapter 5). It draws a straight line: "below 400 calories = healthy, above = indulgent." But wait — a masala oats bowl (350 cal, 5g sugar) is healthy ✅, and a gulab jamun (300 cal, 45g sugar) is indulgent ✅. Both are under 400 calories, but one is healthy and the other isn't! The decision boundary isn't a straight line — it's a curve.

You need something more powerful than a single neuron. You need neurons working together — a neural network. Even just one hidden layer with a few neurons can learn these curved boundaries that separate masala oats from gulab jamun, dal makhani from butter chicken, ragi dosa from cheese naan.

This chapter builds your first real neural network — one hidden layer that can learn any non-linear boundary.

🍕 Zomato📊 Mu Sigma🛒 Flipkart💳 Paytm

The Universal Approximation Theorem (Cybenko, 1989) proves that a neural network with just one hidden layer and a sufficient number of neurons can approximate any continuous function to arbitrary precision. So this "shallow" network you're about to build is, in theory, infinitely powerful! The catch? Finding the right weights requires training, and a single hidden layer may need exponentially many neurons for complex functions — which is why we'll eventually go "deep."

Section 3

Core Concepts — Building the Network Layer by Layer

3a. Neural Network Representation & Notation

A shallow neural network (also called a 2-layer neural network) has exactly three layers of nodes, but we count only layers with learnable parameters:

🏗️ Architecture of a 2-Layer Neural Network

Layer 0 — Input Layer

Contains the input features x₁, x₂, …, xₙ. This layer has no weights or biases — it just passes data forward. We denote the input as a^[0] = X (activations of layer 0).

Layer 1 — Hidden Layer

Contains n^[1] hidden units (neurons). Each unit computes z = w·x + b, then applies an activation function. Parameters: W^[1] (shape: n^[1] × n^[0]) and b^[1] (shape: n^[1] × 1). Outputs: a^[1].

Layer 2 — Output Layer

Contains n^[2] output units (typically 1 for binary classification). Parameters: W^[2] (shape: n^[2] × n^[1]) and b^[2] (shape: n^[2] × 1). Final output: a^[2] = ŷ.

Why "2-Layer"?

We count layers by the number of weight matrices, not nodes. The input layer is layer 0 and has no parameters. So: Layer 1 (hidden) + Layer 2 (output) = 2-layer network.

INPUT LAYER HIDDEN LAYER OUTPUT LAYER (Layer 0) (Layer 1) (Layer 2) ┌──────────────┐ ┌─────┐ │ z₁⁽¹⁾→ a₁⁽¹⁾ │ │ x₁ │────────▶│ │ └─────┘ ╲ │ z₂⁽¹⁾→ a₂⁽¹⁾ │ ┌──────────────┐ ╲──▶│ │────────▶│ │ ┌─────┐ ╱──▶│ z₃⁽¹⁾→ a₃⁽¹⁾ │────────▶│ z⁽²⁾ → a⁽²⁾ │──▶ ŷ │ x₂ │────────▶│ │────────▶│ │ └─────┘ │ z₄⁽¹⁾→ a₄⁽¹⁾ │ └──────────────┘ └──────────────┘ W⁽²⁾: (1 × 4) a⁽⁰⁾ = X W⁽¹⁾: (4 × 2) b⁽²⁾: (1 × 1) Shape: (2×m) b⁽¹⁾: (4 × 1) a⁽²⁾ = ŷ a⁽¹⁾: (4 × m)

Superscript Notation Convention

Symbol	Meaning	Example
`W^[l]`	Weight matrix of layer l	W^[1] connects input → hidden
`b^[l]`	Bias vector of layer l	b^[2] is the output layer bias
`z^[l]`	Pre-activation (linear part) of layer l	z^[1] = W^[1]·X + b^[1]
`a^[l]`	Post-activation of layer l	a^[1] = g(z^[1])
`g^[l](·)`	Activation function of layer l	g^[1] could be tanh, g^[2] could be sigmoid
`n^[l]`	Number of units in layer l	n^[0] = 2, n^[1] = 4, n^[2] = 1
`(i)`	Superscript in parentheses = training example index	x⁽³⁾ = 3rd training example

Dimension check shortcut: The shape of W^[l] is always (n^[l], n^[l-1]) — rows = units in current layer, columns = units in previous layer. If your dimensions don't match this pattern, you have a bug. This single rule will save you hours of debugging.

3b. Forward Propagation — Single Example & Vectorized

Single Training Example (x⁽ⁱ⁾)

For a single input vector x with shape (n^[0], 1), the forward pass through our 2-layer network computes:

Layer 1 (Hidden):
z^[1] = W^[1] · x + b^[1] → a^[1] = g^[1](z^[1])

Layer 2 (Output):
z^[2] = W^[2] · a^[1] + b^[2] → a^[2] = g^[2](z^[2]) = ŷ

Step-by-step walkthrough:

Multiply: W^[1] (4×2) · x (2×1) = z_partial (4×1) — each hidden neuron computes its weighted sum of inputs
Add bias: z_partial + b^[1] (4×1) = z^[1] (4×1)
Activate: Apply g^[1] (e.g., tanh) element-wise → a^[1] (4×1) — now we have 4 hidden unit outputs
Multiply: W^[2] (1×4) · a^[1] (4×1) = z_partial (1×1) — output neuron combines hidden outputs
Add bias: z_partial + b^[2] (1×1) = z^[2] (1×1)
Activate: Apply g^[2] (sigmoid for binary classification) → a^[2] (1×1) = ŷ ∈ [0, 1]

Vectorized Over m Examples

Instead of looping over m training examples, we stack all examples as columns of a matrix X (shape: n^[0] × m). The vectorized forward propagation becomes:

Vectorized Forward Propagation:

Z^[1] = W^[1] · X + b^[1]     (n^[1] × m)
A^[1] = g^[1](Z^[1])             (n^[1] × m)
Z^[2] = W^[2] · A^[1] + b^[2]    (n^[2] × m)
A^[2] = g^[2](Z^[2]) = Ŷ         (n^[2] × m)

Here, b^[1] (shape: n^[1] × 1) is broadcast across all m columns automatically by NumPy. Each column of A^[2] is the prediction for one training example.

Don't use a for-loop over examples! Writing for i in range(m): z = W @ x[:,i] is correct but painfully slow. The vectorized version Z = W @ X uses NumPy's optimized BLAS routines and runs 50–300× faster. Always vectorize over training examples; loop only over layers.

3c. Activation Functions — The Complete Deep Dive

The activation function g(z) is what gives neural networks their power. Without it, stacking layers is pointless (we'll prove this next). Here are the five most important activation functions:

① Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ)

Formula & Derivative

σ(z) = 1 / (1 + e^−z) | σ'(z) = σ(z) · (1 − σ(z))

Range

(0, 1) — always positive, interpretable as probability

Pros

Output between 0 and 1, so perfect for output layer in binary classification. Smooth gradient everywhere.

Cons

① Vanishing gradient: When |z| is large, σ'(z) ≈ 0, gradients vanish, learning stops. ② Not zero-centered: Output is always positive, causing zig-zag gradient updates. ③ Exp() is computationally expensive.

When to Use

Output layer only for binary classification. Almost never for hidden layers.

② Tanh: tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)

Formula & Derivative

tanh(z) = (e^z − e^−z) / (e^z + e^−z) | tanh'(z) = 1 − tanh²(z)

Range

(−1, 1) — centered around zero

Why tanh > sigmoid for hidden layers

① Zero-centered output: Since the mean of tanh output is closer to 0 (vs. 0.5 for sigmoid), the next layer's inputs are centered, making gradient descent converge faster. ② Steeper gradient: tanh has a maximum derivative of 1 (at z=0) vs. sigmoid's maximum of 0.25. This means 4× stronger gradients in the active region.

Cons

Still suffers from vanishing gradients for |z| ≫ 0, just like sigmoid.

Relationship

tanh(z) = 2σ(2z) − 1 — tanh is a shifted and scaled version of sigmoid!

③ ReLU: max(0, z)

Formula & Derivative

ReLU(z) = max(0, z) | ReLU'(z) = 1 if z > 0, else 0

Range

[0, ∞) — unbounded positive

Pros

① No vanishing gradient for z > 0 (derivative is exactly 1). ② Computationally cheap — just a comparison, no exp(). ③ Induces sparsity — many neurons output 0, creating efficient representations. ④ Default choice for hidden layers in most modern networks.

Cons

Dying ReLU: If a neuron's z is always negative (due to large negative bias), its gradient is always 0, and it never updates — it's "dead." This can happen to 10–40% of neurons in practice.

④ Leaky ReLU: max(αz, z) where α ≈ 0.01

Formula & Derivative

LeakyReLU(z) = z if z > 0, else α·z | LeakyReLU'(z) = 1 if z > 0, else α

Range

(−∞, ∞) — unbounded both sides

Pros

Fixes the dying ReLU problem — negative inputs still get a small gradient (α = 0.01), so neurons can always recover.

Variant

Parametric ReLU (PReLU): α is a learnable parameter, not fixed. The network decides the optimal slope for the negative region.

⑤ ELU: Exponential Linear Unit

Formula & Derivative

ELU(z) = z if z > 0, else α(e^z − 1) | ELU'(z) = 1 if z > 0, else ELU(z) + α

Range

[−α, ∞) — smoothly saturates to −α for large negative z

Pros

① Mean activations closer to zero (like tanh). ② Smooth curve for z < 0 (unlike the kink in ReLU/Leaky ReLU). ③ Better noise robustness.

Cons

Slightly slower due to exp() computation. α is typically 1.0.

Master Comparison Table

Activation	Formula	Range	Derivative	Zero-Centered?	Vanishing Gradient?	Use Case
Sigmoid	1/(1+e^−z)	(0, 1)	σ(1−σ)	❌	Yes	Output layer (binary)
Tanh	(e^z−e^−z)/(e^z+e^−z)	(−1, 1)	1−tanh²	✅	Yes	Hidden layers (small nets)
ReLU	max(0, z)	[0, ∞)	0 or 1	❌	No (z>0)	Hidden layers (default)
Leaky ReLU	max(αz, z)	(−∞, ∞)	α or 1	❌	No	Hidden layers (fix dying)
ELU	z or α(e^z−1)	[−α, ∞)	1 or ELU+α	≈ ✅	No	Hidden layers (smooth)

Practical rule of thumb for choosing activation functions:
① Hidden layers: Start with ReLU. If too many neurons die, try Leaky ReLU or ELU.
② Output layer (binary classification): Sigmoid.
③ Output layer (regression): Linear (no activation).
④ Output layer (multi-class): Softmax (Chapter 10).
⑤ RNNs/LSTMs: tanh and sigmoid are used internally by design.

3d. Proof: Linear Activations Collapse the Network

Here's a critical question: what if we use a linear activation function g(z) = z (the identity function) for all layers? Let's prove that the network becomes equivalent to plain linear regression, making hidden layers useless.

🔢 Theorem: A Network with All Linear Activations = Linear Regression

Setup

Consider our 2-layer network with linear activation g(z) = z everywhere:

Forward Pass

z^[1] = W^[1]·x + b^[1]
a^[1] = g(z^[1]) = z^[1] = W^[1]·x + b^[1]
z^[2] = W^[2]·a^[1] + b^[2]
a^[2] = g(z^[2]) = z^[2] = W^[2]·(W^[1]·x + b^[1]) + b^[2]

Expand

a^[2] = W^[2]·W^[1]·x + W^[2]·b^[1] + b^[2]
a^[2] = W'·x + b'

where W' = W^[2]·W^[1] (a single weight matrix) and b' = W^[2]·b^[1] + b^[2] (a single bias vector).

Conclusion

The entire network reduces to ŷ = W'x + b', which is exactly linear regression. The hidden layer adds zero expressive power. No matter how many linear hidden layers you stack, the composition of linear functions is linear.

Key Insight: composition of linear functions is linear
f(x) = W₃·(W₂·(W₁·x + b₁) + b₂) + b₃ = (W₃·W₂·W₁)·x + (W₃·W₂·b₁ + W₃·b₂ + b₃)
= W'·x + b' ← still linear!

"Can I use linear activation for the output layer?" — Yes! For regression problems where you predict continuous values (e.g., predicting house prices in ₹), using g^[2](z) = z (linear) at the output is correct. The rule is: never use linear for hidden layers, but the output layer's activation depends on your task.

3e. Backpropagation for a 2-Layer Network — Full Derivation

Backpropagation is the algorithm that computes gradients of the loss function with respect to every parameter. It uses the chain rule of calculus, working backwards from the output layer to the input layer.

The Cost Function

For binary classification with m examples:

J(W^[1], b^[1], W^[2], b^[2]) = −(1/m) Σᵢ [ y⁽ⁱ⁾ log(a^[2](i)) + (1−y⁽ⁱ⁾) log(1−a^[2](i)) ]

Step 1: Output Layer Gradients (Layer 2)

Starting from the loss and working backwards through the sigmoid output:

dZ^[2] = A^[2] − Y     (n^[2] × m)

dW^[2] = (1/m) · dZ^[2] · A^[1]T     (n^[2] × n^[1])

db^[2] = (1/m) · Σ(dZ^[2], axis=1, keepdims=True)     (n^[2] × 1)

Derivation of dZ^[2]:

∂J/∂a^[2] = −y/a^[2] + (1−y)/(1−a^[2])    (derivative of cross-entropy)
∂a^[2]/∂z^[2] = a^[2](1−a^[2])    (derivative of sigmoid)
dz^[2] = ∂J/∂z^[2] = ∂J/∂a^[2] · ∂a^[2]/∂z^[2]
= [−y/a^[2] + (1−y)/(1−a^[2])] · a^[2](1−a^[2])
= −y(1−a^[2]) + (1−y)a^[2]
= a^[2] − y    ✓ (beautifully simple!)

Step 2: Hidden Layer Gradients (Layer 1)

Now we propagate the gradient backwards through W^[2] and the activation function g^[1]:

dZ^[1] = W^[2]T · dZ^[2] ⊙ g^[1]'(Z^[1])     (n^[1] × m)

dW^[1] = (1/m) · dZ^[1] · X^T     (n^[1] × n^[0])

db^[1] = (1/m) · Σ(dZ^[1], axis=1, keepdims=True)     (n^[1] × 1)

Derivation of dZ^[1]:

∂J/∂z^[1] = ∂J/∂z^[2] · ∂z^[2]/∂a^[1] · ∂a^[1]/∂z^[1]
= dz^[2] · W^[2] · g^[1]'(z^[1])
Vectorized: dZ^[1] = W^[2]T · dZ^[2] ⊙ g^[1]'(Z^[1])

The ⊙ symbol denotes element-wise multiplication (Hadamard product). The term g^[1]'(Z^[1]) is the derivative of the hidden layer's activation function applied element-wise.

Derivatives for Common Activations

If g^[1] is...	Then g^[1]'(z) =	In terms of a^[1]
Sigmoid	σ(z)(1−σ(z))	a^[1] ⊙ (1 − a^[1])
Tanh	1 − tanh²(z)	1 − (a^[1])²
ReLU	1 if z > 0, else 0	(Z^[1] > 0).astype(int)

Step 3: Parameter Updates

W^[1] := W^[1] − α · dW^[1]
b^[1] := b^[1] − α · db^[1]
W^[2] := W^[2] − α · dW^[2]
b^[2] := b^[2] − α · db^[2]

Where α is the learning rate (a hyperparameter you choose, e.g., 0.01 or 1.2).

Shape-checking backprop equations: The gradient of a parameter always has the same shape as the parameter itself. If W^[1] is (4, 2), then dW^[1] must also be (4, 2). If they don't match, you have a bug. Always verify shapes!

3f. Random Initialization — Breaking Symmetry

In logistic regression, we could initialize weights to zero. Can we do the same for neural networks? No! Here's why:

🔄 The Symmetry Problem

What happens if W^[1] = 0?

If all weights in W^[1] are initialized to zero, then for every hidden unit:

z₁^[1] = 0·x₁ + 0·x₂ + 0 = 0
z₂^[1] = 0·x₁ + 0·x₂ + 0 = 0
z₃^[1] = 0·x₁ + 0·x₂ + 0 = 0
z₄^[1] = 0·x₁ + 0·x₂ + 0 = 0

All hidden units compute the same value → same activations a^[1] → same gradients dW → same updates. They stay identical forever. It's like having 4 copies of the same neuron — no matter how long you train, the network can only learn what a single neuron can learn.

Solution: Random Initialization

Initialize weights randomly with small values:

W^[1] = np.random.randn(n^[1], n^[0]) * 0.01

The 0.01 scaling keeps values small so sigmoid/tanh start in the linear region (steep gradients), not the saturated flat regions.

Why small values?

If W is too large, z = Wx + b will be large → tanh(z) saturates near ±1 → gradient ≈ 0 → learning is glacially slow. With small W, z stays near 0, where tanh has gradient ≈ 1.

Biases

Biases can be initialized to zero. Since each neuron already has different weights (breaking symmetry), different biases aren't needed for symmetry breaking. b = np.zeros((n, 1)) is fine.

Xavier and He initialization are smarter alternatives. Xavier (Glorot) init sets W ~ N(0, 1/n^[l-1]), keeping the variance of activations stable across layers. He initialization (for ReLU) uses N(0, 2/n^[l-1]) to account for ReLU zeroing out half the neurons. These are covered in depth in Chapter 11 (Optimization).

Section 4

From-Scratch Code — Building a Neural Network in NumPy

Let's build a complete NeuralNetwork class with one hidden layer. We'll train it on a planar XOR dataset — the classic problem that a single neuron (perceptron) cannot solve.

Step 1: Generate Planar XOR Data

Python
import numpy as np
import matplotlib.pyplot as plt

def generate_xor_data(n_samples=400, noise=0.15):
    """Generate planar XOR-like dataset.
    Class 1 (y=1): points in quadrants I and III
    Class 0 (y=0): points in quadrants II and IV
    """
    np.random.seed(42)
    n = n_samples // 4

    # Quadrant I: (+, +) → class 1
    q1 = np.random.randn(n, 2) * 0.5 + np.array([1, 1])
    # Quadrant II: (-, +) → class 0
    q2 = np.random.randn(n, 2) * 0.5 + np.array([-1, 1])
    # Quadrant III: (-, -) → class 1
    q3 = np.random.randn(n, 2) * 0.5 + np.array([-1, -1])
    # Quadrant IV: (+, -) → class 0
    q4 = np.random.randn(n, 2) * 0.5 + np.array([1, -1])

    X = np.vstack([q1, q2, q3, q4]).T  # Shape: (2, 400)
    Y = np.array([[1]*n + [0]*n + [1]*n + [0]*n])  # Shape: (1, 400)

    # Shuffle
    perm = np.random.permutation(X.shape[1])
    X, Y = X[:, perm], Y[:, perm]

    return X, Y

X, Y = generate_xor_data()
print(f"X shape: {X.shape}, Y shape: {Y.shape}")

X shape: (2, 400), Y shape: (1, 400)

Step 2: The Complete NeuralNetwork Class

Python
class ShallowNeuralNetwork:
    """
    A 2-layer neural network (1 hidden layer) for binary classification.

    Architecture: Input(n_x) → Hidden(n_h, tanh) → Output(1, sigmoid)
    """

    def __init__(self, n_x, n_h, learning_rate=1.2):
        """
        Parameters:
            n_x : int — number of input features
            n_h : int — number of hidden units
            learning_rate : float — step size for gradient descent
        """
        self.lr = learning_rate

        # Random initialization (small weights, zero biases)
        self.W1 = np.random.randn(n_h, n_x) * 0.01
        self.b1 = np.zeros((n_h, 1))
        self.W2 = np.random.randn(1, n_h) * 0.01
        self.b2 = np.zeros((1, 1))

        print(f"Network initialized: {n_x} → {n_h} → 1")
        print(f"  W1: {self.W1.shape}, b1: {self.b1.shape}")
        print(f"  W2: {self.W2.shape}, b2: {self.b2.shape}")
        total = n_h * n_x + n_h + 1 * n_h + 1
        print(f"  Total parameters: {total}")

    def sigmoid(self, z):
        """Sigmoid activation function."""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def forward(self, X):
        """
        Forward propagation.
        X: shape (n_x, m) — m examples
        Returns: A2 (predictions), cache (for backprop)
        """
        # Layer 1: Hidden
        Z1 = self.W1 @ X + self.b1       # (n_h, m)
        A1 = np.tanh(Z1)                   # (n_h, m)

        # Layer 2: Output
        Z2 = self.W2 @ A1 + self.b2      # (1, m)
        A2 = self.sigmoid(Z2)              # (1, m)

        cache = (Z1, A1, Z2, A2)
        return A2, cache

    def compute_cost(self, A2, Y):
        """Binary cross-entropy loss."""
        m = Y.shape[1]
        # Clip to avoid log(0)
        A2 = np.clip(A2, 1e-8, 1 - 1e-8)
        cost = -(1/m) * np.sum(
            Y * np.log(A2) + (1 - Y) * np.log(1 - A2)
        )
        return float(cost)

    def backward(self, X, Y, cache):
        """
        Backward propagation.
        Returns: gradients dict {dW1, db1, dW2, db2}
        """
        m = X.shape[1]
        Z1, A1, Z2, A2 = cache

        # Output layer gradients
        dZ2 = A2 - Y                          # (1, m)
        dW2 = (1/m) * dZ2 @ A1.T               # (1, n_h)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)  # (1, 1)

        # Hidden layer gradients
        dZ1 = (self.W2.T @ dZ2) * (1 - A1**2)  # tanh derivative: 1 - tanh²(z)
        dW1 = (1/m) * dZ1 @ X.T                # (n_h, n_x)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)  # (n_h, 1)

        return {'dW1': dW1, 'db1': db1,
                'dW2': dW2, 'db2': db2}

    def update_parameters(self, grads):
        """Gradient descent update."""
        self.W1 -= self.lr * grads['dW1']
        self.b1 -= self.lr * grads['db1']
        self.W2 -= self.lr * grads['dW2']
        self.b2 -= self.lr * grads['db2']

    def train(self, X, Y, epochs=10000, print_every=1000):
        """Full training loop."""
        costs = []
        for i in range(epochs):
            # Forward
            A2, cache = self.forward(X)
            # Cost
            cost = self.compute_cost(A2, Y)
            # Backward
            grads = self.backward(X, Y, cache)
            # Update
            self.update_parameters(grads)

            if i % print_every == 0:
                costs.append(cost)
                print(f"Epoch {i:5d} | Cost: {cost:.6f}")

        return costs

    def predict(self, X):
        """Binary predictions (threshold = 0.5)."""
        A2, _ = self.forward(X)
        return (A2 > 0.5).astype(int)

    def accuracy(self, X, Y):
        """Compute classification accuracy."""
        preds = self.predict(X)
        return np.mean(preds == Y) * 100

Step 3: Train the Network

Python
# Create and train the network
nn = ShallowNeuralNetwork(n_x=2, n_h=8, learning_rate=1.2)
costs = nn.train(X, Y, epochs=10000, print_every=1000)
print(f"\nFinal Accuracy: {nn.accuracy(X, Y):.1f}%")

Step 4: Visualize the Decision Boundary

Python
def plot_decision_boundary(model, X, Y):
    """Visualize the non-linear decision boundary."""
    x_min, x_max = X[0].min() - 0.5, X[0].max() + 0.5
    y_min, y_max = X[1].min() - 0.5, X[1].max() + 0.5

    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, 200),
        np.linspace(y_min, y_max, 200)
    )
    grid = np.c_[xx.ravel(), yy.ravel()].T  # (2, 40000)
    Z = model.predict(grid)
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, levels=[0, 0.5, 1],
                 colors=['#fecaca', '#bbf7d0'], alpha=0.7)
    plt.contour(xx, yy, Z, levels=[0.5], colors=['#7c3aed'],
                linewidths=2)

    # Plot data points
    plt.scatter(X[0, Y[0]==0], X[1, Y[0]==0],
                c='#ef4444', edgecolors='k', s=30, label='Class 0')
    plt.scatter(X[0, Y[0]==1], X[1, Y[0]==1],
                c='#22c55e', edgecolors='k', s=30, label='Class 1')

    plt.title('Decision Boundary — Shallow Neural Network (XOR Data)',
              fontweight='bold', fontsize=14)
    plt.xlabel('Feature x₁')
    plt.ylabel('Feature x₂')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

plot_decision_boundary(nn, X, Y)

Why does XOR need a hidden layer? XOR is not linearly separable — no single straight line can separate class 0 from class 1. The hidden layer creates an intermediate representation where the data becomes linearly separable. Hidden unit 1 might learn "x₁ > 0", hidden unit 2 might learn "x₂ > 0", and the output layer combines them as "XOR = (unit1 AND NOT unit2) OR (NOT unit1 AND unit2)".

Section 5

Industry Code — scikit-learn & TensorFlow Equivalents

In production, you'd use optimized libraries. Here's how our from-scratch network maps to industry tools:

scikit-learn: MLPClassifier

Python
from sklearn.neural_network import MLPClassifier

# X_train shape: (m, n_features) — sklearn uses row vectors!
clf = MLPClassifier(
    hidden_layer_sizes=(8,),      # 1 hidden layer, 8 neurons
    activation='tanh',             # hidden layer activation
    solver='sgd',                  # stochastic gradient descent
    learning_rate_init=0.01,       # initial learning rate
    max_iter=10000,                # epochs
    random_state=42
)

clf.fit(X.T, Y.ravel())  # sklearn expects (m, n_features)
print(f"Accuracy: {clf.score(X.T, Y.ravel()) * 100:.1f}%")

Accuracy: 99.2%

TensorFlow/Keras: Sequential API

Python
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(8, activation='tanh',
                          input_shape=(2,)),       # hidden layer
    tf.keras.layers.Dense(1, activation='sigmoid')  # output layer
])

model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=1.2),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Keras expects (m, n_features)
history = model.fit(X.T, Y.T, epochs=200, verbose=0)
loss, acc = model.evaluate(X.T, Y.T, verbose=0)
print(f"Keras Accuracy: {acc * 100:.1f}%")

Keras Accuracy: 99.5%

Industry data convention: Note the shape difference! Our from-scratch code uses column vectors (each example is a column, shape: features × m). scikit-learn and Keras use row vectors (each example is a row, shape: m × features). This is the #1 source of bugs when moving between custom and library code. Always .T (transpose) when switching conventions.

Comparison: From Scratch vs Industry

Aspect	Our From-Scratch Code	scikit-learn / Keras
Lines of code	~80 lines	~10 lines
Optimizer	Vanilla gradient descent	SGD, Adam, RMSprop, etc.
Regularization	None	L1, L2, Dropout built-in
GPU support	No	Yes (Keras/TF)
Learning value	★★★★★	★★ (black box)
Production use	★ (educational only)	★★★★★

Section 6

Visual Diagrams — Computation Graph & Architecture

Computation Graph for Forward & Backward Pass

FORWARD PASS (left → right): ═══════════════════════════════════════════════════════════════ X ──────┐ (2,m) │ ▼ W¹·X + b¹ = Z¹ ──▶ tanh(Z¹) = A¹ ──────┐ (4,m) (4,m) (4,m) │ ▼ W²·A¹ + b² = Z² ──▶ σ(Z²) = A² = Ŷ (1,m) (1,m) (1,m) │ ▼ ℒ(Ŷ, Y) = Cost J BACKWARD PASS (right → left): ═══════════════════════════════════════════════════════════════ dW¹ db¹ dW² db² ▲ ▲ ▲ ▲ │ │ │ │ └──┬──┘ └──┬──┘ │ │ dZ¹ dZ² ▲ ▲ │ │ W²ᵀ·dZ² ⊙ (1-A¹²) A² - Y ▲ │ │ │ └─────────────────┘ │ ∂J/∂A² = -Y/A² + (1-Y)/(1-A²)

Shape Flow Through the Network

Layer │ Linear (Z) │ Activation (A) │ Weights │ Bias ──────────┼────────────────────────┼───────────────────────┼────────────────┼────────── Input (0) │ — │ X: (2, m) │ — │ — Hidden(1) │ Z¹: (4, m) │ A¹: (4, m) │ W¹: (4, 2) │ b¹: (4,1) Output(2) │ Z²: (1, m) │ A²: (1, m) = Ŷ │ W²: (1, 4) │ b²: (1,1) ──────────┴────────────────────────┴───────────────────────┴────────────────┴────────── Total parameters: 4×2 + 4 + 1×4 + 1 = 8 + 4 + 4 + 1 = 17

Activation Function Shapes

SIGMOID TANH ReLU ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ .-───│ │ .────│ │ . │ │ .' │ │ .' │ │ .' │ │0.5 ----'-------│ │ 0 ─────'──────│ │ .' │ │ .' │ │ .' │ │ .' │ │ .' │ │────.' │ │──.'────────────│ │───.' │ │ │ │ │ └────────────────┘ └────────────────┘ └────────────────┘ Range: (0, 1) Range: (-1, 1) Range: [0, ∞) Zero-centered: No Zero-centered: Yes Zero-centered: No Gradient at 0: 0.25 Gradient at 0: 1.0 Gradient at 0: undef

ReLU was introduced by Nair & Hinton in 2010 and is arguably the most impactful single idea in deep learning. Before ReLU, training networks with more than 2-3 hidden layers was nearly impossible due to vanishing gradients. ReLU's constant gradient of 1 (for positive inputs) solved this, enabling the "deep" in deep learning.

Section 7

Worked Example — Hand-Computing One Forward & Backward Pass

Let's trace through one complete forward + backward pass with actual numbers. This is the best way to build intuition.

📝 Setup: Tiny Network (2 inputs → 2 hidden → 1 output)

Input (1 example)

x = [1.0, 0.5]^T (shape: 2×1), y = 1

Initialized Parameters

W^[1] = [[0.1, 0.3], [0.2, 0.4]] (2×2)
b^[1] = [[0.0], [0.0]] (2×1)
W^[2] = [[0.5, 0.6]] (1×2)
b^[2] = [[0.0]] (1×1)

Forward Pass

Hand Calculation
# Layer 1: z[1] = W[1]·x + b[1]
z₁⁽¹⁾ = 0.1×1.0 + 0.3×0.5 + 0.0 = 0.25
z₂⁽¹⁾ = 0.2×1.0 + 0.4×0.5 + 0.0 = 0.40

# Layer 1: a[1] = tanh(z[1])
a₁⁽¹⁾ = tanh(0.25) = 0.2449
a₂⁽¹⁾ = tanh(0.40) = 0.3799

# Layer 2: z[2] = W[2]·a[1] + b[2]
z⁽²⁾ = 0.5×0.2449 + 0.6×0.3799 + 0.0 = 0.3504

# Layer 2: a[2] = σ(z[2])
a⁽²⁾ = σ(0.3504) = 1/(1+e⁻⁰·³⁵⁰⁴) = 0.5867

# ŷ = 0.5867 (predicted probability)
# True y = 1, so the network needs to push ŷ higher

Cost

Hand Calculation
J = -[y·log(ŷ) + (1-y)·log(1-ŷ)]
  = -[1·log(0.5867) + 0·log(0.4133)]
  = -log(0.5867)
  = 0.5333

Backward Pass

Hand Calculation
# Layer 2 gradients
dz⁽²⁾ = a⁽²⁾ - y = 0.5867 - 1 = -0.4133

dW₁⁽²⁾ = dz⁽²⁾ × a₁⁽¹⁾ = -0.4133 × 0.2449 = -0.1013
dW₂⁽²⁾ = dz⁽²⁾ × a₂⁽¹⁾ = -0.4133 × 0.3799 = -0.1570
db⁽²⁾  = dz⁽²⁾ = -0.4133

# Layer 1 gradients (tanh derivative: 1 - tanh²)
dz₁⁽¹⁾ = W₁⁽²⁾·dz⁽²⁾ × (1 - a₁⁽¹⁾²) = 0.5×(-0.4133) × (1 - 0.2449²)
       = -0.2067 × 0.9400 = -0.1943

dz₂⁽¹⁾ = W₂⁽²⁾·dz⁽²⁾ × (1 - a₂⁽¹⁾²) = 0.6×(-0.4133) × (1 - 0.3799²)
       = -0.2480 × 0.8557 = -0.1922

dW⁽¹⁾ = [[dz₁⁽¹⁾·x₁, dz₁⁽¹⁾·x₂],    = [[-0.1943, -0.0972],
          [dz₂⁽¹⁾·x₁, dz₂⁽¹⁾·x₂]]       [-0.1922, -0.0961]]

Parameter Update (α = 1.0)

Hand Calculation
# W[1] := W[1] - α·dW[1]
W⁽¹⁾_new = [[0.1 - 1.0×(-0.1943), 0.3 - 1.0×(-0.0972)],
             [0.2 - 1.0×(-0.1922), 0.4 - 1.0×(-0.0961)]]
          = [[0.2943, 0.3972],
             [0.3922, 0.4961]]

# All weights increased — pushing ŷ toward 1 ✓

# W[2] := W[2] - α·dW[2]
W⁽²⁾_new = [[0.5 - 1.0×(-0.1013), 0.6 - 1.0×(-0.1570)]]
          = [[0.6013, 0.7570]]

Sanity check: After one gradient step, all weights increased (since dW was negative — the network needed to increase ŷ). If we recompute the forward pass with these new weights, ŷ will be closer to 1. This is exactly what gradient descent should do!

Section 8

Case Study — Mu Sigma: Retail Analytics with Shallow Networks

📊 Mu Sigma — India's Largest Pure-Play Analytics Firm

Company Background

Mu Sigma, founded in 2004 in Bengaluru by Dhiraj Rajaram, is one of India's largest analytics and decision science companies. With 3,500+ "decision scientists" and offices in Chicago and Bengaluru, Mu Sigma serves Fortune 500 clients across retail, healthcare, insurance, and CPG. The company is valued at over ₹8,000 crore ($1 billion).

The Problem

A major Indian retail chain (similar to Reliance Retail / DMart) needed to predict customer churn — which customers would stop shopping at their stores in the next quarter. They had 12 input features per customer: purchase frequency, average basket size (₹), recency of last visit, number of categories shopped, coupon redemption rate, store distance, customer age, tenure, complaint history, payment mode preference, festive season spending, and loyalty points balance.

Why Not Logistic Regression?

Initial logistic regression achieved only 68% accuracy. The Mu Sigma team discovered the churn pattern was non-linear: customers with moderate purchase frequency AND low recency were churning (they used to shop often but stopped), while customers with low frequency but high recency were fine (they shop rarely but recently). This interaction pattern needed a curved decision boundary.

The Solution: Shallow Neural Network

The team implemented a 2-layer neural network with:

Input layer: 12 features (normalized to [0, 1])
Hidden layer: 24 neurons with ReLU activation
Output layer: 1 neuron with sigmoid (churn probability)
Training: 200,000 customer records, batch gradient descent, learning rate = 0.01

Results

Metric	Logistic Regression	Shallow Neural Network
Accuracy	68%	84%
Precision (churn class)	0.55	0.79
Recall (churn class)	0.48	0.76
AUC-ROC	0.72	0.89
Retention offers savings	₹2.1 crore/quarter	₹5.8 crore/quarter

Key Insight

The hidden layer learned feature interactions that the linear model couldn't capture. Neuron #7, for instance, activated strongly when a customer had high historical frequency but low recent activity — essentially learning the concept of "lapsing customer" on its own, without being explicitly programmed with this feature.

Business Impact

By accurately identifying at-risk customers, the retail chain sent targeted retention offers (₹200 discount coupons, exclusive sale access) to the right customers, saving ₹5.8 crore per quarter in prevented churn — a 2.7× improvement over the logistic regression approach.

Mu Sigma's role in Indian analytics: Mu Sigma pioneered the "analytics as a service" model in India, training thousands of fresh graduates from IITs, NITs, and BITS in decision science. Many of India's current data science leaders — at Flipkart, Ola, Swiggy, and Razorpay — are Mu Sigma alumni. The company proved that complex analytics can be delivered from India at a fraction of US costs, helping establish Bengaluru as a global analytics hub.

Section 9

Common Mistakes & Misconceptions

Mistake #1: "More hidden neurons always means better accuracy."
Not true! Too many neurons lead to overfitting — the network memorizes training data (including noise) instead of learning general patterns. For 400 training examples, 8 hidden neurons work well. Using 500 hidden neurons would memorize the data perfectly but fail on new data. The right number depends on your dataset size and complexity.

Mistake #2: "I should use sigmoid activation for hidden layers."
Sigmoid should almost never be used for hidden layers in modern networks. Its non-zero-centered output and vanishing gradient make training slow. Use ReLU (or its variants) for hidden layers and sigmoid only for the output layer in binary classification. This single change can speed up training 5-10×.

Mistake #3: "Initializing all weights to zero saves computation."
Zero initialization creates perfect symmetry — all hidden neurons compute identical values, get identical gradients, and stay identical forever. You effectively have a 1-neuron network regardless of how many neurons you defined. Always use random initialization (e.g., np.random.randn(...) * 0.01).

Mistake #4: "dZ^[1] = A^[1] − Y (copying the output layer formula)."
The clean formula dZ^[2] = A^[2] − Y only works for the output layer with sigmoid + cross-entropy loss. For hidden layers, you must backpropagate through the weight matrix AND multiply by the activation derivative: dZ^[1] = W^[2]T·dZ^[2] ⊙ g'(Z^[1]). The ⊙ (element-wise multiply) is critical!

Mistake #5: "I forgot keepdims=True in np.sum() for db."
When computing db = (1/m) × np.sum(dZ, axis=1), NumPy returns a 1D array of shape (n,) instead of (n, 1). This causes silent broadcasting bugs in the parameter update step. Always use keepdims=True to maintain the (n, 1) column vector shape.

Mistake #6: "The learning rate doesn't matter much."
Too high (e.g., α = 100): cost oscillates wildly, never converges. Too low (e.g., α = 0.0001): training takes forever. For shallow networks with tanh, a good starting point is α = 0.5 to 2.0. Always plot the cost curve — it should decrease smoothly.

Section 10

Comparison Table — Logistic Regression vs. Shallow Neural Network

Aspect	Logistic Regression (Ch 5)	Shallow Neural Network (Ch 6)
Architecture	Single neuron (no hidden layer)	1+ hidden layer with multiple neurons
Decision Boundary	Linear (straight line/hyperplane)	Non-linear (curves, regions)
Parameters	W (n×1), b (scalar)	W^[1], b^[1], W^[2], b^[2]
Expressiveness	Can only learn linearly separable patterns	Can learn any continuous function (universal approximator)
XOR Problem	❌ Cannot solve	✅ Easily solved
Forward Pass	ŷ = σ(Wx + b) — 1 step	2 steps: hidden → output
Backprop	dW = (1/m)(A−Y)X^T	Chain rule through 2 layers
Initialization	Zeros OK	Must be random (symmetry breaking)
Training Speed	Fast (convex optimization)	Slower (non-convex, more parameters)
Overfitting Risk	Low	Higher (more capacity)
Interpretability	High (feature weights directly interpretable)	Lower (hidden representations are abstract)
When to Use	Linearly separable data, baseline model	Non-linear data, feature interactions matter

Practical workflow: Always start with logistic regression as a baseline. If it achieves 95%+ accuracy, you probably don't need a neural network. If accuracy is low and the problem involves feature interactions, try a shallow NN (4–32 hidden neurons). Only go deeper if the shallow network plateaus.

Section 11

Exercises

Section A — Multiple Choice Questions (10)

Q1.

In a 2-layer neural network with 5 input features, 3 hidden units, and 1 output unit, what is the shape of W^[1]?

(5, 3)
(3, 5)
(1, 3)
(3, 1)

✅ B. (3, 5) — W^[l] has shape (n^[l], n^[l-1]) = (3, 5). Rows = current layer units, columns = previous layer units.

RememberBeginner

Q2.

Why is tanh preferred over sigmoid for hidden layers?

tanh has a larger range [−2, 2]
tanh outputs are zero-centered, leading to faster convergence
tanh never has vanishing gradients
tanh is computationally cheaper than sigmoid

✅ B. — tanh outputs are centered around 0 (range: −1 to 1), so the mean activation is close to 0. This makes gradient updates for the next layer more balanced, avoiding the "zig-zag" problem of all-positive sigmoid outputs.

UnderstandBeginner

Q3.

What happens if all weights in a neural network are initialized to zero?

The network learns faster due to simplicity
All hidden neurons learn the same features — symmetry is never broken
Only the output layer is affected
The biases compensate for the zero weights

✅ B. — With identical weights, all neurons compute the same value, get the same gradient, and update identically. The network effectively has 1 hidden neuron regardless of width. Random initialization breaks this symmetry.

UnderstandIntermediate

Q4.

What is the derivative of the tanh activation function?

tanh(z) × (1 − tanh(z))
1 − tanh²(z)
tanh(z) × (1 + tanh(z))
z × (1 − z²)

✅ B. 1 − tanh²(z) — This is derived from d/dz[tanh(z)] = sech²(z) = 1 − tanh²(z). At z=0, the derivative is 1 (maximum). Compare with sigmoid's maximum derivative of 0.25.

RememberBeginner

Q5.

In backpropagation for a 2-layer network, dZ^[1] = ?

A^[1] − Y
W^[2]T · dZ^[2] ⊙ g^[1]'(Z^[1])
W^[1]T · dZ^[2] + b^[1]
dZ^[2] · W^[2] · g^[1](Z^[1])

✅ B. — The gradient flows backward: multiply by the transposed weight matrix (W^[2]T) to "distribute" the error to each hidden unit, then element-wise multiply (⊙) by the activation derivative to account for the non-linearity.

ApplyIntermediate

Q6.

If a network uses linear activations (g(z) = z) in all layers, what does it reduce to?

A polynomial regression model
A single-layer linear model (linear regression)
A support vector machine
A decision tree

✅ B. — Composition of linear functions is linear: W^[2]·(W^[1]·x + b^[1]) + b^[2] = (W^[2]W^[1])x + (W^[2]b^[1] + b^[2]) = W'x + b'. No matter how many layers, the result is equivalent to a single linear transformation.

AnalyzeIntermediate

Q7.

What is the "dying ReLU" problem?

ReLU neurons output NaN for large inputs
Neurons with consistently negative pre-activation have zero gradient and never update
ReLU causes exploding gradients in deep networks
ReLU neurons become saturated at a maximum value

✅ B. — If a neuron's z is always negative (due to a large negative bias or unfortunate weight update), ReLU outputs 0 with gradient 0. The neuron never receives gradient signal and is permanently "dead." Leaky ReLU fixes this by allowing a small gradient (α ≈ 0.01) for negative inputs.

UnderstandIntermediate

Q8.

For a network with n^[0]=10, n^[1]=20, n^[2]=1, how many total learnable parameters are there?

✅ B. 241 — W^[1]: 20×10=200, b^[1]: 20, W^[2]: 1×20=20, b^[2]: 1. Total = 200 + 20 + 20 + 1 = 241.

ApplyBeginner

Q9.

Why do we multiply weights by 0.01 during random initialization?

To normalize the weights to unit variance
To keep pre-activations small so activations start in the steep gradient region
To prevent the network from overfitting
To ensure the biases dominate initially

✅ B. — Small weights → small z values → sigmoid/tanh in their linear region (steep gradient ≈ 1) → fast learning. Large weights → z in saturated region → gradient ≈ 0 → vanishing gradients → slow/no learning.

UnderstandIntermediate

Q10.

In vectorized forward propagation, b^[1] has shape (n^[1], 1) but Z^[1] has shape (n^[1], m). How does the addition Z^[1] = W^[1]·X + b^[1] work?

b^[1] is tiled m times manually
NumPy broadcasting copies b^[1] across all m columns automatically
A for-loop adds b^[1] to each column
b^[1] is reshaped to (1, m) first

✅ B. — NumPy broadcasting automatically replicates the (n^[1], 1) bias vector across all m columns during addition. This is both memory-efficient (no actual copying) and fast (single SIMD operation). This is why keepdims=True is important — it preserves the (n, 1) shape needed for broadcasting.

UnderstandBeginner

Section B — Short Answer Questions (5)

B1.

Write the four equations of forward propagation for a 2-layer neural network (vectorized form). State the shape of each intermediate result assuming n^[0]=3, n^[1]=5, n^[2]=1, and m=100 training examples.

RememberBeginner4 marks

B2.

Explain the symmetry problem in neural networks. What specific condition causes it, and what is the standard solution? Why can biases still be initialized to zero without causing this problem?

UnderstandIntermediate4 marks

B3.

Compare sigmoid and tanh activation functions along four axes: output range, zero-centering, maximum derivative value, and recommended use case. Include the mathematical relationship between them.

AnalyzeIntermediate5 marks

B4.

A Paytm fraud detection model uses a shallow neural network with 15 input features and 10 hidden neurons. Calculate: (a) total number of parameters, (b) shape of each weight matrix and bias vector, (c) shape of Z^[1] and A^[2] when processing a batch of 500 transactions.

ApplyBeginner5 marks

B5.

Explain the "dying ReLU" problem. How does Leaky ReLU address it? Write the formulas for both ReLU and Leaky ReLU, including their derivatives.

UnderstandIntermediate4 marks

Section C — Long Answer Questions (3)

C1.

Full Backpropagation Derivation. For a 2-layer neural network with tanh hidden activation and sigmoid output activation, derive the complete set of backpropagation equations. Start from the binary cross-entropy loss J, and derive dZ^[2], dW^[2], db^[2], dZ^[1], dW^[1], db^[1] step by step using the chain rule. Show each intermediate step and verify the shapes.

AnalyzeAdvanced12 marks

C2.

Linear Activation Analysis. (a) Prove that a neural network with any number of hidden layers using linear activations g(z) = z is equivalent to a single linear transformation. Show the proof for a 3-layer network. (b) Does this mean linear activations are never useful? Discuss one scenario where a linear output activation is appropriate. (c) What is the minimum requirement for a neural network to learn XOR? Prove with a construction.

EvaluateAdvanced12 marks

C3.

Activation Function Selection. You are building a meal recommendation system for Swiggy with these requirements: (a) Hidden layer for learning food preference patterns from 50 user features. (b) Output layer 1: probability of the user ordering (binary). (c) Output layer 2: predicted delivery time in minutes (continuous, 10–90 min). (d) Output layer 3: rating prediction (1–5 stars). For each layer/output, recommend an activation function with justification. Also discuss what would go wrong if you used sigmoid for all hidden layers in a network with 5 hidden layers.

EvaluateIntermediate10 marks

Section D — Programming Exercises (2)

D1.

Activation Function Visualizer. Write a Python program that:

Plots all 5 activation functions (sigmoid, tanh, ReLU, Leaky ReLU, ELU) on the same figure with z ∈ [−5, 5]
Plots their derivatives on a second subplot
Annotates each curve with its name and output range
Uses a clean, publication-quality style with legend

ApplyIntermediate8 marks

D2.

Hidden Unit Ablation Study. Using the ShallowNeuralNetwork class from this chapter:

Train the network on the XOR dataset with n_h = 1, 2, 4, 8, 16, 32, 64 hidden neurons
For each, record: final accuracy, final cost, and number of epochs to reach 95% accuracy (or "N/A" if it never does)
Plot: (a) accuracy vs. n_h, (b) cost curves for all values on the same plot
Answer: What is the minimum n_h that can solve XOR? Does increasing n_h always help?

AnalyzeAdvanced10 marks

Section E — Mini-Project

E1.

Flipkart Product Classifier. Build a shallow neural network from scratch to classify Flipkart products into two categories (e.g., electronics vs. fashion) based on product title features:

Data: Create a synthetic dataset of 1000 products with 5 features: title_length, has_brand_name (0/1), avg_word_length, number_count (digits in title), special_char_count
Architecture: 5 → n_h → 1 (experiment with n_h = 4, 8, 16)
Implement: The full NeuralNetwork class with both tanh and ReLU options for hidden layer
Evaluate: Train/test split (80/20), report accuracy, plot cost curves and decision boundary (pick any 2 features for visualization)
Compare: Against logistic regression baseline — is the neural network better? By how much?
Report: Write a brief report (1 page) with findings, including which activation function and n_h worked best and why

CreateAdvanced15 marks

Section 12

Chapter Summary

🧠 Key Takeaways from Chapter 6

Architecture: A shallow neural network has an input layer (layer 0), one hidden layer (layer 1), and an output layer (layer 2). We count layers by the number of weight matrices: 2.
Notation: W^[l] has shape (n^[l], n^[l-1]), b^[l] has shape (n^[l], 1). The gradient dW^[l] always has the same shape as W^[l].
Forward propagation computes Z^[l] = W^[l]·A^[l-1] + b^[l], then A^[l] = g^[l](Z^[l]). Vectorize over examples (m columns), loop only over layers.
Activation functions matter: tanh > sigmoid for hidden layers (zero-centered, stronger gradients). ReLU is the modern default (no vanishing gradient, computationally cheap). Never use linear activations for hidden layers.
Linear activations collapse: composition of linear functions = single linear function. Hidden layers with g(z) = z add no expressive power.
Backpropagation uses the chain rule backwards: dZ^[2] = A^[2] − Y (output), dZ^[1] = W^[2]T·dZ^[2] ⊙ g'(Z^[1]) (hidden). Shape of dW^[l] = shape of W^[l].
Random initialization is essential to break symmetry. Small weights (×0.01) keep activations in the high-gradient region of sigmoid/tanh.
Universal Approximation Theorem: One hidden layer with enough neurons can approximate any continuous function — but "enough" may be exponentially many.
Practical wisdom: Start with logistic regression as baseline. If it fails, try a shallow NN with 4–32 hidden neurons. Plot cost curves. Check shapes obsessively.

The 6 Equations of a Shallow Neural Network:

Forward: Z^[1] = W^[1]X + b^[1] → A^[1] = tanh(Z^[1]) → Z^[2] = W^[2]A^[1] + b^[2] → A^[2] = σ(Z^[2])
Backward: dZ^[2] = A^[2]−Y → dZ^[1] = W^[2]TdZ^[2] ⊙ (1−A^[1]²)

Your next step: In Chapter 7, we go deep — multiple hidden layers! You'll learn how to generalize the forward and backward passes for L layers, understand why depth helps, and encounter new challenges like vanishing/exploding gradients. The notation you mastered here (W^[l], b^[l], a^[l]) extends directly.

Section 13

References & Further Reading

Primary Textbooks

Goodfellow, Bengio & Courville (2016). Deep Learning. MIT Press. Chapters 6 (Deep Feedforward Networks) — the definitive reference for activation functions and network architectures. Free at deeplearningbook.org.
Andrew Ng — Coursera Deep Learning Specialization. Course 1, Weeks 3-4 — Shallow Neural Networks and Deep Neural Networks. The notation used in this chapter follows Ng's conventions.
Michael Nielsen (2015). Neural Networks and Deep Learning. Free online book (neuralnetworksanddeeplearning.com). Chapter 1 has excellent visualizations of how neural networks learn.

Landmark Papers

Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function." Mathematics of Control, Signals, and Systems. — The Universal Approximation Theorem.
Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks." Neural Networks. — Extended the theorem to arbitrary activation functions.
Nair, V. & Hinton, G. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML. — The paper that popularized ReLU.
Glorot, X. & Bengio, Y. (2010). "Understanding the Difficulty of Training Deep Feedforward Neural Networks." AISTATS. — Xavier initialization.
Clevert, D., Unterthiner, T. & Hochreiter, S. (2016). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." ICLR.

Indian Industry Context

Mu Sigma: musigma.com — Case studies in retail analytics, CPG, and insurance. Blog articles on decision science methodology.
NASSCOM AI Report (2024): India's analytics industry overview — growth trends, talent pipeline, and adoption across sectors.
NPTEL Deep Learning Course (IIT Madras): Prof. Mitesh Khapra's course covers shallow networks in Weeks 4-5 with excellent Hindi/English explanations — nptel.ac.in.
IndiaAI Portal: indiaai.gov.in — Government AI resources, datasets, and case studies from Indian industry.

Visualization Tools

TensorFlow Playground: playground.tensorflow.org — Interactively build and train shallow networks. See how hidden neurons create decision boundaries in real time. Start with the XOR dataset!
3Blue1Brown — Neural Networks series (YouTube): Grant Sanderson's visual explanations of forward/backward propagation are the best visual resource available.

Chapter 6: Shallow Neural Networks

Bloom's Taxonomy Map for This Chapter

Learning Objectives

Opening Hook — When a Single Neuron Isn't Enough

🍕 "Is this dish healthy or indulgent?" — Zomato's Cuisine Classifier

Core Concepts — Building the Network Layer by Layer

3a. Neural Network Representation & Notation

🏗️ Architecture of a 2-Layer Neural Network

Superscript Notation Convention

3b. Forward Propagation — Single Example & Vectorized

Single Training Example (x(i))

Vectorized Over m Examples

3c. Activation Functions — The Complete Deep Dive

① Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ)

② Tanh: tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)

③ ReLU: max(0, z)

④ Leaky ReLU: max(αz, z) where α ≈ 0.01

⑤ ELU: Exponential Linear Unit

Master Comparison Table

3d. Proof: Linear Activations Collapse the Network

🔢 Theorem: A Network with All Linear Activations = Linear Regression

3e. Backpropagation for a 2-Layer Network — Full Derivation

The Cost Function

Step 1: Output Layer Gradients (Layer 2)

Step 2: Hidden Layer Gradients (Layer 1)

Derivatives for Common Activations

Step 3: Parameter Updates

3f. Random Initialization — Breaking Symmetry

🔄 The Symmetry Problem

From-Scratch Code — Building a Neural Network in NumPy

Step 1: Generate Planar XOR Data

Step 2: The Complete NeuralNetwork Class

Step 3: Train the Network

Step 4: Visualize the Decision Boundary

Industry Code — scikit-learn & TensorFlow Equivalents

scikit-learn: MLPClassifier

TensorFlow/Keras: Sequential API

Comparison: From Scratch vs Industry

Visual Diagrams — Computation Graph & Architecture

Computation Graph for Forward & Backward Pass

Shape Flow Through the Network

Activation Function Shapes

Worked Example — Hand-Computing One Forward & Backward Pass

📝 Setup: Tiny Network (2 inputs → 2 hidden → 1 output)

Forward Pass

Cost

Backward Pass

Parameter Update (α = 1.0)

Case Study — Mu Sigma: Retail Analytics with Shallow Networks

📊 Mu Sigma — India's Largest Pure-Play Analytics Firm

Common Mistakes & Misconceptions

Comparison Table — Logistic Regression vs. Shallow Neural Network

Exercises

Section A — Multiple Choice Questions (10)

Section B — Short Answer Questions (5)

Section C — Long Answer Questions (3)

Section D — Programming Exercises (2)

Section E — Mini-Project

Chapter Summary

🧠 Key Takeaways from Chapter 6

References & Further Reading

Primary Textbooks

Landmark Papers

Indian Industry Context

Visualization Tools

Single Training Example (x⁽ⁱ⁾)