Neural Networks & Deep Learning
Chapter 6: Shallow Neural Networks
One Hidden Layer โ From Single Neuron to Your First Network
โฑ๏ธ Reading Time: ~3.5 hours | ๐ Part II: The Single Neuron to Networks | ๐ง Theory + Code Chapter
๐ Prerequisites: Chapter 4 (Single Neuron), Chapter 5 (Logistic Regression), NumPy basics
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the notation W[l], b[l], a[l] and the architecture of a 2-layer neural network |
| ๐ต Understand | Explain why non-linear activations are essential and why tanh outperforms sigmoid in hidden layers |
| ๐ข Apply | Implement forward propagation and backpropagation from scratch in NumPy for a 2-layer network |
| ๐ก Analyze | Derive backpropagation equations step-by-step using the chain rule on the computation graph |
| ๐ Evaluate | Compare sigmoid, tanh, ReLU, Leaky ReLU, and ELU โ selecting the right one for a given problem |
| ๐ด Create | Build a complete NeuralNetwork class that learns non-linear decision boundaries on XOR-like data |
Learning Objectives
By the end of this chapter, you will be able to:
- Draw the architecture of a shallow neural network (1 hidden layer) with proper notation for layers, weights, biases, and activations
- Derive the forward propagation equations for a single training example and extend them to vectorized form over the entire dataset
- Prove mathematically that linear hidden activations collapse the entire network into a single linear transformation, making hidden layers useless
- Compare five activation functions โ sigmoid, tanh, ReLU, Leaky ReLU, and ELU โ with their formulas, derivatives, ranges, and trade-offs
- Derive the complete backpropagation equations for a 2-layer network using chain rule on the computation graph
- Explain the symmetry-breaking problem and why random initialization of weights is essential
- Implement a full
NeuralNetworkclass from scratch in NumPy that trains on planar data and visualizes non-linear decision boundaries - Apply shallow neural networks to real-world classification problems in the Indian industry context
Opening Hook โ When a Single Neuron Isn't Enough
๐ "Is this dish healthy or indulgent?" โ Zomato's Cuisine Classifier
Imagine you're on Zomato's data science team in Gurugram. The product manager wants a new feature: automatically tag every dish as "Healthy Choice" ๐ฅ or "Indulgent Treat" ๐ฐ based on two features โ calorie count and sugar content.
You try logistic regression (Chapter 5). It draws a straight line: "below 400 calories = healthy, above = indulgent." But wait โ a masala oats bowl (350 cal, 5g sugar) is healthy โ , and a gulab jamun (300 cal, 45g sugar) is indulgent โ . Both are under 400 calories, but one is healthy and the other isn't! The decision boundary isn't a straight line โ it's a curve.
You need something more powerful than a single neuron. You need neurons working together โ a neural network. Even just one hidden layer with a few neurons can learn these curved boundaries that separate masala oats from gulab jamun, dal makhani from butter chicken, ragi dosa from cheese naan.
This chapter builds your first real neural network โ one hidden layer that can learn any non-linear boundary.
Core Concepts โ Building the Network Layer by Layer
3a. Neural Network Representation & Notation
A shallow neural network (also called a 2-layer neural network) has exactly three layers of nodes, but we count only layers with learnable parameters:
๐๏ธ Architecture of a 2-Layer Neural Network
Contains the input features xโ, xโ, โฆ, xโ. This layer has no weights or biases โ it just passes data forward. We denote the input as a[0] = X (activations of layer 0).
Contains n[1] hidden units (neurons). Each unit computes z = wยทx + b, then applies an activation function. Parameters: W[1] (shape: n[1] ร n[0]) and b[1] (shape: n[1] ร 1). Outputs: a[1].
Contains n[2] output units (typically 1 for binary classification). Parameters: W[2] (shape: n[2] ร n[1]) and b[2] (shape: n[2] ร 1). Final output: a[2] = ลท.
We count layers by the number of weight matrices, not nodes. The input layer is layer 0 and has no parameters. So: Layer 1 (hidden) + Layer 2 (output) = 2-layer network.
Superscript Notation Convention
| Symbol | Meaning | Example |
|---|---|---|
W[l] | Weight matrix of layer l | W[1] connects input โ hidden |
b[l] | Bias vector of layer l | b[2] is the output layer bias |
z[l] | Pre-activation (linear part) of layer l | z[1] = W[1]ยทX + b[1] |
a[l] | Post-activation of layer l | a[1] = g(z[1]) |
g[l](ยท) | Activation function of layer l | g[1] could be tanh, g[2] could be sigmoid |
n[l] | Number of units in layer l | n[0] = 2, n[1] = 4, n[2] = 1 |
(i) | Superscript in parentheses = training example index | x(3) = 3rd training example |
(n[l], n[l-1]) โ rows = units in current layer, columns = units in previous layer. If your dimensions don't match this pattern, you have a bug. This single rule will save you hours of debugging.
3b. Forward Propagation โ Single Example & Vectorized
Single Training Example (x(i))
For a single input vector x with shape (n[0], 1), the forward pass through our 2-layer network computes:
z[1] = W[1] ยท x + b[1] โ a[1] = g[1](z[1])
Layer 2 (Output):
z[2] = W[2] ยท a[1] + b[2] โ a[2] = g[2](z[2]) = ลท
Step-by-step walkthrough:
- Multiply: W[1] (4ร2) ยท x (2ร1) = z_partial (4ร1) โ each hidden neuron computes its weighted sum of inputs
- Add bias: z_partial + b[1] (4ร1) = z[1] (4ร1)
- Activate: Apply g[1] (e.g., tanh) element-wise โ a[1] (4ร1) โ now we have 4 hidden unit outputs
- Multiply: W[2] (1ร4) ยท a[1] (4ร1) = z_partial (1ร1) โ output neuron combines hidden outputs
- Add bias: z_partial + b[2] (1ร1) = z[2] (1ร1)
- Activate: Apply g[2] (sigmoid for binary classification) โ a[2] (1ร1) = ลท โ [0, 1]
Vectorized Over m Examples
Instead of looping over m training examples, we stack all examples as columns of a matrix X (shape: n[0] ร m). The vectorized forward propagation becomes:
Z[1] = W[1] ยท X + b[1] (n[1] ร m)
A[1] = g[1](Z[1]) (n[1] ร m)
Z[2] = W[2] ยท A[1] + b[2] (n[2] ร m)
A[2] = g[2](Z[2]) = ลถ (n[2] ร m)
Here, b[1] (shape: n[1] ร 1) is broadcast across all m columns automatically by NumPy. Each column of A[2] is the prediction for one training example.
for i in range(m): z = W @ x[:,i] is correct but painfully slow. The vectorized version Z = W @ X uses NumPy's optimized BLAS routines and runs 50โ300ร faster. Always vectorize over training examples; loop only over layers.
3c. Activation Functions โ The Complete Deep Dive
The activation function g(z) is what gives neural networks their power. Without it, stacking layers is pointless (we'll prove this next). Here are the five most important activation functions:
โ Sigmoid: ฯ(z) = 1 / (1 + eโปแถป)
ฯ(z) = 1 / (1 + eโz) | ฯ'(z) = ฯ(z) ยท (1 โ ฯ(z))
Range(0, 1) โ always positive, interpretable as probability
ProsOutput between 0 and 1, so perfect for output layer in binary classification. Smooth gradient everywhere.
Consโ Vanishing gradient: When |z| is large, ฯ'(z) โ 0, gradients vanish, learning stops. โก Not zero-centered: Output is always positive, causing zig-zag gradient updates. โข Exp() is computationally expensive.
When to UseOutput layer only for binary classification. Almost never for hidden layers.
โก Tanh: tanh(z) = (eแถป โ eโปแถป) / (eแถป + eโปแถป)
tanh(z) = (ez โ eโz) / (ez + eโz) | tanh'(z) = 1 โ tanhยฒ(z)
Range(โ1, 1) โ centered around zero
Why tanh > sigmoid for hidden layersโ Zero-centered output: Since the mean of tanh output is closer to 0 (vs. 0.5 for sigmoid), the next layer's inputs are centered, making gradient descent converge faster. โก Steeper gradient: tanh has a maximum derivative of 1 (at z=0) vs. sigmoid's maximum of 0.25. This means 4ร stronger gradients in the active region.
ConsStill suffers from vanishing gradients for |z| โซ 0, just like sigmoid.
Relationshiptanh(z) = 2ฯ(2z) โ 1 โ tanh is a shifted and scaled version of sigmoid!
โข ReLU: max(0, z)
ReLU(z) = max(0, z) | ReLU'(z) = 1 if z > 0, else 0
Range[0, โ) โ unbounded positive
Prosโ No vanishing gradient for z > 0 (derivative is exactly 1). โก Computationally cheap โ just a comparison, no exp(). โข Induces sparsity โ many neurons output 0, creating efficient representations. โฃ Default choice for hidden layers in most modern networks.
ConsDying ReLU: If a neuron's z is always negative (due to large negative bias), its gradient is always 0, and it never updates โ it's "dead." This can happen to 10โ40% of neurons in practice.
โฃ Leaky ReLU: max(ฮฑz, z) where ฮฑ โ 0.01
LeakyReLU(z) = z if z > 0, else ฮฑยทz | LeakyReLU'(z) = 1 if z > 0, else ฮฑ
Range(โโ, โ) โ unbounded both sides
ProsFixes the dying ReLU problem โ negative inputs still get a small gradient (ฮฑ = 0.01), so neurons can always recover.
VariantParametric ReLU (PReLU): ฮฑ is a learnable parameter, not fixed. The network decides the optimal slope for the negative region.
โค ELU: Exponential Linear Unit
ELU(z) = z if z > 0, else ฮฑ(ez โ 1) | ELU'(z) = 1 if z > 0, else ELU(z) + ฮฑ
Range[โฮฑ, โ) โ smoothly saturates to โฮฑ for large negative z
Prosโ Mean activations closer to zero (like tanh). โก Smooth curve for z < 0 (unlike the kink in ReLU/Leaky ReLU). โข Better noise robustness.
ConsSlightly slower due to exp() computation. ฮฑ is typically 1.0.
Master Comparison Table
| Activation | Formula | Range | Derivative | Zero-Centered? | Vanishing Gradient? | Use Case |
|---|---|---|---|---|---|---|
| Sigmoid | 1/(1+eโz) | (0, 1) | ฯ(1โฯ) | โ | Yes | Output layer (binary) |
| Tanh | (ezโeโz)/(ez+eโz) | (โ1, 1) | 1โtanhยฒ | โ | Yes | Hidden layers (small nets) |
| ReLU | max(0, z) | [0, โ) | 0 or 1 | โ | No (z>0) | Hidden layers (default) |
| Leaky ReLU | max(ฮฑz, z) | (โโ, โ) | ฮฑ or 1 | โ | No | Hidden layers (fix dying) |
| ELU | z or ฮฑ(ezโ1) | [โฮฑ, โ) | 1 or ELU+ฮฑ | โ โ | No | Hidden layers (smooth) |
โ Hidden layers: Start with ReLU. If too many neurons die, try Leaky ReLU or ELU.
โก Output layer (binary classification): Sigmoid.
โข Output layer (regression): Linear (no activation).
โฃ Output layer (multi-class): Softmax (Chapter 10).
โค RNNs/LSTMs: tanh and sigmoid are used internally by design.
3d. Proof: Linear Activations Collapse the Network
Here's a critical question: what if we use a linear activation function g(z) = z (the identity function) for all layers? Let's prove that the network becomes equivalent to plain linear regression, making hidden layers useless.
๐ข Theorem: A Network with All Linear Activations = Linear Regression
Consider our 2-layer network with linear activation g(z) = z everywhere:
Forward Pass
z[1] = W[1]ยทx + b[1]
a[1] = g(z[1]) = z[1] = W[1]ยทx + b[1]
z[2] = W[2]ยทa[1] + b[2]
a[2] = g(z[2]) = z[2] = W[2]ยท(W[1]ยทx + b[1]) + b[2]
a[2] = W[2]ยทW[1]ยทx + W[2]ยทb[1] + b[2]
a[2] = W'ยทx + b'
where W' = W[2]ยทW[1] (a single weight matrix) and b' = W[2]ยทb[1] + b[2] (a single bias vector).
ConclusionThe entire network reduces to ลท = W'x + b', which is exactly linear regression. The hidden layer adds zero expressive power. No matter how many linear hidden layers you stack, the composition of linear functions is linear.
f(x) = Wโยท(Wโยท(Wโยทx + bโ) + bโ) + bโ = (WโยทWโยทWโ)ยทx + (WโยทWโยทbโ + Wโยทbโ + bโ)
= W'ยทx + b' โ still linear!
3e. Backpropagation for a 2-Layer Network โ Full Derivation
Backpropagation is the algorithm that computes gradients of the loss function with respect to every parameter. It uses the chain rule of calculus, working backwards from the output layer to the input layer.
The Cost Function
For binary classification with m examples:
Step 1: Output Layer Gradients (Layer 2)
Starting from the loss and working backwards through the sigmoid output:
dW[2] = (1/m) ยท dZ[2] ยท A[1]T (n[2] ร n[1])
db[2] = (1/m) ยท ฮฃ(dZ[2], axis=1, keepdims=True) (n[2] ร 1)
Derivation of dZ[2]:
โJ/โa[2] = โy/a[2] + (1โy)/(1โa[2]) (derivative of cross-entropy)
โa[2]/โz[2] = a[2](1โa[2]) (derivative of sigmoid)
dz[2] = โJ/โz[2] = โJ/โa[2] ยท โa[2]/โz[2]
= [โy/a[2] + (1โy)/(1โa[2])] ยท a[2](1โa[2])
= โy(1โa[2]) + (1โy)a[2]
= a[2] โ y โ (beautifully simple!)
Step 2: Hidden Layer Gradients (Layer 1)
Now we propagate the gradient backwards through W[2] and the activation function g[1]:
dW[1] = (1/m) ยท dZ[1] ยท XT (n[1] ร n[0])
db[1] = (1/m) ยท ฮฃ(dZ[1], axis=1, keepdims=True) (n[1] ร 1)
Derivation of dZ[1]:
โJ/โz[1] = โJ/โz[2] ยท โz[2]/โa[1] ยท โa[1]/โz[1]
= dz[2] ยท W[2] ยท g[1]'(z[1])
Vectorized: dZ[1] = W[2]T ยท dZ[2] โ g[1]'(Z[1])
The โ symbol denotes element-wise multiplication (Hadamard product). The term g[1]'(Z[1]) is the derivative of the hidden layer's activation function applied element-wise.
Derivatives for Common Activations
| If g[1] is... | Then g[1]'(z) = | In terms of a[1] |
|---|---|---|
| Sigmoid | ฯ(z)(1โฯ(z)) | a[1] โ (1 โ a[1]) |
| Tanh | 1 โ tanhยฒ(z) | 1 โ (a[1])ยฒ |
| ReLU | 1 if z > 0, else 0 | (Z[1] > 0).astype(int) |
Step 3: Parameter Updates
b[1] := b[1] โ ฮฑ ยท db[1]
W[2] := W[2] โ ฮฑ ยท dW[2]
b[2] := b[2] โ ฮฑ ยท db[2]
Where ฮฑ is the learning rate (a hyperparameter you choose, e.g., 0.01 or 1.2).
3f. Random Initialization โ Breaking Symmetry
In logistic regression, we could initialize weights to zero. Can we do the same for neural networks? No! Here's why:
๐ The Symmetry Problem
If all weights in W[1] are initialized to zero, then for every hidden unit:
zโ[1] = 0ยทxโ + 0ยทxโ + 0 = 0
zโ[1] = 0ยทxโ + 0ยทxโ + 0 = 0
zโ[1] = 0ยทxโ + 0ยทxโ + 0 = 0
zโ[1] = 0ยทxโ + 0ยทxโ + 0 = 0
All hidden units compute the same value โ same activations a[1] โ same gradients dW โ same updates. They stay identical forever. It's like having 4 copies of the same neuron โ no matter how long you train, the network can only learn what a single neuron can learn.
Solution: Random InitializationInitialize weights randomly with small values:
W[1] = np.random.randn(n[1], n[0]) * 0.01
The 0.01 scaling keeps values small so sigmoid/tanh start in the linear region (steep gradients), not the saturated flat regions.
Why small values?If W is too large, z = Wx + b will be large โ tanh(z) saturates near ยฑ1 โ gradient โ 0 โ learning is glacially slow. With small W, z stays near 0, where tanh has gradient โ 1.
BiasesBiases can be initialized to zero. Since each neuron already has different weights (breaking symmetry), different biases aren't needed for symmetry breaking. b = np.zeros((n, 1)) is fine.
From-Scratch Code โ Building a Neural Network in NumPy
Let's build a complete NeuralNetwork class with one hidden layer. We'll train it on a planar XOR dataset โ the classic problem that a single neuron (perceptron) cannot solve.
Step 1: Generate Planar XOR Data
Python
import numpy as np
import matplotlib.pyplot as plt
def generate_xor_data(n_samples=400, noise=0.15):
"""Generate planar XOR-like dataset.
Class 1 (y=1): points in quadrants I and III
Class 0 (y=0): points in quadrants II and IV
"""
np.random.seed(42)
n = n_samples // 4
# Quadrant I: (+, +) โ class 1
q1 = np.random.randn(n, 2) * 0.5 + np.array([1, 1])
# Quadrant II: (-, +) โ class 0
q2 = np.random.randn(n, 2) * 0.5 + np.array([-1, 1])
# Quadrant III: (-, -) โ class 1
q3 = np.random.randn(n, 2) * 0.5 + np.array([-1, -1])
# Quadrant IV: (+, -) โ class 0
q4 = np.random.randn(n, 2) * 0.5 + np.array([1, -1])
X = np.vstack([q1, q2, q3, q4]).T # Shape: (2, 400)
Y = np.array([[1]*n + [0]*n + [1]*n + [0]*n]) # Shape: (1, 400)
# Shuffle
perm = np.random.permutation(X.shape[1])
X, Y = X[:, perm], Y[:, perm]
return X, Y
X, Y = generate_xor_data()
print(f"X shape: {X.shape}, Y shape: {Y.shape}")
Step 2: The Complete NeuralNetwork Class
Python
class ShallowNeuralNetwork:
"""
A 2-layer neural network (1 hidden layer) for binary classification.
Architecture: Input(n_x) โ Hidden(n_h, tanh) โ Output(1, sigmoid)
"""
def __init__(self, n_x, n_h, learning_rate=1.2):
"""
Parameters:
n_x : int โ number of input features
n_h : int โ number of hidden units
learning_rate : float โ step size for gradient descent
"""
self.lr = learning_rate
# Random initialization (small weights, zero biases)
self.W1 = np.random.randn(n_h, n_x) * 0.01
self.b1 = np.zeros((n_h, 1))
self.W2 = np.random.randn(1, n_h) * 0.01
self.b2 = np.zeros((1, 1))
print(f"Network initialized: {n_x} โ {n_h} โ 1")
print(f" W1: {self.W1.shape}, b1: {self.b1.shape}")
print(f" W2: {self.W2.shape}, b2: {self.b2.shape}")
total = n_h * n_x + n_h + 1 * n_h + 1
print(f" Total parameters: {total}")
def sigmoid(self, z):
"""Sigmoid activation function."""
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def forward(self, X):
"""
Forward propagation.
X: shape (n_x, m) โ m examples
Returns: A2 (predictions), cache (for backprop)
"""
# Layer 1: Hidden
Z1 = self.W1 @ X + self.b1 # (n_h, m)
A1 = np.tanh(Z1) # (n_h, m)
# Layer 2: Output
Z2 = self.W2 @ A1 + self.b2 # (1, m)
A2 = self.sigmoid(Z2) # (1, m)
cache = (Z1, A1, Z2, A2)
return A2, cache
def compute_cost(self, A2, Y):
"""Binary cross-entropy loss."""
m = Y.shape[1]
# Clip to avoid log(0)
A2 = np.clip(A2, 1e-8, 1 - 1e-8)
cost = -(1/m) * np.sum(
Y * np.log(A2) + (1 - Y) * np.log(1 - A2)
)
return float(cost)
def backward(self, X, Y, cache):
"""
Backward propagation.
Returns: gradients dict {dW1, db1, dW2, db2}
"""
m = X.shape[1]
Z1, A1, Z2, A2 = cache
# Output layer gradients
dZ2 = A2 - Y # (1, m)
dW2 = (1/m) * dZ2 @ A1.T # (1, n_h)
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True) # (1, 1)
# Hidden layer gradients
dZ1 = (self.W2.T @ dZ2) * (1 - A1**2) # tanh derivative: 1 - tanhยฒ(z)
dW1 = (1/m) * dZ1 @ X.T # (n_h, n_x)
db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True) # (n_h, 1)
return {'dW1': dW1, 'db1': db1,
'dW2': dW2, 'db2': db2}
def update_parameters(self, grads):
"""Gradient descent update."""
self.W1 -= self.lr * grads['dW1']
self.b1 -= self.lr * grads['db1']
self.W2 -= self.lr * grads['dW2']
self.b2 -= self.lr * grads['db2']
def train(self, X, Y, epochs=10000, print_every=1000):
"""Full training loop."""
costs = []
for i in range(epochs):
# Forward
A2, cache = self.forward(X)
# Cost
cost = self.compute_cost(A2, Y)
# Backward
grads = self.backward(X, Y, cache)
# Update
self.update_parameters(grads)
if i % print_every == 0:
costs.append(cost)
print(f"Epoch {i:5d} | Cost: {cost:.6f}")
return costs
def predict(self, X):
"""Binary predictions (threshold = 0.5)."""
A2, _ = self.forward(X)
return (A2 > 0.5).astype(int)
def accuracy(self, X, Y):
"""Compute classification accuracy."""
preds = self.predict(X)
return np.mean(preds == Y) * 100
Step 3: Train the Network
Python
# Create and train the network
nn = ShallowNeuralNetwork(n_x=2, n_h=8, learning_rate=1.2)
costs = nn.train(X, Y, epochs=10000, print_every=1000)
print(f"\nFinal Accuracy: {nn.accuracy(X, Y):.1f}%")
Step 4: Visualize the Decision Boundary
Python
def plot_decision_boundary(model, X, Y):
"""Visualize the non-linear decision boundary."""
x_min, x_max = X[0].min() - 0.5, X[0].max() + 0.5
y_min, y_max = X[1].min() - 0.5, X[1].max() + 0.5
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200)
)
grid = np.c_[xx.ravel(), yy.ravel()].T # (2, 40000)
Z = model.predict(grid)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, levels=[0, 0.5, 1],
colors=['#fecaca', '#bbf7d0'], alpha=0.7)
plt.contour(xx, yy, Z, levels=[0.5], colors=['#7c3aed'],
linewidths=2)
# Plot data points
plt.scatter(X[0, Y[0]==0], X[1, Y[0]==0],
c='#ef4444', edgecolors='k', s=30, label='Class 0')
plt.scatter(X[0, Y[0]==1], X[1, Y[0]==1],
c='#22c55e', edgecolors='k', s=30, label='Class 1')
plt.title('Decision Boundary โ Shallow Neural Network (XOR Data)',
fontweight='bold', fontsize=14)
plt.xlabel('Feature xโ')
plt.ylabel('Feature xโ')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
plot_decision_boundary(nn, X, Y)
Industry Code โ scikit-learn & TensorFlow Equivalents
In production, you'd use optimized libraries. Here's how our from-scratch network maps to industry tools:
scikit-learn: MLPClassifier
Python
from sklearn.neural_network import MLPClassifier
# X_train shape: (m, n_features) โ sklearn uses row vectors!
clf = MLPClassifier(
hidden_layer_sizes=(8,), # 1 hidden layer, 8 neurons
activation='tanh', # hidden layer activation
solver='sgd', # stochastic gradient descent
learning_rate_init=0.01, # initial learning rate
max_iter=10000, # epochs
random_state=42
)
clf.fit(X.T, Y.ravel()) # sklearn expects (m, n_features)
print(f"Accuracy: {clf.score(X.T, Y.ravel()) * 100:.1f}%")
TensorFlow/Keras: Sequential API
Python
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(8, activation='tanh',
input_shape=(2,)), # hidden layer
tf.keras.layers.Dense(1, activation='sigmoid') # output layer
])
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=1.2),
loss='binary_crossentropy',
metrics=['accuracy']
)
# Keras expects (m, n_features)
history = model.fit(X.T, Y.T, epochs=200, verbose=0)
loss, acc = model.evaluate(X.T, Y.T, verbose=0)
print(f"Keras Accuracy: {acc * 100:.1f}%")
.T (transpose) when switching conventions.
Comparison: From Scratch vs Industry
| Aspect | Our From-Scratch Code | scikit-learn / Keras |
|---|---|---|
| Lines of code | ~80 lines | ~10 lines |
| Optimizer | Vanilla gradient descent | SGD, Adam, RMSprop, etc. |
| Regularization | None | L1, L2, Dropout built-in |
| GPU support | No | Yes (Keras/TF) |
| Learning value | โ โ โ โ โ | โ โ (black box) |
| Production use | โ (educational only) | โ โ โ โ โ |
Visual Diagrams โ Computation Graph & Architecture
Computation Graph for Forward & Backward Pass
Shape Flow Through the Network
Activation Function Shapes
Worked Example โ Hand-Computing One Forward & Backward Pass
Let's trace through one complete forward + backward pass with actual numbers. This is the best way to build intuition.
๐ Setup: Tiny Network (2 inputs โ 2 hidden โ 1 output)
x = [1.0, 0.5]T (shape: 2ร1), y = 1
Initialized Parameters
W[1] = [[0.1, 0.3], [0.2, 0.4]] (2ร2)
b[1] = [[0.0], [0.0]] (2ร1)
W[2] = [[0.5, 0.6]] (1ร2)
b[2] = [[0.0]] (1ร1)
Forward Pass
Hand Calculation
# Layer 1: z[1] = W[1]ยทx + b[1]
zโโฝยนโพ = 0.1ร1.0 + 0.3ร0.5 + 0.0 = 0.25
zโโฝยนโพ = 0.2ร1.0 + 0.4ร0.5 + 0.0 = 0.40
# Layer 1: a[1] = tanh(z[1])
aโโฝยนโพ = tanh(0.25) = 0.2449
aโโฝยนโพ = tanh(0.40) = 0.3799
# Layer 2: z[2] = W[2]ยทa[1] + b[2]
zโฝยฒโพ = 0.5ร0.2449 + 0.6ร0.3799 + 0.0 = 0.3504
# Layer 2: a[2] = ฯ(z[2])
aโฝยฒโพ = ฯ(0.3504) = 1/(1+eโปโฐยทยณโตโฐโด) = 0.5867
# ลท = 0.5867 (predicted probability)
# True y = 1, so the network needs to push ลท higher
Cost
Hand Calculation
J = -[yยทlog(ลท) + (1-y)ยทlog(1-ลท)]
= -[1ยทlog(0.5867) + 0ยทlog(0.4133)]
= -log(0.5867)
= 0.5333
Backward Pass
Hand Calculation
# Layer 2 gradients
dzโฝยฒโพ = aโฝยฒโพ - y = 0.5867 - 1 = -0.4133
dWโโฝยฒโพ = dzโฝยฒโพ ร aโโฝยนโพ = -0.4133 ร 0.2449 = -0.1013
dWโโฝยฒโพ = dzโฝยฒโพ ร aโโฝยนโพ = -0.4133 ร 0.3799 = -0.1570
dbโฝยฒโพ = dzโฝยฒโพ = -0.4133
# Layer 1 gradients (tanh derivative: 1 - tanhยฒ)
dzโโฝยนโพ = Wโโฝยฒโพยทdzโฝยฒโพ ร (1 - aโโฝยนโพยฒ) = 0.5ร(-0.4133) ร (1 - 0.2449ยฒ)
= -0.2067 ร 0.9400 = -0.1943
dzโโฝยนโพ = Wโโฝยฒโพยทdzโฝยฒโพ ร (1 - aโโฝยนโพยฒ) = 0.6ร(-0.4133) ร (1 - 0.3799ยฒ)
= -0.2480 ร 0.8557 = -0.1922
dWโฝยนโพ = [[dzโโฝยนโพยทxโ, dzโโฝยนโพยทxโ], = [[-0.1943, -0.0972],
[dzโโฝยนโพยทxโ, dzโโฝยนโพยทxโ]] [-0.1922, -0.0961]]
Parameter Update (ฮฑ = 1.0)
Hand Calculation
# W[1] := W[1] - ฮฑยทdW[1]
Wโฝยนโพ_new = [[0.1 - 1.0ร(-0.1943), 0.3 - 1.0ร(-0.0972)],
[0.2 - 1.0ร(-0.1922), 0.4 - 1.0ร(-0.0961)]]
= [[0.2943, 0.3972],
[0.3922, 0.4961]]
# All weights increased โ pushing ลท toward 1 โ
# W[2] := W[2] - ฮฑยทdW[2]
Wโฝยฒโพ_new = [[0.5 - 1.0ร(-0.1013), 0.6 - 1.0ร(-0.1570)]]
= [[0.6013, 0.7570]]
Case Study โ Mu Sigma: Retail Analytics with Shallow Networks
๐ Mu Sigma โ India's Largest Pure-Play Analytics Firm
Mu Sigma, founded in 2004 in Bengaluru by Dhiraj Rajaram, is one of India's largest analytics and decision science companies. With 3,500+ "decision scientists" and offices in Chicago and Bengaluru, Mu Sigma serves Fortune 500 clients across retail, healthcare, insurance, and CPG. The company is valued at over โน8,000 crore ($1 billion).
The ProblemA major Indian retail chain (similar to Reliance Retail / DMart) needed to predict customer churn โ which customers would stop shopping at their stores in the next quarter. They had 12 input features per customer: purchase frequency, average basket size (โน), recency of last visit, number of categories shopped, coupon redemption rate, store distance, customer age, tenure, complaint history, payment mode preference, festive season spending, and loyalty points balance.
Why Not Logistic Regression?Initial logistic regression achieved only 68% accuracy. The Mu Sigma team discovered the churn pattern was non-linear: customers with moderate purchase frequency AND low recency were churning (they used to shop often but stopped), while customers with low frequency but high recency were fine (they shop rarely but recently). This interaction pattern needed a curved decision boundary.
The Solution: Shallow Neural NetworkThe team implemented a 2-layer neural network with:
- Input layer: 12 features (normalized to [0, 1])
- Hidden layer: 24 neurons with ReLU activation
- Output layer: 1 neuron with sigmoid (churn probability)
- Training: 200,000 customer records, batch gradient descent, learning rate = 0.01
| Metric | Logistic Regression | Shallow Neural Network |
|---|---|---|
| Accuracy | 68% | 84% |
| Precision (churn class) | 0.55 | 0.79 |
| Recall (churn class) | 0.48 | 0.76 |
| AUC-ROC | 0.72 | 0.89 |
| Retention offers savings | โน2.1 crore/quarter | โน5.8 crore/quarter |
The hidden layer learned feature interactions that the linear model couldn't capture. Neuron #7, for instance, activated strongly when a customer had high historical frequency but low recent activity โ essentially learning the concept of "lapsing customer" on its own, without being explicitly programmed with this feature.
Business ImpactBy accurately identifying at-risk customers, the retail chain sent targeted retention offers (โน200 discount coupons, exclusive sale access) to the right customers, saving โน5.8 crore per quarter in prevented churn โ a 2.7ร improvement over the logistic regression approach.
Common Mistakes & Misconceptions
Not true! Too many neurons lead to overfitting โ the network memorizes training data (including noise) instead of learning general patterns. For 400 training examples, 8 hidden neurons work well. Using 500 hidden neurons would memorize the data perfectly but fail on new data. The right number depends on your dataset size and complexity.
Sigmoid should almost never be used for hidden layers in modern networks. Its non-zero-centered output and vanishing gradient make training slow. Use ReLU (or its variants) for hidden layers and sigmoid only for the output layer in binary classification. This single change can speed up training 5-10ร.
Zero initialization creates perfect symmetry โ all hidden neurons compute identical values, get identical gradients, and stay identical forever. You effectively have a 1-neuron network regardless of how many neurons you defined. Always use random initialization (e.g.,
np.random.randn(...) * 0.01).
The clean formula dZ[2] = A[2] โ Y only works for the output layer with sigmoid + cross-entropy loss. For hidden layers, you must backpropagate through the weight matrix AND multiply by the activation derivative: dZ[1] = W[2]TยทdZ[2] โ g'(Z[1]). The โ (element-wise multiply) is critical!
When computing db = (1/m) ร np.sum(dZ, axis=1), NumPy returns a 1D array of shape (n,) instead of (n, 1). This causes silent broadcasting bugs in the parameter update step. Always use
keepdims=True to maintain the (n, 1) column vector shape.
Too high (e.g., ฮฑ = 100): cost oscillates wildly, never converges. Too low (e.g., ฮฑ = 0.0001): training takes forever. For shallow networks with tanh, a good starting point is ฮฑ = 0.5 to 2.0. Always plot the cost curve โ it should decrease smoothly.
Comparison Table โ Logistic Regression vs. Shallow Neural Network
| Aspect | Logistic Regression (Ch 5) | Shallow Neural Network (Ch 6) |
|---|---|---|
| Architecture | Single neuron (no hidden layer) | 1+ hidden layer with multiple neurons |
| Decision Boundary | Linear (straight line/hyperplane) | Non-linear (curves, regions) |
| Parameters | W (nร1), b (scalar) | W[1], b[1], W[2], b[2] |
| Expressiveness | Can only learn linearly separable patterns | Can learn any continuous function (universal approximator) |
| XOR Problem | โ Cannot solve | โ Easily solved |
| Forward Pass | ลท = ฯ(Wx + b) โ 1 step | 2 steps: hidden โ output |
| Backprop | dW = (1/m)(AโY)XT | Chain rule through 2 layers |
| Initialization | Zeros OK | Must be random (symmetry breaking) |
| Training Speed | Fast (convex optimization) | Slower (non-convex, more parameters) |
| Overfitting Risk | Low | Higher (more capacity) |
| Interpretability | High (feature weights directly interpretable) | Lower (hidden representations are abstract) |
| When to Use | Linearly separable data, baseline model | Non-linear data, feature interactions matter |
Exercises
Section A โ Multiple Choice Questions (10)
In a 2-layer neural network with 5 input features, 3 hidden units, and 1 output unit, what is the shape of W[1]?
- (5, 3)
- (3, 5)
- (1, 3)
- (3, 1)
Why is tanh preferred over sigmoid for hidden layers?
- tanh has a larger range [โ2, 2]
- tanh outputs are zero-centered, leading to faster convergence
- tanh never has vanishing gradients
- tanh is computationally cheaper than sigmoid
What happens if all weights in a neural network are initialized to zero?
- The network learns faster due to simplicity
- All hidden neurons learn the same features โ symmetry is never broken
- Only the output layer is affected
- The biases compensate for the zero weights
What is the derivative of the tanh activation function?
- tanh(z) ร (1 โ tanh(z))
- 1 โ tanhยฒ(z)
- tanh(z) ร (1 + tanh(z))
- z ร (1 โ zยฒ)
In backpropagation for a 2-layer network, dZ[1] = ?
- A[1] โ Y
- W[2]T ยท dZ[2] โ g[1]'(Z[1])
- W[1]T ยท dZ[2] + b[1]
- dZ[2] ยท W[2] ยท g[1](Z[1])
If a network uses linear activations (g(z) = z) in all layers, what does it reduce to?
- A polynomial regression model
- A single-layer linear model (linear regression)
- A support vector machine
- A decision tree
What is the "dying ReLU" problem?
- ReLU neurons output NaN for large inputs
- Neurons with consistently negative pre-activation have zero gradient and never update
- ReLU causes exploding gradients in deep networks
- ReLU neurons become saturated at a maximum value
For a network with n[0]=10, n[1]=20, n[2]=1, how many total learnable parameters are there?
- 231
- 241
- 220
- 210
Why do we multiply weights by 0.01 during random initialization?
- To normalize the weights to unit variance
- To keep pre-activations small so activations start in the steep gradient region
- To prevent the network from overfitting
- To ensure the biases dominate initially
In vectorized forward propagation, b[1] has shape (n[1], 1) but Z[1] has shape (n[1], m). How does the addition Z[1] = W[1]ยทX + b[1] work?
- b[1] is tiled m times manually
- NumPy broadcasting copies b[1] across all m columns automatically
- A for-loop adds b[1] to each column
- b[1] is reshaped to (1, m) first
Section B โ Short Answer Questions (5)
Write the four equations of forward propagation for a 2-layer neural network (vectorized form). State the shape of each intermediate result assuming n[0]=3, n[1]=5, n[2]=1, and m=100 training examples.
Explain the symmetry problem in neural networks. What specific condition causes it, and what is the standard solution? Why can biases still be initialized to zero without causing this problem?
Compare sigmoid and tanh activation functions along four axes: output range, zero-centering, maximum derivative value, and recommended use case. Include the mathematical relationship between them.
A Paytm fraud detection model uses a shallow neural network with 15 input features and 10 hidden neurons. Calculate: (a) total number of parameters, (b) shape of each weight matrix and bias vector, (c) shape of Z[1] and A[2] when processing a batch of 500 transactions.
Explain the "dying ReLU" problem. How does Leaky ReLU address it? Write the formulas for both ReLU and Leaky ReLU, including their derivatives.
Section C โ Long Answer Questions (3)
Full Backpropagation Derivation. For a 2-layer neural network with tanh hidden activation and sigmoid output activation, derive the complete set of backpropagation equations. Start from the binary cross-entropy loss J, and derive dZ[2], dW[2], db[2], dZ[1], dW[1], db[1] step by step using the chain rule. Show each intermediate step and verify the shapes.
Linear Activation Analysis. (a) Prove that a neural network with any number of hidden layers using linear activations g(z) = z is equivalent to a single linear transformation. Show the proof for a 3-layer network. (b) Does this mean linear activations are never useful? Discuss one scenario where a linear output activation is appropriate. (c) What is the minimum requirement for a neural network to learn XOR? Prove with a construction.
Activation Function Selection. You are building a meal recommendation system for Swiggy with these requirements: (a) Hidden layer for learning food preference patterns from 50 user features. (b) Output layer 1: probability of the user ordering (binary). (c) Output layer 2: predicted delivery time in minutes (continuous, 10โ90 min). (d) Output layer 3: rating prediction (1โ5 stars). For each layer/output, recommend an activation function with justification. Also discuss what would go wrong if you used sigmoid for all hidden layers in a network with 5 hidden layers.
Section D โ Programming Exercises (2)
Activation Function Visualizer. Write a Python program that:
- Plots all 5 activation functions (sigmoid, tanh, ReLU, Leaky ReLU, ELU) on the same figure with z โ [โ5, 5]
- Plots their derivatives on a second subplot
- Annotates each curve with its name and output range
- Uses a clean, publication-quality style with legend
Hidden Unit Ablation Study. Using the ShallowNeuralNetwork class from this chapter:
- Train the network on the XOR dataset with n_h = 1, 2, 4, 8, 16, 32, 64 hidden neurons
- For each, record: final accuracy, final cost, and number of epochs to reach 95% accuracy (or "N/A" if it never does)
- Plot: (a) accuracy vs. n_h, (b) cost curves for all values on the same plot
- Answer: What is the minimum n_h that can solve XOR? Does increasing n_h always help?
Section E โ Mini-Project
Flipkart Product Classifier. Build a shallow neural network from scratch to classify Flipkart products into two categories (e.g., electronics vs. fashion) based on product title features:
- Data: Create a synthetic dataset of 1000 products with 5 features: title_length, has_brand_name (0/1), avg_word_length, number_count (digits in title), special_char_count
- Architecture: 5 โ n_h โ 1 (experiment with n_h = 4, 8, 16)
- Implement: The full NeuralNetwork class with both tanh and ReLU options for hidden layer
- Evaluate: Train/test split (80/20), report accuracy, plot cost curves and decision boundary (pick any 2 features for visualization)
- Compare: Against logistic regression baseline โ is the neural network better? By how much?
- Report: Write a brief report (1 page) with findings, including which activation function and n_h worked best and why
Chapter Summary
๐ง Key Takeaways from Chapter 6
- Architecture: A shallow neural network has an input layer (layer 0), one hidden layer (layer 1), and an output layer (layer 2). We count layers by the number of weight matrices: 2.
- Notation: W[l] has shape (n[l], n[l-1]), b[l] has shape (n[l], 1). The gradient dW[l] always has the same shape as W[l].
- Forward propagation computes Z[l] = W[l]ยทA[l-1] + b[l], then A[l] = g[l](Z[l]). Vectorize over examples (m columns), loop only over layers.
- Activation functions matter: tanh > sigmoid for hidden layers (zero-centered, stronger gradients). ReLU is the modern default (no vanishing gradient, computationally cheap). Never use linear activations for hidden layers.
- Linear activations collapse: composition of linear functions = single linear function. Hidden layers with g(z) = z add no expressive power.
- Backpropagation uses the chain rule backwards: dZ[2] = A[2] โ Y (output), dZ[1] = W[2]TยทdZ[2] โ g'(Z[1]) (hidden). Shape of dW[l] = shape of W[l].
- Random initialization is essential to break symmetry. Small weights (ร0.01) keep activations in the high-gradient region of sigmoid/tanh.
- Universal Approximation Theorem: One hidden layer with enough neurons can approximate any continuous function โ but "enough" may be exponentially many.
- Practical wisdom: Start with logistic regression as baseline. If it fails, try a shallow NN with 4โ32 hidden neurons. Plot cost curves. Check shapes obsessively.
Forward: Z[1] = W[1]X + b[1] โ A[1] = tanh(Z[1]) โ Z[2] = W[2]A[1] + b[2] โ A[2] = ฯ(Z[2])
Backward: dZ[2] = A[2]โY โ dZ[1] = W[2]TdZ[2] โ (1โA[1]ยฒ)
References & Further Reading
Primary Textbooks
- Goodfellow, Bengio & Courville (2016). Deep Learning. MIT Press. Chapters 6 (Deep Feedforward Networks) โ the definitive reference for activation functions and network architectures. Free at deeplearningbook.org.
- Andrew Ng โ Coursera Deep Learning Specialization. Course 1, Weeks 3-4 โ Shallow Neural Networks and Deep Neural Networks. The notation used in this chapter follows Ng's conventions.
- Michael Nielsen (2015). Neural Networks and Deep Learning. Free online book (neuralnetworksanddeeplearning.com). Chapter 1 has excellent visualizations of how neural networks learn.
Landmark Papers
- Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function." Mathematics of Control, Signals, and Systems. โ The Universal Approximation Theorem.
- Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks." Neural Networks. โ Extended the theorem to arbitrary activation functions.
- Nair, V. & Hinton, G. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML. โ The paper that popularized ReLU.
- Glorot, X. & Bengio, Y. (2010). "Understanding the Difficulty of Training Deep Feedforward Neural Networks." AISTATS. โ Xavier initialization.
- Clevert, D., Unterthiner, T. & Hochreiter, S. (2016). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." ICLR.
Indian Industry Context
- Mu Sigma: musigma.com โ Case studies in retail analytics, CPG, and insurance. Blog articles on decision science methodology.
- NASSCOM AI Report (2024): India's analytics industry overview โ growth trends, talent pipeline, and adoption across sectors.
- NPTEL Deep Learning Course (IIT Madras): Prof. Mitesh Khapra's course covers shallow networks in Weeks 4-5 with excellent Hindi/English explanations โ nptel.ac.in.
- IndiaAI Portal: indiaai.gov.in โ Government AI resources, datasets, and case studies from Indian industry.
Visualization Tools
- TensorFlow Playground: playground.tensorflow.org โ Interactively build and train shallow networks. See how hidden neurons create decision boundaries in real time. Start with the XOR dataset!
- 3Blue1Brown โ Neural Networks series (YouTube): Grant Sanderson's visual explanations of forward/backward propagation are the best visual resource available.