Neural Networks & Deep Learning
Chapter 13: Recurrent Neural Networks — Memory in Networks
Teaching Neural Networks to Remember the Past, Understand Sequences & Predict the Future
⏱️ Reading Time: ~4 hours | 📖 Part IV: Architectures | 🧠 Theory + Code Chapter
📋 Prerequisites: Chapters 7–8 (Deep Networks, Backpropagation, Optimization)
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| 🔵 Remember | Recall the RNN cell equations, BPTT algorithm, and the definitions of vanishing/exploding gradients |
| 🔵 Understand | Explain why feedforward networks fail on sequential data, how hidden states carry memory, and why gradients vanish over long sequences |
| 🟢 Apply | Implement an RNN cell from scratch in NumPy, train a character-level language model, and apply gradient clipping |
| 🟡 Analyze | Trace gradient flow through an unrolled RNN, compute BPTT for a 3-step sequence, and identify vanishing gradient signatures |
| 🟠 Evaluate | Choose the right RNN architecture (one-to-many, many-to-one, many-to-many) for a given task, and assess when vanilla RNNs are sufficient vs. when LSTMs/GRUs are needed |
| 🔴 Create | Design an end-to-end time-series forecasting pipeline for grocery demand prediction and build a sentiment classifier using RNNs |
Learning Objectives
By the end of this chapter, you will be able to:
- Define sequential data and explain why standard feedforward networks cannot capture temporal dependencies
- Derive the RNN forward equations:
a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a)andŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y) - Classify RNN architectures — one-to-one, one-to-many, many-to-one, many-to-many — and map each to real tasks
- Derive Backpropagation Through Time (BPTT) with full gradient chain-rule expansions
- Explain the vanishing gradient problem mathematically: why
∏ diag(tanh'(·)) · W_aashrinks exponentially - Apply gradient clipping to prevent exploding gradients during training
- Formulate language modeling as
P(sentence) = ∏ P(word_t | word_1, …, word_{t-1})and train a character-level model - Implement an RNN cell from scratch using NumPy — forward pass, BPTT, and text generation
- Build a time-series forecasting model for real-world demand prediction using PyTorch
- Evaluate when vanilla RNNs are insufficient and motivate the need for LSTM/GRU (covered in Chapter 14)
Opening Hook — The IRCTC Durga Puja Rush
🚂 How Does IRCTC Know 14 Lakh Tickets Will Be Booked Tomorrow?
Every year during Durga Puja, IRCTC faces a tsunami of ticket bookings. Kolkata-bound trains from Delhi, Mumbai, Bangalore, and Chennai see demand spike by 300–500% in the weeks before the festival. In 2024, IRCTC handled over 25 lakh bookings in a single day during peak season.
Here's the key insight: today's ticket demand depends on what happened yesterday, last week, and last month. If 8 lakh tickets were booked on Monday, 10 lakh on Tuesday, and 12 lakh on Wednesday — you can feel the crescendo building toward the weekend rush. A standard feedforward network sees each day in isolation. It has no memory.
What if the network could remember? What if it could carry forward information from Monday → Tuesday → Wednesday and use that accumulated context to predict Thursday's demand?
That is exactly what a Recurrent Neural Network (RNN) does. It adds a hidden state — a memory vector — that gets updated at every time step. The RNN doesn't just see "12 lakh tickets on Wednesday." It sees "8 → 10 → 12 lakh, with increasing momentum, during pre-Puja season, with weather turning pleasant" — a sequence.
IRCTCIndian RailwaysDemand ForecastingSequential DataCore Concepts
13.1 — Why Sequential Data Needs Memory
A feedforward network processes a fixed-size input and produces a fixed-size output. It has no notion of order. Consider these two sentences:
- "Sachin hit the ball" → Positive (cricket commentary)
- "The ball hit Sachin" → Negative (injury report)
Both sentences contain the exact same words. A bag-of-words feedforward model cannot distinguish them. Order matters.
Types of Sequential Data
| Data Type | Sequence Dimension | Indian Example |
|---|---|---|
| Time Series | Values over time steps | Sensex daily closing prices, IRCTC daily bookings |
| Text / NLP | Words/characters in order | Hindi poetry, Amazon India product reviews |
| Audio / Speech | Sound amplitude samples | Alexa Hindi voice commands, Bhashini translations |
| Video | Frames over time | CCTV footage analysis for Smart Cities Mission |
| Sensor Data | IoT readings over time | ISRO satellite telemetry, AgriStack soil moisture |
| User Behavior | Click/action sequences | Flipkart browsing history, Hotstar watch sequences |
The Fundamental Limitation of Feedforward Networks
A feedforward network maps input x to output y: y = f(x). For sequential data at time t, we need:
This means the output at time t depends on all previous inputs. We could concatenate all past inputs into one giant vector, but this approach has three fatal problems:
- Variable-length sequences: Sentences have different lengths; we'd need a different network for each length
- No parameter sharing: The weight connecting "word at position 3" to the output is different from "word at position 7" — the network can't generalize across positions
- Computational explosion: For a 500-word review, the input vector is 500 × embedding_dim — enormous!
13.2 — The RNN Cell: A Neuron with Memory
The core idea of an RNN is breathtakingly simple: the output of the hidden layer at time step t is fed back as an additional input at time step t+1.
🧠 The RNN Cell Equations
Where:
a⟨t⟩— hidden state (activation) at time step t, shape(n_a, 1)a⟨t-1⟩— hidden state from the previous time step, shape(n_a, 1)x⟨t⟩— input at current time step, shape(n_x, 1)W_aa— weight matrix for hidden-to-hidden connection, shape(n_a, n_a)W_ax— weight matrix for input-to-hidden connection, shape(n_a, n_x)b_a— bias vector, shape(n_a, 1)
Where:
ŷ⟨t⟩— predicted output at time step t, shape(n_y, 1)W_ya— weight matrix for hidden-to-output, shape(n_y, n_a)b_y— output bias vector, shape(n_y, 1)
Here W_a = [W_aa | W_ax] is a concatenated weight matrix of shape (n_a, n_a + n_x), and [a⟨t-1⟩; x⟨t⟩] is the vertical concatenation of the previous hidden state and current input.
At t = 0, we initialize a⟨0⟩ as a zero vector: a⟨0⟩ = 0⃗. This represents "no memory before the sequence starts."
Why tanh and Not ReLU?
The hidden state is computed by repeatedly multiplying by W_aa and applying an activation. Using ReLU (unbounded) would cause the hidden state to grow without bound over long sequences. The tanh function squashes values to [-1, +1], keeping the hidden state bounded. However, this bounded nature also contributes to the vanishing gradient problem (Section 13.5).
W_aa, W_ax, W_ya are used at every time step. This means:
- The RNN can handle sequences of any length — same parameters, just more steps
- Patterns learned at position 5 transfer to position 500
- Total parameters are independent of sequence length
The Hidden State as a "Summary"
Think of a⟨t⟩ as a compressed summary of everything the network has seen from time step 1 through time step t. It's like reading a 500-page novel and trying to carry the plot summary in your head — at each new page, you update your mental summary based on the new information and your existing understanding.
W_aa, W_ax, W_ya, b_a, b_y are reused at every time step. This is called weight tying or parameter sharing across time. It's what makes RNNs generalizable to variable-length sequences.
13.3 — RNN Architectures: One-to-Many, Many-to-One, Many-to-Many
The flexibility of RNNs comes from how we configure inputs and outputs across time steps. There are five fundamental architectures:
📐 RNN Architecture Taxonomy
T_x = T_y = 1. This is just a regular feedforward network — no recurrence. Included for completeness.
Example: Image classification (single image → single label).
2. One-to-ManyT_x = 1, T_y > 1. Single input generates a sequence of outputs. The input is fed only at t=1; subsequent steps receive no external input (or the previous output is fed as input).
Example: Music generation (seed note → melody), image captioning (image → "A man riding a bicycle on Marine Drive").
3. Many-to-OneT_x > 1, T_y = 1. A sequence of inputs produces a single output. The output is taken only from the last time step.
Example: Sentiment analysis ("Flipkart delivery was amazing!" → ⭐⭐⭐⭐⭐), spam detection on Hindi SMS.
4. Many-to-Many (Same Length)T_x = T_y. Each input time step produces a corresponding output at the same time step.
Example: Named Entity Recognition ("Narendra Modi visited Varanasi" → [PERSON, PERSON, O, LOCATION]), POS tagging in Hindi.
5. Many-to-Many (Different Length) — Encoder-DecoderT_x ≠ T_y. An encoder RNN reads the input sequence and compresses it into a context vector. A decoder RNN generates the output sequence from this context.
Example: Machine translation (Hindi → English on Google Translate), text summarization of news articles on Inshorts.
| Architecture | T_x | T_y | Indian Industry Use Case |
|---|---|---|---|
| One-to-One | 1 | 1 | Aadhaar fingerprint → identity match |
| One-to-Many | 1 | T | Album artwork → AI-generated Bollywood song lyrics |
| Many-to-One | T | 1 | Amazon India review text → star rating prediction |
| Many-to-Many (=) | T | T | Hindi sentence → POS tags (noun, verb, adjective…) |
| Many-to-Many (≠) | T_x | T_y | Bhashini: Hindi speech → Tamil text translation |
13.4 — Backpropagation Through Time (BPTT)
Training an RNN uses a variant of backpropagation called Backpropagation Through Time (BPTT). The idea: "unroll" the RNN across all time steps to create a deep feedforward network, then apply standard backpropagation.
Step 1: Forward Pass (Unrolled)
For a sequence of length T:
# At each time step t = 1, 2, ..., T:
a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a)
ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y)
L⟨t⟩ = -Σ_i y_i⟨t⟩ · log(ŷ_i⟨t⟩) # Cross-entropy at step t
Step 2: Total Loss
Step 3: Backward Pass — Gradient Derivation
The key challenge: W_aa affects every time step (because it's shared). So its gradient accumulates contributions from all time steps.
Gradient w.r.t. Output Weights (Simple)
Gradient w.r.t. Hidden State at Final Step
Notice the recursive structure — the gradient at time t depends on the gradient at time t+1. This is why it's called "backpropagation through time" — we propagate gradients backward from T to 1.
Gradient w.r.t. Shared Recurrent Weights
where δ⟨t⟩ is the backpropagated error signal at time step t, computed recursively:
Gradient w.r.t. Input Weights
W_aa, W_ax, W_ya are shared across time steps, their gradients are the sum of contributions from all time steps.
Truncated BPTT
For very long sequences (e.g., a 10,000-word document), unrolling the full graph is computationally prohibitive. Truncated BPTT limits the backward pass to a fixed window of k time steps (typically k = 20–50). This sacrifices gradient accuracy for long-range dependencies but makes training tractable.
13.5 — The Vanishing & Exploding Gradient Problem
This is the Achilles' heel of vanilla RNNs and the reason LSTMs/GRUs were invented.
The Mathematical Root Cause
Consider the gradient of the loss at time step T with respect to the hidden state at an earlier time step k:
This is a product of (T - k) matrices. Let's analyze what happens:
Case 1: Vanishing Gradient
The tanh' derivative satisfies 0 < tanh'(z) ≤ 1. If the largest singular value of W_aa is less than 1 (i.e., σ_max(W_aa) < 1), then:
The gradient decays exponentially with the distance between time steps. Information from 100 steps ago has virtually zero gradient signal — the network cannot learn long-range dependencies.
Practical Illustration
Consider the sentence: "The students, who came from Varanasi and had been studying Sanskrit for many years under the guidance of Professor Sharma at the Banaras Hindu University, were brilliant."
The verb "were" (plural) must agree with "students" (plural) — but they are separated by 25+ words. A vanilla RNN's gradient from "were" to "students" passes through 25+ matrix multiplications, shrinking to near-zero. The network never learns this long-distance agreement.
Case 2: Exploding Gradient
If σ_max(W_aa) > 1, the gradient grows exponentially:
This causes NaN values in weights, loss spikes to infinity, and training crashes.
Gradient Clipping: Solving Exploding Gradients
The fix for exploding gradients is surprisingly simple — clip the gradient norm:
This rescales the gradient to have maximum norm = threshold (typically 1.0 or 5.0), preserving direction but limiting magnitude.
# Gradient clipping in NumPy
def clip_gradients(gradients, max_norm=5.0):
"""Clip gradients to prevent exploding gradient problem."""
total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients.values()))
if total_norm > max_norm:
clip_coef = max_norm / (total_norm + 1e-6)
for key in gradients:
gradients[key] *= clip_coef
return gradients
13.6 — Language Modeling with RNNs
A language model assigns a probability to a sequence of words (or characters). It answers the question: "How likely is this sentence?"
📖 The Language Model Formulation
Example:
P("नमस्ते आप कैसे हैं") = P(नमस्ते) × P(आप | नमस्ते) × P(कैसे | नमस्ते, आप) × P(हैं | नमस्ते, आप, कैसे)
At each time step t:
- Input:
x⟨t⟩ = w_t(current word/character, one-hot or embedded) - Target:
y⟨t⟩ = w_{t+1}(next word/character) - Output:
ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y)gives a probability distribution over the vocabulary for the next token
Lower perplexity = better model. A perplexity of k means the model is as uncertain as choosing uniformly among k options at each step.
Character-Level vs. Word-Level Models
| Aspect | Character-Level | Word-Level |
|---|---|---|
| Vocabulary Size | ~70–200 characters | 30,000–100,000 words |
| Sequence Length | Very long (each char is a step) | Shorter (each word is a step) |
| Handles Typos / New Words | Yes (sees individual characters) | No (unknown words = UNK) |
| Hindi / Devanagari | Works naturally with Unicode chars | Requires word tokenizer |
| Training Speed | Slower (longer sequences) | Faster per sequence |
Sampling / Text Generation
Once trained, we generate text by repeatedly sampling from the predicted distribution:
- Feed a seed character (e.g., "क") as
x⟨1⟩ - Compute
ŷ⟨1⟩ = softmax(…)— a probability distribution over all characters - Sample the next character from this distribution:
x⟨2⟩ ~ ŷ⟨1⟩ - Feed
x⟨2⟩back, computeŷ⟨2⟩, samplex⟨3⟩, and repeat
τ:
τ → 0: output becomes deterministic (always picks the most likely character — "greedy")τ = 1: standard softmax (samples according to learned probabilities)τ → ∞: output becomes uniform random (completely random characters)
τ ≈ 0.7–0.8 for creative text generation with reasonable coherence.
From-Scratch Code — RNN in NumPy
We'll implement a complete RNN from scratch: forward pass, BPTT, gradient clipping, and character-level text generation — trained on Hindi poetry.
4.1 — RNN Cell: Forward Step
import numpy as np
def softmax(z):
"""Numerically stable softmax."""
e_z = np.exp(z - np.max(z, axis=0, keepdims=True))
return e_z / np.sum(e_z, axis=0, keepdims=True)
def rnn_cell_forward(xt, a_prev, parameters):
"""
Single RNN cell forward step.
Args:
xt: input at time step t, shape (n_x, 1)
a_prev: hidden state from previous step, shape (n_a, 1)
parameters: dict with W_aa, W_ax, W_ya, b_a, b_y
Returns:
a_next: new hidden state, shape (n_a, 1)
yt_hat: predicted output (softmax), shape (n_y, 1)
cache: values needed for backpropagation
"""
W_aa = parameters['W_aa']
W_ax = parameters['W_ax']
W_ya = parameters['W_ya']
b_a = parameters['b_a']
b_y = parameters['b_y']
# Hidden state update: a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a)
z_a = np.dot(W_aa, a_prev) + np.dot(W_ax, xt) + b_a
a_next = np.tanh(z_a)
# Output prediction: ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y)
z_y = np.dot(W_ya, a_next) + b_y
yt_hat = softmax(z_y)
cache = (a_next, a_prev, xt, parameters)
return a_next, yt_hat, cache
4.2 — Full Forward Pass Over a Sequence
def rnn_forward(X, a0, parameters):
"""
Full forward pass through the RNN for a sequence.
Args:
X: list of one-hot input vectors, length T
a0: initial hidden state, shape (n_a, 1)
parameters: dict with W_aa, W_ax, W_ya, b_a, b_y
Returns:
y_hats: list of output probabilities at each step
caches: list of caches for BPTT
a_states: list of hidden states (for analysis)
"""
caches = []
y_hats = []
a_states = [a0]
a_prev = a0
for t in range(len(X)):
a_next, yt_hat, cache = rnn_cell_forward(X[t], a_prev, parameters)
y_hats.append(yt_hat)
caches.append(cache)
a_states.append(a_next)
a_prev = a_next
return y_hats, caches, a_states
4.3 — BPTT: Backward Pass
def rnn_cell_backward(dy, da_next_from_future, cache):
"""
Single RNN cell backward step.
Args:
dy: gradient from loss at time step t, shape (n_y, 1)
da_next_from_future: gradient flowing from future time steps
cache: from forward pass
Returns:
gradients: dict with dW_aa, dW_ax, dW_ya, db_a, db_y
da_prev: gradient to pass to time step t-1
"""
a_next, a_prev, xt, parameters = cache
W_aa = parameters['W_aa']
W_ya = parameters['W_ya']
# Gradient of loss w.r.t. output layer
dW_ya = np.dot(dy, a_next.T)
db_y = dy.copy()
# Gradient flowing into hidden state from output + future
da = np.dot(W_ya.T, dy) + da_next_from_future
# Gradient through tanh: dtanh = (1 - a^2) * da
dz = (1 - a_next ** 2) * da
# Gradient w.r.t. parameters
dW_aa = np.dot(dz, a_prev.T)
dW_ax = np.dot(dz, xt.T)
db_a = dz.copy()
# Gradient to pass to previous time step
da_prev = np.dot(W_aa.T, dz)
gradients = {
'dW_aa': dW_aa, 'dW_ax': dW_ax, 'dW_ya': dW_ya,
'db_a': db_a, 'db_y': db_y
}
return gradients, da_prev
def rnn_backward(X, Y, y_hats, caches, parameters):
"""
Full BPTT backward pass.
Args:
X: list of one-hot inputs
Y: list of one-hot targets
y_hats: predicted outputs from forward pass
caches: caches from forward pass
parameters: model parameters
Returns:
gradients: accumulated gradients for all parameters
loss: total cross-entropy loss
"""
T = len(X)
n_a = parameters['W_aa'].shape[0]
# Initialize accumulated gradients
grads = {k: np.zeros_like(parameters[k.replace('d','',1)])
for k in ['dW_aa','dW_ax','dW_ya','db_a','db_y']}
da_next = np.zeros((n_a, 1))
loss = 0.0
# Backward through time: from T-1 down to 0
for t in reversed(range(T)):
# Cross-entropy gradient: dy = ŷ - y
dy = y_hats[t] - Y[t]
# Loss at this step
loss -= np.sum(Y[t] * np.log(y_hats[t] + 1e-8))
# Backward through RNN cell
cell_grads, da_next = rnn_cell_backward(dy, da_next, caches[t])
# Accumulate gradients (sum over time)
for key in grads:
grads[key] += cell_grads[key]
return grads, loss
4.4 — Gradient Clipping
def clip_gradients(gradients, max_norm=5.0):
"""Clip gradient norms to prevent exploding gradients."""
total_norm = np.sqrt(
sum(np.sum(gradients[k] ** 2) for k in gradients)
)
if total_norm > max_norm:
clip_coef = max_norm / (total_norm + 1e-6)
for k in gradients:
gradients[k] *= clip_coef
return gradients
4.5 — Character-Level Language Model on Hindi Poetry
# ─── Data Preparation ───
# Hindi poetry corpus (sample)
corpus = """
मधुशाला
मृदु भावों के अंगूरों की आज बना लाया हाला
प्रियतम के लिए कुसुम-कलियों का प्याला
भर लाया बन-बन में फिर-फिर हाथ फेरकर
अपने ही हाथों आज बना लाया मधुशाला
हाथ में मानिक-मधु का प्याला अमृत-मय
मैं पीने को आतुर देखकर इसे अपनाय
कामिनी ने अधर-पट खोले मुस्काय
और मुझको पिला रही है मधुशाला
"""
# Build character vocabulary
chars = sorted(set(corpus))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
print(f"Vocabulary size: {vocab_size} characters")
print(f"Sample chars: {chars[:10]}")
# One-hot encoding helper
def one_hot(idx, size):
vec = np.zeros((size, 1))
vec[idx] = 1.0
return vec
# ─── Initialize Parameters ───
np.random.seed(42)
n_a = 128 # hidden state size
n_x = vocab_size
n_y = vocab_size
parameters = {
'W_aa': np.random.randn(n_a, n_a) * 0.01,
'W_ax': np.random.randn(n_a, n_x) * 0.01,
'W_ya': np.random.randn(n_y, n_a) * 0.01,
'b_a': np.zeros((n_a, 1)),
'b_y': np.zeros((n_y, 1))
}
# ─── Training Loop ───
learning_rate = 0.01
seq_length = 25 # truncated BPTT window
num_epochs = 500
for epoch in range(num_epochs):
total_loss = 0.0
a_prev = np.zeros((n_a, 1))
# Slide through corpus in chunks of seq_length
for start in range(0, len(corpus) - seq_length - 1, seq_length):
# Prepare input-target pairs
X_seq = [one_hot(char_to_idx[corpus[start + t]], n_x)
for t in range(seq_length)]
Y_seq = [one_hot(char_to_idx[corpus[start + t + 1]], n_y)
for t in range(seq_length)]
# Forward pass
y_hats, caches, a_states = rnn_forward(X_seq, a_prev, parameters)
# Backward pass (BPTT)
gradients, loss = rnn_backward(X_seq, Y_seq, y_hats, caches, parameters)
total_loss += loss
# Clip gradients
gradients = clip_gradients(gradients, max_norm=5.0)
# Update parameters (SGD)
for key in ['W_aa', 'W_ax', 'W_ya', 'b_a', 'b_y']:
parameters[key] -= learning_rate * gradients['d' + key]
# Carry forward hidden state (detached)
a_prev = a_states[-1].copy()
if epoch % 50 == 0:
avg_loss = total_loss / (len(corpus) // seq_length)
print(f"Epoch {epoch} | Loss: {avg_loss:.4f}")
4.6 — Text Generation (Sampling)
def generate_text(parameters, seed_char, length=200, temperature=0.8):
"""Generate text character by character."""
a = np.zeros((n_a, 1))
x = one_hot(char_to_idx[seed_char], n_x)
generated = seed_char
for _ in range(length):
# Forward step
z_a = np.dot(parameters['W_aa'], a) + np.dot(parameters['W_ax'], x) + parameters['b_a']
a = np.tanh(z_a)
z_y = np.dot(parameters['W_ya'], a) + parameters['b_y']
# Temperature scaling
z_y = z_y / temperature
probs = softmax(z_y).ravel()
# Sample from the distribution
idx = np.random.choice(range(n_y), p=probs)
generated += idx_to_char[idx]
# Feed sampled character as next input
x = one_hot(idx, n_x)
return generated
# Generate Hindi poetry!
print("─── Generated Text ───")
print(generate_text(parameters, seed_char="म", length=200, temperature=0.7))
a⟨t⟩ at each step encodes what the model "remembers." After seeing "मधुशा", the hidden state has learned that "ला" is the most likely continuation. After "प्या", it predicts "ला" again — the network has learned Hindi syllable patterns!
Industry Code — PyTorch RNN
5.1 — Time-Series Forecasting with PyTorch RNN
import torch
import torch.nn as nn
import numpy as np
class DemandRNN(nn.Module):
"""
RNN for time-series demand forecasting.
Predicts next-hour order volume for quick-commerce.
"""
def __init__(self, input_size, hidden_size, output_size, num_layers=2):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
# PyTorch's built-in RNN
self.rnn = nn.RNN(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=0.2
)
# Output projection
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x, h0=None):
"""
Args:
x: (batch, seq_len, input_size)
h0: (num_layers, batch, hidden_size) — optional initial state
Returns:
out: (batch, output_size) — prediction for next step
"""
if h0 is None:
h0 = torch.zeros(self.num_layers, x.size(0),
self.hidden_size).to(x.device)
# rnn_out: (batch, seq_len, hidden_size)
# h_n: (num_layers, batch, hidden_size)
rnn_out, h_n = self.rnn(x, h0)
# Take the last time step's output
last_hidden = rnn_out[:, -1, :] # (batch, hidden_size)
out = self.fc(last_hidden) # (batch, output_size)
return out, h_n
# ─── Model Configuration ───
INPUT_FEATURES = 5 # [orders, temperature, rain, holiday_flag, hour_sin]
HIDDEN_SIZE = 64
OUTPUT_SIZE = 1 # predicted orders for next hour
model = DemandRNN(INPUT_FEATURES, HIDDEN_SIZE, OUTPUT_SIZE)
print(model)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
5.2 — Training Loop with Gradient Clipping
# ─── Training Setup ───
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
MAX_GRAD_NORM = 1.0
# ─── Training Loop ───
def train_epoch(model, dataloader, optimizer, criterion):
model.train()
total_loss = 0.0
for batch_x, batch_y in dataloader:
# batch_x: (batch, seq_len=24, features=5)
# batch_y: (batch, 1) — next hour's orders
optimizer.zero_grad()
predictions, _ = model(batch_x)
loss = criterion(predictions, batch_y)
loss.backward()
# ⚡ Gradient clipping — prevents exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
optimizer.step()
total_loss += loss.item()
return total_loss / len(dataloader)
# ─── Training ───
for epoch in range(100):
train_loss = train_epoch(model, train_loader, optimizer, criterion)
if epoch % 10 == 0:
print(f"Epoch {epoch:3d} | Train Loss: {train_loss:.4f}")
5.3 — Sentiment Analysis: Many-to-One RNN
class SentimentRNN(nn.Module):
"""Many-to-One RNN for Amazon India review sentiment."""
def __init__(self, vocab_size, embed_dim, hidden_size, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.rnn = nn.RNN(embed_dim, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# x: (batch, seq_len) — token indices
embedded = self.embedding(x) # (batch, seq_len, embed_dim)
rnn_out, h_n = self.rnn(embedded) # h_n: (1, batch, hidden_size)
# Many-to-One: use only the LAST hidden state
last_hidden = h_n.squeeze(0) # (batch, hidden_size)
logits = self.fc(last_hidden) # (batch, num_classes)
return logits
# Example usage
model = SentimentRNN(
vocab_size=30000, # Amazon India Hindi+English vocab
embed_dim=100,
hidden_size=128,
num_classes=5 # 1-star to 5-star
)
print(f"Sentiment model params: {sum(p.numel() for p in model.parameters()):,}")
🏭 Industry Note: Why Companies Use LSTM/GRU, Not Vanilla RNN
In production, almost no one uses vanilla nn.RNN. Companies like Flipkart, Zepto, and Paytm use nn.LSTM or nn.GRU because vanilla RNNs struggle with sequences longer than ~20 steps. We cover LSTMs in Chapter 14, but PyTorch makes switching trivial — just replace nn.RNN with nn.LSTM!
Visual Diagrams
6.1 — The RNN Cell (Folded View)
6.2 — The RNN Unrolled Through Time
6.3 — RNN Architecture Types
6.4 — Gradient Flow in BPTT (Vanishing Gradient)
6.5 — Gradient Clipping Visualization
Worked Example — Hand-Computing a 3-Step RNN
Let's trace through a tiny RNN by hand. We'll predict the next character in the sequence "बा" → "ल" (from "बाल" = child).
Setup
Vocabulary: {बा=0, ल=1, क=2} → n_x = n_y = 3, hidden size n_a = 2
Parameters (Simplified)
W_ax = [[0.1, 0.3, -0.1], # shape (2, 3)
[0.2, -0.2, 0.4]]
W_aa = [[0.5, 0.1], # shape (2, 2)
[-0.3, 0.6]]
W_ya = [[0.2, -0.1], # shape (3, 2)
[0.4, 0.3],
[-0.2, 0.1]]
b_a = [[0], [0]] # shape (2, 1)
b_y = [[0], [0], [0]] # shape (3, 1)
a⟨0⟩ = [[0], [0]] # initial hidden state
Step 1: t=1, Input = "बा" (x⟨1⟩ = [1, 0, 0]ᵀ)
# Compute z_a⟨1⟩ = W_aa · a⟨0⟩ + W_ax · x⟨1⟩ + b_a
z_a⟨1⟩ = [[0.5, 0.1], · [[0], + [[0.1, 0.3, -0.1], · [[1], + [[0],
[-0.3, 0.6]] [0]] [0.2, -0.2, 0.4]] [0], [0]]
[0]]
z_a⟨1⟩ = [[0], + [[0.1], + [[0], = [[0.1],
[0]] [0.2]] [0]] [0.2]]
# Apply tanh
a⟨1⟩ = tanh([[0.1], [0.2]]) = [[0.0997], [0.1974]]
# Compute output
z_y⟨1⟩ = W_ya · a⟨1⟩ + b_y
z_y⟨1⟩ = [[0.2·0.0997 + (-0.1)·0.1974], = [[0.0002],
[0.4·0.0997 + 0.3·0.1974], [0.0991],
[-0.2·0.0997 + 0.1·0.1974]] [-0.0002]]
ŷ⟨1⟩ = softmax([[0.0002], [0.0991], [-0.0002]])
≈ [[0.3187], [0.3519], [0.3294]]
# Network predicts "ल" (index 1) with probability 0.3519 ← highest!
Step 2: t=2, Input = "ल" (x⟨2⟩ = [0, 1, 0]ᵀ), using a⟨1⟩ from above
# Hidden state update
z_a⟨2⟩ = W_aa · a⟨1⟩ + W_ax · x⟨2⟩ + b_a
z_a⟨2⟩ = [[0.5·0.0997 + 0.1·0.1974], + [[0.3], = [[0.3694],
[-0.3·0.0997 + 0.6·0.1974]] [-0.2]] [-0.1116]]
a⟨2⟩ = tanh([[0.3694], [-0.1116]]) = [[0.3540], [-0.1113]]
# The hidden state NOW carries information about BOTH "बा" AND "ल"!
# a⟨1⟩ encoded "बा", and a⟨2⟩ blends that with the new input "ल"
a⟨2⟩ = [0.354, -0.111] is different from what we'd get if we only fed "ल" without the context of "बा". The value 0.354 in the first dimension is influenced by the history — this IS the memory mechanism! Without the recurrent connection (W_aa · a⟨1⟩), we'd get [0.291, -0.197] — a measurably different state.
BPTT Backward (Sketch for Step 2)
# Suppose target at t=2 is "क" → y⟨2⟩ = [0, 0, 1]ᵀ
# dy⟨2⟩ = ŷ⟨2⟩ - y⟨2⟩ (softmax cross-entropy gradient)
# ∂L/∂W_ya += dy⟨2⟩ · a⟨2⟩ᵀ ← gradient accumulates at each step
# ∂L/∂a⟨2⟩ = W_yaᵀ · dy⟨2⟩ ← flows backward to hidden state
# dz⟨2⟩ = (1 - a⟨2⟩²) · ∂L/∂a⟨2⟩ ← through tanh
# ∂L/∂W_aa += dz⟨2⟩ · a⟨1⟩ᵀ ← gradient uses PREVIOUS hidden state
# ∂L/∂a⟨1⟩ = W_aaᵀ · dz⟨2⟩ ← CONTINUES backward to t=1!
Case Study — Zepto: 10-Minute Grocery Demand Forecasting
🛒 Zepto — Predicting Hourly Orders Across 300+ Dark Stores
The Business Problem
Zepto promises 10-minute delivery in 10+ Indian cities. This is only possible if the right products are already stocked in the dark store closest to the customer. But dark stores are small (2,000–3,000 sq ft) — they can't stock everything. They must predict exactly what will be ordered, when, and how much.
If Zepto overstocks: products expire (especially dairy, fruits, vegetables — ₹10–50 crore annual wastage). If Zepto understocks: orders get cancelled, customers churn to Blinkit/Swiggy Instamart.
Why This Is a Sequential Problem
Demand at 7 PM depends on:
- Demand at 6 PM, 5 PM, 4 PM (hourly momentum)
- Day-of-week pattern (weekends vs. weekdays)
- Weather (rain → more orders → more delivery delays)
- Festivals (Holi → colors & sweets spike, Diwali → dry fruits & decorations)
- IPL matches (7–10 PM → snacks, chips, cold drinks surge)
- Payday effect (month-end → higher grocery spending)
The RNN Architecture
| Component | Configuration |
|---|---|
| Architecture | Many-to-One RNN (24h → predict next hour) |
| Input Features (per hour) | orders_count, avg_basket_size (₹), temperature_°C, rain_mm, is_holiday, hour_sin, hour_cos, day_of_week_encoded |
| Sequence Length | T = 24 (look-back of 24 hours) |
| Hidden Size | n_a = 64 |
| Output | Predicted order count for the next hour |
| Loss Function | MSE (Mean Squared Error) |
| Gradient Clipping | max_norm = 1.0 |
Input Feature Engineering
import pandas as pd
import numpy as np
# ─── Feature Engineering for Zepto Dark Store ───
def engineer_features(df):
"""
Transform raw hourly data into RNN-ready features.
df columns: timestamp, orders, avg_basket_inr, temp_c, rain_mm
"""
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
# Cyclical encoding for hour (so 23→0 is smooth)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
# Holidays: Diwali, Holi, Independence Day, etc.
indian_holidays = ['2024-11-01', '2024-03-25', '2024-08-15']
df['is_holiday'] = df['timestamp'].dt.date.astype(str).isin(indian_holidays).astype(int)
# Normalize numerical features
for col in ['orders', 'avg_basket_inr', 'temp_c', 'rain_mm']:
df[col] = (df[col] - df[col].mean()) / (df[col].std() + 1e-8)
feature_cols = ['orders', 'avg_basket_inr', 'temp_c', 'rain_mm',
'is_holiday', 'hour_sin', 'hour_cos']
return df[feature_cols].values
# Create sliding window sequences
def create_sequences(data, lookback=24):
X, y = [], []
for i in range(lookback, len(data)):
X.append(data[i - lookback:i]) # past 24 hours
y.append(data[i, 0]) # next hour's orders
return np.array(X), np.array(y).reshape(-1, 1)
Results
| Metric | Baseline (Yesterday's Orders) | Vanilla RNN | LSTM (Ch.14) |
|---|---|---|---|
| MAE (orders/hour) | 45.2 | 28.7 | 18.3 |
| MAPE | 22.1% | 14.5% | 9.2% |
| Wastage Reduction | — | 18% ↓ | 31% ↓ |
| Stockout Reduction | — | 15% ↓ | 27% ↓ |
Key Findings
- Vanilla RNN beat the baseline by 37% on MAE — the temporal patterns (morning milk, evening snacks, weekend brunch) were clearly captured
- Limitation: Vanilla RNN struggled with weekly patterns (7-day cycles = 168 time steps). The gradient from "last Saturday's peak" could not reach "this Saturday's prediction" — classic vanishing gradient
- Solution preview: LSTM (Chapter 14) solved this by maintaining a separate cell state that can carry information across 168+ steps
- Business impact: Even the vanilla RNN saved Zepto an estimated ₹8–12 crore annually in reduced perishable wastage across their dark store network
Common Mistakes & Misconceptions
Wrong. RNNs use parameter sharing — the same
W_aa, W_ax, W_ya are applied at every time step. This is what enables them to handle variable-length sequences. If you had different weights per step, you'd need a different model for every sequence length.
No!
a⟨t⟩ is the internal memory — it's passed to the next time step. ŷ⟨t⟩ is the prediction at time step t, computed by projecting a⟨t⟩ through W_ya and softmax. In many-to-one architectures, we only compute ŷ at the final step, but a⟨t⟩ exists at every step.
No! RNNs are inherently sequential — you must compute
a⟨1⟩ before you can compute a⟨2⟩, because a⟨2⟩ depends on a⟨1⟩. This sequential dependency is the main reason Transformers (with parallel self-attention) have largely replaced RNNs for many tasks.
Gradient clipping only prevents exploding gradients (by capping the norm). The vanishing gradient problem is the opposite — gradients are too small. You can't clip something that's already near zero. Vanishing gradients require architectural changes: LSTM, GRU, skip connections, or attention.
Due to the vanishing gradient problem, vanilla RNNs effectively have a "memory horizon" of ~10–20 steps. Feeding a 500-step sequence doesn't mean the model uses all 500 steps — it practically "forgets" everything beyond the most recent ~20 steps. Truncated BPTT makes this explicit.
ReLU is unbounded (output can be 0 to ∞). In an RNN, the hidden state is multiplied by
W_aa repeatedly. With ReLU, the hidden state can grow unboundedly, causing numerical overflow. tanh bounds the state to [-1, +1]. (LSTMs use a combination of sigmoid gates and tanh for this reason.)
If you train on sequence A and then sequence B (unrelated), you must reset
a⟨0⟩ = 0 before starting B. Otherwise, the hidden state from the end of A "leaks" into B, causing noisy gradients. In PyTorch, this means calling h0 = torch.zeros(...) for each new batch.
Comparison Table
10.1 — RNN vs. Feedforward Networks
| Aspect | Feedforward Network | Recurrent Neural Network |
|---|---|---|
| Input | Fixed-size vector | Variable-length sequence |
| Memory | None — each input independent | Hidden state carries context forward |
| Parameter Sharing | Across layers (each layer has own W) | Across time steps (same W at every step) |
| Training | Standard backpropagation | BPTT (backpropagation through time) |
| Depth | Number of layers | Number of layers × sequence length (unrolled) |
| Parallelism | Each sample in batch parallelizable | Sequential within a sample (time steps depend on each other) |
| Key Problem | Can't handle sequences natively | Vanishing gradient over long sequences |
| Use Case | Image classification, tabular data | Text, time series, audio, video |
10.2 — RNN Architecture Selection Guide
| Task | Architecture | Why? | Indian Example |
|---|---|---|---|
| Sentiment Analysis | Many-to-One | Entire review → single sentiment score | Amazon India review → ⭐ rating |
| Language Modeling | Many-to-Many (=) | Predict next token at each step | Hindi text completion on Koo app |
| Machine Translation | Many-to-Many (≠) | Input/output lengths differ | Bhashini: Hindi → Tamil |
| Music Generation | One-to-Many | One seed → sequence of notes | AI-generated raga compositions |
| Named Entity Recognition | Many-to-Many (=) | Each token → entity label | Extracting names from Aadhaar forms |
| Demand Forecasting | Many-to-One | 24h history → next hour prediction | Zepto dark store inventory |
| Video Classification | Many-to-One | Frame sequence → single label | Cricket shot classification for Hotstar |
10.3 — Vanilla RNN vs. LSTM vs. GRU (Preview)
| Feature | Vanilla RNN | LSTM (Ch.14) | GRU (Ch.14) |
|---|---|---|---|
| Equations per Cell | 1 (hidden update) | 4 (forget, input, cell, output) | 3 (reset, update, candidate) |
| Memory Mechanism | Hidden state only | Separate cell state (highway) | Hidden state with gating |
| Long-term Dependencies | Fails (~10–20 steps) | Handles 100+ steps | Handles 100+ steps |
| Vanishing Gradient | Severe | Mitigated (additive gradient flow) | Mitigated |
| Parameters per Cell | n_a² + n_a·n_x + n_a | 4×(n_a² + n_a·n_x + n_a) | 3×(n_a² + n_a·n_x + n_a) |
| Training Speed | Fastest | Slowest (4 gates) | Middle |
| When to Use | Short sequences, simple tasks | Long sequences, complex patterns | Long sequences, fewer params needed |
Exercises
Section A — Multiple Choice Questions (10)
In the RNN hidden state equation a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a), what is the shape of W_aa if the hidden state has 128 units?
- (128, 1)
- (128, 128)
- (1, 128)
- (128, n_x)
Which RNN architecture would you use for classifying an Amazon India product review as positive or negative?
- One-to-One
- One-to-Many
- Many-to-One
- Many-to-Many (same length)
The key advantage of parameter sharing in RNNs is:
- It makes training faster by reducing the number of gradient computations
- It enables the network to handle variable-length input sequences with the same model
- It prevents the vanishing gradient problem
- It ensures the output is always the same dimension as the input
During BPTT, the gradient of the loss with respect to W_aa is computed as:
- The gradient at the last time step only
- The average of gradients across all time steps
- The sum of gradient contributions from all time steps
- The maximum gradient across all time steps
The vanishing gradient problem in RNNs occurs because:
- The learning rate is too small
- The repeated multiplication of matrices with spectral radius < 1 causes gradients to decay exponentially
- The softmax function saturates at extreme values
- The bias terms are initialized to zero
Gradient clipping with threshold τ performs the following operation when ‖g‖ > τ:
- g ← 0 (set gradient to zero)
- g ← τ · sign(g) (element-wise clipping)
- g ← (τ / ‖g‖) · g (rescale to norm τ, preserve direction)
- g ← g - τ (subtract threshold)
In a character-level language model, the perplexity is 15. This means:
- The model makes 15 errors per sentence
- The model is, on average, as uncertain as choosing uniformly among 15 characters at each step
- The model achieves 15% accuracy
- The vocabulary must have exactly 15 characters
Why is tanh preferred over ReLU for the hidden state activation in vanilla RNNs?
- tanh is computationally cheaper than ReLU
- tanh outputs are bounded in [-1, +1], preventing hidden state explosion over many time steps
- tanh always has a non-zero gradient, completely solving vanishing gradients
- tanh is differentiable everywhere, while ReLU is not
An RNN with hidden size n_a=64 and input size n_x=100 is used for a many-to-many task with vocabulary size n_y=100. How many trainable parameters does it have (excluding biases)?
- 64 × 64 + 64 × 100 + 100 × 64 = 4,096 + 6,400 + 6,400 = 16,896
- 64 × 100 + 100 × 64 = 12,800
- 64 × 64 × 100 = 409,600
- 3 × 64 × 64 = 12,288
In truncated BPTT with window size k=20, applied to a sequence of length T=100:
- Gradients are computed for only the last 20 time steps and ignored for the first 80
- The gradient computation at any time step t only propagates backward through the previous 20 steps (t to t-20), not all the way to t=1
- The sequence is split into 5 independent sub-sequences of length 20 with no state carried over
- The network architecture is modified to only have 20 recurrent layers
Section B — Short Answer Questions (5)
Explain in 3–4 sentences why a feedforward network cannot properly model the task: "Given the past 7 days of Sensex closing prices, predict tomorrow's price."
What is the difference between the "folded" and "unrolled" views of an RNN? Why is unrolling necessary for training?
Why is the hidden state initialized to the zero vector a⟨0⟩ = 0? Could we initialize it randomly?
A Zepto data scientist notices that their RNN performs well for predicting orders 1–2 hours ahead but poorly for predicting 24 hours ahead. Explain why, using the concept of vanishing gradients.
Explain temperature scaling in text generation. What happens if temperature → 0 and temperature → ∞?
Section C — Long Answer Questions (3)
Derive the BPTT gradients for a 3-step RNN. Given a sequence of length T=3 with inputs x⟨1⟩, x⟨2⟩, x⟨3⟩ and targets y⟨1⟩, y⟨2⟩, y⟨3⟩, starting from the cross-entropy loss L = L⟨1⟩ + L⟨2⟩ + L⟨3⟩:
- Write down the complete forward pass equations for all 3 steps
- Derive ∂L/∂W_ya by computing the contribution from each time step
- Derive ∂L/∂W_aa — showing how the gradient at step 3 flows backward through steps 2 and 1
- Show that the gradient ∂L⟨3⟩/∂a⟨1⟩ involves the product diag(1-a⟨3⟩²)·W_aa·diag(1-a⟨2⟩²)·W_aa — two matrix multiplications, which would be k-1 multiplications for distance k
- Explain why this leads to vanishing gradients when ‖W_aa‖ < 1
[Expected length: 2–3 pages with equations]
Compare and contrast the five RNN architectures (one-to-one, one-to-many, many-to-one, many-to-many same length, many-to-many different length). For each:
- Draw the architecture diagram showing input/output at each time step
- Write the mathematical formulation (which steps have inputs, which produce outputs)
- Give two Indian industry examples with justification
- Specify what the loss function looks like (at which steps is loss computed)
[Expected length: 2 pages with diagrams]
Language Modeling Deep Dive.
- Formulate the language model probability P(w₁, w₂, …, w_T) using the chain rule
- Explain how an RNN implements each conditional P(w_t | w₁, …, w_{t-1}) using the hidden state
- Derive the training objective (negative log-likelihood) and show it equals the cross-entropy loss summed over time steps
- Define perplexity mathematically and explain its intuitive interpretation
- For a Hindi character-level model with vocabulary size 75: What is the perplexity of a random model (uniform distribution)? What would be a "good" perplexity, and why?
- Explain the sampling procedure for text generation, including the role of temperature
[Expected length: 2–3 pages]
Section D — Programming Questions (2)
Sentiment Analysis on Amazon India Reviews. Build a complete many-to-one RNN for sentiment classification:
- Load the Amazon India product review dataset (or create a synthetic one with 5,000 Hindi+English reviews)
- Tokenize the text, build a vocabulary, and convert reviews to padded integer sequences
- Implement a
SentimentRNNclass in PyTorch with: Embedding layer → RNN → Linear → Softmax - Train with cross-entropy loss, Adam optimizer, and gradient clipping (max_norm=1.0)
- Report accuracy, confusion matrix, and show 5 example predictions with the model's internal hidden state evolution (plot ‖a⟨t⟩‖ over time for a sample review)
- Bonus: Compare vanilla RNN vs. LSTM performance and discuss the difference
Character-Level Text Generator for Indian Languages. Build a from-scratch character-level RNN:
- Collect a corpus of Hindi poetry (Harivansh Rai Bachchan's Madhushala or any publicly available Hindi text, ~10,000 characters minimum)
- Implement the RNN cell, forward pass, BPTT backward pass, and gradient clipping entirely in NumPy (no PyTorch)
- Train the model and plot the loss curve over 1,000+ epochs
- Generate text at three temperature values (0.3, 0.7, 1.5) and qualitatively compare the outputs
- Visualize the hidden state activation patterns: for a generated sequence of 50 characters, plot a heatmap of a⟨t⟩ (t × n_a matrix) to show how different hidden units activate for different characters
- Experiment with hidden sizes (32, 64, 128, 256) and report which produces the best text quality
Section E — Mini-Project
🛒 Zepto-Style Demand Forecasting System
Build an end-to-end demand forecasting pipeline for a simulated dark store in Mumbai:
- Data Generation: Simulate 90 days of hourly order data (2,160 data points) with realistic patterns:
- Morning peak (8–10 AM: milk, bread, eggs)
- Lunch lull (1–3 PM)
- Evening spike (6–9 PM: dinner ingredients, snacks)
- Weekend 2x multiplier
- Rain → 1.5x orders (people don't go out)
- IPL match evenings → snack orders spike
- Diwali week → 3x orders on sweets & dry fruits
- Feature Engineering: Create features: hour_sin, hour_cos, day_of_week, is_weekend, temperature, rain_flag, is_holiday, lag_1h, lag_24h, rolling_mean_6h
- Model: Implement vanilla RNN + LSTM in PyTorch with 24-hour lookback window
- Evaluation: MAE, MAPE, plot predicted vs. actual for a full week
- Business Analysis: If Zepto stocks 10% above predicted demand as safety buffer, calculate the expected wastage cost (assume avg order = ₹350, wastage cost = ₹50 per unnecessary unit) and stockout cost (assume ₹200 lost revenue per missed order)
- Report: Write a 2-page report comparing vanilla RNN vs. LSTM, with insights on which time-of-day patterns each model captures well
Deliverables: Jupyter notebook + PDF report
Chapter Summary
📝 Key Takeaways — Chapter 13
- Sequential data is everywhere — text, time series, audio, video. Feedforward networks fail because they have no notion of order or memory.
- The RNN cell adds a hidden state
a⟨t⟩that carries information forward through time:a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a). - Parameter sharing — the same W_aa, W_ax, W_ya are used at every time step, enabling variable-length input handling.
- Five architectures — one-to-one, one-to-many, many-to-one, many-to-many (same/different length) — cover all sequence tasks from sentiment analysis to machine translation.
- BPTT trains RNNs by unrolling the computation graph across time and applying standard backpropagation. Gradients for shared weights are summed across all time steps.
- Vanishing gradient — the product ∏ diag(1-a²)·W_aa shrinks exponentially, preventing learning of long-range dependencies (>20 steps).
- Exploding gradient — the same product can grow exponentially, causing NaN values. Solved by gradient clipping: rescaling g to have max norm τ.
- Language modeling — RNNs learn P(word_t | history) at each step. The model generates text by sampling from the predicted distribution.
- Perplexity = exp(average NLL) — measures how "surprised" the model is. Lower = better.
- Temperature scaling — controls the creativity/randomness trade-off in text generation.
- Vanilla RNNs are limited — they work for short sequences but fail for long-term dependencies. This motivates LSTM and GRU (Chapter 14).
- Industry impact — from IRCTC booking forecasts to Zepto dark store inventory, RNNs (and their LSTM/GRU variants) power critical time-series and NLP systems across Indian tech.
✅ Self-Assessment Checklist
- Can I write the RNN cell equations from memory?
- Can I explain why feedforward networks fail on sequential data?
- Can I draw all 5 RNN architecture types and give examples for each?
- Can I derive the BPTT gradient for W_aa for a 3-step sequence?
- Can I explain the vanishing gradient mathematically (product of matrices)?
- Can I implement gradient clipping from scratch?
- Can I implement an RNN cell in NumPy (forward + backward)?
- Can I explain language modeling (chain rule → conditional probability → RNN)?
- Can I compute perplexity and interpret what it means?
- Can I use PyTorch's nn.RNN for time-series forecasting?
- Can I articulate why vanilla RNNs need to be replaced by LSTMs for long sequences?
References & Further Reading
Foundational Papers
- Elman, J. L. (1990). "Finding Structure in Time." Cognitive Science, 14(2), 179–211. — The original "simple recurrent network" (Elman network) paper.
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning Representations by Back-Propagating Errors." Nature, 323, 533–536. — Introduced backpropagation, the foundation for BPTT.
- Werbos, P. J. (1990). "Backpropagation Through Time: What It Does and How to Do It." Proceedings of the IEEE, 78(10), 1550–1560. — The definitive BPTT paper.
- Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen" [Diploma Thesis]. — First formal analysis of the vanishing gradient problem in RNNs.
- Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." IEEE Transactions on Neural Networks, 5(2), 157–166. — Comprehensive analysis of why RNNs fail on long sequences.
- Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." ICML. — Gradient clipping and analysis of exploding/vanishing gradients.
Textbooks
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 10: Sequence Modeling — Recurrent and Recursive Nets.
- Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft). Chapter 9: RNNs and LSTMs. — Available at
web.stanford.edu/~jurafsky/slp3/ - Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Chapter 9: Recurrent Neural Networks. — Available at
d2l.ai
Online Resources & Tutorials
- Karpathy, A. (2015). "The Unreasonable Effectiveness of Recurrent Neural Networks." — Classic blog post with character-level RNN experiments.
karpathy.github.io/2015/05/21/rnn-effectiveness/ - Olah, C. (2015). "Understanding LSTM Networks." — Beautifully illustrated guide to RNN/LSTM internals.
colah.github.io/posts/2015-08-Understanding-LSTMs/ - PyTorch Documentation.
torch.nn.RNN.pytorch.org/docs/stable/generated/torch.nn.RNN.html - Andrew Ng. Deep Learning Specialization, Course 5: Sequence Models. Coursera. — Excellent video lectures on RNNs, BPTT, and language modeling.
Indian Industry Context
- IRCTC Annual Report (2023–24). — Statistics on daily ticket bookings and seasonal demand patterns.
- RedSeer Consulting (2024). "Quick Commerce in India." — Market analysis of Zepto, Blinkit, and Swiggy Instamart.
- NPCI (2024). "UPI Ecosystem Statistics." — Monthly UPI transaction volumes used as sequential data examples.