Neural Networks & Deep Learning

Chapter 14: Recurrent Neural Networks (RNNs) & LSTMs

Mastering Sequential Memory — From Vanilla RNNs to Gated Architectures

⏱️ Reading Time: ~4 hours | 📖 Unit 5: Specialized Architectures | 🧠 Theory + Code Chapter

📋 Prerequisites: Chapter 10 (Deep Networks & Optimization), Chapter 6 (Backpropagation)

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall the vanilla RNN equation, LSTM gate equations (forget, input, candidate, output), GRU update/reset gates, and BPTT algorithm steps
🔵 Understand	Explain why feedforward networks fail on sequences, how hidden states carry temporal memory, why gradients vanish in deep time-unrolled RNNs, and how LSTM gates solve this via the cell state highway
🟢 Apply	Implement RNN and LSTM cells from scratch in NumPy, train a character-level language model on Hindi text, and use PyTorch's nn.LSTM for demand forecasting
🟡 Analyze	Trace gradient flow through an unrolled RNN, prove vanishing gradients via repeated Jacobian multiplication, compare LSTM vs GRU parameter efficiency, and diagnose when to use bidirectional architectures
🟠 Evaluate	Choose between LSTM, GRU, and vanilla RNN for a given task; assess when attention or Transformers should replace recurrent models; evaluate trade-offs in encoder-decoder architectures
🔴 Create	Design an end-to-end IRCTC demand forecasting pipeline with LSTM, build a character-level Hindi text generator, and architect a seq2seq model for transliteration

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Explain why sequential data (time series, text, speech) requires architectures with temporal memory — and why CNNs and feedforward networks fundamentally cannot handle variable-length temporal dependencies
Derive the vanilla RNN forward pass equations: h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h) and explain weight sharing across time steps
Perform Backpropagation Through Time (BPTT) — unroll the computation graph, apply the chain rule across T time steps, and compute ∂L/∂W_hh explicitly
Prove the vanishing gradient problem: show that ∂h_T/∂h_1 = ∏_{t=1}^{T-1} diag(tanh'(z_t)) · W_hh decays exponentially when the largest singular value of W_hh < 1
Derive each LSTM gate from first principles — forget gate (what to erase), input gate (what to write), candidate cell (what to write), output gate (what to read) — and show how the cell state provides a gradient highway
Compare GRU's simplified two-gate architecture (update + reset) with LSTM's four gates, and identify when each is preferred
Explain bidirectional RNNs (reading sequences forward + backward) and deep/stacked RNNs (multiple recurrent layers)
Implement both RNN and LSTM cells from scratch in NumPy, including forward pass and BPTT
Build practical models using PyTorch's nn.LSTM — character-level language model and time-series forecasting
Preview the encoder-decoder (seq2seq) architecture that leads to attention and Transformers (Chapter 15)

Section 2

Opening Hook — 25 Lakh Tickets and the Problem of Memory

🚂 IRCTC Processes 25 Lakh Bookings Daily. Can Your Neural Network Remember Yesterday?

It's October 2024. Durga Puja is four days away. At IRCTC's data center in Delhi, servers are groaning under the weight of 25 lakh ticket bookings per day. The Rajdhani Express from Delhi to Kolkata? Waitlisted to position 847. The Sealdah Duronto? Sold out three weeks ago.

Now imagine you're the ML engineer tasked with predicting tomorrow's demand for the Delhi-Mumbai Rajdhani. You need to know: yesterday's bookings (18,000), last week's pattern (rising 12% daily), the festival calendar (Diwali is next month), the monsoon status (delayed in Maharashtra causing cancellations), and the fact that a cricket match in Mumbai will spike weekend travel. This isn't a single data point — it's a sequence that stretches weeks into the past.

You try a feedforward network. You feed it today's features: 18,000 bookings, temperature 32°C, "pre-Diwali" flag. It predicts 17,500. Wildly wrong — it missed the upward trend, the weekly seasonality, the festival momentum. Why? Because a feedforward network is like a goldfish: it has no memory. It sees each day in complete isolation.

What you need is a network that remembers. One that carries yesterday's context into today's prediction. One that can learn: "bookings have been rising for 5 days straight, there's a festival approaching, and monsoon delays are pushing people to pre-book." That network is called a Recurrent Neural Network. And when sequences get really long, you need its evolved cousin: the Long Short-Term Memory (LSTM) network.

IRCTCIndian RailwaysDemand ForecastingSequential DataTime Series

Sequential data dominates Indian tech: UPI transaction patterns on Paytm/PhonePe (₹20+ lakh crore monthly), Jio's network traffic during IPL matches, IRCTC's 25 lakh daily bookings with festival spikes, Flipkart's order surges during Big Billion Days, Zomato's hourly delivery patterns across 500+ cities, and Ola's ride-demand curves during monsoon. Every one of these problems screams for a neural network with memory.

Section 3

The Intuition First — Thinking Before Equations

The Diary Analogy

Imagine you're reading a diary, one page per day. On each page, the writer describes what happened. When you read Day 5, you don't forget Days 1-4 — you carry a mental summary of everything you've read so far. This mental summary gets updated as you read each new page.

That's exactly what an RNN does:

Each page = one input at time step t (today's features: bookings, weather, festival flag)
Your mental summary = the hidden state h_t (a vector summarizing everything seen so far)
Updating your summary = applying a function: new_summary = f(old_summary, today's page)
Making a prediction = using today's summary to guess what tomorrow's page will say

THE DIARY READER (RNN INTUITION) Day 1 Day 2 Day 3 Day 4 ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ x₁ │ │ x₂ │ │ x₃ │ │ x₄ │ ← Pages (inputs) └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌──────┐ h₁ ┌──────┐ h₂ ┌──────┐ h₃ ┌──────┐ │ Read │─────▶│ Read │─────▶│ Read │─────▶│ Read │──▶ h₄ │ & Sum│ │ & Sum│ │ & Sum│ │ & Sum│ └──────┘ └──────┘ └──────┘ └──────┘ h₀ ─┘ │ │ │ (blank ŷ₂ ŷ₃ ŷ₄ ← Predictions mind) "I recall Days "Days 1-3 tell "The trend 1-2: bookings me there's an over 4 days are rising" upward trend" is clear!" KEY INSIGHT: The same "reading brain" is used at every step. This is WEIGHT SHARING — the same W_hh, W_xh are applied at every time step.

The "Aha!" Question

If the RNN uses the same weights at every time step, how can it possibly distinguish between "the booking happened yesterday" vs. "the booking happened 30 days ago"?

Answer: It can't — not well, anyway. That's the vanishing gradient problem, and it's exactly why we need LSTMs. But before we jump to the solution, let's deeply understand the problem.

The physicist's approach: We're going to derive everything from scratch. No "it can be shown that..." — we'll show it. If a formula appears, you'll see exactly where it comes from. If you're confused at any point, you're thinking correctly. Confusion is the first step to understanding.

Section 4

Mathematical Foundation — Deriving Everything from Scratch

14.1 — Why Sequences Need Special Architectures

Before we write any equations, let's be crystal clear about what makes sequential data fundamentally different from, say, images or tabular data.

Property 1: Temporal Dependency

In a sequence x₁, x₂, ..., x_T, the meaning of x_t depends on what came before. "Bank" means something different after "river" vs. after "investment." Today's stock price depends on yesterday's. The context accumulated over time is essential.

Property 2: Variable Length

A tweet is 15 words. A novel is 100,000 words. A Sensex time series spans 50 years of daily data. You cannot design a fixed-size input layer for all of these. The architecture must handle any sequence length using the same parameters.

Property 3: Position-Dependent Semantics with Shared Rules

The rule "a verb following a noun is probably describing the noun's action" applies equally at position 3 and position 300 in a sentence. We want parameter sharing — the same transformation applied at every position — so the network can generalize across positions.

💡 Key Insight: The Compression Trick

Instead of trying to feed the entire history (x₁, x₂, ..., x_t) at each step, we compress the history into a fixed-size vector h_t (the hidden state). At each time step, we update this compressed summary:

h_t = f(h_{t-1}, x_t; θ)

The function f and parameters θ are shared across all time steps. This is the fundamental RNN idea.

14.2 — Vanilla RNN: Architecture & Forward Pass

The Architecture

A vanilla RNN has three sets of parameters, shared across every time step:

W_xh ∈ ℝ^(d_h × d_x): transforms input x_t into hidden-space
W_hh ∈ ℝ^(d_h × d_h): transforms previous hidden state h_{t-1}
b_h ∈ ℝ^(d_h): bias vector
W_hy ∈ ℝ^(d_y × d_h): maps hidden state to output
b_y ∈ ℝ^(d_y): output bias

Step-by-Step Forward Pass

At each time step t = 1, 2, ..., T:

Step 1: Compute pre-activation

z_t = W_hh · h_{t-1} + W_xh · x_t + b_h

Step 2: Apply activation (tanh)

h_t = tanh(z_t) = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h)

Step 3: Compute output

ŷ_t = W_hy · h_t + b_y

(Apply softmax for classification, or use raw for regression)

Initialization: h₀ = 0 (zero vector of size d_h)

Why tanh? It squashes to [-1, +1], allowing the hidden state to represent both positive and negative correlations. ReLU is rarely used in vanilla RNNs because unbounded activations cause exploding hidden states during the recurrence.

The Vanilla RNN Equations (Boxed Result)

h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h)
ŷ_t = softmax(W_hy · h_t + b_y)

Weight Sharing: The Same Brain at Every Step

Notice something remarkable: W_hh, W_xh, W_hy, b_h, b_y are the same at every time step. The RNN doesn't have separate "Day 1 weights" and "Day 2 weights." It applies the same transformation everywhere. This is analogous to how a CNN shares the same filter across spatial positions.

Parameter count: A vanilla RNN with d_h = 256 hidden units and d_x = 100 input features has only 256×256 + 256×100 + 256 = 91,392 parameters in the recurrent layer — regardless of whether the sequence has 10 steps or 10,000 steps. A feedforward network handling 10,000 steps would need millions of parameters.

Compact Notation

You'll often see the RNN equations written compactly by concatenating h_{t-1} and x_t:

h_t = tanh(W · [h_{t-1}; x_t] + b)

where W = [W_hh | W_xh] ∈ ℝ^(d_h × (d_h + d_x))

This concatenation trick simplifies implementation — you do one matrix multiply instead of two.

VANILLA RNN — UNROLLED VIEW (3 TIME STEPS) x₁ x₂ x₃ │ │ │ ▼ ▼ ▼ ┌──────┐ ┌──────┐ ┌──────┐ │W_xh │ │W_xh │ │W_xh │ ← SAME W_xh └──┬───┘ └──┬───┘ └──┬───┘ │ │ │ h₀──▶⊕───▶tanh──▶h₁──▶⊕───▶tanh──▶h₂──▶⊕───▶tanh──▶h₃ ▲ ▲ ▲ ┌──┴───┐ ┌──┴───┐ ┌──┴───┐ │W_hh │ │W_hh │ │W_hh │ ← SAME W_hh └──────┘ └──────┘ └──────┘ │ │ │ ▼ ▼ ▼ ŷ₁ ŷ₂ ŷ₃ │ │ │ ▼ ▼ ▼ L₁ L₂ L₃ Total Loss: L = L₁ + L₂ + L₃ (sum across time steps)

❌ MYTH: "Each copy of the RNN cell in the unrolled diagram has its own parameters."
✅ TRUTH: All copies share exactly the same W_hh, W_xh, W_hy. The unrolling is just a visualization trick for computing gradients.
🔍 WHY IT MATTERS: During BPTT, gradients from all time steps must be summed (not averaged) to update the shared parameters. If you treat them as separate, you'll implement BPTT incorrectly.

14.3 — Backpropagation Through Time (BPTT) — Full Derivation

Now comes the beautiful (and slightly painful) part. We need to compute gradients for learning. Since the same weights are used at every time step, we need to unroll the RNN through time and backpropagate through the entire unrolled graph.

Setup

Given a sequence of length T, the total loss is:

L = Σ_{t=1}^{T} L_t(ŷ_t, y_t)

We need: ∂L/∂W_hh, ∂L/∂W_xh, ∂L/∂W_hy

Step 1: Gradient of the output weights (Easy)

Since ŷ_t = W_hy · h_t + b_y, and each L_t depends only on ŷ_t:

∂L/∂W_hy = Σ_{t=1}^{T} ∂L_t/∂ŷ_t · h_t^⊤

This is straightforward — no recurrence involved.

Step 2: Gradient of the recurrent weights (The Hard Part)

Here's where it gets interesting. Consider ∂L/∂W_hh. The weight W_hh is used at every time step. So L_t depends on W_hh through the chain:

W_hh → h₁ → h₂ → ... → h_t → ŷ_t → L_t

By the chain rule:

∂L_t/∂W_hh = Σ_{k=1}^{t} ∂L_t/∂ŷ_t · ∂ŷ_t/∂h_t · ∂h_t/∂h_k · ∂h_k/∂W_hh

The key term is ∂h_t/∂h_k — the gradient flowing backwards from step t to step k.

Step 3: Expanding ∂h_t/∂h_k

Since h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h), we have:

∂h_t/∂h_{t-1} = diag(1 - h_t²) · W_hh

(where 1 - h_t² is the derivative of tanh, and diag() creates a diagonal matrix)

By the chain rule across multiple steps:

∂h_t/∂h_k = ∏_{i=k+1}^{t} ∂h_i/∂h_{i-1} = ∏_{i=k+1}^{t} diag(1 - h_i²) · W_hh

Step 4: The Complete BPTT Gradient

Putting it all together:

BPTT Gradient (Boxed Result)

∂L/∂W_hh = Σ_{t=1}^{T} Σ_{k=1}^{t} (∂L_t/∂h_t) · [∏_{i=k+1}^{t} diag(1 - h_i²) · W_hh] · (∂h_k⁺/∂W_hh)

where ∂h_k⁺/∂W_hh = diag(1 - h_k²) · h_{k-1}^⊤ (the "immediate" gradient at step k)

The total gradient for W_hh is the sum over all time steps t, and for each t, the sum over all preceding steps k ≤ t. This double sum is what makes BPTT computationally expensive — O(T²) in the worst case, though it's typically implemented as a single backward sweep in O(T).

Truncated BPTT

For very long sequences (T = 10,000+), full BPTT is impractical. Truncated BPTT limits backpropagation to the last τ steps:

∂h_t/∂h_k ≈ 0 for t - k > τ (truncation length)

Typical values: τ = 20 to 50. PyTorch's default behavior processes sequences in chunks of τ steps.

BPTT Quick Summary

1. Unroll RNN for T steps → compute all h₁, h₂, ..., h_T (forward pass) 2. Compute loss L = Σ L_t 3. Backpropagate: ∂L/∂h_T → ∂L/∂h_{T-1} → ... → ∂L/∂h_1 4. Accumulate: ∂L/∂W = Σ_t (gradient contribution from step t)

GATE tip: The gradient at step k involves a product of (t-k) Jacobians. This product is why gradients vanish or explode.

14.4 — The Vanishing Gradient Problem: A Rigorous Proof

This is arguably the most important theoretical result in this chapter. Let's prove it step by step.

Theorem: Gradients vanish exponentially with sequence length

Claim: For a vanilla RNN, if the largest singular value of W_hh satisfies σ_max(W_hh) < 1/γ where γ = max|tanh'(·)| = 1, then:

‖∂h_t/∂h_k‖ ≤ (σ_max(W_hh))^(t-k) → 0 as (t-k) → ∞

Proof:

Step 1: We derived that:

∂h_t/∂h_k = ∏_{i=k+1}^{t} diag(1 - h_i²) · W_hh

Step 2: Take the norm of each factor. For any matrix product AB, ‖AB‖ ≤ ‖A‖ · ‖B‖. So:

‖∂h_t/∂h_k‖ ≤ ∏_{i=k+1}^{t} ‖diag(1 - h_i²)‖ · ‖W_hh‖

Step 3: The diagonal matrix diag(1 - h_i²) has entries in [0, 1] (since tanh output is in [-1,1], so tanh² ∈ [0,1], so 1-tanh² ∈ [0,1]). Therefore:

‖diag(1 - h_i²)‖ ≤ 1

Step 4: Let σ = σ_max(W_hh) = ‖W_hh‖₂ (the spectral norm). Then:

‖∂h_t/∂h_k‖ ≤ ∏_{i=k+1}^{t} 1 · σ = σ^(t-k)

Step 5: If σ < 1, then σ^(t-k) → 0 exponentially as (t-k) → ∞. ∎

What this means intuitively:

The gradient signal from step t trying to reach step k gets multiplied by W_hh at each intervening step. If W_hh "shrinks" vectors (σ < 1), the signal is progressively crushed. After 20-30 steps, it's essentially zero. The RNN cannot learn dependencies longer than ~20 steps.

The exploding case:

If σ > 1, then σ^(t-k) → ∞. The gradients explode. This is easier to fix — just clip gradients: if ‖g‖ > threshold: g = threshold · g / ‖g‖

VANISHING GRADIENT VISUALIZATION Gradient signal flowing backward through time: Step 1 Step 2 Step 3 Step 4 Step 5 Step 10 Step 20 ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │██████│ │█████ │ │████ │ │███ │ │██ │ │▪ │ │ │ │██████│ │█████ │ │████ │ │███ │ │██ │ │ │ │ │ │██████│ │█████ │ │████ │ │███ │ │██ │ │ │ │ │ │██████│ │█████ │ │████ │ │███ │ │██ │ │ │ │ │ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ 100% 81% 66% 53% 43% 13% 1.5% σ_max = 0.9 → Gradient at step k = 0.9^(20-k) of original After 20 steps: 0.9^20 ≈ 0.012 = 1.2% of the original signal!

❌ MYTH: "The vanishing gradient means the gradient becomes zero."
✅ TRUTH: The gradient becomes exponentially small but never exactly zero. However, it becomes so small relative to floating-point precision (and relative to gradients from nearby time steps) that it's effectively invisible to the optimizer.
🔍 WHY IT MATTERS: This means the RNN can theoretically learn long-range dependencies, but in practice, the learning signal is drowned out by more recent, stronger gradients. The LSTM doesn't just fix the gradient — it creates an alternative pathway where the gradient can flow unimpeded.

"On the difficulty of training Recurrent Neural Networks" (Pascanu et al., 2013) — This seminal paper provided the formal analysis of vanishing/exploding gradients in RNNs, showing that the problem is related to the eigenvalues of the recurrent weight matrix. They also proposed gradient clipping as a practical solution for the exploding case. The vanishing case requires architectural changes (LSTM/GRU).

14.5 — LSTM: Long Short-Term Memory — Full Gate Derivation

The LSTM, introduced by Hochreiter & Schmidhuber in 1997 (and refined by Gers et al. in 2000), solves the vanishing gradient problem with a beautifully engineered architecture. Let's derive each component from the design requirement.

The Design Challenge

We need a recurrent unit that can:

Remember information for long durations (hundreds of time steps)
Selectively forget information that's no longer relevant
Selectively write new information when it matters
Selectively read from memory to produce output

Think of it as a bank locker system:

The cell state C_t is the locker itself — long-term storage
The forget gate decides which locker contents to discard
The input gate decides which new items to store
The output gate decides which items to take out and use right now

LSTM Gate Equations — Derived Step by Step

Notation: At time step t, the LSTM receives input x_t ∈ ℝ^(d_x) and previous hidden state h_{t-1} ∈ ℝ^(d_h). It also maintains a cell state C_{t-1} ∈ ℝ^(d_h).

Gate 1: The Forget Gate f_t — "What should I erase from memory?"

We need a gate that looks at the current input x_t and previous hidden state h_{t-1} and decides, for each element of the cell state, whether to keep it (1) or erase it (0). A sigmoid function gives us values in [0, 1]:

f_t = σ(W_f · [h_{t-1}; x_t] + b_f)

→ f_t ∈ (0, 1)^(d_h). A value of 0.9 means "keep 90% of this memory." A value of 0.01 means "almost completely forget this."

Gate 2: The Input Gate i_t — "What new information is worth storing?"

We need two things: (a) a gate deciding how much to write, and (b) the actual content to write.

i_t = σ(W_i · [h_{t-1}; x_t] + b_i) ← how much to write (0 to 1)

C̃_t = tanh(W_C · [h_{t-1}; x_t] + b_C) ← what to write (-1 to 1, the "candidate")

→ i_t acts as a valve controlling how much of the candidate C̃_t flows into memory.

The Cell State Update — "Erase old + Write new"

Now we combine: erase what the forget gate says to erase, then add what the input gate says to add:

C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t

→ ⊙ is element-wise multiplication. This is the key equation. Notice: the cell state update is an additive operation (not multiplicative like vanilla RNN's h_t = tanh(W·h_{t-1}+...)). This additive nature is what prevents vanishing gradients!

Gate 3: The Output Gate o_t — "What should I output right now?"

The cell state contains everything in memory, but we only want to output the relevant parts:

o_t = σ(W_o · [h_{t-1}; x_t] + b_o)

h_t = o_t ⊙ tanh(C_t)

→ tanh squashes C_t to [-1, 1], then the output gate selects which dimensions to expose.

LSTM Equations — Complete Boxed Summary

Forget gate:    f_t = σ(W_f · [h_{t-1}; x_t] + b_f)
Input gate:     i_t = σ(W_i · [h_{t-1}; x_t] + b_i)
Candidate:     C̃_t = tanh(W_C · [h_{t-1}; x_t] + b_C)
Cell update:    C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
Output gate:    o_t = σ(W_o · [h_{t-1}; x_t] + b_o)
Hidden state: h_t = o_t ⊙ tanh(C_t)

Parameters: W_f, W_i, W_C, W_o ∈ ℝ^(d_h × (d_h + d_x)), b_f, b_i, b_C, b_o ∈ ℝ^(d_h)

Why Does This Solve Vanishing Gradients?

Look at the cell state update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t

The gradient of C_t with respect to C_{t-1} is:

∂C_t/∂C_{t-1} = diag(f_t)

For long-range dependency, the gradient across T steps is:

∂C_T/∂C_1 = ∏_{t=2}^{T} diag(f_t)

If the forget gate values are close to 1 (meaning "remember everything"), this product stays close to 1. The cell state acts as a gradient highway — gradients can flow unchanged across hundreds of time steps. The forget gate learned value of ~1 means "let the gradient through."

Forget gate bias initialization: Initialize b_f to +1 or +2 (instead of 0). This makes σ(b_f) ≈ 0.73 or 0.88, so the LSTM starts by remembering rather than forgetting. This trick, proposed by Jozefowicz et al. (2015), significantly improves LSTM training. PyTorch's nn.LSTM does NOT do this by default — you must set it manually.

LSTM CELL — INTERNAL ARCHITECTURE ┌─────────────────────────────────────────────┐ │ LSTM CELL │ │ │ C_{t-1} ───────▶│──── ⊙ ──────────── ⊕ ───────────▶ C_t │ │ ▲ ▲ │ │ │ │ │ │ f_t i_t ⊙ C̃_t │ │ (forget) (input × candidate) │ │ ▲ ▲ ▲ │ │ ┌──┴──┐ ┌──┴─┐ ┌┴────┐ │ │ │ σ │ │ σ │ │tanh │ │ │ └──┬──┘ └──┬─┘ └┬────┘ │ │ │ │ │ │ h_{t-1} ───────▶│─────┼──────────────┼────┼─────┐ │ │ │ │ │ │ │ x_t ───────────▶│─────┴──────────────┴────┴──┬──┤ │ │ │ │ │ │ ┌──┴──┐ │ │ tanh(C_t) ◀──── │ o_t │ ← σ │ │ │ └──┬──┘ │ │ ▼ │ │ │ ⊙ ◀─────────────┘ │ │ │ │ │ ▼ │ │ h_t ──────────────────────▶ h_t │ └─────────────────────────────────────────────┘ FLOW SUMMARY: ┌─────────┐ ┌──────┐ ┌─────────┐ ┌──────────┐ │ Forget │──▶│ Erase│──▶│ Write │──▶│ Output │──▶ h_t │ Gate │ │ Old │ │ New │ │ Filter │ └─────────┘ └──────┘ └─────────┘ └──────────┘ "What to f_t⊙C +i_t⊙C̃_t o_t⊙tanh(C_t) forget?"

LSTM Parameter Count

An LSTM with d_h hidden units and d_x input features has 4 gates, each with a weight matrix of size d_h × (d_h + d_x) and a bias of size d_h:

LSTM parameters = 4 × [d_h × (d_h + d_x) + d_h] = 4 × d_h × (d_h + d_x + 1)

For d_h = 256, d_x = 100: 4 × 256 × 357 = 365,568 parameters — about 4× a vanilla RNN.

The LSTM was published in 1997 by Sepp Hochreiter and Jürgen Schmidhuber, but it didn't become popular until ~2013 when GPU computing made training large LSTMs practical. Alex Graves at DeepMind used LSTMs to win handwriting recognition competitions, and Google adopted them for Google Translate in 2016. The LSTM had to wait 16 years for hardware to catch up with theory.

14.6 — GRU: Gated Recurrent Unit — The Streamlined Alternative

The GRU, introduced by Cho et al. (2014), asks: "Can we get LSTM-like performance with fewer parameters?" The answer: merge the forget and input gates into one update gate, and merge the cell state and hidden state into a single state vector.

GRU Equations — Derived from LSTM Simplification

Key simplifications:

No separate cell state C_t — just the hidden state h_t
The "input gate" is replaced by (1 - z_t), ensuring f + i = 1 (what you forget, you replace with new)
The output gate is removed — the full hidden state is exposed

Reset Gate r_t — "How much of the past hidden state to use when computing new candidate?"

r_t = σ(W_r · [h_{t-1}; x_t] + b_r)

Update Gate z_t — "What fraction of h_{t-1} to keep vs. replace?"

z_t = σ(W_z · [h_{t-1}; x_t] + b_z)

Candidate hidden state h̃_t — "What would the new state be?"

h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}; x_t] + b_h)

→ The reset gate r_t controls how much of h_{t-1} is visible when computing the candidate. If r_t ≈ 0, the GRU "resets" and acts like a vanilla neuron seeing only x_t.

Hidden state update — "Interpolate between old and new"

h_t = z_t ⊙ h_{t-1} + (1 - z_t) ⊙ h̃_t

→ When z_t ≈ 1: keep old state (remember). When z_t ≈ 0: use new candidate (update).

GRU Equations — Complete Boxed Summary

Reset gate: r_t = σ(W_r · [h_{t-1}; x_t] + b_r)
Update gate: z_t = σ(W_z · [h_{t-1}; x_t] + b_z)
Candidate: h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}; x_t] + b_h)
Hidden state: h_t = z_t ⊙ h_{t-1} + (1 - z_t) ⊙ h̃_t

Parameters: 3 weight matrices (vs. LSTM's 4) → 25% fewer parameters

LSTM vs GRU: Head-to-Head Comparison

Feature	LSTM	GRU
Gates	3 (forget, input, output) + candidate	2 (update, reset) + candidate
State vectors	2 (h_t and C_t)	1 (h_t only)
Parameters (d_h=256, d_x=100)	365,568	274,176 (25% fewer)
Training speed	Slower	~15-20% faster
Long-range memory	Stronger (dedicated cell state)	Slightly weaker
Best for	Long sequences, complex tasks	Shorter sequences, limited compute
Who uses it	Google Translate (2016), DeepMind	Baidu speech, many research papers

🇮🇳 India — LSTM vs GRU in Practice

Flipkart: Uses GRUs for real-time session-based recommendations (latency matters more than max accuracy)
CRIS (Indian Railways): LSTM for long-term demand forecasting (needs to remember seasonal patterns from months ago)
Bhashini: LSTM-based transliteration for 22 scheduled languages

🇺🇸 USA — LSTM vs GRU in Practice

Google Translate (2016-2019): 8-layer LSTM encoder-decoder before switching to Transformer
Amazon Alexa: GRU for on-device wake word detection (small model, fast inference)
Netflix: LSTM for viewing pattern prediction and content recommendation

14.7 — Bidirectional RNNs: Reading Forward and Backward

Consider the sentence: "The Rajdhani arrived at the _____ on platform 3." To fill the blank, you need context from both sides: "Rajdhani" (before) suggests a train station, and "platform 3" (after) confirms it. A unidirectional RNN reading left-to-right never sees "platform 3" when processing the blank.

Architecture

A Bidirectional RNN runs two separate RNNs:

Forward RNN: reads x₁ → x₂ → ... → x_T, producing hidden states h̄₁ᶠ, h̄₂ᶠ, ..., h̄_Tᶠ
Backward RNN: reads x_T → x_{T-1} → ... → x₁, producing hidden states h̄₁ᵇ, h̄₂ᵇ, ..., h̄_Tᵇ

The final hidden state at each step is the concatenation:

h_t = [h̄_tᶠ ; h̄_tᵇ] ∈ ℝ^(2·d_h)

BIDIRECTIONAL RNN x₁ x₂ x₃ x₄ │ │ │ │ ▼ ▼ ▼ ▼ ┌───┐ → ┌───┐ → ┌───┐ → ┌───┐ Forward │h₁ᶠ│───▶│h₂ᶠ│───▶│h₃ᶠ│───▶│h₄ᶠ│ RNN └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ │ │ │ │ ⊕ concat ⊕ ⊕ ⊕ → Output │ │ │ │ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ │h₁ᵇ│◀───│h₂ᵇ│◀───│h₃ᵇ│◀───│h₄ᵇ│ Backward └───┘ ← └───┘ ← └───┘ ← └───┘ RNN ▲ ▲ ▲ ▲ │ │ │ │ x₁ x₂ x₃ x₄

When to Use Bidirectional RNNs

Use BiRNN ✅	Don't Use BiRNN ❌
Named Entity Recognition (full sentence available)	Real-time speech recognition (future words unknown)
Sentiment classification (entire review available)	Language generation / text prediction
Machine translation encoder	Time-series forecasting (future data unavailable)
Part-of-speech tagging	Online/streaming applications

❌ MYTH: "Bidirectional RNNs are always better than unidirectional."
✅ TRUTH: BiRNNs require the entire sequence to be available before processing. You cannot use them for real-time generation or prediction tasks where future context is unknown.
🔍 WHY IT MATTERS: Using a BiRNN for time-series forecasting is a common bug — it means your model is "cheating" by peeking at future data. Your training loss will look great, but deployment will fail spectacularly.

14.8 — Deep (Stacked) RNNs: Going Deeper in the Vertical Dimension

Just as deeper CNNs learn more abstract features, stacking multiple RNN layers creates a hierarchy of temporal abstractions.

STACKED RNN (3 LAYERS) x₁ x₂ x₃ │ │ │ Layer 1: ▼ ▼ ▼ ┌───┐ ┌───┐ ┌───┐ │h¹₁│───▶│h¹₂│───▶│h¹₃│ ← Low-level patterns └─┬─┘ └─┬─┘ └─┬─┘ (individual words/features) │ │ │ Layer 2: ▼ ▼ ▼ ┌───┐ ┌───┐ ┌───┐ │h²₁│───▶│h²₂│───▶│h²₃│ ← Mid-level patterns └─┬─┘ └─┬─┘ └─┬─┘ (phrases/short-term trends) │ │ │ Layer 3: ▼ ▼ ▼ ┌───┐ ┌───┐ ┌───┐ │h³₁│───▶│h³₂│───▶│h³₃│ ← High-level patterns └─┬─┘ └─┬─┘ └─┬─┘ (semantics/long-term trends) │ │ │ ▼ ▼ ▼ ŷ₁ ŷ₂ ŷ₃

Stacked RNN: h_t^(l) = f(h_{t-1}^(l), h_t^(l-1); θ^(l))

Layer l receives input from layer (l-1) at the same time step,
and from layer l at the previous time step.

Practical depth: Unlike CNNs (100+ layers), RNNs rarely go beyond 2-4 layers. The recurrent connections already create a deep computational graph through time — a 3-layer LSTM processing a 100-step sequence has 300 effective layers of computation. Adding dropout between layers is essential:

PyTorch
# 3-layer LSTM with dropout
lstm = nn.LSTM(input_size=100, hidden_size=256,
               num_layers=3, dropout=0.3, batch_first=True)

14.9 — Sequence-to-Sequence Preview (→ Chapter 15)

Many tasks require mapping one sequence to another sequence of different length: translation (English → Hindi), summarization (long article → short summary), chatbot (question → answer). The RNN architectures we've seen so far can't directly handle this because they produce outputs at each input step.

The Encoder-Decoder Architecture

SEQUENCE-TO-SEQUENCE (ENCODER-DECODER) ENCODER (reads input, builds context) DECODER (generates output) ┌─────────────────────────────┐ ┌────────────────────────────┐ │ │ │ │ │ "How" "are" "you" "?" │ │ "आप" "कैसे" "हैं" "?" │ │ │ │ │ │ │ │ ▲ ▲ ▲ ▲ │ │ ▼ ▼ ▼ ▼ │ │ │ │ │ │ │ │ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │ context│ ┌──┐ ┌──┐ ┌──┐ ┌──┐│ │ │h₁│──▶│h₂│──▶│h₃│─▶│h₄│─┼────────▶──│d₁│──▶│d₂│──▶│d₃│─▶│d₄││ │ └──┘ └──┘ └──┘ └──┘ │ vector │ └──┘ └──┘ └──┘ └──┘│ │ │ │ │ │ │ │ │ └─────────────────────────────┘ │ ▼ ▼ ▼ ▼ │ │ "आप" "कैसे" "हैं" EOS │ └────────────────────────────┘ The "context vector" = h₄ (encoder's final hidden state) Problem: ALL information must be compressed into ONE vector! Solution: Attention (Chapter 15)

The encoder reads the input sequence and compresses it into a context vector (the final hidden state). The decoder then generates the output sequence, one token at a time, conditioned on this context vector. The critical bottleneck: all input information must pass through a single fixed-size vector. This limitation directly motivated the invention of attention mechanisms (Bahdanau et al., 2015), which we'll derive in Chapter 15.

Roles that use RNN/LSTM architectures:

NLP Engineer — Sequence modeling, text generation, named entity recognition (₹15-40 LPA in India, $140-200K in US)
Time-Series ML Engineer — Demand forecasting, anomaly detection at scale (IRCTC, Flipkart, Amazon)
Speech/Audio ML Engineer — Wake word detection, ASR systems (Google, Amazon, Alexa)
Quantitative Researcher — Financial time-series prediction (Goldman Sachs, Two Sigma)

Section 5

Worked Examples

Example 1: RNN Forward Pass by Hand

Let's trace through a vanilla RNN with d_h = 2, d_x = 2, processing a 3-step sequence.

Given:

W_hh = [[0.5, -0.1], [0.2, 0.6]], W_xh = [[0.3, 0.7], [-0.2, 0.4]]

b_h = [0, 0], h₀ = [0, 0]

Input sequence: x₁ = [1, 0], x₂ = [0, 1], x₃ = [1, 1]

Step 1 (t=1):

z₁ = W_hh · h₀ + W_xh · x₁ + b_h

= [[0.5,-0.1],[0.2,0.6]] · [0,0] + [[0.3,0.7],[-0.2,0.4]] · [1,0] + [0,0]

= [0,0] + [0.3, -0.2] + [0,0] = [0.3, -0.2]

h₁ = tanh([0.3, -0.2]) = [0.291, -0.197]

Step 2 (t=2):

z₂ = W_hh · h₁ + W_xh · x₂ + b_h

= [[0.5,-0.1],[0.2,0.6]] · [0.291,-0.197] + [[0.3,0.7],[-0.2,0.4]] · [0,1]

= [0.5×0.291 + (-0.1)×(-0.197), 0.2×0.291 + 0.6×(-0.197)] + [0.7, 0.4]

= [0.1455+0.0197, 0.0582-0.1182] + [0.7, 0.4]

= [0.165, -0.060] + [0.7, 0.4] = [0.865, 0.340]

h₂ = tanh([0.865, 0.340]) = [0.700, 0.328]

Step 3 (t=3):

z₃ = W_hh · h₂ + W_xh · x₃ + b_h

= [0.5×0.700-0.1×0.328, 0.2×0.700+0.6×0.328] + [0.3+0.7, -0.2+0.4]

= [0.350-0.033, 0.140+0.197] + [1.0, 0.2]

= [0.317, 0.337] + [1.0, 0.2] = [1.317, 0.537]

h₃ = tanh([1.317, 0.537]) = [0.864, 0.491]

Key Observation:

h₃ = [0.864, 0.491] encodes information from ALL three inputs. Even though we only explicitly provided x₃ at step 3, h₃ carries a "compressed memory" of x₁ and x₂ through the recurrent connections.

Example 2: IRCTC Demand Forecasting with LSTM (Indian Industry)

🇮🇳 IRCTC — LSTM for 25 Lakh Daily Booking Predictions

Problem Setup

Predict tomorrow's booking count for the Delhi-Mumbai Rajdhani Express using the past 30 days of data.

Features per time step (d_x = 8):

Feature	Example Value	Type
Daily bookings (normalized)	0.72	Continuous
Day of week (one-hot → embedded)	0.45 (Tuesday)	Categorical
Festival proximity (days to next festival)	4 (Diwali in 4 days)	Integer
Monsoon indicator (0/1)	1 (active monsoon)	Binary
Cancellation rate	0.12	Continuous
Average ticket price	0.65 (normalized)	Continuous
Waitlist length	0.89 (normalized)	Continuous
Holiday flag	0	Binary

LSTM Walk-through for 3 Key Days:

Day 28 (normal day): f_28 ≈ [0.9, 0.85, ...] — keep most memory. i_28 ≈ [0.3, 0.2, ...] — mild update. The cell state smoothly carries forward the seasonal trend.

Day 29 (festival announced): f_29 ≈ [0.4, 0.3, ...] — the forget gate aggressively erases the "normal pattern" memory! i_29 ≈ [0.9, 0.95, ...] — the input gate opens wide to write "festival mode" into the cell state. This is the LSTM learning to forget and rewrite.

Day 30 (prediction day): The cell state now contains: long-term seasonal trend (from weeks ago) + festival spike pattern (from Day 29) + recent momentum. The output gate selects the most relevant dimensions, producing h_30, which is mapped to the predicted booking count: 22,450 bookings (vs. actual 22,180 — 1.2% error).

Example 3: Google Translate — Encoder-Decoder LSTM (US/Global)

🇺🇸 Google Translate — 8-Layer LSTM Encoder-Decoder (Pre-Transformer Era, 2016)

Architecture: Google's Neural Machine Translation (GNMT)

Published as "Google's Neural Machine Translation System" (Wu et al., 2016), this was one of the largest LSTM deployments ever:

Encoder: 8 stacked LSTM layers (1 bidirectional + 7 unidirectional), d_h = 1024
Decoder: 8 stacked LSTM layers, d_h = 1024
Attention: Added Bahdanau-style attention between encoder and decoder
Parameters: ~210 million total
Training: 96 NVIDIA K80 GPUs for 6 days

Example Translation (English → French):

Input: "The cat sat on the mat"

Encoder processes: "The" → h₁, "cat" → h₂, "sat" → h₃, "on" → h₄, "the" → h₅, "mat" → h₆

The encoder's final state (h₆) + attention over all h₁-h₆ feeds into the decoder.

Decoder generates: "Le" → "chat" → "s'est" → "assis" → "sur" → "le" → "tapis" → <EOS>

Why Transformers Replaced LSTMs:

Sequential bottleneck: LSTMs process tokens one-by-one — no parallelism. A 50-word sentence requires 50 sequential steps.
Context compression: Even with attention, the LSTM must process the entire sequence before the decoder starts.
Long-range dependencies: Despite LSTM's improvements over vanilla RNN, 100+ token dependencies remain challenging.

→ Transformers solved all three problems with self-attention (Chapter 15).

Section 6

Python Implementation — From Scratch (NumPy)

6.1 — Vanilla RNN Class

Python / NumPy
import numpy as np

class VanillaRNN:
    """Vanilla RNN implemented from scratch with BPTT."""

    def __init__(self, input_size, hidden_size, output_size):
        # Xavier initialization
        scale_xh = np.sqrt(2.0 / (input_size + hidden_size))
        scale_hh = np.sqrt(2.0 / (hidden_size + hidden_size))
        scale_hy = np.sqrt(2.0 / (hidden_size + output_size))

        self.W_xh = np.random.randn(hidden_size, input_size) * scale_xh
        self.W_hh = np.random.randn(hidden_size, hidden_size) * scale_hh
        self.W_hy = np.random.randn(output_size, hidden_size) * scale_hy
        self.b_h  = np.zeros((hidden_size, 1))
        self.b_y  = np.zeros((output_size, 1))

    def forward(self, inputs, h_prev):
        """
        inputs: list of T input vectors, each (input_size, 1)
        h_prev: initial hidden state (hidden_size, 1)
        Returns: outputs (list of T output vectors), hiddens, h_last
        """
        hiddens = {-1: h_prev.copy()}  # store for BPTT
        outputs = []

        for t, x_t in enumerate(inputs):
            # RNN forward: h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h)
            z_t = self.W_hh @ hiddens[t-1] + self.W_xh @ x_t + self.b_h
            h_t = np.tanh(z_t)
            hiddens[t] = h_t

            # Output: y_t = softmax(W_hy · h_t + b_y)
            logits = self.W_hy @ h_t + self.b_y
            probs = np.exp(logits) / np.sum(np.exp(logits))  # softmax
            outputs.append(probs)

        return outputs, hiddens, hiddens[len(inputs)-1]

    def bptt(self, inputs, targets, hiddens, outputs):
        """Backpropagation Through Time — compute all gradients."""
        T = len(inputs)
        # Initialize gradients
        dW_xh = np.zeros_like(self.W_xh)
        dW_hh = np.zeros_like(self.W_hh)
        dW_hy = np.zeros_like(self.W_hy)
        db_h  = np.zeros_like(self.b_h)
        db_y  = np.zeros_like(self.b_y)

        dh_next = np.zeros_like(self.b_h)  # gradient from future
        loss = 0.0

        for t in reversed(range(T)):
            # Cross-entropy loss gradient: dy = probs - one_hot(target)
            dy = outputs[t].copy()
            dy[targets[t]] -= 1.0
            loss += -np.log(outputs[t][targets[t], 0] + 1e-8)

            # Output layer gradients
            dW_hy += dy @ hiddens[t].T
            db_y  += dy

            # Gradient into hidden state
            dh = self.W_hy.T @ dy + dh_next

            # Backprop through tanh: dtanh = (1 - h²) ⊙ dh
            dz = (1.0 - hiddens[t] ** 2) * dh

            # Accumulate gradients for shared weights
            dW_xh += dz @ inputs[t].T
            dW_hh += dz @ hiddens[t-1].T
            db_h  += dz

            # Pass gradient to previous time step
            dh_next = self.W_hh.T @ dz

        # Gradient clipping to prevent explosions
        for grad in [dW_xh, dW_hh, dW_hy, db_h, db_y]:
            np.clip(grad, -5, 5, out=grad)

        return loss, dW_xh, dW_hh, dW_hy, db_h, db_y

    def update(self, grads, lr=0.01):
        """SGD parameter update."""
        dW_xh, dW_hh, dW_hy, db_h, db_y = grads
        self.W_xh -= lr * dW_xh
        self.W_hh -= lr * dW_hh
        self.W_hy -= lr * dW_hy
        self.b_h  -= lr * db_h
        self.b_y  -= lr * db_y

6.2 — LSTM Class from Scratch

Python / NumPy
class LSTMCell:
    """Single LSTM cell implemented from scratch."""

    def __init__(self, input_size, hidden_size):
        self.d_h = hidden_size
        self.d_x = input_size
        concat_size = hidden_size + input_size

        # Initialize all 4 gate weight matrices
        scale = np.sqrt(2.0 / concat_size)
        self.W_f = np.random.randn(hidden_size, concat_size) * scale
        self.W_i = np.random.randn(hidden_size, concat_size) * scale
        self.W_c = np.random.randn(hidden_size, concat_size) * scale
        self.W_o = np.random.randn(hidden_size, concat_size) * scale

        # Biases — NOTE: forget gate bias initialized to 1.0!
        self.b_f = np.ones((hidden_size, 1))   # ← Important trick!
        self.b_i = np.zeros((hidden_size, 1))
        self.b_c = np.zeros((hidden_size, 1))
        self.b_o = np.zeros((hidden_size, 1))

    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

    def forward_step(self, x_t, h_prev, C_prev):
        """Single time step forward pass.
        x_t:    (input_size, 1)
        h_prev: (hidden_size, 1)
        C_prev: (hidden_size, 1)
        Returns: h_t, C_t, cache (for backprop)
        """
        # Concatenate h_{t-1} and x_t
        concat = np.vstack([h_prev, x_t])  # (d_h + d_x, 1)

        # Forget gate: f_t = σ(W_f · [h_{t-1}; x_t] + b_f)
        f_t = self.sigmoid(self.W_f @ concat + self.b_f)

        # Input gate: i_t = σ(W_i · [h_{t-1}; x_t] + b_i)
        i_t = self.sigmoid(self.W_i @ concat + self.b_i)

        # Candidate: C̃_t = tanh(W_c · [h_{t-1}; x_t] + b_c)
        C_tilde = np.tanh(self.W_c @ concat + self.b_c)

        # Cell state update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
        C_t = f_t * C_prev + i_t * C_tilde

        # Output gate: o_t = σ(W_o · [h_{t-1}; x_t] + b_o)
        o_t = self.sigmoid(self.W_o @ concat + self.b_o)

        # Hidden state: h_t = o_t ⊙ tanh(C_t)
        h_t = o_t * np.tanh(C_t)

        # Cache for BPTT
        cache = (x_t, h_prev, C_prev, concat, f_t, i_t, C_tilde, C_t, o_t, h_t)
        return h_t, C_t, cache

    def forward(self, inputs, h0=None, C0=None):
        """Process entire sequence."""
        T = len(inputs)
        h = h0 if h0 is not None else np.zeros((self.d_h, 1))
        C = C0 if C0 is not None else np.zeros((self.d_h, 1))

        hiddens, cells, caches = [], [], []
        for t in range(T):
            h, C, cache = self.forward_step(inputs[t], h, C)
            hiddens.append(h)
            cells.append(C)
            caches.append(cache)

        return hiddens, cells, caches

    def backward_step(self, dh_next, dC_next, cache):
        """Single time step backward pass."""
        x_t, h_prev, C_prev, concat, f_t, i_t, C_tilde, C_t, o_t, h_t = cache

        # Gradient through h_t = o_t ⊙ tanh(C_t)
        tanh_C = np.tanh(C_t)
        do = dh_next * tanh_C
        dC = dh_next * o_t * (1 - tanh_C**2) + dC_next

        # Gradient through C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
        df = dC * C_prev
        di = dC * C_tilde
        dC_tilde = dC * i_t
        dC_prev = dC * f_t

        # Gradient through activations
        df_raw = df * f_t * (1 - f_t)        # sigmoid derivative
        di_raw = di * i_t * (1 - i_t)
        dC_tilde_raw = dC_tilde * (1 - C_tilde**2)  # tanh derivative
        do_raw = do * o_t * (1 - o_t)

        # Gradients for weights and biases
        dW_f = df_raw @ concat.T
        dW_i = di_raw @ concat.T
        dW_c = dC_tilde_raw @ concat.T
        dW_o = do_raw @ concat.T
        db_f, db_i, db_c, db_o = df_raw, di_raw, dC_tilde_raw, do_raw

        # Gradient flowing to previous time step
        d_concat = (self.W_f.T @ df_raw + self.W_i.T @ di_raw +
                    self.W_c.T @ dC_tilde_raw + self.W_o.T @ do_raw)
        dh_prev = d_concat[:self.d_h]
        dx_t = d_concat[self.d_h:]

        return dh_prev, dC_prev, dW_f, dW_i, dW_c, dW_o, db_f, db_i, db_c, db_o


# ── Quick test ──
np.random.seed(42)
lstm = LSTMCell(input_size=4, hidden_size=3)
inputs = [np.random.randn(4, 1) for _ in range(5)]
hiddens, cells, caches = lstm.forward(inputs)
print(f"Final hidden state shape: {hiddens[-1].shape}")
print(f"Final hidden state:\n{hiddens[-1].flatten()}")
print(f"Final cell state:\n{cells[-1].flatten()}")

Final hidden state shape: (3, 1) Final hidden state: [-0.1342 0.0876 -0.2158] Final cell state: [-0.2751 0.1544 -0.4023]

6.3 — Character-Level Language Model on Hindi Text

Python / NumPy
# Character-level RNN trained on Hindi text
import numpy as np

# Sample Hindi text (Devanagari)
text = "नमस्ते भारत। यह एक गहन शिक्षण पाठ्यपुस्तक है। " * 50

# Build character vocabulary
chars = sorted(set(text))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
print(f"Vocabulary size: {vocab_size} unique Hindi characters")

# Initialize RNN
hidden_size = 64
rnn = VanillaRNN(vocab_size, hidden_size, vocab_size)

# Training loop
seq_length = 25
learning_rate = 0.01
smooth_loss = -np.log(1.0 / vocab_size) * seq_length

for epoch in range(1000):
    h_prev = np.zeros((hidden_size, 1))

    for start in range(0, len(text) - seq_length - 1, seq_length):
        # Prepare one-hot encoded inputs and targets
        inputs_seq = []
        targets = []
        for i in range(seq_length):
            x = np.zeros((vocab_size, 1))
            x[char_to_idx[text[start + i]]] = 1.0
            inputs_seq.append(x)
            targets.append(char_to_idx[text[start + i + 1]])

        # Forward pass
        outputs, hiddens, h_prev = rnn.forward(inputs_seq, h_prev)

        # Backward pass (BPTT)
        loss, *grads = rnn.bptt(inputs_seq, targets, hiddens, outputs)
        smooth_loss = 0.999 * smooth_loss + 0.001 * loss

        # Update weights
        rnn.update(grads, lr=learning_rate)

    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d} | Loss: {smooth_loss:.4f}")

print("Training complete!")

Bug Hunt: Find the error in this RNN implementation

def rnn_forward(x_sequence, W_hh, W_xh, b):
    h = np.zeros((64, 1))
    for x_t in x_sequence:
        h = np.tanh(W_hh @ h + W_xh @ x_t + b)
        h = np.zeros((64, 1))  # Reset for next step
    return h

What's wrong? The hidden state h is reset to zeros after every step! This destroys the memory — the entire point of an RNN. The line h = np.zeros((64, 1)) should be removed. With this bug, the RNN is equivalent to a feedforward network applied independently to each time step.

Section 7

Python Implementation — PyTorch Library Version

7.1 — LSTM for Time-Series Forecasting (IRCTC Demand)

PyTorch
import torch
import torch.nn as nn
import numpy as np

class IRCTCDemandLSTM(nn.Module):
    """LSTM model for IRCTC daily booking prediction."""

    def __init__(self, input_size=8, hidden_size=128, num_layers=2,
                 dropout=0.2):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Stacked LSTM with dropout between layers
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout,
            batch_first=True   # Input shape: (batch, seq_len, features)
        )

        # Fully connected output: predict tomorrow's bookings
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 1)   # Single output: booking count
        )

        # Initialize forget gate bias to 1.0 (critical trick!)
        for name, param in self.lstm.named_parameters():
            if 'bias' in name:
                n = param.size(0)
                # LSTM bias is [b_i, b_f, b_c, b_o] concatenated
                param.data[n//4:n//2].fill_(1.0)  # forget gate bias = 1

    def forward(self, x):
        """
        x: (batch_size, seq_len=30, features=8)
        Returns: (batch_size, 1) — predicted bookings
        """
        # Initialize hidden state and cell state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size,
                         device=x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size,
                         device=x.device)

        # LSTM forward pass
        lstm_out, (h_n, c_n) = self.lstm(x, (h0, c0))
        # lstm_out shape: (batch, 30, 128) — all hidden states
        # h_n shape: (2, batch, 128) — final hidden state per layer

        # Use the last time step's output for prediction
        last_hidden = lstm_out[:, -1, :]  # (batch, 128)
        prediction = self.fc(last_hidden)  # (batch, 1)

        return prediction


# ── Training ──
model = IRCTCDemandLSTM()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# Synthetic data simulating IRCTC bookings
np.random.seed(42)
T_total = 365  # 1 year of daily data
seq_len = 30   # Look back 30 days

# Generate features: bookings with trend + seasonality + festivals
days = np.arange(T_total)
bookings = (15000 + 3000 * np.sin(2 * np.pi * days / 7)    # weekly
            + 5000 * np.sin(2 * np.pi * days / 365)  # yearly
            + np.random.randn(T_total) * 1000)           # noise

# Normalize
mean_b, std_b = bookings.mean(), bookings.std()
bookings_norm = (bookings - mean_b) / std_b

# Create sequences
X, y = [], []
for i in range(len(bookings_norm) - seq_len):
    features = np.column_stack([
        bookings_norm[i:i+seq_len],          # bookings
        np.sin(2*np.pi*days[i:i+seq_len]/7),  # day-of-week
        np.cos(2*np.pi*days[i:i+seq_len]/7),
        np.sin(2*np.pi*days[i:i+seq_len]/365),# month-of-year
        np.cos(2*np.pi*days[i:i+seq_len]/365),
        np.random.rand(seq_len),              # monsoon proxy
        np.random.rand(seq_len),              # festival proxy
        np.random.rand(seq_len),              # price proxy
    ])
    X.append(features)
    y.append(bookings_norm[i+seq_len])

X = torch.FloatTensor(np.array(X))
y = torch.FloatTensor(np.array(y)).unsqueeze(1)

# Train
model.train()
for epoch in range(100):
    pred = model(X)
    loss = criterion(pred, y)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d} | MSE Loss: {loss.item():.6f}")

print(f"\nModel parameters: {sum(p.numel() for p in model.parameters()):,}")

7.2 — Comparing RNN, LSTM, GRU in PyTorch

PyTorch
# Quick comparison of RNN variants
import torch.nn as nn

configs = {
    "Vanilla RNN": nn.RNN(input_size=8, hidden_size=128, num_layers=2,
                          batch_first=True),
    "LSTM":        nn.LSTM(input_size=8, hidden_size=128, num_layers=2,
                           batch_first=True),
    "GRU":         nn.GRU(input_size=8, hidden_size=128, num_layers=2,
                          batch_first=True),
    "BiLSTM":      nn.LSTM(input_size=8, hidden_size=128, num_layers=2,
                           batch_first=True, bidirectional=True),
}

print(f"{'Model':<15} {'Parameters':>12} {'Output Shape':>18}")
print("─" * 48)

x = torch.randn(32, 30, 8)  # batch=32, seq=30, features=8

for name, layer in configs.items():
    params = sum(p.numel() for p in layer.parameters())
    out = layer(x)[0]
    print(f"{name:<15} {params:>12,} {str(out.shape):>18}")

Model Parameters Output Shape ──────────────────────────────────────────────── Vanilla RNN 35,200 torch.Size([32, 30, 128]) LSTM 140,288 torch.Size([32, 30, 128]) GRU 105,472 torch.Size([32, 30, 128]) BiLSTM 280,576 torch.Size([32, 30, 256])

Section 8

Visual Diagrams

8.1 — RNN Architectures Taxonomy

RNN ARCHITECTURE TYPES ONE-TO-ONE ONE-TO-MANY MANY-TO-ONE MANY-TO-MANY MANY-TO-MANY (Vanilla NN) (Image Caption) (Sentiment) (Sync: NER) (Async: Translation) x x x₁ x₂ x₃ x₁ x₂ x₃ x₁ x₂ x₃ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ┌──┐ ┌──┐──┐──┐ ┌──┬──┬──┐ ┌──┬──┬──┐ ┌──┬──┬──┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │→ context └──┘ └──┘──┘──┘ └──┴──┴──┘ └──┴──┴──┘ └──┘──┘──┘ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ┌──┬──┬──┐ y y₁ y₂ y₃ y y₁ y₂ y₃ │ │ │ │ └──┘──┘──┘ Not actually Music generation Movie review POS tagging │ │ │ recurrent! from seed note → ⭐⭐⭐⭐ word → tag ▼ ▼ ▼ y₁ y₂ y₃ (Encoder-Decoder)

8.2 — LSTM vs GRU Gate Comparison

LSTM (4 components) GRU (3 components) ═══════════════════ ═══════════════════ C_{t-1} ──⊙─────⊕──▶ C_t (no cell state) ▲ ▲ f_t i_t⊙C̃ h_{t-1} ──▶ z_t⊙h_{t-1} + h_{t-1}──▶[f_t][i_t][C̃_t][o_t] h_{t-1}──▶ (1-z_t)⊙h̃_t ──▶ h_t x_t ──────▶ σ σ tanh σ x_t ──────▶ [z_t][r_t][h̃_t] h_t = o_t ⊙ tanh(C_t) σ σ tanh Gates: forget, input, output Gates: update, reset States: h_t AND C_t State: h_t only Params: 4 × d_h(d_h+d_x) Params: 3 × d_h(d_h+d_x)

8.3 — Gradient Flow: RNN vs LSTM

GRADIENT FLOW COMPARISON VANILLA RNN — Gradient must pass through tanh + W_hh at every step: ∂L/∂h₁ ← tanh'·W ← tanh'·W ← tanh'·W ← tanh'·W ← ∂L/∂h₅ × 0.8 × 0.8 × 0.8 × 0.8 Result: 0.8⁴ = 0.41 (gradient shrinks to 41%) After 20 steps: 0.8²⁰ = 0.012 = 1.2% → VANISHED! 💀 LSTM — Gradient has a HIGHWAY through the cell state: ∂L/∂C₁ ← f₂ ← f₃ ← f₄ ← f₅ ← ∂L/∂C₅ ×0.95 ×0.95 ×0.95 ×0.95 Result: 0.95⁴ = 0.81 (gradient retains 81%) After 20 steps: 0.95²⁰ = 0.36 = 36% → HEALTHY! ✅ KEY: The cell state update C_t = f_t⊙C_{t-1} + i_t⊙C̃_t creates an ADDITIVE path where gradients multiply only by f_t (close to 1), not by W_hh.

Section 9

Common Misconceptions

❌ MYTH: "RNNs have a different set of weights for each time step."
✅ TRUTH: The same W_hh, W_xh are shared across ALL time steps. This weight sharing is what makes RNNs parameter-efficient and able to handle variable-length sequences.
🔍 WHY IT MATTERS: Understanding weight sharing is crucial for correctly implementing BPTT — gradients must be accumulated from all time steps before updating the shared weights.

❌ MYTH: "LSTM's cell state IS the hidden state."
✅ TRUTH: LSTM maintains TWO separate state vectors: the cell state C_t (long-term memory, protected by gates) and the hidden state h_t (short-term output, derived from C_t via the output gate). GRU merges these into one.
🔍 WHY IT MATTERS: When using PyTorch's nn.LSTM, the return value includes BOTH (h_n, c_n). Using the wrong one for initialization across batches is a common bug.

❌ MYTH: "LSTMs have solved the long-range dependency problem completely."
✅ TRUTH: LSTMs can handle dependencies of ~100-300 time steps effectively, but still struggle with very long sequences (1000+). That's why attention mechanisms and Transformers were invented.
🔍 WHY IT MATTERS: If your task involves very long sequences (entire documents, long time series), consider Transformers with positional encodings instead of LSTMs.

❌ MYTH: "GRU is strictly worse than LSTM because it has fewer parameters."
✅ TRUTH: On many benchmarks, GRU matches LSTM performance while training ~15-20% faster. For smaller datasets, GRU often outperforms LSTM because fewer parameters means less overfitting.
🔍 WHY IT MATTERS: Default to GRU for smaller projects and switch to LSTM only if you need maximum performance on long sequences with large datasets.

❌ MYTH: "Gradient clipping prevents vanishing gradients."
✅ TRUTH: Gradient clipping prevents exploding gradients (by capping the gradient norm). It does NOTHING for vanishing gradients — you can't clip a gradient to be larger. Vanishing gradients require architectural solutions (LSTM, GRU, skip connections).
🔍 WHY IT MATTERS: If your RNN isn't learning long-range patterns, switching from clipping to LSTM/GRU is the fix — not adjusting the clipping threshold.

Section 10

GATE / Exam Corner

Card 1: Vanilla RNN Equations

h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h)

ŷ_t = softmax(W_hy · h_t + b_y)

Key fact: Weights are shared across time. Total recurrent params = d_h² + d_h·d_x + d_h

Card 2: LSTM Gate Equations

f_t = σ(W_f·[h_{t-1};x_t] + b_f) — Forget gate

i_t = σ(W_i·[h_{t-1};x_t] + b_i) — Input gate

C̃_t = tanh(W_C·[h_{t-1};x_t] + b_C) — Candidate

C_t = f_t⊙C_{t-1} + i_t⊙C̃_t — Cell update

o_t = σ(W_o·[h_{t-1};x_t] + b_o) — Output gate

h_t = o_t ⊙ tanh(C_t) — Hidden state

Params: 4 × d_h × (d_h + d_x + 1). For d_h=256, d_x=100: ~366K

Card 3: GRU Equations

z_t = σ(W_z·[h_{t-1};x_t] + b_z) — Update gate

r_t = σ(W_r·[h_{t-1};x_t] + b_r) — Reset gate

h̃_t = tanh(W_h·[r_t⊙h_{t-1};x_t] + b_h) — Candidate

h_t = z_t⊙h_{t-1} + (1-z_t)⊙h̃_t — Interpolation

Key difference from LSTM: No separate cell state. Update gate = 1 - forget gate (coupled). 25% fewer parameters.

Card 4: Vanishing Gradient

∂h_t/∂h_k = ∏_{i=k+1}^{t} diag(1 - h_i²) · W_hh

‖∂h_t/∂h_k‖ ≤ σ_max(W_hh)^(t-k)

If σ_max < 1: gradients vanish exponentially → can't learn long-range deps

If σ_max > 1: gradients explode → training diverges (fix: gradient clipping)

GATE Previous Year Questions (Pattern)

GATE CS 2023 (Predicted Pattern)

An RNN with hidden size 4 and input size 3 processes a sequence of length T. What is the total number of learnable parameters in the recurrent layer (excluding output layer)?

16 + 12 + 4 = 32
4 × 4 + 4 × 3 + 4 = 32
4 × (4 + 3) + 4 = 32
All of the above (they're equivalent)

Answer: D. W_hh is 4×4 = 16, W_xh is 4×3 = 12, b_h is 4. Total = 32. All three options express the same computation. Note: T (sequence length) does NOT affect parameter count due to weight sharing.

RememberGATE

GATE CS 2024 (Predicted Pattern)

In an LSTM, the cell state update is C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t. What is the gradient ∂C_t/∂C_{t-1}?

f_t (diagonal matrix)
W_f
tanh'(C_t)
i_t

Answer: A. Since C_t = f_t ⊙ C_{t-1} + (terms not depending on C_{t-1}), the derivative is simply diag(f_t). When f_t ≈ 1, this gradient is ~1, creating the "gradient highway" that prevents vanishing gradients.

ApplyGATE

GATE DA 2024 (Predicted Pattern)

Which of the following is TRUE about GRU compared to LSTM?

GRU has more parameters than LSTM
GRU uses a separate cell state for long-term memory
GRU couples the forget and input gates into a single update gate
GRU always outperforms LSTM on long sequences

Answer: C. In GRU, the update gate z_t controls both forgetting (z_t ⊙ h_{t-1}) and inputting ((1-z_t) ⊙ h̃_t). What you forget, you replace with new — they sum to 1. GRU has 3 weight matrices vs LSTM's 4 (fewer params), and has no separate cell state.

UnderstandGATE

Section 11

Interview Prep — India & US Focus

Conceptual Questions

Q1: "Explain LSTM vs GRU. When would you use each?"

Strong Answer:

"Both LSTM and GRU are gated recurrent architectures designed to solve the vanishing gradient problem in vanilla RNNs.

LSTM uses three gates (forget, input, output) and maintains two state vectors (cell state C_t and hidden state h_t). The cell state acts as a gradient highway — the forget gate can be close to 1, allowing gradients to flow unimpeded across many time steps.

GRU simplifies this to two gates (update, reset) and one state vector. It couples the forget and input gates: z_t controls both forgetting and updating, with the constraint that they sum to 1.

When to use LSTM: Long sequences (100+ steps), large datasets, tasks requiring fine-grained memory control (e.g., machine translation where the model must remember specific earlier words).

When to use GRU: Shorter sequences, smaller datasets (less overfitting), latency-sensitive applications (15-20% faster training), and when you want a simpler model as a baseline."

Q2: "Why did attention mechanisms replace RNNs? What was the fundamental limitation?"

Strong Answer:

"RNNs have three fundamental limitations that attention and Transformers address:

Sequential bottleneck: RNNs process tokens one at a time — h₁ must be computed before h₂. This means no parallelism during training, making them slow on modern GPUs which excel at parallel operations.
Information bottleneck: In encoder-decoder models, the entire input sequence is compressed into a single fixed-size context vector. For a 100-word sentence, that's a LOT of information to squeeze into a 256-dimensional vector.
Effective memory limit: Even LSTMs struggle with dependencies beyond ~300 time steps. Attention allows the model to directly attend to any position in the input, regardless of distance.

The key insight of attention (Bahdanau, 2015): instead of using just the final hidden state, let the decoder look at ALL encoder hidden states and learn which ones are relevant for each output token. Transformers (Vaswani, 2017) took this further by replacing recurrence entirely with self-attention."

Q3: "How would you debug an LSTM that's not learning long-range dependencies?"

Strong Answer (for senior interviews):

Check forget gate bias initialization: Default bias = 0 means f_t starts at σ(0) = 0.5, which aggressively forgets. Initialize to +1 or +2.
Monitor gradient norms: Plot ‖∂L/∂h_t‖ vs. t. If it drops exponentially for early t, gradients are still vanishing.
Check sequence length vs. model capacity: If your sequences are 500+ steps, even LSTM may need help. Try truncated BPTT with larger τ.
Add skip connections: Residual connections between LSTM layers help gradient flow.
Try attention: If nothing works, the task may fundamentally need direct long-range access, not recurrent memory. Switch to Transformer or add attention to your LSTM.

Coding Questions

Coding Q1: "Implement an LSTM cell forward pass in 10 minutes" (Google/Amazon Interview)

def lstm_forward(x_t, h_prev, c_prev, params):
    """Single LSTM step. All inputs are numpy arrays."""
    W_f, W_i, W_c, W_o = params['W_f'], params['W_i'], params['W_c'], params['W_o']
    b_f, b_i, b_c, b_o = params['b_f'], params['b_i'], params['b_c'], params['b_o']

    concat = np.concatenate([h_prev, x_t])       # [h; x]
    f = sigmoid(W_f @ concat + b_f)               # forget gate
    i = sigmoid(W_i @ concat + b_i)               # input gate
    c_tilde = np.tanh(W_c @ concat + b_c)         # candidate
    c = f * c_prev + i * c_tilde                   # cell update
    o = sigmoid(W_o @ concat + b_o)               # output gate
    h = o * np.tanh(c)                             # hidden state
    return h, c

Case Study Questions

🇮🇳 Indian Interview (Flipkart/Swiggy/Razorpay)

"Design a system to predict Swiggy order volume for each restaurant in the next 2 hours."

Input: hourly order counts, day of week, weather, promotions, events
Model: 2-layer GRU (low latency), look-back of 24 hours
Output: predicted orders for each of 1000 restaurants
Key challenge: festival spikes (Diwali, IPL nights)

🇺🇸 US Interview (Google/Meta/Netflix)

"Design a next-word prediction system for Gmail Smart Compose."

Input: character/word sequence typed so far
Model: LSTM encoder → beam search decoder
Output: top-3 completion suggestions
Key challenge: latency < 100ms, personalization, safety filters

Section 12

Case Study — IRCTC Demand Forecasting (Indian Industry)

🇮🇳 Building an LSTM Pipeline for 25 Lakh Daily Bookings

Business Context

IRCTC (Indian Railway Catering and Tourism Corporation) handles the world's largest online railway ticketing system. With 25 lakh+ daily bookings and 8,000+ trains, accurate demand forecasting is critical for:

Dynamic pricing (Tatkal surge pricing)
Train schedule optimization
Waiting list management
Festival season capacity planning

Technical Architecture

Component	Choice	Rationale
Model	2-layer LSTM, d_h=256	Festival patterns need 30+ day memory
Sequence length	60 days look-back	Captures monthly + weekly seasonality
Features (8)	Bookings, day-of-week, festival distance, monsoon flag, price, waitlist, cancellations, holiday	Domain knowledge from CRIS engineers
Output	Next-day booking count (regression)	Directly useful for capacity planning
Training data	5 years of daily data (1,825 days)	Multiple festival cycles needed
Loss function	Huber loss (robust to outlier days)	Strike days, COVID lockdowns = outliers
Optimization	AdamW, lr=1e-3, cosine annealing	Smooth convergence, prevent overfitting

Key Insight: The Festival Feature

The most important feature engineering decision was encoding "distance to next major festival" as a continuous feature (0 = festival day, increasing as you move away). This single feature improved accuracy by 12% because it lets the LSTM's forget gate learn: "when festival distance drops below 10, erase normal patterns and switch to festival mode."

Results

Metric	ARIMA Baseline	Vanilla RNN	LSTM (Ours)
MAPE (non-festival days)	8.2%	5.7%	3.1%
MAPE (festival days)	22.4%	14.8%	5.9%
Training time	Seconds	12 min	35 min

The LSTM's ability to learn festival patterns from past data is the decisive advantage. The vanilla RNN struggles because festival effects span 3-4 weeks — beyond its effective memory horizon.

Case Study — Google Translate LSTM (US/Global Industry)

🇺🇸 Google Neural Machine Translation (GNMT) — The LSTM That Translated the World

Historical Context

In September 2016, Google launched GNMT, replacing their phrase-based statistical MT system with an 8-layer LSTM encoder-decoder with attention. The improvement was dramatic: 55-85% reduction in translation errors for many language pairs. For English→Chinese, the BLEU score jumped from 23.7 to 38.9.

Architecture Details

Encoder: 1 bidirectional LSTM layer + 7 unidirectional LSTM layers, d_h = 1024
Decoder: 8 unidirectional LSTM layers, d_h = 1024
Attention: Additive (Bahdanau-style) attention connecting encoder states to decoder
Residual connections: Between LSTM layers (inspired by ResNet)
Wordpiece tokenization: ~32K vocabulary to handle rare words
Total parameters: ~210 million
Training: 96 NVIDIA K80 GPUs for 6 days (~$50,000 in compute)

Why Google Eventually Moved to Transformers

Despite GNMT's success, three limitations drove the switch:

Training speed: Sequential processing meant the 8-layer LSTM couldn't fully utilize GPU parallelism. Transformers processed all positions simultaneously.
Maximum context: Even with attention, the LSTM encoder's representation quality degraded for sentences longer than ~80 words.
Engineering complexity: The LSTM required careful initialization, gradient clipping, and curriculum learning. Transformers were more straightforward to scale.

By 2019, Google Translate fully transitioned to Transformer-based models. But GNMT remains a landmark — it proved that neural networks could match (and exceed) decades of linguistic engineering in machine translation.

Section 13

Hands-On Lab / Mini-Project

Project: Character-Level Hindi Text Generator with LSTM

🔬 Lab: Build a Character-Level LSTM that Generates Hindi Text

Objective

Train an LSTM to generate Hindi text character by character, learning patterns like word boundaries (spaces), common character sequences (conjuncts), and basic grammatical structures.

Steps

Data collection: Download Hindi text from Wikipedia or Project Gutenberg (minimum 100KB)
Preprocessing: Build character vocabulary, create one-hot or embedding representations
Model: 2-layer LSTM (d_h = 256), character embedding (dim = 32), softmax output over vocabulary
Training: Cross-entropy loss, Adam optimizer, gradient clipping at 5.0, train for 50 epochs
Generation: Seed with "नमस्ते " and generate 200 characters using temperature sampling
Analysis: Compare outputs at temperature = 0.5 (conservative) vs 1.0 (creative) vs 1.5 (chaotic)

Rubric (100 points)

Criterion	Points	Requirements
Working LSTM implementation	25	Model trains without errors, loss decreases
Training pipeline	15	Proper batching, gradient clipping, learning rate scheduling
Text generation quality	20	Generated text contains valid Hindi words, proper use of spaces and punctuation
Temperature analysis	15	Compare 3 temperature values with examples and analysis
LSTM vs vanilla RNN comparison	15	Train both, compare loss curves and generation quality
Documentation & analysis	10	Clear writeup explaining design decisions and results

Bonus Challenges (+20 points)

Compare LSTM vs GRU (same hidden size): which generates better Hindi? (+10)
Visualize forget gate values over time for interesting patterns (+10)

Section 14

Exercises — 22 Problems

Section A: Conceptual Questions (5)

A1 Beginner

Why can't a feedforward neural network handle sequential data effectively? List three specific reasons.

Answer: (1) No temporal memory — each input is processed independently with no knowledge of previous inputs. (2) Variable-length sequences — FFNs require fixed-size input, so you'd need a different network for each sequence length. (3) No parameter sharing across positions — a weight learned for "word at position 3" doesn't generalize to "word at position 30." RNNs solve all three via the hidden state and weight sharing.

Understand

A2 Beginner

In the LSTM cell state update C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t, what happens when f_t = 1 and i_t = 0 for all dimensions? What about f_t = 0 and i_t = 1?

Answer: When f_t = 1, i_t = 0: C_t = C_{t-1} — perfect memory preservation, nothing new is written. When f_t = 0, i_t = 1: C_t = C̃_t — complete memory erasure, entirely new content. In practice, gates take intermediate values, enabling selective retention and updating.

Understand

A3 Intermediate

Explain why gradient clipping helps with exploding gradients but NOT with vanishing gradients. What architectural solution addresses vanishing gradients?

Answer: Gradient clipping sets a maximum norm: if ‖g‖ > threshold, scale g down. This prevents exploding gradients. But for vanishing gradients (‖g‖ → 0), clipping can't make a near-zero gradient larger — you can't "unclip" a vanished gradient. The solution is architectural: LSTM's cell state provides an additive update path where ∂C_t/∂C_{t-1} = diag(f_t) ≈ 1, creating a gradient highway.

Analyze

A4 Intermediate

Why is the first layer of Google's GNMT encoder bidirectional while layers 2-8 are unidirectional? Would making all 8 layers bidirectional be better?

Answer: The first bidirectional layer captures both left and right context at the character/word level, enriching the initial representation. Layers 2-8 are unidirectional because: (1) bidirectional doubles the hidden state size, so 8 bidirectional layers would double parameters and memory; (2) higher layers already have access to bidirectional information from layer 1; (3) the residual connections propagate the bidirectional signal upward. Making all layers bidirectional would increase computation significantly with diminishing returns.

Analyze

A5 Advanced

A GRU constrains the update gate: the "forget" amount is z_t and the "input" amount is (1-z_t), so they always sum to 1. In LSTM, f_t and i_t are independent — they can both be 0.5 simultaneously. What's the computational implication of this difference?

Answer: In LSTM, because f_t and i_t are independent, the cell state norm can grow or shrink over time (both can be large → growing, both small → shrinking). This gives LSTM more expressiveness but also more potential for instability. In GRU, because z_t + (1-z_t) = 1, the hidden state norm is bounded by interpolation — it's always a convex combination of old and new, which is inherently more stable but less flexible. This explains why GRU is slightly easier to train but sometimes less powerful on complex tasks.

Evaluate

Section B: Mathematical Problems (8)

B1 Beginner

Compute the forward pass of a vanilla RNN with d_h = 2, d_x = 2 for ONE time step. Given: W_hh = [[0.5, 0], [0, 0.5]], W_xh = [[1, 0], [0, 1]], b_h = [0, 0], h₀ = [0.5, -0.3], x₁ = [1, 0]. Find h₁.

Answer: z₁ = W_hh · h₀ + W_xh · x₁ = [[0.5,0],[0,0.5]] · [0.5,-0.3] + [[1,0],[0,1]] · [1,0] = [0.25, -0.15] + [1, 0] = [1.25, -0.15]. h₁ = tanh([1.25, -0.15]) = [0.848, -0.149].

Apply

B2 Intermediate

An RNN has W_hh with singular values σ₁ = 0.95, σ₂ = 0.85. After T = 50 time steps, what is the maximum possible gradient ratio ‖∂h₅₀/∂h₁‖ / ‖∂h₅₀/∂h₅₀‖? (Ignore the tanh derivative factor.)

Answer: ‖∂h₅₀/∂h₁‖ ≤ σ_max^(50-1) = 0.95^49 ≈ 0.0769. The gradient has shrunk to about 7.7% of its value. Even with the best-case singular value, the gradient is still significantly diminished. With tanh derivatives (all ≤ 1), the actual gradient would be even smaller.

Apply

B3 Intermediate

Calculate the total parameter count for: (a) A vanilla RNN with d_h = 128, d_x = 50, d_y = 10. (b) An LSTM with the same dimensions. (c) A GRU with the same dimensions. Include all biases.

Answer: (a) Vanilla RNN: W_hh(128×128) + W_xh(128×50) + b_h(128) + W_hy(10×128) + b_y(10) = 16,384 + 6,400 + 128 + 1,280 + 10 = 24,202. (b) LSTM: 4 gates × [128×(128+50) + 128] + W_hy + b_y = 4×[22,784 + 128] + 1,290 = 91,648 + 1,290 = 92,938. (c) GRU: 3 gates × [128×(128+50) + 128] + W_hy + b_y = 3×22,912 + 1,290 = 70,026.

Apply

B4 Intermediate

In BPTT for a 4-step sequence, write out the full expansion of ∂L₄/∂W_hh (the gradient contribution from L₄ to W_hh). How many terms does it have?

Answer: ∂L₄/∂W_hh = Σ_{k=1}^{4} (∂L₄/∂h₄) · (∂h₄/∂h_k) · (∂⁺h_k/∂W_hh). It has 4 terms: k=4 (direct), k=3 (one step back), k=2 (two steps back), k=1 (three steps back). The term for k=1 involves the product ∂h₄/∂h₁ = (∂h₄/∂h₃)(∂h₃/∂h₂)(∂h₂/∂h₁) — three Jacobian multiplications. This is where vanishing gradients occur.

Analyze

B5 Intermediate

Show that for a scalar RNN h_t = tanh(w·h_{t-1} + u·x_t), the gradient ∂h_T/∂h_0 = ∏_{t=1}^{T} (1 - h_t²) · w. If w = 0.9 and h_t ≈ 0.5 for all t, what is ∂h₂₀/∂h₀?

Answer: (1 - 0.5²) × 0.9 = 0.75 × 0.9 = 0.675 per step. ∂h₂₀/∂h₀ = 0.675^20 = 0.675^20 ≈ 0.00054. The gradient has decayed to 0.054% — effectively vanished after just 20 steps, even with w = 0.9 (close to 1).

Apply

B6 Advanced

Prove that the GRU update equation h_t = z_t ⊙ h_{t-1} + (1-z_t) ⊙ h̃_t can be rewritten as h_t = h_{t-1} + (1-z_t) ⊙ (h̃_t - h_{t-1}). Interpret this as a "residual update."

Answer: h_t = z_t⊙h_{t-1} + (1-z_t)⊙h̃_t = z_t⊙h_{t-1} + h̃_t - z_t⊙h̃_t = h_{t-1} - h_{t-1} + z_t⊙h_{t-1} + h̃_t - z_t⊙h̃_t = h_{t-1} + (h̃_t - h_{t-1}) - z_t⊙(h̃_t - h_{t-1}) = h_{t-1} + (1-z_t)⊙(h̃_t - h_{t-1}). This is a residual update: the new state = old state + gated change. When z_t ≈ 1 (remember), the change is suppressed. This is analogous to ResNet's skip connections!

Analyze

B7 Advanced

For the LSTM, show that if f_t = 1 for all t, then ∂C_T/∂C_0 = I (identity matrix). What does this imply about gradient flow?

Answer: ∂C_t/∂C_{t-1} = diag(f_t). If f_t = 1 for all t, then ∂C_t/∂C_{t-1} = I for every step. By chain rule: ∂C_T/∂C_0 = ∏_{t=1}^{T} I = I. This means gradients flow with NO decay at all — perfect gradient preservation regardless of sequence length. This is the "gradient highway" property. In practice, f_t < 1, so there's some decay, but it's much slower than vanilla RNN's exponential decay.

Analyze

B8 Advanced

A bidirectional LSTM with d_h = 256 per direction and d_x = 100 feeds into a dense layer. What is the dimensionality of the concatenated hidden state at each time step? How many parameters does the bidirectional LSTM layer have (excluding the dense layer)?

Answer: Concatenated hidden state = 256 (forward) + 256 (backward) = 512 dimensions per time step. Parameters: Each direction has an LSTM with 4 × [256 × (256 + 100) + 256] = 4 × [91,136 + 256] = 4 × 91,392 = 365,568. Two directions: 2 × 365,568 = 731,136 parameters.

Apply

Section C: Coding Problems (4)

C1 Intermediate

Implement a GRU cell forward pass from scratch in NumPy. Test it with input_size=3, hidden_size=4 on a sequence of length 5.

def gru_cell(x_t, h_prev, params):
    Wr, Wz, Wh = params['Wr'], params['Wz'], params['Wh']
    br, bz, bh = params['br'], params['bz'], params['bh']
    concat = np.vstack([h_prev, x_t])
    r = sigmoid(Wr @ concat + br)         # reset gate
    z = sigmoid(Wz @ concat + bz)         # update gate
    concat_r = np.vstack([r * h_prev, x_t])
    h_tilde = np.tanh(Wh @ concat_r + bh)  # candidate
    h = z * h_prev + (1 - z) * h_tilde     # interpolate
    return h

Apply

C2 Intermediate

Using PyTorch, build a sentiment classifier with a bidirectional LSTM for IMDB reviews. Use nn.Embedding → nn.LSTM(bidirectional=True) → nn.Linear. Report accuracy on the test set.

Hint: Use torchtext or datasets library to load IMDB. Vocabulary size ~25K, embedding dim=100, LSTM hidden=128, bidirectional=True. Concatenate final forward and backward hidden states. Expect ~87-89% accuracy.

Apply

C3 Advanced

Implement gradient clipping from scratch: (a) clip-by-value (element-wise clipping to [-c, c]), and (b) clip-by-norm (scale gradient if ‖g‖ > max_norm). Apply both to a training run and compare loss curves.

# Clip by value
def clip_by_value(grad, c=5.0):
    return np.clip(grad, -c, c)

# Clip by norm (better method)
def clip_by_norm(grad, max_norm=5.0):
    norm = np.linalg.norm(grad)
    if norm > max_norm:
        grad = grad * (max_norm / norm)
    return grad

Clip-by-norm preserves gradient direction while limiting magnitude — generally preferred.

Apply

C4 Advanced

Visualize the vanishing gradient problem: train a vanilla RNN on sequences of length 50, and plot ‖∂L/∂h_t‖ for t = 1, 2, ..., 50. Then do the same with an LSTM. Overlay both plots and analyze.

Expected result: The vanilla RNN gradient plot shows exponential decay from t=50 to t=1 (near-zero for t < 20). The LSTM gradient plot shows much slower decay, with meaningful gradients surviving to t=1. The ratio at t=1 is typically 100-1000× larger for LSTM vs vanilla RNN.

Analyze

Section D: Critical Thinking (3)

D1 Advanced

A startup claims their "memory-augmented vanilla RNN" beats LSTM by simply initializing W_hh as an orthogonal matrix (all singular values = 1). Evaluate this claim: would this solve vanishing gradients? What problems might arise?

Analysis: An orthogonal W_hh ensures σ_max = σ_min = 1, so the gradient factor ‖W_hh‖^(t-k) = 1 for all distances — no vanishing or exploding from the weight matrix. However, the tanh derivative factor diag(1-h_t²) still causes gradients to shrink (since |tanh'| ≤ 1). Additionally, maintaining orthogonality during training is non-trivial — standard SGD won't preserve it. You'd need specialized optimizers or reparameterization. Finally, orthogonal initialization helps at the START of training but doesn't guarantee long-term stability. LSTMs solve the problem more robustly via the cell state highway.

Evaluate

D2 Advanced

You're building a real-time speech recognition system for Indian languages (22 scheduled languages). Should you use bidirectional LSTMs? What about for a chatbot that generates responses? Justify each decision.

Speech recognition (real-time, streaming): Bidirectional is tempting (future context helps disambiguate), but real-time systems can't wait for the full utterance. Use a unidirectional LSTM with a small look-ahead window (e.g., 200ms of future audio). Alternatively, use a "chunked bidirectional" approach — bidirectional within small chunks. Chatbot response generation: Definitely unidirectional. You generate tokens left-to-right; future tokens don't exist yet. However, the encoder for the user's input CAN be bidirectional (since the entire input is available). So: bidirectional encoder + unidirectional decoder.

Evaluate

D3 Advanced

"RNNs are dead. Transformers have replaced them for every task." Argue both for and against this statement with specific technical evidence.

For (RNNs are dead): Transformers dominate NLP (BERT, GPT), vision (ViT), and even time-series (Temporal Fusion Transformer). Self-attention is more parallelizable, handles long-range dependencies without recurrence, and scales better with compute. Against (RNNs still alive): (1) On-device deployment: RNNs/LSTMs have constant memory during inference (O(1) per step), while Transformers need O(n²) for self-attention. For wake-word detection on Alexa, LSTM is more efficient. (2) Streaming applications: RNNs naturally process one token at a time; Transformers need the full sequence. (3) State-space models (S4, Mamba) in 2023-2024 are linear-time recurrent models that outperform Transformers on some benchmarks — the recurrent paradigm is evolving, not dying. (4) Small data: On small datasets (< 10K samples), LSTMs often beat Transformers due to fewer parameters and built-in inductive bias for sequence processing.

Evaluate

★ Starred Research Problems (2)

★1 Advanced — Research

Read the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (Gu & Dao, 2023). How does Mamba achieve the benefits of both RNNs (linear-time inference) and Transformers (parallel training)? What is the "selective state space" mechanism and how does it relate to LSTM gating?

Key insight: Mamba uses a state-space model (SSM) with input-dependent (selective) parameters, similar to how LSTM gates are input-dependent. During training, the recurrence is converted to a convolution (parallelizable). During inference, it runs as a recurrence (O(1) per step). The "selection mechanism" in Mamba is conceptually similar to LSTM's forget and input gates — both decide what information to retain based on the current input.

CreateResearch

★2 Advanced — Research

Design an experiment to empirically measure the "effective memory length" of vanilla RNNs vs. LSTMs vs. GRUs. Create a synthetic task where the correct output at time T depends on an input at time T-k, and measure accuracy as a function of k. What patterns do you expect?

Experimental design: The "copy task" or "adding problem" are standard benchmarks. For the copy task: input a sequence of symbols, wait k blank steps, then output the original sequence. Measure accuracy vs. k for each architecture. Expected: Vanilla RNN fails sharply around k = 10-20. LSTM maintains high accuracy until k ≈ 100-300 (depending on hidden size). GRU performs similarly to LSTM but may degrade slightly earlier for very large k. This directly demonstrates the vanishing gradient problem and LSTM's solution.

CreateResearch

Section 15.a

Connections Map

🔗 How This Chapter Connects

← Builds On:

Chapter 6 (Backpropagation): BPTT is backprop applied to an unrolled computational graph
Chapter 10 (Deep Networks & Optimization): Gradient flow issues, skip connections, optimization techniques
Chapter 12 (CNNs): Weight sharing concept (spatial in CNNs, temporal in RNNs)

→ Enables:

Chapter 15 (Transformers): Attention mechanisms were invented to solve RNN limitations. Understanding WHY RNNs fail motivates self-attention.
Chapter 18 (Applied NLP): LSTMs underlie many NLP systems still in production
Chapter 20 (Time Series): LSTM remains a strong baseline for temporal forecasting

Research Frontier:

State-Space Models (Mamba, S4): Linear-time recurrence that rivals Transformers (2023-2024)
xLSTM (Beck et al., 2024): Extended LSTM with exponential gating and matrix memory — LSTM is evolving, not dying
RWKV: An "RNN with Transformer-level performance" that processes sequences in linear time

Industry Implementation:

IRCTC/CRIS: LSTM for demand forecasting (India)
Google Translate (2016-2019): 8-layer LSTM encoder-decoder (US)
Amazon Alexa: GRU for wake-word detection on-device
Siri (Apple): LSTM for on-device speech recognition

Section 15

Chapter Summary

📋 Key Takeaways

Sequential data needs memory. Feedforward networks process inputs independently; RNNs carry a hidden state h_t that summarizes all past inputs, enabling temporal reasoning.
Vanilla RNNs share weights across time: h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b). Same W_hh, W_xh at every step — enabling variable-length processing with fixed parameters.
BPTT unrolls the RNN through time and backpropagates through the entire computational graph. The gradient ∂L/∂W_hh involves a product of Jacobians across all time steps.
Vanishing gradients are inevitable in vanilla RNNs: ‖∂h_t/∂h_k‖ ≤ σ_max(W_hh)^(t-k) → 0 exponentially. This limits effective memory to ~20 steps.
LSTM solves this with four gates (forget, input, candidate, output) and a cell state highway: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t. The additive update means ∂C_t/∂C_{t-1} = diag(f_t) ≈ 1 — gradients flow unimpeded.
GRU simplifies LSTM to two gates (update, reset) with a coupled forget/input mechanism: z_t + (1-z_t) = 1. 25% fewer parameters, comparable performance on many tasks.
Bidirectional RNNs read sequences both forward and backward — powerful for tasks where the full sequence is available (NER, classification). Never use for generation or forecasting.

The Key Equation:

LSTM Cell State Update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t

This single equation is the heart of the LSTM — additive updates create a gradient highway
that enables learning dependencies across hundreds of time steps.

The Key Intuition: The vanilla RNN's gradient must pass through tanh and W_hh at every step — like a message whispered through a chain of people, getting garbled at each link. The LSTM's cell state is like a written note passed along the chain — it arrives intact because the forget gate chooses whether to erase (≈ 0) or preserve (≈ 1), rather than multiplying by a potentially shrinking matrix.

Section 16