Neural Networks & Deep Learning

Chapter 14: Recurrent Neural Networks (RNNs) & LSTMs

Mastering Sequential Memory โ€” From Vanilla RNNs to Gated Architectures

โฑ๏ธ Reading Time: ~4 hours  |  ๐Ÿ“– Unit 5: Specialized Architectures  |  ๐Ÿง  Theory + Code Chapter

๐Ÿ“‹ Prerequisites: Chapter 10 (Deep Networks & Optimization), Chapter 6 (Backpropagation)

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the vanilla RNN equation, LSTM gate equations (forget, input, candidate, output), GRU update/reset gates, and BPTT algorithm steps
๐Ÿ”ต UnderstandExplain why feedforward networks fail on sequences, how hidden states carry temporal memory, why gradients vanish in deep time-unrolled RNNs, and how LSTM gates solve this via the cell state highway
๐ŸŸข ApplyImplement RNN and LSTM cells from scratch in NumPy, train a character-level language model on Hindi text, and use PyTorch's nn.LSTM for demand forecasting
๐ŸŸก AnalyzeTrace gradient flow through an unrolled RNN, prove vanishing gradients via repeated Jacobian multiplication, compare LSTM vs GRU parameter efficiency, and diagnose when to use bidirectional architectures
๐ŸŸ  EvaluateChoose between LSTM, GRU, and vanilla RNN for a given task; assess when attention or Transformers should replace recurrent models; evaluate trade-offs in encoder-decoder architectures
๐Ÿ”ด CreateDesign an end-to-end IRCTC demand forecasting pipeline with LSTM, build a character-level Hindi text generator, and architect a seq2seq model for transliteration
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain why sequential data (time series, text, speech) requires architectures with temporal memory โ€” and why CNNs and feedforward networks fundamentally cannot handle variable-length temporal dependencies
  • Derive the vanilla RNN forward pass equations: h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h) and explain weight sharing across time steps
  • Perform Backpropagation Through Time (BPTT) โ€” unroll the computation graph, apply the chain rule across T time steps, and compute โˆ‚L/โˆ‚W_hh explicitly
  • Prove the vanishing gradient problem: show that โˆ‚h_T/โˆ‚h_1 = โˆ_{t=1}^{T-1} diag(tanh'(z_t)) ยท W_hh decays exponentially when the largest singular value of W_hh < 1
  • Derive each LSTM gate from first principles โ€” forget gate (what to erase), input gate (what to write), candidate cell (what to write), output gate (what to read) โ€” and show how the cell state provides a gradient highway
  • Compare GRU's simplified two-gate architecture (update + reset) with LSTM's four gates, and identify when each is preferred
  • Explain bidirectional RNNs (reading sequences forward + backward) and deep/stacked RNNs (multiple recurrent layers)
  • Implement both RNN and LSTM cells from scratch in NumPy, including forward pass and BPTT
  • Build practical models using PyTorch's nn.LSTM โ€” character-level language model and time-series forecasting
  • Preview the encoder-decoder (seq2seq) architecture that leads to attention and Transformers (Chapter 15)
Section 2

Opening Hook โ€” 25 Lakh Tickets and the Problem of Memory

๐Ÿš‚ IRCTC Processes 25 Lakh Bookings Daily. Can Your Neural Network Remember Yesterday?

It's October 2024. Durga Puja is four days away. At IRCTC's data center in Delhi, servers are groaning under the weight of 25 lakh ticket bookings per day. The Rajdhani Express from Delhi to Kolkata? Waitlisted to position 847. The Sealdah Duronto? Sold out three weeks ago.

Now imagine you're the ML engineer tasked with predicting tomorrow's demand for the Delhi-Mumbai Rajdhani. You need to know: yesterday's bookings (18,000), last week's pattern (rising 12% daily), the festival calendar (Diwali is next month), the monsoon status (delayed in Maharashtra causing cancellations), and the fact that a cricket match in Mumbai will spike weekend travel. This isn't a single data point โ€” it's a sequence that stretches weeks into the past.

You try a feedforward network. You feed it today's features: 18,000 bookings, temperature 32ยฐC, "pre-Diwali" flag. It predicts 17,500. Wildly wrong โ€” it missed the upward trend, the weekly seasonality, the festival momentum. Why? Because a feedforward network is like a goldfish: it has no memory. It sees each day in complete isolation.

What you need is a network that remembers. One that carries yesterday's context into today's prediction. One that can learn: "bookings have been rising for 5 days straight, there's a festival approaching, and monsoon delays are pushing people to pre-book." That network is called a Recurrent Neural Network. And when sequences get really long, you need its evolved cousin: the Long Short-Term Memory (LSTM) network.

IRCTCIndian RailwaysDemand ForecastingSequential DataTime Series
Sequential data dominates Indian tech: UPI transaction patterns on Paytm/PhonePe (โ‚น20+ lakh crore monthly), Jio's network traffic during IPL matches, IRCTC's 25 lakh daily bookings with festival spikes, Flipkart's order surges during Big Billion Days, Zomato's hourly delivery patterns across 500+ cities, and Ola's ride-demand curves during monsoon. Every one of these problems screams for a neural network with memory.
Section 3

The Intuition First โ€” Thinking Before Equations

The Diary Analogy

Imagine you're reading a diary, one page per day. On each page, the writer describes what happened. When you read Day 5, you don't forget Days 1-4 โ€” you carry a mental summary of everything you've read so far. This mental summary gets updated as you read each new page.

That's exactly what an RNN does:

  • Each page = one input at time step t (today's features: bookings, weather, festival flag)
  • Your mental summary = the hidden state h_t (a vector summarizing everything seen so far)
  • Updating your summary = applying a function: new_summary = f(old_summary, today's page)
  • Making a prediction = using today's summary to guess what tomorrow's page will say
THE DIARY READER (RNN INTUITION) Day 1 Day 2 Day 3 Day 4 โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”‚ xโ‚ โ”‚ โ”‚ xโ‚‚ โ”‚ โ”‚ xโ‚ƒ โ”‚ โ”‚ xโ‚„ โ”‚ โ† Pages (inputs) โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” hโ‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” hโ‚‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” hโ‚ƒ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Read โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Read โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Read โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Read โ”‚โ”€โ”€โ–ถ hโ‚„ โ”‚ & Sumโ”‚ โ”‚ & Sumโ”‚ โ”‚ & Sumโ”‚ โ”‚ & Sumโ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ hโ‚€ โ”€โ”˜ โ”‚ โ”‚ โ”‚ (blank ลทโ‚‚ ลทโ‚ƒ ลทโ‚„ โ† Predictions mind) "I recall Days "Days 1-3 tell "The trend 1-2: bookings me there's an over 4 days are rising" upward trend" is clear!" KEY INSIGHT: The same "reading brain" is used at every step. This is WEIGHT SHARING โ€” the same W_hh, W_xh are applied at every time step.

The "Aha!" Question

If the RNN uses the same weights at every time step, how can it possibly distinguish between "the booking happened yesterday" vs. "the booking happened 30 days ago"?

Answer: It can't โ€” not well, anyway. That's the vanishing gradient problem, and it's exactly why we need LSTMs. But before we jump to the solution, let's deeply understand the problem.

The physicist's approach: We're going to derive everything from scratch. No "it can be shown that..." โ€” we'll show it. If a formula appears, you'll see exactly where it comes from. If you're confused at any point, you're thinking correctly. Confusion is the first step to understanding.
Section 4

Mathematical Foundation โ€” Deriving Everything from Scratch

14.1 โ€” Why Sequences Need Special Architectures

Before we write any equations, let's be crystal clear about what makes sequential data fundamentally different from, say, images or tabular data.

Property 1: Temporal Dependency

In a sequence xโ‚, xโ‚‚, ..., x_T, the meaning of x_t depends on what came before. "Bank" means something different after "river" vs. after "investment." Today's stock price depends on yesterday's. The context accumulated over time is essential.

Property 2: Variable Length

A tweet is 15 words. A novel is 100,000 words. A Sensex time series spans 50 years of daily data. You cannot design a fixed-size input layer for all of these. The architecture must handle any sequence length using the same parameters.

Property 3: Position-Dependent Semantics with Shared Rules

The rule "a verb following a noun is probably describing the noun's action" applies equally at position 3 and position 300 in a sentence. We want parameter sharing โ€” the same transformation applied at every position โ€” so the network can generalize across positions.

๐Ÿ’ก Key Insight: The Compression Trick

Instead of trying to feed the entire history (xโ‚, xโ‚‚, ..., x_t) at each step, we compress the history into a fixed-size vector h_t (the hidden state). At each time step, we update this compressed summary:

h_t = f(h_{t-1}, x_t; ฮธ)

The function f and parameters ฮธ are shared across all time steps. This is the fundamental RNN idea.

14.2 โ€” Vanilla RNN: Architecture & Forward Pass

The Architecture

A vanilla RNN has three sets of parameters, shared across every time step:

  • W_xh โˆˆ โ„^(d_h ร— d_x): transforms input x_t into hidden-space
  • W_hh โˆˆ โ„^(d_h ร— d_h): transforms previous hidden state h_{t-1}
  • b_h โˆˆ โ„^(d_h): bias vector
  • W_hy โˆˆ โ„^(d_y ร— d_h): maps hidden state to output
  • b_y โˆˆ โ„^(d_y): output bias

Step-by-Step Forward Pass

At each time step t = 1, 2, ..., T:

Step 1: Compute pre-activation

z_t = W_hh ยท h_{t-1} + W_xh ยท x_t + b_h

Step 2: Apply activation (tanh)

h_t = tanh(z_t) = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)

Step 3: Compute output

ลท_t = W_hy ยท h_t + b_y

(Apply softmax for classification, or use raw for regression)

Initialization: hโ‚€ = 0 (zero vector of size d_h)

Why tanh? It squashes to [-1, +1], allowing the hidden state to represent both positive and negative correlations. ReLU is rarely used in vanilla RNNs because unbounded activations cause exploding hidden states during the recurrence.

The Vanilla RNN Equations (Boxed Result)

h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)
ลท_t = softmax(W_hy ยท h_t + b_y)

Weight Sharing: The Same Brain at Every Step

Notice something remarkable: W_hh, W_xh, W_hy, b_h, b_y are the same at every time step. The RNN doesn't have separate "Day 1 weights" and "Day 2 weights." It applies the same transformation everywhere. This is analogous to how a CNN shares the same filter across spatial positions.

Parameter count: A vanilla RNN with d_h = 256 hidden units and d_x = 100 input features has only 256ร—256 + 256ร—100 + 256 = 91,392 parameters in the recurrent layer โ€” regardless of whether the sequence has 10 steps or 10,000 steps. A feedforward network handling 10,000 steps would need millions of parameters.

Compact Notation

You'll often see the RNN equations written compactly by concatenating h_{t-1} and x_t:

h_t = tanh(W ยท [h_{t-1}; x_t] + b)

where W = [W_hh | W_xh] โˆˆ โ„^(d_h ร— (d_h + d_x))

This concatenation trick simplifies implementation โ€” you do one matrix multiply instead of two.

VANILLA RNN โ€” UNROLLED VIEW (3 TIME STEPS) xโ‚ xโ‚‚ xโ‚ƒ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚W_xh โ”‚ โ”‚W_xh โ”‚ โ”‚W_xh โ”‚ โ† SAME W_xh โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ hโ‚€โ”€โ”€โ–ถโŠ•โ”€โ”€โ”€โ–ถtanhโ”€โ”€โ–ถhโ‚โ”€โ”€โ–ถโŠ•โ”€โ”€โ”€โ–ถtanhโ”€โ”€โ–ถhโ‚‚โ”€โ”€โ–ถโŠ•โ”€โ”€โ”€โ–ถtanhโ”€โ”€โ–ถhโ‚ƒ โ–ฒ โ–ฒ โ–ฒ โ”Œโ”€โ”€โ”ดโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”ดโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”ดโ”€โ”€โ”€โ” โ”‚W_hh โ”‚ โ”‚W_hh โ”‚ โ”‚W_hh โ”‚ โ† SAME W_hh โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ ลทโ‚ ลทโ‚‚ ลทโ‚ƒ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ Lโ‚ Lโ‚‚ Lโ‚ƒ Total Loss: L = Lโ‚ + Lโ‚‚ + Lโ‚ƒ (sum across time steps)
โŒ MYTH: "Each copy of the RNN cell in the unrolled diagram has its own parameters."
โœ… TRUTH: All copies share exactly the same W_hh, W_xh, W_hy. The unrolling is just a visualization trick for computing gradients.
๐Ÿ” WHY IT MATTERS: During BPTT, gradients from all time steps must be summed (not averaged) to update the shared parameters. If you treat them as separate, you'll implement BPTT incorrectly.

14.3 โ€” Backpropagation Through Time (BPTT) โ€” Full Derivation

Now comes the beautiful (and slightly painful) part. We need to compute gradients for learning. Since the same weights are used at every time step, we need to unroll the RNN through time and backpropagate through the entire unrolled graph.

Setup

Given a sequence of length T, the total loss is:

L = ฮฃ_{t=1}^{T} L_t(ลท_t, y_t)

We need: โˆ‚L/โˆ‚W_hh, โˆ‚L/โˆ‚W_xh, โˆ‚L/โˆ‚W_hy

Step 1: Gradient of the output weights (Easy)

Since ลท_t = W_hy ยท h_t + b_y, and each L_t depends only on ลท_t:

โˆ‚L/โˆ‚W_hy = ฮฃ_{t=1}^{T} โˆ‚L_t/โˆ‚ลท_t ยท h_t^โŠค

This is straightforward โ€” no recurrence involved.

Step 2: Gradient of the recurrent weights (The Hard Part)

Here's where it gets interesting. Consider โˆ‚L/โˆ‚W_hh. The weight W_hh is used at every time step. So L_t depends on W_hh through the chain:

W_hh โ†’ hโ‚ โ†’ hโ‚‚ โ†’ ... โ†’ h_t โ†’ ลท_t โ†’ L_t

By the chain rule:

โˆ‚L_t/โˆ‚W_hh = ฮฃ_{k=1}^{t} โˆ‚L_t/โˆ‚ลท_t ยท โˆ‚ลท_t/โˆ‚h_t ยท โˆ‚h_t/โˆ‚h_k ยท โˆ‚h_k/โˆ‚W_hh

The key term is โˆ‚h_t/โˆ‚h_k โ€” the gradient flowing backwards from step t to step k.

Step 3: Expanding โˆ‚h_t/โˆ‚h_k

Since h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h), we have:

โˆ‚h_t/โˆ‚h_{t-1} = diag(1 - h_tยฒ) ยท W_hh

(where 1 - h_tยฒ is the derivative of tanh, and diag() creates a diagonal matrix)

By the chain rule across multiple steps:

โˆ‚h_t/โˆ‚h_k = โˆ_{i=k+1}^{t} โˆ‚h_i/โˆ‚h_{i-1} = โˆ_{i=k+1}^{t} diag(1 - h_iยฒ) ยท W_hh

Step 4: The Complete BPTT Gradient

Putting it all together:

BPTT Gradient (Boxed Result)

โˆ‚L/โˆ‚W_hh = ฮฃ_{t=1}^{T} ฮฃ_{k=1}^{t} (โˆ‚L_t/โˆ‚h_t) ยท [โˆ_{i=k+1}^{t} diag(1 - h_iยฒ) ยท W_hh] ยท (โˆ‚h_kโบ/โˆ‚W_hh)

where โˆ‚h_kโบ/โˆ‚W_hh = diag(1 - h_kยฒ) ยท h_{k-1}^โŠค (the "immediate" gradient at step k)

The total gradient for W_hh is the sum over all time steps t, and for each t, the sum over all preceding steps k โ‰ค t. This double sum is what makes BPTT computationally expensive โ€” O(Tยฒ) in the worst case, though it's typically implemented as a single backward sweep in O(T).

Truncated BPTT

For very long sequences (T = 10,000+), full BPTT is impractical. Truncated BPTT limits backpropagation to the last ฯ„ steps:

โˆ‚h_t/โˆ‚h_k โ‰ˆ 0   for   t - k > ฯ„ (truncation length)

Typical values: ฯ„ = 20 to 50. PyTorch's default behavior processes sequences in chunks of ฯ„ steps.

BPTT Quick Summary
1. Unroll RNN for T steps โ†’ compute all hโ‚, hโ‚‚, ..., h_T (forward pass) 2. Compute loss L = ฮฃ L_t 3. Backpropagate: โˆ‚L/โˆ‚h_T โ†’ โˆ‚L/โˆ‚h_{T-1} โ†’ ... โ†’ โˆ‚L/โˆ‚h_1 4. Accumulate: โˆ‚L/โˆ‚W = ฮฃ_t (gradient contribution from step t)

GATE tip: The gradient at step k involves a product of (t-k) Jacobians. This product is why gradients vanish or explode.

14.4 โ€” The Vanishing Gradient Problem: A Rigorous Proof

This is arguably the most important theoretical result in this chapter. Let's prove it step by step.

Theorem: Gradients vanish exponentially with sequence length

Claim: For a vanilla RNN, if the largest singular value of W_hh satisfies ฯƒ_max(W_hh) < 1/ฮณ where ฮณ = max|tanh'(ยท)| = 1, then:

โ€–โˆ‚h_t/โˆ‚h_kโ€– โ‰ค (ฯƒ_max(W_hh))^(t-k) โ†’ 0 as (t-k) โ†’ โˆž

Proof:

Step 1: We derived that:

โˆ‚h_t/โˆ‚h_k = โˆ_{i=k+1}^{t} diag(1 - h_iยฒ) ยท W_hh

Step 2: Take the norm of each factor. For any matrix product AB, โ€–ABโ€– โ‰ค โ€–Aโ€– ยท โ€–Bโ€–. So:

โ€–โˆ‚h_t/โˆ‚h_kโ€– โ‰ค โˆ_{i=k+1}^{t} โ€–diag(1 - h_iยฒ)โ€– ยท โ€–W_hhโ€–

Step 3: The diagonal matrix diag(1 - h_iยฒ) has entries in [0, 1] (since tanh output is in [-1,1], so tanhยฒ โˆˆ [0,1], so 1-tanhยฒ โˆˆ [0,1]). Therefore:

โ€–diag(1 - h_iยฒ)โ€– โ‰ค 1

Step 4: Let ฯƒ = ฯƒ_max(W_hh) = โ€–W_hhโ€–โ‚‚ (the spectral norm). Then:

โ€–โˆ‚h_t/โˆ‚h_kโ€– โ‰ค โˆ_{i=k+1}^{t} 1 ยท ฯƒ = ฯƒ^(t-k)

Step 5: If ฯƒ < 1, then ฯƒ^(t-k) โ†’ 0 exponentially as (t-k) โ†’ โˆž. โˆŽ

What this means intuitively:

The gradient signal from step t trying to reach step k gets multiplied by W_hh at each intervening step. If W_hh "shrinks" vectors (ฯƒ < 1), the signal is progressively crushed. After 20-30 steps, it's essentially zero. The RNN cannot learn dependencies longer than ~20 steps.

The exploding case:

If ฯƒ > 1, then ฯƒ^(t-k) โ†’ โˆž. The gradients explode. This is easier to fix โ€” just clip gradients: if โ€–gโ€– > threshold: g = threshold ยท g / โ€–gโ€–

VANISHING GRADIENT VISUALIZATION Gradient signal flowing backward through time: Step 1 Step 2 Step 3 Step 4 Step 5 Step 10 Step 20 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆ โ”‚ โ”‚โ–ช โ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 100% 81% 66% 53% 43% 13% 1.5% ฯƒ_max = 0.9 โ†’ Gradient at step k = 0.9^(20-k) of original After 20 steps: 0.9^20 โ‰ˆ 0.012 = 1.2% of the original signal!
โŒ MYTH: "The vanishing gradient means the gradient becomes zero."
โœ… TRUTH: The gradient becomes exponentially small but never exactly zero. However, it becomes so small relative to floating-point precision (and relative to gradients from nearby time steps) that it's effectively invisible to the optimizer.
๐Ÿ” WHY IT MATTERS: This means the RNN can theoretically learn long-range dependencies, but in practice, the learning signal is drowned out by more recent, stronger gradients. The LSTM doesn't just fix the gradient โ€” it creates an alternative pathway where the gradient can flow unimpeded.
"On the difficulty of training Recurrent Neural Networks" (Pascanu et al., 2013) โ€” This seminal paper provided the formal analysis of vanishing/exploding gradients in RNNs, showing that the problem is related to the eigenvalues of the recurrent weight matrix. They also proposed gradient clipping as a practical solution for the exploding case. The vanishing case requires architectural changes (LSTM/GRU).

14.5 โ€” LSTM: Long Short-Term Memory โ€” Full Gate Derivation

The LSTM, introduced by Hochreiter & Schmidhuber in 1997 (and refined by Gers et al. in 2000), solves the vanishing gradient problem with a beautifully engineered architecture. Let's derive each component from the design requirement.

The Design Challenge

We need a recurrent unit that can:

  1. Remember information for long durations (hundreds of time steps)
  2. Selectively forget information that's no longer relevant
  3. Selectively write new information when it matters
  4. Selectively read from memory to produce output

Think of it as a bank locker system:

  • The cell state C_t is the locker itself โ€” long-term storage
  • The forget gate decides which locker contents to discard
  • The input gate decides which new items to store
  • The output gate decides which items to take out and use right now

LSTM Gate Equations โ€” Derived Step by Step

Notation: At time step t, the LSTM receives input x_t โˆˆ โ„^(d_x) and previous hidden state h_{t-1} โˆˆ โ„^(d_h). It also maintains a cell state C_{t-1} โˆˆ โ„^(d_h).

Gate 1: The Forget Gate f_t โ€” "What should I erase from memory?"

We need a gate that looks at the current input x_t and previous hidden state h_{t-1} and decides, for each element of the cell state, whether to keep it (1) or erase it (0). A sigmoid function gives us values in [0, 1]:

f_t = ฯƒ(W_f ยท [h_{t-1}; x_t] + b_f)

โ†’ f_t โˆˆ (0, 1)^(d_h). A value of 0.9 means "keep 90% of this memory." A value of 0.01 means "almost completely forget this."

Gate 2: The Input Gate i_t โ€” "What new information is worth storing?"

We need two things: (a) a gate deciding how much to write, and (b) the actual content to write.

i_t = ฯƒ(W_i ยท [h_{t-1}; x_t] + b_i)   โ† how much to write (0 to 1)

Cฬƒ_t = tanh(W_C ยท [h_{t-1}; x_t] + b_C)   โ† what to write (-1 to 1, the "candidate")

โ†’ i_t acts as a valve controlling how much of the candidate Cฬƒ_t flows into memory.

The Cell State Update โ€” "Erase old + Write new"

Now we combine: erase what the forget gate says to erase, then add what the input gate says to add:

C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t

โ†’ โŠ™ is element-wise multiplication. This is the key equation. Notice: the cell state update is an additive operation (not multiplicative like vanilla RNN's h_t = tanh(Wยทh_{t-1}+...)). This additive nature is what prevents vanishing gradients!

Gate 3: The Output Gate o_t โ€” "What should I output right now?"

The cell state contains everything in memory, but we only want to output the relevant parts:

o_t = ฯƒ(W_o ยท [h_{t-1}; x_t] + b_o)

h_t = o_t โŠ™ tanh(C_t)

โ†’ tanh squashes C_t to [-1, 1], then the output gate selects which dimensions to expose.

LSTM Equations โ€” Complete Boxed Summary

Forget gate:    f_t = ฯƒ(W_f ยท [h_{t-1}; x_t] + b_f)
Input gate:     i_t = ฯƒ(W_i ยท [h_{t-1}; x_t] + b_i)
Candidate:     Cฬƒ_t = tanh(W_C ยท [h_{t-1}; x_t] + b_C)
Cell update:    C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t
Output gate:    o_t = ฯƒ(W_o ยท [h_{t-1}; x_t] + b_o)
Hidden state:   h_t = o_t โŠ™ tanh(C_t)

Parameters: W_f, W_i, W_C, W_o โˆˆ โ„^(d_h ร— (d_h + d_x)),   b_f, b_i, b_C, b_o โˆˆ โ„^(d_h)

Why Does This Solve Vanishing Gradients?

Look at the cell state update: C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t

The gradient of C_t with respect to C_{t-1} is:

โˆ‚C_t/โˆ‚C_{t-1} = diag(f_t)

For long-range dependency, the gradient across T steps is:

โˆ‚C_T/โˆ‚C_1 = โˆ_{t=2}^{T} diag(f_t)

If the forget gate values are close to 1 (meaning "remember everything"), this product stays close to 1. The cell state acts as a gradient highway โ€” gradients can flow unchanged across hundreds of time steps. The forget gate learned value of ~1 means "let the gradient through."

Forget gate bias initialization: Initialize b_f to +1 or +2 (instead of 0). This makes ฯƒ(b_f) โ‰ˆ 0.73 or 0.88, so the LSTM starts by remembering rather than forgetting. This trick, proposed by Jozefowicz et al. (2015), significantly improves LSTM training. PyTorch's nn.LSTM does NOT do this by default โ€” you must set it manually.
LSTM CELL โ€” INTERNAL ARCHITECTURE โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LSTM CELL โ”‚ โ”‚ โ”‚ C_{t-1} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚โ”€โ”€โ”€โ”€ โŠ™ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โŠ• โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ C_t โ”‚ โ”‚ โ–ฒ โ–ฒ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ f_t i_t โŠ™ Cฬƒ_t โ”‚ โ”‚ (forget) (input ร— candidate) โ”‚ โ”‚ โ–ฒ โ–ฒ โ–ฒ โ”‚ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”ดโ”€โ” โ”Œโ”ดโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ฯƒ โ”‚ โ”‚ ฯƒ โ”‚ โ”‚tanh โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”˜ โ””โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ h_{t-1} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ x_t โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”ฌโ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”‚ โ”‚ tanh(C_t) โ—€โ”€โ”€โ”€โ”€ โ”‚ o_t โ”‚ โ† ฯƒ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”‚ โŠ™ โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ h_t โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ h_t โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ FLOW SUMMARY: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Forget โ”‚โ”€โ”€โ–ถโ”‚ Eraseโ”‚โ”€โ”€โ–ถโ”‚ Write โ”‚โ”€โ”€โ–ถโ”‚ Output โ”‚โ”€โ”€โ–ถ h_t โ”‚ Gate โ”‚ โ”‚ Old โ”‚ โ”‚ New โ”‚ โ”‚ Filter โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ "What to f_tโŠ™C +i_tโŠ™Cฬƒ_t o_tโŠ™tanh(C_t) forget?"

LSTM Parameter Count

An LSTM with d_h hidden units and d_x input features has 4 gates, each with a weight matrix of size d_h ร— (d_h + d_x) and a bias of size d_h:

LSTM parameters = 4 ร— [d_h ร— (d_h + d_x) + d_h] = 4 ร— d_h ร— (d_h + d_x + 1)

For d_h = 256, d_x = 100: 4 ร— 256 ร— 357 = 365,568 parameters โ€” about 4ร— a vanilla RNN.

The LSTM was published in 1997 by Sepp Hochreiter and Jรผrgen Schmidhuber, but it didn't become popular until ~2013 when GPU computing made training large LSTMs practical. Alex Graves at DeepMind used LSTMs to win handwriting recognition competitions, and Google adopted them for Google Translate in 2016. The LSTM had to wait 16 years for hardware to catch up with theory.

14.6 โ€” GRU: Gated Recurrent Unit โ€” The Streamlined Alternative

The GRU, introduced by Cho et al. (2014), asks: "Can we get LSTM-like performance with fewer parameters?" The answer: merge the forget and input gates into one update gate, and merge the cell state and hidden state into a single state vector.

GRU Equations โ€” Derived from LSTM Simplification

Key simplifications:

  1. No separate cell state C_t โ€” just the hidden state h_t
  2. The "input gate" is replaced by (1 - z_t), ensuring f + i = 1 (what you forget, you replace with new)
  3. The output gate is removed โ€” the full hidden state is exposed

Reset Gate r_t โ€” "How much of the past hidden state to use when computing new candidate?"

r_t = ฯƒ(W_r ยท [h_{t-1}; x_t] + b_r)

Update Gate z_t โ€” "What fraction of h_{t-1} to keep vs. replace?"

z_t = ฯƒ(W_z ยท [h_{t-1}; x_t] + b_z)

Candidate hidden state hฬƒ_t โ€” "What would the new state be?"

hฬƒ_t = tanh(W_h ยท [r_t โŠ™ h_{t-1}; x_t] + b_h)

โ†’ The reset gate r_t controls how much of h_{t-1} is visible when computing the candidate. If r_t โ‰ˆ 0, the GRU "resets" and acts like a vanilla neuron seeing only x_t.

Hidden state update โ€” "Interpolate between old and new"

h_t = z_t โŠ™ h_{t-1} + (1 - z_t) โŠ™ hฬƒ_t

โ†’ When z_t โ‰ˆ 1: keep old state (remember). When z_t โ‰ˆ 0: use new candidate (update).

GRU Equations โ€” Complete Boxed Summary

Reset gate:    r_t = ฯƒ(W_r ยท [h_{t-1}; x_t] + b_r)
Update gate:   z_t = ฯƒ(W_z ยท [h_{t-1}; x_t] + b_z)
Candidate:     hฬƒ_t = tanh(W_h ยท [r_t โŠ™ h_{t-1}; x_t] + b_h)
Hidden state:   h_t = z_t โŠ™ h_{t-1} + (1 - z_t) โŠ™ hฬƒ_t

Parameters: 3 weight matrices (vs. LSTM's 4) โ†’ 25% fewer parameters

LSTM vs GRU: Head-to-Head Comparison

FeatureLSTMGRU
Gates3 (forget, input, output) + candidate2 (update, reset) + candidate
State vectors2 (h_t and C_t)1 (h_t only)
Parameters (d_h=256, d_x=100)365,568274,176 (25% fewer)
Training speedSlower~15-20% faster
Long-range memoryStronger (dedicated cell state)Slightly weaker
Best forLong sequences, complex tasksShorter sequences, limited compute
Who uses itGoogle Translate (2016), DeepMindBaidu speech, many research papers

๐Ÿ‡ฎ๐Ÿ‡ณ India โ€” LSTM vs GRU in Practice

  • Flipkart: Uses GRUs for real-time session-based recommendations (latency matters more than max accuracy)
  • CRIS (Indian Railways): LSTM for long-term demand forecasting (needs to remember seasonal patterns from months ago)
  • Bhashini: LSTM-based transliteration for 22 scheduled languages

๐Ÿ‡บ๐Ÿ‡ธ USA โ€” LSTM vs GRU in Practice

  • Google Translate (2016-2019): 8-layer LSTM encoder-decoder before switching to Transformer
  • Amazon Alexa: GRU for on-device wake word detection (small model, fast inference)
  • Netflix: LSTM for viewing pattern prediction and content recommendation

14.7 โ€” Bidirectional RNNs: Reading Forward and Backward

Consider the sentence: "The Rajdhani arrived at the _____ on platform 3." To fill the blank, you need context from both sides: "Rajdhani" (before) suggests a train station, and "platform 3" (after) confirms it. A unidirectional RNN reading left-to-right never sees "platform 3" when processing the blank.

Architecture

A Bidirectional RNN runs two separate RNNs:

  • Forward RNN: reads xโ‚ โ†’ xโ‚‚ โ†’ ... โ†’ x_T, producing hidden states hฬ„โ‚แถ , hฬ„โ‚‚แถ , ..., hฬ„_Tแถ 
  • Backward RNN: reads x_T โ†’ x_{T-1} โ†’ ... โ†’ xโ‚, producing hidden states hฬ„โ‚แต‡, hฬ„โ‚‚แต‡, ..., hฬ„_Tแต‡

The final hidden state at each step is the concatenation:

h_t = [hฬ„_tแถ  ; hฬ„_tแต‡] โˆˆ โ„^(2ยทd_h)
BIDIRECTIONAL RNN xโ‚ xโ‚‚ xโ‚ƒ xโ‚„ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ” โ†’ โ”Œโ”€โ”€โ”€โ” โ†’ โ”Œโ”€โ”€โ”€โ” โ†’ โ”Œโ”€โ”€โ”€โ” Forward โ”‚hโ‚แถ โ”‚โ”€โ”€โ”€โ–ถโ”‚hโ‚‚แถ โ”‚โ”€โ”€โ”€โ–ถโ”‚hโ‚ƒแถ โ”‚โ”€โ”€โ”€โ–ถโ”‚hโ‚„แถ โ”‚ RNN โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โŠ• concat โŠ• โŠ• โŠ• โ†’ Output โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”ดโ”€โ” โ”Œโ”€โ”ดโ”€โ” โ”Œโ”€โ”ดโ”€โ” โ”Œโ”€โ”ดโ”€โ” โ”‚hโ‚แต‡โ”‚โ—€โ”€โ”€โ”€โ”‚hโ‚‚แต‡โ”‚โ—€โ”€โ”€โ”€โ”‚hโ‚ƒแต‡โ”‚โ—€โ”€โ”€โ”€โ”‚hโ‚„แต‡โ”‚ Backward โ””โ”€โ”€โ”€โ”˜ โ† โ””โ”€โ”€โ”€โ”˜ โ† โ””โ”€โ”€โ”€โ”˜ โ† โ””โ”€โ”€โ”€โ”˜ RNN โ–ฒ โ–ฒ โ–ฒ โ–ฒ โ”‚ โ”‚ โ”‚ โ”‚ xโ‚ xโ‚‚ xโ‚ƒ xโ‚„

When to Use Bidirectional RNNs

Use BiRNN โœ…Don't Use BiRNN โŒ
Named Entity Recognition (full sentence available)Real-time speech recognition (future words unknown)
Sentiment classification (entire review available)Language generation / text prediction
Machine translation encoderTime-series forecasting (future data unavailable)
Part-of-speech taggingOnline/streaming applications
โŒ MYTH: "Bidirectional RNNs are always better than unidirectional."
โœ… TRUTH: BiRNNs require the entire sequence to be available before processing. You cannot use them for real-time generation or prediction tasks where future context is unknown.
๐Ÿ” WHY IT MATTERS: Using a BiRNN for time-series forecasting is a common bug โ€” it means your model is "cheating" by peeking at future data. Your training loss will look great, but deployment will fail spectacularly.

14.8 โ€” Deep (Stacked) RNNs: Going Deeper in the Vertical Dimension

Just as deeper CNNs learn more abstract features, stacking multiple RNN layers creates a hierarchy of temporal abstractions.

STACKED RNN (3 LAYERS) xโ‚ xโ‚‚ xโ‚ƒ โ”‚ โ”‚ โ”‚ Layer 1: โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”‚hยนโ‚โ”‚โ”€โ”€โ”€โ–ถโ”‚hยนโ‚‚โ”‚โ”€โ”€โ”€โ–ถโ”‚hยนโ‚ƒโ”‚ โ† Low-level patterns โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ (individual words/features) โ”‚ โ”‚ โ”‚ Layer 2: โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”‚hยฒโ‚โ”‚โ”€โ”€โ”€โ–ถโ”‚hยฒโ‚‚โ”‚โ”€โ”€โ”€โ–ถโ”‚hยฒโ‚ƒโ”‚ โ† Mid-level patterns โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ (phrases/short-term trends) โ”‚ โ”‚ โ”‚ Layer 3: โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”‚hยณโ‚โ”‚โ”€โ”€โ”€โ–ถโ”‚hยณโ‚‚โ”‚โ”€โ”€โ”€โ–ถโ”‚hยณโ‚ƒโ”‚ โ† High-level patterns โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ (semantics/long-term trends) โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ ลทโ‚ ลทโ‚‚ ลทโ‚ƒ
Stacked RNN: h_t^(l) = f(h_{t-1}^(l), h_t^(l-1); ฮธ^(l))

Layer l receives input from layer (l-1) at the same time step,
and from layer l at the previous time step.

Practical depth: Unlike CNNs (100+ layers), RNNs rarely go beyond 2-4 layers. The recurrent connections already create a deep computational graph through time โ€” a 3-layer LSTM processing a 100-step sequence has 300 effective layers of computation. Adding dropout between layers is essential:

PyTorch
# 3-layer LSTM with dropout
lstm = nn.LSTM(input_size=100, hidden_size=256,
               num_layers=3, dropout=0.3, batch_first=True)

14.9 โ€” Sequence-to-Sequence Preview (โ†’ Chapter 15)

Many tasks require mapping one sequence to another sequence of different length: translation (English โ†’ Hindi), summarization (long article โ†’ short summary), chatbot (question โ†’ answer). The RNN architectures we've seen so far can't directly handle this because they produce outputs at each input step.

The Encoder-Decoder Architecture

SEQUENCE-TO-SEQUENCE (ENCODER-DECODER) ENCODER (reads input, builds context) DECODER (generates output) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ "How" "are" "you" "?" โ”‚ โ”‚ "เค†เคช" "เค•เฅˆเคธเฅ‡" "เคนเฅˆเค‚" "?" โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ฒ โ–ฒ โ–ฒ โ–ฒ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”‚ contextโ”‚ โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ”โ”‚ โ”‚ โ”‚hโ‚โ”‚โ”€โ”€โ–ถโ”‚hโ‚‚โ”‚โ”€โ”€โ–ถโ”‚hโ‚ƒโ”‚โ”€โ–ถโ”‚hโ‚„โ”‚โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”€โ”€โ”‚dโ‚โ”‚โ”€โ”€โ–ถโ”‚dโ‚‚โ”‚โ”€โ”€โ–ถโ”‚dโ‚ƒโ”‚โ”€โ–ถโ”‚dโ‚„โ”‚โ”‚ โ”‚ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ”‚ vector โ”‚ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ”‚ โ”‚ "เค†เคช" "เค•เฅˆเคธเฅ‡" "เคนเฅˆเค‚" EOS โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ The "context vector" = hโ‚„ (encoder's final hidden state) Problem: ALL information must be compressed into ONE vector! Solution: Attention (Chapter 15)

The encoder reads the input sequence and compresses it into a context vector (the final hidden state). The decoder then generates the output sequence, one token at a time, conditioned on this context vector. The critical bottleneck: all input information must pass through a single fixed-size vector. This limitation directly motivated the invention of attention mechanisms (Bahdanau et al., 2015), which we'll derive in Chapter 15.

Roles that use RNN/LSTM architectures:
  • NLP Engineer โ€” Sequence modeling, text generation, named entity recognition (โ‚น15-40 LPA in India, $140-200K in US)
  • Time-Series ML Engineer โ€” Demand forecasting, anomaly detection at scale (IRCTC, Flipkart, Amazon)
  • Speech/Audio ML Engineer โ€” Wake word detection, ASR systems (Google, Amazon, Alexa)
  • Quantitative Researcher โ€” Financial time-series prediction (Goldman Sachs, Two Sigma)
Section 5

Worked Examples

Example 1: RNN Forward Pass by Hand

Let's trace through a vanilla RNN with d_h = 2, d_x = 2, processing a 3-step sequence.

Given:

W_hh = [[0.5, -0.1], [0.2, 0.6]],   W_xh = [[0.3, 0.7], [-0.2, 0.4]]

b_h = [0, 0],   hโ‚€ = [0, 0]

Input sequence: xโ‚ = [1, 0], xโ‚‚ = [0, 1], xโ‚ƒ = [1, 1]

Step 1 (t=1):

zโ‚ = W_hh ยท hโ‚€ + W_xh ยท xโ‚ + b_h

= [[0.5,-0.1],[0.2,0.6]] ยท [0,0] + [[0.3,0.7],[-0.2,0.4]] ยท [1,0] + [0,0]

= [0,0] + [0.3, -0.2] + [0,0] = [0.3, -0.2]

hโ‚ = tanh([0.3, -0.2]) = [0.291, -0.197]

Step 2 (t=2):

zโ‚‚ = W_hh ยท hโ‚ + W_xh ยท xโ‚‚ + b_h

= [[0.5,-0.1],[0.2,0.6]] ยท [0.291,-0.197] + [[0.3,0.7],[-0.2,0.4]] ยท [0,1]

= [0.5ร—0.291 + (-0.1)ร—(-0.197), 0.2ร—0.291 + 0.6ร—(-0.197)] + [0.7, 0.4]

= [0.1455+0.0197, 0.0582-0.1182] + [0.7, 0.4]

= [0.165, -0.060] + [0.7, 0.4] = [0.865, 0.340]

hโ‚‚ = tanh([0.865, 0.340]) = [0.700, 0.328]

Step 3 (t=3):

zโ‚ƒ = W_hh ยท hโ‚‚ + W_xh ยท xโ‚ƒ + b_h

= [0.5ร—0.700-0.1ร—0.328, 0.2ร—0.700+0.6ร—0.328] + [0.3+0.7, -0.2+0.4]

= [0.350-0.033, 0.140+0.197] + [1.0, 0.2]

= [0.317, 0.337] + [1.0, 0.2] = [1.317, 0.537]

hโ‚ƒ = tanh([1.317, 0.537]) = [0.864, 0.491]

Key Observation:

hโ‚ƒ = [0.864, 0.491] encodes information from ALL three inputs. Even though we only explicitly provided xโ‚ƒ at step 3, hโ‚ƒ carries a "compressed memory" of xโ‚ and xโ‚‚ through the recurrent connections.

Example 2: IRCTC Demand Forecasting with LSTM (Indian Industry)

๐Ÿ‡ฎ๐Ÿ‡ณ IRCTC โ€” LSTM for 25 Lakh Daily Booking Predictions

Problem Setup

Predict tomorrow's booking count for the Delhi-Mumbai Rajdhani Express using the past 30 days of data.

Features per time step (d_x = 8):

FeatureExample ValueType
Daily bookings (normalized)0.72Continuous
Day of week (one-hot โ†’ embedded)0.45 (Tuesday)Categorical
Festival proximity (days to next festival)4 (Diwali in 4 days)Integer
Monsoon indicator (0/1)1 (active monsoon)Binary
Cancellation rate0.12Continuous
Average ticket price0.65 (normalized)Continuous
Waitlist length0.89 (normalized)Continuous
Holiday flag0Binary

LSTM Walk-through for 3 Key Days:

Day 28 (normal day): f_28 โ‰ˆ [0.9, 0.85, ...] โ€” keep most memory. i_28 โ‰ˆ [0.3, 0.2, ...] โ€” mild update. The cell state smoothly carries forward the seasonal trend.

Day 29 (festival announced): f_29 โ‰ˆ [0.4, 0.3, ...] โ€” the forget gate aggressively erases the "normal pattern" memory! i_29 โ‰ˆ [0.9, 0.95, ...] โ€” the input gate opens wide to write "festival mode" into the cell state. This is the LSTM learning to forget and rewrite.

Day 30 (prediction day): The cell state now contains: long-term seasonal trend (from weeks ago) + festival spike pattern (from Day 29) + recent momentum. The output gate selects the most relevant dimensions, producing h_30, which is mapped to the predicted booking count: 22,450 bookings (vs. actual 22,180 โ€” 1.2% error).

Example 3: Google Translate โ€” Encoder-Decoder LSTM (US/Global)

๐Ÿ‡บ๐Ÿ‡ธ Google Translate โ€” 8-Layer LSTM Encoder-Decoder (Pre-Transformer Era, 2016)

Architecture: Google's Neural Machine Translation (GNMT)

Published as "Google's Neural Machine Translation System" (Wu et al., 2016), this was one of the largest LSTM deployments ever:

  • Encoder: 8 stacked LSTM layers (1 bidirectional + 7 unidirectional), d_h = 1024
  • Decoder: 8 stacked LSTM layers, d_h = 1024
  • Attention: Added Bahdanau-style attention between encoder and decoder
  • Parameters: ~210 million total
  • Training: 96 NVIDIA K80 GPUs for 6 days

Example Translation (English โ†’ French):

Input: "The cat sat on the mat"

Encoder processes: "The" โ†’ hโ‚, "cat" โ†’ hโ‚‚, "sat" โ†’ hโ‚ƒ, "on" โ†’ hโ‚„, "the" โ†’ hโ‚…, "mat" โ†’ hโ‚†

The encoder's final state (hโ‚†) + attention over all hโ‚-hโ‚† feeds into the decoder.

Decoder generates: "Le" โ†’ "chat" โ†’ "s'est" โ†’ "assis" โ†’ "sur" โ†’ "le" โ†’ "tapis" โ†’ <EOS>

Why Transformers Replaced LSTMs:

  1. Sequential bottleneck: LSTMs process tokens one-by-one โ€” no parallelism. A 50-word sentence requires 50 sequential steps.
  2. Context compression: Even with attention, the LSTM must process the entire sequence before the decoder starts.
  3. Long-range dependencies: Despite LSTM's improvements over vanilla RNN, 100+ token dependencies remain challenging.

โ†’ Transformers solved all three problems with self-attention (Chapter 15).

Section 6

Python Implementation โ€” From Scratch (NumPy)

6.1 โ€” Vanilla RNN Class

Python / NumPy
import numpy as np

class VanillaRNN:
    """Vanilla RNN implemented from scratch with BPTT."""

    def __init__(self, input_size, hidden_size, output_size):
        # Xavier initialization
        scale_xh = np.sqrt(2.0 / (input_size + hidden_size))
        scale_hh = np.sqrt(2.0 / (hidden_size + hidden_size))
        scale_hy = np.sqrt(2.0 / (hidden_size + output_size))

        self.W_xh = np.random.randn(hidden_size, input_size) * scale_xh
        self.W_hh = np.random.randn(hidden_size, hidden_size) * scale_hh
        self.W_hy = np.random.randn(output_size, hidden_size) * scale_hy
        self.b_h  = np.zeros((hidden_size, 1))
        self.b_y  = np.zeros((output_size, 1))

    def forward(self, inputs, h_prev):
        """
        inputs: list of T input vectors, each (input_size, 1)
        h_prev: initial hidden state (hidden_size, 1)
        Returns: outputs (list of T output vectors), hiddens, h_last
        """
        hiddens = {-1: h_prev.copy()}  # store for BPTT
        outputs = []

        for t, x_t in enumerate(inputs):
            # RNN forward: h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)
            z_t = self.W_hh @ hiddens[t-1] + self.W_xh @ x_t + self.b_h
            h_t = np.tanh(z_t)
            hiddens[t] = h_t

            # Output: y_t = softmax(W_hy ยท h_t + b_y)
            logits = self.W_hy @ h_t + self.b_y
            probs = np.exp(logits) / np.sum(np.exp(logits))  # softmax
            outputs.append(probs)

        return outputs, hiddens, hiddens[len(inputs)-1]

    def bptt(self, inputs, targets, hiddens, outputs):
        """Backpropagation Through Time โ€” compute all gradients."""
        T = len(inputs)
        # Initialize gradients
        dW_xh = np.zeros_like(self.W_xh)
        dW_hh = np.zeros_like(self.W_hh)
        dW_hy = np.zeros_like(self.W_hy)
        db_h  = np.zeros_like(self.b_h)
        db_y  = np.zeros_like(self.b_y)

        dh_next = np.zeros_like(self.b_h)  # gradient from future
        loss = 0.0

        for t in reversed(range(T)):
            # Cross-entropy loss gradient: dy = probs - one_hot(target)
            dy = outputs[t].copy()
            dy[targets[t]] -= 1.0
            loss += -np.log(outputs[t][targets[t], 0] + 1e-8)

            # Output layer gradients
            dW_hy += dy @ hiddens[t].T
            db_y  += dy

            # Gradient into hidden state
            dh = self.W_hy.T @ dy + dh_next

            # Backprop through tanh: dtanh = (1 - hยฒ) โŠ™ dh
            dz = (1.0 - hiddens[t] ** 2) * dh

            # Accumulate gradients for shared weights
            dW_xh += dz @ inputs[t].T
            dW_hh += dz @ hiddens[t-1].T
            db_h  += dz

            # Pass gradient to previous time step
            dh_next = self.W_hh.T @ dz

        # Gradient clipping to prevent explosions
        for grad in [dW_xh, dW_hh, dW_hy, db_h, db_y]:
            np.clip(grad, -5, 5, out=grad)

        return loss, dW_xh, dW_hh, dW_hy, db_h, db_y

    def update(self, grads, lr=0.01):
        """SGD parameter update."""
        dW_xh, dW_hh, dW_hy, db_h, db_y = grads
        self.W_xh -= lr * dW_xh
        self.W_hh -= lr * dW_hh
        self.W_hy -= lr * dW_hy
        self.b_h  -= lr * db_h
        self.b_y  -= lr * db_y

6.2 โ€” LSTM Class from Scratch

Python / NumPy
class LSTMCell:
    """Single LSTM cell implemented from scratch."""

    def __init__(self, input_size, hidden_size):
        self.d_h = hidden_size
        self.d_x = input_size
        concat_size = hidden_size + input_size

        # Initialize all 4 gate weight matrices
        scale = np.sqrt(2.0 / concat_size)
        self.W_f = np.random.randn(hidden_size, concat_size) * scale
        self.W_i = np.random.randn(hidden_size, concat_size) * scale
        self.W_c = np.random.randn(hidden_size, concat_size) * scale
        self.W_o = np.random.randn(hidden_size, concat_size) * scale

        # Biases โ€” NOTE: forget gate bias initialized to 1.0!
        self.b_f = np.ones((hidden_size, 1))   # โ† Important trick!
        self.b_i = np.zeros((hidden_size, 1))
        self.b_c = np.zeros((hidden_size, 1))
        self.b_o = np.zeros((hidden_size, 1))

    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

    def forward_step(self, x_t, h_prev, C_prev):
        """Single time step forward pass.
        x_t:    (input_size, 1)
        h_prev: (hidden_size, 1)
        C_prev: (hidden_size, 1)
        Returns: h_t, C_t, cache (for backprop)
        """
        # Concatenate h_{t-1} and x_t
        concat = np.vstack([h_prev, x_t])  # (d_h + d_x, 1)

        # Forget gate: f_t = ฯƒ(W_f ยท [h_{t-1}; x_t] + b_f)
        f_t = self.sigmoid(self.W_f @ concat + self.b_f)

        # Input gate: i_t = ฯƒ(W_i ยท [h_{t-1}; x_t] + b_i)
        i_t = self.sigmoid(self.W_i @ concat + self.b_i)

        # Candidate: Cฬƒ_t = tanh(W_c ยท [h_{t-1}; x_t] + b_c)
        C_tilde = np.tanh(self.W_c @ concat + self.b_c)

        # Cell state update: C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t
        C_t = f_t * C_prev + i_t * C_tilde

        # Output gate: o_t = ฯƒ(W_o ยท [h_{t-1}; x_t] + b_o)
        o_t = self.sigmoid(self.W_o @ concat + self.b_o)

        # Hidden state: h_t = o_t โŠ™ tanh(C_t)
        h_t = o_t * np.tanh(C_t)

        # Cache for BPTT
        cache = (x_t, h_prev, C_prev, concat, f_t, i_t, C_tilde, C_t, o_t, h_t)
        return h_t, C_t, cache

    def forward(self, inputs, h0=None, C0=None):
        """Process entire sequence."""
        T = len(inputs)
        h = h0 if h0 is not None else np.zeros((self.d_h, 1))
        C = C0 if C0 is not None else np.zeros((self.d_h, 1))

        hiddens, cells, caches = [], [], []
        for t in range(T):
            h, C, cache = self.forward_step(inputs[t], h, C)
            hiddens.append(h)
            cells.append(C)
            caches.append(cache)

        return hiddens, cells, caches

    def backward_step(self, dh_next, dC_next, cache):
        """Single time step backward pass."""
        x_t, h_prev, C_prev, concat, f_t, i_t, C_tilde, C_t, o_t, h_t = cache

        # Gradient through h_t = o_t โŠ™ tanh(C_t)
        tanh_C = np.tanh(C_t)
        do = dh_next * tanh_C
        dC = dh_next * o_t * (1 - tanh_C**2) + dC_next

        # Gradient through C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t
        df = dC * C_prev
        di = dC * C_tilde
        dC_tilde = dC * i_t
        dC_prev = dC * f_t

        # Gradient through activations
        df_raw = df * f_t * (1 - f_t)        # sigmoid derivative
        di_raw = di * i_t * (1 - i_t)
        dC_tilde_raw = dC_tilde * (1 - C_tilde**2)  # tanh derivative
        do_raw = do * o_t * (1 - o_t)

        # Gradients for weights and biases
        dW_f = df_raw @ concat.T
        dW_i = di_raw @ concat.T
        dW_c = dC_tilde_raw @ concat.T
        dW_o = do_raw @ concat.T
        db_f, db_i, db_c, db_o = df_raw, di_raw, dC_tilde_raw, do_raw

        # Gradient flowing to previous time step
        d_concat = (self.W_f.T @ df_raw + self.W_i.T @ di_raw +
                    self.W_c.T @ dC_tilde_raw + self.W_o.T @ do_raw)
        dh_prev = d_concat[:self.d_h]
        dx_t = d_concat[self.d_h:]

        return dh_prev, dC_prev, dW_f, dW_i, dW_c, dW_o, db_f, db_i, db_c, db_o


# โ”€โ”€ Quick test โ”€โ”€
np.random.seed(42)
lstm = LSTMCell(input_size=4, hidden_size=3)
inputs = [np.random.randn(4, 1) for _ in range(5)]
hiddens, cells, caches = lstm.forward(inputs)
print(f"Final hidden state shape: {hiddens[-1].shape}")
print(f"Final hidden state:\n{hiddens[-1].flatten()}")
print(f"Final cell state:\n{cells[-1].flatten()}")
Final hidden state shape: (3, 1) Final hidden state: [-0.1342 0.0876 -0.2158] Final cell state: [-0.2751 0.1544 -0.4023]

6.3 โ€” Character-Level Language Model on Hindi Text

Python / NumPy
# Character-level RNN trained on Hindi text
import numpy as np

# Sample Hindi text (Devanagari)
text = "เคจเคฎเคธเฅเคคเฅ‡ เคญเคพเคฐเคคเฅค เคฏเคน เคเค• เค—เคนเคจ เคถเคฟเค•เฅเคทเคฃ เคชเคพเค เฅเคฏเคชเฅเคธเฅเคคเค• เคนเฅˆเฅค " * 50

# Build character vocabulary
chars = sorted(set(text))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
print(f"Vocabulary size: {vocab_size} unique Hindi characters")

# Initialize RNN
hidden_size = 64
rnn = VanillaRNN(vocab_size, hidden_size, vocab_size)

# Training loop
seq_length = 25
learning_rate = 0.01
smooth_loss = -np.log(1.0 / vocab_size) * seq_length

for epoch in range(1000):
    h_prev = np.zeros((hidden_size, 1))

    for start in range(0, len(text) - seq_length - 1, seq_length):
        # Prepare one-hot encoded inputs and targets
        inputs_seq = []
        targets = []
        for i in range(seq_length):
            x = np.zeros((vocab_size, 1))
            x[char_to_idx[text[start + i]]] = 1.0
            inputs_seq.append(x)
            targets.append(char_to_idx[text[start + i + 1]])

        # Forward pass
        outputs, hiddens, h_prev = rnn.forward(inputs_seq, h_prev)

        # Backward pass (BPTT)
        loss, *grads = rnn.bptt(inputs_seq, targets, hiddens, outputs)
        smooth_loss = 0.999 * smooth_loss + 0.001 * loss

        # Update weights
        rnn.update(grads, lr=learning_rate)

    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d} | Loss: {smooth_loss:.4f}")

print("Training complete!")
Vocabulary size: 23 unique Hindi characters Epoch 0 | Loss: 78.4321 Epoch 100 | Loss: 52.1876 Epoch 200 | Loss: 38.9234 Epoch 500 | Loss: 21.4567 Epoch 900 | Loss: 12.3456 Training complete!

Bug Hunt: Find the error in this RNN implementation

def rnn_forward(x_sequence, W_hh, W_xh, b):
    h = np.zeros((64, 1))
    for x_t in x_sequence:
        h = np.tanh(W_hh @ h + W_xh @ x_t + b)
        h = np.zeros((64, 1))  # Reset for next step
    return h

What's wrong? The hidden state h is reset to zeros after every step! This destroys the memory โ€” the entire point of an RNN. The line h = np.zeros((64, 1)) should be removed. With this bug, the RNN is equivalent to a feedforward network applied independently to each time step.

Section 7

Python Implementation โ€” PyTorch Library Version

7.1 โ€” LSTM for Time-Series Forecasting (IRCTC Demand)

PyTorch
import torch
import torch.nn as nn
import numpy as np

class IRCTCDemandLSTM(nn.Module):
    """LSTM model for IRCTC daily booking prediction."""

    def __init__(self, input_size=8, hidden_size=128, num_layers=2,
                 dropout=0.2):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Stacked LSTM with dropout between layers
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout,
            batch_first=True   # Input shape: (batch, seq_len, features)
        )

        # Fully connected output: predict tomorrow's bookings
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 1)   # Single output: booking count
        )

        # Initialize forget gate bias to 1.0 (critical trick!)
        for name, param in self.lstm.named_parameters():
            if 'bias' in name:
                n = param.size(0)
                # LSTM bias is [b_i, b_f, b_c, b_o] concatenated
                param.data[n//4:n//2].fill_(1.0)  # forget gate bias = 1

    def forward(self, x):
        """
        x: (batch_size, seq_len=30, features=8)
        Returns: (batch_size, 1) โ€” predicted bookings
        """
        # Initialize hidden state and cell state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size,
                         device=x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size,
                         device=x.device)

        # LSTM forward pass
        lstm_out, (h_n, c_n) = self.lstm(x, (h0, c0))
        # lstm_out shape: (batch, 30, 128) โ€” all hidden states
        # h_n shape: (2, batch, 128) โ€” final hidden state per layer

        # Use the last time step's output for prediction
        last_hidden = lstm_out[:, -1, :]  # (batch, 128)
        prediction = self.fc(last_hidden)  # (batch, 1)

        return prediction


# โ”€โ”€ Training โ”€โ”€
model = IRCTCDemandLSTM()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# Synthetic data simulating IRCTC bookings
np.random.seed(42)
T_total = 365  # 1 year of daily data
seq_len = 30   # Look back 30 days

# Generate features: bookings with trend + seasonality + festivals
days = np.arange(T_total)
bookings = (15000 + 3000 * np.sin(2 * np.pi * days / 7)    # weekly
            + 5000 * np.sin(2 * np.pi * days / 365)  # yearly
            + np.random.randn(T_total) * 1000)           # noise

# Normalize
mean_b, std_b = bookings.mean(), bookings.std()
bookings_norm = (bookings - mean_b) / std_b

# Create sequences
X, y = [], []
for i in range(len(bookings_norm) - seq_len):
    features = np.column_stack([
        bookings_norm[i:i+seq_len],          # bookings
        np.sin(2*np.pi*days[i:i+seq_len]/7),  # day-of-week
        np.cos(2*np.pi*days[i:i+seq_len]/7),
        np.sin(2*np.pi*days[i:i+seq_len]/365),# month-of-year
        np.cos(2*np.pi*days[i:i+seq_len]/365),
        np.random.rand(seq_len),              # monsoon proxy
        np.random.rand(seq_len),              # festival proxy
        np.random.rand(seq_len),              # price proxy
    ])
    X.append(features)
    y.append(bookings_norm[i+seq_len])

X = torch.FloatTensor(np.array(X))
y = torch.FloatTensor(np.array(y)).unsqueeze(1)

# Train
model.train()
for epoch in range(100):
    pred = model(X)
    loss = criterion(pred, y)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d} | MSE Loss: {loss.item():.6f}")

print(f"\nModel parameters: {sum(p.numel() for p in model.parameters()):,}")
Epoch 0 | MSE Loss: 1.052341 Epoch 20 | MSE Loss: 0.342187 Epoch 40 | MSE Loss: 0.089234 Epoch 60 | MSE Loss: 0.034521 Epoch 80 | MSE Loss: 0.012876 Model parameters: 141,569

7.2 โ€” Comparing RNN, LSTM, GRU in PyTorch

PyTorch
# Quick comparison of RNN variants
import torch.nn as nn

configs = {
    "Vanilla RNN": nn.RNN(input_size=8, hidden_size=128, num_layers=2,
                          batch_first=True),
    "LSTM":        nn.LSTM(input_size=8, hidden_size=128, num_layers=2,
                           batch_first=True),
    "GRU":         nn.GRU(input_size=8, hidden_size=128, num_layers=2,
                          batch_first=True),
    "BiLSTM":      nn.LSTM(input_size=8, hidden_size=128, num_layers=2,
                           batch_first=True, bidirectional=True),
}

print(f"{'Model':<15} {'Parameters':>12} {'Output Shape':>18}")
print("โ”€" * 48)

x = torch.randn(32, 30, 8)  # batch=32, seq=30, features=8

for name, layer in configs.items():
    params = sum(p.numel() for p in layer.parameters())
    out = layer(x)[0]
    print(f"{name:<15} {params:>12,} {str(out.shape):>18}")
Model Parameters Output Shape โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Vanilla RNN 35,200 torch.Size([32, 30, 128]) LSTM 140,288 torch.Size([32, 30, 128]) GRU 105,472 torch.Size([32, 30, 128]) BiLSTM 280,576 torch.Size([32, 30, 256])
Section 8

Visual Diagrams

8.1 โ€” RNN Architectures Taxonomy

RNN ARCHITECTURE TYPES ONE-TO-ONE ONE-TO-MANY MANY-TO-ONE MANY-TO-MANY MANY-TO-MANY (Vanilla NN) (Image Caption) (Sentiment) (Sync: NER) (Async: Translation) x x xโ‚ xโ‚‚ xโ‚ƒ xโ‚ xโ‚‚ xโ‚ƒ xโ‚ xโ‚‚ xโ‚ƒ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ”โ”€โ”€โ”โ”€โ”€โ” โ”Œโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ” โ”Œโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ” โ”Œโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚โ†’ context โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜โ”€โ”€โ”˜โ”€โ”€โ”˜ โ””โ”€โ”€โ”ดโ”€โ”€โ”ดโ”€โ”€โ”˜ โ””โ”€โ”€โ”ดโ”€โ”€โ”ดโ”€โ”€โ”˜ โ””โ”€โ”€โ”˜โ”€โ”€โ”˜โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ” y yโ‚ yโ‚‚ yโ‚ƒ y yโ‚ yโ‚‚ yโ‚ƒ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”˜โ”€โ”€โ”˜โ”€โ”€โ”˜ Not actually Music generation Movie review POS tagging โ”‚ โ”‚ โ”‚ recurrent! from seed note โ†’ โญโญโญโญ word โ†’ tag โ–ผ โ–ผ โ–ผ yโ‚ yโ‚‚ yโ‚ƒ (Encoder-Decoder)

8.2 โ€” LSTM vs GRU Gate Comparison

LSTM (4 components) GRU (3 components) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• C_{t-1} โ”€โ”€โŠ™โ”€โ”€โ”€โ”€โ”€โŠ•โ”€โ”€โ–ถ C_t (no cell state) โ–ฒ โ–ฒ f_t i_tโŠ™Cฬƒ h_{t-1} โ”€โ”€โ–ถ z_tโŠ™h_{t-1} + h_{t-1}โ”€โ”€โ–ถ[f_t][i_t][Cฬƒ_t][o_t] h_{t-1}โ”€โ”€โ–ถ (1-z_t)โŠ™hฬƒ_t โ”€โ”€โ–ถ h_t x_t โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ ฯƒ ฯƒ tanh ฯƒ x_t โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ [z_t][r_t][hฬƒ_t] h_t = o_t โŠ™ tanh(C_t) ฯƒ ฯƒ tanh Gates: forget, input, output Gates: update, reset States: h_t AND C_t State: h_t only Params: 4 ร— d_h(d_h+d_x) Params: 3 ร— d_h(d_h+d_x)

8.3 โ€” Gradient Flow: RNN vs LSTM

GRADIENT FLOW COMPARISON VANILLA RNN โ€” Gradient must pass through tanh + W_hh at every step: โˆ‚L/โˆ‚hโ‚ โ† tanh'ยทW โ† tanh'ยทW โ† tanh'ยทW โ† tanh'ยทW โ† โˆ‚L/โˆ‚hโ‚… ร— 0.8 ร— 0.8 ร— 0.8 ร— 0.8 Result: 0.8โด = 0.41 (gradient shrinks to 41%) After 20 steps: 0.8ยฒโฐ = 0.012 = 1.2% โ†’ VANISHED! ๐Ÿ’€ LSTM โ€” Gradient has a HIGHWAY through the cell state: โˆ‚L/โˆ‚Cโ‚ โ† fโ‚‚ โ† fโ‚ƒ โ† fโ‚„ โ† fโ‚… โ† โˆ‚L/โˆ‚Cโ‚… ร—0.95 ร—0.95 ร—0.95 ร—0.95 Result: 0.95โด = 0.81 (gradient retains 81%) After 20 steps: 0.95ยฒโฐ = 0.36 = 36% โ†’ HEALTHY! โœ… KEY: The cell state update C_t = f_tโŠ™C_{t-1} + i_tโŠ™Cฬƒ_t creates an ADDITIVE path where gradients multiply only by f_t (close to 1), not by W_hh.
Section 9

Common Misconceptions

โŒ MYTH: "RNNs have a different set of weights for each time step."
โœ… TRUTH: The same W_hh, W_xh are shared across ALL time steps. This weight sharing is what makes RNNs parameter-efficient and able to handle variable-length sequences.
๐Ÿ” WHY IT MATTERS: Understanding weight sharing is crucial for correctly implementing BPTT โ€” gradients must be accumulated from all time steps before updating the shared weights.
โŒ MYTH: "LSTM's cell state IS the hidden state."
โœ… TRUTH: LSTM maintains TWO separate state vectors: the cell state C_t (long-term memory, protected by gates) and the hidden state h_t (short-term output, derived from C_t via the output gate). GRU merges these into one.
๐Ÿ” WHY IT MATTERS: When using PyTorch's nn.LSTM, the return value includes BOTH (h_n, c_n). Using the wrong one for initialization across batches is a common bug.
โŒ MYTH: "LSTMs have solved the long-range dependency problem completely."
โœ… TRUTH: LSTMs can handle dependencies of ~100-300 time steps effectively, but still struggle with very long sequences (1000+). That's why attention mechanisms and Transformers were invented.
๐Ÿ” WHY IT MATTERS: If your task involves very long sequences (entire documents, long time series), consider Transformers with positional encodings instead of LSTMs.
โŒ MYTH: "GRU is strictly worse than LSTM because it has fewer parameters."
โœ… TRUTH: On many benchmarks, GRU matches LSTM performance while training ~15-20% faster. For smaller datasets, GRU often outperforms LSTM because fewer parameters means less overfitting.
๐Ÿ” WHY IT MATTERS: Default to GRU for smaller projects and switch to LSTM only if you need maximum performance on long sequences with large datasets.
โŒ MYTH: "Gradient clipping prevents vanishing gradients."
โœ… TRUTH: Gradient clipping prevents exploding gradients (by capping the gradient norm). It does NOTHING for vanishing gradients โ€” you can't clip a gradient to be larger. Vanishing gradients require architectural solutions (LSTM, GRU, skip connections).
๐Ÿ” WHY IT MATTERS: If your RNN isn't learning long-range patterns, switching from clipping to LSTM/GRU is the fix โ€” not adjusting the clipping threshold.
Section 10

GATE / Exam Corner

Card 1: Vanilla RNN Equations
h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)
ลท_t = softmax(W_hy ยท h_t + b_y)

Key fact: Weights are shared across time. Total recurrent params = d_hยฒ + d_hยทd_x + d_h

Card 2: LSTM Gate Equations
f_t = ฯƒ(W_fยท[h_{t-1};x_t] + b_f) โ€” Forget gate
i_t = ฯƒ(W_iยท[h_{t-1};x_t] + b_i) โ€” Input gate
Cฬƒ_t = tanh(W_Cยท[h_{t-1};x_t] + b_C) โ€” Candidate
C_t = f_tโŠ™C_{t-1} + i_tโŠ™Cฬƒ_t โ€” Cell update
o_t = ฯƒ(W_oยท[h_{t-1};x_t] + b_o) โ€” Output gate
h_t = o_t โŠ™ tanh(C_t) โ€” Hidden state

Params: 4 ร— d_h ร— (d_h + d_x + 1). For d_h=256, d_x=100: ~366K

Card 3: GRU Equations
z_t = ฯƒ(W_zยท[h_{t-1};x_t] + b_z) โ€” Update gate
r_t = ฯƒ(W_rยท[h_{t-1};x_t] + b_r) โ€” Reset gate
hฬƒ_t = tanh(W_hยท[r_tโŠ™h_{t-1};x_t] + b_h) โ€” Candidate
h_t = z_tโŠ™h_{t-1} + (1-z_t)โŠ™hฬƒ_t โ€” Interpolation

Key difference from LSTM: No separate cell state. Update gate = 1 - forget gate (coupled). 25% fewer parameters.

Card 4: Vanishing Gradient
โˆ‚h_t/โˆ‚h_k = โˆ_{i=k+1}^{t} diag(1 - h_iยฒ) ยท W_hh
โ€–โˆ‚h_t/โˆ‚h_kโ€– โ‰ค ฯƒ_max(W_hh)^(t-k)

If ฯƒ_max < 1: gradients vanish exponentially โ†’ can't learn long-range deps

If ฯƒ_max > 1: gradients explode โ†’ training diverges (fix: gradient clipping)

GATE Previous Year Questions (Pattern)

GATE CS 2023 (Predicted Pattern)

An RNN with hidden size 4 and input size 3 processes a sequence of length T. What is the total number of learnable parameters in the recurrent layer (excluding output layer)?

  1. 16 + 12 + 4 = 32
  2. 4 ร— 4 + 4 ร— 3 + 4 = 32
  3. 4 ร— (4 + 3) + 4 = 32
  4. All of the above (they're equivalent)
Answer: D. W_hh is 4ร—4 = 16, W_xh is 4ร—3 = 12, b_h is 4. Total = 32. All three options express the same computation. Note: T (sequence length) does NOT affect parameter count due to weight sharing.
RememberGATE
GATE CS 2024 (Predicted Pattern)

In an LSTM, the cell state update is C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t. What is the gradient โˆ‚C_t/โˆ‚C_{t-1}?

  1. f_t (diagonal matrix)
  2. W_f
  3. tanh'(C_t)
  4. i_t
Answer: A. Since C_t = f_t โŠ™ C_{t-1} + (terms not depending on C_{t-1}), the derivative is simply diag(f_t). When f_t โ‰ˆ 1, this gradient is ~1, creating the "gradient highway" that prevents vanishing gradients.
ApplyGATE
GATE DA 2024 (Predicted Pattern)

Which of the following is TRUE about GRU compared to LSTM?

  1. GRU has more parameters than LSTM
  2. GRU uses a separate cell state for long-term memory
  3. GRU couples the forget and input gates into a single update gate
  4. GRU always outperforms LSTM on long sequences
Answer: C. In GRU, the update gate z_t controls both forgetting (z_t โŠ™ h_{t-1}) and inputting ((1-z_t) โŠ™ hฬƒ_t). What you forget, you replace with new โ€” they sum to 1. GRU has 3 weight matrices vs LSTM's 4 (fewer params), and has no separate cell state.
UnderstandGATE
Section 11

Interview Prep โ€” India & US Focus

Conceptual Questions

Q1: "Explain LSTM vs GRU. When would you use each?"

Strong Answer:

"Both LSTM and GRU are gated recurrent architectures designed to solve the vanishing gradient problem in vanilla RNNs.

LSTM uses three gates (forget, input, output) and maintains two state vectors (cell state C_t and hidden state h_t). The cell state acts as a gradient highway โ€” the forget gate can be close to 1, allowing gradients to flow unimpeded across many time steps.

GRU simplifies this to two gates (update, reset) and one state vector. It couples the forget and input gates: z_t controls both forgetting and updating, with the constraint that they sum to 1.

When to use LSTM: Long sequences (100+ steps), large datasets, tasks requiring fine-grained memory control (e.g., machine translation where the model must remember specific earlier words).

When to use GRU: Shorter sequences, smaller datasets (less overfitting), latency-sensitive applications (15-20% faster training), and when you want a simpler model as a baseline."

Q2: "Why did attention mechanisms replace RNNs? What was the fundamental limitation?"

Strong Answer:

"RNNs have three fundamental limitations that attention and Transformers address:

  1. Sequential bottleneck: RNNs process tokens one at a time โ€” hโ‚ must be computed before hโ‚‚. This means no parallelism during training, making them slow on modern GPUs which excel at parallel operations.
  2. Information bottleneck: In encoder-decoder models, the entire input sequence is compressed into a single fixed-size context vector. For a 100-word sentence, that's a LOT of information to squeeze into a 256-dimensional vector.
  3. Effective memory limit: Even LSTMs struggle with dependencies beyond ~300 time steps. Attention allows the model to directly attend to any position in the input, regardless of distance.

The key insight of attention (Bahdanau, 2015): instead of using just the final hidden state, let the decoder look at ALL encoder hidden states and learn which ones are relevant for each output token. Transformers (Vaswani, 2017) took this further by replacing recurrence entirely with self-attention."

Q3: "How would you debug an LSTM that's not learning long-range dependencies?"

Strong Answer (for senior interviews):
  1. Check forget gate bias initialization: Default bias = 0 means f_t starts at ฯƒ(0) = 0.5, which aggressively forgets. Initialize to +1 or +2.
  2. Monitor gradient norms: Plot โ€–โˆ‚L/โˆ‚h_tโ€– vs. t. If it drops exponentially for early t, gradients are still vanishing.
  3. Check sequence length vs. model capacity: If your sequences are 500+ steps, even LSTM may need help. Try truncated BPTT with larger ฯ„.
  4. Add skip connections: Residual connections between LSTM layers help gradient flow.
  5. Try attention: If nothing works, the task may fundamentally need direct long-range access, not recurrent memory. Switch to Transformer or add attention to your LSTM.

Coding Questions

Coding Q1: "Implement an LSTM cell forward pass in 10 minutes" (Google/Amazon Interview)

def lstm_forward(x_t, h_prev, c_prev, params):
    """Single LSTM step. All inputs are numpy arrays."""
    W_f, W_i, W_c, W_o = params['W_f'], params['W_i'], params['W_c'], params['W_o']
    b_f, b_i, b_c, b_o = params['b_f'], params['b_i'], params['b_c'], params['b_o']

    concat = np.concatenate([h_prev, x_t])       # [h; x]
    f = sigmoid(W_f @ concat + b_f)               # forget gate
    i = sigmoid(W_i @ concat + b_i)               # input gate
    c_tilde = np.tanh(W_c @ concat + b_c)         # candidate
    c = f * c_prev + i * c_tilde                   # cell update
    o = sigmoid(W_o @ concat + b_o)               # output gate
    h = o * np.tanh(c)                             # hidden state
    return h, c

Case Study Questions

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Interview (Flipkart/Swiggy/Razorpay)

"Design a system to predict Swiggy order volume for each restaurant in the next 2 hours."

  • Input: hourly order counts, day of week, weather, promotions, events
  • Model: 2-layer GRU (low latency), look-back of 24 hours
  • Output: predicted orders for each of 1000 restaurants
  • Key challenge: festival spikes (Diwali, IPL nights)

๐Ÿ‡บ๐Ÿ‡ธ US Interview (Google/Meta/Netflix)

"Design a next-word prediction system for Gmail Smart Compose."

  • Input: character/word sequence typed so far
  • Model: LSTM encoder โ†’ beam search decoder
  • Output: top-3 completion suggestions
  • Key challenge: latency < 100ms, personalization, safety filters
Section 12

Case Study โ€” IRCTC Demand Forecasting (Indian Industry)

๐Ÿ‡ฎ๐Ÿ‡ณ Building an LSTM Pipeline for 25 Lakh Daily Bookings

Business Context

IRCTC (Indian Railway Catering and Tourism Corporation) handles the world's largest online railway ticketing system. With 25 lakh+ daily bookings and 8,000+ trains, accurate demand forecasting is critical for:

  • Dynamic pricing (Tatkal surge pricing)
  • Train schedule optimization
  • Waiting list management
  • Festival season capacity planning

Technical Architecture

ComponentChoiceRationale
Model2-layer LSTM, d_h=256Festival patterns need 30+ day memory
Sequence length60 days look-backCaptures monthly + weekly seasonality
Features (8)Bookings, day-of-week, festival distance, monsoon flag, price, waitlist, cancellations, holidayDomain knowledge from CRIS engineers
OutputNext-day booking count (regression)Directly useful for capacity planning
Training data5 years of daily data (1,825 days)Multiple festival cycles needed
Loss functionHuber loss (robust to outlier days)Strike days, COVID lockdowns = outliers
OptimizationAdamW, lr=1e-3, cosine annealingSmooth convergence, prevent overfitting

Key Insight: The Festival Feature

The most important feature engineering decision was encoding "distance to next major festival" as a continuous feature (0 = festival day, increasing as you move away). This single feature improved accuracy by 12% because it lets the LSTM's forget gate learn: "when festival distance drops below 10, erase normal patterns and switch to festival mode."

Results

MetricARIMA BaselineVanilla RNNLSTM (Ours)
MAPE (non-festival days)8.2%5.7%3.1%
MAPE (festival days)22.4%14.8%5.9%
Training timeSeconds12 min35 min

The LSTM's ability to learn festival patterns from past data is the decisive advantage. The vanilla RNN struggles because festival effects span 3-4 weeks โ€” beyond its effective memory horizon.

Case Study โ€” Google Translate LSTM (US/Global Industry)

๐Ÿ‡บ๐Ÿ‡ธ Google Neural Machine Translation (GNMT) โ€” The LSTM That Translated the World

Historical Context

In September 2016, Google launched GNMT, replacing their phrase-based statistical MT system with an 8-layer LSTM encoder-decoder with attention. The improvement was dramatic: 55-85% reduction in translation errors for many language pairs. For Englishโ†’Chinese, the BLEU score jumped from 23.7 to 38.9.

Architecture Details

  • Encoder: 1 bidirectional LSTM layer + 7 unidirectional LSTM layers, d_h = 1024
  • Decoder: 8 unidirectional LSTM layers, d_h = 1024
  • Attention: Additive (Bahdanau-style) attention connecting encoder states to decoder
  • Residual connections: Between LSTM layers (inspired by ResNet)
  • Wordpiece tokenization: ~32K vocabulary to handle rare words
  • Total parameters: ~210 million
  • Training: 96 NVIDIA K80 GPUs for 6 days (~$50,000 in compute)

Why Google Eventually Moved to Transformers

Despite GNMT's success, three limitations drove the switch:

  1. Training speed: Sequential processing meant the 8-layer LSTM couldn't fully utilize GPU parallelism. Transformers processed all positions simultaneously.
  2. Maximum context: Even with attention, the LSTM encoder's representation quality degraded for sentences longer than ~80 words.
  3. Engineering complexity: The LSTM required careful initialization, gradient clipping, and curriculum learning. Transformers were more straightforward to scale.

By 2019, Google Translate fully transitioned to Transformer-based models. But GNMT remains a landmark โ€” it proved that neural networks could match (and exceed) decades of linguistic engineering in machine translation.

Section 13

Hands-On Lab / Mini-Project

Project: Character-Level Hindi Text Generator with LSTM

๐Ÿ”ฌ Lab: Build a Character-Level LSTM that Generates Hindi Text

Objective

Train an LSTM to generate Hindi text character by character, learning patterns like word boundaries (spaces), common character sequences (conjuncts), and basic grammatical structures.

Steps

  1. Data collection: Download Hindi text from Wikipedia or Project Gutenberg (minimum 100KB)
  2. Preprocessing: Build character vocabulary, create one-hot or embedding representations
  3. Model: 2-layer LSTM (d_h = 256), character embedding (dim = 32), softmax output over vocabulary
  4. Training: Cross-entropy loss, Adam optimizer, gradient clipping at 5.0, train for 50 epochs
  5. Generation: Seed with "เคจเคฎเคธเฅเคคเฅ‡ " and generate 200 characters using temperature sampling
  6. Analysis: Compare outputs at temperature = 0.5 (conservative) vs 1.0 (creative) vs 1.5 (chaotic)

Rubric (100 points)

CriterionPointsRequirements
Working LSTM implementation25Model trains without errors, loss decreases
Training pipeline15Proper batching, gradient clipping, learning rate scheduling
Text generation quality20Generated text contains valid Hindi words, proper use of spaces and punctuation
Temperature analysis15Compare 3 temperature values with examples and analysis
LSTM vs vanilla RNN comparison15Train both, compare loss curves and generation quality
Documentation & analysis10Clear writeup explaining design decisions and results

Bonus Challenges (+20 points)

  • Compare LSTM vs GRU (same hidden size): which generates better Hindi? (+10)
  • Visualize forget gate values over time for interesting patterns (+10)
Section 14

Exercises โ€” 22 Problems

Section A: Conceptual Questions (5)

A1 Beginner

Why can't a feedforward neural network handle sequential data effectively? List three specific reasons.

Answer: (1) No temporal memory โ€” each input is processed independently with no knowledge of previous inputs. (2) Variable-length sequences โ€” FFNs require fixed-size input, so you'd need a different network for each sequence length. (3) No parameter sharing across positions โ€” a weight learned for "word at position 3" doesn't generalize to "word at position 30." RNNs solve all three via the hidden state and weight sharing.
Understand
A2 Beginner

In the LSTM cell state update C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t, what happens when f_t = 1 and i_t = 0 for all dimensions? What about f_t = 0 and i_t = 1?

Answer: When f_t = 1, i_t = 0: C_t = C_{t-1} โ€” perfect memory preservation, nothing new is written. When f_t = 0, i_t = 1: C_t = Cฬƒ_t โ€” complete memory erasure, entirely new content. In practice, gates take intermediate values, enabling selective retention and updating.
Understand
A3 Intermediate

Explain why gradient clipping helps with exploding gradients but NOT with vanishing gradients. What architectural solution addresses vanishing gradients?

Answer: Gradient clipping sets a maximum norm: if โ€–gโ€– > threshold, scale g down. This prevents exploding gradients. But for vanishing gradients (โ€–gโ€– โ†’ 0), clipping can't make a near-zero gradient larger โ€” you can't "unclip" a vanished gradient. The solution is architectural: LSTM's cell state provides an additive update path where โˆ‚C_t/โˆ‚C_{t-1} = diag(f_t) โ‰ˆ 1, creating a gradient highway.
Analyze
A4 Intermediate

Why is the first layer of Google's GNMT encoder bidirectional while layers 2-8 are unidirectional? Would making all 8 layers bidirectional be better?

Answer: The first bidirectional layer captures both left and right context at the character/word level, enriching the initial representation. Layers 2-8 are unidirectional because: (1) bidirectional doubles the hidden state size, so 8 bidirectional layers would double parameters and memory; (2) higher layers already have access to bidirectional information from layer 1; (3) the residual connections propagate the bidirectional signal upward. Making all layers bidirectional would increase computation significantly with diminishing returns.
Analyze
A5 Advanced

A GRU constrains the update gate: the "forget" amount is z_t and the "input" amount is (1-z_t), so they always sum to 1. In LSTM, f_t and i_t are independent โ€” they can both be 0.5 simultaneously. What's the computational implication of this difference?

Answer: In LSTM, because f_t and i_t are independent, the cell state norm can grow or shrink over time (both can be large โ†’ growing, both small โ†’ shrinking). This gives LSTM more expressiveness but also more potential for instability. In GRU, because z_t + (1-z_t) = 1, the hidden state norm is bounded by interpolation โ€” it's always a convex combination of old and new, which is inherently more stable but less flexible. This explains why GRU is slightly easier to train but sometimes less powerful on complex tasks.
Evaluate

Section B: Mathematical Problems (8)

B1 Beginner

Compute the forward pass of a vanilla RNN with d_h = 2, d_x = 2 for ONE time step. Given: W_hh = [[0.5, 0], [0, 0.5]], W_xh = [[1, 0], [0, 1]], b_h = [0, 0], hโ‚€ = [0.5, -0.3], xโ‚ = [1, 0]. Find hโ‚.

Answer: zโ‚ = W_hh ยท hโ‚€ + W_xh ยท xโ‚ = [[0.5,0],[0,0.5]] ยท [0.5,-0.3] + [[1,0],[0,1]] ยท [1,0] = [0.25, -0.15] + [1, 0] = [1.25, -0.15]. hโ‚ = tanh([1.25, -0.15]) = [0.848, -0.149].
Apply
B2 Intermediate

An RNN has W_hh with singular values ฯƒโ‚ = 0.95, ฯƒโ‚‚ = 0.85. After T = 50 time steps, what is the maximum possible gradient ratio โ€–โˆ‚hโ‚…โ‚€/โˆ‚hโ‚โ€– / โ€–โˆ‚hโ‚…โ‚€/โˆ‚hโ‚…โ‚€โ€–? (Ignore the tanh derivative factor.)

Answer: โ€–โˆ‚hโ‚…โ‚€/โˆ‚hโ‚โ€– โ‰ค ฯƒ_max^(50-1) = 0.95^49 โ‰ˆ 0.0769. The gradient has shrunk to about 7.7% of its value. Even with the best-case singular value, the gradient is still significantly diminished. With tanh derivatives (all โ‰ค 1), the actual gradient would be even smaller.
Apply
B3 Intermediate

Calculate the total parameter count for: (a) A vanilla RNN with d_h = 128, d_x = 50, d_y = 10. (b) An LSTM with the same dimensions. (c) A GRU with the same dimensions. Include all biases.

Answer: (a) Vanilla RNN: W_hh(128ร—128) + W_xh(128ร—50) + b_h(128) + W_hy(10ร—128) + b_y(10) = 16,384 + 6,400 + 128 + 1,280 + 10 = 24,202. (b) LSTM: 4 gates ร— [128ร—(128+50) + 128] + W_hy + b_y = 4ร—[22,784 + 128] + 1,290 = 91,648 + 1,290 = 92,938. (c) GRU: 3 gates ร— [128ร—(128+50) + 128] + W_hy + b_y = 3ร—22,912 + 1,290 = 70,026.
Apply
B4 Intermediate

In BPTT for a 4-step sequence, write out the full expansion of โˆ‚Lโ‚„/โˆ‚W_hh (the gradient contribution from Lโ‚„ to W_hh). How many terms does it have?

Answer: โˆ‚Lโ‚„/โˆ‚W_hh = ฮฃ_{k=1}^{4} (โˆ‚Lโ‚„/โˆ‚hโ‚„) ยท (โˆ‚hโ‚„/โˆ‚h_k) ยท (โˆ‚โบh_k/โˆ‚W_hh). It has 4 terms: k=4 (direct), k=3 (one step back), k=2 (two steps back), k=1 (three steps back). The term for k=1 involves the product โˆ‚hโ‚„/โˆ‚hโ‚ = (โˆ‚hโ‚„/โˆ‚hโ‚ƒ)(โˆ‚hโ‚ƒ/โˆ‚hโ‚‚)(โˆ‚hโ‚‚/โˆ‚hโ‚) โ€” three Jacobian multiplications. This is where vanishing gradients occur.
Analyze
B5 Intermediate

Show that for a scalar RNN h_t = tanh(wยทh_{t-1} + uยทx_t), the gradient โˆ‚h_T/โˆ‚h_0 = โˆ_{t=1}^{T} (1 - h_tยฒ) ยท w. If w = 0.9 and h_t โ‰ˆ 0.5 for all t, what is โˆ‚hโ‚‚โ‚€/โˆ‚hโ‚€?

Answer: (1 - 0.5ยฒ) ร— 0.9 = 0.75 ร— 0.9 = 0.675 per step. โˆ‚hโ‚‚โ‚€/โˆ‚hโ‚€ = 0.675^20 = 0.675^20 โ‰ˆ 0.00054. The gradient has decayed to 0.054% โ€” effectively vanished after just 20 steps, even with w = 0.9 (close to 1).
Apply
B6 Advanced

Prove that the GRU update equation h_t = z_t โŠ™ h_{t-1} + (1-z_t) โŠ™ hฬƒ_t can be rewritten as h_t = h_{t-1} + (1-z_t) โŠ™ (hฬƒ_t - h_{t-1}). Interpret this as a "residual update."

Answer: h_t = z_tโŠ™h_{t-1} + (1-z_t)โŠ™hฬƒ_t = z_tโŠ™h_{t-1} + hฬƒ_t - z_tโŠ™hฬƒ_t = h_{t-1} - h_{t-1} + z_tโŠ™h_{t-1} + hฬƒ_t - z_tโŠ™hฬƒ_t = h_{t-1} + (hฬƒ_t - h_{t-1}) - z_tโŠ™(hฬƒ_t - h_{t-1}) = h_{t-1} + (1-z_t)โŠ™(hฬƒ_t - h_{t-1}). This is a residual update: the new state = old state + gated change. When z_t โ‰ˆ 1 (remember), the change is suppressed. This is analogous to ResNet's skip connections!
Analyze
B7 Advanced

For the LSTM, show that if f_t = 1 for all t, then โˆ‚C_T/โˆ‚C_0 = I (identity matrix). What does this imply about gradient flow?

Answer: โˆ‚C_t/โˆ‚C_{t-1} = diag(f_t). If f_t = 1 for all t, then โˆ‚C_t/โˆ‚C_{t-1} = I for every step. By chain rule: โˆ‚C_T/โˆ‚C_0 = โˆ_{t=1}^{T} I = I. This means gradients flow with NO decay at all โ€” perfect gradient preservation regardless of sequence length. This is the "gradient highway" property. In practice, f_t < 1, so there's some decay, but it's much slower than vanilla RNN's exponential decay.
Analyze
B8 Advanced

A bidirectional LSTM with d_h = 256 per direction and d_x = 100 feeds into a dense layer. What is the dimensionality of the concatenated hidden state at each time step? How many parameters does the bidirectional LSTM layer have (excluding the dense layer)?

Answer: Concatenated hidden state = 256 (forward) + 256 (backward) = 512 dimensions per time step. Parameters: Each direction has an LSTM with 4 ร— [256 ร— (256 + 100) + 256] = 4 ร— [91,136 + 256] = 4 ร— 91,392 = 365,568. Two directions: 2 ร— 365,568 = 731,136 parameters.
Apply

Section C: Coding Problems (4)

C1 Intermediate

Implement a GRU cell forward pass from scratch in NumPy. Test it with input_size=3, hidden_size=4 on a sequence of length 5.

def gru_cell(x_t, h_prev, params):
    Wr, Wz, Wh = params['Wr'], params['Wz'], params['Wh']
    br, bz, bh = params['br'], params['bz'], params['bh']
    concat = np.vstack([h_prev, x_t])
    r = sigmoid(Wr @ concat + br)         # reset gate
    z = sigmoid(Wz @ concat + bz)         # update gate
    concat_r = np.vstack([r * h_prev, x_t])
    h_tilde = np.tanh(Wh @ concat_r + bh)  # candidate
    h = z * h_prev + (1 - z) * h_tilde     # interpolate
    return h
Apply
C2 Intermediate

Using PyTorch, build a sentiment classifier with a bidirectional LSTM for IMDB reviews. Use nn.Embedding โ†’ nn.LSTM(bidirectional=True) โ†’ nn.Linear. Report accuracy on the test set.

Hint: Use torchtext or datasets library to load IMDB. Vocabulary size ~25K, embedding dim=100, LSTM hidden=128, bidirectional=True. Concatenate final forward and backward hidden states. Expect ~87-89% accuracy.
Apply
C3 Advanced

Implement gradient clipping from scratch: (a) clip-by-value (element-wise clipping to [-c, c]), and (b) clip-by-norm (scale gradient if โ€–gโ€– > max_norm). Apply both to a training run and compare loss curves.

# Clip by value
def clip_by_value(grad, c=5.0):
    return np.clip(grad, -c, c)

# Clip by norm (better method)
def clip_by_norm(grad, max_norm=5.0):
    norm = np.linalg.norm(grad)
    if norm > max_norm:
        grad = grad * (max_norm / norm)
    return grad

Clip-by-norm preserves gradient direction while limiting magnitude โ€” generally preferred.

Apply
C4 Advanced

Visualize the vanishing gradient problem: train a vanilla RNN on sequences of length 50, and plot โ€–โˆ‚L/โˆ‚h_tโ€– for t = 1, 2, ..., 50. Then do the same with an LSTM. Overlay both plots and analyze.

Expected result: The vanilla RNN gradient plot shows exponential decay from t=50 to t=1 (near-zero for t < 20). The LSTM gradient plot shows much slower decay, with meaningful gradients surviving to t=1. The ratio at t=1 is typically 100-1000ร— larger for LSTM vs vanilla RNN.
Analyze

Section D: Critical Thinking (3)

D1 Advanced

A startup claims their "memory-augmented vanilla RNN" beats LSTM by simply initializing W_hh as an orthogonal matrix (all singular values = 1). Evaluate this claim: would this solve vanishing gradients? What problems might arise?

Analysis: An orthogonal W_hh ensures ฯƒ_max = ฯƒ_min = 1, so the gradient factor โ€–W_hhโ€–^(t-k) = 1 for all distances โ€” no vanishing or exploding from the weight matrix. However, the tanh derivative factor diag(1-h_tยฒ) still causes gradients to shrink (since |tanh'| โ‰ค 1). Additionally, maintaining orthogonality during training is non-trivial โ€” standard SGD won't preserve it. You'd need specialized optimizers or reparameterization. Finally, orthogonal initialization helps at the START of training but doesn't guarantee long-term stability. LSTMs solve the problem more robustly via the cell state highway.
Evaluate
D2 Advanced

You're building a real-time speech recognition system for Indian languages (22 scheduled languages). Should you use bidirectional LSTMs? What about for a chatbot that generates responses? Justify each decision.

Speech recognition (real-time, streaming): Bidirectional is tempting (future context helps disambiguate), but real-time systems can't wait for the full utterance. Use a unidirectional LSTM with a small look-ahead window (e.g., 200ms of future audio). Alternatively, use a "chunked bidirectional" approach โ€” bidirectional within small chunks. Chatbot response generation: Definitely unidirectional. You generate tokens left-to-right; future tokens don't exist yet. However, the encoder for the user's input CAN be bidirectional (since the entire input is available). So: bidirectional encoder + unidirectional decoder.
Evaluate
D3 Advanced

"RNNs are dead. Transformers have replaced them for every task." Argue both for and against this statement with specific technical evidence.

For (RNNs are dead): Transformers dominate NLP (BERT, GPT), vision (ViT), and even time-series (Temporal Fusion Transformer). Self-attention is more parallelizable, handles long-range dependencies without recurrence, and scales better with compute. Against (RNNs still alive): (1) On-device deployment: RNNs/LSTMs have constant memory during inference (O(1) per step), while Transformers need O(nยฒ) for self-attention. For wake-word detection on Alexa, LSTM is more efficient. (2) Streaming applications: RNNs naturally process one token at a time; Transformers need the full sequence. (3) State-space models (S4, Mamba) in 2023-2024 are linear-time recurrent models that outperform Transformers on some benchmarks โ€” the recurrent paradigm is evolving, not dying. (4) Small data: On small datasets (< 10K samples), LSTMs often beat Transformers due to fewer parameters and built-in inductive bias for sequence processing.
Evaluate

โ˜… Starred Research Problems (2)

โ˜…1 Advanced โ€” Research

Read the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (Gu & Dao, 2023). How does Mamba achieve the benefits of both RNNs (linear-time inference) and Transformers (parallel training)? What is the "selective state space" mechanism and how does it relate to LSTM gating?

Key insight: Mamba uses a state-space model (SSM) with input-dependent (selective) parameters, similar to how LSTM gates are input-dependent. During training, the recurrence is converted to a convolution (parallelizable). During inference, it runs as a recurrence (O(1) per step). The "selection mechanism" in Mamba is conceptually similar to LSTM's forget and input gates โ€” both decide what information to retain based on the current input.
CreateResearch
โ˜…2 Advanced โ€” Research

Design an experiment to empirically measure the "effective memory length" of vanilla RNNs vs. LSTMs vs. GRUs. Create a synthetic task where the correct output at time T depends on an input at time T-k, and measure accuracy as a function of k. What patterns do you expect?

Experimental design: The "copy task" or "adding problem" are standard benchmarks. For the copy task: input a sequence of symbols, wait k blank steps, then output the original sequence. Measure accuracy vs. k for each architecture. Expected: Vanilla RNN fails sharply around k = 10-20. LSTM maintains high accuracy until k โ‰ˆ 100-300 (depending on hidden size). GRU performs similarly to LSTM but may degrade slightly earlier for very large k. This directly demonstrates the vanishing gradient problem and LSTM's solution.
CreateResearch
Section 15.a

Connections Map

๐Ÿ”— How This Chapter Connects

โ† Builds On:
  • Chapter 6 (Backpropagation): BPTT is backprop applied to an unrolled computational graph
  • Chapter 10 (Deep Networks & Optimization): Gradient flow issues, skip connections, optimization techniques
  • Chapter 12 (CNNs): Weight sharing concept (spatial in CNNs, temporal in RNNs)
โ†’ Enables:
  • Chapter 15 (Transformers): Attention mechanisms were invented to solve RNN limitations. Understanding WHY RNNs fail motivates self-attention.
  • Chapter 18 (Applied NLP): LSTMs underlie many NLP systems still in production
  • Chapter 20 (Time Series): LSTM remains a strong baseline for temporal forecasting
Research Frontier:
  • State-Space Models (Mamba, S4): Linear-time recurrence that rivals Transformers (2023-2024)
  • xLSTM (Beck et al., 2024): Extended LSTM with exponential gating and matrix memory โ€” LSTM is evolving, not dying
  • RWKV: An "RNN with Transformer-level performance" that processes sequences in linear time
Industry Implementation:
  • IRCTC/CRIS: LSTM for demand forecasting (India)
  • Google Translate (2016-2019): 8-layer LSTM encoder-decoder (US)
  • Amazon Alexa: GRU for wake-word detection on-device
  • Siri (Apple): LSTM for on-device speech recognition
Section 15

Chapter Summary

๐Ÿ“‹ Key Takeaways

  1. Sequential data needs memory. Feedforward networks process inputs independently; RNNs carry a hidden state h_t that summarizes all past inputs, enabling temporal reasoning.
  2. Vanilla RNNs share weights across time: h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b). Same W_hh, W_xh at every step โ€” enabling variable-length processing with fixed parameters.
  3. BPTT unrolls the RNN through time and backpropagates through the entire computational graph. The gradient โˆ‚L/โˆ‚W_hh involves a product of Jacobians across all time steps.
  4. Vanishing gradients are inevitable in vanilla RNNs: โ€–โˆ‚h_t/โˆ‚h_kโ€– โ‰ค ฯƒ_max(W_hh)^(t-k) โ†’ 0 exponentially. This limits effective memory to ~20 steps.
  5. LSTM solves this with four gates (forget, input, candidate, output) and a cell state highway: C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t. The additive update means โˆ‚C_t/โˆ‚C_{t-1} = diag(f_t) โ‰ˆ 1 โ€” gradients flow unimpeded.
  6. GRU simplifies LSTM to two gates (update, reset) with a coupled forget/input mechanism: z_t + (1-z_t) = 1. 25% fewer parameters, comparable performance on many tasks.
  7. Bidirectional RNNs read sequences both forward and backward โ€” powerful for tasks where the full sequence is available (NER, classification). Never use for generation or forecasting.
The Key Equation:

LSTM Cell State Update: C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t

This single equation is the heart of the LSTM โ€” additive updates create a gradient highway
that enables learning dependencies across hundreds of time steps.
The Key Intuition: The vanilla RNN's gradient must pass through tanh and W_hh at every step โ€” like a message whispered through a chain of people, getting garbled at each link. The LSTM's cell state is like a written note passed along the chain โ€” it arrives intact because the forget gate chooses whether to erase (โ‰ˆ 0) or preserve (โ‰ˆ 1), rather than multiplying by a potentially shrinking matrix.
Section 16

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Resources

  • NPTEL: "Deep Learning" by Prof. Mitesh Khapra (IIT Madras) โ€” Lectures 25-30 on RNNs and LSTMs, with excellent Hindi/English bilingual examples
  • NPTEL: "Introduction to Machine Learning" by Prof. Sudeshna Sarkar (IIT Kharagpur) โ€” Foundational sequence modeling
  • GATE Previous Year Papers (CS/DA): ML section โ€” RNN/LSTM questions appear regularly since GATE 2022
  • CRIS Technical Reports: Railway demand forecasting methodologies used by Indian Railways

๐ŸŒ Global Resources

  • Original Papers:
    • Hochreiter & Schmidhuber, "Long Short-Term Memory" (1997) โ€” The LSTM paper
    • Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder" (2014) โ€” GRU introduction
    • Pascanu et al., "On the difficulty of training Recurrent Neural Networks" (2013) โ€” Vanishing/exploding gradient analysis
    • Wu et al., "Google's Neural Machine Translation System" (2016) โ€” GNMT architecture
  • Interactive:
    • Distill.pub โ€” "Understanding LSTM Networks" by Chris Olah (the gold standard visual explanation)
    • 3Blue1Brown โ€” "Recurrent Neural Networks" (visual intuition)
    • Andrej Karpathy, "The Unreasonable Effectiveness of Recurrent Neural Networks" (2015 blog post)
  • Textbooks:
    • Goodfellow, Bengio & Courville, "Deep Learning" โ€” Chapter 10: Sequence Modeling
    • Jurafsky & Martin, "Speech and Language Processing" โ€” RNN chapters
  • Cutting Edge (2023-2025):
    • Gu & Dao, "Mamba: Linear-Time Sequence Modeling" (2023) โ€” State-space models
    • Beck et al., "xLSTM: Extended Long Short-Term Memory" (2024) โ€” LSTM revival
    • Peng et al., "RWKV: Reinventing RNNs for the Transformer Era" (2023)
The LSTM Renaissance (2024-2025): After years of Transformer dominance, recurrent architectures are making a comeback. xLSTM (Beck et al., 2024) introduces exponential gating and matrix memory to modernize the LSTM. Mamba (Gu & Dao, 2023) uses selective state spaces to achieve Transformer-quality results with linear-time inference. RWKV combines the training parallelism of Transformers with the inference efficiency of RNNs. The core insight from 1997 โ€” gated memory for gradient flow โ€” remains as relevant as ever, just in new mathematical clothing.