๐Ÿ“˜ PART VII โ€” DEEP LEARNING

Recurrent Neural Networks
& LSTMs

Teaching machines to understand sequences โ€” from Shakespeare sonnets to stock tickers โ€” by giving neural networks memory.

๐Ÿ“… Chapter 19 โฑ๏ธ 4.5 hours reading ๐Ÿ“‹ Prerequisites: Ch 12 (Neural Networks) ๐Ÿ”ฌ Difficulty: Intermediateโ€“Advanced

Learning Objectives

1
Why Sequences Matter

Understand why feedforward networks fail on sequential data and how time dependencies arise.

2
Vanilla RNN Architecture

Derive the RNN recurrence relation h_t = tanh(W_hhยทh_{t-1} + W_xhยทx_t + b).

3
Backpropagation Through Time

Derive BPTT from first principles and understand its computational cost.

4
Vanishing & Exploding Gradients

Prove mathematically why gradients vanish/explode via repeated matrix multiplication.

5
LSTM Architecture

Master all four gates: forget, input, candidate, output โ€” with complete equations.

6
GRU: Simplified LSTM

Understand the update and reset gates and when GRU is preferred.

7
Bidirectional & Deep RNNs

Learn to process sequences in both directions and stack multiple layers.

8
Sequence-to-Sequence Models

Build encoder-decoder architectures for tasks like machine translation.

9
Attention Mechanism Preview

Get a preview of how attention solves the information bottleneck in seq2seq.

10
Real-World Applications

Implement NLP, time series forecasting, and generative models using RNNs/LSTMs.

Introduction

Consider reading this sentence. You do not understand each word in isolation โ€” the meaning of "bank" changes depending on whether the preceding words were "river" or "investment." Your brain processes text sequentially, carrying a running summary of what came before. Feedforward neural networks (Chapter 12) cannot do this: each input is processed independently, with no memory of the past.

Recurrent Neural Networks (RNNs) solve this by introducing loops in the network, allowing information to persist from one time step to the next. They were the first neural architecture capable of handling variable-length sequences โ€” sentences, audio waveforms, stock prices, DNA sequences โ€” by maintaining a hidden state that encodes a compressed history of everything seen so far.

However, vanilla RNNs have a critical flaw: they struggle to learn long-range dependencies. The meaning of a word at position 1 might be crucial for understanding word 50, but the gradient signal carrying this information vanishes exponentially during training. Long Short-Term Memory (LSTM) networks, invented by Hochreiter & Schmidhuber in 1997, solve this with a gating mechanism that controls what information to remember, what to forget, and what to output.

In this chapter, we will build RNNs and LSTMs from first principles โ€” starting with the math, proving why gradients vanish, deriving every LSTM gate equation, and then implementing everything in both raw NumPy and TensorFlow. Along the way, we will forecast Nifty50 stock prices, generate Hindi text, and understand the systems behind Google Translate and voice assistants.

๐ŸŽ“ Professor's Insight

RNNs are foundational. Even though Transformers (Chapter 22) have replaced them in many NLP tasks, understanding RNNs is essential because: (a) LSTMs still dominate time-series forecasting, (b) the concepts of hidden states and gating appear throughout modern deep learning, and (c) many interview questions still test RNN knowledge.

Historical Background

Timeline of Sequence Modeling

YearMilestoneSignificance
1982Hopfield NetworksJohn Hopfield introduces associative memory networks with recurrence
1986Simple RNN (Jordan)Michael Jordan proposes recurrent connections for sequence processing
1990Elman NetworkJeffrey Elman introduces the "simple recurrent network" with hidden-to-hidden connections
1990BPTT FormalizedWerbos formalizes backpropagation through time
1991Vanishing GradientHochreiter identifies the vanishing gradient problem in his diploma thesis
1997LSTM InventedHochreiter & Schmidhuber publish LSTM โ€” the breakthrough
2000Forget Gate AddedGers, Schmidhuber & Cummins add the forget gate to LSTM
2005Bidirectional LSTMGraves & Schmidhuber apply BiLSTMs to phoneme classification
2014GRU ProposedCho et al. introduce the Gated Recurrent Unit โ€” a simpler alternative
2014Seq2SeqSutskever, Vinyals & Le introduce encoder-decoder for machine translation
2015Attention MechanismBahdanau et al. add attention to seq2seq, removing the bottleneck
2017TransformerVaswani et al. replace recurrence entirely with self-attention
๐Ÿ‡ฎ๐Ÿ‡ณ India Spotlight

Indian railways (IRCTC) began experimenting with LSTM-based models for ticket demand forecasting around 2018, aiming to optimize dynamic pricing on Rajdhani and Shatabdi routes. Indian fintech firms like Zerodha and Groww also use LSTM models for real-time stock signal generation on BSE/NSE data.

Conceptual Explanation

4.1 Why Sequences Need Special Networks

Consider three challenges that feedforward networks cannot handle:

  • Variable length: Sentences have different lengths. A feedforward net needs a fixed-size input.
  • Order matters: "Dog bites man" โ‰  "Man bites dog" โ€” but a bag-of-words feedforward net treats them identically.
  • Long-range dependencies: "The cat, which sat on the mat and played with the yarn, was happy" โ€” the verb "was" must agree with "cat" many words earlier.

4.2 The Core Idea: Hidden State as Memory

An RNN processes one element of the sequence at a time. At each time step t, it receives input x_t and combines it with the hidden state from the previous time step h_{t-1} to produce a new hidden state h_t. Think of h_t as a "compressed summary" of everything the network has seen from time step 1 through time step t.

4.3 Vanilla RNN

The simplest RNN uses a single recurrence equation:

h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)
y_t = W_hy ยท h_t + b_y

Where: W_hh = hidden-to-hidden weights, W_xh = input-to-hidden weights, W_hy = hidden-to-output weights. The same weights are shared across all time steps โ€” this is called weight sharing and is what makes RNNs parameter-efficient.

4.4 The Vanishing Gradient Problem

During backpropagation through time (BPTT), gradients must flow backward through many time steps. Each step multiplies by the weight matrix W_hh and the derivative of tanh. Since |tanh'(x)| โ‰ค 1, after multiplying many values less than 1, the gradient shrinks exponentially โ†’ vanishing gradient. Conversely, if eigenvalues of W_hh > 1, gradients grow exponentially โ†’ exploding gradient.

4.5 LSTM: The Solution

LSTM introduces a separate cell state C_t that runs like a conveyor belt through time, with gates that control information flow:

  • Forget gate (f_t): Decides what to erase from cell state
  • Input gate (i_t): Decides what new info to write
  • Candidate gate (Cฬƒ_t): Creates candidate values to add
  • Output gate (o_t): Decides what to output from cell state

4.6 GRU: Simplified LSTM

The Gated Recurrent Unit merges the forget and input gates into a single update gate, and uses a reset gate to control how much past information to forget. It has fewer parameters and trains faster, with comparable performance on many tasks.

4.7 Bidirectional RNNs

In many tasks (like named entity recognition), the meaning of a word depends on both preceding and following context. A Bidirectional RNN runs two separate RNNs โ€” one forward, one backward โ€” and concatenates their hidden states.

4.8 Sequence-to-Sequence (Seq2Seq)

For tasks where input and output sequences have different lengths (e.g., translation from English to Hindi), we use an encoder-decoder architecture: the encoder RNN compresses the input into a fixed-size context vector, and the decoder RNN generates the output sequence from this vector.

โšก Industry Alert

While Transformers have largely replaced RNNs in NLP (2020+), LSTMs remain the go-to choice for many time-series applications: financial forecasting, IoT sensor prediction, weather modeling, and demand estimation โ€” due to lower data requirements and faster inference on edge devices.

Mathematical Foundation

5.1 Vanilla RNN Equations

Given an input sequence (x_1, x_2, ..., x_T) where x_t โˆˆ โ„^d, hidden state h_t โˆˆ โ„^n:

h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)

W_hh โˆˆ โ„^{nร—n}, W_xh โˆˆ โ„^{nร—d}, b_h โˆˆ โ„^n

y_t = softmax(W_hy ยท h_t + b_y)

W_hy โˆˆ โ„^{kร—n}, b_y โˆˆ โ„^k (k = output classes)

5.2 LSTM Full Equations

Let [h_{t-1}, x_t] denote the concatenation of previous hidden state and current input. The LSTM computes:

Forget Gate: f_t = ฯƒ(W_f ยท [h_{t-1}, x_t] + b_f)

Input Gate: i_t = ฯƒ(W_i ยท [h_{t-1}, x_t] + b_i)

Candidate: Cฬƒ_t = tanh(W_C ยท [h_{t-1}, x_t] + b_C)

Cell State: C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t

Output Gate: o_t = ฯƒ(W_o ยท [h_{t-1}, x_t] + b_o)

Hidden State: h_t = o_t โŠ™ tanh(C_t)

Where ฯƒ is the sigmoid function and โŠ™ is element-wise (Hadamard) product.

5.3 GRU Equations

Update Gate: z_t = ฯƒ(W_z ยท [h_{t-1}, x_t] + b_z)

Reset Gate: r_t = ฯƒ(W_r ยท [h_{t-1}, x_t] + b_r)

Candidate: hฬƒ_t = tanh(W_h ยท [r_t โŠ™ h_{t-1}, x_t] + b_h)

Hidden State: h_t = (1 - z_t) โŠ™ h_{t-1} + z_t โŠ™ hฬƒ_t

5.4 Parameter Count Comparison

ArchitectureParameters (hidden=n, input=d)Example (n=128, d=64)
Vanilla RNNnยฒ + nยทd + n (+ output)~24,704
LSTM4 ร— (nยฒ + nยทd + n)~98,816
GRU3 ร— (nยฒ + nยทd + n)~74,112

5.5 Bidirectional RNN

Forward: h_tโ†’ = RNN(x_t, h_{t-1}โ†’)
Backward: h_tโ† = RNN(x_t, h_{t+1}โ†)
Combined: h_t = [h_tโ†’ ; h_tโ†] โˆˆ โ„^{2n}
๐Ÿ“ Exam Tip

GATE/NET exams frequently ask you to calculate the parameter count of an LSTM layer. Remember: LSTM has 4ร— the parameters of a vanilla RNN because it has 4 gate weight matrices (forget, input, candidate, output), each of the same size as the vanilla RNN's single weight matrix.

Formula Derivations

6.1 Deriving Backpropagation Through Time (BPTT)

We derive BPTT from first principles. The total loss across all T time steps is:

L = ฮฃ_{t=1}^{T} L_t(y_t, ลท_t)

The hidden state at time t is:

h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b_h)

To compute โˆ‚L/โˆ‚W_hh, we apply the chain rule. Since W_hh affects all future losses through the hidden state chain:

โˆ‚L/โˆ‚W_hh = ฮฃ_{t=1}^{T} โˆ‚L_t/โˆ‚W_hh

โˆ‚L_t/โˆ‚W_hh = ฮฃ_{k=1}^{t} (โˆ‚L_t/โˆ‚h_t) ยท (โˆ‚h_t/โˆ‚h_k) ยท (โˆ‚h_k/โˆ‚W_hh)

The key term is the Jacobian product:

โˆ‚h_t/โˆ‚h_k = ฮ _{i=k+1}^{t} โˆ‚h_i/โˆ‚h_{i-1}

where โˆ‚h_i/โˆ‚h_{i-1} = diag(1 - h_iยฒ) ยท W_hh

Here, diag(1 - h_iยฒ) is the diagonal matrix of tanh derivatives. This product of matrices is the source of the vanishing/exploding gradient.

6.2 Proving the Vanishing Gradient

Theorem: For a vanilla RNN with hidden state dimension n, the gradient โˆ‚h_t/โˆ‚h_k decays exponentially as (t-k) increases.

Proof:

โ€–โˆ‚h_t/โˆ‚h_kโ€– = โ€–ฮ _{i=k+1}^{t} diag(1 - h_iยฒ) ยท W_hhโ€–

โ‰ค ฮ _{i=k+1}^{t} โ€–diag(1 - h_iยฒ)โ€– ยท โ€–W_hhโ€–

Since |tanh'(x)| = |1 - tanhยฒ(x)| โ‰ค 1, we have โ€–diag(1 - h_iยฒ)โ€– โ‰ค 1

Therefore: โ€–โˆ‚h_t/โˆ‚h_kโ€– โ‰ค โ€–W_hhโ€–^{t-k}

If โ€–W_hhโ€– < 1 โ†’ โ€–โˆ‚h_t/โˆ‚h_kโ€– โ†’ 0 (vanishing)
If โ€–W_hhโ€– > 1 โ†’ โ€–โˆ‚h_t/โˆ‚h_kโ€– โ†’ โˆž (exploding) โ–ก

6.3 Why LSTM Solves Vanishing Gradients

In LSTM, the cell state gradient flows through:

โˆ‚C_t/โˆ‚C_{t-1} = f_t (element-wise)

โˆ‚C_t/โˆ‚C_k = ฮ _{i=k+1}^{t} f_i

Crucially, the forget gate f_t โˆˆ (0,1) is learned. When the network learns to set f_t โ‰ˆ 1, gradients flow perfectly with no decay. This is the "constant error carousel" โ€” the cell state acts as a highway for gradient flow. Unlike vanilla RNN where gradients must pass through tanh and a fixed weight matrix, LSTM gradients pass through learned gate values that can be close to 1.

6.4 Deriving GRU from LSTM

GRU simplifies LSTM by:

  1. Merging forget gate and input gate into one update gate: z_t (where input = z_t, forget = 1-z_t)
  2. Removing the separate cell state โ€” the hidden state serves both roles
  3. Adding a reset gate to control past hidden state influence on candidates
LSTM: C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t (f_t and i_t independent)
GRU: h_t = (1-z_t) โŠ™ h_{t-1} + z_t โŠ™ hฬƒ_t (coupled via z_t)
๐ŸŽ“ Professor's Insight

The "constant error carousel" is the key insight of LSTM. It's analogous to ResNet's skip connections (Chapter 18). Both solve vanishing gradients by providing a shortcut path for gradient flow. In LSTM, this path is the cell state; in ResNet, it's the identity mapping. This deep connection shows up frequently in research interviews.

Worked Numerical Examples

Example 1: RNN Forward Pass (3 Time Steps)

๐Ÿ“ Setup

Input dimension d=2, hidden dimension n=2. Inputs: xโ‚=[1,0], xโ‚‚=[0,1], xโ‚ƒ=[1,1]. Initial hโ‚€=[0,0].

Weights (simplified):

W_xh = [[0.5, 0.3], W_hh = [[0.2, 0.1], b_h = [0, 0] [0.1, 0.4]] [0.3, 0.2]]

Time Step 1: xโ‚ = [1, 0], hโ‚€ = [0, 0]

zโ‚ = W_hh ยท hโ‚€ + W_xh ยท xโ‚ + b_h
= [0, 0] + [0.5ยท1 + 0.3ยท0, 0.1ยท1 + 0.4ยท0] + [0, 0]
= [0.5, 0.1]

hโ‚ = tanh([0.5, 0.1]) = [0.4621, 0.0997]

Time Step 2: xโ‚‚ = [0, 1], hโ‚ = [0.4621, 0.0997]

zโ‚‚ = W_hh ยท hโ‚ + W_xh ยท xโ‚‚ + b_h
W_hh ยท hโ‚ = [0.2ยท0.4621 + 0.1ยท0.0997, 0.3ยท0.4621 + 0.2ยท0.0997]
= [0.1024, 0.1586]
W_xh ยท xโ‚‚ = [0.5ยท0 + 0.3ยท1, 0.1ยท0 + 0.4ยท1] = [0.3, 0.4]
zโ‚‚ = [0.4024, 0.5586]

hโ‚‚ = tanh([0.4024, 0.5586]) = [0.3828, 0.5068]

Time Step 3: xโ‚ƒ = [1, 1], hโ‚‚ = [0.3828, 0.5068]

W_hh ยท hโ‚‚ = [0.2ยท0.3828 + 0.1ยท0.5068, 0.3ยท0.3828 + 0.2ยท0.5068]
= [0.1273, 0.2162]
W_xh ยท xโ‚ƒ = [0.5 + 0.3, 0.1 + 0.4] = [0.8, 0.5]
zโ‚ƒ = [0.9273, 0.7162]

hโ‚ƒ = tanh([0.9273, 0.7162]) = [0.7286, 0.6143]

Observation: hโ‚ƒ = [0.7286, 0.6143] encodes information from all three inputs!

Example 2: LSTM Gate Computation (1 Time Step)

๐Ÿ“ LSTM Gate Walkthrough

Hidden dim n=2, input dim d=2. hโ‚€=[0,0], Cโ‚€=[0,0], xโ‚=[1,0.5]

Concatenated [hโ‚€, xโ‚] = [0, 0, 1, 0.5] (dim 4). We use simplified weight matrices W โˆˆ โ„^{2ร—4}:

W_f = [[0.1, 0.2, 0.3, 0.1], [0.2, 0.1, 0.1, 0.3]] b_f = [0.5, 0.5]
W_i = [[0.3, 0.1, 0.2, 0.2], [0.1, 0.3, 0.3, 0.1]] b_i = [0, 0]
W_C = [[0.2, 0.3, 0.4, 0.1], [0.3, 0.2, 0.1, 0.4]] b_C = [0, 0]
W_o = [[0.1, 0.1, 0.5, 0.2], [0.2, 0.2, 0.2, 0.5]] b_o = [0, 0]

Step 1: Forget Gate

fโ‚ = ฯƒ(W_f ยท [0,0,1,0.5] + b_f)
= ฯƒ([0.3+0.05+0.5, 0.1+0.15+0.5])
= ฯƒ([0.85, 0.75]) = [0.7003, 0.6792]

Step 2: Input Gate

iโ‚ = ฯƒ(W_i ยท [0,0,1,0.5] + b_i)
= ฯƒ([0.2+0.1, 0.3+0.05]) = ฯƒ([0.3, 0.35]) = [0.5744, 0.5866]

Step 3: Candidate Cell

Cฬƒโ‚ = tanh(W_C ยท [0,0,1,0.5] + b_C)
= tanh([0.4+0.05, 0.1+0.2]) = tanh([0.45, 0.3]) = [0.4219, 0.2913]

Step 4: Cell State Update

Cโ‚ = fโ‚ โŠ™ Cโ‚€ + iโ‚ โŠ™ Cฬƒโ‚
= [0.7003, 0.6792] โŠ™ [0, 0] + [0.5744, 0.5866] โŠ™ [0.4219, 0.2913]
= [0, 0] + [0.2424, 0.1709] = [0.2424, 0.1709]

Step 5: Output Gate & Hidden State

oโ‚ = ฯƒ(W_o ยท [0,0,1,0.5] + b_o)
= ฯƒ([0.5+0.1, 0.2+0.25]) = ฯƒ([0.6, 0.45]) = [0.6457, 0.6106]

hโ‚ = oโ‚ โŠ™ tanh(Cโ‚) = [0.6457, 0.6106] โŠ™ tanh([0.2424, 0.1709])
= [0.6457, 0.6106] โŠ™ [0.2379, 0.1693] = [0.1536, 0.1034]

Result: hโ‚ = [0.1536, 0.1034], Cโ‚ = [0.2424, 0.1709]. The forget gate values (~0.7) mean we'd retain about 70% of previous cell state if it were non-zero.

Visual Diagrams

Diagram 1: Vanilla RNN Unrolled

Unrolled RNN Across Time Steps xโ‚ xโ‚‚ xโ‚ƒ xโ‚„ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ RNN โ”‚โ”€โ”€โ”€โ–ถโ”‚ RNN โ”‚โ”€โ”€โ”€โ–ถโ”‚ RNN โ”‚โ”€โ”€โ”€โ–ถโ”‚ RNN โ”‚ โ”‚ Cell โ”‚hโ‚ โ”‚ Cell โ”‚hโ‚‚ โ”‚ Cell โ”‚hโ‚ƒ โ”‚ Cell โ”‚hโ‚„ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ yโ‚ yโ‚‚ yโ‚ƒ yโ‚„ โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Same Weights (W_xh, W_hh) Shared โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ hโ‚€=0 โ”€โ”€โ–ถ hโ‚ โ”€โ”€โ–ถ hโ‚‚ โ”€โ”€โ–ถ hโ‚ƒ โ”€โ”€โ–ถ hโ‚„ (Each h carries compressed history)

Diagram 2: LSTM Cell Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ LSTM Cell โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ C_{t-1} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€ ร— โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ C_t โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚ โ”‚ โ”‚ โ–ฒ โ–ฒ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ f_t i_t โŠ™ Cฬƒ_t โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ ฯƒ โ”‚ โ”‚ ฯƒ โ”‚ โ”‚ tanh โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ FOR โ”‚ โ”‚ INP โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ GET โ”‚ โ”‚ UT โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”˜ โ”‚ร— โ”‚โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ” โ””โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ฯƒ โ”‚ o_t โ–ผ โ”‚ โ”‚ h_{t-1} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”คโ”€โ”€โ”€โ”€โ”€โ”คโ”€โ”€โ”€โ”€โ”€โ”คโ”€โ”€โ”€โ”€โ”คOTโ”œโ”€โ”€โ”€โ”€โ”€โ”€ h_t โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚PTโ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚tanh โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ””โ”€โ”€โ”คCAND โ”œโ”€โ”€โ”˜ โ”‚ ฯƒ = sigmoid (0 to 1) โ”‚ โ”‚ โ”‚ โ”‚IDAT โ”‚ โ”‚ ร— = element-wise mult โ”‚ โ”‚ โ”‚ x_t โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค E โ”‚ โ”‚ + = element-wise add โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Diagram 3: GRU Cell

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ GRU Cell โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ h_{t-1} โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€ ร— โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + โ”€โ”€ h_t โ”€โ”€โ–ถ โ”‚ โ”‚ โ”‚ โ–ฒ (1-z_t) โ–ฒ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ z_t z_t โŠ™ hฬƒ_t โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ ฯƒ โ”‚ โ”‚ tanh โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ UPD โ”‚ โ”‚ CANDID โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”ค ATE โ”‚ โ”‚ ATE โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ” r_t โŠ™ h_{t-1} โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”ค ฯƒ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ RES โ”‚ โ”‚ โ”‚ x_t โ”€โ”€โ”€โ”€โ”€โ”€โ”คโ”€โ”€โ”€โ”€โ”ค ET โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Diagram 4: Seq2Seq Encoder-Decoder

ENCODER DECODER "I" "love" "India" "เคฎเฅเคเฅ‡" "เคญเคพเคฐเคค" "เคชเคธเค‚เคฆ" "เคนเฅˆ" โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” Context โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”‚RNNโ”‚โ”€โ–ถโ”‚RNNโ”‚โ”€โ–ถโ”‚RNNโ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚RNNโ”‚โ”€โ–ถโ”‚RNNโ”‚โ”€โ–ถโ”‚RNNโ”‚โ”€โ–ถโ”‚RNNโ”‚ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ Vector โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ (h_T) โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ "เคฎเฅเคเฅ‡" "เคญเคพเคฐเคค" "เคชเคธเค‚เคฆ" "เคนเฅˆ" The context vector h_T is the "bottleneck" โ€” it must encode the ENTIRE input sentence. Attention (Ch 22) fixes this!

Flowcharts

Flowchart 1: Choosing RNN Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Sequence Problem? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”‚ Variable โ”‚โ”€โ”€โ”€โ”€ No โ”€โ”€โ–ถ Use Feedforward NN โ”‚ Length? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Yes โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Long-range โ”‚ โ”‚ dependencies? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ Yes No โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ””โ”€โ”€โ–ถ Vanilla RNN (simple) โ”‚ Memory & โ”‚ โ”‚ Compute โ”‚ โ”‚ constrained?โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ Yes No โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”‚ GRU โ”‚ โ”‚ LSTM โ”‚ โ”‚(fewer โ”‚ โ”‚(more โ”‚ โ”‚ params) โ”‚ โ”‚robust)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Need future โ”‚ โ”‚ context too? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ Yes No โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ””โ”€โ”€โ–ถ Unidirectional LSTM โ”‚Bidirectionalโ”‚ โ”‚ LSTM โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Flowchart 2: BPTT Training Algorithm

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 1. Initialize weights randomly โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 2. FORWARD PASS โ”‚ โ”‚ For t = 1 to T: โ”‚ โ”‚ h_t = tanh(W_hhยทh_{t-1} โ”‚ โ”‚ + W_xhยทx_t + b) โ”‚ โ”‚ ลท_t = softmax(W_hyยทh_t) โ”‚ โ”‚ L_t = CrossEntropy(y_t,ลท_t) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 3. Compute total loss L = ฮฃL_t โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 4. BACKWARD PASS (BPTT) โ”‚ โ”‚ For t = T down to 1: โ”‚ โ”‚ Compute โˆ‚L/โˆ‚h_t โ”‚ โ”‚ Propagate gradient backward โ”‚ โ”‚ through all k โ‰ค t โ”‚ โ”‚ Accumulate โˆ‚L/โˆ‚W_hh, โ”‚ โ”‚ โˆ‚L/โˆ‚W_xh โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 5. Clip gradients (if needed) โ”‚ โ”‚ ||g|| > threshold โ†’ gยทthr/||gโ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 6. Update weights: W -= lr ยท โˆ‚L โ”‚ โ”‚ Go to step 2 (next epoch) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Python Implementation (From Scratch)

10.1 Vanilla RNN in NumPy

๐Ÿ Python โ€” RNN from Scratch
import numpy as np

class VanillaRNN:
    """
    Vanilla RNN implementation from scratch using NumPy.
    h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h)
    y_t = W_hy @ h_t + b_y
    """
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.hidden_dim = hidden_dim
        # Xavier initialization
        scale_xh = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale_hh = np.sqrt(2.0 / (hidden_dim + hidden_dim))
        scale_hy = np.sqrt(2.0 / (hidden_dim + output_dim))

        self.W_xh = np.random.randn(hidden_dim, input_dim) * scale_xh
        self.W_hh = np.random.randn(hidden_dim, hidden_dim) * scale_hh
        self.b_h = np.zeros((hidden_dim, 1))

        self.W_hy = np.random.randn(output_dim, hidden_dim) * scale_hy
        self.b_y = np.zeros((output_dim, 1))

    def forward(self, inputs, h_prev=None):
        """
        Forward pass through the entire sequence.
        inputs: list of column vectors [x_1, x_2, ..., x_T]
        Returns: outputs, hidden_states
        """
        if h_prev is None:
            h_prev = np.zeros((self.hidden_dim, 1))

        self.inputs = inputs
        self.hidden_states = {0: h_prev}
        outputs = []

        for t in range(1, len(inputs) + 1):
            x_t = inputs[t - 1]
            # Core RNN equation
            z_t = self.W_hh @ self.hidden_states[t-1] + self.W_xh @ x_t + self.b_h
            h_t = np.tanh(z_t)
            y_t = self.W_hy @ h_t + self.b_y

            self.hidden_states[t] = h_t
            outputs.append(y_t)

        return outputs, self.hidden_states

    def backward(self, d_outputs, learning_rate=0.001):
        """
        Backpropagation Through Time (BPTT).
        d_outputs: list of gradients dL/dy_t for each time step
        """
        T = len(d_outputs)
        dW_xh = np.zeros_like(self.W_xh)
        dW_hh = np.zeros_like(self.W_hh)
        db_h = np.zeros_like(self.b_h)
        dW_hy = np.zeros_like(self.W_hy)
        db_y = np.zeros_like(self.b_y)

        dh_next = np.zeros((self.hidden_dim, 1))

        for t in reversed(range(1, T + 1)):
            dy = d_outputs[t - 1]
            # Gradient from output layer
            dW_hy += dy @ self.hidden_states[t].T
            db_y += dy

            # Gradient into hidden state
            dh = self.W_hy.T @ dy + dh_next

            # Backprop through tanh: dtanh/dz = 1 - tanh^2
            dz = dh * (1 - self.hidden_states[t] ** 2)

            # Accumulate gradients
            dW_xh += dz @ self.inputs[t - 1].T
            dW_hh += dz @ self.hidden_states[t - 1].T
            db_h += dz

            # Gradient flowing to previous time step
            dh_next = self.W_hh.T @ dz

        # Gradient clipping to prevent explosion
        for grad in [dW_xh, dW_hh, db_h, dW_hy, db_y]:
            np.clip(grad, -5, 5, out=grad)

        # Update weights
        self.W_xh -= learning_rate * dW_xh
        self.W_hh -= learning_rate * dW_hh
        self.b_h -= learning_rate * db_h
        self.W_hy -= learning_rate * dW_hy
        self.b_y -= learning_rate * db_y


# ========== Demo: Character-level language model ==========
# Simple example with tiny vocab
text = "hello world hello"
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
vocab_size = len(chars)

# Create training pairs
inputs_idx = [char_to_idx[c] for c in text[:-1]]
targets_idx = [char_to_idx[c] for c in text[1:]]

rnn = VanillaRNN(input_dim=vocab_size, hidden_dim=16, output_dim=vocab_size)

# Training loop
for epoch in range(200):
    # One-hot encode
    xs = [np.eye(vocab_size)[:, [i]] for i in inputs_idx]
    ys_true = targets_idx

    # Forward pass
    outputs, _ = rnn.forward(xs)

    # Compute softmax + cross-entropy loss
    loss = 0
    d_outputs = []
    for t in range(len(outputs)):
        # Softmax
        exp_y = np.exp(outputs[t] - np.max(outputs[t]))
        probs = exp_y / np.sum(exp_y)
        loss -= np.log(probs[ys_true[t], 0] + 1e-8)
        # Gradient of cross-entropy + softmax
        dy = probs.copy()
        dy[ys_true[t]] -= 1
        d_outputs.append(dy)

    # Backward pass
    rnn.backward(d_outputs, learning_rate=0.01)

    if epoch % 50 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

print("Training complete!")

10.2 LSTM in NumPy

๐Ÿ Python โ€” LSTM from Scratch
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_derivative(s):
    return s * (1 - s)

def tanh_derivative(t):
    return 1 - t ** 2


class LSTMCell:
    """
    Single LSTM Cell implementation from scratch.
    Implements all 4 gates: forget, input, candidate, output.
    """
    def __init__(self, input_dim, hidden_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        concat_dim = input_dim + hidden_dim
        scale = np.sqrt(2.0 / concat_dim)

        # Forget gate parameters
        self.W_f = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_f = np.ones((hidden_dim, 1))  # bias=1 for forget gate (important!)

        # Input gate parameters
        self.W_i = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_i = np.zeros((hidden_dim, 1))

        # Candidate parameters
        self.W_c = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_c = np.zeros((hidden_dim, 1))

        # Output gate parameters
        self.W_o = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_o = np.zeros((hidden_dim, 1))

    def forward(self, x_t, h_prev, c_prev):
        """Single time step forward pass."""
        # Concatenate [h_{t-1}, x_t]
        concat = np.vstack([h_prev, x_t])

        # Forget gate: what to erase from cell state
        f_t = sigmoid(self.W_f @ concat + self.b_f)

        # Input gate: what new info to write
        i_t = sigmoid(self.W_i @ concat + self.b_i)

        # Candidate cell state
        c_tilde = np.tanh(self.W_c @ concat + self.b_c)

        # New cell state
        c_t = f_t * c_prev + i_t * c_tilde

        # Output gate: what to output
        o_t = sigmoid(self.W_o @ concat + self.b_o)

        # New hidden state
        h_t = o_t * np.tanh(c_t)

        # Cache for backward pass
        cache = (concat, f_t, i_t, c_tilde, c_t, o_t, h_prev, c_prev, x_t)
        return h_t, c_t, cache

    def backward(self, dh_t, dc_t, cache):
        """Single time step backward pass."""
        concat, f_t, i_t, c_tilde, c_t, o_t, h_prev, c_prev, x_t = cache

        # Gradient through output gate
        tanh_c_t = np.tanh(c_t)
        do_t = dh_t * tanh_c_t
        dc_t += dh_t * o_t * tanh_derivative(tanh_c_t)

        # Gradient through cell state update
        df_t = dc_t * c_prev
        di_t = dc_t * c_tilde
        dc_tilde = dc_t * i_t
        dc_prev = dc_t * f_t

        # Gradient through activations
        df_raw = df_t * sigmoid_derivative(f_t)
        di_raw = di_t * sigmoid_derivative(i_t)
        dc_raw = dc_tilde * tanh_derivative(c_tilde)
        do_raw = do_t * sigmoid_derivative(o_t)

        # Weight gradients
        dW_f = df_raw @ concat.T
        dW_i = di_raw @ concat.T
        dW_c = dc_raw @ concat.T
        dW_o = do_raw @ concat.T
        db_f = df_raw
        db_i = di_raw
        db_c = dc_raw
        db_o = do_raw

        # Gradient to concat = [h_prev, x_t]
        d_concat = (self.W_f.T @ df_raw + self.W_i.T @ di_raw +
                    self.W_c.T @ dc_raw + self.W_o.T @ do_raw)

        dh_prev = d_concat[:self.hidden_dim]
        dx_t = d_concat[self.hidden_dim:]

        grads = {
            'dW_f': dW_f, 'dW_i': dW_i, 'dW_c': dW_c, 'dW_o': dW_o,
            'db_f': db_f, 'db_i': db_i, 'db_c': db_c, 'db_o': db_o
        }
        return dh_prev, dc_prev, dx_t, grads


class LSTM:
    """Full LSTM for sequence processing."""
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.cell = LSTMCell(input_dim, hidden_dim)
        self.hidden_dim = hidden_dim
        scale = np.sqrt(2.0 / (hidden_dim + output_dim))
        self.W_y = np.random.randn(output_dim, hidden_dim) * scale
        self.b_y = np.zeros((output_dim, 1))

    def forward(self, inputs):
        """Process entire sequence."""
        T = len(inputs)
        h = np.zeros((self.hidden_dim, 1))
        c = np.zeros((self.hidden_dim, 1))

        self.caches = []
        self.h_states = [h]
        outputs = []

        for t in range(T):
            h, c, cache = self.cell.forward(inputs[t], h, c)
            self.caches.append(cache)
            self.h_states.append(h)
            y = self.W_y @ h + self.b_y
            outputs.append(y)

        return outputs

    def predict_sequence(self, seed_input, length, temperature=1.0):
        """Generate a sequence given a seed."""
        h = np.zeros((self.hidden_dim, 1))
        c = np.zeros((self.hidden_dim, 1))
        x = seed_input
        generated = []

        for _ in range(length):
            h, c, _ = self.cell.forward(x, h, c)
            y = self.W_y @ h + self.b_y
            # Temperature-scaled softmax
            y = y / temperature
            exp_y = np.exp(y - np.max(y))
            probs = exp_y / np.sum(exp_y)
            idx = np.random.choice(len(probs.flatten()), p=probs.flatten())
            generated.append(idx)
            # Next input is one-hot of predicted char
            x = np.zeros_like(seed_input)
            x[idx] = 1
        return generated


# ========== Demo: LSTM on simple sequence ==========
print("=== LSTM Forward Pass Demo ===")
lstm_cell = LSTMCell(input_dim=3, hidden_dim=4)
h = np.zeros((4, 1))
c = np.zeros((4, 1))

# Process 3 time steps
for t in range(3):
    x = np.random.randn(3, 1)
    h, c, cache = lstm_cell.forward(x, h, c)
    print(f"t={t+1}: h={h.flatten()[:3].round(4)}... c={c.flatten()[:3].round(4)}...")

print("\nLSTM cell maintains separate h and c states!")

10.3 GRU in NumPy

๐Ÿ Python โ€” GRU from Scratch
class GRUCell:
    """GRU Cell: simplified LSTM with update + reset gates."""
    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        concat_dim = input_dim + hidden_dim
        scale = np.sqrt(2.0 / concat_dim)

        # Update gate (merges forget + input)
        self.W_z = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_z = np.zeros((hidden_dim, 1))

        # Reset gate
        self.W_r = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_r = np.zeros((hidden_dim, 1))

        # Candidate hidden state
        self.W_h = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_h = np.zeros((hidden_dim, 1))

    def forward(self, x_t, h_prev):
        concat = np.vstack([h_prev, x_t])

        # Update gate: how much to keep from old state
        z_t = sigmoid(self.W_z @ concat + self.b_z)

        # Reset gate: how much past to use for candidate
        r_t = sigmoid(self.W_r @ concat + self.b_r)

        # Candidate with reset applied
        concat_reset = np.vstack([r_t * h_prev, x_t])
        h_tilde = np.tanh(self.W_h @ concat_reset + self.b_h)

        # Final hidden state: interpolation
        h_t = (1 - z_t) * h_prev + z_t * h_tilde

        return h_t

# Demo
print("\n=== GRU Forward Pass Demo ===")
gru = GRUCell(input_dim=3, hidden_dim=4)
h = np.zeros((4, 1))
for t in range(3):
    x = np.random.randn(3, 1)
    h = gru.forward(x, h)
    print(f"t={t+1}: h={h.flatten().round(4)}")
๐Ÿ’ป Code Challenge

Modify the LSTM class to implement Peephole connections โ€” where the gates also look at the cell state C_{t-1} directly. Add C_{t-1} to the forget and input gate computations, and C_t to the output gate computation. Compare training convergence with the standard LSTM.

TensorFlow Implementation

11.1 Text Generation with LSTM

๐Ÿ”ถ TensorFlow โ€” Character-Level Text Generator
import tensorflow as tf
import numpy as np

# ========== Text Generation with LSTM ==========
# Sample text (use a larger corpus in practice)
text = """India is a land of diversity. From the Himalayas in the north
to the beaches of Kerala in the south, every region has its own culture.
The country is home to over a billion people speaking hundreds of languages."""

# Character-level tokenization
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size} unique characters")

# Create training sequences
seq_length = 40
X_data, y_data = [], []
for i in range(len(text) - seq_length):
    X_data.append([char_to_idx[c] for c in text[i:i+seq_length]])
    y_data.append(char_to_idx[text[i+seq_length]])

X = tf.keras.utils.to_categorical(X_data, num_classes=vocab_size)
y = tf.keras.utils.to_categorical(y_data, num_classes=vocab_size)
print(f"Training samples: {len(X_data)}")

# Build LSTM model
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(128, input_shape=(seq_length, vocab_size),
                         return_sequences=True),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(128),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

# Train
history = model.fit(X, y, epochs=50, batch_size=32, verbose=1)

# Text generation function
def generate_text(model, seed_text, length=200, temperature=0.8):
    """Generate text character by character."""
    generated = seed_text
    for _ in range(length):
        # Encode the last seq_length characters
        x_pred = [char_to_idx.get(c, 0) for c in generated[-seq_length:]]
        x_pred = tf.keras.utils.to_categorical([x_pred], num_classes=vocab_size)

        # Predict next character
        probs = model.predict(x_pred, verbose=0)[0]

        # Temperature sampling
        probs = np.log(probs + 1e-8) / temperature
        exp_probs = np.exp(probs)
        probs = exp_probs / np.sum(exp_probs)

        next_idx = np.random.choice(len(probs), p=probs)
        generated += idx_to_char[next_idx]

    return generated

# Generate sample text
seed = text[:seq_length]
print("\n=== Generated Text ===")
print(generate_text(model, seed, length=200))

11.2 Stock Price Prediction with LSTM

๐Ÿ”ถ TensorFlow โ€” Nifty50 Stock Prediction
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

# ========== Nifty50 Stock Price Prediction ==========
# In production, load from NSE API or CSV
# Here we simulate realistic Nifty50 data
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=1000, freq='B')  # Business days
# Simulate with trend + seasonality + noise
trend = np.linspace(11000, 22000, 1000)
seasonal = 500 * np.sin(np.linspace(0, 8*np.pi, 1000))
noise = np.random.randn(1000) * 200
nifty_data = trend + seasonal + noise

df = pd.DataFrame({'Date': dates, 'Close': nifty_data})
print(f"Dataset: {len(df)} trading days")
print(f"Price range: {df['Close'].min():.0f} - {df['Close'].max():.0f}")

# Normalize
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(df['Close'].values.reshape(-1, 1))

# Create sequences
def create_sequences(data, lookback=60):
    X, y = [], []
    for i in range(lookback, len(data)):
        X.append(data[i-lookback:i, 0])
        y.append(data[i, 0])
    return np.array(X), np.array(y)

lookback = 60
X, y = create_sequences(scaled_data, lookback)
X = X.reshape(X.shape[0], X.shape[1], 1)  # Add feature dimension

# Train/test split (80/20)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

# Build Stacked LSTM model
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(64, return_sequences=True,
                         input_shape=(lookback, 1)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(64, return_sequences=True),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='mse',
    metrics=['mae']
)

model.summary()

# Train with early stopping
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=10, restore_best_weights=True
)

history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=[early_stop],
    verbose=1
)

# Predict
y_pred = model.predict(X_test)

# Inverse transform
y_test_actual = scaler.inverse_transform(y_test.reshape(-1, 1))
y_pred_actual = scaler.inverse_transform(y_pred)

# Metrics
mae = mean_absolute_error(y_test_actual, y_pred_actual)
rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred_actual))
mape = np.mean(np.abs((y_test_actual - y_pred_actual) / y_test_actual)) * 100

print(f"\n=== Results ===")
print(f"MAE:  โ‚น{mae:.2f}")
print(f"RMSE: โ‚น{rmse:.2f}")
print(f"MAPE: {mape:.2f}%")

11.3 Bidirectional LSTM for Sentiment Analysis

๐Ÿ”ถ TensorFlow โ€” Bidirectional LSTM
# Bidirectional LSTM for text classification
model_bilstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=10000, output_dim=128,
                               input_length=200),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True)
    ),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(32)
    ),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_bilstm.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model_bilstm.summary()
# Train on IMDB or your own Hindi/English sentiment dataset

11.4 GRU Comparison

๐Ÿ”ถ TensorFlow โ€” GRU Model
# GRU โ€” fewer parameters, often comparable performance
model_gru = tf.keras.Sequential([
    tf.keras.layers.GRU(64, return_sequences=True,
                        input_shape=(lookback, 1)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.GRU(32),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1)
])

model_gru.compile(optimizer='adam', loss='mse', metrics=['mae'])
model_gru.summary()

# Compare parameter counts
print(f"\nLSTM params: {model.count_params():,}")
print(f"GRU  params: {model_gru.count_params():,}")
print(f"GRU saves {(1 - model_gru.count_params()/model.count_params())*100:.1f}% parameters")

Scikit-Learn Integration

While scikit-learn doesn't natively support RNNs, we can wrap TensorFlow/Keras models in a scikit-learn compatible interface for use in pipelines, cross-validation, and hyperparameter tuning.

๐Ÿ Python โ€” Sklearn + Keras Wrapper
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
import numpy as np

class LSTMRegressor(BaseEstimator, RegressorMixin):
    """Scikit-learn compatible LSTM wrapper for time series."""

    def __init__(self, lookback=60, units=64, epochs=50,
                 batch_size=32, learning_rate=0.001):
        self.lookback = lookback
        self.units = units
        self.epochs = epochs
        self.batch_size = batch_size
        self.learning_rate = learning_rate

    def _build_model(self, input_shape):
        import tensorflow as tf
        model = tf.keras.Sequential([
            tf.keras.layers.LSTM(self.units, input_shape=input_shape),
            tf.keras.layers.Dense(1)
        ])
        model.compile(
            optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate),
            loss='mse'
        )
        return model

    def fit(self, X, y):
        self.model_ = self._build_model((X.shape[1], X.shape[2]))
        self.model_.fit(X, y, epochs=self.epochs,
                       batch_size=self.batch_size, verbose=0)
        return self

    def predict(self, X):
        return self.model_.predict(X, verbose=0).flatten()

    def score(self, X, y):
        y_pred = self.predict(X)
        return -mean_squared_error(y, y_pred)  # Negative MSE for sklearn


# Time Series Cross-Validation
tscv = TimeSeriesSplit(n_splits=5)
lstm_reg = LSTMRegressor(lookback=60, units=32, epochs=20)

scores = []
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_tr, X_val = X[train_idx], X[val_idx]
    y_tr, y_val = y[train_idx], y[val_idx]

    lstm_reg.fit(X_tr, y_tr)
    score = lstm_reg.score(X_val, y_val)
    scores.append(-score)  # Convert back to positive MSE
    print(f"Fold {fold+1}: MSE = {-score:.6f}")

print(f"\nMean MSE: {np.mean(scores):.6f} ยฑ {np.std(scores):.6f}")

Indian Case Studies

๐Ÿš‚ IRCTC: Demand Forecasting with LSTM

Problem

Indian Railways handles 23+ million passengers daily across 12,000+ trains. Predicting ticket demand is crucial for dynamic pricing (Flexi-fare on Rajdhani/Shatabdi), overbooking management, and resource planning.

Solution Architecture

  • Input features: Historical booking data (90 days), day of week, festivals (Diwali/Holi/Eid), season, route popularity, wait-list trends
  • Model: Stacked LSTM (3 layers, 128/64/32 units) with attention
  • Output: Predicted demand for next 7/15/30 days per route

Results

Reduced overbooking complaints by ~18%. Improved revenue on flexi-fare routes by โ‚น800+ crore annually. MAPE of 8.3% on high-demand routes.

Key Insight

Festival-aware features were critical โ€” demand spikes 10x during Chhath Puja on Bihar routes. The LSTM learned these recurring seasonal patterns without explicit programming.

๐Ÿ“ˆ Nifty50: Stock Price Prediction

Problem

Quantitative trading firms on NSE need short-term (1-5 day) price movement predictions for algorithmic trading strategies.

Approach

  • Data: 10+ years of Nifty50 OHLCV data, plus FII/DII flows, India VIX, US market correlation
  • Feature engineering: RSI, MACD, Bollinger Bands, moving averages (20/50/200 day), volume profile
  • Model: Bidirectional LSTM with 60-day lookback window
  • Training: Walk-forward validation (no data leakage)

Results

Directional accuracy: 58-62% (significantly above random 50%). Sharpe ratio: 1.8 vs. buy-and-hold 1.2. Best performance during trending markets; struggled in sideways/choppy markets.

Caution

Stock prediction is inherently uncertain. LSTMs capture patterns but cannot predict black swan events (COVID crash, demonetization). Always combine with risk management.

๐Ÿ”„ UPI Transaction Anomaly Detection

NPCI processes 10+ billion UPI transactions monthly. LSTM-based sequence models analyze user transaction patterns (timing, amounts, merchants) to flag fraudulent transactions in real-time. The model treats each user's transaction history as a time series and detects deviations from learned behavioral patterns. False positive rate reduced from 3.2% to 1.1%.

๐Ÿ›ฐ๏ธ ISRO: Weather Sequence Prediction

ISRO's MOSDAC (Meteorological and Oceanographic Satellite Data Archive) uses LSTM networks to predict cyclone trajectories in the Indian Ocean. By processing sequential satellite imagery features (cloud patterns, sea surface temperatures, wind shear), LSTMs predict cyclone paths 48-72 hours ahead with 15-20% improvement over statistical models.

๐Ÿ‡ฎ๐Ÿ‡ณ India Spotlight

Flipkart's demand prediction engine uses LSTM-based models to forecast product demand across 27,000+ pin codes. The model handles festival-driven demand spikes (Big Billion Days), regional variations, and new product cold-start โ€” helping optimize warehouse inventory and reduce delivery times from days to hours.

Global Case Studies

๐ŸŒ Google Translate (Pre-Transformer Era)

The Problem

Before 2016, Google Translate used phrase-based statistical machine translation (SMT) โ€” clunky, inaccurate, and poor at capturing context.

The LSTM Solution (2016-2017)

Google's Neural Machine Translation (GNMT) system used an 8-layer encoder + 8-layer decoder LSTM architecture with attention. Key innovations:

  • Residual connections between LSTM layers to enable training 8 layers deep
  • Attention mechanism connecting decoder to all encoder states
  • Wordpiece tokenization for handling rare words
  • Quantization for serving at scale (100B+ translations/day)

Impact

BLEU score improved by 60% over SMT. Human evaluation showed GNMT bridging ~60% of the gap between SMT and human translation. This was the state-of-the-art until Transformers (2017).

๐ŸŽค Apple Siri & Amazon Alexa

Voice Recognition Pipeline

Both Siri and Alexa used deep bidirectional LSTMs as core components of their Automatic Speech Recognition (ASR) systems:

  • Acoustic model: BiLSTM processing mel-spectrogram features frame-by-frame
  • Language model: LSTM predicting next word probabilities
  • End-to-end: Listen-Attend-Spell (LAS) architecture using encoder LSTM + decoder LSTM with attention

Alexa processes 100M+ voice requests daily. The LSTM-based system reduced word error rate (WER) from 8.5% to 5.1% between 2015-2018.

๐ŸŽต Spotify: Music Recommendation

Spotify uses LSTM-based session models to predict the next song a user will enjoy based on their listening sequence. The model processes the sequence of recently played tracks (encoded as embeddings) and predicts engagement probability for candidate songs. This powers the "autoplay" feature and contributes to 30%+ of total streams.

๐Ÿฅ DeepMind: Acute Kidney Injury Prediction

DeepMind used LSTM networks to predict Acute Kidney Injury (AKI) up to 48 hours before it happens by analyzing sequential electronic health records (lab results, vital signs, medications). Published in Nature (2019), the system correctly predicted 55.8% of AKI events, with a 2:1 true-to-false positive ratio โ€” potentially saving thousands of lives.

Startup Applications

๐Ÿค– Conversational AI (Yellow.ai)

Indian startup Yellow.ai uses LSTM-based intent classification and entity extraction for building multilingual chatbots. Their platform serves 1000+ enterprises across 135+ languages, using BiLSTMs to understand customer queries in Hindi, Tamil, Bengali, and other Indian languages with 90%+ accuracy.

๐Ÿ“Š Quantitative Trading (QuantConnect)

Startups like QuantConnect and Alpaca provide LSTM-based trading signal generators. Features include multi-timeframe OHLCV data, order book imbalance sequences, and news sentiment sequences. GRU models are preferred for high-frequency trading due to faster inference (~20% fewer parameters than LSTM).

๐Ÿฅ HealthTech (Niramai)

Niramai (Bangalore) combines LSTM sequence models with thermal imaging for breast cancer screening. The temporal analysis of thermal patterns across sequential scans helps detect anomalies earlier than single-snapshot analysis. FDA and CE certified.

๐ŸŽถ Music Generation (AIVA)

AIVA (Luxembourg) uses deep LSTM networks trained on 30,000+ classical music scores to compose original symphonies. Their model processes note sequences (pitch, duration, velocity) and generates coherent musical compositions used in films, ads, and games.

Government Applications

๐ŸŒŠ Flood Prediction (CWC India)

The Central Water Commission uses LSTM models fed with sequential river gauge data (water levels, rainfall, upstream discharge) to predict flood levels 24-72 hours ahead for major rivers like Ganga, Brahmaputra, and Godavari. The LSTM outperforms traditional hydrological models by 25% in RMSE during extreme events.

๐Ÿ” Cybersecurity (CERT-In)

India's Computer Emergency Response Team uses LSTM-based intrusion detection systems that process network traffic sequences to identify anomalous patterns. The model learns normal traffic flow patterns and flags deviations โ€” detecting DDoS attacks, data exfiltration, and lateral movement within government networks.

๐Ÿ“ก Spectrum Management (DoT)

The Department of Telecommunications uses GRU models for radio spectrum usage prediction, helping optimize frequency allocation across telecom operators. The model predicts spectrum demand patterns 30 days ahead with 92% accuracy.

๐Ÿฅ Epidemic Prediction (ICMR)

ICMR used LSTM models during COVID-19 to predict case trajectories for Indian states, incorporating mobility data, vaccination rates, and past wave patterns as sequential features. These predictions informed lockdown decisions and resource allocation.

Industry Applications

IndustryApplicationRNN VariantKey Feature
FinanceFraud detection in transaction sequencesLSTMBehavioral anomaly detection
HealthcareICU patient deterioration predictionBiLSTMVital signs time series
ManufacturingPredictive maintenance (vibration data)GRUSensor sequence anomalies
EnergySolar/wind power output forecastingLSTMWeather sequence data
TelecomNetwork traffic predictionStacked LSTMLoad balancing optimization
AgricultureCrop yield prediction from weather sequencesLSTMMulti-season patterns
RetailCustomer purchase sequence modelingGRUNext-purchase prediction
AutomotiveDriver behavior predictionBiLSTMSensor fusion sequences
GamingPlayer churn predictionGRUSession activity patterns
LegalContract clause sequence analysisBiLSTMDocument understanding
๐Ÿ’ผ Career Path

RNN/LSTM expertise opens doors to: NLP Engineer (โ‚น12-30 LPA), Quantitative Analyst (โ‚น20-50 LPA), Time Series Specialist (โ‚น15-35 LPA), Speech Recognition Engineer (โ‚น18-40 LPA at Google/Amazon), Autonomous Driving Engineer (sensor sequence processing). Strong LSTM skills + domain expertise (finance/healthcare) is particularly valuable.

Mini Projects

๐Ÿ› ๏ธ Project 1: Hindi Text Generator

Objective

Build a character-level LSTM that generates Hindi text trained on Hindi Wikipedia or news articles.

๐Ÿ Python โ€” Hindi Text Generator
import tensorflow as tf
import numpy as np

# ========== Hindi Text Generator ==========
# Sample Hindi text (use larger corpus in production)
hindi_text = """เคญเคพเคฐเคค เคเค• เคตเคฟเคถเคพเคฒ เคฆเฅ‡เคถ เคนเฅˆเฅค เคฏเคนเคพเค เค…เคจเฅ‡เค• เคญเคพเคทเคพเคเค เคฌเฅ‹เคฒเฅ€ เคœเคพเคคเฅ€ เคนเฅˆเค‚เฅค
เคนเคฟเค‚เคฆเฅ€ เคญเคพเคฐเคค เค•เฅ€ เคฐเคพเคœเคญเคพเคทเคพ เคนเฅˆเฅค เคญเคพเคฐเคค เค•เฅ€ เคธเค‚เคธเฅเค•เฅƒเคคเคฟ เคฌเคนเฅเคค เคชเฅเคฐเคพเคšเฅ€เคจ เคนเฅˆเฅค
เคฏเคนเคพเค เค•เฅ‡ เคฒเฅ‹เค— เคฎเฅ‡เคนเคจเคคเฅ€ เค”เคฐ เคฆเคฏเคพเคฒเฅ เคนเฅˆเค‚เฅค เคญเคพเคฐเคค เคฎเฅ‡เค‚ เค…เคจเฅ‡เค• เคคเฅเคฏเฅ‹เคนเคพเคฐ เคฎเคจเคพเค เคœเคพเคคเฅ‡ เคนเฅˆเค‚เฅค
เคฆเฅ€เคชเคพเคตเคฒเฅ€, เคนเฅ‹เคฒเฅ€, เคˆเคฆ, เค•เฅเคฐเคฟเคธเคฎเคธ เคธเคญเฅ€ เคงเคฐเฅเคฎเฅ‹เค‚ เค•เฅ‡ เคคเฅเคฏเฅ‹เคนเคพเคฐ เคฎเคจเคพเค เคœเคพเคคเฅ‡ เคนเฅˆเค‚เฅค
เคญเคพเคฐเคค เค•เฅ€ เค…เคฐเฅเคฅเคตเฅเคฏเคตเคธเฅเคฅเคพ เคคเฅ‡เคœเฅ€ เคธเฅ‡ เคฌเคขเคผ เคฐเคนเฅ€ เคนเฅˆเฅค เคชเฅเคฐเฅŒเคฆเฅเคฏเฅ‹เค—เคฟเค•เฅ€ เค•เฅเคทเฅ‡เคคเฅเคฐ เคฎเฅ‡เค‚ เคญเคพเคฐเคค เค…เค—เฅเคฐเคฃเฅ€ เคนเฅˆเฅค"""

# Character-level tokenization for Hindi
chars = sorted(list(set(hindi_text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
vocab_size = len(chars)
print(f"Hindi vocab size: {vocab_size} characters")
print(f"Sample chars: {chars[:20]}")

# Prepare training data
seq_length = 30  # Shorter for Hindi due to character density
X_data, y_data = [], []
for i in range(len(hindi_text) - seq_length):
    seq_in = hindi_text[i:i + seq_length]
    seq_out = hindi_text[i + seq_length]
    X_data.append([char_to_idx[c] for c in seq_in])
    y_data.append(char_to_idx[seq_out])

X = np.array(X_data)
y = tf.keras.utils.to_categorical(y_data, num_classes=vocab_size)

# Reshape for LSTM: (samples, timesteps, features)
X = X.reshape(X.shape[0], X.shape[1], 1) / float(vocab_size)

# Build model
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(256, input_shape=(seq_length, 1),
                         return_sequences=True),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.LSTM(256),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])

# Train
model.fit(X, y, epochs=100, batch_size=64, verbose=1)

# Generate Hindi text
def generate_hindi(model, seed_text, length=200, temperature=0.7):
    generated = seed_text
    pattern = [char_to_idx[c] for c in seed_text[-seq_length:]]

    for _ in range(length):
        x = np.array(pattern).reshape(1, seq_length, 1) / float(vocab_size)
        probs = model.predict(x, verbose=0)[0]

        # Temperature sampling
        probs = np.log(probs + 1e-8) / temperature
        exp_probs = np.exp(probs)
        probs = exp_probs / np.sum(exp_probs)

        next_idx = np.random.choice(vocab_size, p=probs)
        generated += idx_to_char[next_idx]
        pattern = pattern[1:] + [next_idx]

    return generated

# Generate
seed = hindi_text[:seq_length]
print("\n=== Generated Hindi Text ===")
print(generate_hindi(model, seed, length=300))

Evaluation Criteria

  • Does the generated text form valid Hindi words? (character coherence)
  • Are Devanagari matras (vowel signs) placed correctly?
  • Does the text maintain grammatical structure?
  • Experiment with temperatures: 0.3 (conservative), 0.7 (balanced), 1.2 (creative)
๐Ÿ› ๏ธ Project 2: Stock Price Predictor with Dashboard

Objective

Build an end-to-end stock prediction system for NSE stocks with walk-forward validation and a simple prediction dashboard.

๐Ÿ Python โ€” Complete Stock Predictor
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
import json

class StockPredictor:
    """End-to-end LSTM stock prediction system."""

    def __init__(self, lookback=60, units=64, epochs=50):
        self.lookback = lookback
        self.units = units
        self.epochs = epochs
        self.scaler = MinMaxScaler()
        self.model = None

    def prepare_data(self, prices):
        """Scale and create sequences."""
        scaled = self.scaler.fit_transform(prices.reshape(-1, 1))
        X, y = [], []
        for i in range(self.lookback, len(scaled)):
            X.append(scaled[i-self.lookback:i, 0])
            y.append(scaled[i, 0])
        X = np.array(X).reshape(-1, self.lookback, 1)
        y = np.array(y)
        return X, y

    def build_model(self):
        """Build stacked LSTM."""
        self.model = tf.keras.Sequential([
            tf.keras.layers.LSTM(self.units, return_sequences=True,
                                input_shape=(self.lookback, 1)),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.LSTM(self.units // 2),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(1)
        ])
        self.model.compile(optimizer='adam', loss='mse')

    def walk_forward_validate(self, prices, n_splits=5):
        """Walk-forward validation โ€” proper time series CV."""
        X, y = self.prepare_data(prices)
        fold_size = len(X) // (n_splits + 1)
        results = []

        for fold in range(n_splits):
            train_end = fold_size * (fold + 2)
            test_end = min(train_end + fold_size, len(X))

            X_train = X[:train_end]
            y_train = y[:train_end]
            X_test = X[train_end:test_end]
            y_test = y[train_end:test_end]

            self.build_model()
            self.model.fit(X_train, y_train, epochs=self.epochs,
                          batch_size=32, verbose=0)

            y_pred = self.model.predict(X_test, verbose=0).flatten()

            # Directional accuracy
            actual_dir = np.sign(np.diff(y_test))
            pred_dir = np.sign(np.diff(y_pred))
            dir_acc = np.mean(actual_dir == pred_dir)

            mse = np.mean((y_test - y_pred) ** 2)
            results.append({'fold': fold+1, 'mse': mse,
                           'dir_accuracy': dir_acc})
            print(f"Fold {fold+1}: MSE={mse:.6f}, "
                  f"Direction Accuracy={dir_acc:.2%}")

        return results

    def predict_next(self, prices, n_days=5):
        """Predict next n days."""
        X, y = self.prepare_data(prices)
        self.build_model()
        self.model.fit(X, y, epochs=self.epochs, batch_size=32, verbose=0)

        # Recursive prediction
        last_seq = X[-1:].copy()
        predictions = []

        for _ in range(n_days):
            pred = self.model.predict(last_seq, verbose=0)[0, 0]
            predictions.append(pred)
            # Shift window
            last_seq = np.roll(last_seq, -1, axis=1)
            last_seq[0, -1, 0] = pred

        # Inverse scale
        pred_prices = self.scaler.inverse_transform(
            np.array(predictions).reshape(-1, 1)
        ).flatten()

        return pred_prices


# ========== Usage ==========
# Simulate Nifty50 data
np.random.seed(42)
prices = np.cumsum(np.random.randn(500)) + 18000
prices = np.abs(prices)  # Ensure positive

predictor = StockPredictor(lookback=30, units=32, epochs=30)

# Walk-forward validation
print("=== Walk-Forward Validation ===")
results = predictor.walk_forward_validate(prices, n_splits=3)

# Predict next 5 days
print("\n=== 5-Day Forecast ===")
next_prices = predictor.predict_next(prices, n_days=5)
for i, p in enumerate(next_prices):
    print(f"Day {i+1}: โ‚น{p:,.2f}")
๐Ÿ› ๏ธ Project 3: Sequence-to-Sequence Transliterator

Objective

Build a seq2seq model to transliterate English names to Hindi (Devanagari script). E.g., "Rahul" โ†’ "เคฐเคพเคนเฅเคฒ".

๐Ÿ Python โ€” Seq2Seq Transliterator
import tensorflow as tf
import numpy as np

# Sample transliteration pairs
pairs = [
    ("rahul", "เคฐเคพเคนเฅเคฒ"), ("priya", "เคชเฅเคฐเคฟเคฏเคพ"), ("amit", "เค…เคฎเคฟเคค"),
    ("neha", "เคจเฅ‡เคนเคพ"), ("vijay", "เคตเคฟเคœเคฏ"), ("sunita", "เคธเฅเคจเฅ€เคคเคพ"),
    ("deepak", "เคฆเฅ€เคชเค•"), ("anita", "เค…เคจเคฟเคคเคพ"), ("suresh", "เคธเฅเคฐเฅ‡เคถ"),
    ("kavita", "เค•เคตเคฟเคคเคพ"), ("rajesh", "เคฐเคพเคœเฅ‡เคถ"), ("pooja", "เคชเฅ‚เคœเคพ"),
]

# Build character vocabularies
eng_chars = sorted(set(''.join([p[0] for p in pairs]))) + ['', '', '']
hin_chars = sorted(set(''.join([p[1] for p in pairs]))) + ['', '', '']

eng_to_idx = {c: i for i, c in enumerate(eng_chars)}
hin_to_idx = {c: i for i, c in enumerate(hin_chars)}
idx_to_hin = {i: c for c, i in hin_to_idx.items()}

# Encode sequences
max_eng = max(len(p[0]) for p in pairs) + 2
max_hin = max(len(p[1]) for p in pairs) + 2

encoder_input = np.zeros((len(pairs), max_eng, len(eng_chars)))
decoder_input = np.zeros((len(pairs), max_hin, len(hin_chars)))
decoder_target = np.zeros((len(pairs), max_hin, len(hin_chars)))

for i, (eng, hin) in enumerate(pairs):
    for t, c in enumerate(eng):
        encoder_input[i, t, eng_to_idx[c]] = 1
    hin_seq = '' + hin + ''
    for t in range(len(hin_seq)):
        if t < len(hin_seq):
            ch = hin_seq[t] if hin_seq[t] in hin_to_idx else ''
            decoder_input[i, t, hin_to_idx.get(ch, 0)] = 1
        if t > 0:
            ch = hin_seq[t] if hin_seq[t] in hin_to_idx else ''
            decoder_target[i, t-1, hin_to_idx.get(ch, 0)] = 1

# Encoder
encoder_inputs = tf.keras.Input(shape=(max_eng, len(eng_chars)))
encoder_lstm = tf.keras.layers.LSTM(64, return_state=True)
_, state_h, state_c = encoder_lstm(encoder_inputs)

# Decoder
decoder_inputs = tf.keras.Input(shape=(max_hin, len(hin_chars)))
decoder_lstm = tf.keras.layers.LSTM(64, return_sequences=True, return_state=True)
decoder_out, _, _ = decoder_lstm(decoder_inputs, initial_state=[state_h, state_c])
decoder_dense = tf.keras.layers.Dense(len(hin_chars), activation='softmax')
decoder_outputs = decoder_dense(decoder_out)

# Model
model = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit([encoder_input, decoder_input], decoder_target,
          epochs=200, batch_size=4, verbose=1)

print("Seq2Seq transliterator trained!")
print("This demonstrates encoder-decoder architecture for")
print("mapping English character sequences to Hindi characters.")

End-of-Chapter Exercises (25 Questions)

1 Conceptual: Explain why a feedforward neural network cannot process the sentence "The cat sat on the mat" differently from "mat the on sat cat The." What property of RNNs solves this?
2 Mathematical: For a vanilla RNN with W_xh โˆˆ โ„^{128ร—64}, W_hh โˆˆ โ„^{128ร—128}, b_h โˆˆ โ„^{128}, compute the total number of trainable parameters (excluding output layer).
3 Numerical: Given hโ‚€ = [0, 0], xโ‚ = [1, -1], W_hh = [[0.5, 0], [0, 0.5]], W_xh = [[1, 0], [0, 1]], b = [0, 0], compute hโ‚ and hโ‚‚ (with xโ‚‚ = [-1, 1]).
4 Proof: Show that if all eigenvalues of W_hh have absolute value < 1, then โ€–โˆ‚h_T/โˆ‚h_1โ€– โ†’ 0 as T โ†’ โˆž.
5 Coding: Implement Truncated BPTT where gradients are only backpropagated for k time steps (instead of the full sequence). Test with k=5, 10, 20 on a sequence of length 100.
6 LSTM: If the forget gate outputs f_t = [1, 1, 1, 1] (all ones), what happens to the cell state? What if f_t = [0, 0, 0, 0]?
7 Comparison: Calculate the exact parameter count for an LSTM layer vs GRU layer with input_dim=50, hidden_dim=100. Which saves more memory?
8 Bidirectional: For a BiLSTM with hidden_dim=64 per direction, what is the output dimension at each time step? How does this affect the subsequent dense layer?
9 Seq2Seq: Explain the "information bottleneck" problem in vanilla seq2seq. How does the attention mechanism solve it?
10 Coding: Modify the vanilla RNN code to use ReLU instead of tanh. Train on the same character-level task. What happens to training stability? Implement gradient clipping to fix it.
11 Time Series: Why is it incorrect to use random train/test splits for time series data? Implement a proper walk-forward validation scheme.
12 Mathematical: Derive the gradient โˆ‚L/โˆ‚W_xh for a 3-step vanilla RNN. Show all intermediate steps.
13 LSTM Analysis: Explain why the forget gate bias is often initialized to 1 instead of 0. What would happen with b_f = 0?
14 GRU: Show mathematically how GRU's update gate simultaneously controls forgetting AND input (unlike LSTM where these are independent).
15 Architecture Design: You need to build an NLP model for Hindi sentiment analysis. Choose between: Vanilla RNN, LSTM, GRU, BiLSTM. Justify your choice with at least 3 reasons.
16 Coding: Add peephole connections to the LSTM implementation. The forget gate should also receive C_{t-1} as input.
17 Stacking: Draw the architecture of a 3-layer deep LSTM. What is the input to each layer? How do gradients flow?
18 Regularization: Compare dropout, recurrent dropout, and zoneout for RNNs. Implement recurrent dropout where the same dropout mask is used across time steps.
19 Application: Design an LSTM architecture for predicting the next word in a Hindi sentence. Specify: vocabulary size handling, embedding dimension, LSTM layers, and output layer.
20 Vanishing Gradient: Create a synthetic experiment that demonstrates the vanishing gradient problem. Plot gradient magnitude vs. time step distance for vanilla RNN and LSTM.
21 Teacher Forcing: Explain teacher forcing in seq2seq training. What is the "exposure bias" problem it creates, and how does scheduled sampling address it?
22 Beam Search: Implement beam search decoding (beam width k=3) for the text generation model. Compare output quality with greedy decoding and sampling.
23 Multi-variate: Extend the stock prediction model to use 5 input features (Open, High, Low, Close, Volume) instead of just Close. How does the LSTM input shape change?
24 Attention: Implement a simple additive attention mechanism (Bahdanau attention) on top of the seq2seq model. Compute attention weights and visualize them as a heatmap.
25 Research: Read the original LSTM paper (Hochreiter & Schmidhuber, 1997). Summarize the three key problems the paper identifies with traditional RNNs and how LSTM addresses each.

Multiple Choice Questions (12 MCQs)

Q1. In a vanilla RNN, the hidden state h_t is computed as:
A) h_t = sigmoid(W_hh ยท h_{t-1} + W_xh ยท x_t + b)
B) h_t = tanh(W_hh ยท h_{t-1} + W_xh ยท x_t + b)
C) h_t = ReLU(W_hh ยท h_{t-1} + W_xh ยท x_t + b)
D) h_t = W_hh ยท h_{t-1} + W_xh ยท x_t + b
โœ… B) The standard vanilla RNN uses tanh activation. Sigmoid would squash to [0,1] losing negative values. ReLU can cause exploding activations in recurrent settings. No activation (D) would make it a linear model.
Q2. How many gate weight matrices does a single LSTM layer have?
A) 1
B) 2
C) 3
D) 4
โœ… D) LSTM has 4 weight matrices: forget gate (W_f), input gate (W_i), candidate (W_C), and output gate (W_o). Each has its own weights and biases.
Q3. The vanishing gradient problem in RNNs occurs because:
A) The learning rate is too small
B) Repeated multiplication by values < 1 during BPTT
C) The hidden state dimension is too large
D) Batch normalization is not applied
โœ… B) During BPTT, gradients are multiplied by the Jacobian โˆ‚h_i/โˆ‚h_{i-1} at each time step. Since tanh derivative โ‰ค 1 and if โ€–W_hhโ€– < 1, repeated multiplication causes exponential decay.
Q4. In an LSTM, the cell state C_t is updated as:
A) C_t = C_{t-1} + i_t โŠ™ Cฬƒ_t
B) C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t
C) C_t = f_t ยท C_{t-1} + i_t ยท Cฬƒ_t (matrix multiplication)
D) C_t = tanh(f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t)
โœ… B) The cell state is updated via element-wise operations: forget gate decides what to erase from C_{t-1}, and input gate decides what to add from the candidate Cฬƒ_t. No activation is applied to C_t itself (tanh is applied later when computing h_t).
Q5. GRU differs from LSTM in that:
A) GRU has more parameters than LSTM
B) GRU has a separate cell state
C) GRU merges forget and input gates into a single update gate
D) GRU uses ReLU instead of tanh
โœ… C) GRU simplifies LSTM by: (1) merging forget+input into update gate z_t, (2) removing the separate cell state, (3) using a reset gate. This results in ~25% fewer parameters.
Q6. In a Bidirectional LSTM with hidden_dim=64 per direction, the output at each timestep has dimension:
A) 32
B) 64
C) 128
D) 256
โœ… C) BiLSTM concatenates forward (64-dim) and backward (64-dim) hidden states, giving 128-dimensional output at each time step.
Q7. What is "teacher forcing" in sequence-to-sequence training?
A) Using a pre-trained teacher model to guide training
B) Feeding ground truth (instead of predicted) output as next input during training
C) Forcing the model to learn from hard examples only
D) Using gradient clipping to stabilize training
โœ… B) Teacher forcing feeds the correct output token as input to the decoder at the next time step during training, rather than the model's own prediction. This speeds convergence but creates "exposure bias" โ€” the model never sees its own errors during training.
Q8. Gradient clipping in RNNs typically involves:
A) Setting gradients below a threshold to zero
B) Scaling down the gradient vector if its norm exceeds a threshold
C) Clipping individual weight values
D) Removing time steps with large gradients
โœ… B) Gradient clipping rescales the entire gradient vector: if โ€–gโ€– > threshold, then g โ† g ร— threshold/โ€–gโ€–. This preserves gradient direction while preventing explosion.
Q9. For a time series prediction task, which train/test split strategy is correct?
A) Random 80/20 split with shuffling
B) K-fold cross-validation with random folds
C) Chronological split โ€” train on earlier data, test on later data
D) Stratified split based on price ranges
โœ… C) Time series data must preserve temporal order. Random splits cause data leakage โ€” the model sees future information during training. Walk-forward or time-based splits are required.
Q10. The "constant error carousel" in LSTM refers to:
A) The output gate maintaining constant values
B) The cell state allowing gradients to flow unchanged when forget gate โ‰ˆ 1
C) The learning rate remaining constant during training
D) The hidden state cycling through the same values
โœ… B) When f_t โ‰ˆ 1 and i_t โ‰ˆ 0, C_t โ‰ˆ C_{t-1}, and gradients flow through the cell state without decay. This is the key mechanism by which LSTM avoids the vanishing gradient problem.
Q11. Weight sharing in RNNs means:
A) Different layers share weights
B) The same weight matrices are used at every time step
C) Weights are shared between encoder and decoder
D) Pre-trained weights from another model are used
โœ… B) In an RNN, the same W_xh, W_hh, and W_hy are applied at every time step. This is what allows RNNs to handle variable-length sequences with a fixed number of parameters.
Q12. Which architecture would you choose for real-time speech recognition?
A) Bidirectional LSTM (requires full sequence)
B) Unidirectional LSTM (processes left-to-right)
C) Vanilla RNN (simplest)
D) Deep feedforward network
โœ… B) Real-time speech requires processing audio as it arrives (streaming). BiLSTM needs the full sequence, so it cannot be used in real-time. Unidirectional LSTM processes frame-by-frame in one direction, suitable for streaming ASR.

Interview Questions (12 Questions)

Q1. Explain the vanishing gradient problem in RNNs and how LSTM solves it.

Expected Answer: During BPTT, gradients are multiplied by the Jacobian โˆ‚h_i/โˆ‚h_{i-1} = diag(tanh'(z_i)) ยท W_hh at each step. Since |tanh'| โ‰ค 1, gradients decay exponentially over long sequences. LSTM introduces a cell state C_t that is updated additively (C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t). When f_t โ‰ˆ 1, gradients flow through C without decay โ€” the "constant error carousel."

Q2. When would you choose GRU over LSTM?

Expected Answer: GRU when: (a) training data is limited (fewer params = less overfitting), (b) inference speed matters (25% fewer ops), (c) sequences are moderate length. LSTM when: (a) very long sequences need precise memory control, (b) computational budget allows it, (c) task requires independent control of forgetting and input. Empirically, performance is often comparable โ€” try both and validate.

Q3. What is teacher forcing and what problem does it create?

Expected Answer: Teacher forcing feeds ground-truth tokens as decoder input during training (instead of model predictions). Problem: "exposure bias" โ€” during inference, the model uses its own (possibly wrong) predictions, but it never saw such errors during training. Solutions: scheduled sampling (gradually shifting from teacher forcing to model predictions), or reinforcement learning-based training.

Q4. How do you handle variable-length sequences in batch training?

Expected Answer: (1) Padding: pad shorter sequences with zeros to max length in batch, use masking to ignore padded positions. (2) Bucketing: group sequences of similar length into the same batch to minimize padding. (3) Pack sequences (PyTorch pack_padded_sequence): skip computation on padded timesteps. (4) Dynamic batching: adjust batch size based on sequence length.

Q5. Explain the difference between many-to-one, many-to-many, and one-to-many RNN architectures. Give examples.

Expected Answer: Many-to-one: sentiment analysis (sequence โ†’ single label). Many-to-many (same length): POS tagging, NER (label per token). Many-to-many (different length): machine translation (seq2seq). One-to-many: image captioning (single image โ†’ sequence of words). One-to-one: essentially a feedforward network (not useful as RNN).

Q6. Why is the forget gate bias initialized to 1 in LSTM?

Expected Answer: With b_f=1, the sigmoid output starts near 1, meaning the LSTM initially remembers everything. This prevents premature information loss before the model has learned what to forget. With b_f=0, the forget gate starts at ฯƒ(0)=0.5, immediately discarding 50% of cell state โ€” harmful for long-range dependencies. This was recommended by Gers et al. (2000) and Jozefowicz et al. (2015).

Q7. How do you prevent overfitting in LSTM models?

Expected Answer: (1) Dropout between LSTM layers (not within recurrence). (2) Recurrent dropout: same dropout mask across time steps (Gal & Ghahramani, 2016). (3) L2 regularization on weights. (4) Early stopping with validation loss. (5) Reduce model complexity (fewer units/layers). (6) Data augmentation for sequences (noise injection, time warping).

Q8. You're building a stock prediction model. A colleague says "my LSTM gets 95% accuracy." What are your concerns?

Expected Answer: Red flags: (1) Data leakage โ€” using future information in features. (2) Wrong split โ€” random instead of temporal. (3) Accuracy metric is meaningless for regression โ€” should use MAE, RMSE, MAPE. (4) Directional accuracy might be a better metric. (5) Need walk-forward validation, not single train/test split. (6) Overfitting to training period. (7) Transaction costs not considered. 95% in stock prediction is almost certainly a bug.

Q9. Compare LSTM with Transformer for sequence modeling. When would you still use LSTM?

Expected Answer: Transformers: better for long sequences (parallel processing), state-of-the-art for NLP, need more data and compute. LSTM still preferred for: (1) small datasets, (2) online/streaming applications (process one step at a time), (3) edge devices (fewer parameters), (4) time series with strong autoregressive patterns, (5) tasks where sequential inductive bias helps. Transformers are O(nยฒ) in sequence length; LSTM is O(n).

Q10. Explain how attention works in the context of seq2seq models.

Expected Answer: Instead of compressing the entire input into a single context vector, attention allows the decoder to "look back" at all encoder hidden states at each generation step. It computes alignment scores between decoder state s_t and each encoder state h_i, converts them to weights via softmax, and creates a weighted sum (context vector). This solves the information bottleneck: the decoder can access any part of the input directly.

Q11. What is gradient clipping and why is it essential for RNN training?

Expected Answer: Gradient clipping rescales the gradient if its norm exceeds a threshold: g โ† g ร— (threshold/โ€–gโ€–). Essential because RNNs suffer from exploding gradients (spectral radius of W_hh > 1 causes exponential gradient growth). Without clipping, a single step with exploding gradients can ruin all learned weights. Typical threshold: 1.0-5.0. Two variants: norm clipping (scale entire gradient vector) and value clipping (clip each element independently).

Q12. Design an LSTM system for real-time fraud detection on UPI transactions.

Expected Answer: Architecture: (1) Feature extraction: encode each transaction as a vector (amount, merchant category, time delta, location, device). (2) User-level LSTM: maintain per-user hidden state updated with each transaction. (3) Anomaly scoring: LSTM output โ†’ dense โ†’ sigmoid for fraud probability. (4) Online learning: update model with confirmed labels. Key challenges: class imbalance (99.9% legitimate), latency requirements (<100ms), cold start for new users. Use GRU for faster inference. Deployment: model serving with TF Serving or ONNX Runtime.

Research Problems

๐Ÿ”ฌ Research Problem 1: LSTM vs. Transformer for Low-Resource Indian Languages

Question: For languages with limited training data (Konkani, Dogri, Bodo โ€” scheduled languages with < 1M text corpus), do LSTM-based models outperform Transformers for tasks like NER, POS tagging, and text classification?

Hypothesis: LSTMs' stronger inductive bias (sequential processing) may compensate for data scarcity where Transformers' flexibility leads to overfitting.

Methodology: Compare BiLSTM-CRF vs. small Transformer models across 5+ Indian languages at various data sizes (1K, 10K, 100K, 1M sentences). Use cross-lingual transfer from Hindi as baseline.

Expected Contribution: Guidelines for choosing architectures based on data availability in multilingual Indian NLP applications.

๐Ÿ”ฌ Research Problem 2: Continual Learning in LSTM for Non-Stationary Time Series

Question: How can LSTM models adapt to distributional shift in financial time series (e.g., regime changes in Nifty50 due to policy changes, pandemics) without catastrophic forgetting?

Approach: Investigate elastic weight consolidation (EWC), progressive neural networks, and online LSTM updating strategies. Test on Indian market data across regime changes: demonetization (Nov 2016), GST implementation (Jul 2017), COVID crash (Mar 2020), and rate hike cycles.

๐Ÿ”ฌ Research Problem 3: Efficient LSTM Architectures for Edge Deployment

Question: Can we design pruned/quantized LSTM models that run on Indian IoT devices (Raspberry Pi, ESP32) for real-time agricultural sensor prediction while maintaining > 95% of full-precision performance?

Techniques to Explore: Knowledge distillation from large LSTM to small GRU, structured pruning of LSTM gates, INT8 quantization, and architecture search for optimal hidden dimension on constrained hardware. Target: < 1MB model size, < 10ms inference latency.

๐Ÿ”ฌ Research Problem 4: Hybrid LSTM-Transformer Models

Question: Can we combine the sequential inductive bias of LSTMs with the parallel attention of Transformers to get the best of both worlds for sequence modeling?

Ideas: (1) LSTM encoder + Transformer decoder, (2) Transformer with LSTM positional encoding replacing sinusoidal, (3) Gated Transformer blocks using LSTM-style forget/update mechanisms. Benchmark on time series (ETTh, Weather), machine translation (FLORES for Indian languages), and speech recognition (CommonVoice Hindi).

Key Takeaways

๐Ÿ”
RNNs process sequences by maintaining hidden state. The hidden state h_t = tanh(W_hhยทh_{t-1} + W_xhยทx_t + b) serves as a compressed memory of all inputs seen so far. Same weights are shared across all time steps.
๐Ÿ“‰
Vanishing gradients kill long-range learning. During BPTT, gradients decay as โ€–W_hhโ€–^{t-k}. This is provably exponential and prevents vanilla RNNs from learning dependencies beyond ~10-20 time steps.
๐Ÿšช
LSTM gates are the breakthrough. Four gates (forget, input, candidate, output) control information flow. The cell state C_t acts as a "gradient highway" โ€” when f_t โ‰ˆ 1, information and gradients flow unchanged across hundreds of time steps.
โšก
GRU is the efficient alternative. By merging forget+input into an update gate and removing the cell state, GRU achieves ~25% parameter reduction with comparable performance. Prefer GRU when speed/memory matters.
โ†”๏ธ
Bidirectional = context from both sides. BiLSTMs process sequences forward and backward, giving each position context from the entire sequence. Essential for NER, POS tagging, and any task where future context helps.
๐Ÿ”—
Seq2Seq enables variable-length mapping. Encoder compresses input into context vector; decoder generates output. Attention mechanism removes the information bottleneck by letting the decoder access all encoder states.
๐Ÿ“Š
LSTMs still dominate time series. Despite Transformer hype, LSTMs remain state-of-the-art for many time series tasks (financial forecasting, sensor prediction, demand estimation) due to lower data needs, natural sequential bias, and efficient streaming inference.
โš ๏ธ
Proper time series evaluation is critical. Never use random train/test splits. Always use walk-forward validation or chronological splits. Metrics: MAE, RMSE, MAPE for regression; directional accuracy for trading signals.
๐Ÿ‡ฎ๐Ÿ‡ณ
India-specific applications are growing rapidly. From IRCTC demand forecasting to Nifty50 prediction, UPI fraud detection, and ISRO weather prediction โ€” LSTM/RNN skills are in high demand across Indian tech, fintech, and government sectors.

References & Further Reading

๐Ÿ“„ Foundational Papers
  1. Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735-1780. โ€” The original LSTM paper.
  2. Gers, F.A., Schmidhuber, J., & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM." Neural Computation, 12(10), 2451-2471. โ€” Introduces the forget gate.
  3. Cho, K., et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder." EMNLP. โ€” Introduces GRU.
  4. Sutskever, I., Vinyals, O., & Le, Q.V. (2014). "Sequence to Sequence Learning with Neural Networks." NeurIPS. โ€” Foundational seq2seq paper.
  5. Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR. โ€” Introduces attention mechanism.
  6. Graves, A. (2013). "Generating Sequences With Recurrent Neural Networks." arXiv:1308.0850. โ€” Text and handwriting generation.
  7. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." ICML. โ€” Vanishing/exploding gradient analysis.
๐Ÿ“– Textbooks
  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 10: Sequence Modeling. MIT Press.
  2. Jurafsky, D. & Martin, J.H. (2023). Speech and Language Processing, 3rd ed. โ€” Chapters on RNNs and seq2seq.
  3. Chollet, F. (2021). Deep Learning with Python, 2nd ed. Chapter 10: Timeseries. Manning.
  4. Gรฉron, A. (2022). Hands-On Machine Learning, 3rd ed. Chapter 15: Processing Sequences. O'Reilly.
๐Ÿ”— Online Resources
  1. Olah, C. (2015). "Understanding LSTM Networks." โ€” colah.github.io. โ€” Best visual explanation of LSTMs.
  2. Karpathy, A. (2015). "The Unreasonable Effectiveness of Recurrent Neural Networks." โ€” karpathy.github.io.
  3. TensorFlow RNN Tutorial โ€” tensorflow.org/guide/keras/rnn
  4. PyTorch Seq2Seq Tutorial โ€” pytorch.org/tutorials
  5. CS231n Lecture 10: Recurrent Neural Networks โ€” Stanford (YouTube)
๐Ÿ‡ฎ๐Ÿ‡ณ India-Specific References
  1. NPCI Annual Reports (2020-2024) โ€” UPI transaction statistics and fraud prevention.
  2. NSE India Historical Data โ€” nseindia.com โ€” Nifty50 OHLCV data for stock prediction projects.
  3. IRCTC Open Data โ€” Passenger traffic and booking patterns.
  4. ISRO MOSDAC โ€” mosdac.gov.in โ€” Meteorological data for weather prediction.
  5. IIT Bombay Hindi-English Parallel Corpus โ€” For seq2seq translation projects.
End of Chapter 19

Recurrent Neural Networks & LSTMs

You've mastered sequence modeling from vanilla RNNs to LSTMs. Next up: Chapter 20 explores Generative Adversarial Networks (GANs) โ€” teaching networks to create.

๐Ÿ“– Continue to Chapter 20: GANs โ†’