Neural Networks & Deep Learning
Chapter 14: LSTMs and GRUs โ Solving Long-Term Memory
How Gated Architectures Let Recurrent Networks Remember What Matters and Forget What Doesn't
โฑ๏ธ Reading Time: ~3.5 hours | ๐ Part IV: Sequence Models | ๐ง Theory + Code Chapter
๐ Prerequisites: Chapter 13 (Recurrent Neural Networks), Chapter 8 (Optimization), Chapter 7 (Backpropagation)
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the LSTM gate equations (forget, input, output), GRU equations (update, reset), and the role of the cell state as a "conveyor belt" for gradients |
| ๐ต Understand | Explain why vanilla RNNs suffer from vanishing gradients, how the cell state solves this, and why GRUs use fewer parameters than LSTMs |
| ๐ข Apply | Implement an LSTM cell from scratch in NumPy, build a Nifty 50 stock predictor in TensorFlow, and apply Bidirectional LSTMs to NER tasks |
| ๐ก Analyze | Trace gradients through the LSTM cell, compare LSTM vs GRU training dynamics, and analyze gate activations for interpretability |
| ๐ Evaluate | Choose between LSTM, GRU, and Bidirectional variants for specific applications; justify architecture choices for Indian industry problems |
| ๐ด Create | Design a complete fraud detection pipeline using stacked Bidirectional LSTMs on sequential transaction data |
Learning Objectives
By the end of this chapter, you will be able to:
- Explain why vanilla RNNs fail on long sequences by deriving the vanishing gradient problem through repeated Jacobian multiplication
- Derive the complete LSTM cell equations โ forget gate (f), input/update gate (i), cell candidate (cฬ), cell state (c), output gate (o), and hidden state (a) โ with full mathematical notation
- Derive the GRU equations โ update gate (z), reset gate (r), candidate hidden state (hฬ), and final hidden state (h) โ and explain how GRU merges the forget and input gates
- Compare LSTM vs GRU on parameter count, training speed, and performance across different sequence lengths
- Explain Bidirectional RNNs โ why reading a sequence both forwards and backwards helps tasks like Named Entity Recognition
- Implement an LSTM cell forward pass from scratch using only NumPy
- Build a TensorFlow LSTM model for NSE Nifty 50 stock price prediction using real-world financial time-series data
- Build a Bidirectional LSTM for Named Entity Recognition on Indian news articles
- Analyze the HDFC Bank case study โ how LSTMs on transaction sequences reduced false positives in fraud detection by 40%
- Design deep/stacked RNN architectures and know when to add depth vs. width
Opening Hook โ The Sentence That Broke the RNN
๐ฃ๏ธ When Memory Fails: A Hindi Sentence Challenge
Consider this everyday Hindi sentence:
"Kya aap mujhe Bangalore mein best biryani restaurant suggest kar sakte hain?"
To answer this question correctly, a model needs to connect "Kya" (the question word at position 1) to "hain" (the verb at position 11). The subject "aap" appears 10 words before its verb. A vanilla RNN, processing word by word, must carry the memory of "Kya" and "aap" across 10 time steps.
Here's the brutal truth: by the time a vanilla RNN reaches "hain", the gradient signal from "Kya" has been multiplied through 10 weight matrices. If each multiplication shrinks the gradient by just 0.5ร, the signal arriving back is 0.5ยนโฐ = 0.001 โ a thousand times weaker. The RNN literally forgets whether this was a question or a statement.
Google Translate initially struggled with long HindiโEnglish translations for exactly this reason. When they switched from vanilla RNNs to LSTMs in 2016, Hindi translation quality improved by 60% on BLEU scores โ because LSTMs can remember "Kya" when they finally reach "hain", even 50 words later.
Google Translate Hindi NLP Long-Range Dependencies Vanishing GradientsThis chapter is about the most important architectural innovation in sequence modeling history: the gated recurrent cell. We'll study two variants โ the LSTM (Long Short-Term Memory, 1997) and the GRU (Gated Recurrent Unit, 2014) โ that solved the vanishing gradient problem and powered everything from Google Translate to Alexa to financial fraud detection at HDFC Bank.
India runs on sequential data. Flipkart processes 1.5 billion event sequences per day (search โ browse โ add-to-cart โ purchase). PhonePe analyzes transaction sequences for โน12 lakh crore in annual UPI volume. IRCTC handles booking sequences from 25 million daily queries. Every one of these systems benefits from architectures that can remember long-range patterns โ and that's exactly what LSTMs and GRUs do.
Core Concepts
We begin by understanding why vanilla RNNs fail, then build the LSTM and GRU architectures gate by gate.
14.1 The Vanishing Gradient Problem in RNNs โ Why Memory Fades
Recall from Chapter 13 that a vanilla RNN computes:
ลทโจtโฉ = softmax(W_ya ยท aโจtโฉ + b_y)
During backpropagation through time (BPTT), the gradient of the loss at time step T with respect to the hidden state at time step t requires computing:
This is a product of (T โ t) matrices. Let's analyze what happens:
Why the Product of Matrices Destroys Gradients
If the largest eigenvalue of W_aa is ฮป_max, then after (T โ t) multiplications, the gradient scales roughly as ฮป_max^(Tโt).
- If ฮป_max < 1: Gradient โ 0 exponentially (vanishing gradient). For ฮป_max = 0.9 and T โ t = 100: 0.9ยนโฐโฐ โ 2.66 ร 10โปโต
- If ฮป_max > 1: Gradient โ โ exponentially (exploding gradient). For ฮป_max = 1.1 and T โ t = 100: 1.1ยนโฐโฐ โ 13,780
- If ฮป_max = 1: Gradient stays stable โ but this is a razor's edge, impossible to maintain in practice
For a vanilla RNN, long-range dependencies (more than ~10-20 time steps) are effectively invisible during training. The model can learn that "biryani" relates to "restaurant" (2 steps apart) but cannot learn that "Kya" relates to "hain" (10 steps apart).
The tanh Saturation FactorThe derivative of tanh is (1 โ tanhยฒ(x)), which is always โค 1. When activations saturate (|x| large), this derivative approaches 0. Each multiplication by diag(1 โ aโจkโฉยฒ) further shrinks the gradient โ multiplicatively.
Sepp Hochreiter identified the vanishing gradient problem in his 1991 diploma thesis (in German!). His advisor, Jรผrgen Schmidhuber, encouraged him to solve it โ leading to the LSTM paper in 1997. It took nearly 20 years (until ~2014) for compute hardware to catch up and make LSTMs practical for industry use.
The key insight that leads to LSTM: we need a path where gradients can flow unchanged across many time steps. Instead of multiplying through W_aa repeatedly, we need an additive connection โ a highway for gradients. This is the cell state.
14.2 LSTM โ Long Short-Term Memory (Full Derivation)
The LSTM, introduced by Hochreiter & Schmidhuber (1997) and refined by Gers et al. (2000) with the forget gate, replaces the simple RNN hidden state with a carefully engineered memory cell controlled by three gates.
The Key Idea: Two Separate State Vectors
Unlike a vanilla RNN (which has only one hidden state aโจtโฉ), an LSTM maintains two vectors at each time step:
- Cell state cโจtโฉ โ the "long-term memory" that flows along a conveyor belt with minimal modification
- Hidden state aโจtโฉ (sometimes written hโจtโฉ) โ the "working memory" exposed to the outside world
Step-by-Step Gate Derivation
At each time step t, the LSTM receives three inputs: the previous hidden state aโจtโ1โฉ, the previous cell state cโจtโ1โฉ, and the current input xโจtโฉ. It produces updated cโจtโฉ and aโจtโฉ.
Gate 1: Forget Gate (fโจtโฉ) โ "What to erase from memory"
The forget gate outputs a vector of values between 0 and 1 (sigmoid output). Each element decides how much of the corresponding cell state dimension to retain:
- f = 1: Keep this memory completely (e.g., "remember that this is a question")
- f = 0: Erase this memory completely (e.g., "forget the previous subject, new subject introduced")
- f = 0.7: Keep 70% of this memory โ gradual decay
The notation [aโจtโ1โฉ, xโจtโฉ] means concatenation. If aโจtโ1โฉ โ โโฟ and xโจtโฉ โ โแต, then [aโจtโ1โฉ, xโจtโฉ] โ โโฟโบแต, and W_f โ โโฟหฃโฝโฟโบแตโพ. This is the same for all four weight matrices in the LSTM.
Gate 2: Input/Update Gate (iโจtโฉ) โ "What new information to store"
The input gate decides which dimensions of the cell state will receive new information. Like the forget gate, it outputs values in [0, 1].
Cell Candidate (cฬโจtโฉ) โ "What new information to potentially store"
The cell candidate is the proposed new memory content. It uses tanh (output in [โ1, 1]) because cell state values can be positive or negative. Think of this as the "raw new information" โ the input gate decides how much of it to actually write.
Cell State Update (cโจtโฉ) โ "The actual memory update"
This is the most important equation in the LSTM. The โ symbol denotes element-wise (Hadamard) multiplication. Notice:
- The first term
fโจtโฉ โ cโจtโ1โฉselectively forgets parts of the old memory - The second term
iโจtโฉ โ cฬโจtโฉselectively writes new information - The cell state update is additive (not multiplicative!) โ this is why gradients flow easily through time
๐ Why the Additive Update Solves Vanishing Gradients
In a vanilla RNN: aโจtโฉ = tanh(W ยท aโจtโ1โฉ + ...) โ the hidden state is a multiplicative function of the previous state. Gradient = product of many W matrices โ vanishes or explodes.
In an LSTM: cโจtโฉ = fโจtโฉ โ cโจtโ1โฉ + ... โ the cell state is an additive function of the previous cell state. The gradient of cโจtโฉ with respect to cโจtโ1โฉ is simply fโจtโฉ (element-wise). If the forget gate is close to 1, the gradient passes through unchanged. No repeated matrix multiplication!
This is analogous to skip connections in ResNets (Chapter 11). Just as ResNets add the identity mapping to let gradients skip layers, LSTMs add the cell state to let gradients skip time steps.
Gate 3: Output Gate (oโจtโฉ) โ "What to reveal from memory"
Hidden State (aโจtโฉ) โ "The visible output"
The hidden state is a filtered version of the cell state. The cell state might store "this is a question sentence" and "the subject is aap", but at the current time step, only the relevant information is revealed through the output gate.
Complete LSTM Equations โ Summary
Input gate: iโจtโฉ = ฯ(W_i ยท [aโจtโ1โฉ, xโจtโฉ] + b_i)
Cell candidate: cฬโจtโฉ = tanh(W_c ยท [aโจtโ1โฉ, xโจtโฉ] + b_c)
Cell update: cโจtโฉ = fโจtโฉ โ cโจtโ1โฉ + iโจtโฉ โ cฬโจtโฉ
Output gate: oโจtโฉ = ฯ(W_o ยท [aโจtโ1โฉ, xโจtโฉ] + b_o)
Hidden state: aโจtโฉ = oโจtโฉ โ tanh(cโจtโฉ)
LSTM Parameter Count
Let n = hidden size and m = input size. Each gate has a weight matrix of shape (n, n+m) and a bias of shape (n,). With 4 sets of parameters (forget, input, cell candidate, output):
For n = 256, m = 100: Total = 4(256ยฒ) + 4(256)(100) + 4(256) = 262,144 + 102,400 + 1,024 = 365,568 parameters per LSTM layer.
The forget gate is NOT about forgetting! Counterintuitively, a forget gate value of 1 means "remember everything" and 0 means "forget everything". It should really be called the "remember gate". This naming confusion trips up students constantly. Tip: Initialize the forget gate bias to a positive value (e.g., 1.0 or 2.0) so that training starts with "remember by default" โ this was shown by Jozefowicz et al. (2015) to significantly improve LSTM training.
14.3 GRU โ Gated Recurrent Unit (Simplified Gating)
The GRU was proposed by Cho et al. (2014) as a simpler alternative to the LSTM. It achieves similar performance with fewer parameters by making two key simplifications:
- Merge the cell state and hidden state into a single state vector hโจtโฉ
- Merge the forget and input gates into a single update gate zโจtโฉ (if you update, you automatically forget the old value)
GRU Equations โ Step by Step
Update Gate (zโจtโฉ) โ "How much of the old state to keep"
The update gate serves the roles of both the LSTM's forget gate and input gate. A value of z = 1 means "keep the old hidden state completely" (copy through), while z = 0 means "replace entirely with the new candidate".
Reset Gate (rโจtโฉ) โ "How much of the old state to use for the candidate"
The reset gate controls how much of the previous hidden state is used to compute the new candidate. When r = 0, the model "resets" and acts as if reading the first word of a new sentence.
Candidate Hidden State (hฬโจtโฉ)
Notice the reset gate is applied inside the tanh โ it selectively zeros out parts of the previous hidden state before computing the candidate.
Hidden State Update (hโจtโฉ)
This is a convex combination of the old state and the new candidate. The elegance: if z = 1, hโจtโฉ = hโจtโ1โฉ (perfect copy, gradient flows through unchanged). If z = 0, hโจtโฉ = hฬโจtโฉ (complete reset).
Complete GRU Equations โ Summary
Reset gate: rโจtโฉ = ฯ(W_r ยท [hโจtโ1โฉ, xโจtโฉ] + b_r)
Candidate: hฬโจtโฉ = tanh(W_h ยท [rโจtโฉ โ hโจtโ1โฉ, xโจtโฉ] + b_h)
Hidden update: hโจtโฉ = zโจtโฉ โ hโจtโ1โฉ + (1 โ zโจtโฉ) โ hฬโจtโฉ
GRU Parameter Count
With 3 sets of parameters (update, reset, candidate) instead of LSTM's 4:
For n = 256, m = 100: Total = 3(256ยฒ) + 3(256)(100) + 3(256) = 196,608 + 76,800 + 768 = 274,176 parameters โ 25% fewer than LSTM.
GRU โ LSTM Correspondence
| GRU Component | LSTM Equivalent | Key Difference |
|---|---|---|
| Update gate z | Forget gate f (inversely) | z controls both forgetting AND updating; 1โz replaces the input gate |
| Reset gate r | Partially like output gate o | Applied before candidate computation, not after |
| Single hโจtโฉ | Separate cโจtโฉ and aโจtโฉ | GRU has no protected "cell state" โ the hidden state IS the memory |
The GRU was invented by Kyunghyun Cho (now at NYU) as part of the team that also proposed the Encoder-Decoder architecture for machine translation. The GRU paper (2014) and the Encoder-Decoder paper were submitted within weeks of each other โ both became foundational for neural machine translation.
14.4 LSTM vs GRU โ When to Use Each
| Criterion | LSTM | GRU |
|---|---|---|
| Parameters | 4nยฒ + 4nm + 4n | 3nยฒ + 3nm + 3n (25% fewer) |
| Training Speed | Slower per epoch | ~20-30% faster per epoch |
| Long Sequences (>500 steps) | โ Better โ separate cell state provides stronger gradient highway | โ ๏ธ Can struggle โ single state must balance memory and output |
| Small Datasets | โ ๏ธ May overfit โ more parameters | โ Better generalization |
| Interpretability | โ Can inspect cell state and gate activations separately | โ ๏ธ Harder โ single state mixes memory and output |
| Industry Default | โ More common in production (proven track record) | โ Growing adoption, especially in mobile/edge |
| Music/Audio Generation | โ Preferred โ needs very long context | โ ๏ธ Often needs larger hidden size to match |
| Text Classification | Similar performance | Similar performance, but faster |
The practitioner's rule of thumb: Start with GRU (faster to experiment). If GRU's performance plateaus and you suspect the model needs longer memory, switch to LSTM. If you're working on resource-constrained devices (mobile, IoT), prefer GRU. For production systems where accuracy is paramount and compute is available, LSTM is the safer choice.
14.5 Bidirectional RNNs โ Reading Forward AND Backward
Consider the NER (Named Entity Recognition) task on this Indian news headline:
"Sachin scored a century at Wankhede while Tendulkar Foundation donated โน5 crore."
A forward-only RNN reading "Sachin" doesn't yet know what follows. Is "Sachin" a person's first name, a place, or a brand? Only when the model reads "scored a century" does it become clear this is a cricketer. And "Tendulkar Foundation" โ is "Tendulkar" a person or an organization? Only the following word "Foundation" disambiguates it.
Architecture: Two RNNs, One Sequence
A Bidirectional RNN runs two separate RNNs on the same input:
- Forward RNN (โ): Processes xโจ1โฉ, xโจ2โฉ, ..., xโจTโฉ left-to-right, producing hidden states โaโจ1โฉ, โaโจ2โฉ, ..., โaโจTโฉ
- Backward RNN (โ): Processes xโจTโฉ, xโจTโ1โฉ, ..., xโจ1โฉ right-to-left, producing hidden states โaโจ1โฉ, โaโจ2โฉ, ..., โaโจTโฉ
The prediction at each time step uses both past context (from forward RNN) and future context (from backward RNN):
Bidirectional RNNs CANNOT be used for real-time prediction! Since the backward RNN needs the full sequence, you must have the complete input before making predictions. This means BiLSTMs are great for NER, sentiment analysis, and machine translation (where you have the full input), but NOT for speech recognition in real-time, next-word prediction, or stock price forecasting where you predict while receiving input.
Bidirectional LSTMs at Indian tech companies: Flipkart uses BiLSTMs for product review sentiment analysis in Hindi-English code-mixed text. The backward pass catches patterns like "...but battery life is terrible" where the negation comes after the subject. Jio's speech team uses BiLSTMs for named entity extraction from Hindi call transcripts to auto-tag customer complaints.
14.6 Deep (Stacked) RNNs โ Adding Depth to Sequence Models
Just as we stack convolutional layers in CNNs, we can stack multiple LSTM/GRU layers. The hidden state output of layer l becomes the input to layer l+1:
where aโจtโฉ_0 = xโจtโฉ (the input embedding).
Practical guidelines for stacking:
- 2-3 layers is the sweet spot for most NLP tasks
- 4+ layers is rarely beneficial โ diminishing returns + much slower training
- Google's Neural Machine Translation (GNMT) used 8 stacked LSTM layers with residual connections โ but this was before Transformers took over
- Add dropout between layers (not within recurrent connections!) โ typically 0.2-0.5
When using deep stacked LSTMs, add residual connections between layers (just like ResNets). This means: output_l = LSTM_l(input_l) + input_l. Google's GNMT paper showed this was essential for training 8-layer LSTMs.
From-Scratch Code โ LSTM Cell in NumPy
Let's implement a single LSTM cell forward pass using only NumPy. This computes one time step of the LSTM equations.
Python
import numpy as np
def sigmoid(x):
"""Numerically stable sigmoid."""
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
def lstm_cell_forward(x_t, a_prev, c_prev, parameters):
"""
Single LSTM cell forward pass.
Arguments:
x_t -- input at time step t, shape (m, 1)
a_prev -- hidden state from previous step, shape (n, 1)
c_prev -- cell state from previous step, shape (n, 1)
parameters -- dict containing:
Wf, bf -- forget gate weights & bias
Wi, bi -- input gate weights & bias
Wc, bc -- cell candidate weights & bias
Wo, bo -- output gate weights & bias
Returns:
a_next -- next hidden state, shape (n, 1)
c_next -- next cell state, shape (n, 1)
cache -- values needed for backprop
"""
# Extract parameters
Wf = parameters["Wf"] # shape (n, n+m)
bf = parameters["bf"] # shape (n, 1)
Wi = parameters["Wi"]
bi = parameters["bi"]
Wc = parameters["Wc"]
bc = parameters["bc"]
Wo = parameters["Wo"]
bo = parameters["bo"]
# Step 1: Concatenate a_prev and x_t
concat = np.vstack((a_prev, x_t)) # shape (n+m, 1)
# Step 2: Forget gate โ what to erase from cell state
ft = sigmoid(Wf @ concat + bf) # shape (n, 1)
# Step 3: Input (update) gate โ what new info to write
it = sigmoid(Wi @ concat + bi) # shape (n, 1)
# Step 4: Cell candidate โ proposed new memory
c_tilde = np.tanh(Wc @ concat + bc) # shape (n, 1)
# Step 5: Cell state update (THE KEY EQUATION)
c_next = ft * c_prev + it * c_tilde # element-wise
# Step 6: Output gate โ what to reveal
ot = sigmoid(Wo @ concat + bo) # shape (n, 1)
# Step 7: Hidden state โ filtered cell state
a_next = ot * np.tanh(c_next) # shape (n, 1)
# Cache for backpropagation
cache = (a_next, c_next, a_prev, c_prev, ft, it,
c_tilde, ot, x_t, parameters)
return a_next, c_next, cache
def lstm_forward(x, a0, parameters):
"""
Full LSTM forward pass over T time steps.
Arguments:
x -- input sequence, shape (m, T)
a0 -- initial hidden state, shape (n, 1)
parameters -- dict of LSTM weights
Returns:
a_all -- all hidden states, shape (n, T)
caches -- list of caches for backprop
"""
n = a0.shape[0]
m, T = x.shape
# Initialize
a_all = np.zeros((n, T))
c_prev = np.zeros((n, 1))
a_prev = a0
caches = []
for t in range(T):
x_t = x[:, t].reshape(-1, 1)
a_prev, c_prev, cache = lstm_cell_forward(
x_t, a_prev, c_prev, parameters
)
a_all[:, t] = a_prev.flatten()
caches.append(cache)
return a_all, caches
# โโโ Demo: Run LSTM on a toy sequence โโโ
np.random.seed(42)
n_hidden = 4 # hidden state size
n_input = 3 # input feature size
T_steps = 5 # sequence length
# Initialize parameters (Xavier-like)
scale = np.sqrt(2.0 / (n_hidden + n_input))
params = {}
for name in ["Wf", "Wi", "Wc", "Wo"]:
params[name] = np.random.randn(n_hidden, n_hidden + n_input) * scale
for name in ["bf", "bi", "bc", "bo"]:
params[name] = np.zeros((n_hidden, 1))
params["bf"] += 1.0 # Forget gate bias init = 1 (remember by default)
# Create toy input and initial hidden state
x_seq = np.random.randn(n_input, T_steps)
a_init = np.zeros((n_hidden, 1))
# Forward pass
hidden_states, caches = lstm_forward(x_seq, a_init, params)
print("Input shape:", x_seq.shape)
print("Hidden states shape:", hidden_states.shape)
print("\nHidden state at t=0:")
print(np.round(hidden_states[:, 0], 4))
print("\nHidden state at t=4:")
print(np.round(hidden_states[:, 4], 4))
print(f"\nTotal parameters: {sum(p.size for p in params.values()):,}")
Understanding the parameter count: For n=4, m=3: each weight matrix is 4ร7 = 28 elements, each bias is 4 elements. With 4 gates: 4ร(28+4) = 128 + 12... wait, let's recount: 4 weight matrices of 28 = 112, 4 biases of 4 = 16, total = 128. But we initialized bf to 1.0, adding those in โ the count above (140) includes all parameters. The formula 4n(n+m) + 4n = 4(4)(7) + 16 = 128 is the core, but NumPy counts each element.
Now let's also implement a GRU cell from scratch for comparison:
Python
def gru_cell_forward(x_t, h_prev, parameters):
"""
Single GRU cell forward pass.
Arguments:
x_t -- input at time step t, shape (m, 1)
h_prev -- hidden state from previous step, shape (n, 1)
parameters -- dict containing:
Wz, bz -- update gate weights & bias
Wr, br -- reset gate weights & bias
Wh, bh -- candidate weights & bias
Returns:
h_next -- next hidden state, shape (n, 1)
"""
Wz, bz = parameters["Wz"], parameters["bz"]
Wr, br = parameters["Wr"], parameters["br"]
Wh, bh = parameters["Wh"], parameters["bh"]
# Concatenate h_prev and x_t
concat = np.vstack((h_prev, x_t)) # (n+m, 1)
# Update gate: how much to keep old state
zt = sigmoid(Wz @ concat + bz) # (n, 1)
# Reset gate: how much old state to use in candidate
rt = sigmoid(Wr @ concat + br) # (n, 1)
# Candidate hidden state (note: reset applied to h_prev)
concat_reset = np.vstack((rt * h_prev, x_t))
h_tilde = np.tanh(Wh @ concat_reset + bh) # (n, 1)
# Final hidden state: convex combination
h_next = zt * h_prev + (1 - zt) * h_tilde
return h_next
# Compare parameter counts
n, m = 256, 100
lstm_params = 4 * (n * (n + m) + n)
gru_params = 3 * (n * (n + m) + n)
print(f"LSTM params (n={n}, m={m}): {lstm_params:,}")
print(f"GRU params (n={n}, m={m}): {gru_params:,}")
print(f"GRU saves: {lstm_params - gru_params:,} params ({(lstm_params-gru_params)/lstm_params*100:.1f}%)")
Industry Code โ TensorFlow / Keras
5A. NSE Nifty 50 Stock Price Prediction with LSTM
We build a model to predict the next-day closing price of the NSE Nifty 50 index using the past 60 days of prices. This is a classic many-to-one sequence problem.
๐ Real-World Context
Indian quant firms like Quadeye, Tower Research Capital India, and Edelweiss use LSTM-based models as one component in their trading strategies. While no model can "beat the market" consistently, LSTMs capture temporal patterns (momentum, mean-reversion, seasonality) that simpler models miss. Zerodha processes ~15 million orders/day on NSE โ the scale of data that makes deep learning viable.
Python
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
# โโโ 1. Load and Prepare Nifty 50 Data โโโ
# Download from: https://www.nseindia.com/market-data/live-equity-market
# Or use yfinance: pip install yfinance
import yfinance as yf
nifty = yf.download("^NSEI", start="2015-01-01", end="2024-12-31")
prices = nifty["Close"].values.reshape(-1, 1)
print(f"Dataset: {len(prices)} trading days")
print(f"Price range: โน{prices.min():.0f} to โน{prices.max():.0f}")
# โโโ 2. Normalize prices to [0, 1] โโโ
scaler = MinMaxScaler(feature_range=(0, 1))
prices_scaled = scaler.fit_transform(prices)
# โโโ 3. Create sequences: 60 days โ predict day 61 โโโ
LOOKBACK = 60 # Use 60 days of history
def create_sequences(data, lookback):
X, y = [], []
for i in range(lookback, len(data)):
X.append(data[i - lookback:i, 0])
y.append(data[i, 0])
return np.array(X), np.array(y)
X, y = create_sequences(prices_scaled, LOOKBACK)
X = X.reshape(X.shape[0], X.shape[1], 1) # (samples, timesteps, features)
# Train-test split (80-20, chronological โ NEVER shuffle time series!)
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
# โโโ 4. Build Stacked LSTM Model โโโ
model = Sequential([
# Layer 1: LSTM with 128 units, return sequences for stacking
LSTM(128, return_sequences=True,
input_shape=(LOOKBACK, 1)),
Dropout(0.2),
# Layer 2: LSTM with 64 units
LSTM(64, return_sequences=False),
Dropout(0.2),
# Dense output: predict single price value
Dense(32, activation="relu"),
Dense(1) # Linear activation for regression
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss="mse",
metrics=["mae"]
)
model.summary()
# โโโ 5. Train with Callbacks โโโ
callbacks = [
EarlyStopping(monitor="val_loss", patience=10,
restore_best_weights=True),
ReduceLROnPlateau(monitor="val_loss", factor=0.5,
patience=5, min_lr=1e-6)
]
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.1,
callbacks=callbacks,
verbose=1
)
# โโโ 6. Evaluate and Visualize โโโ
predictions_scaled = model.predict(X_test)
predictions = scaler.inverse_transform(predictions_scaled)
actual = scaler.inverse_transform(y_test.reshape(-1, 1))
# Metrics
mae = np.mean(np.abs(predictions - actual))
mape = np.mean(np.abs((actual - predictions) / actual)) * 100
print(f"\nTest MAE: โน{mae:.2f}")
print(f"Test MAPE: {mape:.2f}%")
# Plot
plt.figure(figsize=(14, 5))
plt.plot(actual, label="Actual Nifty 50", color="#0f172a")
plt.plot(predictions, label="LSTM Prediction", color="#7c3aed", alpha=0.8)
plt.title("Nifty 50 Stock Price Prediction โ LSTM")
plt.xlabel("Trading Days")
plt.ylabel("Price (โน)")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("nifty50_lstm_prediction.png", dpi=150)
plt.show()
NEVER shuffle time-series data for train/test split! Always use chronological splitting. Shuffling creates "data leakage" โ the model sees future prices during training, giving unrealistically good results that don't generalize. Also, stock prediction models have limited real-world utility for trading โ past patterns don't guarantee future performance. Use these models for learning, not for investing your savings.
5B. Bidirectional LSTM for Named Entity Recognition (Indian News)
We build a BiLSTM model to identify named entities (Person, Organization, Location) in Indian news text.
Python
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
Embedding, Bidirectional, LSTM, Dense, TimeDistributed, Dropout
)
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
# โโโ 1. Sample Indian News NER Dataset โโโ
# NER Tags: O=Outside, B-PER=Begin Person, I-PER=Inside Person,
# B-ORG=Begin Org, I-ORG=Inside Org, B-LOC=Begin Location
sentences = [
["Narendra", "Modi", "visited", "Varanasi", "yesterday"],
["TCS", "reported", "strong", "quarterly", "results"],
["Flipkart", "CEO", "Kalyan", "Krishnamurthy", "spoke",
"at", "Bangalore", "tech", "summit"],
["HDFC", "Bank", "launched", "UPI", "services",
"in", "Mumbai"],
["Sundar", "Pichai", "announced", "Google", "investment",
"in", "India"],
]
labels = [
["B-PER", "I-PER", "O", "B-LOC", "O"],
["B-ORG", "O", "O", "O", "O"],
["B-ORG", "O", "B-PER", "I-PER", "O",
"O", "B-LOC", "O", "O"],
["B-ORG", "I-ORG", "O", "B-ORG", "O",
"O", "B-LOC"],
["B-PER", "I-PER", "O", "B-ORG", "O",
"O", "B-LOC"],
]
# โโโ 2. Build Vocabulary and Tag Mappings โโโ
words = sorted(set(w for s in sentences for w in s))
tags = sorted(set(t for l in labels for t in l))
word2idx = {w: i+2 for i, w in enumerate(words)} # 0=PAD, 1=UNK
word2idx["PAD"] = 0
word2idx["UNK"] = 1
tag2idx = {t: i for i, t in enumerate(tags)}
idx2tag = {i: t for t, i in tag2idx.items()}
n_words = len(word2idx)
n_tags = len(tag2idx)
MAX_LEN = 15
# โโโ 3. Encode and Pad Sequences โโโ
X = [[word2idx.get(w, 1) for w in s] for s in sentences]
y = [[tag2idx[t] for t in l] for l in labels]
X_pad = pad_sequences(X, maxlen=MAX_LEN, padding="post")
y_pad = pad_sequences(y, maxlen=MAX_LEN, padding="post",
value=tag2idx["O"])
y_cat = to_categorical(y_pad, num_classes=n_tags)
# โโโ 4. Build BiLSTM Model โโโ
EMBED_DIM = 64
LSTM_UNITS = 128
model = Sequential([
Embedding(input_dim=n_words, output_dim=EMBED_DIM,
input_length=MAX_LEN, mask_zero=True),
# Bidirectional LSTM: forward + backward = 256 dims
Bidirectional(LSTM(LSTM_UNITS, return_sequences=True)),
Dropout(0.3),
# Second BiLSTM layer
Bidirectional(LSTM(64, return_sequences=True)),
Dropout(0.3),
# TimeDistributed: apply Dense to each time step
TimeDistributed(Dense(n_tags, activation="softmax"))
])
model.compile(
optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"]
)
model.summary()
# โโโ 5. Train (in production, use thousands of sentences) โโโ
model.fit(X_pad, y_cat, epochs=50, batch_size=2, verbose=0)
# โโโ 6. Predict on a New Sentence โโโ
test_sentence = ["Ratan", "Tata", "founded", "Tata",
"Digital", "in", "Pune"]
test_encoded = [word2idx.get(w, 1) for w in test_sentence]
test_padded = pad_sequences([test_encoded], maxlen=MAX_LEN,
padding="post")
pred = model.predict(test_padded, verbose=0)
pred_tags = [idx2tag[np.argmax(p)] for p in pred[0][:len(test_sentence)]]
print("\nNER Predictions:")
print("-" * 40)
for word, tag in zip(test_sentence, pred_tags):
marker = " โ" if tag != "O" else ""
print(f" {word:20s} โ {tag}{marker}")
Why BiLSTM matters for Indian NER: Indian languages have rich morphology and code-mixing. The same word "Tata" can be a person (Ratan Tata) or an organization (Tata Group) โ the bidirectional context is essential for disambiguation. Companies like Reverie Language Technologies (Bangalore) and Vernacular.ai (now Skit.ai) build NER systems for 12+ Indian languages using BiLSTM architectures.
Visual Diagrams
6A. LSTM Cell โ Complete Architecture
6B. GRU Cell โ Simplified Architecture
6C. LSTM vs GRU โ Side by Side
6D. Unrolled Bidirectional LSTM
Worked Example โ Tracing Through an LSTM Cell by Hand
Let's trace through one time step of an LSTM cell with concrete numbers. We'll use tiny dimensions (n=2, m=2) to make the math tractable.
Setup
Hidden size n = 2, Input size m = 2. The concatenated vector [aโจtโ1โฉ, xโจtโฉ] has size 4.
Given Values
aโจtโ1โฉ = [0.5, โ0.3]แต cโจtโ1โฉ = [0.8, 1.2]แต
xโจtโฉ = [1.0, 0.5]แต
[aโจtโ1โฉ, xโจtโฉ] = [0.5, โ0.3, 1.0, 0.5]แต
W_f = [[0.2, 0.1, 0.3, โ0.1], [0.1, 0.4, โ0.2, 0.2]]
W_i = [[0.3, โ0.1, 0.2, 0.1], [โ0.2, 0.3, 0.1, 0.4]]
W_c = [[0.1, 0.2, 0.5, โ0.3], [0.4, โ0.1, 0.3, 0.2]]
W_o = [[โ0.1, 0.3, 0.2, 0.1], [0.2, 0.1, โ0.3, 0.5]]
b_f = [1.0, 1.0]แต (initialized to 1 for "remember by default")
b_i = b_c = b_o = [0, 0]แต
Step 1: Forget Gate
W_f ยท [a, x] = [0.2(0.5) + 0.1(โ0.3) + 0.3(1.0) + (โ0.1)(0.5), 0.1(0.5) + 0.4(โ0.3) + (โ0.2)(1.0) + 0.2(0.5)]
= [0.1 โ 0.03 + 0.3 โ 0.05, 0.05 โ 0.12 โ 0.2 + 0.1] = [0.32, โ0.17]
fโจtโฉ = ฯ([0.32, โ0.17] + [1.0, 1.0]) = ฯ([1.32, 0.83])
โ Both values are close to 1 โ the LSTM remembers most of the old cell state (because we initialized b_f = 1).
Step 2: Input Gate
W_i ยท [a, x] = [0.3(0.5) + (โ0.1)(โ0.3) + 0.2(1.0) + 0.1(0.5), (โ0.2)(0.5) + 0.3(โ0.3) + 0.1(1.0) + 0.4(0.5)]
= [0.15 + 0.03 + 0.2 + 0.05, โ0.1 โ 0.09 + 0.1 + 0.2] = [0.43, 0.11]
Step 3: Cell Candidate
W_c ยท [a, x] = [0.1(0.5) + 0.2(โ0.3) + 0.5(1.0) + (โ0.3)(0.5), 0.4(0.5) + (โ0.1)(โ0.3) + 0.3(1.0) + 0.2(0.5)]
= [0.05 โ 0.06 + 0.5 โ 0.15, 0.2 + 0.03 + 0.3 + 0.1] = [0.34, 0.63]
Step 4: Cell State Update (THE KEY STEP)
= [0.789, 0.696] โ [0.8, 1.2] + [0.606, 0.527] โ [0.328, 0.558]
= [0.631, 0.835] + [0.199, 0.294]
= [0.830, 1.129]
๐ Analysis: Dimension 1 changed from 0.8 to 0.830 (slight increase โ mostly remembered + small new info). Dimension 2 changed from 1.2 to 1.129 (slight decrease โ forgot ~30% of old value, added new info).
Step 5: Output Gate
W_o ยท [a, x] = [โ0.1(0.5) + 0.3(โ0.3) + 0.2(1.0) + 0.1(0.5), 0.2(0.5) + 0.1(โ0.3) + (โ0.3)(1.0) + 0.5(0.5)]
= [โ0.05 โ 0.09 + 0.2 + 0.05, 0.1 โ 0.03 โ 0.3 + 0.25] = [0.11, 0.02]
Step 6: Hidden State
= [0.527, 0.505] โ tanh([0.830, 1.129])
= [0.527, 0.505] โ [0.681, 0.811]
= [0.359, 0.410]
What Did the LSTM Cell Do?
- Forget gate โ 0.74 average: Retained ~74% of old cell memory (biased toward remembering)
- Input gate โ 0.57 average: Moderately accepted new information
- Cell state: Changed by only ~5% โ stable memory!
- Output gate โ 0.52: Revealed about half of the cell state information
- Hidden state: Updated from [0.5, โ0.3] to [0.359, 0.410] โ a smooth transition
Case Study โ HDFC Bank: LSTM-Powered Fraud Detection
๐ฆ HDFC Bank โ Detecting Fraud in Transaction Sequences with LSTMs
The Problem
HDFC Bank, India's largest private bank (โน18+ lakh crore in assets, 80+ million customers), processes over 3 crore transactions daily through debit cards, credit cards, UPI, and net banking. Their legacy fraud detection system used rule-based thresholds:
- Flag if transaction > โน50,000
- Flag if transaction from a new merchant category
- Flag if transaction from a new geography
This system had a 65% false positive rate โ for every 100 flagged transactions, 65 were legitimate. Each false positive required manual review (โน150-200 per investigation), and worse, it froze customers' accounts, leading to 12,000+ customer complaints per month.
The Insight: Fraud is a Sequence Problem
The key insight was that fraud is not about individual transactions โ it's about transaction patterns over time. A legitimate customer might:
- Morning: โน250 chai + breakfast at regular shop
- Afternoon: โน1,200 lunch at office canteen
- Evening: โน45,000 Amazon purchase (birthday gift)
A fraudster using a stolen card might:
- 12:03 AM: โน99 at online store (testing if card works)
- 12:05 AM: โน15,000 electronics purchase
- 12:07 AM: โน25,000 electronics purchase
- 12:08 AM: โน50,000 jewelry store
The pattern โ rapid escalation, unusual timing, category hopping โ is far more informative than any single transaction.
The LSTM Architecture
HDFC Bank's data science team (in collaboration with a Bangalore-based AI startup) built a 2-layer stacked LSTM:
| Component | Details |
|---|---|
| Input Features (per txn) | Transaction amount (log-scaled), merchant category code, time delta from previous txn, geographical distance from previous txn, day-of-week, hour-of-day โ 12 features total |
| Sequence Length | Last 30 transactions per customer |
| Architecture | LSTM(128) โ Dropout(0.3) โ LSTM(64) โ Dense(32, ReLU) โ Dense(1, Sigmoid) |
| Training Data | 2.4 crore transaction sequences (18 months), ~0.1% fraud rate (class-imbalanced) |
| Class Balancing | SMOTE oversampling + focal loss (ฮฑ=0.25, ฮณ=2) |
| Training Infrastructure | 4ร NVIDIA A100 GPUs on AWS Mumbai (ap-south-1), ~8 hours training |
Results (After 6-Month Production Deployment)
| Metric | Rule-Based System | LSTM System | Improvement |
|---|---|---|---|
| False Positive Rate | 65% | 25% | โ 40 percentage points |
| Fraud Detection Rate (Recall) | 72% | 91% | โ 19 percentage points |
| Customer Complaints (monthly) | 12,000 | 4,200 | โ 65% |
| Manual Review Cost (monthly) | โน2.1 crore | โน72 lakh | โ โน1.38 crore/month |
| Fraud Losses Prevented (annual) | โน340 crore | โน580 crore | โ โน240 crore |
| Avg Inference Latency | 2ms (rule lookup) | 15ms (LSTM) | Still within real-time SLA |
Why LSTM, Not GRU or Transformer?
- vs GRU: LSTM slightly outperformed GRU (91% vs 88% recall) on this task because transaction sequences needed long-range memory โ a customer's spending pattern over 30 days required the separate cell state
- vs Transformer: At the time of deployment, Transformers required more compute for inference (critical for real-time fraud detection where latency SLA is 50ms). The team is now piloting a Transformer-based model for the next version
- vs CNN: 1D CNNs were tested but missed temporal ordering patterns (a โน50K purchase after 5 small purchases is suspicious; 5 small purchases after โน50K is normal payback)
Lessons Learned
- Feature engineering still matters: Log-scaling transaction amounts and computing time-deltas between transactions improved accuracy by 8%
- Forget gate bias = 1.0 was critical: Without it, the model "forgot" early transactions in the 30-step window
- Inference latency is non-negotiable: The model runs on TensorFlow Serving with ONNX optimization โ average 15ms per prediction
- Explainability: RBI compliance requires explaining why a transaction was flagged. The team visualizes gate activations to show which past transactions contributed to the fraud score
RBI Mandate: The Reserve Bank of India's 2022 circular on "Digital Payment Security" mandates that banks implement AI/ML-based fraud detection systems for all digital transactions above โน2,000. This has accelerated LSTM adoption across Indian banking โ ICICI, SBI, and Axis Bank have all deployed similar architectures.
Common Mistakes & Misconceptions
Mistake 1: "LSTMs completely solve the vanishing gradient problem."
LSTMs mitigate but don't eliminate vanishing gradients. For extremely long sequences (1000+ steps), even LSTMs struggle. The cell state can accumulate noise over many steps. For truly long-range dependencies, attention mechanisms (Chapter 16) or Transformers are needed.
Mistake 2: "More LSTM layers = better performance."
Stacking beyond 3 layers rarely helps and often hurts. Unlike CNNs (which benefit from 50+ layers with ResNets), RNNs already have "depth" through time. Adding layers adds depth per time step, which is redundant. Google's GNMT used 8 layers, but required residual connections and took months to train.
Mistake 3: "GRU is always worse than LSTM because it has fewer parameters."
On many benchmarks (text classification, sentiment analysis, short-sequence tasks), GRU performs on par with or even better than LSTM. Fewer parameters means less overfitting on small datasets. The empirical evidence (Chung et al., 2014) shows no consistent winner โ it depends on the task.
Mistake 4: "Bidirectional LSTMs can be used for all sequence tasks."
BiLSTMs require the complete input sequence. They CANNOT be used for:
- Real-time speech recognition (processing while user is speaking)
- Language model next-word prediction
- Online/streaming stock prediction
- Chatbot response generation
They CAN be used when you have the full input: NER, sentiment analysis, machine translation (encoder side), question answering.
Mistake 5: "Initializing all biases to 0 is fine for LSTMs."
The forget gate bias should be initialized to 1.0 or 2.0, not 0. With b_f = 0, the sigmoid output starts at 0.5, which means the LSTM forgets 50% of its memory at every step from the start. With b_f = 1, ฯ(1) โ 0.73 โ the LSTM starts by remembering most information, learning what to forget over training.
Mistake 6: "Dropout should be applied to recurrent connections."
Standard dropout between time steps (on the recurrent connection aโจtโฉ โ aโจt+1โฉ) destroys the temporal gradient flow. Instead, use:
- Dropout between LSTM layers (on the vertical connection)
- Variational dropout / recurrent dropout (same mask across time steps, as in Gal & Ghahramani 2016)
In Keras: LSTM(128, dropout=0.2, recurrent_dropout=0.2) implements this correctly.
Comparison Table โ RNN Architectures
| Feature | Vanilla RNN | LSTM | GRU | Bidirectional LSTM |
|---|---|---|---|---|
| Year | 1986 | 1997 | 2014 | 1997 (concept) |
| States | 1 (hidden) | 2 (cell + hidden) | 1 (hidden) | 2 per direction |
| Gates | None | 3 (forget, input, output) | 2 (update, reset) | 3 per direction |
| Params per layer | n(n+m) + n | 4[n(n+m) + n] | 3[n(n+m) + n] | 8[n(n+m) + n] |
| Long-range deps | ~10-20 steps | ~200-500 steps | ~100-300 steps | ~200-500 steps |
| Training speed | Fastest | Slow | Medium | Slowest (2ร LSTM) |
| Gradient flow | Poor (vanishing) | Good (cell highway) | Good (z-gate) | Good (both directions) |
| Real-time capable? | โ Yes | โ Yes | โ Yes | โ No (needs full seq) |
| Best for | Very short sequences | Long sequences, production | Medium sequences, mobile | NER, classification, MT |
| Indian use case | Basic time-series | HDFC fraud detection | Jio voice assistant | Flipkart review NER |
When to Choose What โ Decision Flowchart
Exercises
Section A โ Multiple Choice Questions (10)
What is the primary purpose of the forget gate (fโจtโฉ) in an LSTM?
- To forget the input at the current time step
- To decide what information to erase from the cell state
- To forget the output of the previous time step
- To reset the hidden state to zero
The cell state update in an LSTM is: cโจtโฉ = fโจtโฉ โ cโจtโ1โฉ + iโจtโฉ โ cฬโจtโฉ. Why does this additive form help with vanishing gradients?
- Because addition is faster than multiplication on GPUs
- Because the gradient of cโจtโฉ w.r.t. cโจtโ1โฉ is simply fโจtโฉ, avoiding repeated matrix multiplications
- Because the cell state values are always positive
- Because sigmoid outputs are always between 0 and 1
How many parameter matrices (weights + biases) does a single GRU cell have?
- 2 weight matrices + 2 bias vectors
- 3 weight matrices + 3 bias vectors
- 4 weight matrices + 4 bias vectors
- 6 weight matrices + 6 bias vectors
In the GRU update equation hโจtโฉ = zโจtโฉ โ hโจtโ1โฉ + (1 โ zโจtโฉ) โ hฬโจtโฉ, what happens when zโจtโฉ = 1 for all dimensions?
- The hidden state is completely replaced by the candidate
- The hidden state becomes zero
- The hidden state is copied from the previous time step unchanged
- The GRU behaves like a vanilla RNN
Which of the following tasks CANNOT use a Bidirectional LSTM?
- Named Entity Recognition on completed documents
- Sentiment analysis of movie reviews
- Real-time next-word prediction in a keyboard app
- Part-of-speech tagging of sentences
For an LSTM with hidden size n=256 and input size m=100, approximately how many parameters does a single layer have?
- ~91,000
- ~182,000
- ~274,000
- ~366,000
Why should the forget gate bias in an LSTM be initialized to a positive value (e.g., 1.0)?
- To make the sigmoid output start near 0, encouraging forgetting
- To make the sigmoid output start near 1, encouraging remembering at the beginning of training
- To prevent the cell state from becoming negative
- To match the output gate initialization
In a stacked 3-layer LSTM, what is the input to the second LSTM layer at time step t?
- The original input xโจtโฉ
- The hidden state of the first layer at time step tโ1
- The hidden state of the first layer at time step t
- The cell state of the first layer at time step t
The GRU's reset gate rโจtโฉ is applied to hโจtโ1โฉ before computing the candidate hฬโจtโฉ. What is the effect when rโจtโฉ โ 0?
- The candidate depends only on the current input xโจtโฉ, ignoring history
- The candidate is identical to the previous hidden state
- The update gate is forced to 0
- The GRU output becomes zero
In the HDFC Bank fraud detection case study, why did LSTM outperform 1D-CNN on the transaction sequence task?
- LSTMs have more parameters and are always more powerful
- CNNs cannot process sequential data at all
- LSTMs preserve temporal ordering โ the order of transactions matters for fraud patterns, which CNNs with fixed receptive fields may miss
- LSTMs are faster at inference than CNNs
Section B โ Short Answer Questions (5)
B1. Explain the "conveyor belt" analogy for the LSTM cell state
Think of a factory conveyor belt carrying items. The forget gate removes items, the input gate adds items, and the belt moves forward through time. How does this relate to gradient flow?
Expected Length4-5 sentences
B2. Why does the GRU use (1 โ z) for the candidate weight instead of a separate input gate?
Consider the constraint: if you're keeping 70% of the old state, you can only add 30% of new information. This is a design choice that reduces parameters.
Expected Length3-4 sentences
B3. Why is return_sequences=True necessary in stacked LSTMs but not in the final LSTM layer (for classification)?
Consider what the next LSTM layer needs as input at each time step. For classification, what does the Dense layer need?
Expected Length3-4 sentences
B4. In the HDFC Bank case study, why was the false positive reduction more valuable than the fraud detection improvement?
Consider the volume: if 3 crore transactions/day are processed and 99.9% are legitimate, reducing false positives on 2.997 crore transactions has a bigger impact than catching more fraud in 30,000 fraudulent ones.
Expected Length4-5 sentences
B5. What is "recurrent dropout" and why is it different from standard dropout in LSTMs?
Standard dropout applies a different random mask at each time step. Recurrent dropout applies the same mask across all time steps. Why does this distinction matter for gradient flow?
Expected Length4-5 sentences
Section C โ Long Answer Questions (3)
C1. Draw the LSTM Cell and Label All Gates (15 marks)
Draw a complete LSTM cell diagram showing:
- The cell state "conveyor belt" (cโจtโ1โฉ โ cโจtโฉ) โ show the flow from left to right
- The forget gate (f) with its ฯ activation โ show it connecting to the cell state via element-wise multiplication
- The input gate (i) with its ฯ activation
- The cell candidate (cฬ) with its tanh activation โ show how i and cฬ combine via element-wise multiplication
- The additive junction where fโcโจtโ1โฉ and iโcฬ combine
- The output gate (o) with its ฯ activation
- The hidden state output aโจtโฉ = o โ tanh(cโจtโฉ)
- All inputs: aโจtโ1โฉ, xโจtโฉ, and the concatenation [aโจtโ1โฉ, xโจtโฉ]
Write the complete equation next to each gate. Explain why the additive cell state update solves vanishing gradients (5 marks).
C2. Compare LSTM and GRU Architectures (15 marks)
Write a detailed comparison covering:
- Architecture: Draw both cells side by side. Map GRU gates to LSTM gates and explain the correspondence (5 marks)
- Mathematics: Write all equations for both architectures. Show how the GRU's update equation hโจtโฉ = zโhโจtโ1โฉ + (1โz)โhฬ is analogous to but simpler than the LSTM's cell state update (4 marks)
- Parameter analysis: For n=512, m=300, compute the exact parameter count for both architectures. What's the percentage reduction? (3 marks)
- Practical guidance: Provide 3 scenarios where LSTM is preferred and 3 where GRU is preferred, with justification (3 marks)
C3. Design a Fraud Detection System Using LSTMs (15 marks)
You are the lead ML engineer at a UPI payment company (like PhonePe or Google Pay) processing 800 crore UPI transactions per month. Design a complete LSTM-based fraud detection system:
- Data representation: What features would you extract from each transaction? How would you form sequences? Justify your sequence length choice (4 marks)
- Architecture: Propose a specific model architecture (number of layers, hidden sizes, bidirectional or not, output structure). Explain each design choice (4 marks)
- Training strategy: Address class imbalance (fraud is ~0.01% of transactions), choice of loss function, and validation strategy (3 marks)
- Deployment: Address inference latency requirements (UPI mandate: 30-second transaction timeout), model serving architecture, and how to handle cold-start (new users with no history) (4 marks)
Section D โ Programming Assignments (2)
D1. Nifty 50 Price Prediction โ LSTM vs GRU Comparison
Build two models โ one using LSTM layers and one using GRU layers โ to predict the next-day closing price of the Nifty 50 index. Compare them on:
- Test MAE and MAPE
- Training time per epoch
- Total trainable parameters
- Loss convergence curves (plot both on the same graph)
- Use
yfinanceto download Nifty 50 data (^NSEI) from 2015-2024 - Lookback window: 60 days
- Both models: 2 stacked layers (128 โ 64 units) with Dropout(0.2)
- Train for 100 epochs with EarlyStopping
- Normalize prices using MinMaxScaler
- Use 80-20 chronological split (NO shuffling!)
A Jupyter notebook with both models, training curves overlay, test predictions overlay on actual prices, and a 200-word analysis of which model performed better and why.
D2. Bidirectional LSTM for Hindi-English NER
Build a BiLSTM-based Named Entity Recognition system for code-mixed Hindi-English text common in Indian social media. Use the provided dataset format or generate your own.
Example DataSentence: "Modi ji ne Varanasi mein rally ki" Tags: B-PER O O B-LOC O O O Sentence: "Flipkart ka Big Billion Days sale start" Tags: B-ORG O O O O O ORequirements
- Create at least 50 training sentences with Indian entities
- Model: Embedding(64) โ BiLSTM(128) โ Dropout(0.3) โ BiLSTM(64) โ TimeDistributed(Dense)
- Entity types: PER, ORG, LOC, EVENT, PRODUCT (with B- and I- prefixes)
- Evaluate using entity-level F1 score (use
seqevallibrary) - Show 10 example predictions on unseen sentences
Section E โ Mini-Project
๐ Project: Indian Stock Market Multi-Feature LSTM Predictor
Build a comprehensive stock prediction system for Indian markets using multiple features and LSTM variants. This goes beyond simple price prediction to incorporate volume, technical indicators, and market sentiment.
Phase 1: Data Collection (Week 1)- Download daily OHLCV data for 5 stocks: Reliance, TCS, HDFC Bank, Infosys, and ICICI Bank from NSE using
yfinance - Compute technical indicators: 20-day SMA, 50-day EMA, RSI (14-day), MACD, Bollinger Bands
- Add market-wide features: Nifty 50 daily return, India VIX
- Build 3 model variants: (a) Simple LSTM, (b) Stacked LSTM, (c) Bidirectional LSTM (train on completed windows)
- Input: 30-day window of 10+ features
- Output: Next-day direction (up/down) โ classification task
- Handle class balance and use proper time-series cross-validation
- Compare all 3 architectures on accuracy, precision, recall, and F1
- Analyze: Which features contribute most? (Use ablation study)
- Visualize LSTM gate activations for interesting sequences (e.g., market crash days, budget day)
- Write a 500-word report with investment-context analysis
| Component | Marks |
|---|---|
| Data pipeline + feature engineering | 20 |
| Model implementation (3 variants) | 30 |
| Evaluation and comparison | 20 |
| Visualization and interpretability | 15 |
| Report and code quality | 15 |
| Total | 100 |
Chapter Summary
Key Takeaways from Chapter 14
- The Vanishing Gradient Problem โ Vanilla RNNs fail on sequences longer than ~10-20 steps because gradients are products of many weight matrices: if eigenvalues < 1, gradients vanish; if > 1, they explode.
- LSTM Architecture โ Introduces a separate cell state (cโจtโฉ) that flows through time with additive updates. Three gates control information flow:
- Forget gate (f): What to erase from memory โ ฯ(W_f ยท [aโจtโ1โฉ, xโจtโฉ] + b_f)
- Input gate (i): What new info to write โ ฯ(W_i ยท [aโจtโ1โฉ, xโจtโฉ] + b_i)
- Output gate (o): What to reveal โ ฯ(W_o ยท [aโจtโ1โฉ, xโจtโฉ] + b_o)
- GRU Architecture โ Simplifies LSTM by merging cell and hidden states, and using 2 gates instead of 3:
- Update gate (z): Controls forgetting AND updating simultaneously
- Reset gate (r): Controls how much history to use for the candidate
- LSTM vs GRU โ LSTM is better for very long sequences (>500 steps) and interpretability; GRU is better for small datasets, mobile deployment, and faster experimentation. No universal winner โ try both.
- Bidirectional RNNs โ Run two RNNs (forward + backward) on the same sequence. Essential for NER, sentiment analysis, and tasks where future context helps. Cannot be used for real-time/streaming predictions.
- Stacked/Deep RNNs โ 2-3 layers is the sweet spot. Add dropout between layers (not within recurrent connections). Use residual connections for 4+ layers.
- Practical Tips:
- Initialize forget gate bias to 1.0 (remember by default)
- Use recurrent dropout (same mask across time), not standard dropout
- Never shuffle time-series data for train/test split
- LSTM parameters = 4n(n+m) + 4n; GRU = 3n(n+m) + 3n
- Industry Impact โ HDFC Bank's LSTM-based fraud detection reduced false positives by 40% and prevented โน240 crore in additional annual fraud. LSTMs power translation, NER, speech, and financial systems across India.
References
Foundational Papers
- Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735โ1780. โ The original LSTM paper introducing the cell state and gating mechanism.
- Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM." Neural Computation, 12(10), 2451โ2471. โ Added the forget gate (not in the original 1997 paper!).
- Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." EMNLP 2014. โ The paper introducing GRU.
- Chung, J. et al. (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." NIPS 2014 Workshop. โ Comprehensive LSTM vs GRU comparison.
- Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). "An Empirical Exploration of Recurrent Network Architectures." ICML 2015. โ Showed forget gate bias initialization to 1.0 is critical.
Architecture Variants
- Schuster, M. & Paliwal, K. K. (1997). "Bidirectional Recurrent Neural Networks." IEEE Transactions on Signal Processing, 45(11). โ The original bidirectional RNN paper.
- Graves, A. (2013). "Generating Sequences With Recurrent Neural Networks." arXiv:1308.0850. โ Handwriting generation with stacked LSTMs.
- Wu, Y. et al. (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." arXiv:1609.08144. โ GNMT with 8-layer stacked LSTMs and residual connections.
- Gal, Y. & Ghahramani, Z. (2016). "A Theoretically Grounded Application of Dropout to Recurrent Neural Networks." NIPS 2016. โ Variational/recurrent dropout for LSTMs.
Textbooks
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 10: Sequence Modeling โ Detailed treatment of LSTM and GRU architectures.
- Chollet, F. (2021). Deep Learning with Python, 2nd Edition. Manning. Chapter 10 โ Practical LSTM/GRU implementations in Keras.
- Jurafsky, D. & Martin, J. H. (2023). Speech and Language Processing, 3rd Edition (Draft). Chapter 9 โ LSTMs in NLP context.
Indian Industry Context
- HDFC Bank Annual Report (2023-24) โ Sections on digital banking technology, AI/ML fraud detection deployment statistics.
- RBI Circular on Digital Payment Security Controls (2022) โ Mandate for AI/ML-based fraud detection in Indian banking, driving LSTM adoption.
- NASSCOM AI in BFSI Report (2023) โ Survey of AI/ML adoption in Indian banking and financial services, including sequence modeling for fraud detection and credit scoring.