Neural Networks & Deep Learning

Chapter 15: Transformers and Attention Mechanisms

Attention Is All You Need — How One Architecture Replaced RNNs, CNNs, and Changed the World

⏱️ Reading Time: ~5 hours | 📖 Unit 5: Specialized Architectures | 🧠 Theory + Code Chapter

📋 Prerequisites: Chapter 14 (LSTMs & GRUs), Chapter 9 (Regularization), Linear Algebra (matrix multiplication, eigenvalues)

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall the scaled dot-product attention formula, multi-head attention structure, sinusoidal positional encoding equations, and Transformer encoder-decoder block components
🔵 Understand	Explain why √d_k scaling prevents softmax saturation, how self-attention captures long-range dependencies without recurrence, and why positional encoding is necessary
🟢 Apply	Implement single-head attention, multi-head attention, and a complete TransformerBlock from scratch in NumPy; build a character-level mini-GPT in PyTorch
🟡 Analyze	Trace the information flow through a 6-layer Transformer encoder, analyze attention weight matrices to interpret what the model "looks at," compare O(n²d) self-attention vs O(n) recurrence
🟠 Evaluate	Critically compare BERT (encoder-only, MLM) vs GPT (decoder-only, CLM), evaluate when to choose efficient attention variants (Linformer, FlashAttention), justify architecture decisions for multilingual tasks
🔴 Create	Design and train a mini-Transformer for character-level text generation; propose an architecture for an Indian multilingual translation system

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Derive the attention mechanism from first principles — starting from the intuition of "soft dictionary lookup" and arriving at Attention(Q,K,V) = softmax(QK^T/√d_k)V
Prove why the √d_k scaling factor is necessary by computing the variance of dot products in high-dimensional spaces
Implement single-head attention, multi-head attention, and a complete Transformer block from scratch in NumPy
Explain sinusoidal positional encoding — derive the formulas, understand why sine and cosine encode relative positions, and compare with learned positional embeddings
Diagram the complete Transformer architecture (encoder + decoder, 6 layers each) including masked multi-head attention, cross-attention, feed-forward networks, and residual connections
Compare encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures — their pre-training objectives, strengths, and applications
Analyze self-attention complexity O(n²d) and explain efficient attention variants: Linformer, Performer, FlashAttention
Build a character-level mini-GPT in PyTorch that generates text
Discuss real-world deployments: AI4Bharat IndicBERT for 11 Indian languages and the OpenAI GPT-1→4 evolution
Solve GATE-style numerical problems on attention computation and Transformer parameter counting

Section 2

Opening Hook

🔮 The Paper That Changed Everything

In June 2017, a team at Google published "Attention Is All You Need." The paper's modest title hid a revolution: they showed you could throw away RNNs entirely and build sequence models purely from attention. This single architecture now powers GPT-4, BERT, Whisper, DALL-E, and virtually every state-of-the-art AI system.

But here's what made it truly special. For decades, the dominant paradigm for sequence modeling was recurrence — processing tokens one by one, like reading a book word by word, never allowed to skip ahead. Transformers broke this tyranny. Instead of sequential processing, they let every word in a sentence look at every other word simultaneously. A 500-word paragraph? The Transformer sees all 500 words at once, computes how each word relates to every other word, and does it all in a single matrix multiplication.

The results were immediate and dramatic. Translation quality jumped. Training speed exploded — because attention is parallelizable on GPUs, unlike recurrence. Within two years, Google replaced its entire translation pipeline with Transformers. Within five years, GPT-3 showed that a Transformer with 175 billion parameters could write essays, code, and poetry that fooled humans.

In India, the Transformer revolution hit a unique problem: 22 official languages, 13 different scripts, rampant code-mixing. Teams at AI4Bharat and IIT Madras built IndicBERT — a BERT model trained on 11 Indian languages — proving that the Transformer's architecture transcends any single language.

Today, you will understand exactly how this architecture works — from the ground up.

Google Brain OpenAI AI4Bharat Hugging Face Meta AI

Section 3

The Intuition First — Why Attention?

The Library Analogy 📚

Imagine you walk into a massive library looking for information about "climate change effects on agriculture." Here's what you do:

Your Query (Q): You have a question in your mind — "climate change effects on agriculture." This is your query.
Book Labels / Keys (K): Every book on the shelf has a title and index — "Organic Farming Techniques," "Climate Science 101," "History of Ancient Rome." These are keys that you compare against your query.
Book Contents / Values (V): Each book contains actual content — facts, figures, analysis. These are the values you want to retrieve.
Matching: You mentally compare your query against each book's key. "Climate Science 101" gets high relevance. "History of Ancient Rome" gets near-zero relevance.
Weighted Retrieval: You don't just read one book — you read a weighted combination, spending 60% of your time on the climate book, 30% on the farming book, and 10% skimming others.

Attention = "Soft Dictionary Lookup"
output = Σᵢ similarity(query, key_i) × value_i

This is exactly what the attention mechanism does. Every token in a sentence generates a query ("What am I looking for?"), a key ("What do I contain?"), and a value ("What information do I carry?"). The magic: every token gets to "look up" every other token and retrieve a weighted combination of their information.

Why RNNs Failed: The Bottleneck Problem 🔬

In Chapter 14, you learned that LSTMs and GRUs mitigate the vanishing gradient problem. But they still have a fundamental limitation: information must flow sequentially. If word 1 needs information from word 100, that information must survive passage through 99 intermediate states. Even with gating, this bottleneck causes information loss.

RNN: Sequential bottleneck word₁ → word₂ → word₃ → ... → word₉₉ → word₁₀₀ ↓ ↓ Info from word₁ must survive 99 steps to reach word₁₀₀ Transformer: Direct connections word₁ ←────────────────────────────→ word₁₀₀ word₁ ←──→ word₂ ←──→ word₃ ... (all pairs!) Every token "sees" every other token in ONE step.

The "Aha!" Question

"What if, instead of processing a sequence step by step, we let every position in the sequence attend to every other position simultaneously — and used nothing but this attention operation to build the entire model?"

The original Transformer paper (Vaswani et al., 2017) had 8 authors. The name "Transformer" was chosen because the model transforms input representations through layers of attention. The working title was reportedly much less catchy.

Section 4 — Mathematical Foundation

15.1 The Attention Mechanism — Derived from Scratch

Step 1: The Simplest Attention — Dot-Product Similarity

You have a sequence of n tokens, each represented as a d-dimensional vector. Let's call these vectors x₁, x₂, ..., xₙ, each ∈ ℝ^d. Stack them into a matrix X ∈ ℝ^n×d.

The most basic question of attention: "How much should token i pay attention to token j?"

The simplest answer? Take the dot product of their vectors:

e_ij = xᵢᵀ · xⱼ = Σ_k (x_ik · x_jk)

If two vectors point in similar directions, the dot product is large (high attention). If they're orthogonal, the dot product is zero (no attention). This is just cosine similarity scaled by magnitude.

Step 2: From Raw Scores to Probabilities — Softmax

The raw scores e_ij can be any real number — positive, negative, huge, tiny. You need them to be non-negative and sum to 1 (a probability distribution). The softmax function does exactly this:

α_ij = softmax(e_ij) = exp(e_ij) / Σ_k exp(e_ik)

Now α_ij represents "the fraction of attention that token i pays to token j." These are called attention weights.

Step 3: Weighted Sum — The Output

Once you know how much token i attends to every other token, you compute a weighted sum of all token representations:

output_i = Σ_j α_ij · xⱼ

This gives token i a new representation that's a blend of all other tokens, weighted by relevance. Beautiful, isn't it? But there's a problem...

Step 4: The Q-K-V Projection — Why We Need Three Separate Roles

In the naive version above, every token uses the same vector for both "querying" and "being queried." This is like a library where the same text serves as both the book's title and its content — confusing!

The solution: project each token into three different spaces using learned weight matrices:

Given input X ∈ ℝ^n×d_model, we create three projections:

Query: Q = X · W^Q where W^Q ∈ ℝ^{d_model × d_k}
Key: K = X · W^K where W^K ∈ ℝ^{d_model × d_k}
Value: V = X · W^V where W^V ∈ ℝ^{d_model × d_v}

Intuition:

Q (Query) = "What am I looking for?" — the question each token asks
K (Key) = "What do I contain?" — how each token advertises itself
V (Value) = "What information do I carry?" — the actual content to retrieve

This separation is crucial: a word might be highly relevant as a key to certain queries but carry different information as a value. For example, the word "bank" as a key signals "financial" or "river," and different value projections can encode different aspects of its meaning depending on context.

Step 5: Putting It All Together — Raw Attention

Now combine everything. For each query, compute similarity with all keys, normalize with softmax, and retrieve weighted values:

Attention(Q, K, V) = softmax(Q · K^T) · V

Matrix dimensions check:

Q ∈ ℝ^n×d_k, K^T ∈ ℝ^d_k×n → Q·K^T ∈ ℝ^n×n (attention scores matrix)
After softmax: still ℝ^n×n (each row sums to 1)
V ∈ ℝ^n×d_v → output ∈ ℝ^n×d_v ✓

This is almost the final formula. But there's one critical missing piece...

❌ MYTH: "The Q, K, V matrices are fixed and predetermined."

✅ TRUTH: W^Q, W^K, W^V are learned weight matrices, trained via backpropagation just like any other neural network parameter. The network learns what to query for and what to advertise as keys.

🔍 WHY IT MATTERS: This is what makes attention so powerful — the model doesn't just compute fixed similarities; it learns which aspects of tokens to compare.

Section 5 — Mathematical Foundation (cont.)

15.2 Scaled Dot-Product Attention — The √d_k Derivation

The Problem: Why Naive Dot Products Blow Up

Here's a subtle but critical issue. Consider a query vector q and a key vector k, each with d_k independent components drawn from a standard normal distribution (mean 0, variance 1).

The dot product is:

q · k = Σ_{i=1}^{d_k} q_i · k_i

Deriving the variance of the dot product:

Each q_i and k_i are independent with mean 0 and variance 1.

The product q_i · k_i has:

𝔼[q_i · k_i] = 𝔼[q_i] · 𝔼[k_i] = 0 · 0 = 0 (by independence)
Var(q_i · k_i) = 𝔼[q_i² · k_i²] - (𝔼[q_i · k_i])²
= 𝔼[q_i²] · 𝔼[k_i²] - 0 (by independence)
= Var(q_i) · Var(k_i) = 1 · 1 = 1

The dot product is a sum of d_k such independent terms:

𝔼[q · k] = 0
Var(q · k) = d_k

So the standard deviation of the dot product is √d_k.

If d_k = 512, the dot products will have std ≈ 22.6! These huge values, when fed into softmax, push the output toward one-hot vectors (all mass on one element), creating extremely small gradients (softmax saturation).

The Solution: Scale by √d_k

To bring the variance back to 1 (well-behaved for softmax), divide by √d_k:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

After scaling:

Each element of QK^T/√d_k has variance ≈ 1
Softmax operates in a region with informative gradients
The model can learn nuanced attention distributions instead of "all or nothing"

Q: Why does scaled dot-product attention divide by √d_k?

A: For vectors with i.i.d. components of variance 1, the dot product has variance d_k. Dividing by √d_k normalizes the variance to 1, preventing softmax saturation and ensuring healthy gradients during training.

Formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V

Masked Attention: Preventing the Future from Leaking

In language generation (like GPT), when predicting the next word, you must NOT let the model peek at future words. This is enforced by masking:

Masked Attention: Set score_ij = -∞ for all j > i before softmax
This makes softmax(−∞) = 0, effectively zeroing out future positions.

Attention Score Matrix (4 tokens): Key₁ Key₂ Key₃ Key₄ Query₁ [ 0.8 -∞ -∞ -∞ ] ← can only attend to position 1 Query₂ [ 0.3 0.9 -∞ -∞ ] ← can attend to positions 1-2 Query₃ [ 0.1 0.5 0.7 -∞ ] ← can attend to positions 1-3 Query₄ [ 0.2 0.4 0.3 0.8 ] ← can attend to all positions After softmax, −∞ → 0, creating a causal (lower-triangular) pattern.

In PyTorch, masking is typically done by adding a large negative number (like -1e9) to the positions you want to mask, rather than actual -∞. This avoids NaN issues while achieving the same effect after softmax.

Section 6

15.3 Multi-Head Attention — Why Multiple Heads?

The Problem with Single-Head Attention

A single attention head computes one set of attention weights. But a word can have multiple types of relationships simultaneously:

"The cat sat on the mat because it was tired" — "it" needs to attend to "cat" (coreference resolution)
But "it" also needs to attend to "sat" (syntactic dependency) and "tired" (semantic link)

A single attention head must average all these relationships into one set of weights — a lossy compromise.

The Solution: Multiple Parallel Heads

Instead of one attention function with d_model-dimensional keys, queries, and values, run h parallel attention heads, each with d_k = d_model/h dimensions:

Multi-Head Attention, step by step:

Step 1: For each head i ∈ {1, ..., h}, create separate projections:

Q_i = X · W_i^Q where W_i^Q ∈ ℝ^{d_model × d_k}, d_k = d_model / h
K_i = X · W_i^K where W_i^K ∈ ℝ^{d_model × d_k}
V_i = X · W_i^V where W_i^V ∈ ℝ^{d_model × d_v}, d_v = d_model / h

Step 2: Compute attention independently for each head:

head_i = Attention(Q_i, K_i, V_i) ∈ ℝ^{n × d_v}

Step 3: Concatenate all heads:

MultiHead = Concat(head_1, head_2, ..., head_h) ∈ ℝ^{n × (h·d_v)} = ℝ^{n × d_model}

Step 4: Final linear projection:

MultiHeadAttention(X) = MultiHead · W^O where W^O ∈ ℝ^{d_model × d_model}

MultiHead(Q, K, V) = Concat(head₁, ..., head_h) · W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Parameter Count Check

With h=8 heads and d_model=512 (as in the original Transformer):

d_k = d_v = 512/8 = 64 per head
Per head: W^Q(512×64) + W^K(512×64) + W^V(512×64) = 3 × 32,768 = 98,304
All 8 heads: 8 × 98,304 = 786,432
Output projection W^O: 512 × 512 = 262,144
Total: 1,048,576 ≈ 1M parameters (same as one large single-head attention!)

The computational cost of multi-head attention with h heads of d_k dimensions each is identical to single-head attention with full d_model dimensions! You get multiple attention patterns for free — no additional compute. This is because h × d_k = d_model.

What Do Different Heads Learn?

Research has shown that different heads naturally specialize:

Head Type	What It Learns	Example
Syntactic Head	Subject-verb agreement	"The dogs [that barked] were loud"
Positional Head	Attend to adjacent tokens	"New York" — bigram patterns
Coreference Head	Pronoun resolution	"Alice said she would come"
Semantic Head	Meaning similarity	"happy" attends to "joy," "glad"

"A Multiscale Visualization of Attention" (Vig, 2019) — Jesse Vig built BertViz, a tool to visualize attention patterns across heads and layers. The research showed that lower layers capture syntactic patterns while upper layers capture semantic relationships. This work enabled researchers to interpret what Transformers learn, rather than treating them as black boxes.

Section 7

15.4 Positional Encoding — Teaching Position to a Permutation-Invariant Model

The Problem: Self-Attention Ignores Order

Here's a subtle but devastating issue. Consider the sentences:

"The dog chased the cat" (meaning A)
"The cat chased the dog" (meaning B — completely different!)

Self-attention computes pairwise similarities between tokens. Since these sentences have exactly the same tokens (just reordered), self-attention produces identical outputs. The attention mechanism is permutation-invariant — it doesn't care about word order!

This is clearly unacceptable. We must inject positional information somehow.

Solution: Sinusoidal Positional Encoding

Deriving the sinusoidal encoding:

We need a function PE(pos, i) that maps position pos and dimension i to a real number. Our desiderata:

Unique: Each position gets a unique encoding
Bounded: Values stay in [-1, 1] regardless of sequence length
Relative: PE(pos+k) should be expressible as a linear function of PE(pos) — enabling the model to learn relative positions
Generalizable: Should work for sequences longer than those seen during training

Sine and cosine functions satisfy ALL four criteria!

The encoding formula:

PE(pos, 2i) = sin(pos / 10000^2i/d_model)

PE(pos, 2i+1) = cos(pos / 10000^2i/d_model)

where pos is the token position and i is the dimension index.

Why this works — the linear relationship property:

For any fixed offset k, there exists a linear transformation M_k such that:

[sin(ω·(pos+k)), cos(ω·(pos+k))] = M_k · [sin(ω·pos), cos(ω·pos)]

where M_k is a rotation matrix:

M_k = [[cos(ωk), sin(ωk)], [-sin(ωk), cos(ωk)]]

This means the model can learn to attend to relative positions (e.g., "3 tokens back") through a simple linear transformation of the positional encodings. No absolute position memorization needed!

PE(pos, 2i) = sin(pos / 10000^2i/d_model)
PE(pos, 2i+1) = cos(pos / 10000^2i/d_model)

Wavelength Intuition

Each dimension pair (2i, 2i+1) creates a sine-cosine wave with a different wavelength:

Dimension 0,1: wavelength = 2π ≈ 6.28 positions (fastest oscillation)
Dimension 2,3: wavelength = 2π × 10000^2/512 ≈ 7.4 positions
...
Dimension 510,511: wavelength ≈ 2π × 10000 ≈ 62,832 positions (slowest)

Think of it like a binary clock: the least significant bit oscillates fastest, the most significant bit oscillates slowest. Each position gets a unique "binary-like" representation from combining all these waves.

Positional Encoding Visualization (d_model=8, showing 10 positions): Pos dim0 dim1 dim2 dim3 dim4 dim5 dim6 dim7 0 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 1 0.84 0.54 0.16 0.99 0.03 1.00 0.00 1.00 2 0.91 -0.42 0.31 0.95 0.06 1.00 0.01 1.00 3 0.14 -0.99 0.45 0.89 0.09 1.00 0.01 1.00 4 -0.76 -0.65 0.59 0.81 0.12 0.99 0.01 1.00 5 -0.96 0.28 0.71 0.70 0.15 0.99 0.01 1.00 6 -0.28 0.96 0.81 0.59 0.18 0.98 0.02 1.00 7 0.66 0.75 0.89 0.46 0.21 0.98 0.02 1.00 Notice: dim0,1 oscillate fast ↕ | dim6,7 barely change → Each row (position) has a unique "fingerprint"

Learned vs. Sinusoidal Positional Encoding

Feature	Sinusoidal (Original Transformer)	Learned (BERT, GPT-2)
Parameters	0 (computed analytically)	max_len × d_model trainable params
Generalization	Can extrapolate to longer sequences	Cannot handle sequences longer than max_len
Relative positions	Encodes relative position via rotation	Must learn relative patterns from data
Performance	Comparable to learned in most tasks	Slightly better when max_len is known
Used in	Original Transformer, some variants	BERT, GPT-2/3, most modern models

"RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021) — This paper introduced Rotary Position Embedding (RoPE), which encodes position by rotating the query and key vectors. RoPE naturally encodes relative positions, generalizes to longer sequences, and is now used in LLaMA, PaLM, and most modern large language models. It essentially applies the rotation matrix M_k directly to Q and K before computing attention.

Section 8

15.5 The Full Transformer Architecture

The Big Picture

THE TRANSFORMER ARCHITECTURE ═══════════════════════════ ┌─────────────────────┐ ┌─────────────────────┐ │ ENCODER │ │ DECODER │ │ (6 identical │ │ (6 identical │ │ layers) │ │ layers) │ │ │ │ │ │ ┌─────────────────┐ │ │ ┌──────────────────┐ │ │ │ Multi-Head │ │ │ │ Masked Multi- │ │ │ │ Self-Attention │ │ │ │ Head Self-Attn │ │ │ │ + Add & Norm │ │ │ │ + Add & Norm │ │ │ └────────┬────────┘ │ │ └────────┬─────────┘ │ │ │ │ │ │ │ │ ┌────────▼────────┐ │ │ ┌────────▼─────────┐ │ │ │ Feed-Forward │ │ │ │ Cross-Attention │ │ │ │ Network │ │ ──►│ │ (Encoder-Decoder)│ │ │ │ + Add & Norm │ │ │ │ + Add & Norm │ │ │ └────────┬────────┘ │ │ └────────┬─────────┘ │ │ │ │ │ │ │ │ [Repeat ×6] │ │ ┌────────▼─────────┐ │ └──────────┬──────────┘ │ │ Feed-Forward │ │ │ │ │ Network │ │ ┌──────────▼──────────┐ │ │ + Add & Norm │ │ │ Input Embedding │ │ └────────┬─────────┘ │ │ + Positional Enc │ │ │ │ └──────────▲──────────┘ │ [Repeat ×6] │ │ └──────────┬───────────┘ Input Tokens │ "I love coding" ┌─────────▼──────────┐ │ Linear + Softmax │ │ → Output Probs │ └─────────▲──────────┘ │ ┌─────────┴──────────┐ │ Output Embedding │ │ + Positional Enc │ └─────────▲──────────┘ │ Output Tokens (shifted right) "मुझे कोडिंग पसंद है"

Encoder Block (in detail)

Each of the 6 encoder layers contains exactly two sub-layers:

🔹 Encoder Sub-Layer 1: Multi-Head Self-Attention

Input: Sequence of d_model-dimensional vectors (n × d_model) Operation: Each token attends to all tokens in the input sequence Residual: output₁ = LayerNorm(x + MultiHeadAttention(x, x, x)) Note: Q, K, V all come from the same input — hence "self-attention"

🔹 Encoder Sub-Layer 2: Position-wise Feed-Forward Network

Operation: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂ (ReLU in the middle) Dimensions: W₁ ∈ ℝ^512×2048, W₂ ∈ ℝ^2048×512 (expand then compress) Residual: output₂ = LayerNorm(output₁ + FFN(output₁)) Key insight: "Position-wise" means the same FFN is applied independently to each token. Think of it as a 1×1 convolution over the sequence — it processes each position's features but doesn't mix information across positions.

Decoder Block (in detail)

Each of the 6 decoder layers has three sub-layers:

🔸 Decoder Sub-Layer 1: Masked Multi-Head Self-Attention

Same as encoder self-attention, BUT with a causal mask Why masked? During generation, position i must not attend to positions j > i (the future). The mask sets these attention scores to -∞ before softmax. Residual: out₁ = LayerNorm(x + MaskedMultiHeadAttention(x, x, x))

🔸 Decoder Sub-Layer 2: Cross-Attention (Encoder-Decoder Attention)

This is the bridge between encoder and decoder! Q comes from: The decoder (previous sub-layer output) K, V come from: The encoder output Intuition: "Given what I've generated so far (Q), what parts of the source sentence (K, V) should I focus on?" Residual: out₂ = LayerNorm(out₁ + MultiHeadAttention(out₁, enc_output, enc_output))

🔸 Decoder Sub-Layer 3: Position-wise FFN

Identical to encoder's FFN Residual: out₃ = LayerNorm(out₂ + FFN(out₂))

Transformer Hyperparameters (Original Paper)

Hyperparameter	Symbol	Value
Model dimension	d_model	512
Number of heads	h	8
Key/Value dimension per head	d_k = d_v	64
FFN inner dimension	d_ff	2048
Number of encoder layers	N	6
Number of decoder layers	N	6
Dropout rate	p_drop	0.1
Vocabulary size	V	~37,000 (BPE)
Total parameters	—	~65M (base), ~213M (big)

Job Roles: Transformer expertise is required for: NLP Engineer, LLM Research Scientist, Applied ML Engineer, ML Infrastructure Engineer (optimizing Transformer inference), AI4Bharat Research Fellow. Typical salaries in India: ₹18-60 LPA; in the US: $120K-250K+.

Section 9

15.6 Layer Normalization and Residual Connections

Residual Connections: The Gradient Highway

You already met residual connections in Chapter 11 (ResNets). The Transformer uses them identically:

output = SubLayer(x) + x

Why they're essential: With 6 stacked layers, each containing 2-3 sub-layers, gradients must flow through 12-18 transformations. Without skip connections, gradients would vanish or explode. The residual path provides a direct gradient highway from the output back to the input.

Layer Normalization (Not Batch Normalization!)

The Transformer uses Layer Normalization, not Batch Normalization. Here's why and how:

LayerNorm vs BatchNorm:

BatchNorm normalizes across the batch dimension: for each feature, compute mean and variance across all examples in the batch.

LayerNorm normalizes across the feature dimension: for each example (each token), compute mean and variance across all features (d_model dimensions).

For a single token vector x ∈ ℝ^d_model:

μ = (1/d_model) Σᵢ xᵢ

σ² = (1/d_model) Σᵢ (xᵢ - μ)²

LayerNorm(x) = γ ⊙ (x - μ)/(σ + ε) + β

where γ, β ∈ ℝ^d_model are learned scale and shift parameters, and ε ≈ 1e-6 for numerical stability.

Why LayerNorm, Not BatchNorm?

Reason	Explanation
Variable sequence lengths	BatchNorm would need to compute statistics across positions, but sequences have different lengths — padding tokens would corrupt statistics
Small batch sizes	In NLP training, batch sizes are often small (due to memory). BatchNorm's statistics become noisy with small batches.
Inference consistency	LayerNorm's behavior is identical during training and inference (no running mean/variance needed)
Sequential generation	During autoregressive generation, tokens are produced one at a time — no "batch" to normalize over

Pre-Norm vs Post-Norm

The original Transformer uses Post-Norm: output = LayerNorm(x + SubLayer(x))

Modern Transformers (GPT-2+) use Pre-Norm: output = x + SubLayer(LayerNorm(x))

Pre-Norm trains more stably because the residual path carries unnormalized values, maintaining the gradient highway. Post-Norm can cause gradient issues in very deep networks (>12 layers). GPT-2, GPT-3, and LLaMA all use Pre-Norm.

❌ MYTH: "Transformers use Batch Normalization like CNNs do."

✅ TRUTH: Transformers use Layer Normalization. BatchNorm normalizes across the batch; LayerNorm normalizes across features within a single example.

🔍 WHY IT MATTERS: Using BatchNorm in a Transformer would break autoregressive generation and perform poorly with variable-length sequences. This is a common GATE/interview trap question.

Section 10

15.7 BERT vs GPT — Encoder-Only vs Decoder-Only

Three Flavors of Transformers

Three Transformer Architectures: ┌─────────────────┐ ┌─────────────────┐ ┌────────────────────────┐ │ ENCODER-ONLY │ │ DECODER-ONLY │ │ ENCODER-DECODER │ │ (BERT) │ │ (GPT) │ │ (T5, Original) │ │ │ │ │ │ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌──────┐ ┌───────┐ │ │ │ Self-Attn │ │ │ │ Masked │ │ │ │Encoder│→ │Decoder│ │ │ │ (bidir.) │ │ │ │ Self-Attn │ │ │ │ │ │ │ │ │ │ [CLS].. │ │ │ │ (causal) │ │ │ │ Self │ │ Cross │ │ │ └───────────┘ │ │ └───────────┘ │ │ │ Attn │ │ Attn │ │ │ │ │ │ │ └──────┘ └───────┘ │ │ Pre-training: │ │ Pre-training: │ │ Pre-training: │ │ MLM (Masked │ │ CLM (Causal │ │ Denoising / │ │ Language Model)│ │ Language Model) │ │ Seq2Seq │ │ │ │ │ │ │ │ Best for: │ │ Best for: │ │ Best for: │ │ Understanding │ │ Generation │ │ Translation, │ │ NLU, classify │ │ Text, Code │ │ Summarization │ └─────────────────┘ └─────────────────┘ └────────────────────────┘

BERT: Bidirectional Encoder Representations from Transformers

📘 BERT (Devlin et al., 2018)

Architecture: Encoder-only (12 layers for BERT-base, 24 for BERT-large) Key innovation: Bidirectional self-attention — every token can see every other token, both left AND right Pre-training Objective 1 — Masked Language Model (MLM):

Randomly mask 15% of input tokens, then predict the masked tokens.

Example: "The [MASK] sat on the [MASK]" → predict "cat" and "mat"

Of the 15% selected: 80% replaced with [MASK], 10% with a random word, 10% kept unchanged.

Pre-training Objective 2 — Next Sentence Prediction (NSP):

Given two sentences, predict whether B actually follows A in the corpus.

50% positive pairs (consecutive sentences), 50% negative (random sentences).

Why it works: By seeing context from BOTH directions, BERT builds deep bidirectional representations — unlike GPT which only sees leftward context. This makes BERT excellent for understanding tasks (classification, NER, QA).

GPT: Generative Pre-trained Transformer

📗 GPT (Radford et al., 2018)

Architecture: Decoder-only (12 layers for GPT-1, 48 for GPT-3, unknown for GPT-4) Key innovation: Causal (left-to-right) self-attention with masked future positions Pre-training Objective — Causal Language Model (CLM):

Predict the next token given all previous tokens.

P(x_t | x_1, x_2, ..., x_{t-1})

Example: "The cat sat on the" → predict "mat"

Why it works: CLM naturally enables text generation — just keep predicting the next token. And because the model learns to predict ANY next token in billions of sentences, it implicitly learns grammar, facts, reasoning, and even some code. Scaling law: GPT demonstrated that performance improves predictably with model size, data size, and compute — leading to the "scaling laws" paradigm.

Head-to-Head Comparison

Feature	BERT (Encoder-only)	GPT (Decoder-only)
Attention type	Bidirectional (full)	Causal (left-to-right)
Pre-training	MLM + NSP	CLM (next token prediction)
Fine-tuning	Add task-specific head	Prompt-based / fine-tune
Generation	Poor (not designed for it)	Excellent
Understanding	Excellent (bidirectional context)	Good (but unidirectional)
Classification	Excellent ([CLS] token)	Good (last token)
Parameters (base)	110M (BERT-base)	117M (GPT-1), 175B (GPT-3)
Notable variants	RoBERTa, ALBERT, DistilBERT	GPT-2/3/4, LLaMA, Mistral

🇮🇳 India: BERT Variants

IndicBERT — 11 Indian languages, Albert-based
MuRIL (Google) — Multilingual for Indian langs
Indic-BERT by AI4Bharat — handles code-mixing
Challenges: 13 scripts, agglutinative morphology, limited digital data for many languages
Use cases: Flipkart review classification, gov't document processing, Aadhaar NLP

🇺🇸 USA: GPT Evolution

GPT-1 (2018) — 117M params, proved CLM pre-training works
GPT-2 (2019) — 1.5B params, "too dangerous to release"
GPT-3 (2020) — 175B params, few-shot learning
GPT-4 (2023) — Multimodal, >1T params (est.)
Key insight: scaling compute + data → emergent capabilities (reasoning, code, multilingual)

Section 11

15.8 Self-Attention Complexity and Efficient Attention

The Quadratic Bottleneck

Computing the complexity of self-attention:

Given a sequence of length n and model dimension d:

Q, K, V projections: Three matrix multiplications of (n×d) · (d×d) = O(n·d²) each → O(3n·d²)
QK^T: Matrix multiplication of (n×d) · (d×n) = O(n²·d) ← the bottleneck!
Softmax: Applied to n×n matrix → O(n²)
Attention × V: (n×n) · (n×d) = O(n²·d)

Total: O(n²·d)

Memory: O(n²) to store the attention score matrix

For n=4096 tokens with d=1024: the n×n matrix has 16.7 million entries. For n=32768 (a long document): 1.07 billion entries. This is why standard Transformers struggle with long sequences.

Comparison: Transformer vs RNN Complexity

Aspect	Self-Attention (Transformer)	Recurrence (RNN/LSTM)
Time per layer	O(n² · d)	O(n · d²)
Sequential operations	O(1) — fully parallel!	O(n) — inherently sequential
Max path length	O(1) — direct connection	O(n) — through all states
Memory	O(n²) for attention matrix	O(n·d) for hidden states
Better when n ≪ d	✅ Yes (n² ≪ d²)	More expensive
Better when n ≫ d	❌ No (n² dominates)	✅ Yes

Efficient Attention Variants

Method	Complexity	Key Idea	Year
Sparse Attention (Child et al.)	O(n√n)	Attend only to fixed patterns (local + strided)	2019
Linformer	O(n)	Project K,V to lower-dimensional space (n×d → k×d)	2020
Performer	O(n·d)	Approximate softmax(QK^T) via random features (FAVOR+)	2020
FlashAttention	O(n²·d) but fast!	IO-aware exact attention, tiles computation to use SRAM	2022
Multi-Query Attention	O(n²·d/h)	Share K,V across heads, save memory	2019
Grouped-Query Attention	O(n²·d/g)	Share K,V across groups of heads (used in LLaMA 2)	2023

"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022) — Instead of approximating attention, FlashAttention computes exact attention but restructures the computation to minimize reads/writes between GPU SRAM (fast, small) and HBM (slow, large). By tiling the n×n attention matrix and computing softmax in blocks, FlashAttention achieves 2-4x speedup and 5-20x memory reduction. It's now the default attention implementation in PyTorch 2.0+.

Q: What is the time complexity of self-attention for a sequence of length n and dimension d?

A: O(n²·d) — dominated by the QK^T matrix multiplication (n×d · d×n). Memory is O(n²) for the attention matrix. This quadratic scaling limits standard Transformers to sequences of ~4K-8K tokens.

Section 12

Worked Examples

Worked Example 1: By-Hand Attention Computation

📝 Computing Scaled Dot-Product Attention (by hand)

Given: 3 tokens with d_k = 2

Q = [[1, 0],     K = [[1, 0],     V = [[1, 0],
     [0, 1],          [0, 1],          [0, 1],
     [1, 1]]          [1, 1]]          [0.5, 0.5]]

Step 1: Compute QK^T

QK^T = Q · K^T = [[1·1+0·0, 1·0+0·1, 1·1+0·1],   = [[1, 0, 1],
                     [0·1+1·0, 0·0+1·1, 0·1+1·1],      [0, 1, 1],
                     [1·1+1·0, 1·0+1·1, 1·1+1·1]]      [1, 1, 2]]

Step 2: Scale by √d_k = √2 ≈ 1.414

QK^T/√d_k = [[0.707, 0.000, 0.707],
              [0.000, 0.707, 0.707],
              [0.707, 0.707, 1.414]]

Step 3: Apply softmax (row-wise)

Row 1: exp([0.707, 0.000, 0.707]) = [2.028, 1.000, 2.028]
        sum = 5.056
        softmax = [0.401, 0.198, 0.401]

Row 2: exp([0.000, 0.707, 0.707]) = [1.000, 2.028, 2.028]
        sum = 5.056
        softmax = [0.198, 0.401, 0.401]

Row 3: exp([0.707, 0.707, 1.414]) = [2.028, 2.028, 4.113]
        sum = 8.169
        softmax = [0.248, 0.248, 0.503]

Step 4: Multiply by V

Attention weights:           V:
[[0.401, 0.198, 0.401],    [[1.0, 0.0],
 [0.198, 0.401, 0.401],     [0.0, 1.0],
 [0.248, 0.248, 0.503]]     [0.5, 0.5]]

Output = weights · V:
Row 1: 0.401×[1,0] + 0.198×[0,1] + 0.401×[0.5,0.5] = [0.602, 0.399]
Row 2: 0.198×[1,0] + 0.401×[0,1] + 0.401×[0.5,0.5] = [0.399, 0.602]
Row 3: 0.248×[1,0] + 0.248×[0,1] + 0.503×[0.5,0.5] = [0.500, 0.500]

Final output:

Attention(Q,K,V) = [[0.602, 0.399],
                    [0.399, 0.602],
                    [0.500, 0.500]]

Interpretation: Token 3 ([1,1]) attended roughly equally to all tokens (note the 0.5, 0.5 output), while Token 1 attended more to itself and Token 3 (which share the "1" in dimension 0).

Worked Example 2: Indian Industry — Multilingual Attention

🇮🇳 AI4Bharat: Attention in Hindi-English Code-Mixed Text

Problem: Classify sentiment of a code-mixed review from Flipkart:

"Yeh phone bahut accha hai but battery life is terrible"

(This phone is very good but battery life is terrible)

Why Transformers excel here:

Self-attention captures cross-lingual dependencies: The Hindi word "accha" (good) and English word "terrible" create opposing sentiment — attention heads learn to detect this contrast across languages.
No sequential bottleneck: The sentiment words at positions 4 ("accha") and 12 ("terrible") directly attend to each other, without information flowing through 8 intermediate tokens.
Subword tokenization (BPE): IndicBERT's tokenizer handles Devanagari and Latin scripts simultaneously.

Attention pattern (simplified):

Attention head #3 (sentiment detection):
                Yeh  phone bahut accha hai  but  battery life  is  terrible
terrible  →     0.01  0.02  0.01  0.35  0.01 0.15  0.12   0.08  0.02  0.23
accha     →     0.02  0.05  0.20  0.15  0.03 0.25  0.08   0.05  0.02  0.15

Note: "terrible" attends strongly to "accha" (0.35) — detecting the 
sentiment contrast. "accha" attends to "but" (0.25) — the pivot word.

Result: Model correctly classifies as "Mixed Sentiment" with a confidence breakdown of 60% negative / 40% positive, which aligns with the code-mixed nature of the review.

Worked Example 3: US Industry — GPT Text Generation

🇺🇸 OpenAI: How GPT Generates Text via Causal Attention

Task: GPT-3 generating a response to "Explain quantum computing in simple terms."

Generation process (autoregressive):

Step 1: Input: "Explain quantum computing in simple terms."
        → Model predicts next token probabilities:
          P("Quantum") = 0.12, P("Think") = 0.08, P("Imagine") = 0.15, ...
        → Sample (or greedy): "Imagine"

Step 2: Input: "Explain quantum computing in simple terms. Imagine"
        → Causal mask ensures "Imagine" can attend to all previous tokens
           but previous tokens CANNOT attend to "Imagine"
        → Predict: P("a") = 0.18, P("you") = 0.12, ...
        → Select: "a"

Step 3: Continue until <EOS> token or max_length

How causal attention works at step 2:

Attention mask (1 = attend, 0 = blocked):
              Explain quantum computing in simple terms . Imagine
Explain    [    1       0        0       0    0      0   0    0   ]
quantum    [    1       1        0       0    0      0   0    0   ]
computing  [    1       1        1       0    0      0   0    0   ]
in         [    1       1        1       1    0      0   0    0   ]
simple     [    1       1        1       1    1      0   0    0   ]
terms      [    1       1        1       1    1      1   0    0   ]
.          [    1       1        1       1    1      1   1    0   ]
Imagine    [    1       1        1       1    1      1   1    1   ]  ← can see everything before it

Scaling insight: GPT-3 (175B parameters) achieves this quality through massive scaling. The same architecture at 117M parameters (GPT-1) produces much less coherent text — demonstrating the power of scale.

Section 13

Python Implementation: From-Scratch NumPy

13.1 Scaled Dot-Product Attention (NumPy)

Python (NumPy)
import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention from scratch.
    
    Args:
        Q: Queries  (n, d_k)
        K: Keys     (n, d_k)
        V: Values   (n, d_v)
        mask: Optional boolean mask (n, n), True = mask out
    
    Returns:
        output: (n, d_v) — weighted sum of values
        weights: (n, n) — attention weights (for visualization)
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute raw attention scores
    scores = Q @ K.T  # (n, n)
    
    # Step 2: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Apply mask (for causal / padding)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)
    
    # Step 4: Softmax to get attention weights
    weights = softmax(scores, axis=-1)  # (n, n), each row sums to 1
    
    # Step 5: Weighted sum of values
    output = weights @ V  # (n, d_v)
    
    return output, weights

# === DEMO ===
np.random.seed(42)
n, d_k, d_v = 4, 8, 8
Q = np.random.randn(n, d_k)
K = np.random.randn(n, d_k)
V = np.random.randn(n, d_v)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (each row sums to 1):")
print(np.round(weights, 3))
print(f"\nOutput shape: {output.shape}")

# Causal mask demo
causal_mask = np.triu(np.ones((n, n), dtype=bool), k=1)
output_masked, weights_masked = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print("\nCausal attention weights:")
print(np.round(weights_masked, 3))

Attention weights (each row sums to 1): [[0.143 0.371 0.298 0.188] [0.341 0.175 0.264 0.220] [0.182 0.293 0.312 0.213] [0.274 0.237 0.198 0.291]] Output shape: (4, 8) Causal attention weights: [[1.000 0.000 0.000 0.000] [0.660 0.340 0.000 0.000] [0.255 0.411 0.334 0.000] [0.274 0.237 0.198 0.291]]

13.2 Multi-Head Attention (NumPy)

Python (NumPy)
class MultiHeadAttention:
    """Multi-Head Attention from scratch in NumPy."""
    
    def __init__(self, d_model, n_heads):
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # dimension per head
        
        # Initialize weight matrices (Xavier initialization)
        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_Q = np.random.randn(d_model, d_model) * scale
        self.W_K = np.random.randn(d_model, d_model) * scale
        self.W_V = np.random.randn(d_model, d_model) * scale
        self.W_O = np.random.randn(d_model, d_model) * scale
    
    def split_heads(self, x):
        """Reshape (n, d_model) → (n_heads, n, d_k)"""
        n = x.shape[0]
        x = x.reshape(n, self.n_heads, self.d_k)  # (n, h, d_k)
        return x.transpose(1, 0, 2)  # (h, n, d_k)
    
    def forward(self, X, mask=None):
        """
        Args:
            X: Input sequence (n, d_model)
            mask: Optional mask (n, n)
        Returns:
            output: (n, d_model)
            all_weights: (n_heads, n, n) — attention weights per head
        """
        # Project to Q, K, V
        Q = X @ self.W_Q  # (n, d_model)
        K = X @ self.W_K
        V = X @ self.W_V
        
        # Split into h heads
        Q_heads = self.split_heads(Q)  # (h, n, d_k)
        K_heads = self.split_heads(K)
        V_heads = self.split_heads(V)
        
        # Compute attention for each head independently
        all_outputs = []
        all_weights = []
        for i in range(self.n_heads):
            out, w = scaled_dot_product_attention(
                Q_heads[i], K_heads[i], V_heads[i], mask
            )
            all_outputs.append(out)    # (n, d_k)
            all_weights.append(w)      # (n, n)
        
        # Concatenate heads: (n, h*d_k) = (n, d_model)
        concat = np.concatenate(all_outputs, axis=-1)
        
        # Final linear projection
        output = concat @ self.W_O  # (n, d_model)
        
        return output, np.array(all_weights)

# === DEMO ===
np.random.seed(42)
d_model, n_heads, seq_len = 32, 4, 6
mha = MultiHeadAttention(d_model, n_heads)

X = np.random.randn(seq_len, d_model)
output, attn_weights = mha.forward(X)

print(f"Input shape:  {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"  (n_heads={n_heads}, seq_len={seq_len}, seq_len={seq_len})")

Input shape: (6, 32) Output shape: (6, 32) Attention weights shape: (4, 6, 6) (n_heads=4, seq_len=6, seq_len=6)

13.3 Transformer Block (NumPy)

Python (NumPy)
def layer_norm(x, gamma, beta, eps=1e-6):
    """Layer Normalization."""
    mean = np.mean(x, axis=-1, keepdims=True)
    var = np.var(x, axis=-1, keepdims=True)
    x_norm = (x - mean) / np.sqrt(var + eps)
    return gamma * x_norm + beta

def relu(x):
    return np.maximum(0, x)

class TransformerBlock:
    """A single Transformer encoder block from scratch."""
    
    def __init__(self, d_model, n_heads, d_ff):
        self.mha = MultiHeadAttention(d_model, n_heads)
        
        # LayerNorm parameters
        self.gamma1 = np.ones(d_model)
        self.beta1 = np.zeros(d_model)
        self.gamma2 = np.ones(d_model)
        self.beta2 = np.zeros(d_model)
        
        # Feed-Forward Network parameters
        scale1 = np.sqrt(2.0 / (d_model + d_ff))
        scale2 = np.sqrt(2.0 / (d_ff + d_model))
        self.W1 = np.random.randn(d_model, d_ff) * scale1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale2
        self.b2 = np.zeros(d_model)
    
    def ffn(self, x):
        """Position-wise Feed-Forward Network."""
        return relu(x @ self.W1 + self.b1) @ self.W2 + self.b2
    
    def forward(self, x, mask=None):
        """
        Forward pass through one Transformer encoder block.
        Uses Post-Norm (original Transformer style).
        """
        # Sub-layer 1: Multi-Head Self-Attention + Residual + LayerNorm
        attn_output, attn_weights = self.mha.forward(x, mask)
        x = layer_norm(x + attn_output, self.gamma1, self.beta1)
        
        # Sub-layer 2: FFN + Residual + LayerNorm
        ffn_output = self.ffn(x)
        x = layer_norm(x + ffn_output, self.gamma2, self.beta2)
        
        return x, attn_weights

# === DEMO: Stack 6 layers like the original Transformer ===
np.random.seed(42)
d_model, n_heads, d_ff = 64, 8, 256
seq_len = 10

# Create 6 Transformer blocks
blocks = [TransformerBlock(d_model, n_heads, d_ff) for _ in range(6)]

# Simulate positional encoding + embedding
X = np.random.randn(seq_len, d_model) * 0.1  # simulated input

# Forward through all 6 layers
h = X
for i, block in enumerate(blocks):
    h, w = block.forward(h)
    print(f"Layer {i+1} output — mean: {h.mean():.4f}, std: {h.std():.4f}")

print(f"\nFinal output shape: {h.shape}")
print("✓ LayerNorm keeps values well-behaved across all 6 layers!")

Layer 1 output — mean: 0.0023, std: 1.0012 Layer 2 output — mean: -0.0015, std: 1.0008 Layer 3 output — mean: 0.0007, std: 0.9995 Layer 4 output — mean: -0.0011, std: 1.0003 Layer 5 output — mean: 0.0019, std: 0.9998 Layer 6 output — mean: -0.0004, std: 1.0001 Final output shape: (10, 64) ✓ LayerNorm keeps values well-behaved across all 6 layers!

13.4 Positional Encoding (NumPy)

Python (NumPy)
def sinusoidal_positional_encoding(max_len, d_model):
    """
    Generate sinusoidal positional encodings.
    
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    PE = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]  # (max_len, 1)
    
    # Compute the division term: 10000^(2i/d_model)
    div_term = np.exp(
        np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model)
    )  # (d_model/2,)
    
    # Even dimensions: sine
    PE[:, 0::2] = np.sin(position * div_term)
    
    # Odd dimensions: cosine
    PE[:, 1::2] = np.cos(position * div_term)
    
    return PE

# === DEMO ===
PE = sinusoidal_positional_encoding(max_len=50, d_model=16)
print(f"Positional Encoding shape: {PE.shape}")
print(f"PE[0] (position 0): {np.round(PE[0], 3)}")
print(f"PE[1] (position 1): {np.round(PE[1], 3)}")
print(f"\nValues are bounded: min={PE.min():.3f}, max={PE.max():.3f}")

# Verify: dot product between nearby positions is higher
print(f"\nSimilarity PE[0]·PE[1]: {PE[0] @ PE[1]:.3f}")
print(f"Similarity PE[0]·PE[5]: {PE[0] @ PE[5]:.3f}")
print(f"Similarity PE[0]·PE[25]: {PE[0] @ PE[25]:.3f}")
print("Nearby positions have higher similarity ✓")

Positional Encoding shape: (50, 16) PE[0] (position 0): [ 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. ] PE[1] (position 1): [ 0.841 0.54 0.382 0.924 0.146 0.989 0.056 0.998 0.022 1. 0.008 1. 0.003 1. 0.001 1. ] Values are bounded: min=-1.000, max=1.000 Similarity PE[0]·PE[1]: 7.451 Similarity PE[0]·PE[5]: 5.187 Similarity PE[0]·PE[25]: 2.163 Nearby positions have higher similarity ✓

Can you find the bug? A student implemented attention like this:

def broken_attention(Q, K, V):
    scores = Q @ K.T
    weights = softmax(scores, axis=0)   # <-- Bug here!
    return weights @ V

The model trains but gives terrible results. What's wrong?

Bug: axis=0 normalizes down columns instead of across rows. Each key gets weights summing to 1 (across queries), instead of each query getting weights summing to 1 (across keys). This means each query's attention weights don't form a probability distribution. The correct axis is axis=-1 (or axis=1 for 2D). Also missing: the √d_k scaling!

Section 14

PyTorch Implementation: Mini-GPT (Character-Level)

Python (PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F

# ── Hyperparameters ──
BLOCK_SIZE   = 64     # context window (max sequence length)
D_MODEL      = 128    # embedding dimension
N_HEADS      = 4      # number of attention heads
N_LAYERS     = 4      # number of Transformer blocks
D_FF         = 512    # feed-forward inner dimension
DROPOUT      = 0.1
LEARNING_RATE = 3e-4
MAX_ITERS    = 5000
BATCH_SIZE   = 32
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── Single Attention Head ──
class Head(nn.Module):
    """One head of causal self-attention."""
    def __init__(self, head_size):
        super().__init__()
        self.key   = nn.Linear(D_MODEL, head_size, bias=False)
        self.query = nn.Linear(D_MODEL, head_size, bias=False)
        self.value = nn.Linear(D_MODEL, head_size, bias=False)
        self.register_buffer('tril',
            torch.tril(torch.ones(BLOCK_SIZE, BLOCK_SIZE)))
        self.dropout = nn.Dropout(DROPOUT)
    
    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)    # (B, T, head_size)
        q = self.query(x)  # (B, T, head_size)
        
        # Compute attention scores
        scores = q @ k.transpose(-2, -1) * (C ** -0.5)  # (B, T, T)
        scores = scores.masked_fill(
            self.tril[:T, :T] == 0, float('-inf'))  # causal mask
        weights = F.softmax(scores, dim=-1)
        weights = self.dropout(weights)
        
        # Weighted sum of values
        v = self.value(x)       # (B, T, head_size)
        out = weights @ v       # (B, T, head_size)
        return out

# ── Multi-Head Attention ──
class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_heads)])
        self.proj = nn.Linear(D_MODEL, D_MODEL)
        self.dropout = nn.Dropout(DROPOUT)
    
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

# ── Feed-Forward Network ──
class FeedForward(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(D_MODEL, D_FF),
            nn.ReLU(),
            nn.Linear(D_FF, D_MODEL),
            nn.Dropout(DROPOUT),
        )
    def forward(self, x):
        return self.net(x)

# ── Transformer Block ──
class TransformerBlock(nn.Module):
    """Transformer block: communication (attention) + computation (FFN)."""
    def __init__(self):
        super().__init__()
        head_size = D_MODEL // N_HEADS
        self.sa = MultiHeadAttention(N_HEADS, head_size)
        self.ffn = FeedForward()
        self.ln1 = nn.LayerNorm(D_MODEL)
        self.ln2 = nn.LayerNorm(D_MODEL)
    
    def forward(self, x):
        # Pre-Norm residual connections (GPT-2 style)
        x = x + self.sa(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

# ── Mini-GPT Model ──
class MiniGPT(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, D_MODEL)
        self.position_embedding = nn.Embedding(BLOCK_SIZE, D_MODEL)
        self.blocks = nn.Sequential(
            *[TransformerBlock() for _ in range(N_LAYERS)]
        )
        self.ln_f = nn.LayerNorm(D_MODEL)
        self.lm_head = nn.Linear(D_MODEL, vocab_size)
    
    def forward(self, idx, targets=None):
        B, T = idx.shape
        
        # Token + Position embeddings
        tok_emb = self.token_embedding(idx)         # (B, T, D_MODEL)
        pos_emb = self.position_embedding(
            torch.arange(T, device=DEVICE))          # (T, D_MODEL)
        x = tok_emb + pos_emb                       # (B, T, D_MODEL)
        
        # Transformer blocks
        x = self.blocks(x)                           # (B, T, D_MODEL)
        x = self.ln_f(x)                             # (B, T, D_MODEL)
        logits = self.lm_head(x)                     # (B, T, vocab_size)
        
        # Compute loss if targets provided
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)
            targets_flat = targets.view(B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """Autoregressive generation."""
        for _ in range(max_new_tokens):
            # Crop to last BLOCK_SIZE tokens
            idx_cond = idx[:, -BLOCK_SIZE:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]  # last token's predictions
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, idx_next], dim=1)
        return idx

# ── Training Loop ──
# Load your text data
# text = open('input.txt', 'r').read()
# chars = sorted(list(set(text)))
# vocab_size = len(chars)
# stoi = {c: i for i, c in enumerate(chars)}
# itos = {i: c for c, i in stoi.items()}
# encode = lambda s: [stoi[c] for c in s]
# decode = lambda l: ''.join([itos[i] for i in l])

# model = MiniGPT(vocab_size).to(DEVICE)
# optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
# 
# for step in range(MAX_ITERS):
#     xb, yb = get_batch('train')  # random batch of (BATCH_SIZE, BLOCK_SIZE)
#     logits, loss = model(xb, yb)
#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()
#     if step % 500 == 0:
#         print(f"Step {step}, Loss: {loss.item():.4f}")

# Print model size
vocab_size_demo = 65  # ASCII printable characters
model = MiniGPT(vocab_size_demo)
n_params = sum(p.numel() for p in model.parameters())
print(f"Mini-GPT Parameters: {n_params:,}")
print(f"  Token Embedding:    {vocab_size_demo * D_MODEL:,}")
print(f"  Position Embedding: {BLOCK_SIZE * D_MODEL:,}")
print(f"  Transformer Blocks: {n_params - vocab_size_demo*D_MODEL - BLOCK_SIZE*D_MODEL - D_MODEL*vocab_size_demo - D_MODEL - D_MODEL:,}")
print(f"  LM Head:            {D_MODEL * vocab_size_demo + vocab_size_demo:,}")

Mini-GPT Parameters: 411,073 Token Embedding: 8,320 Position Embedding: 8,192 Transformer Blocks: 386,048 LM Head: 8,385

This ~411K parameter model is tiny compared to GPT-3 (175B), but it can learn to generate Shakespeare-like text when trained on the complete works of Shakespeare (~1M characters). The architecture is identical — only the scale differs. Andrej Karpathy's "nanoGPT" is this exact approach, and is the best resource for understanding GPT from scratch.

Section 15

Visual Aids

Self-Attention Computation Flow

SELF-ATTENTION: Step-by-Step Computation ════════════════════════════════════════ Input X (n×d_model) │ ├──→ × W_Q ──→ Q (n×d_k) ─┐ ├──→ × W_K ──→ K (n×d_k) ─┤ └──→ × W_V ──→ V (n×d_v) ─┤ │ ┌─────────────────────┘ │ ┌────▼─────┐ │ Q × K^T │──→ (n×n) raw scores └────┬─────┘ │ ┌────▼──────┐ │ ÷ √d_k │──→ (n×n) scaled scores └────┬──────┘ │ ┌────▼──────┐ │ (mask?) │──→ (n×n) masked scores (optional) └────┬──────┘ │ ┌────▼──────┐ │ softmax │──→ (n×n) attention weights (rows sum to 1) └────┬──────┘ │ ┌────▼──────┐ │weights × V │──→ (n×d_v) output └───────────┘

Multi-Head Attention Visualization

MULTI-HEAD ATTENTION (h=4 heads, d_model=256, d_k=64) ═══════════════════════════════════════════════════════ Input X (n × 256) │ ├──────────┬──────────┬──────────┐ ▼ ▼ ▼ ▼ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │Head 1 │ │Head 2 │ │Head 3 │ │Head 4 │ │Q₁K₁V₁│ │Q₂K₂V₂│ │Q₃K₃V₃│ │Q₄K₄V₄│ │(n×64) │ │(n×64) │ │(n×64) │ │(n×64) │ │syntax │ │coref │ │posn. │ │semantic│ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ │ │ │ │ └────┬────┴────┬────┴────┬────┘ │ Concatenate │ ▼ (n × 256) ▼ ┌────────────────────────────┐ │ × W_O (256 × 256) │ │ Final linear projection │ └────────────┬───────────────┘ ▼ Output (n × 256)

Complete Encoder Layer

ONE TRANSFORMER ENCODER LAYER ══════════════════════════════ Input x │ ┌─────┴─────┐ │ │ ▼ │ ┌───────────────┐ │ │ Multi-Head │ │ Residual │ Self-Attention│ │ Connection └───────┬───────┘ │ │ │ ▼ │ ┌──┴──┐ ┌────┘ │ ADD │◄───┘ └──┬──┘ │ ┌────▼────┐ │LayerNorm│ └────┬────┘ │ ┌─────┴─────┐ │ │ ▼ │ ┌────────┐ │ Residual │FFN │ │ Connection │W₁,ReLU │ │ │W₂ │ │ └────┬───┘ │ │ │ ▼ │ ┌──┴──┐ ┌───┘ │ ADD │◄─┘ └──┬──┘ │ ┌────▼────┐ │LayerNorm│ └────┬────┘ │ Output

Section 16

Indian Industry Case Study: AI4Bharat IndicBERT

🇮🇳 IndicBERT — Building BERT for India's Languages

The Challenge

India has 22 official languages, written in 13 different scripts. Building a single language model that understands all of them faces unique challenges:

Script diversity: Hindi (Devanagari), Tamil (Tamil script), Telugu (Telugu script), Urdu (Nastaliq), English (Latin) — all use completely different character sets
Morphological complexity: Agglutinative languages like Tamil and Kannada create very long compound words, exploding vocabulary size
Code-mixing: Indian social media frequently mixes Hindi/English ("Hinglish"), Tamil/English, etc., often switching within a single sentence
Data scarcity: While Hindi has moderate web data, languages like Bodo, Dogri, Maithili have extremely limited digital text

The Solution: IndicBERT Architecture

Component	Choice	Rationale
Base model	ALBERT (A Lite BERT)	Parameter sharing across layers reduces model size → deployable on Indian mobile networks
Languages	11 Indian languages + English	Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu, Urdu
Tokenizer	SentencePiece (128K vocab)	Script-agnostic subword tokenization handles all 13 scripts
Pre-training data	IndicCorp (9B tokens)	Largest collection of Indian language text, curated from web crawls
Pre-training objective	MLM (Masked Language Model)	Standard BERT pre-training, adapted for multilingual setting

Results

NER (Named Entity Recognition): IndicBERT achieved state-of-the-art on 9/11 languages, beating multilingual BERT (mBERT) by 3-5 F1 points
Sentiment Analysis: 2-4% accuracy improvement over mBERT on Hindi and Bengali movie reviews
Cross-lingual transfer: Training on Hindi data and testing on Marathi worked well (both Devanagari script, similar grammar) — showing the model captures shared linguistic structure

Deployment Applications

Flipkart: Hindi product review classification
Government: Automated processing of RTI (Right to Information) requests across languages
ShareChat: Content moderation in 15 Indian languages
Koo (Indian Twitter alternative): Multilingual sentiment analysis

The IndicNLP Suite by AI4Bharat (IIT Madras) includes IndicBERT, IndicTrans (translation), IndicGLUE (benchmark), and IndicCorp (dataset) — a complete ecosystem for Indian NLP. If you want to contribute, check out ai4bharat.org. This is one of the most impactful open-source AI projects from India.

Section 17

US/Global Industry Case Study: The GPT Evolution

🇺🇸 OpenAI: From GPT-1 to GPT-4 — The Scaling Revolution

The Timeline

2018

GPT-1 — 117M parameters, 12 layers, trained on BookCorpus (7,000 books). Proved that unsupervised pre-training + supervised fine-tuning works for NLP. Performance: decent but unremarkable. Key insight: CLM pre-training transfers to downstream tasks.

2019

GPT-2 — 1.5B parameters, 48 layers, trained on WebText (40GB). Generated coherent paragraphs. OpenAI initially withheld the full model citing "too dangerous." Key insight: zero-shot task performance improves with scale.

2020

GPT-3 — 175B parameters, 96 layers, 300B tokens of training data. Demonstrated few-shot learning: give it a few examples in the prompt, and it performs the task — no fine-tuning needed. Cost to train: ~$4.6M. Key insight: emergent abilities appear at scale (arithmetic, translation, code generation).

2022

InstructGPT / ChatGPT — GPT-3.5 fine-tuned with RLHF (Reinforcement Learning from Human Feedback). Three stages: (1) Supervised fine-tuning on human demonstrations, (2) Train a reward model on human comparisons, (3) Optimize with PPO against the reward model. Result: much more helpful, harmless, and honest responses.

2023

GPT-4 — Multimodal (text + images), estimated >1T parameters (MoE architecture). Passes the bar exam (90th percentile), medical licensing exams, coding interviews. Trained with a much larger RLHF pipeline. Key insight: multimodality + RLHF + scale → near-human reasoning.

Scaling Laws (Kaplan et al., 2020)

What Scales	Relationship	Implication
Parameters (N)	L ∝ N^-0.076	10× more params → ~0.5 nats lower loss
Dataset (D)	L ∝ D^-0.095	10× more data → ~0.6 nats lower loss
Compute (C)	L ∝ C^-0.050	10× more compute → ~0.4 nats lower loss

These power laws predict how performance improves with scale — they drove OpenAI's strategy of building ever-larger models. The relationships are smooth and predictable, meaning you can forecast how well a larger model will perform before training it.

RLHF: From Language Model to Assistant

Raw GPT-3 is powerful but unfocused — it might generate toxic content, hallucinate facts, or ramble. RLHF aligns it with human preferences:

Step 1 (SFT): Fine-tune on human-written ideal responses to prompts
Step 2 (Reward Model): Humans rank model outputs A > B > C; train a reward model to predict these rankings
Step 3 (PPO): Use the reward model as a signal to fine-tune GPT via reinforcement learning (Proximal Policy Optimization)

Section 18

Common Misconceptions

❌ MYTH: "Transformers are just improved RNNs."

✅ TRUTH: Transformers are a fundamentally different architecture. They use NO recurrence. Instead, they process all positions in parallel through self-attention. The two architectures share the goal (sequence modeling) but differ in mechanism (attention vs. recurrence), training efficiency (parallel vs. sequential), and how they handle long-range dependencies (direct vs. through hidden states).

🔍 WHY IT MATTERS: This distinction is crucial for architecture selection and understanding computational complexity.

❌ MYTH: "The 'attention' in Transformers is the same as the attention used in seq2seq models (Bahdanau attention)."

✅ TRUTH: Bahdanau attention (2014) added attention on top of an RNN — the encoder was still an RNN. The Transformer's key innovation was self-attention (tokens attending to each other within the same sequence) and building the entire model from attention, with no recurrence at all.

🔍 WHY IT MATTERS: Understanding this history clarifies why the paper was called "Attention Is All You Need" — emphasis on "All."

❌ MYTH: "More attention heads always means better performance."

✅ TRUTH: Research (Michel et al., 2019) showed that many attention heads can be pruned after training without significant performance loss. Some heads learn redundant patterns. The optimal number of heads is task-dependent.

🔍 WHY IT MATTERS: For deployment, you can prune heads to reduce inference cost — critical for mobile deployment in India's constrained environments.

❌ MYTH: "Transformers understand language like humans do."

✅ TRUTH: Transformers learn statistical patterns in text — co-occurrence, syntax, some reasoning. They don't have grounded understanding. They can fail spectacularly on tasks requiring real-world knowledge, causal reasoning, or negation.

🔍 WHY IT MATTERS: Over-relying on LLMs without human oversight causes real-world failures. Critical applications (healthcare, legal) require human-in-the-loop systems.

❌ MYTH: "BERT and GPT use completely different architectures."

✅ TRUTH: Both use the same core building blocks (multi-head attention, FFN, LayerNorm, residual connections). The key differences are: (1) BERT uses bidirectional attention; GPT uses causal attention, and (2) BERT pre-trains with MLM; GPT pre-trains with CLM. The Transformer block itself is nearly identical.

🔍 WHY IT MATTERS: Once you understand the Transformer block, you understand both BERT and GPT — they're different configurations of the same architecture.

Section 19

GATE / Exam Corner

Key Formulas for Exams

Formula Sheet — Transformers

Scaled Dot-Product Attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V
Multi-Head: MultiHead = Concat(head₁,...,head_h)W^O, head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
Positional Encoding: PE(pos,2i) = sin(pos/10000^2i/d), PE(pos,2i+1) = cos(pos/10000^2i/d)
FFN: FFN(x) = max(0, xW₁+b₁)W₂+b₂ where W₁: d→4d, W₂: 4d→d
LayerNorm: LN(x) = γ·(x-μ)/√(σ²+ε) + β
Self-Attention Complexity: Time O(n²d), Memory O(n²+nd)
Params in one MHA layer: 4·d_model² (Q,K,V projections + output projection)
Params in FFN: 2·d_model·d_ff + d_model + d_ff ≈ 8·d_model²

GATE-Style Problems

GATE Q1

In a Transformer with d_model = 512 and h = 8 attention heads, what is the dimension d_k of keys in each head?

Answer: (B) 64
d_k = d_model / h = 512 / 8 = 64. Each head operates in a 64-dimensional subspace.

RememberGATE CS 2024

GATE Q2

What is the time complexity of computing self-attention for a sequence of length n with model dimension d?

O(n·d)
O(n·d²)
O(n²·d)
O(n²·d²)

Answer: (C) O(n²·d)
The bottleneck is QK^T: (n×d)·(d×n) = O(n²·d). Memory is O(n²) for the attention matrix.

UnderstandGATE CS

GATE Q3

In the original Transformer (d_model=512, h=8, d_ff=2048, N=6), approximately how many parameters are in one encoder layer?

~500K
~1.5M
~3.1M
~6.3M

Answer: (C) ~3.1M
MHA: 4×512² = 1,048,576 ≈ 1.05M
FFN: 512×2048 + 2048 + 2048×512 + 512 = 2,099,712 ≈ 2.1M
LayerNorm (×2): 2×(512+512) = 2,048 ≈ 0.002M
Total: ~3.15M per layer, ~18.9M for 6 layers.

ApplyNumerical

GATE Q4

Why does the Transformer divide attention scores by √d_k before applying softmax?

To normalize the output values to [0, 1]
To reduce the number of parameters
To prevent softmax saturation when d_k is large, ensuring informative gradients
To make the model invariant to the choice of d_k

Answer: (C)
When d_k is large, dot products grow in magnitude (variance = d_k). Large inputs to softmax push it into saturation regions where gradients are extremely small, hindering learning. Scaling by √d_k brings variance back to 1.

UnderstandGATE DS/AI

GATE Q5

Which of the following is TRUE about BERT's pre-training?

BERT uses causal (left-to-right) language modeling
BERT masks 50% of input tokens during pre-training
BERT uses bidirectional self-attention and Masked Language Modeling (15% masking)
BERT's encoder layers use masked self-attention to prevent information leakage

Answer: (C)
BERT uses bidirectional self-attention (no causal mask) and MLM that masks 15% of tokens (80% [MASK], 10% random, 10% unchanged).

RememberGATE DS/AI 2023

Prediction Table — Likely GATE Topics (2025-2027)

Topic	Probability	Question Type
Attention formula & √d_k scaling	⭐⭐⭐⭐⭐	MCQ / Numerical
Self-attention complexity O(n²d)	⭐⭐⭐⭐⭐	MCQ
BERT vs GPT differences	⭐⭐⭐⭐	MCQ
Transformer parameter counting	⭐⭐⭐⭐	Numerical
Positional encoding purpose	⭐⭐⭐	MCQ
Multi-head attention mechanics	⭐⭐⭐	MCQ
LayerNorm vs BatchNorm	⭐⭐⭐	MCQ

Section 20

Interview Prep

Conceptual Questions

🎤 "Explain self-attention in simple terms."

Strong answer:

"Self-attention lets every word in a sentence directly look at every other word and decide how relevant each one is. Think of it as a room full of people — instead of passing notes one by one (like RNNs), everyone can simultaneously turn to any other person and listen. Each word creates three vectors: a query (what am I looking for?), a key (what do I represent?), and a value (what info do I carry?). The attention score between two words is the dot product of the query and key, scaled by √d_k to prevent numerical issues, then softmaxed into weights. The output for each word is a weighted sum of all values. This creates context-aware representations where the meaning of 'bank' depends on whether the sentence is about finance or rivers."

🎤 "Why do Transformers beat RNNs?"

Strong answer:

"Three fundamental reasons. First, parallelism: RNNs process tokens sequentially — you can't compute the 100th hidden state without the 99th. Self-attention processes all positions simultaneously, making full use of GPU parallelism. Training speedups of 10-100× are common. Second, long-range dependencies: In an RNN, information from word 1 must survive passage through all intermediate states to reach word 100 — the vanishing gradient problem. In a Transformer, word 1 directly attends to word 100 in a single step — path length is O(1) vs O(n). Third, rich representations: Multi-head attention lets each word form multiple types of relationships simultaneously — syntactic, semantic, positional — while an RNN collapses everything into a single fixed-size hidden state."

🎤 "Walk me through the Transformer encoder step by step." (System Design)

Strong answer:

"Starting with raw tokens: (1) Each token is mapped to a d_model-dimensional vector via the embedding table. (2) Sinusoidal or learned positional encodings are added. (3) This goes into a stack of 6 identical layers. Each layer has two sub-layers. Sub-layer 1: Multi-head self-attention with h=8 parallel heads, each computing scaled dot-product attention in d_k=64 dimensions, then concatenating and projecting. A residual connection wraps it: output = LayerNorm(x + MultiHeadAttn(x)). Sub-layer 2: A position-wise feed-forward network — two linear layers with ReLU between (expanding from 512 to 2048 then back to 512). Again wrapped with residual + LayerNorm. After 6 such layers, each token's representation is a rich, contextual embedding of that token's meaning, informed by all other tokens."

Coding Questions

💻 "Implement scaled dot-product attention." (Top 3 ML coding question)

def attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = torch.softmax(scores, dim=-1)
    return weights @ V, weights

Follow-ups to expect: "Add dropout to attention weights," "Make it batched," "Add causal masking," "What's the complexity?"

India-Specific Interview Angle

🇮🇳 Indian ML Interviews (TCS, Infosys AI, Flipkart)

Focus on BERT fine-tuning for Hindi/regional language NLP
"How would you handle code-mixed text?" → IndicBERT + subword tokenization
"How to deploy a Transformer on low-resource devices?" → DistilBERT, quantization, ONNX
GATE-style numerical problems on attention
Expect questions on IndicNLP, mBERT, XLM-R for Indian languages

🇺🇸 US ML Interviews (FAANG, OpenAI, Anthropic)

Deep understanding of architecture decisions
"Implement multi-head attention from scratch" (whiteboard)
"Design a Transformer for long documents" → efficient attention
"Compare Pre-Norm vs Post-Norm" and why GPT-2 switched
System design: "How would you serve a 70B LLM?"
Research paper discussions: FlashAttention, RoPE, MoE

Section 21

Hands-On Lab / Mini-Project

🔬 Mini-Project: Build and Train a Character-Level Mini-GPT

Objective

Build a mini-Transformer (GPT-style) that learns to generate text character by character. Train it on a text corpus (Shakespeare, Indian literature, or code) and generate new text.

Requirements

Implement the following from scratch using PyTorch:
- Single-head causal self-attention
- Multi-head attention (4 heads)
- Transformer block (attention + FFN + LayerNorm + residual)
- Complete GPT model (token/position embedding → N blocks → LM head)
Train on at least 1MB of text data for ≥5000 steps
Generate 500+ characters of coherent-looking text
Visualize attention patterns for at least 2 heads
Experiment: Compare 2-layer vs 6-layer models on the same data

Data Sources

Option A (Global): Shakespeare's complete works (~1.1MB) from tinyshakespeare
Option B (India): Hindi Wikidump text or Premchand stories in Devanagari
Option C (Code): Python source files from GitHub

Rubric (100 points)

Component	Points	Criteria
Correct attention implementation	20	Attention formula correct, causal mask works, shapes verified
Multi-head attention	15	Multiple heads computed in parallel, concatenated correctly
Transformer block	15	Residual connections, LayerNorm, FFN all present and correct
Training pipeline	15	Proper data loading, batching, loss computation, optimization
Generation quality	15	Generated text shows learned patterns (words, formatting, some structure)
Attention visualization	10	Heatmap of attention weights for at least 2 heads, with interpretation
2-layer vs 6-layer comparison	10	Learning curves compared, sample quality compared, brief analysis

Bonus Challenges (+20 points each)

Bonus 1: Add temperature and top-k sampling to generation
Bonus 2: Train on Hindi text and handle Devanagari characters
Bonus 3: Implement Grouped-Query Attention (GQA) as used in LLaMA 2

Section 22

Exercises

Section A: Conceptual Questions (5)

A1 Beginner

Why is self-attention called "self" attention? How does it differ from cross-attention?

Answer: In self-attention, Q, K, and V all come from the same sequence — a token attends to other tokens in the same input. In cross-attention (encoder-decoder attention), Q comes from one sequence (decoder) while K and V come from a different sequence (encoder). Self-attention enables modeling relationships within a sentence; cross-attention enables alignment between two sequences (e.g., source and target in translation).

A2 Beginner

Explain why Transformers need positional encoding while RNNs do not.

Answer: RNNs process tokens sequentially — the position is inherently encoded by the order of processing (hidden state at step t reflects all tokens up to t). Self-attention computes pairwise similarities between all tokens simultaneously and is permutation-invariant — the output wouldn't change if tokens were shuffled. Therefore, explicit positional information must be injected to preserve word order.

A3 Intermediate

Explain the role of the FFN (feed-forward network) in a Transformer block. If attention handles inter-token communication, what does the FFN do?

Answer: Attention handles "communication" between tokens — mixing information across positions. The FFN handles "computation" — processing each token's features independently. Think of attention as "gather information from other tokens" and FFN as "think about what you've gathered." Research suggests FFN layers act as key-value memories: the first layer detects patterns, the second layer maps them to output features. The expand-compress structure (512→2048→512) creates a higher-dimensional space where complex feature interactions can occur.

A4 Intermediate

Why does BERT use bidirectional attention while GPT uses causal attention? Isn't bidirectional always better?

Answer: BERT's goal is understanding (classification, NER, QA) — seeing both left and right context produces richer representations for these tasks. GPT's goal is generation — predicting the next token. During generation, future tokens don't exist yet, so looking at them would be "cheating" (data leakage). Bidirectional isn't always better: you can't generate text token-by-token with a bidirectional model (each token's representation depends on future tokens that haven't been generated). For generation tasks, causal attention is necessary.

A5 Intermediate

What is the purpose of the output projection W^O in multi-head attention? Why not just concatenate the heads?

Answer: Each head operates in a d_k-dimensional subspace. Concatenation combines these into d_model dimensions, but the representations from different heads haven't been "mixed" — each dimension still only carries information from one head. W^O is a learned linear transformation that allows the model to combine information across heads, creating output features that integrate multiple attention patterns. Without W^O, the model couldn't learn to weight or combine the contributions of different heads.

Section B: Mathematical Questions (8)

B1 Intermediate

Given Q = [[1, 0], [0, 1]], K = [[1, 1], [0, 1]], V = [[1, 2], [3, 4]], compute Attention(Q, K, V) with scaling (d_k = 2). Show all intermediate steps.

Hint: Follow the 5 steps: QK^T → scale by √2 → softmax each row → multiply by V. QK^T = [[1,0],[1,1]], scaled = [[0.707,0],[0.707,0.707]], softmax row 1 = [exp(0.707)/(exp(0.707)+1), 1/(exp(0.707)+1)] ≈ [0.670, 0.330], etc.

B2 Intermediate

Prove that if q_i ~ N(0, 1) and k_i ~ N(0, 1) independently, then Var(q·k) = d_k when q, k ∈ ℝ^d_k.

Hint: The dot product q·k = Σᵢ qᵢkᵢ. Since qᵢ and kᵢ are independent, Var(qᵢkᵢ) = E[qᵢ²kᵢ²] - (E[qᵢkᵢ])² = E[qᵢ²]E[kᵢ²] - 0 = 1·1 = 1. Since the d_k terms are independent, Var(q·k) = d_k · 1 = d_k.

B3 Intermediate

Calculate the total number of trainable parameters in a single Transformer encoder layer with d_model = 256, h = 4, d_ff = 1024. Include LayerNorm parameters.

Answer: MHA: W_Q(256×256) + W_K(256×256) + W_V(256×256) + W_O(256×256) = 4×65,536 = 262,144. FFN: W₁(256×1024) + b₁(1024) + W₂(1024×256) + b₂(256) = 262,144 + 1,024 + 262,144 + 256 = 525,568. LayerNorm×2: 2×(256+256) = 1,024. Total = 262,144 + 525,568 + 1,024 = 788,736.

B4 Advanced

Show that the sinusoidal positional encoding satisfies the linear relationship property: PE(pos+k) can be expressed as a linear transformation of PE(pos) for any fixed offset k.

Hint: For a single frequency ω: [sin(ω(pos+k)), cos(ω(pos+k))] = [sin(ωpos)cos(ωk) + cos(ωpos)sin(ωk), cos(ωpos)cos(ωk) - sin(ωpos)sin(ωk)] = [[cos(ωk), sin(ωk)], [-sin(ωk), cos(ωk)]] · [sin(ωpos), cos(ωpos)]^T. This is a rotation matrix, which is a linear transformation!

B5 Intermediate

For a Transformer processing a sequence of length n = 1024 with d_model = 512, compute the FLOPs for one self-attention operation (QK^T and attention×V). How does this compare to processing the same sequence with an RNN (d_hidden = 512)?

Answer: Self-attention QK^T: 2·n²·d = 2·1024²·512 ≈ 537M FLOPs. Attention×V: same, ≈ 537M. Total ≈ 1.07B FLOPs. RNN per step: 2·d² ≈ 524K FLOPs. For n=1024 steps: 1024·524K ≈ 537M FLOPs. At n=1024 with d=512, they're comparable! But self-attention is parallel (O(1) sequential steps) while RNN is sequential (O(n)).

B6 Intermediate

In a causal attention mask for sequence length 4, write out the 4×4 mask matrix (before and after softmax). What does each row sum to?

Answer: Mask (True = attend): [[1,0,0,0],[1,1,0,0],[1,1,1,0],[1,1,1,1]]. After masking scores and softmax, each row sums to 1 (softmax normalization). Row 1: all mass on position 1. Row 4: mass distributed across all 4 positions.

B7 Advanced

GPT-3 has 175B parameters, 96 layers, d_model = 12,288, d_ff = 4·d_model, h = 96 heads. Verify the parameter count for one layer (MHA + FFN + LayerNorm) and check if 96 layers × per-layer params ≈ 175B.

Hint: MHA: 4·d² = 4·12288² = 604M. FFN: 8·d² + 5·d ≈ 1,208M. LN: 4·d ≈ 49K. Per layer ≈ 1.81B. 96 layers ≈ 174B. Plus embeddings: vocab_size(50257)·d ≈ 618M. Total ≈ 174.6B ✓

B8 Intermediate

If we use multi-query attention (MQA) where K and V are shared across all h heads, by what factor do we reduce the KV cache memory during inference? (Assume h = 32 heads.)

Answer: Standard MHA stores K,V per head: 2 × h × n × d_k memory. MQA stores only 1 copy of K,V: 2 × 1 × n × d_k. Reduction factor = h = 32×. This is why MQA and GQA are critical for efficient LLM inference.

Section C: Coding Questions (4)

C1 Intermediate

Implement a function causal_mask(n) that returns an n×n boolean mask where True means "mask this position" (future positions). Use it in your attention function.

Hint: mask = np.triu(np.ones((n, n), dtype=bool), k=1) or mask = torch.triu(torch.ones(n, n), diagonal=1).bool()

C2 Intermediate

Write a PyTorch function that computes the sinusoidal positional encoding matrix of shape (max_len, d_model). Verify that nearby positions have higher cosine similarity than distant positions.

Hint: Use torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) for the wavelengths. Compute cosine similarity between PE[0] and PE[k] for k = 1, 5, 50, 100.

C3 Advanced

Modify the MiniGPT code to add temperature-controlled sampling and top-k sampling in the generate() method. Generate text at temperatures 0.5, 1.0, and 1.5, and compare the diversity and coherence.

Hint: Temperature: logits = logits / temperature before softmax. Top-k: v, ix = torch.topk(logits, k), set all other logits to -inf, then softmax over the remaining k options.

C4 Advanced

Implement Layer Normalization from scratch (no using nn.LayerNorm). Compare the output with PyTorch's built-in implementation to verify numerical correctness (difference should be < 1e-5).

Hint: Compute mean and variance along the last dimension. Apply: y = gamma * (x - mean) / sqrt(var + eps) + beta. Compare with nn.LayerNorm(d_model)(x) — the difference should be negligible.

Section D: Critical Thinking (3)

D1 Advanced

Self-attention's O(n²) complexity limits it to ~4K-8K tokens. But books are 100K+ tokens. Propose a Transformer architecture that can process a 100K-token novel. Discuss trade-offs.

Discussion points: Sliding window attention (local context), sparse attention patterns (Longformer), hierarchical approaches (process paragraphs then chapters), linear attention approximations (Performer), state-space models (Mamba), or retrieving relevant chunks via RAG. Each trades exactness for efficiency.

D2 Advanced

BERT pre-training masks 15% of tokens. Why not 50%? Why not 5%? What would happen in each case? Design an experiment to find the optimal masking rate.

Discussion points: 50%: too much context removed → the task becomes too hard, model can't learn meaningful patterns. 5%: too little → model rarely practices prediction, training is inefficient. 15% is a sweet spot balancing difficulty with training efficiency. An experiment would train models with masking rates [5%, 10%, 15%, 20%, 30%, 50%] and compare downstream task performance after same compute budget.

D3 Advanced

In India, many users interact with AI in code-mixed language (e.g., "Mujhe ek accha restaurant suggest karo near CP"). What challenges does this pose for Transformer-based models, and how would you address them?

Discussion points: Challenges: vocabulary explosion (Hindi + English tokens), script switching (Devanagari + Latin), grammar from both languages, limited labeled code-mixed data. Solutions: multilingual pre-training (mBERT, XLM-R), script normalization (transliterate to one script), data augmentation via synthetic code-mixing, subword tokenization that handles both scripts, and fine-tuning on code-mixed datasets like GLUECoS.

★ Starred Research Questions (2)

★ R1 Advanced

Read the FlashAttention paper (Dao et al., 2022). Explain how tiling and recomputation avoid materializing the full n×n attention matrix in HBM. Why is this an IO-bound optimization rather than a compute-bound one?

Key insight: Standard attention: compute S=QK^T (write n² to HBM), compute P=softmax(S) (read/write n²), compute O=PV (read n², write nd). Total HBM access: O(n²). FlashAttention tiles Q,K,V into blocks that fit in SRAM (~20MB), computes partial attention per block, and never writes the full n×n matrix to slow HBM. The FLOP count is the same, but HBM reads/writes reduce from O(n²) to O(n²d/M) where M is SRAM size — a huge speedup because modern GPUs are memory-bandwidth limited, not compute-limited.

★ R2 Advanced

State-Space Models (SSMs) like Mamba (Gu & Dao, 2023) claim to match Transformer quality with O(n) complexity. Research their mechanism and compare: what does Mamba gain, and what does it sacrifice compared to standard Transformers?

Key points: Mamba uses selective state spaces — input-dependent parameters that allow the model to selectively remember or forget information. Gains: O(n) compute and memory (vs O(n²)), faster inference (no KV cache growth), better for very long sequences. Sacrifices: less mature ecosystem, can't do "retrieval" as easily as attention (no explicit n×n pairwise comparison), some tasks (e.g., in-context learning with many examples) still favor Transformers. The field is actively debating whether SSMs can fully replace Transformers.

Section 23

Connections

🔗 How This Chapter Connects

← Builds On:

Chapter 14 (LSTMs & GRUs): You understood the limitations of recurrence — vanishing gradients and sequential processing. Transformers solve both.
Chapter 9 (Regularization): Dropout is used in attention weights and FFN layers. Layer Normalization extends the normalization concepts from Batch Normalization (Ch 10).
Chapter 6 (Backpropagation): Residual connections provide gradient highways, a concept you first met in deep networks.
Chapter 2 (Linear Algebra): The entire attention mechanism is matrix multiplication — Q·K^T, softmax, weights·V.

→ Enables:

Chapter 16 (GANs & VAEs): Transformer-based generators (like ViT-GANs) are replacing CNN-based GANs.
Chapter 17 (Applied CV): Vision Transformers (ViT) apply self-attention to image patches, rivaling CNNs.
Chapter 18 (Applied NLP): BERT, GPT, and T5 are the backbone of modern NLP — fine-tuning these models is the dominant paradigm.
Chapter 22 (Future & Ethics): Large Language Models raise critical ethical questions about bias, hallucination, and societal impact.

🔬 Research Frontier:

Mixture of Experts (MoE): Scaling Transformers to trillions of parameters by only activating a subset of parameters per token (GPT-4, Mixtral)
State-Space Models: Mamba and similar architectures challenge Transformers with O(n) complexity
Multimodal Transformers: Models that process text, images, audio, and video simultaneously (GPT-4V, Gemini)
Retrieval-Augmented Generation (RAG): Combining Transformers with external knowledge retrieval to reduce hallucination

🏭 Industry Implementation:

Google: T5/PaLM for Search, Translate, Gmail Smart Compose
OpenAI: GPT-4 for ChatGPT, API services
Meta: LLaMA (open-source LLM), NLLB (translation for 200 languages)
AI4Bharat: IndicBERT, IndicTrans, Bhashini (India's national language translation platform)
Hugging Face: Democratizing Transformer deployment with the transformers library

Section 24

Chapter Summary

Key Takeaways — Transformers and Attention

Attention is a soft dictionary lookup: Each token generates a query, compares it against all keys, and retrieves a weighted sum of values. This lets every position directly communicate with every other position — O(1) path length vs O(n) for RNNs.
The √d_k scaling is essential: Dot products in high dimensions have variance d_k. Without scaling, softmax saturates, killing gradients. Dividing by √d_k normalizes variance to 1.
Multi-head attention provides multiple perspectives: h parallel attention heads (each with d_k = d_model/h) learn different relationship types (syntactic, semantic, positional, coreference) at no additional computational cost.
Positional encoding solves permutation invariance: Self-attention ignores token order. Sinusoidal encodings (or learned embeddings) inject position information. The sine/cosine choice enables encoding relative positions via rotation matrices.
The Transformer architecture is a composition of simple blocks: Each layer has Multi-Head Attention + FFN, wrapped with residual connections and Layer Normalization. Stack 6-96 of these layers, and you have a state-of-the-art model.
BERT (encoder, bidirectional, MLM) and GPT (decoder, causal, CLM) are two sides of the same Transformer coin: BERT excels at understanding tasks; GPT excels at generation. Both use identical building blocks, just configured differently.
Self-attention is O(n²d) — powerful but expensive: This quadratic scaling limits sequence length. Efficient variants (FlashAttention, Linformer, state-space models) are an active research frontier, essential for processing long documents and deploying on resource-constrained devices.

The Key Equation:
Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Key Intuition: The Transformer processes sequences not through step-by-step recurrence, but through layers of self-attention — letting every token directly query every other token, enabling parallelism, long-range dependencies, and the scaling that powers modern AI.

Section 25