Neural Networks & Deep Learning
Chapter 15: Transformers and Attention Mechanisms
Attention Is All You Need โ How One Architecture Replaced RNNs, CNNs, and Changed the World
โฑ๏ธ Reading Time: ~5 hours | ๐ Unit 5: Specialized Architectures | ๐ง Theory + Code Chapter
๐ Prerequisites: Chapter 14 (LSTMs & GRUs), Chapter 9 (Regularization), Linear Algebra (matrix multiplication, eigenvalues)
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the scaled dot-product attention formula, multi-head attention structure, sinusoidal positional encoding equations, and Transformer encoder-decoder block components |
| ๐ต Understand | Explain why โd_k scaling prevents softmax saturation, how self-attention captures long-range dependencies without recurrence, and why positional encoding is necessary |
| ๐ข Apply | Implement single-head attention, multi-head attention, and a complete TransformerBlock from scratch in NumPy; build a character-level mini-GPT in PyTorch |
| ๐ก Analyze | Trace the information flow through a 6-layer Transformer encoder, analyze attention weight matrices to interpret what the model "looks at," compare O(nยฒd) self-attention vs O(n) recurrence |
| ๐ Evaluate | Critically compare BERT (encoder-only, MLM) vs GPT (decoder-only, CLM), evaluate when to choose efficient attention variants (Linformer, FlashAttention), justify architecture decisions for multilingual tasks |
| ๐ด Create | Design and train a mini-Transformer for character-level text generation; propose an architecture for an Indian multilingual translation system |
Learning Objectives
By the end of this chapter, you will be able to:
- Derive the attention mechanism from first principles โ starting from the intuition of "soft dictionary lookup" and arriving at Attention(Q,K,V) = softmax(QKT/โd_k)V
- Prove why the โd_k scaling factor is necessary by computing the variance of dot products in high-dimensional spaces
- Implement single-head attention, multi-head attention, and a complete Transformer block from scratch in NumPy
- Explain sinusoidal positional encoding โ derive the formulas, understand why sine and cosine encode relative positions, and compare with learned positional embeddings
- Diagram the complete Transformer architecture (encoder + decoder, 6 layers each) including masked multi-head attention, cross-attention, feed-forward networks, and residual connections
- Compare encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures โ their pre-training objectives, strengths, and applications
- Analyze self-attention complexity O(nยฒd) and explain efficient attention variants: Linformer, Performer, FlashAttention
- Build a character-level mini-GPT in PyTorch that generates text
- Discuss real-world deployments: AI4Bharat IndicBERT for 11 Indian languages and the OpenAI GPT-1โ4 evolution
- Solve GATE-style numerical problems on attention computation and Transformer parameter counting
Opening Hook
๐ฎ The Paper That Changed Everything
In June 2017, a team at Google published "Attention Is All You Need." The paper's modest title hid a revolution: they showed you could throw away RNNs entirely and build sequence models purely from attention. This single architecture now powers GPT-4, BERT, Whisper, DALL-E, and virtually every state-of-the-art AI system.
But here's what made it truly special. For decades, the dominant paradigm for sequence modeling was recurrence โ processing tokens one by one, like reading a book word by word, never allowed to skip ahead. Transformers broke this tyranny. Instead of sequential processing, they let every word in a sentence look at every other word simultaneously. A 500-word paragraph? The Transformer sees all 500 words at once, computes how each word relates to every other word, and does it all in a single matrix multiplication.
The results were immediate and dramatic. Translation quality jumped. Training speed exploded โ because attention is parallelizable on GPUs, unlike recurrence. Within two years, Google replaced its entire translation pipeline with Transformers. Within five years, GPT-3 showed that a Transformer with 175 billion parameters could write essays, code, and poetry that fooled humans.
In India, the Transformer revolution hit a unique problem: 22 official languages, 13 different scripts, rampant code-mixing. Teams at AI4Bharat and IIT Madras built IndicBERT โ a BERT model trained on 11 Indian languages โ proving that the Transformer's architecture transcends any single language.
Today, you will understand exactly how this architecture works โ from the ground up.
The Intuition First โ Why Attention?
The Library Analogy ๐
Imagine you walk into a massive library looking for information about "climate change effects on agriculture." Here's what you do:
- Your Query (Q): You have a question in your mind โ "climate change effects on agriculture." This is your query.
- Book Labels / Keys (K): Every book on the shelf has a title and index โ "Organic Farming Techniques," "Climate Science 101," "History of Ancient Rome." These are keys that you compare against your query.
- Book Contents / Values (V): Each book contains actual content โ facts, figures, analysis. These are the values you want to retrieve.
- Matching: You mentally compare your query against each book's key. "Climate Science 101" gets high relevance. "History of Ancient Rome" gets near-zero relevance.
- Weighted Retrieval: You don't just read one book โ you read a weighted combination, spending 60% of your time on the climate book, 30% on the farming book, and 10% skimming others.
output = ฮฃแตข similarity(query, key_i) ร value_i
This is exactly what the attention mechanism does. Every token in a sentence generates a query ("What am I looking for?"), a key ("What do I contain?"), and a value ("What information do I carry?"). The magic: every token gets to "look up" every other token and retrieve a weighted combination of their information.
Why RNNs Failed: The Bottleneck Problem ๐ฌ
In Chapter 14, you learned that LSTMs and GRUs mitigate the vanishing gradient problem. But they still have a fundamental limitation: information must flow sequentially. If word 1 needs information from word 100, that information must survive passage through 99 intermediate states. Even with gating, this bottleneck causes information loss.
The "Aha!" Question
"What if, instead of processing a sequence step by step, we let every position in the sequence attend to every other position simultaneously โ and used nothing but this attention operation to build the entire model?"
The original Transformer paper (Vaswani et al., 2017) had 8 authors. The name "Transformer" was chosen because the model transforms input representations through layers of attention. The working title was reportedly much less catchy.
15.1 The Attention Mechanism โ Derived from Scratch
Step 1: The Simplest Attention โ Dot-Product Similarity
You have a sequence of n tokens, each represented as a d-dimensional vector. Let's call these vectors xโ, xโ, ..., xโ, each โ โd. Stack them into a matrix X โ โnรd.
The most basic question of attention: "How much should token i pay attention to token j?"
The simplest answer? Take the dot product of their vectors:
If two vectors point in similar directions, the dot product is large (high attention). If they're orthogonal, the dot product is zero (no attention). This is just cosine similarity scaled by magnitude.
Step 2: From Raw Scores to Probabilities โ Softmax
The raw scores e_ij can be any real number โ positive, negative, huge, tiny. You need them to be non-negative and sum to 1 (a probability distribution). The softmax function does exactly this:
Now ฮฑ_ij represents "the fraction of attention that token i pays to token j." These are called attention weights.
Step 3: Weighted Sum โ The Output
Once you know how much token i attends to every other token, you compute a weighted sum of all token representations:
This gives token i a new representation that's a blend of all other tokens, weighted by relevance. Beautiful, isn't it? But there's a problem...
Step 4: The Q-K-V Projection โ Why We Need Three Separate Roles
In the naive version above, every token uses the same vector for both "querying" and "being queried." This is like a library where the same text serves as both the book's title and its content โ confusing!
The solution: project each token into three different spaces using learned weight matrices:
Given input X โ โnรd_model, we create three projections:
- Query: Q = X ยท WQ where WQ โ โd_model ร d_k
- Key: K = X ยท WK where WK โ โd_model ร d_k
- Value: V = X ยท WV where WV โ โd_model ร d_v
Intuition:
- Q (Query) = "What am I looking for?" โ the question each token asks
- K (Key) = "What do I contain?" โ how each token advertises itself
- V (Value) = "What information do I carry?" โ the actual content to retrieve
This separation is crucial: a word might be highly relevant as a key to certain queries but carry different information as a value. For example, the word "bank" as a key signals "financial" or "river," and different value projections can encode different aspects of its meaning depending on context.
Step 5: Putting It All Together โ Raw Attention
Now combine everything. For each query, compute similarity with all keys, normalize with softmax, and retrieve weighted values:
Matrix dimensions check:
- Q โ โnรd_k, KT โ โd_kรn โ QยทKT โ โnรn (attention scores matrix)
- After softmax: still โnรn (each row sums to 1)
- V โ โnรd_v โ output โ โnรd_v โ
This is almost the final formula. But there's one critical missing piece...
โ MYTH: "The Q, K, V matrices are fixed and predetermined."
โ TRUTH: WQ, WK, WV are learned weight matrices, trained via backpropagation just like any other neural network parameter. The network learns what to query for and what to advertise as keys.
๐ WHY IT MATTERS: This is what makes attention so powerful โ the model doesn't just compute fixed similarities; it learns which aspects of tokens to compare.
15.2 Scaled Dot-Product Attention โ The โd_k Derivation
The Problem: Why Naive Dot Products Blow Up
Here's a subtle but critical issue. Consider a query vector q and a key vector k, each with d_k independent components drawn from a standard normal distribution (mean 0, variance 1).
The dot product is:
Deriving the variance of the dot product:
Each q_i and k_i are independent with mean 0 and variance 1.
The product q_i ยท k_i has:
- ๐ผ[q_i ยท k_i] = ๐ผ[q_i] ยท ๐ผ[k_i] = 0 ยท 0 = 0 (by independence)
- Var(q_i ยท k_i) = ๐ผ[q_iยฒ ยท k_iยฒ] - (๐ผ[q_i ยท k_i])ยฒ
- = ๐ผ[q_iยฒ] ยท ๐ผ[k_iยฒ] - 0 (by independence)
- = Var(q_i) ยท Var(k_i) = 1 ยท 1 = 1
The dot product is a sum of d_k such independent terms:
- ๐ผ[q ยท k] = 0
- Var(q ยท k) = d_k
So the standard deviation of the dot product is โd_k.
If d_k = 512, the dot products will have std โ 22.6! These huge values, when fed into softmax, push the output toward one-hot vectors (all mass on one element), creating extremely small gradients (softmax saturation).
The Solution: Scale by โd_k
To bring the variance back to 1 (well-behaved for softmax), divide by โd_k:
After scaling:
- Each element of QKT/โd_k has variance โ 1
- Softmax operates in a region with informative gradients
- The model can learn nuanced attention distributions instead of "all or nothing"
Q: Why does scaled dot-product attention divide by โd_k?
A: For vectors with i.i.d. components of variance 1, the dot product has variance d_k. Dividing by โd_k normalizes the variance to 1, preventing softmax saturation and ensuring healthy gradients during training.
Formula: Attention(Q,K,V) = softmax(QKT/โd_k)V
Masked Attention: Preventing the Future from Leaking
In language generation (like GPT), when predicting the next word, you must NOT let the model peek at future words. This is enforced by masking:
This makes softmax(โโ) = 0, effectively zeroing out future positions.
In PyTorch, masking is typically done by adding a large negative number (like -1e9) to the positions you want to mask, rather than actual -โ. This avoids NaN issues while achieving the same effect after softmax.
15.3 Multi-Head Attention โ Why Multiple Heads?
The Problem with Single-Head Attention
A single attention head computes one set of attention weights. But a word can have multiple types of relationships simultaneously:
- "The cat sat on the mat because it was tired" โ "it" needs to attend to "cat" (coreference resolution)
- But "it" also needs to attend to "sat" (syntactic dependency) and "tired" (semantic link)
A single attention head must average all these relationships into one set of weights โ a lossy compromise.
The Solution: Multiple Parallel Heads
Instead of one attention function with d_model-dimensional keys, queries, and values, run h parallel attention heads, each with d_k = d_model/h dimensions:
Multi-Head Attention, step by step:
Step 1: For each head i โ {1, ..., h}, create separate projections:
- Q_i = X ยท W_iQ where W_iQ โ โd_model ร d_k, d_k = d_model / h
- K_i = X ยท W_iK where W_iK โ โd_model ร d_k
- V_i = X ยท W_iV where W_iV โ โd_model ร d_v, d_v = d_model / h
Step 2: Compute attention independently for each head:
head_i = Attention(Q_i, K_i, V_i) โ โn ร d_v
Step 3: Concatenate all heads:
MultiHead = Concat(head_1, head_2, ..., head_h) โ โn ร (hยทd_v) = โn ร d_model
Step 4: Final linear projection:
MultiHeadAttention(X) = MultiHead ยท WO where WO โ โd_model ร d_model
where head_i = Attention(QW_iQ, KW_iK, VW_iV)
Parameter Count Check
With h=8 heads and d_model=512 (as in the original Transformer):
- d_k = d_v = 512/8 = 64 per head
- Per head: WQ(512ร64) + WK(512ร64) + WV(512ร64) = 3 ร 32,768 = 98,304
- All 8 heads: 8 ร 98,304 = 786,432
- Output projection WO: 512 ร 512 = 262,144
- Total: 1,048,576 โ 1M parameters (same as one large single-head attention!)
The computational cost of multi-head attention with h heads of d_k dimensions each is identical to single-head attention with full d_model dimensions! You get multiple attention patterns for free โ no additional compute. This is because h ร d_k = d_model.
What Do Different Heads Learn?
Research has shown that different heads naturally specialize:
| Head Type | What It Learns | Example |
|---|---|---|
| Syntactic Head | Subject-verb agreement | "The dogs [that barked] were loud" |
| Positional Head | Attend to adjacent tokens | "New York" โ bigram patterns |
| Coreference Head | Pronoun resolution | "Alice said she would come" |
| Semantic Head | Meaning similarity | "happy" attends to "joy," "glad" |
"A Multiscale Visualization of Attention" (Vig, 2019) โ Jesse Vig built BertViz, a tool to visualize attention patterns across heads and layers. The research showed that lower layers capture syntactic patterns while upper layers capture semantic relationships. This work enabled researchers to interpret what Transformers learn, rather than treating them as black boxes.
15.4 Positional Encoding โ Teaching Position to a Permutation-Invariant Model
The Problem: Self-Attention Ignores Order
Here's a subtle but devastating issue. Consider the sentences:
- "The dog chased the cat" (meaning A)
- "The cat chased the dog" (meaning B โ completely different!)
Self-attention computes pairwise similarities between tokens. Since these sentences have exactly the same tokens (just reordered), self-attention produces identical outputs. The attention mechanism is permutation-invariant โ it doesn't care about word order!
This is clearly unacceptable. We must inject positional information somehow.
Solution: Sinusoidal Positional Encoding
Deriving the sinusoidal encoding:
We need a function PE(pos, i) that maps position pos and dimension i to a real number. Our desiderata:
- Unique: Each position gets a unique encoding
- Bounded: Values stay in [-1, 1] regardless of sequence length
- Relative: PE(pos+k) should be expressible as a linear function of PE(pos) โ enabling the model to learn relative positions
- Generalizable: Should work for sequences longer than those seen during training
Sine and cosine functions satisfy ALL four criteria!
The encoding formula:
PE(pos, 2i) = sin(pos / 100002i/d_model)
PE(pos, 2i+1) = cos(pos / 100002i/d_model)
where pos is the token position and i is the dimension index.
Why this works โ the linear relationship property:
For any fixed offset k, there exists a linear transformation M_k such that:
[sin(ฯยท(pos+k)), cos(ฯยท(pos+k))] = M_k ยท [sin(ฯยทpos), cos(ฯยทpos)]
where M_k is a rotation matrix:
M_k = [[cos(ฯk), sin(ฯk)], [-sin(ฯk), cos(ฯk)]]
This means the model can learn to attend to relative positions (e.g., "3 tokens back") through a simple linear transformation of the positional encodings. No absolute position memorization needed!
PE(pos, 2i+1) = cos(pos / 100002i/d_model)
Wavelength Intuition
Each dimension pair (2i, 2i+1) creates a sine-cosine wave with a different wavelength:
- Dimension 0,1: wavelength = 2ฯ โ 6.28 positions (fastest oscillation)
- Dimension 2,3: wavelength = 2ฯ ร 100002/512 โ 7.4 positions
- ...
- Dimension 510,511: wavelength โ 2ฯ ร 10000 โ 62,832 positions (slowest)
Think of it like a binary clock: the least significant bit oscillates fastest, the most significant bit oscillates slowest. Each position gets a unique "binary-like" representation from combining all these waves.
Learned vs. Sinusoidal Positional Encoding
| Feature | Sinusoidal (Original Transformer) | Learned (BERT, GPT-2) |
|---|---|---|
| Parameters | 0 (computed analytically) | max_len ร d_model trainable params |
| Generalization | Can extrapolate to longer sequences | Cannot handle sequences longer than max_len |
| Relative positions | Encodes relative position via rotation | Must learn relative patterns from data |
| Performance | Comparable to learned in most tasks | Slightly better when max_len is known |
| Used in | Original Transformer, some variants | BERT, GPT-2/3, most modern models |
"RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021) โ This paper introduced Rotary Position Embedding (RoPE), which encodes position by rotating the query and key vectors. RoPE naturally encodes relative positions, generalizes to longer sequences, and is now used in LLaMA, PaLM, and most modern large language models. It essentially applies the rotation matrix M_k directly to Q and K before computing attention.
15.5 The Full Transformer Architecture
The Big Picture
Encoder Block (in detail)
Each of the 6 encoder layers contains exactly two sub-layers:
๐น Encoder Sub-Layer 1: Multi-Head Self-Attention
๐น Encoder Sub-Layer 2: Position-wise Feed-Forward Network
Decoder Block (in detail)
Each of the 6 decoder layers has three sub-layers:
๐ธ Decoder Sub-Layer 1: Masked Multi-Head Self-Attention
๐ธ Decoder Sub-Layer 2: Cross-Attention (Encoder-Decoder Attention)
๐ธ Decoder Sub-Layer 3: Position-wise FFN
Transformer Hyperparameters (Original Paper)
| Hyperparameter | Symbol | Value |
|---|---|---|
| Model dimension | d_model | 512 |
| Number of heads | h | 8 |
| Key/Value dimension per head | d_k = d_v | 64 |
| FFN inner dimension | d_ff | 2048 |
| Number of encoder layers | N | 6 |
| Number of decoder layers | N | 6 |
| Dropout rate | p_drop | 0.1 |
| Vocabulary size | V | ~37,000 (BPE) |
| Total parameters | โ | ~65M (base), ~213M (big) |
Job Roles: Transformer expertise is required for: NLP Engineer, LLM Research Scientist, Applied ML Engineer, ML Infrastructure Engineer (optimizing Transformer inference), AI4Bharat Research Fellow. Typical salaries in India: โน18-60 LPA; in the US: $120K-250K+.
15.6 Layer Normalization and Residual Connections
Residual Connections: The Gradient Highway
You already met residual connections in Chapter 11 (ResNets). The Transformer uses them identically:
Why they're essential: With 6 stacked layers, each containing 2-3 sub-layers, gradients must flow through 12-18 transformations. Without skip connections, gradients would vanish or explode. The residual path provides a direct gradient highway from the output back to the input.
Layer Normalization (Not Batch Normalization!)
The Transformer uses Layer Normalization, not Batch Normalization. Here's why and how:
LayerNorm vs BatchNorm:
BatchNorm normalizes across the batch dimension: for each feature, compute mean and variance across all examples in the batch.
LayerNorm normalizes across the feature dimension: for each example (each token), compute mean and variance across all features (d_model dimensions).
For a single token vector x โ โd_model:
ฮผ = (1/d_model) ฮฃแตข xแตข
ฯยฒ = (1/d_model) ฮฃแตข (xแตข - ฮผ)ยฒ
LayerNorm(x) = ฮณ โ (x - ฮผ)/(ฯ + ฮต) + ฮฒ
where ฮณ, ฮฒ โ โd_model are learned scale and shift parameters, and ฮต โ 1e-6 for numerical stability.
Why LayerNorm, Not BatchNorm?
| Reason | Explanation |
|---|---|
| Variable sequence lengths | BatchNorm would need to compute statistics across positions, but sequences have different lengths โ padding tokens would corrupt statistics |
| Small batch sizes | In NLP training, batch sizes are often small (due to memory). BatchNorm's statistics become noisy with small batches. |
| Inference consistency | LayerNorm's behavior is identical during training and inference (no running mean/variance needed) |
| Sequential generation | During autoregressive generation, tokens are produced one at a time โ no "batch" to normalize over |
Pre-Norm vs Post-Norm
The original Transformer uses Post-Norm: output = LayerNorm(x + SubLayer(x))
Modern Transformers (GPT-2+) use Pre-Norm: output = x + SubLayer(LayerNorm(x))
Pre-Norm trains more stably because the residual path carries unnormalized values, maintaining the gradient highway. Post-Norm can cause gradient issues in very deep networks (>12 layers). GPT-2, GPT-3, and LLaMA all use Pre-Norm.
โ MYTH: "Transformers use Batch Normalization like CNNs do."
โ TRUTH: Transformers use Layer Normalization. BatchNorm normalizes across the batch; LayerNorm normalizes across features within a single example.
๐ WHY IT MATTERS: Using BatchNorm in a Transformer would break autoregressive generation and perform poorly with variable-length sequences. This is a common GATE/interview trap question.
15.7 BERT vs GPT โ Encoder-Only vs Decoder-Only
Three Flavors of Transformers
BERT: Bidirectional Encoder Representations from Transformers
๐ BERT (Devlin et al., 2018)
Randomly mask 15% of input tokens, then predict the masked tokens.
Example: "The [MASK] sat on the [MASK]" โ predict "cat" and "mat"
Of the 15% selected: 80% replaced with [MASK], 10% with a random word, 10% kept unchanged.
Pre-training Objective 2 โ Next Sentence Prediction (NSP):Given two sentences, predict whether B actually follows A in the corpus.
50% positive pairs (consecutive sentences), 50% negative (random sentences).
Why it works: By seeing context from BOTH directions, BERT builds deep bidirectional representations โ unlike GPT which only sees leftward context. This makes BERT excellent for understanding tasks (classification, NER, QA).GPT: Generative Pre-trained Transformer
๐ GPT (Radford et al., 2018)
Predict the next token given all previous tokens.
P(x_t | x_1, x_2, ..., x_{t-1})
Example: "The cat sat on the" โ predict "mat"
Why it works: CLM naturally enables text generation โ just keep predicting the next token. And because the model learns to predict ANY next token in billions of sentences, it implicitly learns grammar, facts, reasoning, and even some code. Scaling law: GPT demonstrated that performance improves predictably with model size, data size, and compute โ leading to the "scaling laws" paradigm.Head-to-Head Comparison
| Feature | BERT (Encoder-only) | GPT (Decoder-only) |
|---|---|---|
| Attention type | Bidirectional (full) | Causal (left-to-right) |
| Pre-training | MLM + NSP | CLM (next token prediction) |
| Fine-tuning | Add task-specific head | Prompt-based / fine-tune |
| Generation | Poor (not designed for it) | Excellent |
| Understanding | Excellent (bidirectional context) | Good (but unidirectional) |
| Classification | Excellent ([CLS] token) | Good (last token) |
| Parameters (base) | 110M (BERT-base) | 117M (GPT-1), 175B (GPT-3) |
| Notable variants | RoBERTa, ALBERT, DistilBERT | GPT-2/3/4, LLaMA, Mistral |
๐ฎ๐ณ India: BERT Variants
- IndicBERT โ 11 Indian languages, Albert-based
- MuRIL (Google) โ Multilingual for Indian langs
- Indic-BERT by AI4Bharat โ handles code-mixing
- Challenges: 13 scripts, agglutinative morphology, limited digital data for many languages
- Use cases: Flipkart review classification, gov't document processing, Aadhaar NLP
๐บ๐ธ USA: GPT Evolution
- GPT-1 (2018) โ 117M params, proved CLM pre-training works
- GPT-2 (2019) โ 1.5B params, "too dangerous to release"
- GPT-3 (2020) โ 175B params, few-shot learning
- GPT-4 (2023) โ Multimodal, >1T params (est.)
- Key insight: scaling compute + data โ emergent capabilities (reasoning, code, multilingual)
15.8 Self-Attention Complexity and Efficient Attention
The Quadratic Bottleneck
Computing the complexity of self-attention:
Given a sequence of length n and model dimension d:
- Q, K, V projections: Three matrix multiplications of (nรd) ยท (dรd) = O(nยทdยฒ) each โ O(3nยทdยฒ)
- QKT: Matrix multiplication of (nรd) ยท (dรn) = O(nยฒยทd) โ the bottleneck!
- Softmax: Applied to nรn matrix โ O(nยฒ)
- Attention ร V: (nรn) ยท (nรd) = O(nยฒยทd)
Total: O(nยฒยทd)
Memory: O(nยฒ) to store the attention score matrix
For n=4096 tokens with d=1024: the nรn matrix has 16.7 million entries. For n=32768 (a long document): 1.07 billion entries. This is why standard Transformers struggle with long sequences.
Comparison: Transformer vs RNN Complexity
| Aspect | Self-Attention (Transformer) | Recurrence (RNN/LSTM) |
|---|---|---|
| Time per layer | O(nยฒ ยท d) | O(n ยท dยฒ) |
| Sequential operations | O(1) โ fully parallel! | O(n) โ inherently sequential |
| Max path length | O(1) โ direct connection | O(n) โ through all states |
| Memory | O(nยฒ) for attention matrix | O(nยทd) for hidden states |
| Better when n โช d | โ Yes (nยฒ โช dยฒ) | More expensive |
| Better when n โซ d | โ No (nยฒ dominates) | โ Yes |
Efficient Attention Variants
| Method | Complexity | Key Idea | Year |
|---|---|---|---|
| Sparse Attention (Child et al.) | O(nโn) | Attend only to fixed patterns (local + strided) | 2019 |
| Linformer | O(n) | Project K,V to lower-dimensional space (nรd โ kรd) | 2020 |
| Performer | O(nยทd) | Approximate softmax(QKT) via random features (FAVOR+) | 2020 |
| FlashAttention | O(nยฒยทd) but fast! | IO-aware exact attention, tiles computation to use SRAM | 2022 |
| Multi-Query Attention | O(nยฒยทd/h) | Share K,V across heads, save memory | 2019 |
| Grouped-Query Attention | O(nยฒยทd/g) | Share K,V across groups of heads (used in LLaMA 2) | 2023 |
"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022) โ Instead of approximating attention, FlashAttention computes exact attention but restructures the computation to minimize reads/writes between GPU SRAM (fast, small) and HBM (slow, large). By tiling the nรn attention matrix and computing softmax in blocks, FlashAttention achieves 2-4x speedup and 5-20x memory reduction. It's now the default attention implementation in PyTorch 2.0+.
Q: What is the time complexity of self-attention for a sequence of length n and dimension d?
A: O(nยฒยทd) โ dominated by the QKT matrix multiplication (nรd ยท dรn). Memory is O(nยฒ) for the attention matrix. This quadratic scaling limits standard Transformers to sequences of ~4K-8K tokens.
Worked Examples
Worked Example 1: By-Hand Attention Computation
๐ Computing Scaled Dot-Product Attention (by hand)
Given: 3 tokens with d_k = 2
Q = [[1, 0], K = [[1, 0], V = [[1, 0],
[0, 1], [0, 1], [0, 1],
[1, 1]] [1, 1]] [0.5, 0.5]]
Step 1: Compute QKT
QKT = Q ยท KT = [[1ยท1+0ยท0, 1ยท0+0ยท1, 1ยท1+0ยท1], = [[1, 0, 1],
[0ยท1+1ยท0, 0ยท0+1ยท1, 0ยท1+1ยท1], [0, 1, 1],
[1ยท1+1ยท0, 1ยท0+1ยท1, 1ยท1+1ยท1]] [1, 1, 2]]
Step 2: Scale by โd_k = โ2 โ 1.414
QKT/โd_k = [[0.707, 0.000, 0.707],
[0.000, 0.707, 0.707],
[0.707, 0.707, 1.414]]
Step 3: Apply softmax (row-wise)
Row 1: exp([0.707, 0.000, 0.707]) = [2.028, 1.000, 2.028]
sum = 5.056
softmax = [0.401, 0.198, 0.401]
Row 2: exp([0.000, 0.707, 0.707]) = [1.000, 2.028, 2.028]
sum = 5.056
softmax = [0.198, 0.401, 0.401]
Row 3: exp([0.707, 0.707, 1.414]) = [2.028, 2.028, 4.113]
sum = 8.169
softmax = [0.248, 0.248, 0.503]
Step 4: Multiply by V
Attention weights: V:
[[0.401, 0.198, 0.401], [[1.0, 0.0],
[0.198, 0.401, 0.401], [0.0, 1.0],
[0.248, 0.248, 0.503]] [0.5, 0.5]]
Output = weights ยท V:
Row 1: 0.401ร[1,0] + 0.198ร[0,1] + 0.401ร[0.5,0.5] = [0.602, 0.399]
Row 2: 0.198ร[1,0] + 0.401ร[0,1] + 0.401ร[0.5,0.5] = [0.399, 0.602]
Row 3: 0.248ร[1,0] + 0.248ร[0,1] + 0.503ร[0.5,0.5] = [0.500, 0.500]
Final output:
Attention(Q,K,V) = [[0.602, 0.399],
[0.399, 0.602],
[0.500, 0.500]]
Interpretation: Token 3 ([1,1]) attended roughly equally to all tokens (note the 0.5, 0.5 output), while Token 1 attended more to itself and Token 3 (which share the "1" in dimension 0).
Worked Example 2: Indian Industry โ Multilingual Attention
๐ฎ๐ณ AI4Bharat: Attention in Hindi-English Code-Mixed Text
Problem: Classify sentiment of a code-mixed review from Flipkart:
"Yeh phone bahut accha hai but battery life is terrible"
(This phone is very good but battery life is terrible)
Why Transformers excel here:
- Self-attention captures cross-lingual dependencies: The Hindi word "accha" (good) and English word "terrible" create opposing sentiment โ attention heads learn to detect this contrast across languages.
- No sequential bottleneck: The sentiment words at positions 4 ("accha") and 12 ("terrible") directly attend to each other, without information flowing through 8 intermediate tokens.
- Subword tokenization (BPE): IndicBERT's tokenizer handles Devanagari and Latin scripts simultaneously.
Attention pattern (simplified):
Attention head #3 (sentiment detection):
Yeh phone bahut accha hai but battery life is terrible
terrible โ 0.01 0.02 0.01 0.35 0.01 0.15 0.12 0.08 0.02 0.23
accha โ 0.02 0.05 0.20 0.15 0.03 0.25 0.08 0.05 0.02 0.15
Note: "terrible" attends strongly to "accha" (0.35) โ detecting the
sentiment contrast. "accha" attends to "but" (0.25) โ the pivot word.
Result: Model correctly classifies as "Mixed Sentiment" with a confidence breakdown of 60% negative / 40% positive, which aligns with the code-mixed nature of the review.
Worked Example 3: US Industry โ GPT Text Generation
๐บ๐ธ OpenAI: How GPT Generates Text via Causal Attention
Task: GPT-3 generating a response to "Explain quantum computing in simple terms."
Generation process (autoregressive):
Step 1: Input: "Explain quantum computing in simple terms."
โ Model predicts next token probabilities:
P("Quantum") = 0.12, P("Think") = 0.08, P("Imagine") = 0.15, ...
โ Sample (or greedy): "Imagine"
Step 2: Input: "Explain quantum computing in simple terms. Imagine"
โ Causal mask ensures "Imagine" can attend to all previous tokens
but previous tokens CANNOT attend to "Imagine"
โ Predict: P("a") = 0.18, P("you") = 0.12, ...
โ Select: "a"
Step 3: Continue until <EOS> token or max_length
How causal attention works at step 2:
Attention mask (1 = attend, 0 = blocked):
Explain quantum computing in simple terms . Imagine
Explain [ 1 0 0 0 0 0 0 0 ]
quantum [ 1 1 0 0 0 0 0 0 ]
computing [ 1 1 1 0 0 0 0 0 ]
in [ 1 1 1 1 0 0 0 0 ]
simple [ 1 1 1 1 1 0 0 0 ]
terms [ 1 1 1 1 1 1 0 0 ]
. [ 1 1 1 1 1 1 1 0 ]
Imagine [ 1 1 1 1 1 1 1 1 ] โ can see everything before it
Scaling insight: GPT-3 (175B parameters) achieves this quality through massive scaling. The same architecture at 117M parameters (GPT-1) produces much less coherent text โ demonstrating the power of scale.
Python Implementation: From-Scratch NumPy
13.1 Scaled Dot-Product Attention (NumPy)
Python (NumPy)
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax."""
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / np.sum(e_x, axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Scaled dot-product attention from scratch.
Args:
Q: Queries (n, d_k)
K: Keys (n, d_k)
V: Values (n, d_v)
mask: Optional boolean mask (n, n), True = mask out
Returns:
output: (n, d_v) โ weighted sum of values
weights: (n, n) โ attention weights (for visualization)
"""
d_k = Q.shape[-1]
# Step 1: Compute raw attention scores
scores = Q @ K.T # (n, n)
# Step 2: Scale by sqrt(d_k)
scores = scores / np.sqrt(d_k)
# Step 3: Apply mask (for causal / padding)
if mask is not None:
scores = np.where(mask, -1e9, scores)
# Step 4: Softmax to get attention weights
weights = softmax(scores, axis=-1) # (n, n), each row sums to 1
# Step 5: Weighted sum of values
output = weights @ V # (n, d_v)
return output, weights
# === DEMO ===
np.random.seed(42)
n, d_k, d_v = 4, 8, 8
Q = np.random.randn(n, d_k)
K = np.random.randn(n, d_k)
V = np.random.randn(n, d_v)
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (each row sums to 1):")
print(np.round(weights, 3))
print(f"\nOutput shape: {output.shape}")
# Causal mask demo
causal_mask = np.triu(np.ones((n, n), dtype=bool), k=1)
output_masked, weights_masked = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print("\nCausal attention weights:")
print(np.round(weights_masked, 3))
13.2 Multi-Head Attention (NumPy)
Python (NumPy)
class MultiHeadAttention:
"""Multi-Head Attention from scratch in NumPy."""
def __init__(self, d_model, n_heads):
assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads # dimension per head
# Initialize weight matrices (Xavier initialization)
scale = np.sqrt(2.0 / (d_model + self.d_k))
self.W_Q = np.random.randn(d_model, d_model) * scale
self.W_K = np.random.randn(d_model, d_model) * scale
self.W_V = np.random.randn(d_model, d_model) * scale
self.W_O = np.random.randn(d_model, d_model) * scale
def split_heads(self, x):
"""Reshape (n, d_model) โ (n_heads, n, d_k)"""
n = x.shape[0]
x = x.reshape(n, self.n_heads, self.d_k) # (n, h, d_k)
return x.transpose(1, 0, 2) # (h, n, d_k)
def forward(self, X, mask=None):
"""
Args:
X: Input sequence (n, d_model)
mask: Optional mask (n, n)
Returns:
output: (n, d_model)
all_weights: (n_heads, n, n) โ attention weights per head
"""
# Project to Q, K, V
Q = X @ self.W_Q # (n, d_model)
K = X @ self.W_K
V = X @ self.W_V
# Split into h heads
Q_heads = self.split_heads(Q) # (h, n, d_k)
K_heads = self.split_heads(K)
V_heads = self.split_heads(V)
# Compute attention for each head independently
all_outputs = []
all_weights = []
for i in range(self.n_heads):
out, w = scaled_dot_product_attention(
Q_heads[i], K_heads[i], V_heads[i], mask
)
all_outputs.append(out) # (n, d_k)
all_weights.append(w) # (n, n)
# Concatenate heads: (n, h*d_k) = (n, d_model)
concat = np.concatenate(all_outputs, axis=-1)
# Final linear projection
output = concat @ self.W_O # (n, d_model)
return output, np.array(all_weights)
# === DEMO ===
np.random.seed(42)
d_model, n_heads, seq_len = 32, 4, 6
mha = MultiHeadAttention(d_model, n_heads)
X = np.random.randn(seq_len, d_model)
output, attn_weights = mha.forward(X)
print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f" (n_heads={n_heads}, seq_len={seq_len}, seq_len={seq_len})")
13.3 Transformer Block (NumPy)
Python (NumPy)
def layer_norm(x, gamma, beta, eps=1e-6):
"""Layer Normalization."""
mean = np.mean(x, axis=-1, keepdims=True)
var = np.var(x, axis=-1, keepdims=True)
x_norm = (x - mean) / np.sqrt(var + eps)
return gamma * x_norm + beta
def relu(x):
return np.maximum(0, x)
class TransformerBlock:
"""A single Transformer encoder block from scratch."""
def __init__(self, d_model, n_heads, d_ff):
self.mha = MultiHeadAttention(d_model, n_heads)
# LayerNorm parameters
self.gamma1 = np.ones(d_model)
self.beta1 = np.zeros(d_model)
self.gamma2 = np.ones(d_model)
self.beta2 = np.zeros(d_model)
# Feed-Forward Network parameters
scale1 = np.sqrt(2.0 / (d_model + d_ff))
scale2 = np.sqrt(2.0 / (d_ff + d_model))
self.W1 = np.random.randn(d_model, d_ff) * scale1
self.b1 = np.zeros(d_ff)
self.W2 = np.random.randn(d_ff, d_model) * scale2
self.b2 = np.zeros(d_model)
def ffn(self, x):
"""Position-wise Feed-Forward Network."""
return relu(x @ self.W1 + self.b1) @ self.W2 + self.b2
def forward(self, x, mask=None):
"""
Forward pass through one Transformer encoder block.
Uses Post-Norm (original Transformer style).
"""
# Sub-layer 1: Multi-Head Self-Attention + Residual + LayerNorm
attn_output, attn_weights = self.mha.forward(x, mask)
x = layer_norm(x + attn_output, self.gamma1, self.beta1)
# Sub-layer 2: FFN + Residual + LayerNorm
ffn_output = self.ffn(x)
x = layer_norm(x + ffn_output, self.gamma2, self.beta2)
return x, attn_weights
# === DEMO: Stack 6 layers like the original Transformer ===
np.random.seed(42)
d_model, n_heads, d_ff = 64, 8, 256
seq_len = 10
# Create 6 Transformer blocks
blocks = [TransformerBlock(d_model, n_heads, d_ff) for _ in range(6)]
# Simulate positional encoding + embedding
X = np.random.randn(seq_len, d_model) * 0.1 # simulated input
# Forward through all 6 layers
h = X
for i, block in enumerate(blocks):
h, w = block.forward(h)
print(f"Layer {i+1} output โ mean: {h.mean():.4f}, std: {h.std():.4f}")
print(f"\nFinal output shape: {h.shape}")
print("โ LayerNorm keeps values well-behaved across all 6 layers!")
13.4 Positional Encoding (NumPy)
Python (NumPy)
def sinusoidal_positional_encoding(max_len, d_model):
"""
Generate sinusoidal positional encodings.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
PE = np.zeros((max_len, d_model))
position = np.arange(max_len)[:, np.newaxis] # (max_len, 1)
# Compute the division term: 10000^(2i/d_model)
div_term = np.exp(
np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model)
) # (d_model/2,)
# Even dimensions: sine
PE[:, 0::2] = np.sin(position * div_term)
# Odd dimensions: cosine
PE[:, 1::2] = np.cos(position * div_term)
return PE
# === DEMO ===
PE = sinusoidal_positional_encoding(max_len=50, d_model=16)
print(f"Positional Encoding shape: {PE.shape}")
print(f"PE[0] (position 0): {np.round(PE[0], 3)}")
print(f"PE[1] (position 1): {np.round(PE[1], 3)}")
print(f"\nValues are bounded: min={PE.min():.3f}, max={PE.max():.3f}")
# Verify: dot product between nearby positions is higher
print(f"\nSimilarity PE[0]ยทPE[1]: {PE[0] @ PE[1]:.3f}")
print(f"Similarity PE[0]ยทPE[5]: {PE[0] @ PE[5]:.3f}")
print(f"Similarity PE[0]ยทPE[25]: {PE[0] @ PE[25]:.3f}")
print("Nearby positions have higher similarity โ")
Can you find the bug? A student implemented attention like this:
def broken_attention(Q, K, V):
scores = Q @ K.T
weights = softmax(scores, axis=0) # <-- Bug here!
return weights @ V
The model trains but gives terrible results. What's wrong?
axis=0 normalizes down columns instead of across rows. Each key gets weights summing to 1 (across queries), instead of each query getting weights summing to 1 (across keys). This means each query's attention weights don't form a probability distribution. The correct axis is axis=-1 (or axis=1 for 2D). Also missing: the โd_k scaling!
PyTorch Implementation: Mini-GPT (Character-Level)
Python (PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F
# โโ Hyperparameters โโ
BLOCK_SIZE = 64 # context window (max sequence length)
D_MODEL = 128 # embedding dimension
N_HEADS = 4 # number of attention heads
N_LAYERS = 4 # number of Transformer blocks
D_FF = 512 # feed-forward inner dimension
DROPOUT = 0.1
LEARNING_RATE = 3e-4
MAX_ITERS = 5000
BATCH_SIZE = 32
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
# โโ Single Attention Head โโ
class Head(nn.Module):
"""One head of causal self-attention."""
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(D_MODEL, head_size, bias=False)
self.query = nn.Linear(D_MODEL, head_size, bias=False)
self.value = nn.Linear(D_MODEL, head_size, bias=False)
self.register_buffer('tril',
torch.tril(torch.ones(BLOCK_SIZE, BLOCK_SIZE)))
self.dropout = nn.Dropout(DROPOUT)
def forward(self, x):
B, T, C = x.shape
k = self.key(x) # (B, T, head_size)
q = self.query(x) # (B, T, head_size)
# Compute attention scores
scores = q @ k.transpose(-2, -1) * (C ** -0.5) # (B, T, T)
scores = scores.masked_fill(
self.tril[:T, :T] == 0, float('-inf')) # causal mask
weights = F.softmax(scores, dim=-1)
weights = self.dropout(weights)
# Weighted sum of values
v = self.value(x) # (B, T, head_size)
out = weights @ v # (B, T, head_size)
return out
# โโ Multi-Head Attention โโ
class MultiHeadAttention(nn.Module):
def __init__(self, n_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(n_heads)])
self.proj = nn.Linear(D_MODEL, D_MODEL)
self.dropout = nn.Dropout(DROPOUT)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
return out
# โโ Feed-Forward Network โโ
class FeedForward(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(D_MODEL, D_FF),
nn.ReLU(),
nn.Linear(D_FF, D_MODEL),
nn.Dropout(DROPOUT),
)
def forward(self, x):
return self.net(x)
# โโ Transformer Block โโ
class TransformerBlock(nn.Module):
"""Transformer block: communication (attention) + computation (FFN)."""
def __init__(self):
super().__init__()
head_size = D_MODEL // N_HEADS
self.sa = MultiHeadAttention(N_HEADS, head_size)
self.ffn = FeedForward()
self.ln1 = nn.LayerNorm(D_MODEL)
self.ln2 = nn.LayerNorm(D_MODEL)
def forward(self, x):
# Pre-Norm residual connections (GPT-2 style)
x = x + self.sa(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return x
# โโ Mini-GPT Model โโ
class MiniGPT(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, D_MODEL)
self.position_embedding = nn.Embedding(BLOCK_SIZE, D_MODEL)
self.blocks = nn.Sequential(
*[TransformerBlock() for _ in range(N_LAYERS)]
)
self.ln_f = nn.LayerNorm(D_MODEL)
self.lm_head = nn.Linear(D_MODEL, vocab_size)
def forward(self, idx, targets=None):
B, T = idx.shape
# Token + Position embeddings
tok_emb = self.token_embedding(idx) # (B, T, D_MODEL)
pos_emb = self.position_embedding(
torch.arange(T, device=DEVICE)) # (T, D_MODEL)
x = tok_emb + pos_emb # (B, T, D_MODEL)
# Transformer blocks
x = self.blocks(x) # (B, T, D_MODEL)
x = self.ln_f(x) # (B, T, D_MODEL)
logits = self.lm_head(x) # (B, T, vocab_size)
# Compute loss if targets provided
loss = None
if targets is not None:
B, T, C = logits.shape
logits_flat = logits.view(B*T, C)
targets_flat = targets.view(B*T)
loss = F.cross_entropy(logits_flat, targets_flat)
return logits, loss
def generate(self, idx, max_new_tokens):
"""Autoregressive generation."""
for _ in range(max_new_tokens):
# Crop to last BLOCK_SIZE tokens
idx_cond = idx[:, -BLOCK_SIZE:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] # last token's predictions
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, idx_next], dim=1)
return idx
# โโ Training Loop โโ
# Load your text data
# text = open('input.txt', 'r').read()
# chars = sorted(list(set(text)))
# vocab_size = len(chars)
# stoi = {c: i for i, c in enumerate(chars)}
# itos = {i: c for c, i in stoi.items()}
# encode = lambda s: [stoi[c] for c in s]
# decode = lambda l: ''.join([itos[i] for i in l])
# model = MiniGPT(vocab_size).to(DEVICE)
# optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
#
# for step in range(MAX_ITERS):
# xb, yb = get_batch('train') # random batch of (BATCH_SIZE, BLOCK_SIZE)
# logits, loss = model(xb, yb)
# optimizer.zero_grad()
# loss.backward()
# optimizer.step()
# if step % 500 == 0:
# print(f"Step {step}, Loss: {loss.item():.4f}")
# Print model size
vocab_size_demo = 65 # ASCII printable characters
model = MiniGPT(vocab_size_demo)
n_params = sum(p.numel() for p in model.parameters())
print(f"Mini-GPT Parameters: {n_params:,}")
print(f" Token Embedding: {vocab_size_demo * D_MODEL:,}")
print(f" Position Embedding: {BLOCK_SIZE * D_MODEL:,}")
print(f" Transformer Blocks: {n_params - vocab_size_demo*D_MODEL - BLOCK_SIZE*D_MODEL - D_MODEL*vocab_size_demo - D_MODEL - D_MODEL:,}")
print(f" LM Head: {D_MODEL * vocab_size_demo + vocab_size_demo:,}")
This ~411K parameter model is tiny compared to GPT-3 (175B), but it can learn to generate Shakespeare-like text when trained on the complete works of Shakespeare (~1M characters). The architecture is identical โ only the scale differs. Andrej Karpathy's "nanoGPT" is this exact approach, and is the best resource for understanding GPT from scratch.
Visual Aids
Self-Attention Computation Flow
Multi-Head Attention Visualization
Complete Encoder Layer
Indian Industry Case Study: AI4Bharat IndicBERT
๐ฎ๐ณ IndicBERT โ Building BERT for India's Languages
The Challenge
India has 22 official languages, written in 13 different scripts. Building a single language model that understands all of them faces unique challenges:
- Script diversity: Hindi (Devanagari), Tamil (Tamil script), Telugu (Telugu script), Urdu (Nastaliq), English (Latin) โ all use completely different character sets
- Morphological complexity: Agglutinative languages like Tamil and Kannada create very long compound words, exploding vocabulary size
- Code-mixing: Indian social media frequently mixes Hindi/English ("Hinglish"), Tamil/English, etc., often switching within a single sentence
- Data scarcity: While Hindi has moderate web data, languages like Bodo, Dogri, Maithili have extremely limited digital text
The Solution: IndicBERT Architecture
| Component | Choice | Rationale |
|---|---|---|
| Base model | ALBERT (A Lite BERT) | Parameter sharing across layers reduces model size โ deployable on Indian mobile networks |
| Languages | 11 Indian languages + English | Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu, Urdu |
| Tokenizer | SentencePiece (128K vocab) | Script-agnostic subword tokenization handles all 13 scripts |
| Pre-training data | IndicCorp (9B tokens) | Largest collection of Indian language text, curated from web crawls |
| Pre-training objective | MLM (Masked Language Model) | Standard BERT pre-training, adapted for multilingual setting |
Results
- NER (Named Entity Recognition): IndicBERT achieved state-of-the-art on 9/11 languages, beating multilingual BERT (mBERT) by 3-5 F1 points
- Sentiment Analysis: 2-4% accuracy improvement over mBERT on Hindi and Bengali movie reviews
- Cross-lingual transfer: Training on Hindi data and testing on Marathi worked well (both Devanagari script, similar grammar) โ showing the model captures shared linguistic structure
Deployment Applications
- Flipkart: Hindi product review classification
- Government: Automated processing of RTI (Right to Information) requests across languages
- ShareChat: Content moderation in 15 Indian languages
- Koo (Indian Twitter alternative): Multilingual sentiment analysis
The IndicNLP Suite by AI4Bharat (IIT Madras) includes IndicBERT, IndicTrans (translation), IndicGLUE (benchmark), and IndicCorp (dataset) โ a complete ecosystem for Indian NLP. If you want to contribute, check out ai4bharat.org. This is one of the most impactful open-source AI projects from India.
US/Global Industry Case Study: The GPT Evolution
๐บ๐ธ OpenAI: From GPT-1 to GPT-4 โ The Scaling Revolution
The Timeline
Scaling Laws (Kaplan et al., 2020)
| What Scales | Relationship | Implication |
|---|---|---|
| Parameters (N) | L โ N-0.076 | 10ร more params โ ~0.5 nats lower loss |
| Dataset (D) | L โ D-0.095 | 10ร more data โ ~0.6 nats lower loss |
| Compute (C) | L โ C-0.050 | 10ร more compute โ ~0.4 nats lower loss |
These power laws predict how performance improves with scale โ they drove OpenAI's strategy of building ever-larger models. The relationships are smooth and predictable, meaning you can forecast how well a larger model will perform before training it.
RLHF: From Language Model to Assistant
Raw GPT-3 is powerful but unfocused โ it might generate toxic content, hallucinate facts, or ramble. RLHF aligns it with human preferences:
- Step 1 (SFT): Fine-tune on human-written ideal responses to prompts
- Step 2 (Reward Model): Humans rank model outputs A > B > C; train a reward model to predict these rankings
- Step 3 (PPO): Use the reward model as a signal to fine-tune GPT via reinforcement learning (Proximal Policy Optimization)
Common Misconceptions
โ MYTH: "Transformers are just improved RNNs."
โ TRUTH: Transformers are a fundamentally different architecture. They use NO recurrence. Instead, they process all positions in parallel through self-attention. The two architectures share the goal (sequence modeling) but differ in mechanism (attention vs. recurrence), training efficiency (parallel vs. sequential), and how they handle long-range dependencies (direct vs. through hidden states).
๐ WHY IT MATTERS: This distinction is crucial for architecture selection and understanding computational complexity.
โ MYTH: "The 'attention' in Transformers is the same as the attention used in seq2seq models (Bahdanau attention)."
โ TRUTH: Bahdanau attention (2014) added attention on top of an RNN โ the encoder was still an RNN. The Transformer's key innovation was self-attention (tokens attending to each other within the same sequence) and building the entire model from attention, with no recurrence at all.
๐ WHY IT MATTERS: Understanding this history clarifies why the paper was called "Attention Is All You Need" โ emphasis on "All."
โ MYTH: "More attention heads always means better performance."
โ TRUTH: Research (Michel et al., 2019) showed that many attention heads can be pruned after training without significant performance loss. Some heads learn redundant patterns. The optimal number of heads is task-dependent.
๐ WHY IT MATTERS: For deployment, you can prune heads to reduce inference cost โ critical for mobile deployment in India's constrained environments.
โ MYTH: "Transformers understand language like humans do."
โ TRUTH: Transformers learn statistical patterns in text โ co-occurrence, syntax, some reasoning. They don't have grounded understanding. They can fail spectacularly on tasks requiring real-world knowledge, causal reasoning, or negation.
๐ WHY IT MATTERS: Over-relying on LLMs without human oversight causes real-world failures. Critical applications (healthcare, legal) require human-in-the-loop systems.
โ MYTH: "BERT and GPT use completely different architectures."
โ TRUTH: Both use the same core building blocks (multi-head attention, FFN, LayerNorm, residual connections). The key differences are: (1) BERT uses bidirectional attention; GPT uses causal attention, and (2) BERT pre-trains with MLM; GPT pre-trains with CLM. The Transformer block itself is nearly identical.
๐ WHY IT MATTERS: Once you understand the Transformer block, you understand both BERT and GPT โ they're different configurations of the same architecture.
GATE / Exam Corner
Key Formulas for Exams
Formula Sheet โ Transformers
- Scaled Dot-Product Attention: Attention(Q,K,V) = softmax(QKT/โd_k)V
- Multi-Head: MultiHead = Concat(headโ,...,head_h)WO, head_i = Attention(QW_iQ, KW_iK, VW_iV)
- Positional Encoding: PE(pos,2i) = sin(pos/100002i/d), PE(pos,2i+1) = cos(pos/100002i/d)
- FFN: FFN(x) = max(0, xWโ+bโ)Wโ+bโ where Wโ: dโ4d, Wโ: 4dโd
- LayerNorm: LN(x) = ฮณยท(x-ฮผ)/โ(ฯยฒ+ฮต) + ฮฒ
- Self-Attention Complexity: Time O(nยฒd), Memory O(nยฒ+nd)
- Params in one MHA layer: 4ยทd_modelยฒ (Q,K,V projections + output projection)
- Params in FFN: 2ยทd_modelยทd_ff + d_model + d_ff โ 8ยทd_modelยฒ
GATE-Style Problems
In a Transformer with d_model = 512 and h = 8 attention heads, what is the dimension d_k of keys in each head?
- 8
- 64
- 128
- 512
d_k = d_model / h = 512 / 8 = 64. Each head operates in a 64-dimensional subspace.
What is the time complexity of computing self-attention for a sequence of length n with model dimension d?
- O(nยทd)
- O(nยทdยฒ)
- O(nยฒยทd)
- O(nยฒยทdยฒ)
The bottleneck is QKT: (nรd)ยท(dรn) = O(nยฒยทd). Memory is O(nยฒ) for the attention matrix.
In the original Transformer (d_model=512, h=8, d_ff=2048, N=6), approximately how many parameters are in one encoder layer?
- ~500K
- ~1.5M
- ~3.1M
- ~6.3M
MHA: 4ร512ยฒ = 1,048,576 โ 1.05M
FFN: 512ร2048 + 2048 + 2048ร512 + 512 = 2,099,712 โ 2.1M
LayerNorm (ร2): 2ร(512+512) = 2,048 โ 0.002M
Total: ~3.15M per layer, ~18.9M for 6 layers.
Why does the Transformer divide attention scores by โd_k before applying softmax?
- To normalize the output values to [0, 1]
- To reduce the number of parameters
- To prevent softmax saturation when d_k is large, ensuring informative gradients
- To make the model invariant to the choice of d_k
When d_k is large, dot products grow in magnitude (variance = d_k). Large inputs to softmax push it into saturation regions where gradients are extremely small, hindering learning. Scaling by โd_k brings variance back to 1.
Which of the following is TRUE about BERT's pre-training?
- BERT uses causal (left-to-right) language modeling
- BERT masks 50% of input tokens during pre-training
- BERT uses bidirectional self-attention and Masked Language Modeling (15% masking)
- BERT's encoder layers use masked self-attention to prevent information leakage
BERT uses bidirectional self-attention (no causal mask) and MLM that masks 15% of tokens (80% [MASK], 10% random, 10% unchanged).
Prediction Table โ Likely GATE Topics (2025-2027)
| Topic | Probability | Question Type |
|---|---|---|
| Attention formula & โd_k scaling | โญโญโญโญโญ | MCQ / Numerical |
| Self-attention complexity O(nยฒd) | โญโญโญโญโญ | MCQ |
| BERT vs GPT differences | โญโญโญโญ | MCQ |
| Transformer parameter counting | โญโญโญโญ | Numerical |
| Positional encoding purpose | โญโญโญ | MCQ |
| Multi-head attention mechanics | โญโญโญ | MCQ |
| LayerNorm vs BatchNorm | โญโญโญ | MCQ |
Interview Prep
Conceptual Questions
๐ค "Explain self-attention in simple terms."
"Self-attention lets every word in a sentence directly look at every other word and decide how relevant each one is. Think of it as a room full of people โ instead of passing notes one by one (like RNNs), everyone can simultaneously turn to any other person and listen. Each word creates three vectors: a query (what am I looking for?), a key (what do I represent?), and a value (what info do I carry?). The attention score between two words is the dot product of the query and key, scaled by โd_k to prevent numerical issues, then softmaxed into weights. The output for each word is a weighted sum of all values. This creates context-aware representations where the meaning of 'bank' depends on whether the sentence is about finance or rivers."
๐ค "Why do Transformers beat RNNs?"
"Three fundamental reasons. First, parallelism: RNNs process tokens sequentially โ you can't compute the 100th hidden state without the 99th. Self-attention processes all positions simultaneously, making full use of GPU parallelism. Training speedups of 10-100ร are common. Second, long-range dependencies: In an RNN, information from word 1 must survive passage through all intermediate states to reach word 100 โ the vanishing gradient problem. In a Transformer, word 1 directly attends to word 100 in a single step โ path length is O(1) vs O(n). Third, rich representations: Multi-head attention lets each word form multiple types of relationships simultaneously โ syntactic, semantic, positional โ while an RNN collapses everything into a single fixed-size hidden state."
๐ค "Walk me through the Transformer encoder step by step." (System Design)
"Starting with raw tokens: (1) Each token is mapped to a d_model-dimensional vector via the embedding table. (2) Sinusoidal or learned positional encodings are added. (3) This goes into a stack of 6 identical layers. Each layer has two sub-layers. Sub-layer 1: Multi-head self-attention with h=8 parallel heads, each computing scaled dot-product attention in d_k=64 dimensions, then concatenating and projecting. A residual connection wraps it: output = LayerNorm(x + MultiHeadAttn(x)). Sub-layer 2: A position-wise feed-forward network โ two linear layers with ReLU between (expanding from 512 to 2048 then back to 512). Again wrapped with residual + LayerNorm. After 6 such layers, each token's representation is a rich, contextual embedding of that token's meaning, informed by all other tokens."
Coding Questions
๐ป "Implement scaled dot-product attention." (Top 3 ML coding question)
def attention(Q, K, V, mask=None):
d_k = Q.shape[-1]
scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
weights = torch.softmax(scores, dim=-1)
return weights @ V, weights
Follow-ups to expect: "Add dropout to attention weights," "Make it batched," "Add causal masking," "What's the complexity?"
India-Specific Interview Angle
๐ฎ๐ณ Indian ML Interviews (TCS, Infosys AI, Flipkart)
- Focus on BERT fine-tuning for Hindi/regional language NLP
- "How would you handle code-mixed text?" โ IndicBERT + subword tokenization
- "How to deploy a Transformer on low-resource devices?" โ DistilBERT, quantization, ONNX
- GATE-style numerical problems on attention
- Expect questions on IndicNLP, mBERT, XLM-R for Indian languages
๐บ๐ธ US ML Interviews (FAANG, OpenAI, Anthropic)
- Deep understanding of architecture decisions
- "Implement multi-head attention from scratch" (whiteboard)
- "Design a Transformer for long documents" โ efficient attention
- "Compare Pre-Norm vs Post-Norm" and why GPT-2 switched
- System design: "How would you serve a 70B LLM?"
- Research paper discussions: FlashAttention, RoPE, MoE
Hands-On Lab / Mini-Project
๐ฌ Mini-Project: Build and Train a Character-Level Mini-GPT
Objective
Build a mini-Transformer (GPT-style) that learns to generate text character by character. Train it on a text corpus (Shakespeare, Indian literature, or code) and generate new text.
Requirements
- Implement the following from scratch using PyTorch:
- Single-head causal self-attention
- Multi-head attention (4 heads)
- Transformer block (attention + FFN + LayerNorm + residual)
- Complete GPT model (token/position embedding โ N blocks โ LM head)
- Train on at least 1MB of text data for โฅ5000 steps
- Generate 500+ characters of coherent-looking text
- Visualize attention patterns for at least 2 heads
- Experiment: Compare 2-layer vs 6-layer models on the same data
Data Sources
- Option A (Global): Shakespeare's complete works (~1.1MB) from
tinyshakespeare - Option B (India): Hindi Wikidump text or Premchand stories in Devanagari
- Option C (Code): Python source files from GitHub
Rubric (100 points)
| Component | Points | Criteria |
|---|---|---|
| Correct attention implementation | 20 | Attention formula correct, causal mask works, shapes verified |
| Multi-head attention | 15 | Multiple heads computed in parallel, concatenated correctly |
| Transformer block | 15 | Residual connections, LayerNorm, FFN all present and correct |
| Training pipeline | 15 | Proper data loading, batching, loss computation, optimization |
| Generation quality | 15 | Generated text shows learned patterns (words, formatting, some structure) |
| Attention visualization | 10 | Heatmap of attention weights for at least 2 heads, with interpretation |
| 2-layer vs 6-layer comparison | 10 | Learning curves compared, sample quality compared, brief analysis |
Bonus Challenges (+20 points each)
- Bonus 1: Add temperature and top-k sampling to generation
- Bonus 2: Train on Hindi text and handle Devanagari characters
- Bonus 3: Implement Grouped-Query Attention (GQA) as used in LLaMA 2
Exercises
Section A: Conceptual Questions (5)
Why is self-attention called "self" attention? How does it differ from cross-attention?
Explain why Transformers need positional encoding while RNNs do not.
Explain the role of the FFN (feed-forward network) in a Transformer block. If attention handles inter-token communication, what does the FFN do?
Why does BERT use bidirectional attention while GPT uses causal attention? Isn't bidirectional always better?
What is the purpose of the output projection WO in multi-head attention? Why not just concatenate the heads?
Section B: Mathematical Questions (8)
Given Q = [[1, 0], [0, 1]], K = [[1, 1], [0, 1]], V = [[1, 2], [3, 4]], compute Attention(Q, K, V) with scaling (d_k = 2). Show all intermediate steps.
Prove that if q_i ~ N(0, 1) and k_i ~ N(0, 1) independently, then Var(qยทk) = d_k when q, k โ โd_k.
Calculate the total number of trainable parameters in a single Transformer encoder layer with d_model = 256, h = 4, d_ff = 1024. Include LayerNorm parameters.
Show that the sinusoidal positional encoding satisfies the linear relationship property: PE(pos+k) can be expressed as a linear transformation of PE(pos) for any fixed offset k.
For a Transformer processing a sequence of length n = 1024 with d_model = 512, compute the FLOPs for one self-attention operation (QKT and attentionรV). How does this compare to processing the same sequence with an RNN (d_hidden = 512)?
In a causal attention mask for sequence length 4, write out the 4ร4 mask matrix (before and after softmax). What does each row sum to?
GPT-3 has 175B parameters, 96 layers, d_model = 12,288, d_ff = 4ยทd_model, h = 96 heads. Verify the parameter count for one layer (MHA + FFN + LayerNorm) and check if 96 layers ร per-layer params โ 175B.
If we use multi-query attention (MQA) where K and V are shared across all h heads, by what factor do we reduce the KV cache memory during inference? (Assume h = 32 heads.)
Section C: Coding Questions (4)
Implement a function causal_mask(n) that returns an nรn boolean mask where True means "mask this position" (future positions). Use it in your attention function.
mask = np.triu(np.ones((n, n), dtype=bool), k=1) or mask = torch.triu(torch.ones(n, n), diagonal=1).bool()Write a PyTorch function that computes the sinusoidal positional encoding matrix of shape (max_len, d_model). Verify that nearby positions have higher cosine similarity than distant positions.
torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) for the wavelengths. Compute cosine similarity between PE[0] and PE[k] for k = 1, 5, 50, 100.Modify the MiniGPT code to add temperature-controlled sampling and top-k sampling in the generate() method. Generate text at temperatures 0.5, 1.0, and 1.5, and compare the diversity and coherence.
logits = logits / temperature before softmax. Top-k: v, ix = torch.topk(logits, k), set all other logits to -inf, then softmax over the remaining k options.Implement Layer Normalization from scratch (no using nn.LayerNorm). Compare the output with PyTorch's built-in implementation to verify numerical correctness (difference should be < 1e-5).
y = gamma * (x - mean) / sqrt(var + eps) + beta. Compare with nn.LayerNorm(d_model)(x) โ the difference should be negligible.Section D: Critical Thinking (3)
Self-attention's O(nยฒ) complexity limits it to ~4K-8K tokens. But books are 100K+ tokens. Propose a Transformer architecture that can process a 100K-token novel. Discuss trade-offs.
BERT pre-training masks 15% of tokens. Why not 50%? Why not 5%? What would happen in each case? Design an experiment to find the optimal masking rate.
In India, many users interact with AI in code-mixed language (e.g., "Mujhe ek accha restaurant suggest karo near CP"). What challenges does this pose for Transformer-based models, and how would you address them?
โ Starred Research Questions (2)
Read the FlashAttention paper (Dao et al., 2022). Explain how tiling and recomputation avoid materializing the full nรn attention matrix in HBM. Why is this an IO-bound optimization rather than a compute-bound one?
State-Space Models (SSMs) like Mamba (Gu & Dao, 2023) claim to match Transformer quality with O(n) complexity. Research their mechanism and compare: what does Mamba gain, and what does it sacrifice compared to standard Transformers?
Connections
๐ How This Chapter Connects
- Chapter 14 (LSTMs & GRUs): You understood the limitations of recurrence โ vanishing gradients and sequential processing. Transformers solve both.
- Chapter 9 (Regularization): Dropout is used in attention weights and FFN layers. Layer Normalization extends the normalization concepts from Batch Normalization (Ch 10).
- Chapter 6 (Backpropagation): Residual connections provide gradient highways, a concept you first met in deep networks.
- Chapter 2 (Linear Algebra): The entire attention mechanism is matrix multiplication โ QยทKT, softmax, weightsยทV.
- Chapter 16 (GANs & VAEs): Transformer-based generators (like ViT-GANs) are replacing CNN-based GANs.
- Chapter 17 (Applied CV): Vision Transformers (ViT) apply self-attention to image patches, rivaling CNNs.
- Chapter 18 (Applied NLP): BERT, GPT, and T5 are the backbone of modern NLP โ fine-tuning these models is the dominant paradigm.
- Chapter 22 (Future & Ethics): Large Language Models raise critical ethical questions about bias, hallucination, and societal impact.
- Mixture of Experts (MoE): Scaling Transformers to trillions of parameters by only activating a subset of parameters per token (GPT-4, Mixtral)
- State-Space Models: Mamba and similar architectures challenge Transformers with O(n) complexity
- Multimodal Transformers: Models that process text, images, audio, and video simultaneously (GPT-4V, Gemini)
- Retrieval-Augmented Generation (RAG): Combining Transformers with external knowledge retrieval to reduce hallucination
- Google: T5/PaLM for Search, Translate, Gmail Smart Compose
- OpenAI: GPT-4 for ChatGPT, API services
- Meta: LLaMA (open-source LLM), NLLB (translation for 200 languages)
- AI4Bharat: IndicBERT, IndicTrans, Bhashini (India's national language translation platform)
- Hugging Face: Democratizing Transformer deployment with the transformers library
Chapter Summary
Key Takeaways โ Transformers and Attention
- Attention is a soft dictionary lookup: Each token generates a query, compares it against all keys, and retrieves a weighted sum of values. This lets every position directly communicate with every other position โ O(1) path length vs O(n) for RNNs.
- The โd_k scaling is essential: Dot products in high dimensions have variance d_k. Without scaling, softmax saturates, killing gradients. Dividing by โd_k normalizes variance to 1.
- Multi-head attention provides multiple perspectives: h parallel attention heads (each with d_k = d_model/h) learn different relationship types (syntactic, semantic, positional, coreference) at no additional computational cost.
- Positional encoding solves permutation invariance: Self-attention ignores token order. Sinusoidal encodings (or learned embeddings) inject position information. The sine/cosine choice enables encoding relative positions via rotation matrices.
- The Transformer architecture is a composition of simple blocks: Each layer has Multi-Head Attention + FFN, wrapped with residual connections and Layer Normalization. Stack 6-96 of these layers, and you have a state-of-the-art model.
- BERT (encoder, bidirectional, MLM) and GPT (decoder, causal, CLM) are two sides of the same Transformer coin: BERT excels at understanding tasks; GPT excels at generation. Both use identical building blocks, just configured differently.
- Self-attention is O(nยฒd) โ powerful but expensive: This quadratic scaling limits sequence length. Efficient variants (FlashAttention, Linformer, state-space models) are an active research frontier, essential for processing long documents and deploying on resource-constrained devices.
Attention(Q, K, V) = softmax(QKT / โd_k) ยท V
Key Intuition: The Transformer processes sequences not through step-by-step recurrence, but through layers of self-attention โ letting every token directly query every other token, enabling parallelism, long-range dependencies, and the scaling that powers modern AI.
Further Reading
๐ฎ๐ณ Indian Resources
- NPTEL Course: "Deep Learning for NLP" by Prof. Mitesh Khapra (IIT Madras) โ covers Transformers with Indian language examples, available free on Swayam
- AI4Bharat:
ai4bharat.iitm.ac.inโ IndicBERT, IndicTrans, IndicCorp resources and pre-trained models - GATE CS/DS Prep: "Deep Learning" by GATE Wallah / Unacademy โ covers attention complexity and Transformer architecture for GATE exams
- Book: "Speech and Language Processing" by Jurafsky & Martin (draft chapters free online) โ Chapter 10 covers Transformers extensively
๐ Global Resources
- Original Paper: "Attention Is All You Need" (Vaswani et al., 2017) โ the foundational paper, dense but essential
- The Illustrated Transformer: Jay Alammar's visual walkthrough at
jalammar.github.io/illustrated-transformerโ the best visual explanation - 3Blue1Brown: "But what is a GPT?" โ visual intuition for attention, embedding, and generation
- Andrej Karpathy: "Let's build GPT from scratch" (YouTube) โ builds nanoGPT step by step, the best hands-on tutorial
- Harvard NLP: "The Annotated Transformer" โ line-by-line PyTorch implementation with explanations
- Lilian Weng's Blog: "The Transformer Family" โ comprehensive survey of all Transformer variants (2020-2023)
- Distill.pub: "A Mathematical Framework for Transformer Circuits" (Elhage et al., 2021) โ deep mechanistic interpretability of attention
๐ Key Papers (Chronological)
- Bahdanau et al. (2014) โ "Neural Machine Translation by Jointly Learning to Align and Translate" (additive attention)
- Vaswani et al. (2017) โ "Attention Is All You Need" (the Transformer)
- Devlin et al. (2018) โ "BERT: Pre-training of Deep Bidirectional Transformers" (encoder-only)
- Radford et al. (2018, 2019) โ GPT-1, GPT-2 papers (decoder-only)
- Brown et al. (2020) โ "Language Models are Few-Shot Learners" (GPT-3, scaling)
- Dosovitskiy et al. (2020) โ "An Image is Worth 16ร16 Words" (Vision Transformer)
- Su et al. (2021) โ "RoFormer: Rotary Position Embedding" (RoPE)
- Dao et al. (2022) โ "FlashAttention: Fast and Memory-Efficient Exact Attention"
- Touvron et al. (2023) โ "LLaMA: Open and Efficient Foundation Language Models"
- Gu & Dao (2023) โ "Mamba: Linear-Time Sequence Modeling with Selective State Spaces"