Neural Networks & Deep Learning

Chapter 15: Transformers and Attention Mechanisms

Attention Is All You Need โ€” How One Architecture Replaced RNNs, CNNs, and Changed the World

โฑ๏ธ Reading Time: ~5 hours  |  ๐Ÿ“– Unit 5: Specialized Architectures  |  ๐Ÿง  Theory + Code Chapter

๐Ÿ“‹ Prerequisites: Chapter 14 (LSTMs & GRUs), Chapter 9 (Regularization), Linear Algebra (matrix multiplication, eigenvalues)

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the scaled dot-product attention formula, multi-head attention structure, sinusoidal positional encoding equations, and Transformer encoder-decoder block components
๐Ÿ”ต UnderstandExplain why โˆšd_k scaling prevents softmax saturation, how self-attention captures long-range dependencies without recurrence, and why positional encoding is necessary
๐ŸŸข ApplyImplement single-head attention, multi-head attention, and a complete TransformerBlock from scratch in NumPy; build a character-level mini-GPT in PyTorch
๐ŸŸก AnalyzeTrace the information flow through a 6-layer Transformer encoder, analyze attention weight matrices to interpret what the model "looks at," compare O(nยฒd) self-attention vs O(n) recurrence
๐ŸŸ  EvaluateCritically compare BERT (encoder-only, MLM) vs GPT (decoder-only, CLM), evaluate when to choose efficient attention variants (Linformer, FlashAttention), justify architecture decisions for multilingual tasks
๐Ÿ”ด CreateDesign and train a mini-Transformer for character-level text generation; propose an architecture for an Indian multilingual translation system
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  1. Derive the attention mechanism from first principles โ€” starting from the intuition of "soft dictionary lookup" and arriving at Attention(Q,K,V) = softmax(QKT/โˆšd_k)V
  2. Prove why the โˆšd_k scaling factor is necessary by computing the variance of dot products in high-dimensional spaces
  3. Implement single-head attention, multi-head attention, and a complete Transformer block from scratch in NumPy
  4. Explain sinusoidal positional encoding โ€” derive the formulas, understand why sine and cosine encode relative positions, and compare with learned positional embeddings
  5. Diagram the complete Transformer architecture (encoder + decoder, 6 layers each) including masked multi-head attention, cross-attention, feed-forward networks, and residual connections
  6. Compare encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures โ€” their pre-training objectives, strengths, and applications
  7. Analyze self-attention complexity O(nยฒd) and explain efficient attention variants: Linformer, Performer, FlashAttention
  8. Build a character-level mini-GPT in PyTorch that generates text
  9. Discuss real-world deployments: AI4Bharat IndicBERT for 11 Indian languages and the OpenAI GPT-1โ†’4 evolution
  10. Solve GATE-style numerical problems on attention computation and Transformer parameter counting
Section 2

Opening Hook

๐Ÿ”ฎ The Paper That Changed Everything

In June 2017, a team at Google published "Attention Is All You Need." The paper's modest title hid a revolution: they showed you could throw away RNNs entirely and build sequence models purely from attention. This single architecture now powers GPT-4, BERT, Whisper, DALL-E, and virtually every state-of-the-art AI system.

But here's what made it truly special. For decades, the dominant paradigm for sequence modeling was recurrence โ€” processing tokens one by one, like reading a book word by word, never allowed to skip ahead. Transformers broke this tyranny. Instead of sequential processing, they let every word in a sentence look at every other word simultaneously. A 500-word paragraph? The Transformer sees all 500 words at once, computes how each word relates to every other word, and does it all in a single matrix multiplication.

The results were immediate and dramatic. Translation quality jumped. Training speed exploded โ€” because attention is parallelizable on GPUs, unlike recurrence. Within two years, Google replaced its entire translation pipeline with Transformers. Within five years, GPT-3 showed that a Transformer with 175 billion parameters could write essays, code, and poetry that fooled humans.

In India, the Transformer revolution hit a unique problem: 22 official languages, 13 different scripts, rampant code-mixing. Teams at AI4Bharat and IIT Madras built IndicBERT โ€” a BERT model trained on 11 Indian languages โ€” proving that the Transformer's architecture transcends any single language.

Today, you will understand exactly how this architecture works โ€” from the ground up.

Google Brain OpenAI AI4Bharat Hugging Face Meta AI
Section 3

The Intuition First โ€” Why Attention?

The Library Analogy ๐Ÿ“š

Imagine you walk into a massive library looking for information about "climate change effects on agriculture." Here's what you do:

  1. Your Query (Q): You have a question in your mind โ€” "climate change effects on agriculture." This is your query.
  2. Book Labels / Keys (K): Every book on the shelf has a title and index โ€” "Organic Farming Techniques," "Climate Science 101," "History of Ancient Rome." These are keys that you compare against your query.
  3. Book Contents / Values (V): Each book contains actual content โ€” facts, figures, analysis. These are the values you want to retrieve.
  4. Matching: You mentally compare your query against each book's key. "Climate Science 101" gets high relevance. "History of Ancient Rome" gets near-zero relevance.
  5. Weighted Retrieval: You don't just read one book โ€” you read a weighted combination, spending 60% of your time on the climate book, 30% on the farming book, and 10% skimming others.
Attention = "Soft Dictionary Lookup"
output = ฮฃแตข similarity(query, key_i) ร— value_i

This is exactly what the attention mechanism does. Every token in a sentence generates a query ("What am I looking for?"), a key ("What do I contain?"), and a value ("What information do I carry?"). The magic: every token gets to "look up" every other token and retrieve a weighted combination of their information.

Why RNNs Failed: The Bottleneck Problem ๐Ÿ”ฌ

In Chapter 14, you learned that LSTMs and GRUs mitigate the vanishing gradient problem. But they still have a fundamental limitation: information must flow sequentially. If word 1 needs information from word 100, that information must survive passage through 99 intermediate states. Even with gating, this bottleneck causes information loss.

RNN: Sequential bottleneck wordโ‚ โ†’ wordโ‚‚ โ†’ wordโ‚ƒ โ†’ ... โ†’ wordโ‚‰โ‚‰ โ†’ wordโ‚โ‚€โ‚€ โ†“ โ†“ Info from wordโ‚ must survive 99 steps to reach wordโ‚โ‚€โ‚€ Transformer: Direct connections wordโ‚ โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ wordโ‚โ‚€โ‚€ wordโ‚ โ†โ”€โ”€โ†’ wordโ‚‚ โ†โ”€โ”€โ†’ wordโ‚ƒ ... (all pairs!) Every token "sees" every other token in ONE step.

The "Aha!" Question

"What if, instead of processing a sequence step by step, we let every position in the sequence attend to every other position simultaneously โ€” and used nothing but this attention operation to build the entire model?"

The original Transformer paper (Vaswani et al., 2017) had 8 authors. The name "Transformer" was chosen because the model transforms input representations through layers of attention. The working title was reportedly much less catchy.

Section 4 โ€” Mathematical Foundation

15.1 The Attention Mechanism โ€” Derived from Scratch

Step 1: The Simplest Attention โ€” Dot-Product Similarity

You have a sequence of n tokens, each represented as a d-dimensional vector. Let's call these vectors xโ‚, xโ‚‚, ..., xโ‚™, each โˆˆ โ„d. Stack them into a matrix X โˆˆ โ„nร—d.

The most basic question of attention: "How much should token i pay attention to token j?"

The simplest answer? Take the dot product of their vectors:

e_ij = xแตขแต€ ยท xโฑผ = ฮฃ_k (x_ik ยท x_jk)

If two vectors point in similar directions, the dot product is large (high attention). If they're orthogonal, the dot product is zero (no attention). This is just cosine similarity scaled by magnitude.

Step 2: From Raw Scores to Probabilities โ€” Softmax

The raw scores e_ij can be any real number โ€” positive, negative, huge, tiny. You need them to be non-negative and sum to 1 (a probability distribution). The softmax function does exactly this:

ฮฑ_ij = softmax(e_ij) = exp(e_ij) / ฮฃ_k exp(e_ik)

Now ฮฑ_ij represents "the fraction of attention that token i pays to token j." These are called attention weights.

Step 3: Weighted Sum โ€” The Output

Once you know how much token i attends to every other token, you compute a weighted sum of all token representations:

output_i = ฮฃ_j ฮฑ_ij ยท xโฑผ

This gives token i a new representation that's a blend of all other tokens, weighted by relevance. Beautiful, isn't it? But there's a problem...

Step 4: The Q-K-V Projection โ€” Why We Need Three Separate Roles

In the naive version above, every token uses the same vector for both "querying" and "being queried." This is like a library where the same text serves as both the book's title and its content โ€” confusing!

The solution: project each token into three different spaces using learned weight matrices:

Given input X โˆˆ โ„nร—d_model, we create three projections:

  • Query: Q = X ยท WQ   where WQ โˆˆ โ„d_model ร— d_k
  • Key: K = X ยท WK   where WK โˆˆ โ„d_model ร— d_k
  • Value: V = X ยท WV   where WV โˆˆ โ„d_model ร— d_v

Intuition:

  • Q (Query) = "What am I looking for?" โ€” the question each token asks
  • K (Key) = "What do I contain?" โ€” how each token advertises itself
  • V (Value) = "What information do I carry?" โ€” the actual content to retrieve

This separation is crucial: a word might be highly relevant as a key to certain queries but carry different information as a value. For example, the word "bank" as a key signals "financial" or "river," and different value projections can encode different aspects of its meaning depending on context.

Step 5: Putting It All Together โ€” Raw Attention

Now combine everything. For each query, compute similarity with all keys, normalize with softmax, and retrieve weighted values:

Attention(Q, K, V) = softmax(Q ยท KT) ยท V

Matrix dimensions check:

  • Q โˆˆ โ„nร—d_k, KT โˆˆ โ„d_kร—n โ†’ QยทKT โˆˆ โ„nร—n (attention scores matrix)
  • After softmax: still โ„nร—n (each row sums to 1)
  • V โˆˆ โ„nร—d_v โ†’ output โˆˆ โ„nร—d_v โœ“

This is almost the final formula. But there's one critical missing piece...

โŒ MYTH: "The Q, K, V matrices are fixed and predetermined."

โœ… TRUTH: WQ, WK, WV are learned weight matrices, trained via backpropagation just like any other neural network parameter. The network learns what to query for and what to advertise as keys.

๐Ÿ” WHY IT MATTERS: This is what makes attention so powerful โ€” the model doesn't just compute fixed similarities; it learns which aspects of tokens to compare.

Section 5 โ€” Mathematical Foundation (cont.)

15.2 Scaled Dot-Product Attention โ€” The โˆšd_k Derivation

The Problem: Why Naive Dot Products Blow Up

Here's a subtle but critical issue. Consider a query vector q and a key vector k, each with d_k independent components drawn from a standard normal distribution (mean 0, variance 1).

The dot product is:

q ยท k = ฮฃ_{i=1}^{d_k} q_i ยท k_i

Deriving the variance of the dot product:

Each q_i and k_i are independent with mean 0 and variance 1.

The product q_i ยท k_i has:

  • ๐”ผ[q_i ยท k_i] = ๐”ผ[q_i] ยท ๐”ผ[k_i] = 0 ยท 0 = 0   (by independence)
  • Var(q_i ยท k_i) = ๐”ผ[q_iยฒ ยท k_iยฒ] - (๐”ผ[q_i ยท k_i])ยฒ
  • = ๐”ผ[q_iยฒ] ยท ๐”ผ[k_iยฒ] - 0   (by independence)
  • = Var(q_i) ยท Var(k_i) = 1 ยท 1 = 1

The dot product is a sum of d_k such independent terms:

  • ๐”ผ[q ยท k] = 0
  • Var(q ยท k) = d_k

So the standard deviation of the dot product is โˆšd_k.

If d_k = 512, the dot products will have std โ‰ˆ 22.6! These huge values, when fed into softmax, push the output toward one-hot vectors (all mass on one element), creating extremely small gradients (softmax saturation).

The Solution: Scale by โˆšd_k

To bring the variance back to 1 (well-behaved for softmax), divide by โˆšd_k:

Attention(Q, K, V) = softmax(Q ยท KT / โˆšd_k) ยท V

After scaling:

  • Each element of QKT/โˆšd_k has variance โ‰ˆ 1
  • Softmax operates in a region with informative gradients
  • The model can learn nuanced attention distributions instead of "all or nothing"

Q: Why does scaled dot-product attention divide by โˆšd_k?

A: For vectors with i.i.d. components of variance 1, the dot product has variance d_k. Dividing by โˆšd_k normalizes the variance to 1, preventing softmax saturation and ensuring healthy gradients during training.

Formula: Attention(Q,K,V) = softmax(QKT/โˆšd_k)V

Masked Attention: Preventing the Future from Leaking

In language generation (like GPT), when predicting the next word, you must NOT let the model peek at future words. This is enforced by masking:

Masked Attention: Set score_ij = -โˆž for all j > i before softmax
This makes softmax(โˆ’โˆž) = 0, effectively zeroing out future positions.
Attention Score Matrix (4 tokens): Keyโ‚ Keyโ‚‚ Keyโ‚ƒ Keyโ‚„ Queryโ‚ [ 0.8 -โˆž -โˆž -โˆž ] โ† can only attend to position 1 Queryโ‚‚ [ 0.3 0.9 -โˆž -โˆž ] โ† can attend to positions 1-2 Queryโ‚ƒ [ 0.1 0.5 0.7 -โˆž ] โ† can attend to positions 1-3 Queryโ‚„ [ 0.2 0.4 0.3 0.8 ] โ† can attend to all positions After softmax, โˆ’โˆž โ†’ 0, creating a causal (lower-triangular) pattern.

In PyTorch, masking is typically done by adding a large negative number (like -1e9) to the positions you want to mask, rather than actual -โˆž. This avoids NaN issues while achieving the same effect after softmax.

Section 6

15.3 Multi-Head Attention โ€” Why Multiple Heads?

The Problem with Single-Head Attention

A single attention head computes one set of attention weights. But a word can have multiple types of relationships simultaneously:

  • "The cat sat on the mat because it was tired" โ€” "it" needs to attend to "cat" (coreference resolution)
  • But "it" also needs to attend to "sat" (syntactic dependency) and "tired" (semantic link)

A single attention head must average all these relationships into one set of weights โ€” a lossy compromise.

The Solution: Multiple Parallel Heads

Instead of one attention function with d_model-dimensional keys, queries, and values, run h parallel attention heads, each with d_k = d_model/h dimensions:

Multi-Head Attention, step by step:

Step 1: For each head i โˆˆ {1, ..., h}, create separate projections:

  • Q_i = X ยท W_iQ   where W_iQ โˆˆ โ„d_model ร— d_k,   d_k = d_model / h
  • K_i = X ยท W_iK   where W_iK โˆˆ โ„d_model ร— d_k
  • V_i = X ยท W_iV   where W_iV โˆˆ โ„d_model ร— d_v,   d_v = d_model / h

Step 2: Compute attention independently for each head:

head_i = Attention(Q_i, K_i, V_i) โˆˆ โ„n ร— d_v

Step 3: Concatenate all heads:

MultiHead = Concat(head_1, head_2, ..., head_h) โˆˆ โ„n ร— (hยทd_v) = โ„n ร— d_model

Step 4: Final linear projection:

MultiHeadAttention(X) = MultiHead ยท WO   where WO โˆˆ โ„d_model ร— d_model

MultiHead(Q, K, V) = Concat(headโ‚, ..., head_h) ยท WO
where head_i = Attention(QW_iQ, KW_iK, VW_iV)

Parameter Count Check

With h=8 heads and d_model=512 (as in the original Transformer):

  • d_k = d_v = 512/8 = 64 per head
  • Per head: WQ(512ร—64) + WK(512ร—64) + WV(512ร—64) = 3 ร— 32,768 = 98,304
  • All 8 heads: 8 ร— 98,304 = 786,432
  • Output projection WO: 512 ร— 512 = 262,144
  • Total: 1,048,576 โ‰ˆ 1M parameters (same as one large single-head attention!)

The computational cost of multi-head attention with h heads of d_k dimensions each is identical to single-head attention with full d_model dimensions! You get multiple attention patterns for free โ€” no additional compute. This is because h ร— d_k = d_model.

What Do Different Heads Learn?

Research has shown that different heads naturally specialize:

Head TypeWhat It LearnsExample
Syntactic HeadSubject-verb agreement"The dogs [that barked] were loud"
Positional HeadAttend to adjacent tokens"New York" โ€” bigram patterns
Coreference HeadPronoun resolution"Alice said she would come"
Semantic HeadMeaning similarity"happy" attends to "joy," "glad"

"A Multiscale Visualization of Attention" (Vig, 2019) โ€” Jesse Vig built BertViz, a tool to visualize attention patterns across heads and layers. The research showed that lower layers capture syntactic patterns while upper layers capture semantic relationships. This work enabled researchers to interpret what Transformers learn, rather than treating them as black boxes.

Section 7

15.4 Positional Encoding โ€” Teaching Position to a Permutation-Invariant Model

The Problem: Self-Attention Ignores Order

Here's a subtle but devastating issue. Consider the sentences:

  • "The dog chased the cat" (meaning A)
  • "The cat chased the dog" (meaning B โ€” completely different!)

Self-attention computes pairwise similarities between tokens. Since these sentences have exactly the same tokens (just reordered), self-attention produces identical outputs. The attention mechanism is permutation-invariant โ€” it doesn't care about word order!

This is clearly unacceptable. We must inject positional information somehow.

Solution: Sinusoidal Positional Encoding

Deriving the sinusoidal encoding:

We need a function PE(pos, i) that maps position pos and dimension i to a real number. Our desiderata:

  1. Unique: Each position gets a unique encoding
  2. Bounded: Values stay in [-1, 1] regardless of sequence length
  3. Relative: PE(pos+k) should be expressible as a linear function of PE(pos) โ€” enabling the model to learn relative positions
  4. Generalizable: Should work for sequences longer than those seen during training

Sine and cosine functions satisfy ALL four criteria!

The encoding formula:

PE(pos, 2i) = sin(pos / 100002i/d_model)

PE(pos, 2i+1) = cos(pos / 100002i/d_model)

where pos is the token position and i is the dimension index.

Why this works โ€” the linear relationship property:

For any fixed offset k, there exists a linear transformation M_k such that:

[sin(ฯ‰ยท(pos+k)), cos(ฯ‰ยท(pos+k))] = M_k ยท [sin(ฯ‰ยทpos), cos(ฯ‰ยทpos)]

where M_k is a rotation matrix:

M_k = [[cos(ฯ‰k), sin(ฯ‰k)], [-sin(ฯ‰k), cos(ฯ‰k)]]

This means the model can learn to attend to relative positions (e.g., "3 tokens back") through a simple linear transformation of the positional encodings. No absolute position memorization needed!

PE(pos, 2i) = sin(pos / 100002i/d_model)
PE(pos, 2i+1) = cos(pos / 100002i/d_model)

Wavelength Intuition

Each dimension pair (2i, 2i+1) creates a sine-cosine wave with a different wavelength:

  • Dimension 0,1: wavelength = 2ฯ€ โ‰ˆ 6.28 positions (fastest oscillation)
  • Dimension 2,3: wavelength = 2ฯ€ ร— 100002/512 โ‰ˆ 7.4 positions
  • ...
  • Dimension 510,511: wavelength โ‰ˆ 2ฯ€ ร— 10000 โ‰ˆ 62,832 positions (slowest)

Think of it like a binary clock: the least significant bit oscillates fastest, the most significant bit oscillates slowest. Each position gets a unique "binary-like" representation from combining all these waves.

Positional Encoding Visualization (d_model=8, showing 10 positions): Pos dim0 dim1 dim2 dim3 dim4 dim5 dim6 dim7 0 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 1 0.84 0.54 0.16 0.99 0.03 1.00 0.00 1.00 2 0.91 -0.42 0.31 0.95 0.06 1.00 0.01 1.00 3 0.14 -0.99 0.45 0.89 0.09 1.00 0.01 1.00 4 -0.76 -0.65 0.59 0.81 0.12 0.99 0.01 1.00 5 -0.96 0.28 0.71 0.70 0.15 0.99 0.01 1.00 6 -0.28 0.96 0.81 0.59 0.18 0.98 0.02 1.00 7 0.66 0.75 0.89 0.46 0.21 0.98 0.02 1.00 Notice: dim0,1 oscillate fast โ†• | dim6,7 barely change โ†’ Each row (position) has a unique "fingerprint"

Learned vs. Sinusoidal Positional Encoding

FeatureSinusoidal (Original Transformer)Learned (BERT, GPT-2)
Parameters0 (computed analytically)max_len ร— d_model trainable params
GeneralizationCan extrapolate to longer sequencesCannot handle sequences longer than max_len
Relative positionsEncodes relative position via rotationMust learn relative patterns from data
PerformanceComparable to learned in most tasksSlightly better when max_len is known
Used inOriginal Transformer, some variantsBERT, GPT-2/3, most modern models

"RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021) โ€” This paper introduced Rotary Position Embedding (RoPE), which encodes position by rotating the query and key vectors. RoPE naturally encodes relative positions, generalizes to longer sequences, and is now used in LLaMA, PaLM, and most modern large language models. It essentially applies the rotation matrix M_k directly to Q and K before computing attention.

Section 8

15.5 The Full Transformer Architecture

The Big Picture

THE TRANSFORMER ARCHITECTURE โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ENCODER โ”‚ โ”‚ DECODER โ”‚ โ”‚ (6 identical โ”‚ โ”‚ (6 identical โ”‚ โ”‚ layers) โ”‚ โ”‚ layers) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Multi-Head โ”‚ โ”‚ โ”‚ โ”‚ Masked Multi- โ”‚ โ”‚ โ”‚ โ”‚ Self-Attention โ”‚ โ”‚ โ”‚ โ”‚ Head Self-Attn โ”‚ โ”‚ โ”‚ โ”‚ + Add & Norm โ”‚ โ”‚ โ”‚ โ”‚ + Add & Norm โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Feed-Forward โ”‚ โ”‚ โ”‚ โ”‚ Cross-Attention โ”‚ โ”‚ โ”‚ โ”‚ Network โ”‚ โ”‚ โ”€โ”€โ–บโ”‚ โ”‚ (Encoder-Decoder)โ”‚ โ”‚ โ”‚ โ”‚ + Add & Norm โ”‚ โ”‚ โ”‚ โ”‚ + Add & Norm โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ [Repeat ร—6] โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ Feed-Forward โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Network โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ + Add & Norm โ”‚ โ”‚ โ”‚ Input Embedding โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ + Positional Enc โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ [Repeat ร—6] โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Input Tokens โ”‚ "I love coding" โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Linear + Softmax โ”‚ โ”‚ โ†’ Output Probs โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Output Embedding โ”‚ โ”‚ + Positional Enc โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Output Tokens (shifted right) "เคฎเฅเคเฅ‡ เค•เฅ‹เคกเคฟเค‚เค— เคชเคธเค‚เคฆ เคนเฅˆ"

Encoder Block (in detail)

Each of the 6 encoder layers contains exactly two sub-layers:

๐Ÿ”น Encoder Sub-Layer 1: Multi-Head Self-Attention

Input: Sequence of d_model-dimensional vectors (n ร— d_model) Operation: Each token attends to all tokens in the input sequence Residual: outputโ‚ = LayerNorm(x + MultiHeadAttention(x, x, x)) Note: Q, K, V all come from the same input โ€” hence "self-attention"

๐Ÿ”น Encoder Sub-Layer 2: Position-wise Feed-Forward Network

Operation: FFN(x) = max(0, xWโ‚ + bโ‚)Wโ‚‚ + bโ‚‚ (ReLU in the middle) Dimensions: Wโ‚ โˆˆ โ„512ร—2048, Wโ‚‚ โˆˆ โ„2048ร—512 (expand then compress) Residual: outputโ‚‚ = LayerNorm(outputโ‚ + FFN(outputโ‚)) Key insight: "Position-wise" means the same FFN is applied independently to each token. Think of it as a 1ร—1 convolution over the sequence โ€” it processes each position's features but doesn't mix information across positions.

Decoder Block (in detail)

Each of the 6 decoder layers has three sub-layers:

๐Ÿ”ธ Decoder Sub-Layer 1: Masked Multi-Head Self-Attention

Same as encoder self-attention, BUT with a causal mask Why masked? During generation, position i must not attend to positions j > i (the future). The mask sets these attention scores to -โˆž before softmax. Residual: outโ‚ = LayerNorm(x + MaskedMultiHeadAttention(x, x, x))

๐Ÿ”ธ Decoder Sub-Layer 2: Cross-Attention (Encoder-Decoder Attention)

This is the bridge between encoder and decoder! Q comes from: The decoder (previous sub-layer output) K, V come from: The encoder output Intuition: "Given what I've generated so far (Q), what parts of the source sentence (K, V) should I focus on?" Residual: outโ‚‚ = LayerNorm(outโ‚ + MultiHeadAttention(outโ‚, enc_output, enc_output))

๐Ÿ”ธ Decoder Sub-Layer 3: Position-wise FFN

Identical to encoder's FFN Residual: outโ‚ƒ = LayerNorm(outโ‚‚ + FFN(outโ‚‚))

Transformer Hyperparameters (Original Paper)

HyperparameterSymbolValue
Model dimensiond_model512
Number of headsh8
Key/Value dimension per headd_k = d_v64
FFN inner dimensiond_ff2048
Number of encoder layersN6
Number of decoder layersN6
Dropout ratep_drop0.1
Vocabulary sizeV~37,000 (BPE)
Total parametersโ€”~65M (base), ~213M (big)

Job Roles: Transformer expertise is required for: NLP Engineer, LLM Research Scientist, Applied ML Engineer, ML Infrastructure Engineer (optimizing Transformer inference), AI4Bharat Research Fellow. Typical salaries in India: โ‚น18-60 LPA; in the US: $120K-250K+.

Section 9

15.6 Layer Normalization and Residual Connections

Residual Connections: The Gradient Highway

You already met residual connections in Chapter 11 (ResNets). The Transformer uses them identically:

output = SubLayer(x) + x

Why they're essential: With 6 stacked layers, each containing 2-3 sub-layers, gradients must flow through 12-18 transformations. Without skip connections, gradients would vanish or explode. The residual path provides a direct gradient highway from the output back to the input.

Layer Normalization (Not Batch Normalization!)

The Transformer uses Layer Normalization, not Batch Normalization. Here's why and how:

LayerNorm vs BatchNorm:

BatchNorm normalizes across the batch dimension: for each feature, compute mean and variance across all examples in the batch.

LayerNorm normalizes across the feature dimension: for each example (each token), compute mean and variance across all features (d_model dimensions).

For a single token vector x โˆˆ โ„d_model:

ฮผ = (1/d_model) ฮฃแตข xแตข

ฯƒยฒ = (1/d_model) ฮฃแตข (xแตข - ฮผ)ยฒ

LayerNorm(x) = ฮณ โŠ™ (x - ฮผ)/(ฯƒ + ฮต) + ฮฒ

where ฮณ, ฮฒ โˆˆ โ„d_model are learned scale and shift parameters, and ฮต โ‰ˆ 1e-6 for numerical stability.

Why LayerNorm, Not BatchNorm?

ReasonExplanation
Variable sequence lengthsBatchNorm would need to compute statistics across positions, but sequences have different lengths โ€” padding tokens would corrupt statistics
Small batch sizesIn NLP training, batch sizes are often small (due to memory). BatchNorm's statistics become noisy with small batches.
Inference consistencyLayerNorm's behavior is identical during training and inference (no running mean/variance needed)
Sequential generationDuring autoregressive generation, tokens are produced one at a time โ€” no "batch" to normalize over

Pre-Norm vs Post-Norm

The original Transformer uses Post-Norm: output = LayerNorm(x + SubLayer(x))

Modern Transformers (GPT-2+) use Pre-Norm: output = x + SubLayer(LayerNorm(x))

Pre-Norm trains more stably because the residual path carries unnormalized values, maintaining the gradient highway. Post-Norm can cause gradient issues in very deep networks (>12 layers). GPT-2, GPT-3, and LLaMA all use Pre-Norm.

โŒ MYTH: "Transformers use Batch Normalization like CNNs do."

โœ… TRUTH: Transformers use Layer Normalization. BatchNorm normalizes across the batch; LayerNorm normalizes across features within a single example.

๐Ÿ” WHY IT MATTERS: Using BatchNorm in a Transformer would break autoregressive generation and perform poorly with variable-length sequences. This is a common GATE/interview trap question.

Section 10

15.7 BERT vs GPT โ€” Encoder-Only vs Decoder-Only

Three Flavors of Transformers

Three Transformer Architectures: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ENCODER-ONLY โ”‚ โ”‚ DECODER-ONLY โ”‚ โ”‚ ENCODER-DECODER โ”‚ โ”‚ (BERT) โ”‚ โ”‚ (GPT) โ”‚ โ”‚ (T5, Original) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Self-Attn โ”‚ โ”‚ โ”‚ โ”‚ Masked โ”‚ โ”‚ โ”‚ โ”‚Encoderโ”‚โ†’ โ”‚Decoderโ”‚ โ”‚ โ”‚ โ”‚ (bidir.) โ”‚ โ”‚ โ”‚ โ”‚ Self-Attn โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ [CLS].. โ”‚ โ”‚ โ”‚ โ”‚ (causal) โ”‚ โ”‚ โ”‚ โ”‚ Self โ”‚ โ”‚ Cross โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ Attn โ”‚ โ”‚ Attn โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ Pre-training: โ”‚ โ”‚ Pre-training: โ”‚ โ”‚ Pre-training: โ”‚ โ”‚ MLM (Masked โ”‚ โ”‚ CLM (Causal โ”‚ โ”‚ Denoising / โ”‚ โ”‚ Language Model)โ”‚ โ”‚ Language Model) โ”‚ โ”‚ Seq2Seq โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Best for: โ”‚ โ”‚ Best for: โ”‚ โ”‚ Best for: โ”‚ โ”‚ Understanding โ”‚ โ”‚ Generation โ”‚ โ”‚ Translation, โ”‚ โ”‚ NLU, classify โ”‚ โ”‚ Text, Code โ”‚ โ”‚ Summarization โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

BERT: Bidirectional Encoder Representations from Transformers

๐Ÿ“˜ BERT (Devlin et al., 2018)

Architecture: Encoder-only (12 layers for BERT-base, 24 for BERT-large) Key innovation: Bidirectional self-attention โ€” every token can see every other token, both left AND right Pre-training Objective 1 โ€” Masked Language Model (MLM):

Randomly mask 15% of input tokens, then predict the masked tokens.

Example: "The [MASK] sat on the [MASK]" โ†’ predict "cat" and "mat"

Of the 15% selected: 80% replaced with [MASK], 10% with a random word, 10% kept unchanged.

Pre-training Objective 2 โ€” Next Sentence Prediction (NSP):

Given two sentences, predict whether B actually follows A in the corpus.

50% positive pairs (consecutive sentences), 50% negative (random sentences).

Why it works: By seeing context from BOTH directions, BERT builds deep bidirectional representations โ€” unlike GPT which only sees leftward context. This makes BERT excellent for understanding tasks (classification, NER, QA).

GPT: Generative Pre-trained Transformer

๐Ÿ“— GPT (Radford et al., 2018)

Architecture: Decoder-only (12 layers for GPT-1, 48 for GPT-3, unknown for GPT-4) Key innovation: Causal (left-to-right) self-attention with masked future positions Pre-training Objective โ€” Causal Language Model (CLM):

Predict the next token given all previous tokens.

P(x_t | x_1, x_2, ..., x_{t-1})

Example: "The cat sat on the" โ†’ predict "mat"

Why it works: CLM naturally enables text generation โ€” just keep predicting the next token. And because the model learns to predict ANY next token in billions of sentences, it implicitly learns grammar, facts, reasoning, and even some code. Scaling law: GPT demonstrated that performance improves predictably with model size, data size, and compute โ€” leading to the "scaling laws" paradigm.

Head-to-Head Comparison

FeatureBERT (Encoder-only)GPT (Decoder-only)
Attention typeBidirectional (full)Causal (left-to-right)
Pre-trainingMLM + NSPCLM (next token prediction)
Fine-tuningAdd task-specific headPrompt-based / fine-tune
GenerationPoor (not designed for it)Excellent
UnderstandingExcellent (bidirectional context)Good (but unidirectional)
ClassificationExcellent ([CLS] token)Good (last token)
Parameters (base)110M (BERT-base)117M (GPT-1), 175B (GPT-3)
Notable variantsRoBERTa, ALBERT, DistilBERTGPT-2/3/4, LLaMA, Mistral

๐Ÿ‡ฎ๐Ÿ‡ณ India: BERT Variants

  • IndicBERT โ€” 11 Indian languages, Albert-based
  • MuRIL (Google) โ€” Multilingual for Indian langs
  • Indic-BERT by AI4Bharat โ€” handles code-mixing
  • Challenges: 13 scripts, agglutinative morphology, limited digital data for many languages
  • Use cases: Flipkart review classification, gov't document processing, Aadhaar NLP

๐Ÿ‡บ๐Ÿ‡ธ USA: GPT Evolution

  • GPT-1 (2018) โ€” 117M params, proved CLM pre-training works
  • GPT-2 (2019) โ€” 1.5B params, "too dangerous to release"
  • GPT-3 (2020) โ€” 175B params, few-shot learning
  • GPT-4 (2023) โ€” Multimodal, >1T params (est.)
  • Key insight: scaling compute + data โ†’ emergent capabilities (reasoning, code, multilingual)
Section 11

15.8 Self-Attention Complexity and Efficient Attention

The Quadratic Bottleneck

Computing the complexity of self-attention:

Given a sequence of length n and model dimension d:

  1. Q, K, V projections: Three matrix multiplications of (nร—d) ยท (dร—d) = O(nยทdยฒ) each โ†’ O(3nยทdยฒ)
  2. QKT: Matrix multiplication of (nร—d) ยท (dร—n) = O(nยฒยทd) โ† the bottleneck!
  3. Softmax: Applied to nร—n matrix โ†’ O(nยฒ)
  4. Attention ร— V: (nร—n) ยท (nร—d) = O(nยฒยทd)

Total: O(nยฒยทd)

Memory: O(nยฒ) to store the attention score matrix

For n=4096 tokens with d=1024: the nร—n matrix has 16.7 million entries. For n=32768 (a long document): 1.07 billion entries. This is why standard Transformers struggle with long sequences.

Comparison: Transformer vs RNN Complexity

AspectSelf-Attention (Transformer)Recurrence (RNN/LSTM)
Time per layerO(nยฒ ยท d)O(n ยท dยฒ)
Sequential operationsO(1) โ€” fully parallel!O(n) โ€” inherently sequential
Max path lengthO(1) โ€” direct connectionO(n) โ€” through all states
MemoryO(nยฒ) for attention matrixO(nยทd) for hidden states
Better when n โ‰ช dโœ… Yes (nยฒ โ‰ช dยฒ)More expensive
Better when n โ‰ซ dโŒ No (nยฒ dominates)โœ… Yes

Efficient Attention Variants

MethodComplexityKey IdeaYear
Sparse Attention (Child et al.)O(nโˆšn)Attend only to fixed patterns (local + strided)2019
LinformerO(n)Project K,V to lower-dimensional space (nร—d โ†’ kร—d)2020
PerformerO(nยทd)Approximate softmax(QKT) via random features (FAVOR+)2020
FlashAttentionO(nยฒยทd) but fast!IO-aware exact attention, tiles computation to use SRAM2022
Multi-Query AttentionO(nยฒยทd/h)Share K,V across heads, save memory2019
Grouped-Query AttentionO(nยฒยทd/g)Share K,V across groups of heads (used in LLaMA 2)2023

"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022) โ€” Instead of approximating attention, FlashAttention computes exact attention but restructures the computation to minimize reads/writes between GPU SRAM (fast, small) and HBM (slow, large). By tiling the nร—n attention matrix and computing softmax in blocks, FlashAttention achieves 2-4x speedup and 5-20x memory reduction. It's now the default attention implementation in PyTorch 2.0+.

Q: What is the time complexity of self-attention for a sequence of length n and dimension d?

A: O(nยฒยทd) โ€” dominated by the QKT matrix multiplication (nร—d ยท dร—n). Memory is O(nยฒ) for the attention matrix. This quadratic scaling limits standard Transformers to sequences of ~4K-8K tokens.

Section 12

Worked Examples

Worked Example 1: By-Hand Attention Computation

๐Ÿ“ Computing Scaled Dot-Product Attention (by hand)

Given: 3 tokens with d_k = 2

Q = [[1, 0],     K = [[1, 0],     V = [[1, 0],
     [0, 1],          [0, 1],          [0, 1],
     [1, 1]]          [1, 1]]          [0.5, 0.5]]

Step 1: Compute QKT

QKT = Q ยท KT = [[1ยท1+0ยท0, 1ยท0+0ยท1, 1ยท1+0ยท1],   = [[1, 0, 1],
                     [0ยท1+1ยท0, 0ยท0+1ยท1, 0ยท1+1ยท1],      [0, 1, 1],
                     [1ยท1+1ยท0, 1ยท0+1ยท1, 1ยท1+1ยท1]]      [1, 1, 2]]

Step 2: Scale by โˆšd_k = โˆš2 โ‰ˆ 1.414

QKT/โˆšd_k = [[0.707, 0.000, 0.707],
              [0.000, 0.707, 0.707],
              [0.707, 0.707, 1.414]]

Step 3: Apply softmax (row-wise)

Row 1: exp([0.707, 0.000, 0.707]) = [2.028, 1.000, 2.028]
        sum = 5.056
        softmax = [0.401, 0.198, 0.401]

Row 2: exp([0.000, 0.707, 0.707]) = [1.000, 2.028, 2.028]
        sum = 5.056
        softmax = [0.198, 0.401, 0.401]

Row 3: exp([0.707, 0.707, 1.414]) = [2.028, 2.028, 4.113]
        sum = 8.169
        softmax = [0.248, 0.248, 0.503]

Step 4: Multiply by V

Attention weights:           V:
[[0.401, 0.198, 0.401],    [[1.0, 0.0],
 [0.198, 0.401, 0.401],     [0.0, 1.0],
 [0.248, 0.248, 0.503]]     [0.5, 0.5]]

Output = weights ยท V:
Row 1: 0.401ร—[1,0] + 0.198ร—[0,1] + 0.401ร—[0.5,0.5] = [0.602, 0.399]
Row 2: 0.198ร—[1,0] + 0.401ร—[0,1] + 0.401ร—[0.5,0.5] = [0.399, 0.602]
Row 3: 0.248ร—[1,0] + 0.248ร—[0,1] + 0.503ร—[0.5,0.5] = [0.500, 0.500]

Final output:

Attention(Q,K,V) = [[0.602, 0.399],
                    [0.399, 0.602],
                    [0.500, 0.500]]

Interpretation: Token 3 ([1,1]) attended roughly equally to all tokens (note the 0.5, 0.5 output), while Token 1 attended more to itself and Token 3 (which share the "1" in dimension 0).

Worked Example 2: Indian Industry โ€” Multilingual Attention

๐Ÿ‡ฎ๐Ÿ‡ณ AI4Bharat: Attention in Hindi-English Code-Mixed Text

Problem: Classify sentiment of a code-mixed review from Flipkart:

"Yeh phone bahut accha hai but battery life is terrible"

(This phone is very good but battery life is terrible)

Why Transformers excel here:

  1. Self-attention captures cross-lingual dependencies: The Hindi word "accha" (good) and English word "terrible" create opposing sentiment โ€” attention heads learn to detect this contrast across languages.
  2. No sequential bottleneck: The sentiment words at positions 4 ("accha") and 12 ("terrible") directly attend to each other, without information flowing through 8 intermediate tokens.
  3. Subword tokenization (BPE): IndicBERT's tokenizer handles Devanagari and Latin scripts simultaneously.

Attention pattern (simplified):

Attention head #3 (sentiment detection):
                Yeh  phone bahut accha hai  but  battery life  is  terrible
terrible  โ†’     0.01  0.02  0.01  0.35  0.01 0.15  0.12   0.08  0.02  0.23
accha     โ†’     0.02  0.05  0.20  0.15  0.03 0.25  0.08   0.05  0.02  0.15

Note: "terrible" attends strongly to "accha" (0.35) โ€” detecting the 
sentiment contrast. "accha" attends to "but" (0.25) โ€” the pivot word.

Result: Model correctly classifies as "Mixed Sentiment" with a confidence breakdown of 60% negative / 40% positive, which aligns with the code-mixed nature of the review.

Worked Example 3: US Industry โ€” GPT Text Generation

๐Ÿ‡บ๐Ÿ‡ธ OpenAI: How GPT Generates Text via Causal Attention

Task: GPT-3 generating a response to "Explain quantum computing in simple terms."

Generation process (autoregressive):

Step 1: Input: "Explain quantum computing in simple terms."
        โ†’ Model predicts next token probabilities:
          P("Quantum") = 0.12, P("Think") = 0.08, P("Imagine") = 0.15, ...
        โ†’ Sample (or greedy): "Imagine"

Step 2: Input: "Explain quantum computing in simple terms. Imagine"
        โ†’ Causal mask ensures "Imagine" can attend to all previous tokens
           but previous tokens CANNOT attend to "Imagine"
        โ†’ Predict: P("a") = 0.18, P("you") = 0.12, ...
        โ†’ Select: "a"

Step 3: Continue until <EOS> token or max_length

How causal attention works at step 2:

Attention mask (1 = attend, 0 = blocked):
              Explain quantum computing in simple terms . Imagine
Explain    [    1       0        0       0    0      0   0    0   ]
quantum    [    1       1        0       0    0      0   0    0   ]
computing  [    1       1        1       0    0      0   0    0   ]
in         [    1       1        1       1    0      0   0    0   ]
simple     [    1       1        1       1    1      0   0    0   ]
terms      [    1       1        1       1    1      1   0    0   ]
.          [    1       1        1       1    1      1   1    0   ]
Imagine    [    1       1        1       1    1      1   1    1   ]  โ† can see everything before it

Scaling insight: GPT-3 (175B parameters) achieves this quality through massive scaling. The same architecture at 117M parameters (GPT-1) produces much less coherent text โ€” demonstrating the power of scale.

Section 13

Python Implementation: From-Scratch NumPy

13.1 Scaled Dot-Product Attention (NumPy)

Python (NumPy)
import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention from scratch.
    
    Args:
        Q: Queries  (n, d_k)
        K: Keys     (n, d_k)
        V: Values   (n, d_v)
        mask: Optional boolean mask (n, n), True = mask out
    
    Returns:
        output: (n, d_v) โ€” weighted sum of values
        weights: (n, n) โ€” attention weights (for visualization)
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute raw attention scores
    scores = Q @ K.T  # (n, n)
    
    # Step 2: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Apply mask (for causal / padding)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)
    
    # Step 4: Softmax to get attention weights
    weights = softmax(scores, axis=-1)  # (n, n), each row sums to 1
    
    # Step 5: Weighted sum of values
    output = weights @ V  # (n, d_v)
    
    return output, weights

# === DEMO ===
np.random.seed(42)
n, d_k, d_v = 4, 8, 8
Q = np.random.randn(n, d_k)
K = np.random.randn(n, d_k)
V = np.random.randn(n, d_v)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (each row sums to 1):")
print(np.round(weights, 3))
print(f"\nOutput shape: {output.shape}")

# Causal mask demo
causal_mask = np.triu(np.ones((n, n), dtype=bool), k=1)
output_masked, weights_masked = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print("\nCausal attention weights:")
print(np.round(weights_masked, 3))
Attention weights (each row sums to 1): [[0.143 0.371 0.298 0.188] [0.341 0.175 0.264 0.220] [0.182 0.293 0.312 0.213] [0.274 0.237 0.198 0.291]] Output shape: (4, 8) Causal attention weights: [[1.000 0.000 0.000 0.000] [0.660 0.340 0.000 0.000] [0.255 0.411 0.334 0.000] [0.274 0.237 0.198 0.291]]

13.2 Multi-Head Attention (NumPy)

Python (NumPy)
class MultiHeadAttention:
    """Multi-Head Attention from scratch in NumPy."""
    
    def __init__(self, d_model, n_heads):
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # dimension per head
        
        # Initialize weight matrices (Xavier initialization)
        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_Q = np.random.randn(d_model, d_model) * scale
        self.W_K = np.random.randn(d_model, d_model) * scale
        self.W_V = np.random.randn(d_model, d_model) * scale
        self.W_O = np.random.randn(d_model, d_model) * scale
    
    def split_heads(self, x):
        """Reshape (n, d_model) โ†’ (n_heads, n, d_k)"""
        n = x.shape[0]
        x = x.reshape(n, self.n_heads, self.d_k)  # (n, h, d_k)
        return x.transpose(1, 0, 2)  # (h, n, d_k)
    
    def forward(self, X, mask=None):
        """
        Args:
            X: Input sequence (n, d_model)
            mask: Optional mask (n, n)
        Returns:
            output: (n, d_model)
            all_weights: (n_heads, n, n) โ€” attention weights per head
        """
        # Project to Q, K, V
        Q = X @ self.W_Q  # (n, d_model)
        K = X @ self.W_K
        V = X @ self.W_V
        
        # Split into h heads
        Q_heads = self.split_heads(Q)  # (h, n, d_k)
        K_heads = self.split_heads(K)
        V_heads = self.split_heads(V)
        
        # Compute attention for each head independently
        all_outputs = []
        all_weights = []
        for i in range(self.n_heads):
            out, w = scaled_dot_product_attention(
                Q_heads[i], K_heads[i], V_heads[i], mask
            )
            all_outputs.append(out)    # (n, d_k)
            all_weights.append(w)      # (n, n)
        
        # Concatenate heads: (n, h*d_k) = (n, d_model)
        concat = np.concatenate(all_outputs, axis=-1)
        
        # Final linear projection
        output = concat @ self.W_O  # (n, d_model)
        
        return output, np.array(all_weights)

# === DEMO ===
np.random.seed(42)
d_model, n_heads, seq_len = 32, 4, 6
mha = MultiHeadAttention(d_model, n_heads)

X = np.random.randn(seq_len, d_model)
output, attn_weights = mha.forward(X)

print(f"Input shape:  {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"  (n_heads={n_heads}, seq_len={seq_len}, seq_len={seq_len})")
Input shape: (6, 32) Output shape: (6, 32) Attention weights shape: (4, 6, 6) (n_heads=4, seq_len=6, seq_len=6)

13.3 Transformer Block (NumPy)

Python (NumPy)
def layer_norm(x, gamma, beta, eps=1e-6):
    """Layer Normalization."""
    mean = np.mean(x, axis=-1, keepdims=True)
    var = np.var(x, axis=-1, keepdims=True)
    x_norm = (x - mean) / np.sqrt(var + eps)
    return gamma * x_norm + beta

def relu(x):
    return np.maximum(0, x)

class TransformerBlock:
    """A single Transformer encoder block from scratch."""
    
    def __init__(self, d_model, n_heads, d_ff):
        self.mha = MultiHeadAttention(d_model, n_heads)
        
        # LayerNorm parameters
        self.gamma1 = np.ones(d_model)
        self.beta1 = np.zeros(d_model)
        self.gamma2 = np.ones(d_model)
        self.beta2 = np.zeros(d_model)
        
        # Feed-Forward Network parameters
        scale1 = np.sqrt(2.0 / (d_model + d_ff))
        scale2 = np.sqrt(2.0 / (d_ff + d_model))
        self.W1 = np.random.randn(d_model, d_ff) * scale1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale2
        self.b2 = np.zeros(d_model)
    
    def ffn(self, x):
        """Position-wise Feed-Forward Network."""
        return relu(x @ self.W1 + self.b1) @ self.W2 + self.b2
    
    def forward(self, x, mask=None):
        """
        Forward pass through one Transformer encoder block.
        Uses Post-Norm (original Transformer style).
        """
        # Sub-layer 1: Multi-Head Self-Attention + Residual + LayerNorm
        attn_output, attn_weights = self.mha.forward(x, mask)
        x = layer_norm(x + attn_output, self.gamma1, self.beta1)
        
        # Sub-layer 2: FFN + Residual + LayerNorm
        ffn_output = self.ffn(x)
        x = layer_norm(x + ffn_output, self.gamma2, self.beta2)
        
        return x, attn_weights

# === DEMO: Stack 6 layers like the original Transformer ===
np.random.seed(42)
d_model, n_heads, d_ff = 64, 8, 256
seq_len = 10

# Create 6 Transformer blocks
blocks = [TransformerBlock(d_model, n_heads, d_ff) for _ in range(6)]

# Simulate positional encoding + embedding
X = np.random.randn(seq_len, d_model) * 0.1  # simulated input

# Forward through all 6 layers
h = X
for i, block in enumerate(blocks):
    h, w = block.forward(h)
    print(f"Layer {i+1} output โ€” mean: {h.mean():.4f}, std: {h.std():.4f}")

print(f"\nFinal output shape: {h.shape}")
print("โœ“ LayerNorm keeps values well-behaved across all 6 layers!")
Layer 1 output โ€” mean: 0.0023, std: 1.0012 Layer 2 output โ€” mean: -0.0015, std: 1.0008 Layer 3 output โ€” mean: 0.0007, std: 0.9995 Layer 4 output โ€” mean: -0.0011, std: 1.0003 Layer 5 output โ€” mean: 0.0019, std: 0.9998 Layer 6 output โ€” mean: -0.0004, std: 1.0001 Final output shape: (10, 64) โœ“ LayerNorm keeps values well-behaved across all 6 layers!

13.4 Positional Encoding (NumPy)

Python (NumPy)
def sinusoidal_positional_encoding(max_len, d_model):
    """
    Generate sinusoidal positional encodings.
    
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    PE = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]  # (max_len, 1)
    
    # Compute the division term: 10000^(2i/d_model)
    div_term = np.exp(
        np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model)
    )  # (d_model/2,)
    
    # Even dimensions: sine
    PE[:, 0::2] = np.sin(position * div_term)
    
    # Odd dimensions: cosine
    PE[:, 1::2] = np.cos(position * div_term)
    
    return PE

# === DEMO ===
PE = sinusoidal_positional_encoding(max_len=50, d_model=16)
print(f"Positional Encoding shape: {PE.shape}")
print(f"PE[0] (position 0): {np.round(PE[0], 3)}")
print(f"PE[1] (position 1): {np.round(PE[1], 3)}")
print(f"\nValues are bounded: min={PE.min():.3f}, max={PE.max():.3f}")

# Verify: dot product between nearby positions is higher
print(f"\nSimilarity PE[0]ยทPE[1]: {PE[0] @ PE[1]:.3f}")
print(f"Similarity PE[0]ยทPE[5]: {PE[0] @ PE[5]:.3f}")
print(f"Similarity PE[0]ยทPE[25]: {PE[0] @ PE[25]:.3f}")
print("Nearby positions have higher similarity โœ“")
Positional Encoding shape: (50, 16) PE[0] (position 0): [ 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. ] PE[1] (position 1): [ 0.841 0.54 0.382 0.924 0.146 0.989 0.056 0.998 0.022 1. 0.008 1. 0.003 1. 0.001 1. ] Values are bounded: min=-1.000, max=1.000 Similarity PE[0]ยทPE[1]: 7.451 Similarity PE[0]ยทPE[5]: 5.187 Similarity PE[0]ยทPE[25]: 2.163 Nearby positions have higher similarity โœ“

Can you find the bug? A student implemented attention like this:

def broken_attention(Q, K, V):
    scores = Q @ K.T
    weights = softmax(scores, axis=0)   # <-- Bug here!
    return weights @ V

The model trains but gives terrible results. What's wrong?

Bug: axis=0 normalizes down columns instead of across rows. Each key gets weights summing to 1 (across queries), instead of each query getting weights summing to 1 (across keys). This means each query's attention weights don't form a probability distribution. The correct axis is axis=-1 (or axis=1 for 2D). Also missing: the โˆšd_k scaling!
Section 14

PyTorch Implementation: Mini-GPT (Character-Level)

Python (PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F

# โ”€โ”€ Hyperparameters โ”€โ”€
BLOCK_SIZE   = 64     # context window (max sequence length)
D_MODEL      = 128    # embedding dimension
N_HEADS      = 4      # number of attention heads
N_LAYERS     = 4      # number of Transformer blocks
D_FF         = 512    # feed-forward inner dimension
DROPOUT      = 0.1
LEARNING_RATE = 3e-4
MAX_ITERS    = 5000
BATCH_SIZE   = 32
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

# โ”€โ”€ Single Attention Head โ”€โ”€
class Head(nn.Module):
    """One head of causal self-attention."""
    def __init__(self, head_size):
        super().__init__()
        self.key   = nn.Linear(D_MODEL, head_size, bias=False)
        self.query = nn.Linear(D_MODEL, head_size, bias=False)
        self.value = nn.Linear(D_MODEL, head_size, bias=False)
        self.register_buffer('tril',
            torch.tril(torch.ones(BLOCK_SIZE, BLOCK_SIZE)))
        self.dropout = nn.Dropout(DROPOUT)
    
    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)    # (B, T, head_size)
        q = self.query(x)  # (B, T, head_size)
        
        # Compute attention scores
        scores = q @ k.transpose(-2, -1) * (C ** -0.5)  # (B, T, T)
        scores = scores.masked_fill(
            self.tril[:T, :T] == 0, float('-inf'))  # causal mask
        weights = F.softmax(scores, dim=-1)
        weights = self.dropout(weights)
        
        # Weighted sum of values
        v = self.value(x)       # (B, T, head_size)
        out = weights @ v       # (B, T, head_size)
        return out

# โ”€โ”€ Multi-Head Attention โ”€โ”€
class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_heads)])
        self.proj = nn.Linear(D_MODEL, D_MODEL)
        self.dropout = nn.Dropout(DROPOUT)
    
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

# โ”€โ”€ Feed-Forward Network โ”€โ”€
class FeedForward(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(D_MODEL, D_FF),
            nn.ReLU(),
            nn.Linear(D_FF, D_MODEL),
            nn.Dropout(DROPOUT),
        )
    def forward(self, x):
        return self.net(x)

# โ”€โ”€ Transformer Block โ”€โ”€
class TransformerBlock(nn.Module):
    """Transformer block: communication (attention) + computation (FFN)."""
    def __init__(self):
        super().__init__()
        head_size = D_MODEL // N_HEADS
        self.sa = MultiHeadAttention(N_HEADS, head_size)
        self.ffn = FeedForward()
        self.ln1 = nn.LayerNorm(D_MODEL)
        self.ln2 = nn.LayerNorm(D_MODEL)
    
    def forward(self, x):
        # Pre-Norm residual connections (GPT-2 style)
        x = x + self.sa(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

# โ”€โ”€ Mini-GPT Model โ”€โ”€
class MiniGPT(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, D_MODEL)
        self.position_embedding = nn.Embedding(BLOCK_SIZE, D_MODEL)
        self.blocks = nn.Sequential(
            *[TransformerBlock() for _ in range(N_LAYERS)]
        )
        self.ln_f = nn.LayerNorm(D_MODEL)
        self.lm_head = nn.Linear(D_MODEL, vocab_size)
    
    def forward(self, idx, targets=None):
        B, T = idx.shape
        
        # Token + Position embeddings
        tok_emb = self.token_embedding(idx)         # (B, T, D_MODEL)
        pos_emb = self.position_embedding(
            torch.arange(T, device=DEVICE))          # (T, D_MODEL)
        x = tok_emb + pos_emb                       # (B, T, D_MODEL)
        
        # Transformer blocks
        x = self.blocks(x)                           # (B, T, D_MODEL)
        x = self.ln_f(x)                             # (B, T, D_MODEL)
        logits = self.lm_head(x)                     # (B, T, vocab_size)
        
        # Compute loss if targets provided
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)
            targets_flat = targets.view(B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """Autoregressive generation."""
        for _ in range(max_new_tokens):
            # Crop to last BLOCK_SIZE tokens
            idx_cond = idx[:, -BLOCK_SIZE:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]  # last token's predictions
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, idx_next], dim=1)
        return idx

# โ”€โ”€ Training Loop โ”€โ”€
# Load your text data
# text = open('input.txt', 'r').read()
# chars = sorted(list(set(text)))
# vocab_size = len(chars)
# stoi = {c: i for i, c in enumerate(chars)}
# itos = {i: c for c, i in stoi.items()}
# encode = lambda s: [stoi[c] for c in s]
# decode = lambda l: ''.join([itos[i] for i in l])

# model = MiniGPT(vocab_size).to(DEVICE)
# optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
# 
# for step in range(MAX_ITERS):
#     xb, yb = get_batch('train')  # random batch of (BATCH_SIZE, BLOCK_SIZE)
#     logits, loss = model(xb, yb)
#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()
#     if step % 500 == 0:
#         print(f"Step {step}, Loss: {loss.item():.4f}")

# Print model size
vocab_size_demo = 65  # ASCII printable characters
model = MiniGPT(vocab_size_demo)
n_params = sum(p.numel() for p in model.parameters())
print(f"Mini-GPT Parameters: {n_params:,}")
print(f"  Token Embedding:    {vocab_size_demo * D_MODEL:,}")
print(f"  Position Embedding: {BLOCK_SIZE * D_MODEL:,}")
print(f"  Transformer Blocks: {n_params - vocab_size_demo*D_MODEL - BLOCK_SIZE*D_MODEL - D_MODEL*vocab_size_demo - D_MODEL - D_MODEL:,}")
print(f"  LM Head:            {D_MODEL * vocab_size_demo + vocab_size_demo:,}")
Mini-GPT Parameters: 411,073 Token Embedding: 8,320 Position Embedding: 8,192 Transformer Blocks: 386,048 LM Head: 8,385

This ~411K parameter model is tiny compared to GPT-3 (175B), but it can learn to generate Shakespeare-like text when trained on the complete works of Shakespeare (~1M characters). The architecture is identical โ€” only the scale differs. Andrej Karpathy's "nanoGPT" is this exact approach, and is the best resource for understanding GPT from scratch.

Section 15

Visual Aids

Self-Attention Computation Flow

SELF-ATTENTION: Step-by-Step Computation โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Input X (nร—d_model) โ”‚ โ”œโ”€โ”€โ†’ ร— W_Q โ”€โ”€โ†’ Q (nร—d_k) โ”€โ” โ”œโ”€โ”€โ†’ ร— W_K โ”€โ”€โ†’ K (nร—d_k) โ”€โ”ค โ””โ”€โ”€โ†’ ร— W_V โ”€โ”€โ†’ V (nร—d_v) โ”€โ”ค โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”‚ Q ร— K^T โ”‚โ”€โ”€โ†’ (nร—n) raw scores โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ รท โˆšd_k โ”‚โ”€โ”€โ†’ (nร—n) scaled scores โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ (mask?) โ”‚โ”€โ”€โ†’ (nร—n) masked scores (optional) โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ softmax โ”‚โ”€โ”€โ†’ (nร—n) attention weights (rows sum to 1) โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚weights ร— V โ”‚โ”€โ”€โ†’ (nร—d_v) output โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Multi-Head Attention Visualization

MULTI-HEAD ATTENTION (h=4 heads, d_model=256, d_k=64) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Input X (n ร— 256) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚Head 1 โ”‚ โ”‚Head 2 โ”‚ โ”‚Head 3 โ”‚ โ”‚Head 4 โ”‚ โ”‚Qโ‚Kโ‚Vโ‚โ”‚ โ”‚Qโ‚‚Kโ‚‚Vโ‚‚โ”‚ โ”‚Qโ‚ƒKโ‚ƒVโ‚ƒโ”‚ โ”‚Qโ‚„Kโ‚„Vโ‚„โ”‚ โ”‚(nร—64) โ”‚ โ”‚(nร—64) โ”‚ โ”‚(nร—64) โ”‚ โ”‚(nร—64) โ”‚ โ”‚syntax โ”‚ โ”‚coref โ”‚ โ”‚posn. โ”‚ โ”‚semanticโ”‚ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ Concatenate โ”‚ โ–ผ (n ร— 256) โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ร— W_O (256 ร— 256) โ”‚ โ”‚ Final linear projection โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ Output (n ร— 256)

Complete Encoder Layer

ONE TRANSFORMER ENCODER LAYER โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Input x โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ–ผ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ Multi-Head โ”‚ โ”‚ Residual โ”‚ Self-Attentionโ”‚ โ”‚ Connection โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”˜ โ”‚ ADD โ”‚โ—„โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”‚LayerNormโ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ–ผ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Residual โ”‚FFN โ”‚ โ”‚ Connection โ”‚Wโ‚,ReLU โ”‚ โ”‚ โ”‚Wโ‚‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”˜ โ”‚ ADD โ”‚โ—„โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”‚LayerNormโ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ Output
Section 16

Indian Industry Case Study: AI4Bharat IndicBERT

๐Ÿ‡ฎ๐Ÿ‡ณ IndicBERT โ€” Building BERT for India's Languages

The Challenge

India has 22 official languages, written in 13 different scripts. Building a single language model that understands all of them faces unique challenges:

  • Script diversity: Hindi (Devanagari), Tamil (Tamil script), Telugu (Telugu script), Urdu (Nastaliq), English (Latin) โ€” all use completely different character sets
  • Morphological complexity: Agglutinative languages like Tamil and Kannada create very long compound words, exploding vocabulary size
  • Code-mixing: Indian social media frequently mixes Hindi/English ("Hinglish"), Tamil/English, etc., often switching within a single sentence
  • Data scarcity: While Hindi has moderate web data, languages like Bodo, Dogri, Maithili have extremely limited digital text

The Solution: IndicBERT Architecture

ComponentChoiceRationale
Base modelALBERT (A Lite BERT)Parameter sharing across layers reduces model size โ†’ deployable on Indian mobile networks
Languages11 Indian languages + EnglishBengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu, Urdu
TokenizerSentencePiece (128K vocab)Script-agnostic subword tokenization handles all 13 scripts
Pre-training dataIndicCorp (9B tokens)Largest collection of Indian language text, curated from web crawls
Pre-training objectiveMLM (Masked Language Model)Standard BERT pre-training, adapted for multilingual setting

Results

  • NER (Named Entity Recognition): IndicBERT achieved state-of-the-art on 9/11 languages, beating multilingual BERT (mBERT) by 3-5 F1 points
  • Sentiment Analysis: 2-4% accuracy improvement over mBERT on Hindi and Bengali movie reviews
  • Cross-lingual transfer: Training on Hindi data and testing on Marathi worked well (both Devanagari script, similar grammar) โ€” showing the model captures shared linguistic structure

Deployment Applications

  • Flipkart: Hindi product review classification
  • Government: Automated processing of RTI (Right to Information) requests across languages
  • ShareChat: Content moderation in 15 Indian languages
  • Koo (Indian Twitter alternative): Multilingual sentiment analysis

The IndicNLP Suite by AI4Bharat (IIT Madras) includes IndicBERT, IndicTrans (translation), IndicGLUE (benchmark), and IndicCorp (dataset) โ€” a complete ecosystem for Indian NLP. If you want to contribute, check out ai4bharat.org. This is one of the most impactful open-source AI projects from India.

Section 17

US/Global Industry Case Study: The GPT Evolution

๐Ÿ‡บ๐Ÿ‡ธ OpenAI: From GPT-1 to GPT-4 โ€” The Scaling Revolution

The Timeline

2018
GPT-1 โ€” 117M parameters, 12 layers, trained on BookCorpus (7,000 books). Proved that unsupervised pre-training + supervised fine-tuning works for NLP. Performance: decent but unremarkable. Key insight: CLM pre-training transfers to downstream tasks.
2019
GPT-2 โ€” 1.5B parameters, 48 layers, trained on WebText (40GB). Generated coherent paragraphs. OpenAI initially withheld the full model citing "too dangerous." Key insight: zero-shot task performance improves with scale.
2020
GPT-3 โ€” 175B parameters, 96 layers, 300B tokens of training data. Demonstrated few-shot learning: give it a few examples in the prompt, and it performs the task โ€” no fine-tuning needed. Cost to train: ~$4.6M. Key insight: emergent abilities appear at scale (arithmetic, translation, code generation).
2022
InstructGPT / ChatGPT โ€” GPT-3.5 fine-tuned with RLHF (Reinforcement Learning from Human Feedback). Three stages: (1) Supervised fine-tuning on human demonstrations, (2) Train a reward model on human comparisons, (3) Optimize with PPO against the reward model. Result: much more helpful, harmless, and honest responses.
2023
GPT-4 โ€” Multimodal (text + images), estimated >1T parameters (MoE architecture). Passes the bar exam (90th percentile), medical licensing exams, coding interviews. Trained with a much larger RLHF pipeline. Key insight: multimodality + RLHF + scale โ†’ near-human reasoning.

Scaling Laws (Kaplan et al., 2020)

What ScalesRelationshipImplication
Parameters (N)L โˆ N-0.07610ร— more params โ†’ ~0.5 nats lower loss
Dataset (D)L โˆ D-0.09510ร— more data โ†’ ~0.6 nats lower loss
Compute (C)L โˆ C-0.05010ร— more compute โ†’ ~0.4 nats lower loss

These power laws predict how performance improves with scale โ€” they drove OpenAI's strategy of building ever-larger models. The relationships are smooth and predictable, meaning you can forecast how well a larger model will perform before training it.

RLHF: From Language Model to Assistant

Raw GPT-3 is powerful but unfocused โ€” it might generate toxic content, hallucinate facts, or ramble. RLHF aligns it with human preferences:

  1. Step 1 (SFT): Fine-tune on human-written ideal responses to prompts
  2. Step 2 (Reward Model): Humans rank model outputs A > B > C; train a reward model to predict these rankings
  3. Step 3 (PPO): Use the reward model as a signal to fine-tune GPT via reinforcement learning (Proximal Policy Optimization)
Section 18

Common Misconceptions

โŒ MYTH: "Transformers are just improved RNNs."

โœ… TRUTH: Transformers are a fundamentally different architecture. They use NO recurrence. Instead, they process all positions in parallel through self-attention. The two architectures share the goal (sequence modeling) but differ in mechanism (attention vs. recurrence), training efficiency (parallel vs. sequential), and how they handle long-range dependencies (direct vs. through hidden states).

๐Ÿ” WHY IT MATTERS: This distinction is crucial for architecture selection and understanding computational complexity.

โŒ MYTH: "The 'attention' in Transformers is the same as the attention used in seq2seq models (Bahdanau attention)."

โœ… TRUTH: Bahdanau attention (2014) added attention on top of an RNN โ€” the encoder was still an RNN. The Transformer's key innovation was self-attention (tokens attending to each other within the same sequence) and building the entire model from attention, with no recurrence at all.

๐Ÿ” WHY IT MATTERS: Understanding this history clarifies why the paper was called "Attention Is All You Need" โ€” emphasis on "All."

โŒ MYTH: "More attention heads always means better performance."

โœ… TRUTH: Research (Michel et al., 2019) showed that many attention heads can be pruned after training without significant performance loss. Some heads learn redundant patterns. The optimal number of heads is task-dependent.

๐Ÿ” WHY IT MATTERS: For deployment, you can prune heads to reduce inference cost โ€” critical for mobile deployment in India's constrained environments.

โŒ MYTH: "Transformers understand language like humans do."

โœ… TRUTH: Transformers learn statistical patterns in text โ€” co-occurrence, syntax, some reasoning. They don't have grounded understanding. They can fail spectacularly on tasks requiring real-world knowledge, causal reasoning, or negation.

๐Ÿ” WHY IT MATTERS: Over-relying on LLMs without human oversight causes real-world failures. Critical applications (healthcare, legal) require human-in-the-loop systems.

โŒ MYTH: "BERT and GPT use completely different architectures."

โœ… TRUTH: Both use the same core building blocks (multi-head attention, FFN, LayerNorm, residual connections). The key differences are: (1) BERT uses bidirectional attention; GPT uses causal attention, and (2) BERT pre-trains with MLM; GPT pre-trains with CLM. The Transformer block itself is nearly identical.

๐Ÿ” WHY IT MATTERS: Once you understand the Transformer block, you understand both BERT and GPT โ€” they're different configurations of the same architecture.

Section 19

GATE / Exam Corner

Key Formulas for Exams

Formula Sheet โ€” Transformers

  • Scaled Dot-Product Attention: Attention(Q,K,V) = softmax(QKT/โˆšd_k)V
  • Multi-Head: MultiHead = Concat(headโ‚,...,head_h)WO, head_i = Attention(QW_iQ, KW_iK, VW_iV)
  • Positional Encoding: PE(pos,2i) = sin(pos/100002i/d), PE(pos,2i+1) = cos(pos/100002i/d)
  • FFN: FFN(x) = max(0, xWโ‚+bโ‚)Wโ‚‚+bโ‚‚ where Wโ‚: dโ†’4d, Wโ‚‚: 4dโ†’d
  • LayerNorm: LN(x) = ฮณยท(x-ฮผ)/โˆš(ฯƒยฒ+ฮต) + ฮฒ
  • Self-Attention Complexity: Time O(nยฒd), Memory O(nยฒ+nd)
  • Params in one MHA layer: 4ยทd_modelยฒ (Q,K,V projections + output projection)
  • Params in FFN: 2ยทd_modelยทd_ff + d_model + d_ff โ‰ˆ 8ยทd_modelยฒ

GATE-Style Problems

GATE Q1

In a Transformer with d_model = 512 and h = 8 attention heads, what is the dimension d_k of keys in each head?

  1. 8
  2. 64
  3. 128
  4. 512
Answer: (B) 64
d_k = d_model / h = 512 / 8 = 64. Each head operates in a 64-dimensional subspace.
RememberGATE CS 2024
GATE Q2

What is the time complexity of computing self-attention for a sequence of length n with model dimension d?

  1. O(nยทd)
  2. O(nยทdยฒ)
  3. O(nยฒยทd)
  4. O(nยฒยทdยฒ)
Answer: (C) O(nยฒยทd)
The bottleneck is QKT: (nร—d)ยท(dร—n) = O(nยฒยทd). Memory is O(nยฒ) for the attention matrix.
UnderstandGATE CS
GATE Q3

In the original Transformer (d_model=512, h=8, d_ff=2048, N=6), approximately how many parameters are in one encoder layer?

  1. ~500K
  2. ~1.5M
  3. ~3.1M
  4. ~6.3M
Answer: (C) ~3.1M
MHA: 4ร—512ยฒ = 1,048,576 โ‰ˆ 1.05M
FFN: 512ร—2048 + 2048 + 2048ร—512 + 512 = 2,099,712 โ‰ˆ 2.1M
LayerNorm (ร—2): 2ร—(512+512) = 2,048 โ‰ˆ 0.002M
Total: ~3.15M per layer, ~18.9M for 6 layers.
ApplyNumerical
GATE Q4

Why does the Transformer divide attention scores by โˆšd_k before applying softmax?

  1. To normalize the output values to [0, 1]
  2. To reduce the number of parameters
  3. To prevent softmax saturation when d_k is large, ensuring informative gradients
  4. To make the model invariant to the choice of d_k
Answer: (C)
When d_k is large, dot products grow in magnitude (variance = d_k). Large inputs to softmax push it into saturation regions where gradients are extremely small, hindering learning. Scaling by โˆšd_k brings variance back to 1.
UnderstandGATE DS/AI
GATE Q5

Which of the following is TRUE about BERT's pre-training?

  1. BERT uses causal (left-to-right) language modeling
  2. BERT masks 50% of input tokens during pre-training
  3. BERT uses bidirectional self-attention and Masked Language Modeling (15% masking)
  4. BERT's encoder layers use masked self-attention to prevent information leakage
Answer: (C)
BERT uses bidirectional self-attention (no causal mask) and MLM that masks 15% of tokens (80% [MASK], 10% random, 10% unchanged).
RememberGATE DS/AI 2023

Prediction Table โ€” Likely GATE Topics (2025-2027)

TopicProbabilityQuestion Type
Attention formula & โˆšd_k scalingโญโญโญโญโญMCQ / Numerical
Self-attention complexity O(nยฒd)โญโญโญโญโญMCQ
BERT vs GPT differencesโญโญโญโญMCQ
Transformer parameter countingโญโญโญโญNumerical
Positional encoding purposeโญโญโญMCQ
Multi-head attention mechanicsโญโญโญMCQ
LayerNorm vs BatchNormโญโญโญMCQ
Section 20

Interview Prep

Conceptual Questions

๐ŸŽค "Explain self-attention in simple terms."

Strong answer:

"Self-attention lets every word in a sentence directly look at every other word and decide how relevant each one is. Think of it as a room full of people โ€” instead of passing notes one by one (like RNNs), everyone can simultaneously turn to any other person and listen. Each word creates three vectors: a query (what am I looking for?), a key (what do I represent?), and a value (what info do I carry?). The attention score between two words is the dot product of the query and key, scaled by โˆšd_k to prevent numerical issues, then softmaxed into weights. The output for each word is a weighted sum of all values. This creates context-aware representations where the meaning of 'bank' depends on whether the sentence is about finance or rivers."

๐ŸŽค "Why do Transformers beat RNNs?"

Strong answer:

"Three fundamental reasons. First, parallelism: RNNs process tokens sequentially โ€” you can't compute the 100th hidden state without the 99th. Self-attention processes all positions simultaneously, making full use of GPU parallelism. Training speedups of 10-100ร— are common. Second, long-range dependencies: In an RNN, information from word 1 must survive passage through all intermediate states to reach word 100 โ€” the vanishing gradient problem. In a Transformer, word 1 directly attends to word 100 in a single step โ€” path length is O(1) vs O(n). Third, rich representations: Multi-head attention lets each word form multiple types of relationships simultaneously โ€” syntactic, semantic, positional โ€” while an RNN collapses everything into a single fixed-size hidden state."

๐ŸŽค "Walk me through the Transformer encoder step by step." (System Design)

Strong answer:

"Starting with raw tokens: (1) Each token is mapped to a d_model-dimensional vector via the embedding table. (2) Sinusoidal or learned positional encodings are added. (3) This goes into a stack of 6 identical layers. Each layer has two sub-layers. Sub-layer 1: Multi-head self-attention with h=8 parallel heads, each computing scaled dot-product attention in d_k=64 dimensions, then concatenating and projecting. A residual connection wraps it: output = LayerNorm(x + MultiHeadAttn(x)). Sub-layer 2: A position-wise feed-forward network โ€” two linear layers with ReLU between (expanding from 512 to 2048 then back to 512). Again wrapped with residual + LayerNorm. After 6 such layers, each token's representation is a rich, contextual embedding of that token's meaning, informed by all other tokens."

Coding Questions

๐Ÿ’ป "Implement scaled dot-product attention." (Top 3 ML coding question)

def attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = torch.softmax(scores, dim=-1)
    return weights @ V, weights

Follow-ups to expect: "Add dropout to attention weights," "Make it batched," "Add causal masking," "What's the complexity?"

India-Specific Interview Angle

๐Ÿ‡ฎ๐Ÿ‡ณ Indian ML Interviews (TCS, Infosys AI, Flipkart)

  • Focus on BERT fine-tuning for Hindi/regional language NLP
  • "How would you handle code-mixed text?" โ†’ IndicBERT + subword tokenization
  • "How to deploy a Transformer on low-resource devices?" โ†’ DistilBERT, quantization, ONNX
  • GATE-style numerical problems on attention
  • Expect questions on IndicNLP, mBERT, XLM-R for Indian languages

๐Ÿ‡บ๐Ÿ‡ธ US ML Interviews (FAANG, OpenAI, Anthropic)

  • Deep understanding of architecture decisions
  • "Implement multi-head attention from scratch" (whiteboard)
  • "Design a Transformer for long documents" โ†’ efficient attention
  • "Compare Pre-Norm vs Post-Norm" and why GPT-2 switched
  • System design: "How would you serve a 70B LLM?"
  • Research paper discussions: FlashAttention, RoPE, MoE
Section 21

Hands-On Lab / Mini-Project

๐Ÿ”ฌ Mini-Project: Build and Train a Character-Level Mini-GPT

Objective

Build a mini-Transformer (GPT-style) that learns to generate text character by character. Train it on a text corpus (Shakespeare, Indian literature, or code) and generate new text.

Requirements

  1. Implement the following from scratch using PyTorch:
    • Single-head causal self-attention
    • Multi-head attention (4 heads)
    • Transformer block (attention + FFN + LayerNorm + residual)
    • Complete GPT model (token/position embedding โ†’ N blocks โ†’ LM head)
  2. Train on at least 1MB of text data for โ‰ฅ5000 steps
  3. Generate 500+ characters of coherent-looking text
  4. Visualize attention patterns for at least 2 heads
  5. Experiment: Compare 2-layer vs 6-layer models on the same data

Data Sources

  • Option A (Global): Shakespeare's complete works (~1.1MB) from tinyshakespeare
  • Option B (India): Hindi Wikidump text or Premchand stories in Devanagari
  • Option C (Code): Python source files from GitHub

Rubric (100 points)

ComponentPointsCriteria
Correct attention implementation20Attention formula correct, causal mask works, shapes verified
Multi-head attention15Multiple heads computed in parallel, concatenated correctly
Transformer block15Residual connections, LayerNorm, FFN all present and correct
Training pipeline15Proper data loading, batching, loss computation, optimization
Generation quality15Generated text shows learned patterns (words, formatting, some structure)
Attention visualization10Heatmap of attention weights for at least 2 heads, with interpretation
2-layer vs 6-layer comparison10Learning curves compared, sample quality compared, brief analysis

Bonus Challenges (+20 points each)

  • Bonus 1: Add temperature and top-k sampling to generation
  • Bonus 2: Train on Hindi text and handle Devanagari characters
  • Bonus 3: Implement Grouped-Query Attention (GQA) as used in LLaMA 2
Section 22

Exercises

Section A: Conceptual Questions (5)

A1 Beginner

Why is self-attention called "self" attention? How does it differ from cross-attention?

Answer: In self-attention, Q, K, and V all come from the same sequence โ€” a token attends to other tokens in the same input. In cross-attention (encoder-decoder attention), Q comes from one sequence (decoder) while K and V come from a different sequence (encoder). Self-attention enables modeling relationships within a sentence; cross-attention enables alignment between two sequences (e.g., source and target in translation).
A2 Beginner

Explain why Transformers need positional encoding while RNNs do not.

Answer: RNNs process tokens sequentially โ€” the position is inherently encoded by the order of processing (hidden state at step t reflects all tokens up to t). Self-attention computes pairwise similarities between all tokens simultaneously and is permutation-invariant โ€” the output wouldn't change if tokens were shuffled. Therefore, explicit positional information must be injected to preserve word order.
A3 Intermediate

Explain the role of the FFN (feed-forward network) in a Transformer block. If attention handles inter-token communication, what does the FFN do?

Answer: Attention handles "communication" between tokens โ€” mixing information across positions. The FFN handles "computation" โ€” processing each token's features independently. Think of attention as "gather information from other tokens" and FFN as "think about what you've gathered." Research suggests FFN layers act as key-value memories: the first layer detects patterns, the second layer maps them to output features. The expand-compress structure (512โ†’2048โ†’512) creates a higher-dimensional space where complex feature interactions can occur.
A4 Intermediate

Why does BERT use bidirectional attention while GPT uses causal attention? Isn't bidirectional always better?

Answer: BERT's goal is understanding (classification, NER, QA) โ€” seeing both left and right context produces richer representations for these tasks. GPT's goal is generation โ€” predicting the next token. During generation, future tokens don't exist yet, so looking at them would be "cheating" (data leakage). Bidirectional isn't always better: you can't generate text token-by-token with a bidirectional model (each token's representation depends on future tokens that haven't been generated). For generation tasks, causal attention is necessary.
A5 Intermediate

What is the purpose of the output projection WO in multi-head attention? Why not just concatenate the heads?

Answer: Each head operates in a d_k-dimensional subspace. Concatenation combines these into d_model dimensions, but the representations from different heads haven't been "mixed" โ€” each dimension still only carries information from one head. WO is a learned linear transformation that allows the model to combine information across heads, creating output features that integrate multiple attention patterns. Without WO, the model couldn't learn to weight or combine the contributions of different heads.

Section B: Mathematical Questions (8)

B1 Intermediate

Given Q = [[1, 0], [0, 1]], K = [[1, 1], [0, 1]], V = [[1, 2], [3, 4]], compute Attention(Q, K, V) with scaling (d_k = 2). Show all intermediate steps.

Hint: Follow the 5 steps: QKT โ†’ scale by โˆš2 โ†’ softmax each row โ†’ multiply by V. QKT = [[1,0],[1,1]], scaled = [[0.707,0],[0.707,0.707]], softmax row 1 = [exp(0.707)/(exp(0.707)+1), 1/(exp(0.707)+1)] โ‰ˆ [0.670, 0.330], etc.
B2 Intermediate

Prove that if q_i ~ N(0, 1) and k_i ~ N(0, 1) independently, then Var(qยทk) = d_k when q, k โˆˆ โ„d_k.

Hint: The dot product qยทk = ฮฃแตข qแตขkแตข. Since qแตข and kแตข are independent, Var(qแตขkแตข) = E[qแตขยฒkแตขยฒ] - (E[qแตขkแตข])ยฒ = E[qแตขยฒ]E[kแตขยฒ] - 0 = 1ยท1 = 1. Since the d_k terms are independent, Var(qยทk) = d_k ยท 1 = d_k.
B3 Intermediate

Calculate the total number of trainable parameters in a single Transformer encoder layer with d_model = 256, h = 4, d_ff = 1024. Include LayerNorm parameters.

Answer: MHA: W_Q(256ร—256) + W_K(256ร—256) + W_V(256ร—256) + W_O(256ร—256) = 4ร—65,536 = 262,144. FFN: Wโ‚(256ร—1024) + bโ‚(1024) + Wโ‚‚(1024ร—256) + bโ‚‚(256) = 262,144 + 1,024 + 262,144 + 256 = 525,568. LayerNormร—2: 2ร—(256+256) = 1,024. Total = 262,144 + 525,568 + 1,024 = 788,736.
B4 Advanced

Show that the sinusoidal positional encoding satisfies the linear relationship property: PE(pos+k) can be expressed as a linear transformation of PE(pos) for any fixed offset k.

Hint: For a single frequency ฯ‰: [sin(ฯ‰(pos+k)), cos(ฯ‰(pos+k))] = [sin(ฯ‰pos)cos(ฯ‰k) + cos(ฯ‰pos)sin(ฯ‰k), cos(ฯ‰pos)cos(ฯ‰k) - sin(ฯ‰pos)sin(ฯ‰k)] = [[cos(ฯ‰k), sin(ฯ‰k)], [-sin(ฯ‰k), cos(ฯ‰k)]] ยท [sin(ฯ‰pos), cos(ฯ‰pos)]T. This is a rotation matrix, which is a linear transformation!
B5 Intermediate

For a Transformer processing a sequence of length n = 1024 with d_model = 512, compute the FLOPs for one self-attention operation (QKT and attentionร—V). How does this compare to processing the same sequence with an RNN (d_hidden = 512)?

Answer: Self-attention QKT: 2ยทnยฒยทd = 2ยท1024ยฒยท512 โ‰ˆ 537M FLOPs. Attentionร—V: same, โ‰ˆ 537M. Total โ‰ˆ 1.07B FLOPs. RNN per step: 2ยทdยฒ โ‰ˆ 524K FLOPs. For n=1024 steps: 1024ยท524K โ‰ˆ 537M FLOPs. At n=1024 with d=512, they're comparable! But self-attention is parallel (O(1) sequential steps) while RNN is sequential (O(n)).
B6 Intermediate

In a causal attention mask for sequence length 4, write out the 4ร—4 mask matrix (before and after softmax). What does each row sum to?

Answer: Mask (True = attend): [[1,0,0,0],[1,1,0,0],[1,1,1,0],[1,1,1,1]]. After masking scores and softmax, each row sums to 1 (softmax normalization). Row 1: all mass on position 1. Row 4: mass distributed across all 4 positions.
B7 Advanced

GPT-3 has 175B parameters, 96 layers, d_model = 12,288, d_ff = 4ยทd_model, h = 96 heads. Verify the parameter count for one layer (MHA + FFN + LayerNorm) and check if 96 layers ร— per-layer params โ‰ˆ 175B.

Hint: MHA: 4ยทdยฒ = 4ยท12288ยฒ = 604M. FFN: 8ยทdยฒ + 5ยทd โ‰ˆ 1,208M. LN: 4ยทd โ‰ˆ 49K. Per layer โ‰ˆ 1.81B. 96 layers โ‰ˆ 174B. Plus embeddings: vocab_size(50257)ยทd โ‰ˆ 618M. Total โ‰ˆ 174.6B โœ“
B8 Intermediate

If we use multi-query attention (MQA) where K and V are shared across all h heads, by what factor do we reduce the KV cache memory during inference? (Assume h = 32 heads.)

Answer: Standard MHA stores K,V per head: 2 ร— h ร— n ร— d_k memory. MQA stores only 1 copy of K,V: 2 ร— 1 ร— n ร— d_k. Reduction factor = h = 32ร—. This is why MQA and GQA are critical for efficient LLM inference.

Section C: Coding Questions (4)

C1 Intermediate

Implement a function causal_mask(n) that returns an nร—n boolean mask where True means "mask this position" (future positions). Use it in your attention function.

Hint: mask = np.triu(np.ones((n, n), dtype=bool), k=1) or mask = torch.triu(torch.ones(n, n), diagonal=1).bool()
C2 Intermediate

Write a PyTorch function that computes the sinusoidal positional encoding matrix of shape (max_len, d_model). Verify that nearby positions have higher cosine similarity than distant positions.

Hint: Use torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) for the wavelengths. Compute cosine similarity between PE[0] and PE[k] for k = 1, 5, 50, 100.
C3 Advanced

Modify the MiniGPT code to add temperature-controlled sampling and top-k sampling in the generate() method. Generate text at temperatures 0.5, 1.0, and 1.5, and compare the diversity and coherence.

Hint: Temperature: logits = logits / temperature before softmax. Top-k: v, ix = torch.topk(logits, k), set all other logits to -inf, then softmax over the remaining k options.
C4 Advanced

Implement Layer Normalization from scratch (no using nn.LayerNorm). Compare the output with PyTorch's built-in implementation to verify numerical correctness (difference should be < 1e-5).

Hint: Compute mean and variance along the last dimension. Apply: y = gamma * (x - mean) / sqrt(var + eps) + beta. Compare with nn.LayerNorm(d_model)(x) โ€” the difference should be negligible.

Section D: Critical Thinking (3)

D1 Advanced

Self-attention's O(nยฒ) complexity limits it to ~4K-8K tokens. But books are 100K+ tokens. Propose a Transformer architecture that can process a 100K-token novel. Discuss trade-offs.

Discussion points: Sliding window attention (local context), sparse attention patterns (Longformer), hierarchical approaches (process paragraphs then chapters), linear attention approximations (Performer), state-space models (Mamba), or retrieving relevant chunks via RAG. Each trades exactness for efficiency.
D2 Advanced

BERT pre-training masks 15% of tokens. Why not 50%? Why not 5%? What would happen in each case? Design an experiment to find the optimal masking rate.

Discussion points: 50%: too much context removed โ†’ the task becomes too hard, model can't learn meaningful patterns. 5%: too little โ†’ model rarely practices prediction, training is inefficient. 15% is a sweet spot balancing difficulty with training efficiency. An experiment would train models with masking rates [5%, 10%, 15%, 20%, 30%, 50%] and compare downstream task performance after same compute budget.
D3 Advanced

In India, many users interact with AI in code-mixed language (e.g., "Mujhe ek accha restaurant suggest karo near CP"). What challenges does this pose for Transformer-based models, and how would you address them?

Discussion points: Challenges: vocabulary explosion (Hindi + English tokens), script switching (Devanagari + Latin), grammar from both languages, limited labeled code-mixed data. Solutions: multilingual pre-training (mBERT, XLM-R), script normalization (transliterate to one script), data augmentation via synthetic code-mixing, subword tokenization that handles both scripts, and fine-tuning on code-mixed datasets like GLUECoS.

โ˜… Starred Research Questions (2)

โ˜… R1 Advanced

Read the FlashAttention paper (Dao et al., 2022). Explain how tiling and recomputation avoid materializing the full nร—n attention matrix in HBM. Why is this an IO-bound optimization rather than a compute-bound one?

Key insight: Standard attention: compute S=QKT (write nยฒ to HBM), compute P=softmax(S) (read/write nยฒ), compute O=PV (read nยฒ, write nd). Total HBM access: O(nยฒ). FlashAttention tiles Q,K,V into blocks that fit in SRAM (~20MB), computes partial attention per block, and never writes the full nร—n matrix to slow HBM. The FLOP count is the same, but HBM reads/writes reduce from O(nยฒ) to O(nยฒd/M) where M is SRAM size โ€” a huge speedup because modern GPUs are memory-bandwidth limited, not compute-limited.
โ˜… R2 Advanced

State-Space Models (SSMs) like Mamba (Gu & Dao, 2023) claim to match Transformer quality with O(n) complexity. Research their mechanism and compare: what does Mamba gain, and what does it sacrifice compared to standard Transformers?

Key points: Mamba uses selective state spaces โ€” input-dependent parameters that allow the model to selectively remember or forget information. Gains: O(n) compute and memory (vs O(nยฒ)), faster inference (no KV cache growth), better for very long sequences. Sacrifices: less mature ecosystem, can't do "retrieval" as easily as attention (no explicit nร—n pairwise comparison), some tasks (e.g., in-context learning with many examples) still favor Transformers. The field is actively debating whether SSMs can fully replace Transformers.
Section 23

Connections

๐Ÿ”— How This Chapter Connects

โ† Builds On:
  • Chapter 14 (LSTMs & GRUs): You understood the limitations of recurrence โ€” vanishing gradients and sequential processing. Transformers solve both.
  • Chapter 9 (Regularization): Dropout is used in attention weights and FFN layers. Layer Normalization extends the normalization concepts from Batch Normalization (Ch 10).
  • Chapter 6 (Backpropagation): Residual connections provide gradient highways, a concept you first met in deep networks.
  • Chapter 2 (Linear Algebra): The entire attention mechanism is matrix multiplication โ€” QยทKT, softmax, weightsยทV.
โ†’ Enables:
  • Chapter 16 (GANs & VAEs): Transformer-based generators (like ViT-GANs) are replacing CNN-based GANs.
  • Chapter 17 (Applied CV): Vision Transformers (ViT) apply self-attention to image patches, rivaling CNNs.
  • Chapter 18 (Applied NLP): BERT, GPT, and T5 are the backbone of modern NLP โ€” fine-tuning these models is the dominant paradigm.
  • Chapter 22 (Future & Ethics): Large Language Models raise critical ethical questions about bias, hallucination, and societal impact.
๐Ÿ”ฌ Research Frontier:
  • Mixture of Experts (MoE): Scaling Transformers to trillions of parameters by only activating a subset of parameters per token (GPT-4, Mixtral)
  • State-Space Models: Mamba and similar architectures challenge Transformers with O(n) complexity
  • Multimodal Transformers: Models that process text, images, audio, and video simultaneously (GPT-4V, Gemini)
  • Retrieval-Augmented Generation (RAG): Combining Transformers with external knowledge retrieval to reduce hallucination
๐Ÿญ Industry Implementation:
  • Google: T5/PaLM for Search, Translate, Gmail Smart Compose
  • OpenAI: GPT-4 for ChatGPT, API services
  • Meta: LLaMA (open-source LLM), NLLB (translation for 200 languages)
  • AI4Bharat: IndicBERT, IndicTrans, Bhashini (India's national language translation platform)
  • Hugging Face: Democratizing Transformer deployment with the transformers library
Section 24

Chapter Summary

Key Takeaways โ€” Transformers and Attention

  1. Attention is a soft dictionary lookup: Each token generates a query, compares it against all keys, and retrieves a weighted sum of values. This lets every position directly communicate with every other position โ€” O(1) path length vs O(n) for RNNs.
  2. The โˆšd_k scaling is essential: Dot products in high dimensions have variance d_k. Without scaling, softmax saturates, killing gradients. Dividing by โˆšd_k normalizes variance to 1.
  3. Multi-head attention provides multiple perspectives: h parallel attention heads (each with d_k = d_model/h) learn different relationship types (syntactic, semantic, positional, coreference) at no additional computational cost.
  4. Positional encoding solves permutation invariance: Self-attention ignores token order. Sinusoidal encodings (or learned embeddings) inject position information. The sine/cosine choice enables encoding relative positions via rotation matrices.
  5. The Transformer architecture is a composition of simple blocks: Each layer has Multi-Head Attention + FFN, wrapped with residual connections and Layer Normalization. Stack 6-96 of these layers, and you have a state-of-the-art model.
  6. BERT (encoder, bidirectional, MLM) and GPT (decoder, causal, CLM) are two sides of the same Transformer coin: BERT excels at understanding tasks; GPT excels at generation. Both use identical building blocks, just configured differently.
  7. Self-attention is O(nยฒd) โ€” powerful but expensive: This quadratic scaling limits sequence length. Efficient variants (FlashAttention, Linformer, state-space models) are an active research frontier, essential for processing long documents and deploying on resource-constrained devices.
The Key Equation:
Attention(Q, K, V) = softmax(QKT / โˆšd_k) ยท V

Key Intuition: The Transformer processes sequences not through step-by-step recurrence, but through layers of self-attention โ€” letting every token directly query every other token, enabling parallelism, long-range dependencies, and the scaling that powers modern AI.

Section 25

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Resources

  • NPTEL Course: "Deep Learning for NLP" by Prof. Mitesh Khapra (IIT Madras) โ€” covers Transformers with Indian language examples, available free on Swayam
  • AI4Bharat: ai4bharat.iitm.ac.in โ€” IndicBERT, IndicTrans, IndicCorp resources and pre-trained models
  • GATE CS/DS Prep: "Deep Learning" by GATE Wallah / Unacademy โ€” covers attention complexity and Transformer architecture for GATE exams
  • Book: "Speech and Language Processing" by Jurafsky & Martin (draft chapters free online) โ€” Chapter 10 covers Transformers extensively

๐ŸŒ Global Resources

  • Original Paper: "Attention Is All You Need" (Vaswani et al., 2017) โ€” the foundational paper, dense but essential
  • The Illustrated Transformer: Jay Alammar's visual walkthrough at jalammar.github.io/illustrated-transformer โ€” the best visual explanation
  • 3Blue1Brown: "But what is a GPT?" โ€” visual intuition for attention, embedding, and generation
  • Andrej Karpathy: "Let's build GPT from scratch" (YouTube) โ€” builds nanoGPT step by step, the best hands-on tutorial
  • Harvard NLP: "The Annotated Transformer" โ€” line-by-line PyTorch implementation with explanations
  • Lilian Weng's Blog: "The Transformer Family" โ€” comprehensive survey of all Transformer variants (2020-2023)
  • Distill.pub: "A Mathematical Framework for Transformer Circuits" (Elhage et al., 2021) โ€” deep mechanistic interpretability of attention

๐Ÿ“„ Key Papers (Chronological)

  1. Bahdanau et al. (2014) โ€” "Neural Machine Translation by Jointly Learning to Align and Translate" (additive attention)
  2. Vaswani et al. (2017) โ€” "Attention Is All You Need" (the Transformer)
  3. Devlin et al. (2018) โ€” "BERT: Pre-training of Deep Bidirectional Transformers" (encoder-only)
  4. Radford et al. (2018, 2019) โ€” GPT-1, GPT-2 papers (decoder-only)
  5. Brown et al. (2020) โ€” "Language Models are Few-Shot Learners" (GPT-3, scaling)
  6. Dosovitskiy et al. (2020) โ€” "An Image is Worth 16ร—16 Words" (Vision Transformer)
  7. Su et al. (2021) โ€” "RoFormer: Rotary Position Embedding" (RoPE)
  8. Dao et al. (2022) โ€” "FlashAttention: Fast and Memory-Efficient Exact Attention"
  9. Touvron et al. (2023) โ€” "LLaMA: Open and Efficient Foundation Language Models"
  10. Gu & Dao (2023) โ€” "Mamba: Linear-Time Sequence Modeling with Selective State Spaces"