Chapter 20: Transformers & Attention — The AI Revolution

🎯 Learning Objectives

After completing this chapter, you will be able to:

1

Explain the attention mechanism — Query, Key, Value — and why it replaced recurrence

2

Derive and compute Scaled Dot-Product Attention: softmax(QK^T/√d_k)V

3

Implement Multi-Head Attention from scratch and explain why multiple heads help

4

Derive the variance argument for √d_k scaling and understand numerical stability

5

Compute and implement sinusoidal positional encodings from first principles

6

Draw and explain the full Transformer architecture: 6-layer encoder + 6-layer decoder

7

Differentiate BERT (encoder-only, MLM+NSP) and GPT (decoder-only, CLM)

8

Understand Vision Transformer (ViT) and how images become token sequences

9

Explain LLM scaling laws, RLHF, tokenization, and emergent abilities

10

Analyze efficient attention variants: Flash Attention, Sparse, and Linear

11

Apply Transformers to Indian language NLP using AI4Bharat IndicBERT and Krutrim

12

Build mini projects: Hindi sentiment analysis and a mini language model

📖 Introduction

In June 2017, a paper titled "Attention Is All You Need" by Vaswani et al. at Google Brain introduced the Transformer — a neural network architecture that replaced recurrence (LSTMs, GRUs) and convolutions entirely with self-attention mechanisms. It was arguably the most consequential machine learning paper of the decade.

Before Transformers, sequence models like RNNs and LSTMs processed tokens one-by-one, creating a computational bottleneck. The Transformer broke this paradigm by computing relationships between all tokens simultaneously, enabling massive parallelization on GPUs and capturing long-range dependencies without degradation.

🌟 Why This Chapter Matters

Every modern AI system you interact with — ChatGPT, Google Search, image generation, code completion, translation, voice assistants — is built on the Transformer architecture. Understanding Transformers is no longer optional; it's the most important single concept in modern AI.

This chapter takes you from the fundamental intuition of attention (think of it as a "database lookup") all the way to understanding GPT-4, BERT fine-tuning, Vision Transformers, and efficient attention. We derive every formula from first principles, implement core components in Python and TensorFlow, and apply them to Indian language processing with AI4Bharat and Krutrim.

📜 Historical Background

The journey to Transformers spans decades of research in sequence modeling, attention mechanisms, and neural architecture design.

1997 — Long Short-Term Memory (LSTM)

Hochreiter & Schmidhuber introduced LSTM to solve vanishing gradients in RNNs, becoming the dominant sequence model for 20 years.

2014 — Seq2Seq with Attention

Bahdanau, Cho, Bengio introduced additive attention for neural machine translation, allowing the decoder to "look back" at relevant encoder states.

2015 — Luong Attention

Luong et al. proposed multiplicative (dot-product) attention — simpler, faster, and became the foundation for the Transformer.

2017 — "Attention Is All You Need"

Vaswani et al. proposed the Transformer: self-attention replacing recurrence entirely. Achieved SOTA on WMT translation. The paper's title became a rallying cry.

2018 — GPT-1 & BERT

OpenAI's GPT (decoder-only, 117M params) and Google's BERT (encoder-only, 340M) demonstrated that pre-training + fine-tuning works spectacularly.

2019 — GPT-2 & T5

GPT-2 (1.5B params) showed emergent text generation; T5 unified NLP tasks as text-to-text. "Language models are unsupervised multitask learners."

2020 — GPT-3 & ViT

GPT-3 (175B params) showed in-context learning. ViT proved Transformers work for images. Scaling laws established (Kaplan et al.).

2022 — ChatGPT & InstructGPT

RLHF alignment produced ChatGPT, reaching 100M users in 2 months. AI4Bharat released IndicBERT for Indian languages.

2023 — GPT-4, Llama, Flash Attention

GPT-4 (multimodal), Meta's Llama (open-source), Flash Attention v2 (IO-aware), and Google's Gemini redefined the frontier.

2024 — Krutrim, Llama 3, Gemini Ultra

India's Krutrim LLM for 22 Indian languages. Llama 3 (405B). Mixture-of-Experts (MoE) becomes mainstream. State-space models challenge Transformers.

💡 Conceptual Explanation

4.1 The Core Intuition: Attention as a Soft Database Lookup

Imagine you have a database of key-value pairs. Given a query, you want to retrieve the most relevant value. In a traditional database, this is a hard lookup — you find the exact matching key. Attention is a soft lookup: you compute a similarity score between your query and every key, then return a weighted average of all values.

The Database Analogy

Query (Q): "What information do I need?" — the current position asking a question
Key (K): "What information do I contain?" — labels for all available positions
Value (V): "Here's my actual content" — the data each position carries

The attention score between a query and a key tells us "how relevant is this key to my query?" The output is a weighted combination of values, where weights come from query-key similarities.

4.2 Self-Attention: Every Token Talks to Every Token

In self-attention, the queries, keys, and values all come from the same sequence. Each word in a sentence creates its own Q, K, and V vectors by multiplying with learned weight matrices. Then each word uses its query to attend to all other words' keys, retrieving a weighted mix of their values.

Consider: "The cat sat on the mat because it was tired." When processing "it", self-attention assigns high attention weight to "cat" (not "mat"), because the model learns that "tired" relates to a living entity — an impressive feat that RNNs struggle with over distance.

4.3 Why Not Recurrence?

❌ Problems with RNNs/LSTMs

Sequential processing: token-by-token, no parallelism
Information bottleneck: everything compressed into hidden state
Long-range forgetting: even LSTM struggles past ~200 tokens
Training time: O(n) sequential steps, GPU underutilized

✅ Transformer Advantages

Parallel processing: all tokens processed simultaneously
Direct connections: any token can attend to any other
Scalable: massively parallel on modern GPUs/TPUs
Constant path length: O(1) between any two positions

4.4 Multi-Head Attention: Multiple Perspectives

A single attention function learns one type of relationship. Multi-head attention runs multiple attention functions in parallel, each with different learned projections. Think of it as having 8 different "reading comprehensions" — one head might attend to syntactic relationships, another to semantic similarity, another to positional proximity.

4.5 Positional Encoding: Injecting Order

Self-attention is permutation-invariant — it treats "dog bites man" identically to "man bites dog". Since word order matters, we add positional encodings to the input embeddings. The original Transformer uses sinusoidal functions with different frequencies, creating a unique "fingerprint" for each position that the model can use to reason about relative positions.

4.6 The Full Architecture: Encoder-Decoder

The complete Transformer has an encoder (6 layers) that reads the input and produces contextual representations, and a decoder (6 layers) that generates output token-by-token. The decoder uses masked self-attention (can only see past tokens) plus cross-attention (attending to encoder outputs).

4.7 Layer Normalization & Residual Connections

Each sub-layer (attention, feed-forward) in the Transformer uses a residual connection followed by Layer Normalization: output = LayerNorm(x + SubLayer(x)). The residual connection ensures gradients flow smoothly through deep networks (similar to ResNets), while LayerNorm stabilizes training by normalizing across the feature dimension.

📐 Mathematical Foundation

5.1 Scaled Dot-Product Attention

The core equation of modern AI:

Definition 20.1 — Scaled Dot-Product Attention Attention(Q, K, V) = softmax(QK T / \sqrtd k) \cdot V Where Q \in ℝ n\timesd k, K \in ℝ m\timesd k, V \in ℝ m\timesd v n = number of query positions, m = number of key-value positions d k = dimension of keys/queries, d v = dimension of values

Step-by-step breakdown:

QK^T ∈ ℝ^n×m: Compute dot products between all query-key pairs → raw attention scores
/ √d_k: Scale down to prevent softmax saturation (explained in derivations)
softmax(·): Convert scores to probabilities (each row sums to 1)
· V: Weighted combination of values using attention weights

5.2 Multi-Head Attention

Definition 20.2 — Multi-Head Attention MultiHead(Q, K, V) = Concat(head 1, ..., head h) \cdot W O where head i = Attention(Q\cdotW i Q, K\cdotW i K, V\cdotW i V) W i Q \in ℝ d model \timesd k, W i K \in ℝ d model \timesd k, W i V \in ℝ d model \timesd v, W O \in ℝ hd v \timesd model Typically h=8, d k =d v =d model /h=64 (for d model =512)

5.3 Positional Encoding (Sinusoidal)

Definition 20.3 — Sinusoidal Positional Encoding PE(pos, 2i) = sin(pos / 10000 2i/d model) PE(pos, 2i+1) = cos(pos / 10000 2i/d model) pos = position in the sequence (0, 1, 2, ...) i = dimension index (0, 1, ..., d model /2 - 1) Each dimension corresponds to a sinusoidal wave with wavelength from 2π to 10000\cdot2π

5.4 Layer Normalization

Definition 20.4 — Layer Normalization LayerNorm(x) = γ ⊙ (x - μ) / \sqrt(σ² + ε) + β where μ = (1/d) Σ i x i, σ² = (1/d) Σ i (x i - μ)² Normalization is across the feature dimension (not the batch dimension like BatchNorm). γ, β are learnable scale and shift parameters.

5.5 Feed-Forward Network (Per Position)

Definition 20.5 — Position-wise Feed-Forward Network FFN(x) = max(0, x\cdotW 1 + b 1)\cdotW 2 + b 2 W 1 \in ℝ d model \timesd ff, W 2 \in ℝ d ff \timesd model Typically d ff = 4 \times d model = 2048 (for d model =512)

5.6 Softmax Function

Definition 20.6 — Softmax softmax(z i) = exp(z i) / Σ j exp(z j)

5.7 Complexity Analysis

Operation	Time Complexity	Space Complexity	Notes
Self-Attention	O(n² · d)	O(n² + n·d)	Quadratic in sequence length
Feed-Forward	O(n · d²)	O(n · d)	Linear in sequence length
RNN / LSTM	O(n · d²)	O(d²)	Linear but sequential
1D Convolution	O(k · n · d²)	O(n · d)	k = kernel size

🔬 Formula Derivations

6.1 Why √d_k Scaling? The Variance Argument

This is one of the most commonly asked interview questions about Transformers. Let's derive it rigorously from first principles.

Derivation: Variance of Dot Products

Setup: Let q, k ∈ ℝ^d_k be query and key vectors, where each component q_i, k_i is drawn independently from a distribution with mean 0 and variance 1.

Goal: Find Var(q · k) = Var(Σ_i=1^d_k q_ik_i)

Step 1: For a single product term z_i = q_ik_i:

E[z_i] = E[q_i]·E[k_i] = 0·0 = 0 (by independence)
E[z_i²] = E[q_i²]·E[k_i²] = Var(q_i)·Var(k_i) = 1·1 = 1
Var(z_i) = E[z_i²] − E[z_i]² = 1 − 0 = 1

Step 2: The dot product is the sum: q·k = Σ_i z_i

Since z_i are independent: Var(q·k) = Σ_i Var(z_i) = d_k

Step 3: If we scale by √d_k:

Var(q·k / √d_k) = Var(q·k) / d_k = d_k / d_k = 1 ✓

Conclusion: Without scaling, the dot products have variance d_k. For d_k=64, values would be ~8× larger than expected, pushing softmax into regions where gradients are extremely small (saturation). Dividing by √d_k normalizes variance to 1, keeping softmax in a healthy gradient region.

6.2 Derivation: Sinusoidal Positional Encoding

Why Sinusoids? The Relative Position Property

Key insight: We want PE(pos+k) to be expressible as a linear function of PE(pos), so the model can easily learn to attend to relative positions.

Proof: For any fixed offset k, there exist constants that allow:

sin(ω(pos + k)) = sin(ωpos)cos(ωk) + cos(ωpos)sin(ωk)
cos(ω(pos + k)) = cos(ωpos)cos(ωk) − sin(ωpos)sin(ωk)

In matrix form:

┌ PE(pos+k, 2i) ┐ ┌ cos(ωk) sin(ωk) ┐ ┌ PE(pos, 2i) ┐
│ │ = │ │·│ │
└ PE(pos+k, 2i+1) ┘ └ -sin(ωk) cos(ωk) ┘ └ PE(pos, 2i+1) ┘

where ω = 1/10000^2i/d_model

This is a rotation matrix! The positional encoding at position pos+k is a rotation of the encoding at position pos. Since the rotation matrix depends only on the offset k (not the absolute position), the model can learn relative position information through linear projections.

Multi-frequency design: Different dimensions i use different frequencies (ω), ranging from high-frequency (i=0, wavelength=2π) to low-frequency (i=d/2−1, wavelength≈10000·2π). This is analogous to a Fourier basis — low dimensions capture fine-grained position, high dimensions capture coarse-grained position.

6.3 Derivation: Softmax Gradients

Why Softmax Saturation Matters

The Jacobian of softmax y = softmax(z) is:

∂y_i/∂z_j = y_i(δ_ij − y_j)

When logits are very large (|z| >> 0), softmax produces near-one-hot outputs where one y_i ≈ 1 and the rest ≈ 0. In this regime:

∂y_i/∂z_i = y_i(1 − y_i) ≈ 1·0 = 0 (for the dominant class)
All gradient terms ≈ 0 → vanishing gradients

This is why scaling by √d_k is essential: it keeps logits moderate, maintaining healthy gradients during backpropagation.

6.4 Derivation: Parameter Count

How Many Parameters in a Transformer Layer?

Multi-Head Attention:

Params_MHA = 4 × d_model² (for W^Q, W^K, W^V, W^O)
= 4 × 512² = 1,048,576 ≈ 1M

Feed-Forward Network:

Params_FFN = 2 × d_model × d_ff = 2 × 512 × 2048 = 2,097,152 ≈ 2M

Layer Norms: 2 × 2 × d_model = 2048

Total per layer: ≈ 3.15M. For 6 encoder + 6 decoder layers: ≈ 37.8M

Add embeddings: vocab_size × d_model = 37000 × 512 ≈ 19M

Total Transformer (base): ≈ 65M parameters

✏️ Worked Numerical Examples

📝 Example 20.1: Self-Attention Computation (4 Tokens, d_k=3)

Let's compute self-attention for the sentence fragment with 4 tokens, using d_k = d_v = 3 for simplicity.

Step 1: Input Representations

Suppose after embedding + positional encoding, our 4 tokens have representations X ∈ ℝ^4×3:

    Token 1 ("The"):   [1.0, 0.0, 1.0]
    Token 2 ("cat"):   [0.0, 1.0, 0.0]
    Token 3 ("sat"):   [1.0, 1.0, 0.0]
    Token 4 ("down"):  [0.0, 0.0, 1.0]

Step 2: Compute Q, K, V (using identity weights for simplicity)

With W^Q = W^K = W^V = I_3×3, we get Q = K = V = X:

    Q = K = V =  ┌ 1  0  1 ┐
                 │ 0  1  0 │
                 │ 1  1  0 │
                 └ 0  0  1 ┘

Step 3: Compute QK^T (raw attention scores)

    QK^T = ┌ 1·1+0·0+1·1  1·0+0·1+1·0  1·1+0·1+1·0  1·0+0·0+1·1 ┐
           │ 0·1+1·0+0·1  0·0+1·1+0·0  0·1+1·1+0·0  0·0+1·0+0·1 │
           │ 1·1+1·0+0·1  1·0+1·1+0·0  1·1+1·1+0·0  1·0+1·0+0·1 │
           └ 0·1+0·0+1·1  0·0+0·1+1·0  0·1+0·1+1·0  0·0+0·0+1·1 ┘

         = ┌ 2  0  1  1 ┐
           │ 0  1  1  0 │
           │ 1  1  2  0 │
           └ 1  0  0  1 ┘

Step 4: Scale by √d_k = √3 ≈ 1.732

    QK^T/√3 = ┌ 1.155  0.000  0.577  0.577 ┐
               │ 0.000  0.577  0.577  0.000 │
               │ 0.577  0.577  1.155  0.000 │
               └ 0.577  0.000  0.000  0.577 ┘

Step 5: Apply Softmax (row-wise)

For row 1: softmax([1.155, 0.000, 0.577, 0.577])

    exp values: [3.174, 1.000, 1.781, 1.781] → sum = 7.736
    softmax:    [0.410, 0.129, 0.230, 0.230]

    Full attention weights A:
    ┌ 0.410  0.129  0.230  0.230 ┐   ← "The" attends mostly to itself
    │ 0.195  0.345  0.345  0.195 │   ← "cat" attends equally to "cat" & "sat"  
    │ 0.230  0.230  0.410  0.129 │   ← "sat" attends mostly to itself
    └ 0.345  0.155  0.155  0.345 ┘   ← "down" attends to "The" & itself

Step 6: Compute Output = A · V

    Output[0] = 0.410·[1,0,1] + 0.129·[0,1,0] + 0.230·[1,1,0] + 0.230·[0,0,1]
              = [0.410,0,0.410] + [0,0.129,0] + [0.230,0.230,0] + [0,0,0.230]
              = [0.641, 0.360, 0.641]

    Full output matrix:
    ┌ 0.641  0.360  0.641 ┐   ← "The" enriched with context
    │ 0.425  0.580  0.195 │   ← "cat" enriched with context
    │ 0.641  0.360  0.360 │   ← "sat" enriched with context
    └ 0.345  0.155  0.690 ┘   ← "down" enriched with context

Key Observation: Each output vector is no longer just the token's own embedding — it's a weighted mixture of all tokens' value vectors. "The" (output [0.641, 0.360, 0.641]) has absorbed information from all four tokens, with the strongest influence from itself (weight 0.410).

📝 Example 20.2: Multi-Head Attention (h=2 heads)

Using d_model=4, h=2, so d_k=d_v=d_model/h=2. Input X ∈ ℝ^2×4 (2 tokens).

Step 1: Input and Projection Matrices

    X = ┌ 1  0  1  0 ┐    (Token 1)
        └ 0  1  0  1 ┘    (Token 2)

    Head 1: W₁Q = ┌ 1  0 ┐   W₁K = ┌ 0  1 ┐   W₁V = ┌ 1  0 ┐
                   │ 0  1 │         │ 1  0 │         │ 0  1 │
                   │ 0  0 │         │ 0  0 │         │ 0  0 │
                   └ 0  0 ┘         └ 0  0 ┘         └ 0  0 ┘

    Head 2: W₂Q = ┌ 0  0 ┐   W₂K = ┌ 0  0 ┐   W₂V = ┌ 0  0 ┐
                   │ 0  0 │         │ 0  0 │         │ 0  0 │
                   │ 1  0 │         │ 0  1 │         │ 1  0 │
                   └ 0  1 ┘         └ 1  0 ┘         └ 0  1 ┘

Step 2: Compute Q, K, V for Each Head

    Head 1: Q₁ = X·W₁Q = ┌ 1  0 ┐   K₁ = X·W₁K = ┌ 0  1 ┐   V₁ = ┌ 1  0 ┐
                          └ 0  1 ┘                  └ 1  0 ┘         └ 0  1 ┘

    Head 2: Q₂ = X·W₂Q = ┌ 1  0 ┐   K₂ = X·W₂K = ┌ 0  1 ┐   V₂ = ┌ 1  0 ┐
                          └ 0  1 ┘                  └ 1  0 ┘         └ 0  1 ┘

Step 3: Attention per Head (d_k=2, √d_k=√2≈1.414)

    Head 1: Q₁K₁ᵀ = ┌ 0  1 ┐ / √2 = ┌ 0.000  0.707 ┐
                     └ 1  0 ┘         └ 0.707  0.000 ┘

    softmax: ┌ 0.331  0.669 ┐    Output₁ = A₁·V₁ = ┌ 0.331  0.669 ┐
             └ 0.669  0.331 ┘                       └ 0.669  0.331 ┘

Step 4: Concatenate and Project

    Concat = ┌ 0.331  0.669  0.331  0.669 ┐   (head₁ | head₂)
             └ 0.669  0.331  0.669  0.331 ┘

    Output = Concat · Wᴼ    (Wᴼ ∈ ℝ⁴ˣ⁴, combines multi-head information)

📝 Example 20.3: Positional Encoding Computation

Compute PE for position 3, d_model=4 (i = 0, 1):

Computation

    i=0: ω₀ = 1/10000^(0/4) = 1/1 = 1
         PE(3, 0) = sin(3 × 1)   = sin(3)   =  0.141
         PE(3, 1) = cos(3 × 1)   = cos(3)   = -0.990

    i=1: ω₁ = 1/10000^(2/4) = 1/100 = 0.01
         PE(3, 2) = sin(3 × 0.01) = sin(0.03) =  0.030
         PE(3, 3) = cos(3 × 0.01) = cos(0.03) =  1.000

    PE(pos=3) = [0.141, -0.990, 0.030, 1.000]

    Compare with PE(pos=0) = [0.000, 1.000, 0.000, 1.000]
    Compare with PE(pos=1) = [0.841, 0.540, 0.010, 1.000]

    → Each position has a unique encoding!
    → Low dims (i=0) change rapidly, high dims (i=1) change slowly.

📊 Visual Diagrams

Diagram 20.1: Scaled Dot-Product Attention

                    ┌─────────────┐
                    │   Output    │
                    │  (n × dv)   │
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
                    │   MatMul    │ ← Weighted sum of V
                    └──┬───────┬──┘
                       │       │
              ┌────────┘       └────────┐
              │                         │
       ┌──────┴──────┐          ┌───────┴───────┐
       │   Softmax   │          │       V       │
       │  (n × m)    │          │   (m × dv)    │
       └──────┬──────┘          └───────────────┘
              │
       ┌──────┴──────┐
       │    Scale     │ ← Divide by √dk
       │  (÷ √dk)    │
       └──────┬──────┘
              │
       ┌──────┴──────┐
       │   MatMul    │ ← QKᵀ dot products
       └──┬───────┬──┘
          │       │
   ┌──────┴──┐ ┌──┴──────┐
   │    Q    │ │    K    │
   │(n × dk)│ │(m × dk) │
   └─────────┘ └─────────┘

Diagram 20.2: Multi-Head Attention

          ┌──────────────────────────────────┐
          │           Linear (W^O)           │
          │         (h·dv → dmodel)          │
          └────────────────┬─────────────────┘
                           │
          ┌────────────────┴─────────────────┐
          │            Concat                │
          │    [head₁ | head₂ | ... | headₕ] │
          └──┬─────┬──────┬────────────┬─────┘
             │     │      │            │
          ┌──┴──┐┌─┴──┐┌──┴──┐    ┌───┴───┐
          │head₁││head₂││head₃│ ···│headₕ  │
          │Attn ││Attn ││Attn │    │ Attn  │
          └──┬──┘└──┬──┘└──┬──┘    └───┬───┘
             │      │      │           │
          ┌──┴──┐┌──┴──┐┌──┴──┐    ┌───┴───┐
          │Lin  ││Lin  ││Lin  │ ···│Lin    │
          │Q,K,V││Q,K,V││Q,K,V│    │Q,K,V  │
          └──┬──┘└──┬──┘└──┬──┘    └───┬───┘
             │      │      │           │
             └──────┴──────┴─────┬─────┘
                                 │
                    ┌────────────┴────────────┐
                    │    Input: Q, K, V       │
                    │    (n × dmodel)         │
                    └─────────────────────────┘

Diagram 20.3: Full Transformer Architecture

  ┌─── ENCODER (×6) ──────────┐     ┌─── DECODER (×6) ──────────┐
  │                            │     │                            │
  │  ┌──────────────────────┐  │     │  ┌──────────────────────┐  │
  │  │ Add & Layer Norm     │  │     │  │ Add & Layer Norm     │  │
  │  └──────────┬───────────┘  │     │  └──────────┬───────────┘  │
  │             │              │     │             │              │
  │  ┌──────────┴───────────┐  │     │  ┌──────────┴───────────┐  │
  │  │  Feed Forward (FFN)  │  │     │  │  Feed Forward (FFN)  │  │
  │  │  512 → 2048 → 512    │  │     │  │  512 → 2048 → 512    │  │
  │  └──────────┬───────────┘  │     │  └──────────┬───────────┘  │
  │             │              │     │             │              │
  │  ┌──────────┴───────────┐  │     │  ┌──────────┴───────────┐  │
  │  │ Add & Layer Norm     │  │     │  │ Add & Layer Norm     │  │
  │  └──────────┬───────────┘  │     │  └──────────┬───────────┘  │
  │             │              │     │             │              │
  │  ┌──────────┴───────────┐  │     │  ┌──────────┴───────────┐  │
  │  │  Multi-Head          │  │────▶│  │  Cross-Attention     │  │
  │  │  Self-Attention      │  │  K,V│  │  (Encoder-Decoder)   │  │
  │  └──────────┬───────────┘  │     │  └──────────┬───────────┘  │
  │             │              │     │             │              │
  └─────────────┼──────────────┘     │  ┌──────────┴───────────┐  │
                │                    │  │ Add & Layer Norm     │  │
  ┌─────────────┴──────────────┐     │  └──────────┬───────────┘  │
  │    Input Embedding         │     │             │              │
  │  + Positional Encoding     │     │  ┌──────────┴───────────┐  │
  └────────────────────────────┘     │  │  Masked Multi-Head   │  │
                                     │  │  Self-Attention      │  │
      "The cat sat on the mat"       │  └──────────┬───────────┘  │
                                     │             │              │
                                     └─────────────┼──────────────┘
                                                   │
                                     ┌─────────────┴──────────────┐
                                     │    Output Embedding        │
                                     │  + Positional Encoding     │
                                     └────────────────────────────┘

                                       "Le chat est assis sur ..."

Diagram 20.4: BERT vs GPT vs T5 Architecture Comparison

  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────┐
  │     BERT        │  │      GPT        │  │         T5              │
  │  (Encoder-only) │  │ (Decoder-only)  │  │   (Encoder-Decoder)     │
  ├─────────────────┤  ├─────────────────┤  ├─────────────────────────┤
  │                 │  │                 │  │                         │
  │  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌─────┐    ┌───────┐  │
  │  │ Encoder   │  │  │  │ Decoder   │  │  │  │Enc  │───▶│ Dec   │  │
  │  │ Block ×12 │  │  │  │ Block ×12 │  │  │  │ ×12 │    │  ×12  │  │
  │  └───────────┘  │  │  └───────────┘  │  │  └─────┘    └───────┘  │
  │                 │  │                 │  │                         │
  │  Bidirectional  │  │  Left-to-Right  │  │  Enc: bidir, Dec: L→R  │
  │  ◄────────────▶ │  │  ─────────────▶ │  │  ◄──▶          ──▶    │
  │                 │  │                 │  │                         │
  │  Tasks:         │  │  Tasks:         │  │  Tasks:                 │
  │  • NER          │  │  • Generation   │  │  • Translation          │
  │  • QA           │  │  • Completion   │  │  • Summarization        │
  │  • Classify     │  │  • Chat         │  │  • Any text-to-text     │
  └─────────────────┘  └─────────────────┘  └─────────────────────────┘

Diagram 20.5: Vision Transformer (ViT)

     Input Image (224×224×3)
              │
     ┌────────┴────────┐
     │  Split into      │
     │  16×16 patches   │
     │  = 196 patches   │
     └────────┬────────┘
              │
     ┌────────┴────────┐       ┌─────────────┐
     │  Linear Embed   │──────▶│ [CLS] token │
     │  (196 × 768)    │       │ prepended   │
     └────────┬────────┘       └──────┬──────┘
              │                       │
              └───────────┬───────────┘
                          │
              ┌───────────┴───────────┐
              │   + Position Embeds   │
              │   (197 × 768)         │
              └───────────┬───────────┘
                          │
              ┌───────────┴───────────┐
              │   Transformer Encoder │
              │   (12 layers, 768)    │
              └───────────┬───────────┘
                          │
              ┌───────────┴───────────┐
              │   [CLS] output        │
              │   → MLP Head          │
              │   → Classification    │
              └───────────────────────┘

🔀 Flowcharts

Flowchart 20.1: Choosing the Right Transformer Architecture

                    ┌──────────────┐
                    │  NLP Task?   │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
       ┌──────────┐ ┌──────────┐ ┌──────────────┐
       │Understand│ │ Generate │ │ Transform    │
       │  text?   │ │  text?   │ │ text→text?   │
       └────┬─────┘ └────┬─────┘ └──────┬───────┘
            │             │              │
            ▼             ▼              ▼
    ┌───────────┐ ┌───────────┐ ┌───────────────┐
    │ BERT-type │ │ GPT-type  │ │  T5 / BART    │
    │ Encoder   │ │ Decoder   │ │  Enc-Dec      │
    └─────┬─────┘ └─────┬─────┘ └───────┬───────┘
          │             │               │
          ▼             ▼               ▼
    ┌──────────┐ ┌──────────┐ ┌──────────────────┐
    │• Classify│ │• ChatBot │ │• Translation     │
    │• NER     │ │• Story   │ │• Summarization   │
    │• QA      │ │• Code    │ │• Question Answer │
    │• Embed   │ │• Reason  │ │• Style Transfer  │
    └──────────┘ └──────────┘ └──────────────────┘

Flowchart 20.2: BERT Fine-Tuning Pipeline

    ┌─────────────────┐
    │ Pre-trained BERT│
    │ (from HuggingFace)│
    └────────┬────────┘
             │
    ┌────────┴────────┐
    │ Task-specific   │
    │ data loading    │
    │ + tokenization  │
    └────────┬────────┘
             │
    ┌────────┴────────┐    ┌──────────────────┐
    │ Add task head:  │    │ Classification:  │
    │ ─────────────── │───▶│ [CLS] → Linear   │
    │ Freeze/Unfreeze │    │ → Softmax → pred │
    │ BERT layers     │    └──────────────────┘
    └────────┬────────┘
             │
    ┌────────┴────────┐
    │ Fine-tune with  │
    │ small lr (2e-5) │
    │ 3-5 epochs      │
    └────────┬────────┘
             │
    ┌────────┴────────┐
    │ Evaluate on     │
    │ validation set  │
    └────────┬────────┘
             │
        ┌────┴────┐
        │ Deploy! │
        └─────────┘

Flowchart 20.3: LLM Training Pipeline (GPT-style)

    ┌──────────────────┐
    │ 1. DATA CURATION │
    │ Web crawl, books │
    │ code, Wikipedia  │
    │ ~1-10 TB text    │
    └────────┬─────────┘
             │
    ┌────────┴─────────┐
    │ 2. TOKENIZATION  │
    │ BPE / SentPiece  │
    │ 32K-100K tokens  │
    └────────┬─────────┘
             │
    ┌────────┴─────────┐
    │ 3. PRETRAINING   │
    │ CLM: predict     │
    │ next token       │
    │ 1000s of GPUs    │
    │ Weeks-months     │
    └────────┬─────────┘
             │
    ┌────────┴─────────┐
    │ 4. SFT           │
    │ Supervised Fine-  │
    │ Tuning on human- │
    │ curated prompts  │
    └────────┬─────────┘
             │
    ┌────────┴─────────┐
    │ 5. RLHF          │
    │ Reward Model      │
    │ + PPO optimizer  │
    │ → Aligned model  │
    └────────┬─────────┘
             │
    ┌────────┴─────────┐
    │ 6. DEPLOYMENT    │
    │ API, guardrails  │
    │ monitoring       │
    └──────────────────┘

🐍 Python Implementation (From Scratch)

10.1 Scaled Dot-Product Attention

attention_from_scratch.py Python

import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled Dot-Product Attention from 'Attention Is All You Need'.
    
    Args:
        Q: Queries  (batch, n, d_k) or (n, d_k)
        K: Keys     (batch, m, d_k) or (m, d_k)
        V: Values   (batch, m, d_v) or (m, d_v)
        mask: Optional mask (n, m) — 0 for positions to attend, -inf for masked
    
    Returns:
        output: Weighted values (batch, n, d_v) or (n, d_v)
        attention_weights: (batch, n, m) or (n, m)
    """
    d_k = Q.shape[-1]
    
    # Step 1: QK^T dot products
    scores = np.matmul(Q, K.swapaxes(-2, -1))  # (n, m)
    
    # Step 2: Scale by sqrt(d_k) — the variance argument!
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Apply mask (for decoder / padding)
    if mask is not None:
        scores = scores + mask  # mask has -inf for blocked positions
    
    # Step 4: Softmax to get attention weights
    attention_weights = softmax(scores, axis=-1)
    
    # Step 5: Weighted sum of values
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights

# ─── DEMO: Self-Attention on 4 tokens ───
np.random.seed(42)
seq_len, d_k, d_v = 4, 8, 8

# Random input embeddings
X = np.random.randn(seq_len, d_k)

# Learnable projection matrices (random for demo)
W_Q = np.random.randn(d_k, d_k) * 0.1
W_K = np.random.randn(d_k, d_k) * 0.1
W_V = np.random.randn(d_k, d_v) * 0.1

# Project to Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Compute attention
output, weights = scaled_dot_product_attention(Q, K, V)

print("Input shape:", X.shape)
print("Output shape:", output.shape)
print("\nAttention weights (each row sums to 1):")
print(np.round(weights, 3))
print("\nRow sums:", np.round(weights.sum(axis=-1), 6))

10.2 Multi-Head Attention

multi_head_attention.py Python

import numpy as np

class MultiHeadAttention:
    """
    Multi-Head Attention from scratch.
    Splits input into h heads, runs attention in parallel, concatenates.
    """
    def __init__(self, d_model, num_heads):
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Initialize projection matrices (Xavier initialization)
        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_Q = np.random.randn(d_model, d_model) * scale
        self.W_K = np.random.randn(d_model, d_model) * scale
        self.W_V = np.random.randn(d_model, d_model) * scale
        self.W_O = np.random.randn(d_model, d_model) * scale
    
    def split_heads(self, x):
        """Reshape (batch, seq, d_model) → (batch, heads, seq, d_k)"""
        batch_size = x.shape[0]
        seq_len = x.shape[1]
        # Reshape: (batch, seq, d_model) → (batch, seq, heads, d_k)
        x = x.reshape(batch_size, seq_len, self.num_heads, self.d_k)
        # Transpose: → (batch, heads, seq, d_k)
        return x.transpose(0, 2, 1, 3)
    
    def forward(self, Q, K, V, mask=None):
        """
        Args:
            Q, K, V: (batch, seq_len, d_model)
            mask: optional (seq_len, seq_len)
        Returns:
            output: (batch, seq_len, d_model)
        """
        batch_size = Q.shape[0]
        
        # Step 1: Linear projections
        Q_proj = Q @ self.W_Q  # (batch, n, d_model)
        K_proj = K @ self.W_K
        V_proj = V @ self.W_V
        
        # Step 2: Split into heads
        Q_heads = self.split_heads(Q_proj)  # (batch, h, n, d_k)
        K_heads = self.split_heads(K_proj)
        V_heads = self.split_heads(V_proj)
        
        # Step 3: Scaled dot-product attention per head
        d_k = self.d_k
        scores = np.matmul(Q_heads, K_heads.transpose(0, 1, 3, 2))
        scores = scores / np.sqrt(d_k)
        
        if mask is not None:
            scores = scores + mask
        
        attn_weights = self._softmax(scores)
        head_outputs = np.matmul(attn_weights, V_heads)  # (batch, h, n, d_k)
        
        # Step 4: Concatenate heads
        # (batch, h, n, d_k) → (batch, n, h, d_k) → (batch, n, d_model)
        concat = head_outputs.transpose(0, 2, 1, 3)
        concat = concat.reshape(batch_size, -1, self.d_model)
        
        # Step 5: Final linear projection
        output = concat @ self.W_O
        
        return output, attn_weights
    
    def _softmax(self, x, axis=-1):
        e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
        return e_x / np.sum(e_x, axis=axis, keepdims=True)

# ─── DEMO ───
np.random.seed(42)
batch_size, seq_len, d_model, num_heads = 2, 6, 64, 8

mha = MultiHeadAttention(d_model, num_heads)
X = np.random.randn(batch_size, seq_len, d_model)

output, weights = mha.forward(X, X, X)  # Self-attention

print(f"Input shape:  {X.shape}")        # (2, 6, 64)
print(f"Output shape: {output.shape}")    # (2, 6, 64)
print(f"Weight shape: {weights.shape}")   # (2, 8, 6, 6)
print(f"\nHead 0, Batch 0 attention (6×6):")
print(np.round(weights[0, 0], 3))

10.3 Positional Encoding

positional_encoding.py Python

import numpy as np

def sinusoidal_positional_encoding(max_len, d_model):
    """
    Compute sinusoidal positional encoding as in 'Attention Is All You Need'.
    
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    PE = np.zeros((max_len, d_model))
    
    positions = np.arange(max_len)[:, np.newaxis]      # (max_len, 1)
    dim_indices = np.arange(0, d_model, 2)              # (d_model/2,)
    
    # Compute the denominator: 10000^(2i/d_model)
    # = exp(2i * ln(10000) / d_model)
    div_term = np.exp(dim_indices * (-np.log(10000.0) / d_model))
    
    # Apply sin to even indices, cos to odd indices
    PE[:, 0::2] = np.sin(positions * div_term)
    PE[:, 1::2] = np.cos(positions * div_term)
    
    return PE

# ─── DEMO ───
PE = sinusoidal_positional_encoding(max_len=50, d_model=16)
print("PE shape:", PE.shape)
print("\nPosition 0:", np.round(PE[0], 3))
print("Position 1:", np.round(PE[1], 3))
print("Position 2:", np.round(PE[2], 3))

# Verify the rotation property: PE(pos+k) is a linear transform of PE(pos)
pos, k, dim = 5, 3, 0
omega = 1.0 / (10000 ** (2 * dim / 16))
# PE(pos+k, 2*dim) should equal sin(omega*(pos+k))
expected = np.sin(omega * (pos + k))
# Using rotation: sin(w*pos)*cos(w*k) + cos(w*pos)*sin(w*k)
from_rotation = PE[pos, 2*dim] * np.cos(omega*k) + PE[pos, 2*dim+1] * np.sin(omega*k)
print(f"\nRotation property check:")
print(f"  PE({pos+k}, {2*dim}) = {PE[pos+k, 2*dim]:.6f}")
print(f"  Via rotation:        = {from_rotation:.6f}")
print(f"  Match: {np.isclose(PE[pos+k, 2*dim], from_rotation)}")

10.4 Complete Transformer Block

transformer_block.py Python

import numpy as np

class LayerNorm:
    """Layer Normalization."""
    def __init__(self, d_model, eps=1e-6):
        self.gamma = np.ones(d_model)
        self.beta = np.zeros(d_model)
        self.eps = eps
    
    def forward(self, x):
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        x_norm = (x - mean) / np.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

class FeedForward:
    """Position-wise Feed-Forward Network: FFN(x) = ReLU(xW1+b1)W2+b2"""
    def __init__(self, d_model, d_ff):
        scale = np.sqrt(2.0 / d_model)
        self.W1 = np.random.randn(d_model, d_ff) * scale
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale
        self.b2 = np.zeros(d_model)
    
    def forward(self, x):
        hidden = np.maximum(0, x @ self.W1 + self.b1)  # ReLU
        return hidden @ self.W2 + self.b2

class TransformerBlock:
    """
    One complete Transformer encoder block:
    1. Multi-Head Self-Attention + Residual + LayerNorm
    2. Feed-Forward Network + Residual + LayerNorm
    """
    def __init__(self, d_model, num_heads, d_ff):
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        # Sub-layer 1: Multi-Head Attention
        attn_output, attn_weights = self.attention.forward(x, x, x, mask)
        x = self.norm1.forward(x + attn_output)  # Residual + LayerNorm
        
        # Sub-layer 2: Feed-Forward
        ffn_output = self.ffn.forward(x)
        x = self.norm2.forward(x + ffn_output)    # Residual + LayerNorm
        
        return x, attn_weights

class TransformerEncoder:
    """Stack of N Transformer encoder blocks."""
    def __init__(self, num_layers, d_model, num_heads, d_ff, max_len=512):
        self.layers = [
            TransformerBlock(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ]
        self.PE = sinusoidal_positional_encoding(max_len, d_model)
    
    def forward(self, x):
        """x: (batch, seq_len, d_model)"""
        seq_len = x.shape[1]
        x = x + self.PE[:seq_len]  # Add positional encoding
        
        all_weights = []
        for layer in self.layers:
            x, weights = layer.forward(x)
            all_weights.append(weights)
        
        return x, all_weights

# ─── DEMO: 6-layer Transformer Encoder ───
np.random.seed(42)
batch, seq_len, d_model = 2, 10, 64
num_heads, d_ff, num_layers = 8, 256, 6

encoder = TransformerEncoder(num_layers, d_model, num_heads, d_ff)
X = np.random.randn(batch, seq_len, d_model)

output, all_weights = encoder.forward(X)

print(f"Input:  {X.shape}")          # (2, 10, 64)
print(f"Output: {output.shape}")      # (2, 10, 64)
print(f"Layers: {len(all_weights)}")  # 6
print(f"Attn weights per layer: {all_weights[0].shape}")  # (2, 8, 10, 10)

🔶 TensorFlow Implementation

11.1 BERT Fine-Tuning for Sentiment Analysis

bert_sentiment_finetuning.py TensorFlow

import tensorflow as tf
from transformers import TFBertModel, BertTokenizer
import numpy as np

# ─── 1. Load Pre-trained BERT ───
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# ─── 2. Build Sentiment Classifier ───
class BERTSentimentClassifier(tf.keras.Model):
    def __init__(self, num_classes=3, dropout_rate=0.3):
        super().__init__()
        self.bert = TFBertModel.from_pretrained('bert-base-uncased')
        self.dropout = tf.keras.layers.Dropout(dropout_rate)
        self.classifier = tf.keras.layers.Dense(
            num_classes, activation='softmax'
        )
    
    def call(self, input_ids, attention_mask, training=False):
        # Get BERT outputs
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            training=training
        )
        # Use [CLS] token representation (first token)
        cls_output = outputs.last_hidden_state[:, 0, :]  # (batch, 768)
        cls_output = self.dropout(cls_output, training=training)
        logits = self.classifier(cls_output)
        return logits

# ─── 3. Tokenize Dataset ───
def tokenize_data(texts, labels, max_length=128):
    """Tokenize texts for BERT input."""
    encodings = tokenizer(
        texts, 
        max_length=max_length,
        truncation=True,
        padding='max_length',
        return_tensors='tf'
    )
    dataset = tf.data.Dataset.from_tensor_slices((
        {
            'input_ids': encodings['input_ids'],
            'attention_mask': encodings['attention_mask']
        },
        labels
    ))
    return dataset

# ─── 4. Sample Data & Training ───
texts = [
    "This movie was absolutely wonderful!",
    "Terrible experience, worst film ever.",
    "It was okay, nothing special.",
    "I loved every minute of this masterpiece!",
    "Complete waste of time and money.",
    "Average movie with some good moments.",
]
labels = [2, 0, 1, 2, 0, 1]  # 0=negative, 1=neutral, 2=positive

dataset = tokenize_data(texts, labels)
dataset = dataset.batch(2).prefetch(tf.data.AUTOTUNE)

# ─── 5. Compile and Train ───
model = BERTSentimentClassifier(num_classes=3)

# Key: Use very small learning rate for BERT fine-tuning!
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
model.compile(
    optimizer=optimizer,
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Fine-tune for 3 epochs (in practice, 3-5 is sufficient)
# model.fit(dataset, epochs=3)
print("Model built. Ready for fine-tuning!")
print(f"BERT parameters: ~110M")
print(f"Classifier head: {768 * 3 + 3} = {768*3+3} parameters")

11.2 Mini-GPT Text Generation

mini_gpt.py TensorFlow

import tensorflow as tf
import numpy as np

class MiniGPT(tf.keras.Model):
    """
    A minimal GPT-style decoder-only Transformer for text generation.
    Implements causal (masked) self-attention.
    """
    def __init__(self, vocab_size, d_model=128, num_heads=4,
                 d_ff=512, num_layers=4, max_len=256):
        super().__init__()
        self.d_model = d_model
        self.max_len = max_len
        
        # Token + Position Embeddings
        self.token_embed = tf.keras.layers.Embedding(vocab_size, d_model)
        self.pos_embed = tf.keras.layers.Embedding(max_len, d_model)
        
        # Transformer Decoder Blocks
        self.blocks = [
            self._decoder_block(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ]
        
        # Output projection
        self.ln_f = tf.keras.layers.LayerNormalization()
        self.head = tf.keras.layers.Dense(vocab_size)
    
    def _decoder_block(self, d_model, num_heads, d_ff):
        """Single decoder block with causal attention."""
        return {
            'attn': tf.keras.layers.MultiHeadAttention(
                num_heads=num_heads, key_dim=d_model // num_heads
            ),
            'ln1': tf.keras.layers.LayerNormalization(),
            'ffn': tf.keras.Sequential([
                tf.keras.layers.Dense(d_ff, activation='gelu'),
                tf.keras.layers.Dense(d_model),
            ]),
            'ln2': tf.keras.layers.LayerNormalization(),
        }
    
    def _causal_mask(self, seq_len):
        """Create causal mask: prevent attending to future tokens."""
        mask = tf.linalg.band_part(
            tf.ones((seq_len, seq_len)), -1, 0
        )
        return mask  # Lower triangular
    
    def call(self, x, training=False):
        batch_size, seq_len = tf.shape(x)[0], tf.shape(x)[1]
        
        # Embeddings
        positions = tf.range(seq_len)
        tok_emb = self.token_embed(x)           # (batch, seq, d_model)
        pos_emb = self.pos_embed(positions)      # (seq, d_model)
        h = tok_emb + pos_emb
        
        # Causal mask
        causal_mask = self._causal_mask(seq_len)
        
        # Pass through decoder blocks
        for block in self.blocks:
            # Pre-norm architecture (GPT-2 style)
            h_norm = block['ln1'](h)
            attn_out = block['attn'](
                query=h_norm, key=h_norm, value=h_norm,
                attention_mask=causal_mask, training=training
            )
            h = h + attn_out  # Residual
            
            h_norm = block['ln2'](h)
            ffn_out = block['ffn'](h_norm, training=training)
            h = h + ffn_out   # Residual
        
        h = self.ln_f(h)
        logits = self.head(h)  # (batch, seq, vocab_size)
        return logits
    
    def generate(self, start_tokens, max_new_tokens=50, temperature=0.8):
        """Autoregressive text generation."""
        tokens = tf.constant([start_tokens])
        
        for _ in range(max_new_tokens):
            # Crop to max_len
            crop = tokens[:, -self.max_len:]
            logits = self(crop, training=False)
            
            # Get logits for last position
            next_logits = logits[:, -1, :] / temperature
            
            # Sample from distribution
            next_token = tf.random.categorical(next_logits, 1)
            tokens = tf.concat([tokens, next_token], axis=1)
        
        return tokens.numpy()[0]

# ─── DEMO ───
vocab_size = 5000
model = MiniGPT(vocab_size=vocab_size, d_model=128, num_heads=4,
                d_ff=512, num_layers=4, max_len=256)

# Test forward pass
dummy_input = tf.constant([[1, 42, 100, 7, 88]])
logits = model(dummy_input)
print(f"Input shape:  {dummy_input.shape}")
print(f"Output shape: {logits.shape}")   # (1, 5, 5000)
print(f"Parameters:   {model.count_params():,}")

# Generate text (random tokens since untrained)
generated = model.generate([1, 42, 100], max_new_tokens=10)
print(f"Generated token IDs: {generated}")

11.3 Custom Transformer Layer in TF

tf_transformer_layer.py TensorFlow

import tensorflow as tf

class TransformerEncoderLayer(tf.keras.layers.Layer):
    """Production-quality Transformer encoder layer."""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.mha = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=d_model // num_heads,
            dropout=dropout
        )
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(d_ff, activation='relu'),
            tf.keras.layers.Dropout(dropout),
            tf.keras.layers.Dense(d_model),
            tf.keras.layers.Dropout(dropout),
        ])
        self.norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(dropout)
    
    def call(self, x, training=False, mask=None):
        # Multi-Head Self-Attention + Residual + Norm
        attn_output = self.mha(x, x, x, attention_mask=mask, training=training)
        attn_output = self.dropout1(attn_output, training=training)
        x = self.norm1(x + attn_output)
        
        # Feed-Forward + Residual + Norm
        ffn_output = self.ffn(x, training=training)
        x = self.norm2(x + ffn_output)
        
        return x

# Build 6-layer encoder
d_model, num_heads, d_ff = 512, 8, 2048
encoder_layers = [
    TransformerEncoderLayer(d_model, num_heads, d_ff)
    for _ in range(6)
]

# Test
x = tf.random.normal((2, 20, 512))  # batch=2, seq=20
for layer in encoder_layers:
    x = layer(x, training=True)
print(f"6-layer encoder output: {x.shape}")  # (2, 20, 512)

🧪 Scikit-Learn Integration

While Scikit-Learn doesn't have native Transformer models, it integrates beautifully with Transformer-based feature extractors. Here we show how to use BERT embeddings as features in sklearn pipelines.

bert_sklearn_pipeline.py Python

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import numpy as np

# ─── BERT as Feature Extractor for sklearn ───

class BERTFeatureExtractor:
    """Extract [CLS] embeddings from BERT for use with sklearn."""
    
    def __init__(self, model_name='bert-base-uncased', max_length=128):
        from transformers import BertTokenizer, TFBertModel
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = TFBertModel.from_pretrained(model_name)
        self.max_length = max_length
    
    def transform(self, texts):
        """Convert texts to BERT [CLS] embeddings (768-dim vectors)."""
        encodings = self.tokenizer(
            texts, max_length=self.max_length,
            truncation=True, padding='max_length',
            return_tensors='tf'
        )
        outputs = self.model(encodings, training=False)
        # Extract [CLS] token embedding
        cls_embeddings = outputs.last_hidden_state[:, 0, :].numpy()
        return cls_embeddings

# ─── Use with sklearn ───
# (Pseudocode — requires transformers & tensorflow installed)
"""
# Extract features
extractor = BERTFeatureExtractor()
X_train = extractor.transform(train_texts)  # (n, 768)
X_test = extractor.transform(test_texts)

# Train any sklearn classifier on BERT features!
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

models = {
    'SVM': SVC(kernel='rbf', C=1.0),
    'LogReg': LogisticRegression(max_iter=1000),
    'RF': RandomForestClassifier(n_estimators=100),
}

for name, clf in models.items():
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")
"""

# ─── Simulated Demo (no GPU needed) ───
np.random.seed(42)
n_samples = 200
X_simulated = np.random.randn(n_samples, 768)  # Simulated BERT features
y_simulated = (X_simulated[:, 0] + X_simulated[:, 1] > 0).astype(int)

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svm = SVC(kernel='rbf', C=1.0)
scores = cross_val_score(svm, X_simulated, y_simulated, cv=5)
print(f"SVM on BERT features: {scores.mean():.4f} ± {scores.std():.4f}")

🇮🇳 Indian Case Studies

🏗️ Case Study 1: AI4Bharat IndicBERT — Transformers for 11 Indian Languages

Challenge: BERT was trained primarily on English. India has 22 official languages and 100+ spoken languages. Most Indian language NLP was severely under-resourced.

Solution: AI4Bharat (IIT Madras) created IndicBERT, a multilingual BERT model trained on the IndicCorp dataset covering 11 major Indian languages: Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Kannada, Malayalam, Odia, Punjabi, and Assamese.

Technical Details:

Based on ALBERT architecture (parameter-sharing for efficiency)
Trained on 9 billion tokens across 11 languages
Uses SentencePiece tokenizer trained on Indian language data
Outperforms multilingual BERT (mBERT) on IndicGLUE benchmark

Impact: Enabled sentiment analysis in Hindi, NER in Tamil, question answering in Bengali, and more. Used by Indian startups for vernacular content moderation, e-commerce search, and government document processing.

🤖 Case Study 2: Krutrim LLM — India's First Multilingual Foundation Model

Challenge: Global LLMs like GPT-4 perform poorly on Indian languages due to limited training data and tokenization issues (Hindi text gets 3-4× more tokens than English).

Solution: Ola's AI lab developed Krutrim (Sanskrit for "artificial"), India's first homegrown LLM supporting 22 Indian languages, with text generation in 10 languages.

Technical Innovation:

Custom tokenizer optimized for Indian scripts (Devanagari, Dravidian scripts, etc.)
Training data curated from Indian web sources, books, and government documents
Efficient inference using quantization for deployment on Indian infrastructure
Krutrim Pro: Larger model with 100B+ parameters for enterprise applications

Impact: Demonstrated that India can build sovereign AI infrastructure. Applications in healthcare (patient communication in local languages), education (tutoring in regional languages), and e-governance.

🏛️ Case Study 3: Bhashini — Government's AI Translation Platform

Challenge: Government services need to be accessible in all 22 scheduled languages of India.

Solution: MeitY's Bhashini platform uses Transformer-based translation models to provide real-time translation across Indian languages.

Uses encoder-decoder Transformers fine-tuned on Samanantar parallel corpus
Integrated with Aadhaar and DigiLocker for document translation
Open API for developers to build multilingual applications
Handles 100+ language pairs with a single multilingual model

🌍 Global Case Studies

🧠 Case Study 4: OpenAI GPT Evolution — The Scaling Frontier

Model	Year	Parameters	Training Data	Key Innovation
GPT-1	2018	117M	BookCorpus (7K books)	Pre-train + fine-tune paradigm
GPT-2	2019	1.5B	WebText (40GB)	Zero-shot via prompting
GPT-3	2020	175B	570GB mixed	In-context learning, few-shot
InstructGPT	2022	~175B	+ RLHF data	RLHF alignment
GPT-4	2023	~1.8T (MoE)	~13T tokens	Multimodal, reasoning
GPT-4o	2024	Undisclosed	Undisclosed	Omnimodal (text+image+audio)

Key Lessons: (1) Scale is predictable — Kaplan scaling laws show loss decreases as power law with compute, data, and parameters. (2) Emergent abilities appear at scale — chain-of-thought reasoning, code generation, multilingual transfer. (3) RLHF transforms raw capability into useful, aligned behavior.

🔍 Case Study 5: Google Gemini — Multimodal from the Ground Up

Architecture: Unlike GPT-4 (which added vision to a text model), Gemini was natively multimodal — trained from scratch on text, images, audio, video, and code simultaneously.

Gemini Ultra: Exceeds human performance on MMLU (90.0%)
Gemini Pro: Powers Google Search, Gmail, Docs integration
Gemini Nano: On-device model for Pixel phones (1.8B & 3.25B variants)
Training: TPU v5p pods, mixture-of-experts, 128K context window

🦙 Case Study 6: Meta Llama — Open-Source LLM Revolution

Impact: Meta's release of Llama models (7B-405B) under open licenses democratized LLM access, enabling thousands of fine-tuned variants.

Llama 2 (2023): 7B/13B/70B, commercially licensed, trained on 2T tokens
Llama 3 (2024): 8B/70B/405B, state-of-art open-source, 15T tokens
Architecture choices: RMSNorm (instead of LayerNorm), SwiGLU activation, Rotary Position Embeddings (RoPE), Grouped Query Attention (GQA)
Community: Over 30,000 derived models on HuggingFace within months

🚀 Startup Applications

Sarvam AI (India)

Building India-first LLMs with focus on voice + text in Indian languages. Their models handle code-switching (Hinglish) natively, a critical requirement for Indian consumers.

Hugging Face (Global)

The "GitHub of ML" — hosts 500K+ models, 100K+ datasets. Their Transformers library is the de facto standard. Valued at $4.5B, proving open-source AI is a viable business.

Cohere (Canada)

Enterprise-focused LLMs with Retrieval-Augmented Generation (RAG). Their Command model powers enterprise search, and Embed model provides best-in-class embeddings.

Anthropic (USA)

Founded by ex-OpenAI researchers, building "safer" LLMs. Claude uses Constitutional AI (CAI) — a novel RLHF variant where the model critiques its own outputs against a constitution.

🏛️ Government Applications

🇮🇳 IndiaAI Mission

Government of India allocated ₹10,300 crore for AI development. Key Transformer applications: Bhashini (translation), Document Intelligence (tax forms, legal docs), and agricultural advisory chatbots in local languages.

🇺🇸 US Intelligence

CIA and NSA use Transformer-based models for signals intelligence — analyzing intercepted communications in 100+ languages. Custom fine-tuned models run on air-gapped classified networks.

🇪🇺 EU AI Act

The world's first comprehensive AI regulation specifically addresses "General-Purpose AI" (GPT-4, Gemini). Foundation model providers must document training data, compute costs, and conduct safety evaluations.

🇮🇳 Digital Courts

Indian judiciary exploring Transformer models for case summarization, legal document translation, and precedent search across 23 High Courts. SUVAS system uses NMT for judgment translation.

🏭 Industry Applications

Industry	Application	Transformer Type	Impact
Healthcare	Medical report generation, drug discovery (AlphaFold)	Encoder-Decoder, Specialized	10× faster literature review
Finance	Fraud detection, sentiment from earnings calls	BERT, FinBERT	95%+ fraud detection accuracy
E-Commerce	Product search, recommendation, review analysis	BERT, Cross-encoders	Flipkart: 15% search improvement
Manufacturing	Predictive maintenance from sensor logs (time-series Transformers)	Encoder-only	30% reduction in downtime
Education	Personalized tutoring, automated grading	GPT-type	Byju's, Vedantu AI tutors
Legal	Contract analysis, case prediction	BERT, LegalBERT	80% faster contract review
Media	Content generation, translation, dubbing	GPT, Whisper	Netflix: 40+ language dubbing
Agriculture	Crop advisory chatbots, pest identification	Multilingual LLMs	KissanAI: 500K+ farmers served

🛠️ Mini Projects

🛠️ Mini Project 1: Hindi Sentiment Analysis with BERT

Objective: Fine-tune a multilingual BERT model for sentiment classification on Hindi movie reviews.

Dataset: Hindi Movie Reviews dataset from AI4Bharat or IIIT-H

hindi_sentiment_bert.py Python

"""
Mini Project 1: Hindi Sentiment Analysis using Multilingual BERT
"""
from transformers import (
    AutoTokenizer, TFAutoModelForSequenceClassification,
    DataCollatorWithPadding
)
import tensorflow as tf
import numpy as np

# ─── 1. Load Hindi-capable Model ───
MODEL_NAME = "ai4bharat/indic-bert"  # or "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = TFAutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=3  # positive, negative, neutral
)

# ─── 2. Sample Hindi Data ───
hindi_reviews = [
    {"text": "यह फिल्म बहुत अच्छी थी, मुझे बहुत पसंद आई", "label": 2},
    {"text": "बेकार फिल्म, समय की बर्बादी", "label": 0},
    {"text": "कहानी ठीक थी लेकिन अभिनय कमजोर था", "label": 1},
    {"text": "शानदार अभिनय और बेहतरीन संगीत", "label": 2},
    {"text": "इतनी खराब फिल्म मैंने कभी नहीं देखी", "label": 0},
    {"text": "औसत फिल्म, एक बार देख सकते हैं", "label": 1},
    {"text": "दिल को छू लेने वाली कहानी", "label": 2},
    {"text": "बोरिंग और लंबी फिल्म", "label": 0},
]

# ─── 3. Tokenize ───
texts = [r["text"] for r in hindi_reviews]
labels = [r["label"] for r in hindi_reviews]

encodings = tokenizer(
    texts, max_length=128, truncation=True,
    padding="max_length", return_tensors="tf"
)

dataset = tf.data.Dataset.from_tensor_slices((
    dict(encodings), labels
)).batch(4)

# ─── 4. Fine-tune ───
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

print("Model ready for Hindi sentiment analysis!")
print(f"Vocab size: {tokenizer.vocab_size}")
print(f"Model parameters: {model.count_params():,}")

# ─── 5. Inference Function ───
def predict_sentiment(text):
    """Predict sentiment of Hindi text."""
    inputs = tokenizer(text, return_tensors="tf", 
                       max_length=128, truncation=True, padding="max_length")
    outputs = model(inputs)
    probs = tf.nn.softmax(outputs.logits, axis=-1)
    label_map = {0: "नकारात्मक (Negative)", 
                 1: "तटस्थ (Neutral)", 
                 2: "सकारात्मक (Positive)"}
    pred = tf.argmax(probs, axis=-1).numpy()[0]
    return label_map[pred], probs.numpy()[0]

# Test (before training — predictions will be random)
test = "यह फिल्म बहुत शानदार है"
label, probs = predict_sentiment(test)
print(f"\nInput: {test}")
print(f"Prediction: {label}")
print(f"Probabilities: {probs}")

🛠️ Mini Project 2: Mini Language Model (Character-Level GPT)

Objective: Build and train a small character-level language model using the Transformer decoder architecture.

Dataset: Any text file (Shakespeare, Hindi stories, etc.)

mini_language_model.py Python

"""
Mini Project 2: Character-Level Language Model using Transformer Decoder
Inspired by Andrej Karpathy's nanoGPT
"""
import tensorflow as tf
import numpy as np

# ─── 1. Data Preparation ───
text = """
India is a vast country with diverse cultures, languages, and traditions.
The Indian constitution recognizes 22 official languages.
Artificial intelligence is transforming India's technology landscape.
From Bengaluru to Mumbai, startups are building innovative AI solutions.
The future of AI in India is bright, with millions of developers.
"""

# Character-level tokenization
chars = sorted(list(set(text)))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for ch, i in char_to_idx.items()}

def encode(s): return [char_to_idx[c] for c in s]
def decode(l): return ''.join([idx_to_char[i] for i in l])

data = np.array(encode(text))
print(f"Vocabulary size: {vocab_size}")
print(f"Text length: {len(data)} characters")
print(f"Sample encoding: '{text[:10]}' → {encode(text[:10])}")

# ─── 2. Create Training Sequences ───
block_size = 32  # Context window
batch_size = 8

def create_dataset(data, block_size):
    X, Y = [], []
    for i in range(len(data) - block_size):
        X.append(data[i:i+block_size])
        Y.append(data[i+1:i+block_size+1])
    return np.array(X), np.array(Y)

X, Y = create_dataset(data, block_size)
dataset = tf.data.Dataset.from_tensor_slices((X, Y))
dataset = dataset.shuffle(1000).batch(batch_size).prefetch(tf.data.AUTOTUNE)

# ─── 3. Build Mini Transformer LM ───
class CharTransformerLM(tf.keras.Model):
    def __init__(self, vocab_size, d_model=64, num_heads=4,
                 num_layers=3, d_ff=128, max_len=128):
        super().__init__()
        self.token_emb = tf.keras.layers.Embedding(vocab_size, d_model)
        self.pos_emb = tf.keras.layers.Embedding(max_len, d_model)
        
        self.blocks = []
        for _ in range(num_layers):
            self.blocks.append({
                'attn': tf.keras.layers.MultiHeadAttention(
                    num_heads=num_heads, key_dim=d_model//num_heads
                ),
                'ln1': tf.keras.layers.LayerNormalization(),
                'ffn': tf.keras.Sequential([
                    tf.keras.layers.Dense(d_ff, activation='gelu'),
                    tf.keras.layers.Dense(d_model)
                ]),
                'ln2': tf.keras.layers.LayerNormalization(),
            })
        
        self.ln_f = tf.keras.layers.LayerNormalization()
        self.head = tf.keras.layers.Dense(vocab_size)
    
    def call(self, x, training=False):
        B, T = tf.shape(x)[0], tf.shape(x)[1]
        
        tok = self.token_emb(x)
        pos = self.pos_emb(tf.range(T))
        h = tok + pos
        
        # Causal mask
        mask = tf.linalg.band_part(tf.ones((T, T)), -1, 0)
        
        for block in self.blocks:
            h_n = block['ln1'](h)
            attn = block['attn'](h_n, h_n, h_n,
                                  attention_mask=mask, training=training)
            h = h + attn
            h_n = block['ln2'](h)
            h = h + block['ffn'](h_n)
        
        h = self.ln_f(h)
        return self.head(h)
    
    def generate(self, start_text, max_tokens=100, temperature=0.8):
        tokens = encode(start_text)
        for _ in range(max_tokens):
            x = tf.constant([tokens[-block_size:]])
            logits = self(x, training=False)
            next_logits = logits[0, -1, :] / temperature
            next_token = tf.random.categorical(
                next_logits[tf.newaxis, :], 1
            )[0, 0].numpy()
            tokens.append(next_token)
        return decode(tokens)

# ─── 4. Train ───
model = CharTransformerLM(vocab_size)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(1e-3)

model.compile(optimizer=optimizer, loss=loss_fn)
print(f"\nModel parameters: {model.count_params():,}")

# Train for a few epochs
# model.fit(dataset, epochs=50, verbose=1)

# Generate text
# generated = model.generate("India is", max_tokens=50)
# print(f"Generated: {generated}")
print("\nModel built successfully! Ready for training.")

📝 End-of-Chapter Exercises (20+)

Conceptual Questions

1. Explain why the Transformer replaced RNNs. List at least 4 advantages of self-attention over recurrence.

2. What are Query, Key, and Value in the attention mechanism? Explain the database lookup analogy in detail.

3. Why does self-attention need positional encoding? What would happen without it?

4. Explain the difference between self-attention in the encoder vs. the decoder. Why does the decoder need masking?

5. What is the purpose of residual connections in the Transformer? How do they help with training deep networks?

Mathematical Problems

6. Given Q = [[1,0], [0,1]], K = [[1,1], [0,1]], V = [[1,0], [0,1]], compute the full scaled dot-product attention output step by step (d_k=2).

7. Prove that if q_i, k_i ~ N(0,1) independently, then Var(q·k) = d_k. Show all steps.

8. Compute the sinusoidal positional encoding for positions 0, 1, 2 with d_model = 4. Verify that the encoding at position 2 can be obtained by rotating the encoding at position 0.

9. Calculate the total number of parameters in a Transformer base model (d_model=512, h=8, d_ff=2048, 6 layers, vocab=32000). Show the breakdown for each component.

10. For a sequence of length n=1024, d_model=768, h=12: (a) What is the size of the attention matrix? (b) How many FLOPs for one self-attention computation? (c) What is the memory footprint in FP16?

Programming Exercises

11. Implement a causal (autoregressive) mask and verify that position i cannot attend to positions j > i. Test with a 5×5 attention matrix.

12. Implement the complete positional encoding function and visualize it as a heatmap for max_len=100, d_model=128. What patterns do you observe?

13. Build a Transformer decoder block from scratch (with masked self-attention + cross-attention + FFN). Test it by feeding encoder outputs and partial decoder inputs.

14. Implement Byte-Pair Encoding (BPE) tokenization from scratch. Start with character-level tokens and iteratively merge the most frequent pair. Test on a Hindi paragraph.

15. Fine-tune a pre-trained BERT model on the IMDB sentiment dataset. Plot the training curve and report accuracy. Compare with a simple LSTM baseline.

Analysis & Research

16. Compare Layer Normalization vs. Batch Normalization. Why is LayerNorm preferred in Transformers? When would BatchNorm be better?

17. Explain the difference between BERT's Masked Language Model (MLM) and GPT's Causal Language Model (CLM). What tasks is each better suited for?

18. What are "emergent abilities" in LLMs? Give 3 examples and explain why they appear only at scale.

19. Explain RLHF (Reinforcement Learning from Human Feedback) step by step. Why is it necessary? What happens without it?

20. Compare Vision Transformer (ViT) with traditional CNNs. At what dataset sizes does ViT outperform ResNets? Why?

21. Analyze the O(n²) bottleneck. For context lengths of 512, 4096, and 32768 tokens, compute the attention matrix size and compare memory requirements.

22. Explain Flash Attention. How does it achieve O(n²) computation with O(n) memory? What is the key insight about GPU memory hierarchy?

23. Design a Transformer-based system for Indian language code-switching detection (Hinglish, Tanglish, etc.). What tokenizer, model, and training strategy would you use?

✅ Multiple Choice Questions (12)

1. In the scaled dot-product attention formula Attention(Q,K,V) = softmax(QK^T/√d_k)V, why do we divide by √d_k?

A) To reduce computation time
B) To make the output dimensionality match the input
C) To prevent softmax saturation by normalizing the variance of dot products
D) To implement dropout regularization

✅ C — Without scaling, the dot products have variance d_k, growing large for high-dimensional keys. This pushes softmax into regions with tiny gradients (saturation). Scaling by √d_k normalizes variance to 1.

2. Which type of Transformer architecture is BERT based on?

A) Decoder-only
B) Encoder-only
C) Full Encoder-Decoder
D) Autoregressive Encoder

✅ B — BERT uses only the encoder stack with bidirectional self-attention. It uses Masked Language Model (MLM) and Next Sentence Prediction (NSP) for pre-training.

3. In Multi-Head Attention with d_model=512 and h=8 heads, what is d_k per head?

A) 512
B) 256
C) 128
D) 64

✅ D — d_k = d_model / h = 512 / 8 = 64. Each head operates on a 64-dimensional subspace, ensuring the total computation is comparable to single-head attention with full dimensionality.

4. The time complexity of self-attention with respect to sequence length n is:

A) O(n)
B) O(n log n)
C) O(n²)
D) O(n³)

✅ C — The attention matrix QK^T has dimensions n×n, requiring O(n²·d) operations. This quadratic scaling is the main bottleneck for long sequences.

5. What pre-training objective does GPT use?

A) Masked Language Modeling (MLM)
B) Causal Language Modeling (CLM) — predict the next token
C) Next Sentence Prediction (NSP)
D) Contrastive Learning

✅ B — GPT uses autoregressive/causal language modeling: given the previous tokens, predict the next token. This naturally enables text generation, unlike BERT's bidirectional MLM.

6. In Vision Transformer (ViT), a 224×224 image with 16×16 patches produces how many patch tokens?

A) 16
B) 64
C) 196
D) 784

✅ C — (224/16) × (224/16) = 14 × 14 = 196 patches. Including the [CLS] token, the total sequence length is 197.

7. What does RLHF stand for in the context of LLM alignment?

A) Recursive Learning for High Fidelity
B) Regularized Learning with Human Filters
C) Reinforcement Learning from Human Feedback
D) Residual Learning for Hierarchical Features

✅ C — RLHF trains a reward model on human preference data, then uses PPO (Proximal Policy Optimization) to fine-tune the LLM to maximize the reward. This is how ChatGPT was aligned.

8. Which innovation does Flash Attention primarily exploit?

A) Sparse attention patterns
B) Lower precision arithmetic (INT4)
C) GPU memory hierarchy (tiling for SRAM vs HBM)
D) Knowledge distillation

✅ C — Flash Attention is IO-aware: it tiles the attention computation to fit in GPU SRAM (fast, small) rather than repeatedly reading/writing to HBM (slow, large). This achieves 2-4× speedup with O(n) memory.

9. In a Transformer, the Feed-Forward Network typically expands the dimension by what factor?

A) 2×
B) 4×
C) 8×
D) 16×

✅ B — The standard FFN uses d_ff = 4 × d_model. For d_model=512, d_ff=2048. This expansion-contraction allows the model to learn complex non-linear mappings.

10. Which of the following is NOT a property of sinusoidal positional encoding?

A) Each position has a unique encoding
B) Relative position can be expressed as a linear function
C) It can generalize to unseen sequence lengths
D) It requires learned parameters during training

✅ D — Sinusoidal positional encodings are fixed (not learned). They use deterministic sin/cos functions. Learned positional embeddings are an alternative approach used in GPT and BERT.

11. What is AI4Bharat's IndicBERT primarily designed for?

A) English-only NLP tasks
B) Image classification for Indian datasets
C) Multilingual NLP across 11+ Indian languages
D) Speech recognition for Hindi only

✅ C — IndicBERT is trained on the IndicCorp dataset covering 11 major Indian languages, providing state-of-the-art performance on Indian language NLP tasks like NER, sentiment analysis, and question answering.

12. GPT-3 demonstrated which surprising capability that was absent in GPT-2?

A) Text generation
B) In-context learning (few-shot prompting without fine-tuning)
C) Masked language modeling
D) Image understanding

✅ B — GPT-3's key breakthrough was in-context learning: the ability to perform tasks just by providing a few examples in the prompt, without any gradient updates. This emergent ability appeared at the 175B parameter scale.

💼 Interview Questions (12)

Q1: Walk me through the Transformer architecture. How does it differ from an LSTM?

Answer: The Transformer uses self-attention instead of recurrence. Key differences: (1) Parallelism — all positions computed simultaneously vs. sequential in LSTM; (2) Constant path length — O(1) between any two positions vs. O(n) in LSTM; (3) Attention mechanism — soft lookup using QKV vs. gated memory cell; (4) Architecture — encoder-decoder with multi-head attention, FFN, LayerNorm, residual connections. The Transformer uses positional encoding since it's permutation-invariant, while LSTMs inherently capture order through sequential processing.

Q2: Why do we scale by √d_k in attention? What happens without it?

Answer: The dot product q·k has variance d_k when q, k have unit variance components. For d_k=64, without scaling, dot products have standard deviation ~8, pushing softmax into saturated regions where gradients ≈ 0 (vanishing gradients). Dividing by √d_k normalizes variance to 1, keeping softmax in a "healthy" gradient region. This was empirically validated by the original paper: additive attention (which doesn't have this scaling issue) and scaled dot-product attention perform comparably, but unscaled dot-product attention performs poorly.

Q3: Explain Multi-Head Attention. Why not just use one big attention head?

Answer: Multiple heads allow the model to attend to different representation subspaces simultaneously. One head might learn syntactic relationships ("subject-verb"), another might learn semantic similarity, another positional proximity. With a single head, these different types of relationships would need to be averaged together. Empirically, multi-head attention outperforms single-head even when total computation is matched (h heads of dimension d_k=d_model/h vs. one head of d_k=d_model).

Q4: Explain the difference between BERT and GPT. When would you use each?

Answer: BERT: encoder-only, bidirectional, pre-trained with MLM (mask 15% tokens, predict them) + NSP. Best for understanding tasks: classification, NER, QA, semantic search. GPT: decoder-only, unidirectional (left-to-right), pre-trained with CLM (predict next token). Best for generation tasks: text generation, dialogue, code completion. Key insight: BERT sees full context but can't generate; GPT generates autoregressively but only sees past context during pre-training.

Q5: What is the computational bottleneck of Transformers, and how do Flash Attention / Sparse Attention address it?

Answer: Self-attention is O(n²) in sequence length — the attention matrix QK^T is n×n. For n=32K (modern context windows), this is ~1 billion entries per head per layer. Flash Attention addresses the memory bottleneck by tiling the computation to fit in SRAM, avoiding materializing the full n×n matrix in HBM. Sparse Attention (like BigBird, Longformer) addresses the compute bottleneck by having each token attend to only O(√n) or O(n log n) other tokens through local windows + global tokens + random attention.

Q6: What is RLHF and why is it critical for LLMs like ChatGPT?

Answer: RLHF has 3 stages: (1) SFT: Fine-tune the pre-trained model on human-written demonstrations of desired behavior. (2) Reward Model: Collect human rankings of model outputs ("output A is better than B"), train a reward model to predict human preferences. (3) PPO: Use the reward model as a signal to further train the LLM via reinforcement learning (PPO algorithm). Without RLHF, models generate plausible but often unhelpful, harmful, or hallucinated text. RLHF aligns the model with human intent.

Q7: How does Vision Transformer (ViT) process images?

Answer: ViT treats an image as a sequence of patches: (1) Split image into fixed-size patches (e.g., 16×16). (2) Flatten each patch and project to d_model dimensions (linear embedding). (3) Prepend a learnable [CLS] token. (4) Add learnable position embeddings. (5) Process through a standard Transformer encoder. (6) Use [CLS] output for classification. Key insight: ViT needs large datasets (ImageNet-21K, JFT-300M) to match CNNs; on smaller datasets, CNNs' inductive biases (locality, translation equivariance) give them an advantage.

Q8: What are "emergent abilities" in LLMs? Give examples.

Answer: Emergent abilities are capabilities that appear in large models but are absent in smaller ones — they emerge unpredictably at certain scale thresholds. Examples: (1) Chain-of-thought reasoning — models with >100B params can solve multi-step math when prompted "let's think step by step." (2) Few-shot learning — GPT-3 (175B) can learn tasks from 3-5 examples in the prompt. (3) Code generation — Codex/GPT-4 generate working code from natural language. These abilities are not explicitly trained; they emerge from scale in data, parameters, and compute.

Q9: Explain tokenization in LLMs. Why do BPE/SentencePiece matter?

Answer: Tokenization converts text into subword tokens. BPE (Byte-Pair Encoding) starts with characters, iteratively merges the most frequent pair. SentencePiece is language-agnostic (works on raw text, no pre-tokenization). Why it matters: (1) vocabulary size affects model size and efficiency; (2) subwords handle rare/new words gracefully ("unhappiness" → "un" + "happi" + "ness"); (3) For Indian languages, poor tokenization means 3-4× more tokens per sentence, wasting context window and increasing compute. This is why Krutrim and IndicBERT use custom tokenizers trained on Indian language data.

Q10: How would you fine-tune a Transformer model for a low-resource Indian language?

Answer: Strategy: (1) Start with a multilingual model (mBERT, XLM-R, or IndicBERT); (2) Use transfer learning — the model's cross-lingual representations transfer knowledge from high-resource to low-resource languages; (3) Data augmentation: back-translation, code-switching augmentation; (4) Parameter-efficient fine-tuning: LoRA or adapters (only fine-tune ~1% of parameters); (5) Few-shot prompting with LLMs as an alternative to fine-tuning; (6) Active learning to maximize the value of limited labeled data. Key: use IndicBERT/IndicTrans2 over mBERT for Indian languages — they have better tokenization and more Indian language pre-training data.

Q11: What is Layer Normalization and why does the Transformer use it instead of Batch Normalization?

Answer: LayerNorm normalizes across the feature dimension (for each token independently), while BatchNorm normalizes across the batch dimension. LayerNorm is preferred because: (1) It's independent of batch size — works with batch size 1 during inference; (2) Variable-length sequences make batch statistics unreliable; (3) In NLP, features at the same position across batches don't have consistent semantics (unlike pixels in images); (4) More stable training dynamics for Transformers.

Q12: Design a system to build a chatbot for Indian Railways in Hindi using Transformers.

Answer: Architecture: (1) Retrieval: Use IndicBERT bi-encoder to embed FAQ/knowledge base; retrieve relevant documents for a query using cosine similarity. (2) Generation: Fine-tune a Hindi-capable LLM (Krutrim or Llama-2-Hindi) on railway domain data (FAQs, PNR status responses, complaint templates). (3) RAG Pipeline: Combine retrieval + generation — the LLM generates responses grounded in retrieved railway documents. (4) Guardrails: Filter harmful outputs, enforce railway-specific terminology. (5) Evaluation: BLEU/ROUGE for generation quality, human evaluation for helpfulness. Deploy on IRCTC with voice input (Whisper for Hindi ASR) + text.

🔬 Research Problems

Research Problem 1: Efficient Attention for Indian Language Documents

Problem: Indian language text produces 3-4× more tokens than English (due to suboptimal tokenization). This makes the O(n²) attention cost even more prohibitive. Design and evaluate an attention mechanism that combines (1) a custom Indian-language-optimized tokenizer, (2) sparse attention patterns (local + global), and (3) Flash Attention tiling for efficient processing of long Hindi/Tamil documents.

Evaluation: Compare against mBERT on IndicGLUE benchmarks while measuring wall-clock time, memory usage, and maximum processable sequence length.

Research Problem 2: Cross-Lingual Transfer Without Parallel Data

Problem: Can we train a Transformer that transfers NLP capabilities from English (high-resource) to languages like Gondi or Bodo (extremely low-resource, <10K sentences) without any parallel data? Investigate unsupervised cross-lingual representation learning using shared subword vocabularies, transliteration bridges (Devanagari → Latin), and self-supervised alignment objectives.

Research Problem 3: Mixture-of-Experts for Multilingual Efficiency

Problem: India has 22 official languages with very different structures (Indo-Aryan vs. Dravidian). A single dense Transformer wastes capacity by activating all parameters for every language. Design a Mixture-of-Experts (MoE) Transformer where different experts specialize in different language families. Investigate routing strategies, load balancing, and whether language-family-aware routing outperforms learned routing on IndicGLUE.

Research Problem 4: Interpretability of Attention in Medical NLP

Problem: Transformers are increasingly used for medical text analysis (radiology reports, clinical notes) in Indian hospitals. However, attention weights do not always correlate with feature importance. Develop methods to explain Transformer predictions on Indian medical records, comparing attention visualization, gradient-based attribution, and SHAP. Validate explanations with domain expert doctors.

🔑 Key Takeaways

Attention = Soft Database Lookup: The core mechanism computes Query-Key similarity to produce weighted combinations of Values. The formula Attention(Q,K,V) = softmax(QK^T/√d_k)V is the foundation of all modern AI.
√d_k Scaling is Critical: Without it, dot product variance grows as d_k, causing softmax saturation and vanishing gradients. This is a first-principles variance normalization.
Multi-Head = Multiple Perspectives: Running h parallel attention heads (each with d_k=d_model/h) captures different types of relationships. Concatenation + projection combines them.
Positional Encoding Injects Order: Self-attention is permutation-invariant. Sinusoidal encodings (or learned embeddings) give the model position information. The rotation property enables relative position reasoning.
Encoder understands, Decoder generates: BERT (encoder-only) excels at classification/NER/QA. GPT (decoder-only) excels at text generation. T5 (encoder-decoder) excels at transformation tasks.
Scale Unlocks Emergent Abilities: LLMs show capabilities (chain-of-thought, few-shot learning) that only emerge at scale — 100B+ parameters. This is why the race to scale continues.
RLHF Aligns Capability with Intent: Pre-training gives capability; RLHF aligns it with human preferences. Without alignment, powerful models produce harmful or unhelpful outputs.
O(n²) is the Achilles Heel: Self-attention's quadratic cost limits context length. Flash Attention (IO-aware), Sparse Attention, and Linear Attention are active research areas to address this.
Indian Languages Need Custom Solutions: Standard tokenizers waste 3-4× tokens on Indian scripts. Projects like AI4Bharat IndicBERT, Krutrim, and Bhashini build India-specific solutions with custom tokenizers and training data.
Transformers Have Won (For Now): From NLP to vision (ViT) to speech (Whisper) to protein folding (AlphaFold), Transformers dominate. State-space models (Mamba) are the primary challenger.

📚 References

Foundational Papers

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS. The paper that started the revolution.
Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers." NAACL. Encoder-only pre-training for NLP understanding.
Radford, A. et al. (2018/2019). "Improving Language Understanding by Generative Pre-Training" (GPT-1/2). OpenAI. Decoder-only pre-training for generation.
Brown, T. et al. (2020). "Language Models are Few-Shot Learners" (GPT-3). NeurIPS. In-context learning at scale.
Dosovitskiy, A. et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition." ICLR. Vision Transformer (ViT).

Scaling & Alignment

Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." OpenAI. Power-law relationships between scale and performance.
Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback" (InstructGPT). OpenAI. RLHF methodology.
Touvron, H. et al. (2023/2024). "Llama 2/3: Open Foundation Models." Meta AI. Open-source LLMs.
OpenAI (2023). "GPT-4 Technical Report." Multimodal foundation model.

Indian AI & Efficient Attention

Kakwani, D. et al. (2020). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Models." AI4Bharat/IIT Madras. IndicBERT and IndicCorp.
Ramesh, G. et al. (2022). "Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages." Foundation for Indian NMT.
Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention." NeurIPS. IO-aware attention.
Beltagy, I. et al. (2020). "Longformer: The Long-Document Transformer." Sparse attention for long documents.

Textbooks & Resources

Jurafsky, D. & Martin, J. (2024). "Speech and Language Processing." 3rd Ed. Ch. 9-10: Transformers.
Tunstall, L. et al. (2022). "Natural Language Processing with Transformers." O'Reilly. Practical guide with HuggingFace.
The Illustrated Transformer — Jay Alammar's blog. Outstanding visual explanations.
Andrej Karpathy, "Let's build GPT from scratch" — YouTube lecture. Best hands-on introduction.

← Chapter 19: RNNs & Sequences Chapter 21: Generative Models →

Transformers & Attention — The AI Revolution