Transformers & Attention โ The AI Revolution
From "Attention Is All You Need" to GPT-4 and beyond โ master the architecture that redefined AI, powering language models, vision systems, and the entire LLM revolution.
Learning Objectives
After completing this chapter, you will be able to:
Introduction
In June 2017, a paper titled "Attention Is All You Need" by Vaswani et al. at Google Brain introduced the Transformer โ a neural network architecture that replaced recurrence (LSTMs, GRUs) and convolutions entirely with self-attention mechanisms. It was arguably the most consequential machine learning paper of the decade.
Before Transformers, sequence models like RNNs and LSTMs processed tokens one-by-one, creating a computational bottleneck. The Transformer broke this paradigm by computing relationships between all tokens simultaneously, enabling massive parallelization on GPUs and capturing long-range dependencies without degradation.
Every modern AI system you interact with โ ChatGPT, Google Search, image generation, code completion, translation, voice assistants โ is built on the Transformer architecture. Understanding Transformers is no longer optional; it's the most important single concept in modern AI.
This chapter takes you from the fundamental intuition of attention (think of it as a "database lookup") all the way to understanding GPT-4, BERT fine-tuning, Vision Transformers, and efficient attention. We derive every formula from first principles, implement core components in Python and TensorFlow, and apply them to Indian language processing with AI4Bharat and Krutrim.
Historical Background
The journey to Transformers spans decades of research in sequence modeling, attention mechanisms, and neural architecture design.
Conceptual Explanation
4.1 The Core Intuition: Attention as a Soft Database Lookup
Imagine you have a database of key-value pairs. Given a query, you want to retrieve the most relevant value. In a traditional database, this is a hard lookup โ you find the exact matching key. Attention is a soft lookup: you compute a similarity score between your query and every key, then return a weighted average of all values.
The Database Analogy
- Query (Q): "What information do I need?" โ the current position asking a question
- Key (K): "What information do I contain?" โ labels for all available positions
- Value (V): "Here's my actual content" โ the data each position carries
The attention score between a query and a key tells us "how relevant is this key to my query?" The output is a weighted combination of values, where weights come from query-key similarities.
4.2 Self-Attention: Every Token Talks to Every Token
In self-attention, the queries, keys, and values all come from the same sequence. Each word in a sentence creates its own Q, K, and V vectors by multiplying with learned weight matrices. Then each word uses its query to attend to all other words' keys, retrieving a weighted mix of their values.
Consider: "The cat sat on the mat because it was tired." When processing "it", self-attention assigns high attention weight to "cat" (not "mat"), because the model learns that "tired" relates to a living entity โ an impressive feat that RNNs struggle with over distance.
4.3 Why Not Recurrence?
โ Problems with RNNs/LSTMs
- Sequential processing: token-by-token, no parallelism
- Information bottleneck: everything compressed into hidden state
- Long-range forgetting: even LSTM struggles past ~200 tokens
- Training time: O(n) sequential steps, GPU underutilized
โ Transformer Advantages
- Parallel processing: all tokens processed simultaneously
- Direct connections: any token can attend to any other
- Scalable: massively parallel on modern GPUs/TPUs
- Constant path length: O(1) between any two positions
4.4 Multi-Head Attention: Multiple Perspectives
A single attention function learns one type of relationship. Multi-head attention runs multiple attention functions in parallel, each with different learned projections. Think of it as having 8 different "reading comprehensions" โ one head might attend to syntactic relationships, another to semantic similarity, another to positional proximity.
4.5 Positional Encoding: Injecting Order
Self-attention is permutation-invariant โ it treats "dog bites man" identically to "man bites dog". Since word order matters, we add positional encodings to the input embeddings. The original Transformer uses sinusoidal functions with different frequencies, creating a unique "fingerprint" for each position that the model can use to reason about relative positions.
4.6 The Full Architecture: Encoder-Decoder
The complete Transformer has an encoder (6 layers) that reads the input and produces contextual representations, and a decoder (6 layers) that generates output token-by-token. The decoder uses masked self-attention (can only see past tokens) plus cross-attention (attending to encoder outputs).
4.7 Layer Normalization & Residual Connections
Each sub-layer (attention, feed-forward) in the Transformer uses a residual connection followed by Layer Normalization: output = LayerNorm(x + SubLayer(x)). The residual connection ensures gradients flow smoothly through deep networks (similar to ResNets), while LayerNorm stabilizes training by normalizing across the feature dimension.
Mathematical Foundation
5.1 Scaled Dot-Product Attention
The core equation of modern AI:
Where Q โ โnรdk, K โ โmรdk, V โ โmรdv
n = number of query positions, m = number of key-value positions
dk = dimension of keys/queries, dv = dimension of values
Step-by-step breakdown:
- QKT โ โnรm: Compute dot products between all query-key pairs โ raw attention scores
- / โdk: Scale down to prevent softmax saturation (explained in derivations)
- softmax(ยท): Convert scores to probabilities (each row sums to 1)
- ยท V: Weighted combination of values using attention weights
5.2 Multi-Head Attention
where headi = Attention(QยทWiQ, KยทWiK, VยทWiV)
WiQ โ โdmodelรdk,
WiK โ โdmodelรdk,
WiV โ โdmodelรdv,
WO โ โhdvรdmodel
Typically h=8, dk=dv=dmodel/h=64 (for dmodel=512)
5.3 Positional Encoding (Sinusoidal)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)
pos = position in the sequence (0, 1, 2, ...)
i = dimension index (0, 1, ..., dmodel/2 - 1)
Each dimension corresponds to a sinusoidal wave with wavelength from 2ฯ to 10000ยท2ฯ
5.4 Layer Normalization
where ฮผ = (1/d) ฮฃi xi, ฯยฒ = (1/d) ฮฃi (xi โ ฮผ)ยฒ
Normalization is across the feature dimension (not the batch dimension like BatchNorm).
ฮณ, ฮฒ are learnable scale and shift parameters.
5.5 Feed-Forward Network (Per Position)
W1 โ โdmodelรdff, W2 โ โdffรdmodel
Typically dff = 4 ร dmodel = 2048 (for dmodel=512)
5.6 Softmax Function
5.7 Complexity Analysis
| Operation | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Self-Attention | O(nยฒ ยท d) | O(nยฒ + nยทd) | Quadratic in sequence length |
| Feed-Forward | O(n ยท dยฒ) | O(n ยท d) | Linear in sequence length |
| RNN / LSTM | O(n ยท dยฒ) | O(dยฒ) | Linear but sequential |
| 1D Convolution | O(k ยท n ยท dยฒ) | O(n ยท d) | k = kernel size |
Formula Derivations
6.1 Why โdk Scaling? The Variance Argument
This is one of the most commonly asked interview questions about Transformers. Let's derive it rigorously from first principles.
Derivation: Variance of Dot Products
Setup: Let q, k โ โdk be query and key vectors, where each component qi, ki is drawn independently from a distribution with mean 0 and variance 1.
Goal: Find Var(q ยท k) = Var(ฮฃi=1dk qiki)
Step 1: For a single product term zi = qiki:
- E[zi] = E[qi]ยทE[ki] = 0ยท0 = 0 (by independence)
- E[ziยฒ] = E[qiยฒ]ยทE[kiยฒ] = Var(qi)ยทVar(ki) = 1ยท1 = 1
- Var(zi) = E[ziยฒ] โ E[zi]ยฒ = 1 โ 0 = 1
Step 2: The dot product is the sum: qยทk = ฮฃi zi
- Since zi are independent: Var(qยทk) = ฮฃi Var(zi) = dk
Step 3: If we scale by โdk:
- Var(qยทk / โdk) = Var(qยทk) / dk = dk / dk = 1 โ
Conclusion: Without scaling, the dot products have variance dk. For dk=64, values would be ~8ร larger than expected, pushing softmax into regions where gradients are extremely small (saturation). Dividing by โdk normalizes variance to 1, keeping softmax in a healthy gradient region.
6.2 Derivation: Sinusoidal Positional Encoding
Why Sinusoids? The Relative Position Property
Key insight: We want PE(pos+k) to be expressible as a linear function of PE(pos), so the model can easily learn to attend to relative positions.
Proof: For any fixed offset k, there exist constants that allow:
sin(ฯ(pos + k)) = sin(ฯpos)cos(ฯk) + cos(ฯpos)sin(ฯk)
cos(ฯ(pos + k)) = cos(ฯpos)cos(ฯk) โ sin(ฯpos)sin(ฯk)
In matrix form:
โ PE(pos+k, 2i) โ โ cos(ฯk) sin(ฯk) โ โ PE(pos, 2i) โ
โ โ = โ โยทโ โ
โ PE(pos+k, 2i+1) โ โ -sin(ฯk) cos(ฯk) โ โ PE(pos, 2i+1) โ
where ฯ = 1/100002i/dmodel
This is a rotation matrix! The positional encoding at position pos+k is a rotation of the encoding at position pos. Since the rotation matrix depends only on the offset k (not the absolute position), the model can learn relative position information through linear projections.
Multi-frequency design: Different dimensions i use different frequencies (ฯ), ranging from high-frequency (i=0, wavelength=2ฯ) to low-frequency (i=d/2โ1, wavelengthโ10000ยท2ฯ). This is analogous to a Fourier basis โ low dimensions capture fine-grained position, high dimensions capture coarse-grained position.
6.3 Derivation: Softmax Gradients
Why Softmax Saturation Matters
The Jacobian of softmax y = softmax(z) is:
โyi/โzj = yi(ฮดij โ yj)
When logits are very large (|z| >> 0), softmax produces near-one-hot outputs where one yi โ 1 and the rest โ 0. In this regime:
- โyi/โzi = yi(1 โ yi) โ 1ยท0 = 0 (for the dominant class)
- All gradient terms โ 0 โ vanishing gradients
This is why scaling by โdk is essential: it keeps logits moderate, maintaining healthy gradients during backpropagation.
6.4 Derivation: Parameter Count
How Many Parameters in a Transformer Layer?
Multi-Head Attention:
ParamsMHA = 4 ร dmodelยฒ (for WQ, WK, WV, WO)
= 4 ร 512ยฒ = 1,048,576 โ 1M
Feed-Forward Network:
ParamsFFN = 2 ร dmodel ร dff = 2 ร 512 ร 2048 = 2,097,152 โ 2M
Layer Norms: 2 ร 2 ร dmodel = 2048
Total per layer: โ 3.15M. For 6 encoder + 6 decoder layers: โ 37.8M
Add embeddings: vocab_size ร dmodel = 37000 ร 512 โ 19M
Total Transformer (base): โ 65M parameters
Worked Numerical Examples
Let's compute self-attention for the sentence fragment with 4 tokens, using dk = dv = 3 for simplicity.
Suppose after embedding + positional encoding, our 4 tokens have representations X โ โ4ร3:
Token 1 ("The"): [1.0, 0.0, 1.0]
Token 2 ("cat"): [0.0, 1.0, 0.0]
Token 3 ("sat"): [1.0, 1.0, 0.0]
Token 4 ("down"): [0.0, 0.0, 1.0]
With WQ = WK = WV = I3ร3, we get Q = K = V = X:
Q = K = V = โ 1 0 1 โ
โ 0 1 0 โ
โ 1 1 0 โ
โ 0 0 1 โ
QKT = โ 1ยท1+0ยท0+1ยท1 1ยท0+0ยท1+1ยท0 1ยท1+0ยท1+1ยท0 1ยท0+0ยท0+1ยท1 โ
โ 0ยท1+1ยท0+0ยท1 0ยท0+1ยท1+0ยท0 0ยท1+1ยท1+0ยท0 0ยท0+1ยท0+0ยท1 โ
โ 1ยท1+1ยท0+0ยท1 1ยท0+1ยท1+0ยท0 1ยท1+1ยท1+0ยท0 1ยท0+1ยท0+0ยท1 โ
โ 0ยท1+0ยท0+1ยท1 0ยท0+0ยท1+1ยท0 0ยท1+0ยท1+1ยท0 0ยท0+0ยท0+1ยท1 โ
= โ 2 0 1 1 โ
โ 0 1 1 0 โ
โ 1 1 2 0 โ
โ 1 0 0 1 โ
QKT/โ3 = โ 1.155 0.000 0.577 0.577 โ
โ 0.000 0.577 0.577 0.000 โ
โ 0.577 0.577 1.155 0.000 โ
โ 0.577 0.000 0.000 0.577 โ
For row 1: softmax([1.155, 0.000, 0.577, 0.577])
exp values: [3.174, 1.000, 1.781, 1.781] โ sum = 7.736
softmax: [0.410, 0.129, 0.230, 0.230]
Full attention weights A:
โ 0.410 0.129 0.230 0.230 โ โ "The" attends mostly to itself
โ 0.195 0.345 0.345 0.195 โ โ "cat" attends equally to "cat" & "sat"
โ 0.230 0.230 0.410 0.129 โ โ "sat" attends mostly to itself
โ 0.345 0.155 0.155 0.345 โ โ "down" attends to "The" & itself
Output[0] = 0.410ยท[1,0,1] + 0.129ยท[0,1,0] + 0.230ยท[1,1,0] + 0.230ยท[0,0,1]
= [0.410,0,0.410] + [0,0.129,0] + [0.230,0.230,0] + [0,0,0.230]
= [0.641, 0.360, 0.641]
Full output matrix:
โ 0.641 0.360 0.641 โ โ "The" enriched with context
โ 0.425 0.580 0.195 โ โ "cat" enriched with context
โ 0.641 0.360 0.360 โ โ "sat" enriched with context
โ 0.345 0.155 0.690 โ โ "down" enriched with context
Key Observation: Each output vector is no longer just the token's own embedding โ it's a weighted mixture of all tokens' value vectors. "The" (output [0.641, 0.360, 0.641]) has absorbed information from all four tokens, with the strongest influence from itself (weight 0.410).
Using dmodel=4, h=2, so dk=dv=dmodel/h=2. Input X โ โ2ร4 (2 tokens).
X = โ 1 0 1 0 โ (Token 1)
โ 0 1 0 1 โ (Token 2)
Head 1: WโQ = โ 1 0 โ WโK = โ 0 1 โ WโV = โ 1 0 โ
โ 0 1 โ โ 1 0 โ โ 0 1 โ
โ 0 0 โ โ 0 0 โ โ 0 0 โ
โ 0 0 โ โ 0 0 โ โ 0 0 โ
Head 2: WโQ = โ 0 0 โ WโK = โ 0 0 โ WโV = โ 0 0 โ
โ 0 0 โ โ 0 0 โ โ 0 0 โ
โ 1 0 โ โ 0 1 โ โ 1 0 โ
โ 0 1 โ โ 1 0 โ โ 0 1 โ
Head 1: Qโ = XยทWโQ = โ 1 0 โ Kโ = XยทWโK = โ 0 1 โ Vโ = โ 1 0 โ
โ 0 1 โ โ 1 0 โ โ 0 1 โ
Head 2: Qโ = XยทWโQ = โ 1 0 โ Kโ = XยทWโK = โ 0 1 โ Vโ = โ 1 0 โ
โ 0 1 โ โ 1 0 โ โ 0 1 โ
Head 1: QโKโแต = โ 0 1 โ / โ2 = โ 0.000 0.707 โ
โ 1 0 โ โ 0.707 0.000 โ
softmax: โ 0.331 0.669 โ Outputโ = AโยทVโ = โ 0.331 0.669 โ
โ 0.669 0.331 โ โ 0.669 0.331 โ
Concat = โ 0.331 0.669 0.331 0.669 โ (headโ | headโ)
โ 0.669 0.331 0.669 0.331 โ
Output = Concat ยท Wแดผ (Wแดผ โ โโดหฃโด, combines multi-head information)
Compute PE for position 3, dmodel=4 (i = 0, 1):
i=0: ฯโ = 1/10000^(0/4) = 1/1 = 1
PE(3, 0) = sin(3 ร 1) = sin(3) = 0.141
PE(3, 1) = cos(3 ร 1) = cos(3) = -0.990
i=1: ฯโ = 1/10000^(2/4) = 1/100 = 0.01
PE(3, 2) = sin(3 ร 0.01) = sin(0.03) = 0.030
PE(3, 3) = cos(3 ร 0.01) = cos(0.03) = 1.000
PE(pos=3) = [0.141, -0.990, 0.030, 1.000]
Compare with PE(pos=0) = [0.000, 1.000, 0.000, 1.000]
Compare with PE(pos=1) = [0.841, 0.540, 0.010, 1.000]
โ Each position has a unique encoding!
โ Low dims (i=0) change rapidly, high dims (i=1) change slowly.
Visual Diagrams
โโโโโโโโโโโโโโโ
โ Output โ
โ (n ร dv) โ
โโโโโโโโฌโโโโโโโ
โ
โโโโโโโโดโโโโโโโ
โ MatMul โ โ Weighted sum of V
โโโโฌโโโโโโโโฌโโโ
โ โ
โโโโโโโโโโ โโโโโโโโโโ
โ โ
โโโโโโโโดโโโโโโโ โโโโโโโโโดโโโโโโโโ
โ Softmax โ โ V โ
โ (n ร m) โ โ (m ร dv) โ
โโโโโโโโฌโโโโโโโ โโโโโโโโโโโโโโโโโ
โ
โโโโโโโโดโโโโโโโ
โ Scale โ โ Divide by โdk
โ (รท โdk) โ
โโโโโโโโฌโโโโโโโ
โ
โโโโโโโโดโโโโโโโ
โ MatMul โ โ QKแต dot products
โโโโฌโโโโโโโโฌโโโ
โ โ
โโโโโโโโดโโโ โโโโดโโโโโโโ
โ Q โ โ K โ
โ(n ร dk)โ โ(m ร dk) โ
โโโโโโโโโโโ โโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Linear (W^O) โ
โ (hยทdv โ dmodel) โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ
โ Concat โ
โ [headโ | headโ | ... | headโ] โ
โโโโฌโโโโโโฌโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโ
โ โ โ โ
โโโโดโโโโโโดโโโโโโโดโโโ โโโโโดโโโโ
โheadโโโheadโโโheadโโ ยทยทยทโheadโ โ
โAttn โโAttn โโAttn โ โ Attn โ
โโโโฌโโโโโโโฌโโโโโโโฌโโโ โโโโโฌโโโโ
โ โ โ โ
โโโโดโโโโโโโดโโโโโโโดโโโ โโโโโดโโโโ
โLin โโLin โโLin โ ยทยทยทโLin โ
โQ,K,VโโQ,K,VโโQ,K,Vโ โQ,K,V โ
โโโโฌโโโโโโโฌโโโโโโโฌโโโ โโโโโฌโโโโ
โ โ โ โ
โโโโโโโโดโโโโโโโดโโโโโโฌโโโโโโ
โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโ
โ Input: Q, K, V โ
โ (n ร dmodel) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโ ENCODER (ร6) โโโโโโโโโโโ โโโโ DECODER (ร6) โโโโโโโโโโโ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Add & Layer Norm โ โ โ โ Add & Layer Norm โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โ โ โ โ โ
โ โโโโโโโโโโโโดโโโโโโโโโโโโ โ โ โโโโโโโโโโโโดโโโโโโโโโโโโ โ
โ โ Feed Forward (FFN) โ โ โ โ Feed Forward (FFN) โ โ
โ โ 512 โ 2048 โ 512 โ โ โ โ 512 โ 2048 โ 512 โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โ โ โ โ โ
โ โโโโโโโโโโโโดโโโโโโโโโโโโ โ โ โโโโโโโโโโโโดโโโโโโโโโโโโ โ
โ โ Add & Layer Norm โ โ โ โ Add & Layer Norm โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โ โ โ โ โ
โ โโโโโโโโโโโโดโโโโโโโโโโโโ โ โ โโโโโโโโโโโโดโโโโโโโโโโโโ โ
โ โ Multi-Head โ โโโโโโถโ โ Cross-Attention โ โ
โ โ Self-Attention โ โ K,Vโ โ (Encoder-Decoder) โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โ โ โ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโดโโโโโโโโโโโโ โ
โ โ โ Add & Layer Norm โ โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ Input Embedding โ โ โ โ
โ + Positional Encoding โ โ โโโโโโโโโโโโดโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ Masked Multi-Head โ โ
โ โ Self-Attention โ โ
"The cat sat on the mat" โ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
โ Output Embedding โ
โ + Positional Encoding โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"Le chat est assis sur ..."
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ BERT โ โ GPT โ โ T5 โ โ (Encoder-only) โ โ (Decoder-only) โ โ (Encoder-Decoder) โ โโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ โ โ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโ โโโโโโโโโ โ โ โ Encoder โ โ โ โ Decoder โ โ โ โEnc โโโโโถโ Dec โ โ โ โ Block ร12 โ โ โ โ Block ร12 โ โ โ โ ร12 โ โ ร12 โ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โ โ โโโโโโโ โโโโโโโโโ โ โ โ โ โ โ โ โ Bidirectional โ โ Left-to-Right โ โ Enc: bidir, Dec: LโR โ โ โโโโโโโโโโโโโโถ โ โ โโโโโโโโโโโโโโถ โ โ โโโโถ โโโถ โ โ โ โ โ โ โ โ Tasks: โ โ Tasks: โ โ Tasks: โ โ โข NER โ โ โข Generation โ โ โข Translation โ โ โข QA โ โ โข Completion โ โ โข Summarization โ โ โข Classify โ โ โข Chat โ โ โข Any text-to-text โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Input Image (224ร224ร3)
โ
โโโโโโโโโโดโโโโโโโโโ
โ Split into โ
โ 16ร16 patches โ
โ = 196 patches โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Linear Embed โโโโโโโโถโ [CLS] token โ
โ (196 ร 768) โ โ prepended โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโฌโโโโโโโ
โ โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ + Position Embeds โ
โ (197 ร 768) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ Transformer Encoder โ
โ (12 layers, 768) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ [CLS] output โ
โ โ MLP Head โ
โ โ Classification โ
โโโโโโโโโโโโโโโโโโโโโโโโโ
Flowcharts
โโโโโโโโโโโโโโโโ
โ NLP Task? โ
โโโโโโโโฌโโโโโโโโ
โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โUnderstandโ โ Generate โ โ Transform โ
โ text? โ โ text? โ โ textโtext? โ
โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโโโฌโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ BERT-type โ โ GPT-type โ โ T5 / BART โ
โ Encoder โ โ Decoder โ โ Enc-Dec โ
โโโโโโโฌโโโโโโ โโโโโโโฌโโโโโโ โโโโโโโโโฌโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โโข Classifyโ โโข ChatBot โ โโข Translation โ
โโข NER โ โโข Story โ โโข Summarization โ
โโข QA โ โโข Code โ โโข Question Answer โ
โโข Embed โ โโข Reason โ โโข Style Transfer โ
โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโ
โ Pre-trained BERTโ
โ (from HuggingFace)โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโ
โ Task-specific โ
โ data loading โ
โ + tokenization โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Add task head: โ โ Classification: โ
โ โโโโโโโโโโโโโโโ โโโโโถโ [CLS] โ Linear โ
โ Freeze/Unfreeze โ โ โ Softmax โ pred โ
โ BERT layers โ โโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโ
โ Fine-tune with โ
โ small lr (2e-5) โ
โ 3-5 epochs โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโ
โ Evaluate on โ
โ validation set โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโดโโโโโ
โ Deploy! โ
โโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโ
โ 1. DATA CURATION โ
โ Web crawl, books โ
โ code, Wikipedia โ
โ ~1-10 TB text โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโโ
โ 2. TOKENIZATION โ
โ BPE / SentPiece โ
โ 32K-100K tokens โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโโ
โ 3. PRETRAINING โ
โ CLM: predict โ
โ next token โ
โ 1000s of GPUs โ
โ Weeks-months โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโโ
โ 4. SFT โ
โ Supervised Fine- โ
โ Tuning on human- โ
โ curated prompts โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโโ
โ 5. RLHF โ
โ Reward Model โ
โ + PPO optimizer โ
โ โ Aligned model โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโโ
โ 6. DEPLOYMENT โ
โ API, guardrails โ
โ monitoring โ
โโโโโโโโโโโโโโโโโโโโ
Python Implementation (From Scratch)
10.1 Scaled Dot-Product Attention
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax."""
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / np.sum(e_x, axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Scaled Dot-Product Attention from 'Attention Is All You Need'.
Args:
Q: Queries (batch, n, d_k) or (n, d_k)
K: Keys (batch, m, d_k) or (m, d_k)
V: Values (batch, m, d_v) or (m, d_v)
mask: Optional mask (n, m) โ 0 for positions to attend, -inf for masked
Returns:
output: Weighted values (batch, n, d_v) or (n, d_v)
attention_weights: (batch, n, m) or (n, m)
"""
d_k = Q.shape[-1]
# Step 1: QK^T dot products
scores = np.matmul(Q, K.swapaxes(-2, -1)) # (n, m)
# Step 2: Scale by sqrt(d_k) โ the variance argument!
scores = scores / np.sqrt(d_k)
# Step 3: Apply mask (for decoder / padding)
if mask is not None:
scores = scores + mask # mask has -inf for blocked positions
# Step 4: Softmax to get attention weights
attention_weights = softmax(scores, axis=-1)
# Step 5: Weighted sum of values
output = np.matmul(attention_weights, V)
return output, attention_weights
# โโโ DEMO: Self-Attention on 4 tokens โโโ
np.random.seed(42)
seq_len, d_k, d_v = 4, 8, 8
# Random input embeddings
X = np.random.randn(seq_len, d_k)
# Learnable projection matrices (random for demo)
W_Q = np.random.randn(d_k, d_k) * 0.1
W_K = np.random.randn(d_k, d_k) * 0.1
W_V = np.random.randn(d_k, d_v) * 0.1
# Project to Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
# Compute attention
output, weights = scaled_dot_product_attention(Q, K, V)
print("Input shape:", X.shape)
print("Output shape:", output.shape)
print("\nAttention weights (each row sums to 1):")
print(np.round(weights, 3))
print("\nRow sums:", np.round(weights.sum(axis=-1), 6))
10.2 Multi-Head Attention
import numpy as np
class MultiHeadAttention:
"""
Multi-Head Attention from scratch.
Splits input into h heads, runs attention in parallel, concatenates.
"""
def __init__(self, d_model, num_heads):
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Initialize projection matrices (Xavier initialization)
scale = np.sqrt(2.0 / (d_model + self.d_k))
self.W_Q = np.random.randn(d_model, d_model) * scale
self.W_K = np.random.randn(d_model, d_model) * scale
self.W_V = np.random.randn(d_model, d_model) * scale
self.W_O = np.random.randn(d_model, d_model) * scale
def split_heads(self, x):
"""Reshape (batch, seq, d_model) โ (batch, heads, seq, d_k)"""
batch_size = x.shape[0]
seq_len = x.shape[1]
# Reshape: (batch, seq, d_model) โ (batch, seq, heads, d_k)
x = x.reshape(batch_size, seq_len, self.num_heads, self.d_k)
# Transpose: โ (batch, heads, seq, d_k)
return x.transpose(0, 2, 1, 3)
def forward(self, Q, K, V, mask=None):
"""
Args:
Q, K, V: (batch, seq_len, d_model)
mask: optional (seq_len, seq_len)
Returns:
output: (batch, seq_len, d_model)
"""
batch_size = Q.shape[0]
# Step 1: Linear projections
Q_proj = Q @ self.W_Q # (batch, n, d_model)
K_proj = K @ self.W_K
V_proj = V @ self.W_V
# Step 2: Split into heads
Q_heads = self.split_heads(Q_proj) # (batch, h, n, d_k)
K_heads = self.split_heads(K_proj)
V_heads = self.split_heads(V_proj)
# Step 3: Scaled dot-product attention per head
d_k = self.d_k
scores = np.matmul(Q_heads, K_heads.transpose(0, 1, 3, 2))
scores = scores / np.sqrt(d_k)
if mask is not None:
scores = scores + mask
attn_weights = self._softmax(scores)
head_outputs = np.matmul(attn_weights, V_heads) # (batch, h, n, d_k)
# Step 4: Concatenate heads
# (batch, h, n, d_k) โ (batch, n, h, d_k) โ (batch, n, d_model)
concat = head_outputs.transpose(0, 2, 1, 3)
concat = concat.reshape(batch_size, -1, self.d_model)
# Step 5: Final linear projection
output = concat @ self.W_O
return output, attn_weights
def _softmax(self, x, axis=-1):
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / np.sum(e_x, axis=axis, keepdims=True)
# โโโ DEMO โโโ
np.random.seed(42)
batch_size, seq_len, d_model, num_heads = 2, 6, 64, 8
mha = MultiHeadAttention(d_model, num_heads)
X = np.random.randn(batch_size, seq_len, d_model)
output, weights = mha.forward(X, X, X) # Self-attention
print(f"Input shape: {X.shape}") # (2, 6, 64)
print(f"Output shape: {output.shape}") # (2, 6, 64)
print(f"Weight shape: {weights.shape}") # (2, 8, 6, 6)
print(f"\nHead 0, Batch 0 attention (6ร6):")
print(np.round(weights[0, 0], 3))
10.3 Positional Encoding
import numpy as np
def sinusoidal_positional_encoding(max_len, d_model):
"""
Compute sinusoidal positional encoding as in 'Attention Is All You Need'.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
PE = np.zeros((max_len, d_model))
positions = np.arange(max_len)[:, np.newaxis] # (max_len, 1)
dim_indices = np.arange(0, d_model, 2) # (d_model/2,)
# Compute the denominator: 10000^(2i/d_model)
# = exp(2i * ln(10000) / d_model)
div_term = np.exp(dim_indices * (-np.log(10000.0) / d_model))
# Apply sin to even indices, cos to odd indices
PE[:, 0::2] = np.sin(positions * div_term)
PE[:, 1::2] = np.cos(positions * div_term)
return PE
# โโโ DEMO โโโ
PE = sinusoidal_positional_encoding(max_len=50, d_model=16)
print("PE shape:", PE.shape)
print("\nPosition 0:", np.round(PE[0], 3))
print("Position 1:", np.round(PE[1], 3))
print("Position 2:", np.round(PE[2], 3))
# Verify the rotation property: PE(pos+k) is a linear transform of PE(pos)
pos, k, dim = 5, 3, 0
omega = 1.0 / (10000 ** (2 * dim / 16))
# PE(pos+k, 2*dim) should equal sin(omega*(pos+k))
expected = np.sin(omega * (pos + k))
# Using rotation: sin(w*pos)*cos(w*k) + cos(w*pos)*sin(w*k)
from_rotation = PE[pos, 2*dim] * np.cos(omega*k) + PE[pos, 2*dim+1] * np.sin(omega*k)
print(f"\nRotation property check:")
print(f" PE({pos+k}, {2*dim}) = {PE[pos+k, 2*dim]:.6f}")
print(f" Via rotation: = {from_rotation:.6f}")
print(f" Match: {np.isclose(PE[pos+k, 2*dim], from_rotation)}")
10.4 Complete Transformer Block
import numpy as np
class LayerNorm:
"""Layer Normalization."""
def __init__(self, d_model, eps=1e-6):
self.gamma = np.ones(d_model)
self.beta = np.zeros(d_model)
self.eps = eps
def forward(self, x):
mean = np.mean(x, axis=-1, keepdims=True)
var = np.var(x, axis=-1, keepdims=True)
x_norm = (x - mean) / np.sqrt(var + self.eps)
return self.gamma * x_norm + self.beta
class FeedForward:
"""Position-wise Feed-Forward Network: FFN(x) = ReLU(xW1+b1)W2+b2"""
def __init__(self, d_model, d_ff):
scale = np.sqrt(2.0 / d_model)
self.W1 = np.random.randn(d_model, d_ff) * scale
self.b1 = np.zeros(d_ff)
self.W2 = np.random.randn(d_ff, d_model) * scale
self.b2 = np.zeros(d_model)
def forward(self, x):
hidden = np.maximum(0, x @ self.W1 + self.b1) # ReLU
return hidden @ self.W2 + self.b2
class TransformerBlock:
"""
One complete Transformer encoder block:
1. Multi-Head Self-Attention + Residual + LayerNorm
2. Feed-Forward Network + Residual + LayerNorm
"""
def __init__(self, d_model, num_heads, d_ff):
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = LayerNorm(d_model)
self.ffn = FeedForward(d_model, d_ff)
self.norm2 = LayerNorm(d_model)
def forward(self, x, mask=None):
# Sub-layer 1: Multi-Head Attention
attn_output, attn_weights = self.attention.forward(x, x, x, mask)
x = self.norm1.forward(x + attn_output) # Residual + LayerNorm
# Sub-layer 2: Feed-Forward
ffn_output = self.ffn.forward(x)
x = self.norm2.forward(x + ffn_output) # Residual + LayerNorm
return x, attn_weights
class TransformerEncoder:
"""Stack of N Transformer encoder blocks."""
def __init__(self, num_layers, d_model, num_heads, d_ff, max_len=512):
self.layers = [
TransformerBlock(d_model, num_heads, d_ff)
for _ in range(num_layers)
]
self.PE = sinusoidal_positional_encoding(max_len, d_model)
def forward(self, x):
"""x: (batch, seq_len, d_model)"""
seq_len = x.shape[1]
x = x + self.PE[:seq_len] # Add positional encoding
all_weights = []
for layer in self.layers:
x, weights = layer.forward(x)
all_weights.append(weights)
return x, all_weights
# โโโ DEMO: 6-layer Transformer Encoder โโโ
np.random.seed(42)
batch, seq_len, d_model = 2, 10, 64
num_heads, d_ff, num_layers = 8, 256, 6
encoder = TransformerEncoder(num_layers, d_model, num_heads, d_ff)
X = np.random.randn(batch, seq_len, d_model)
output, all_weights = encoder.forward(X)
print(f"Input: {X.shape}") # (2, 10, 64)
print(f"Output: {output.shape}") # (2, 10, 64)
print(f"Layers: {len(all_weights)}") # 6
print(f"Attn weights per layer: {all_weights[0].shape}") # (2, 8, 10, 10)
TensorFlow Implementation
11.1 BERT Fine-Tuning for Sentiment Analysis
import tensorflow as tf
from transformers import TFBertModel, BertTokenizer
import numpy as np
# โโโ 1. Load Pre-trained BERT โโโ
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
# โโโ 2. Build Sentiment Classifier โโโ
class BERTSentimentClassifier(tf.keras.Model):
def __init__(self, num_classes=3, dropout_rate=0.3):
super().__init__()
self.bert = TFBertModel.from_pretrained('bert-base-uncased')
self.dropout = tf.keras.layers.Dropout(dropout_rate)
self.classifier = tf.keras.layers.Dense(
num_classes, activation='softmax'
)
def call(self, input_ids, attention_mask, training=False):
# Get BERT outputs
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
training=training
)
# Use [CLS] token representation (first token)
cls_output = outputs.last_hidden_state[:, 0, :] # (batch, 768)
cls_output = self.dropout(cls_output, training=training)
logits = self.classifier(cls_output)
return logits
# โโโ 3. Tokenize Dataset โโโ
def tokenize_data(texts, labels, max_length=128):
"""Tokenize texts for BERT input."""
encodings = tokenizer(
texts,
max_length=max_length,
truncation=True,
padding='max_length',
return_tensors='tf'
)
dataset = tf.data.Dataset.from_tensor_slices((
{
'input_ids': encodings['input_ids'],
'attention_mask': encodings['attention_mask']
},
labels
))
return dataset
# โโโ 4. Sample Data & Training โโโ
texts = [
"This movie was absolutely wonderful!",
"Terrible experience, worst film ever.",
"It was okay, nothing special.",
"I loved every minute of this masterpiece!",
"Complete waste of time and money.",
"Average movie with some good moments.",
]
labels = [2, 0, 1, 2, 0, 1] # 0=negative, 1=neutral, 2=positive
dataset = tokenize_data(texts, labels)
dataset = dataset.batch(2).prefetch(tf.data.AUTOTUNE)
# โโโ 5. Compile and Train โโโ
model = BERTSentimentClassifier(num_classes=3)
# Key: Use very small learning rate for BERT fine-tuning!
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
model.compile(
optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Fine-tune for 3 epochs (in practice, 3-5 is sufficient)
# model.fit(dataset, epochs=3)
print("Model built. Ready for fine-tuning!")
print(f"BERT parameters: ~110M")
print(f"Classifier head: {768 * 3 + 3} = {768*3+3} parameters")
11.2 Mini-GPT Text Generation
import tensorflow as tf
import numpy as np
class MiniGPT(tf.keras.Model):
"""
A minimal GPT-style decoder-only Transformer for text generation.
Implements causal (masked) self-attention.
"""
def __init__(self, vocab_size, d_model=128, num_heads=4,
d_ff=512, num_layers=4, max_len=256):
super().__init__()
self.d_model = d_model
self.max_len = max_len
# Token + Position Embeddings
self.token_embed = tf.keras.layers.Embedding(vocab_size, d_model)
self.pos_embed = tf.keras.layers.Embedding(max_len, d_model)
# Transformer Decoder Blocks
self.blocks = [
self._decoder_block(d_model, num_heads, d_ff)
for _ in range(num_layers)
]
# Output projection
self.ln_f = tf.keras.layers.LayerNormalization()
self.head = tf.keras.layers.Dense(vocab_size)
def _decoder_block(self, d_model, num_heads, d_ff):
"""Single decoder block with causal attention."""
return {
'attn': tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=d_model // num_heads
),
'ln1': tf.keras.layers.LayerNormalization(),
'ffn': tf.keras.Sequential([
tf.keras.layers.Dense(d_ff, activation='gelu'),
tf.keras.layers.Dense(d_model),
]),
'ln2': tf.keras.layers.LayerNormalization(),
}
def _causal_mask(self, seq_len):
"""Create causal mask: prevent attending to future tokens."""
mask = tf.linalg.band_part(
tf.ones((seq_len, seq_len)), -1, 0
)
return mask # Lower triangular
def call(self, x, training=False):
batch_size, seq_len = tf.shape(x)[0], tf.shape(x)[1]
# Embeddings
positions = tf.range(seq_len)
tok_emb = self.token_embed(x) # (batch, seq, d_model)
pos_emb = self.pos_embed(positions) # (seq, d_model)
h = tok_emb + pos_emb
# Causal mask
causal_mask = self._causal_mask(seq_len)
# Pass through decoder blocks
for block in self.blocks:
# Pre-norm architecture (GPT-2 style)
h_norm = block['ln1'](h)
attn_out = block['attn'](
query=h_norm, key=h_norm, value=h_norm,
attention_mask=causal_mask, training=training
)
h = h + attn_out # Residual
h_norm = block['ln2'](h)
ffn_out = block['ffn'](h_norm, training=training)
h = h + ffn_out # Residual
h = self.ln_f(h)
logits = self.head(h) # (batch, seq, vocab_size)
return logits
def generate(self, start_tokens, max_new_tokens=50, temperature=0.8):
"""Autoregressive text generation."""
tokens = tf.constant([start_tokens])
for _ in range(max_new_tokens):
# Crop to max_len
crop = tokens[:, -self.max_len:]
logits = self(crop, training=False)
# Get logits for last position
next_logits = logits[:, -1, :] / temperature
# Sample from distribution
next_token = tf.random.categorical(next_logits, 1)
tokens = tf.concat([tokens, next_token], axis=1)
return tokens.numpy()[0]
# โโโ DEMO โโโ
vocab_size = 5000
model = MiniGPT(vocab_size=vocab_size, d_model=128, num_heads=4,
d_ff=512, num_layers=4, max_len=256)
# Test forward pass
dummy_input = tf.constant([[1, 42, 100, 7, 88]])
logits = model(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Output shape: {logits.shape}") # (1, 5, 5000)
print(f"Parameters: {model.count_params():,}")
# Generate text (random tokens since untrained)
generated = model.generate([1, 42, 100], max_new_tokens=10)
print(f"Generated token IDs: {generated}")
11.3 Custom Transformer Layer in TF
import tensorflow as tf
class TransformerEncoderLayer(tf.keras.layers.Layer):
"""Production-quality Transformer encoder layer."""
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.mha = tf.keras.layers.MultiHeadAttention(
num_heads=num_heads,
key_dim=d_model // num_heads,
dropout=dropout
)
self.ffn = tf.keras.Sequential([
tf.keras.layers.Dense(d_ff, activation='relu'),
tf.keras.layers.Dropout(dropout),
tf.keras.layers.Dense(d_model),
tf.keras.layers.Dropout(dropout),
])
self.norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = tf.keras.layers.Dropout(dropout)
def call(self, x, training=False, mask=None):
# Multi-Head Self-Attention + Residual + Norm
attn_output = self.mha(x, x, x, attention_mask=mask, training=training)
attn_output = self.dropout1(attn_output, training=training)
x = self.norm1(x + attn_output)
# Feed-Forward + Residual + Norm
ffn_output = self.ffn(x, training=training)
x = self.norm2(x + ffn_output)
return x
# Build 6-layer encoder
d_model, num_heads, d_ff = 512, 8, 2048
encoder_layers = [
TransformerEncoderLayer(d_model, num_heads, d_ff)
for _ in range(6)
]
# Test
x = tf.random.normal((2, 20, 512)) # batch=2, seq=20
for layer in encoder_layers:
x = layer(x, training=True)
print(f"6-layer encoder output: {x.shape}") # (2, 20, 512)
Scikit-Learn Integration
While Scikit-Learn doesn't have native Transformer models, it integrates beautifully with Transformer-based feature extractors. Here we show how to use BERT embeddings as features in sklearn pipelines.
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import numpy as np
# โโโ BERT as Feature Extractor for sklearn โโโ
class BERTFeatureExtractor:
"""Extract [CLS] embeddings from BERT for use with sklearn."""
def __init__(self, model_name='bert-base-uncased', max_length=128):
from transformers import BertTokenizer, TFBertModel
self.tokenizer = BertTokenizer.from_pretrained(model_name)
self.model = TFBertModel.from_pretrained(model_name)
self.max_length = max_length
def transform(self, texts):
"""Convert texts to BERT [CLS] embeddings (768-dim vectors)."""
encodings = self.tokenizer(
texts, max_length=self.max_length,
truncation=True, padding='max_length',
return_tensors='tf'
)
outputs = self.model(encodings, training=False)
# Extract [CLS] token embedding
cls_embeddings = outputs.last_hidden_state[:, 0, :].numpy()
return cls_embeddings
# โโโ Use with sklearn โโโ
# (Pseudocode โ requires transformers & tensorflow installed)
"""
# Extract features
extractor = BERTFeatureExtractor()
X_train = extractor.transform(train_texts) # (n, 768)
X_test = extractor.transform(test_texts)
# Train any sklearn classifier on BERT features!
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
models = {
'SVM': SVC(kernel='rbf', C=1.0),
'LogReg': LogisticRegression(max_iter=1000),
'RF': RandomForestClassifier(n_estimators=100),
}
for name, clf in models.items():
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(f"{name}: {scores.mean():.4f} ยฑ {scores.std():.4f}")
"""
# โโโ Simulated Demo (no GPU needed) โโโ
np.random.seed(42)
n_samples = 200
X_simulated = np.random.randn(n_samples, 768) # Simulated BERT features
y_simulated = (X_simulated[:, 0] + X_simulated[:, 1] > 0).astype(int)
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
svm = SVC(kernel='rbf', C=1.0)
scores = cross_val_score(svm, X_simulated, y_simulated, cv=5)
print(f"SVM on BERT features: {scores.mean():.4f} ยฑ {scores.std():.4f}")
Indian Case Studies
๐๏ธ Case Study 1: AI4Bharat IndicBERT โ Transformers for 11 Indian Languages
Challenge: BERT was trained primarily on English. India has 22 official languages and 100+ spoken languages. Most Indian language NLP was severely under-resourced.
Solution: AI4Bharat (IIT Madras) created IndicBERT, a multilingual BERT model trained on the IndicCorp dataset covering 11 major Indian languages: Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Kannada, Malayalam, Odia, Punjabi, and Assamese.
Technical Details:
- Based on ALBERT architecture (parameter-sharing for efficiency)
- Trained on 9 billion tokens across 11 languages
- Uses SentencePiece tokenizer trained on Indian language data
- Outperforms multilingual BERT (mBERT) on IndicGLUE benchmark
Impact: Enabled sentiment analysis in Hindi, NER in Tamil, question answering in Bengali, and more. Used by Indian startups for vernacular content moderation, e-commerce search, and government document processing.
๐ค Case Study 2: Krutrim LLM โ India's First Multilingual Foundation Model
Challenge: Global LLMs like GPT-4 perform poorly on Indian languages due to limited training data and tokenization issues (Hindi text gets 3-4ร more tokens than English).
Solution: Ola's AI lab developed Krutrim (Sanskrit for "artificial"), India's first homegrown LLM supporting 22 Indian languages, with text generation in 10 languages.
Technical Innovation:
- Custom tokenizer optimized for Indian scripts (Devanagari, Dravidian scripts, etc.)
- Training data curated from Indian web sources, books, and government documents
- Efficient inference using quantization for deployment on Indian infrastructure
- Krutrim Pro: Larger model with 100B+ parameters for enterprise applications
Impact: Demonstrated that India can build sovereign AI infrastructure. Applications in healthcare (patient communication in local languages), education (tutoring in regional languages), and e-governance.
๐๏ธ Case Study 3: Bhashini โ Government's AI Translation Platform
Challenge: Government services need to be accessible in all 22 scheduled languages of India.
Solution: MeitY's Bhashini platform uses Transformer-based translation models to provide real-time translation across Indian languages.
- Uses encoder-decoder Transformers fine-tuned on Samanantar parallel corpus
- Integrated with Aadhaar and DigiLocker for document translation
- Open API for developers to build multilingual applications
- Handles 100+ language pairs with a single multilingual model
Global Case Studies
๐ง Case Study 4: OpenAI GPT Evolution โ The Scaling Frontier
| Model | Year | Parameters | Training Data | Key Innovation |
|---|---|---|---|---|
| GPT-1 | 2018 | 117M | BookCorpus (7K books) | Pre-train + fine-tune paradigm |
| GPT-2 | 2019 | 1.5B | WebText (40GB) | Zero-shot via prompting |
| GPT-3 | 2020 | 175B | 570GB mixed | In-context learning, few-shot |
| InstructGPT | 2022 | ~175B | + RLHF data | RLHF alignment |
| GPT-4 | 2023 | ~1.8T (MoE) | ~13T tokens | Multimodal, reasoning |
| GPT-4o | 2024 | Undisclosed | Undisclosed | Omnimodal (text+image+audio) |
Key Lessons: (1) Scale is predictable โ Kaplan scaling laws show loss decreases as power law with compute, data, and parameters. (2) Emergent abilities appear at scale โ chain-of-thought reasoning, code generation, multilingual transfer. (3) RLHF transforms raw capability into useful, aligned behavior.
๐ Case Study 5: Google Gemini โ Multimodal from the Ground Up
Architecture: Unlike GPT-4 (which added vision to a text model), Gemini was natively multimodal โ trained from scratch on text, images, audio, video, and code simultaneously.
- Gemini Ultra: Exceeds human performance on MMLU (90.0%)
- Gemini Pro: Powers Google Search, Gmail, Docs integration
- Gemini Nano: On-device model for Pixel phones (1.8B & 3.25B variants)
- Training: TPU v5p pods, mixture-of-experts, 128K context window
๐ฆ Case Study 6: Meta Llama โ Open-Source LLM Revolution
Impact: Meta's release of Llama models (7B-405B) under open licenses democratized LLM access, enabling thousands of fine-tuned variants.
- Llama 2 (2023): 7B/13B/70B, commercially licensed, trained on 2T tokens
- Llama 3 (2024): 8B/70B/405B, state-of-art open-source, 15T tokens
- Architecture choices: RMSNorm (instead of LayerNorm), SwiGLU activation, Rotary Position Embeddings (RoPE), Grouped Query Attention (GQA)
- Community: Over 30,000 derived models on HuggingFace within months
Startup Applications
Sarvam AI (India)
Building India-first LLMs with focus on voice + text in Indian languages. Their models handle code-switching (Hinglish) natively, a critical requirement for Indian consumers.
Hugging Face (Global)
The "GitHub of ML" โ hosts 500K+ models, 100K+ datasets. Their Transformers library is the de facto standard. Valued at $4.5B, proving open-source AI is a viable business.
Cohere (Canada)
Enterprise-focused LLMs with Retrieval-Augmented Generation (RAG). Their Command model powers enterprise search, and Embed model provides best-in-class embeddings.
Anthropic (USA)
Founded by ex-OpenAI researchers, building "safer" LLMs. Claude uses Constitutional AI (CAI) โ a novel RLHF variant where the model critiques its own outputs against a constitution.
Government Applications
๐ฎ๐ณ IndiaAI Mission
Government of India allocated โน10,300 crore for AI development. Key Transformer applications: Bhashini (translation), Document Intelligence (tax forms, legal docs), and agricultural advisory chatbots in local languages.
๐บ๐ธ US Intelligence
CIA and NSA use Transformer-based models for signals intelligence โ analyzing intercepted communications in 100+ languages. Custom fine-tuned models run on air-gapped classified networks.
๐ช๐บ EU AI Act
The world's first comprehensive AI regulation specifically addresses "General-Purpose AI" (GPT-4, Gemini). Foundation model providers must document training data, compute costs, and conduct safety evaluations.
๐ฎ๐ณ Digital Courts
Indian judiciary exploring Transformer models for case summarization, legal document translation, and precedent search across 23 High Courts. SUVAS system uses NMT for judgment translation.
Industry Applications
| Industry | Application | Transformer Type | Impact |
|---|---|---|---|
| Healthcare | Medical report generation, drug discovery (AlphaFold) | Encoder-Decoder, Specialized | 10ร faster literature review |
| Finance | Fraud detection, sentiment from earnings calls | BERT, FinBERT | 95%+ fraud detection accuracy |
| E-Commerce | Product search, recommendation, review analysis | BERT, Cross-encoders | Flipkart: 15% search improvement |
| Manufacturing | Predictive maintenance from sensor logs (time-series Transformers) | Encoder-only | 30% reduction in downtime |
| Education | Personalized tutoring, automated grading | GPT-type | Byju's, Vedantu AI tutors |
| Legal | Contract analysis, case prediction | BERT, LegalBERT | 80% faster contract review |
| Media | Content generation, translation, dubbing | GPT, Whisper | Netflix: 40+ language dubbing |
| Agriculture | Crop advisory chatbots, pest identification | Multilingual LLMs | KissanAI: 500K+ farmers served |
Mini Projects
๐ ๏ธ Mini Project 1: Hindi Sentiment Analysis with BERT
Objective: Fine-tune a multilingual BERT model for sentiment classification on Hindi movie reviews.
Dataset: Hindi Movie Reviews dataset from AI4Bharat or IIIT-H
"""
Mini Project 1: Hindi Sentiment Analysis using Multilingual BERT
"""
from transformers import (
AutoTokenizer, TFAutoModelForSequenceClassification,
DataCollatorWithPadding
)
import tensorflow as tf
import numpy as np
# โโโ 1. Load Hindi-capable Model โโโ
MODEL_NAME = "ai4bharat/indic-bert" # or "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = TFAutoModelForSequenceClassification.from_pretrained(
MODEL_NAME, num_labels=3 # positive, negative, neutral
)
# โโโ 2. Sample Hindi Data โโโ
hindi_reviews = [
{"text": "เคฏเคน เคซเคฟเคฒเฅเคฎ เคฌเคนเฅเคค เค
เคเฅเคเฅ เคฅเฅ, เคฎเฅเคเฅ เคฌเคนเฅเคค เคชเคธเคเคฆ เคเค", "label": 2},
{"text": "เคฌเฅเคเคพเคฐ เคซเคฟเคฒเฅเคฎ, เคธเคฎเคฏ เคเฅ เคฌเคฐเฅเคฌเคพเคฆเฅ", "label": 0},
{"text": "เคเคนเคพเคจเฅ เค เฅเค เคฅเฅ เคฒเฅเคเคฟเคจ เค
เคญเคฟเคจเคฏ เคเคฎเคเฅเคฐ เคฅเคพ", "label": 1},
{"text": "เคถเคพเคจเคฆเคพเคฐ เค
เคญเคฟเคจเคฏ เคเคฐ เคฌเฅเคนเคคเคฐเฅเคจ เคธเคเคเฅเคค", "label": 2},
{"text": "เคเคคเคจเฅ เคเคฐเคพเคฌ เคซเคฟเคฒเฅเคฎ เคฎเฅเคเคจเฅ เคเคญเฅ เคจเคนเฅเค เคฆเฅเคเฅ", "label": 0},
{"text": "เคเคธเคค เคซเคฟเคฒเฅเคฎ, เคเค เคฌเคพเคฐ เคฆเฅเค เคธเคเคคเฅ เคนเฅเค", "label": 1},
{"text": "เคฆเคฟเคฒ เคเฅ เคเฅ เคฒเฅเคจเฅ เคตเคพเคฒเฅ เคเคนเคพเคจเฅ", "label": 2},
{"text": "เคฌเฅเคฐเคฟเคเค เคเคฐ เคฒเคเคฌเฅ เคซเคฟเคฒเฅเคฎ", "label": 0},
]
# โโโ 3. Tokenize โโโ
texts = [r["text"] for r in hindi_reviews]
labels = [r["label"] for r in hindi_reviews]
encodings = tokenizer(
texts, max_length=128, truncation=True,
padding="max_length", return_tensors="tf"
)
dataset = tf.data.Dataset.from_tensor_slices((
dict(encodings), labels
)).batch(4)
# โโโ 4. Fine-tune โโโ
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
print("Model ready for Hindi sentiment analysis!")
print(f"Vocab size: {tokenizer.vocab_size}")
print(f"Model parameters: {model.count_params():,}")
# โโโ 5. Inference Function โโโ
def predict_sentiment(text):
"""Predict sentiment of Hindi text."""
inputs = tokenizer(text, return_tensors="tf",
max_length=128, truncation=True, padding="max_length")
outputs = model(inputs)
probs = tf.nn.softmax(outputs.logits, axis=-1)
label_map = {0: "เคจเคเคพเคฐเคพเคคเฅเคฎเค (Negative)",
1: "เคคเคเคธเฅเคฅ (Neutral)",
2: "เคธเคเคพเคฐเคพเคคเฅเคฎเค (Positive)"}
pred = tf.argmax(probs, axis=-1).numpy()[0]
return label_map[pred], probs.numpy()[0]
# Test (before training โ predictions will be random)
test = "เคฏเคน เคซเคฟเคฒเฅเคฎ เคฌเคนเฅเคค เคถเคพเคจเคฆเคพเคฐ เคนเฅ"
label, probs = predict_sentiment(test)
print(f"\nInput: {test}")
print(f"Prediction: {label}")
print(f"Probabilities: {probs}")
๐ ๏ธ Mini Project 2: Mini Language Model (Character-Level GPT)
Objective: Build and train a small character-level language model using the Transformer decoder architecture.
Dataset: Any text file (Shakespeare, Hindi stories, etc.)
"""
Mini Project 2: Character-Level Language Model using Transformer Decoder
Inspired by Andrej Karpathy's nanoGPT
"""
import tensorflow as tf
import numpy as np
# โโโ 1. Data Preparation โโโ
text = """
India is a vast country with diverse cultures, languages, and traditions.
The Indian constitution recognizes 22 official languages.
Artificial intelligence is transforming India's technology landscape.
From Bengaluru to Mumbai, startups are building innovative AI solutions.
The future of AI in India is bright, with millions of developers.
"""
# Character-level tokenization
chars = sorted(list(set(text)))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for ch, i in char_to_idx.items()}
def encode(s): return [char_to_idx[c] for c in s]
def decode(l): return ''.join([idx_to_char[i] for i in l])
data = np.array(encode(text))
print(f"Vocabulary size: {vocab_size}")
print(f"Text length: {len(data)} characters")
print(f"Sample encoding: '{text[:10]}' โ {encode(text[:10])}")
# โโโ 2. Create Training Sequences โโโ
block_size = 32 # Context window
batch_size = 8
def create_dataset(data, block_size):
X, Y = [], []
for i in range(len(data) - block_size):
X.append(data[i:i+block_size])
Y.append(data[i+1:i+block_size+1])
return np.array(X), np.array(Y)
X, Y = create_dataset(data, block_size)
dataset = tf.data.Dataset.from_tensor_slices((X, Y))
dataset = dataset.shuffle(1000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
# โโโ 3. Build Mini Transformer LM โโโ
class CharTransformerLM(tf.keras.Model):
def __init__(self, vocab_size, d_model=64, num_heads=4,
num_layers=3, d_ff=128, max_len=128):
super().__init__()
self.token_emb = tf.keras.layers.Embedding(vocab_size, d_model)
self.pos_emb = tf.keras.layers.Embedding(max_len, d_model)
self.blocks = []
for _ in range(num_layers):
self.blocks.append({
'attn': tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=d_model//num_heads
),
'ln1': tf.keras.layers.LayerNormalization(),
'ffn': tf.keras.Sequential([
tf.keras.layers.Dense(d_ff, activation='gelu'),
tf.keras.layers.Dense(d_model)
]),
'ln2': tf.keras.layers.LayerNormalization(),
})
self.ln_f = tf.keras.layers.LayerNormalization()
self.head = tf.keras.layers.Dense(vocab_size)
def call(self, x, training=False):
B, T = tf.shape(x)[0], tf.shape(x)[1]
tok = self.token_emb(x)
pos = self.pos_emb(tf.range(T))
h = tok + pos
# Causal mask
mask = tf.linalg.band_part(tf.ones((T, T)), -1, 0)
for block in self.blocks:
h_n = block['ln1'](h)
attn = block['attn'](h_n, h_n, h_n,
attention_mask=mask, training=training)
h = h + attn
h_n = block['ln2'](h)
h = h + block['ffn'](h_n)
h = self.ln_f(h)
return self.head(h)
def generate(self, start_text, max_tokens=100, temperature=0.8):
tokens = encode(start_text)
for _ in range(max_tokens):
x = tf.constant([tokens[-block_size:]])
logits = self(x, training=False)
next_logits = logits[0, -1, :] / temperature
next_token = tf.random.categorical(
next_logits[tf.newaxis, :], 1
)[0, 0].numpy()
tokens.append(next_token)
return decode(tokens)
# โโโ 4. Train โโโ
model = CharTransformerLM(vocab_size)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(1e-3)
model.compile(optimizer=optimizer, loss=loss_fn)
print(f"\nModel parameters: {model.count_params():,}")
# Train for a few epochs
# model.fit(dataset, epochs=50, verbose=1)
# Generate text
# generated = model.generate("India is", max_tokens=50)
# print(f"Generated: {generated}")
print("\nModel built successfully! Ready for training.")
End-of-Chapter Exercises (20+)
Conceptual Questions
Mathematical Problems
Programming Exercises
Analysis & Research
Multiple Choice Questions (12)
Interview Questions (12)
Q1: Walk me through the Transformer architecture. How does it differ from an LSTM?
Answer: The Transformer uses self-attention instead of recurrence. Key differences: (1) Parallelism โ all positions computed simultaneously vs. sequential in LSTM; (2) Constant path length โ O(1) between any two positions vs. O(n) in LSTM; (3) Attention mechanism โ soft lookup using QKV vs. gated memory cell; (4) Architecture โ encoder-decoder with multi-head attention, FFN, LayerNorm, residual connections. The Transformer uses positional encoding since it's permutation-invariant, while LSTMs inherently capture order through sequential processing.
Q2: Why do we scale by โdk in attention? What happens without it?
Answer: The dot product qยทk has variance dk when q, k have unit variance components. For dk=64, without scaling, dot products have standard deviation ~8, pushing softmax into saturated regions where gradients โ 0 (vanishing gradients). Dividing by โdk normalizes variance to 1, keeping softmax in a "healthy" gradient region. This was empirically validated by the original paper: additive attention (which doesn't have this scaling issue) and scaled dot-product attention perform comparably, but unscaled dot-product attention performs poorly.
Q3: Explain Multi-Head Attention. Why not just use one big attention head?
Answer: Multiple heads allow the model to attend to different representation subspaces simultaneously. One head might learn syntactic relationships ("subject-verb"), another might learn semantic similarity, another positional proximity. With a single head, these different types of relationships would need to be averaged together. Empirically, multi-head attention outperforms single-head even when total computation is matched (h heads of dimension dk=dmodel/h vs. one head of dk=dmodel).
Q4: Explain the difference between BERT and GPT. When would you use each?
Answer: BERT: encoder-only, bidirectional, pre-trained with MLM (mask 15% tokens, predict them) + NSP. Best for understanding tasks: classification, NER, QA, semantic search. GPT: decoder-only, unidirectional (left-to-right), pre-trained with CLM (predict next token). Best for generation tasks: text generation, dialogue, code completion. Key insight: BERT sees full context but can't generate; GPT generates autoregressively but only sees past context during pre-training.
Q5: What is the computational bottleneck of Transformers, and how do Flash Attention / Sparse Attention address it?
Answer: Self-attention is O(nยฒ) in sequence length โ the attention matrix QKT is nรn. For n=32K (modern context windows), this is ~1 billion entries per head per layer. Flash Attention addresses the memory bottleneck by tiling the computation to fit in SRAM, avoiding materializing the full nรn matrix in HBM. Sparse Attention (like BigBird, Longformer) addresses the compute bottleneck by having each token attend to only O(โn) or O(n log n) other tokens through local windows + global tokens + random attention.
Q6: What is RLHF and why is it critical for LLMs like ChatGPT?
Answer: RLHF has 3 stages: (1) SFT: Fine-tune the pre-trained model on human-written demonstrations of desired behavior. (2) Reward Model: Collect human rankings of model outputs ("output A is better than B"), train a reward model to predict human preferences. (3) PPO: Use the reward model as a signal to further train the LLM via reinforcement learning (PPO algorithm). Without RLHF, models generate plausible but often unhelpful, harmful, or hallucinated text. RLHF aligns the model with human intent.
Q7: How does Vision Transformer (ViT) process images?
Answer: ViT treats an image as a sequence of patches: (1) Split image into fixed-size patches (e.g., 16ร16). (2) Flatten each patch and project to dmodel dimensions (linear embedding). (3) Prepend a learnable [CLS] token. (4) Add learnable position embeddings. (5) Process through a standard Transformer encoder. (6) Use [CLS] output for classification. Key insight: ViT needs large datasets (ImageNet-21K, JFT-300M) to match CNNs; on smaller datasets, CNNs' inductive biases (locality, translation equivariance) give them an advantage.
Q8: What are "emergent abilities" in LLMs? Give examples.
Answer: Emergent abilities are capabilities that appear in large models but are absent in smaller ones โ they emerge unpredictably at certain scale thresholds. Examples: (1) Chain-of-thought reasoning โ models with >100B params can solve multi-step math when prompted "let's think step by step." (2) Few-shot learning โ GPT-3 (175B) can learn tasks from 3-5 examples in the prompt. (3) Code generation โ Codex/GPT-4 generate working code from natural language. These abilities are not explicitly trained; they emerge from scale in data, parameters, and compute.
Q9: Explain tokenization in LLMs. Why do BPE/SentencePiece matter?
Answer: Tokenization converts text into subword tokens. BPE (Byte-Pair Encoding) starts with characters, iteratively merges the most frequent pair. SentencePiece is language-agnostic (works on raw text, no pre-tokenization). Why it matters: (1) vocabulary size affects model size and efficiency; (2) subwords handle rare/new words gracefully ("unhappiness" โ "un" + "happi" + "ness"); (3) For Indian languages, poor tokenization means 3-4ร more tokens per sentence, wasting context window and increasing compute. This is why Krutrim and IndicBERT use custom tokenizers trained on Indian language data.
Q10: How would you fine-tune a Transformer model for a low-resource Indian language?
Answer: Strategy: (1) Start with a multilingual model (mBERT, XLM-R, or IndicBERT); (2) Use transfer learning โ the model's cross-lingual representations transfer knowledge from high-resource to low-resource languages; (3) Data augmentation: back-translation, code-switching augmentation; (4) Parameter-efficient fine-tuning: LoRA or adapters (only fine-tune ~1% of parameters); (5) Few-shot prompting with LLMs as an alternative to fine-tuning; (6) Active learning to maximize the value of limited labeled data. Key: use IndicBERT/IndicTrans2 over mBERT for Indian languages โ they have better tokenization and more Indian language pre-training data.
Q11: What is Layer Normalization and why does the Transformer use it instead of Batch Normalization?
Answer: LayerNorm normalizes across the feature dimension (for each token independently), while BatchNorm normalizes across the batch dimension. LayerNorm is preferred because: (1) It's independent of batch size โ works with batch size 1 during inference; (2) Variable-length sequences make batch statistics unreliable; (3) In NLP, features at the same position across batches don't have consistent semantics (unlike pixels in images); (4) More stable training dynamics for Transformers.
Q12: Design a system to build a chatbot for Indian Railways in Hindi using Transformers.
Answer: Architecture: (1) Retrieval: Use IndicBERT bi-encoder to embed FAQ/knowledge base; retrieve relevant documents for a query using cosine similarity. (2) Generation: Fine-tune a Hindi-capable LLM (Krutrim or Llama-2-Hindi) on railway domain data (FAQs, PNR status responses, complaint templates). (3) RAG Pipeline: Combine retrieval + generation โ the LLM generates responses grounded in retrieved railway documents. (4) Guardrails: Filter harmful outputs, enforce railway-specific terminology. (5) Evaluation: BLEU/ROUGE for generation quality, human evaluation for helpfulness. Deploy on IRCTC with voice input (Whisper for Hindi ASR) + text.
Research Problems
Research Problem 1: Efficient Attention for Indian Language Documents
Problem: Indian language text produces 3-4ร more tokens than English (due to suboptimal tokenization). This makes the O(nยฒ) attention cost even more prohibitive. Design and evaluate an attention mechanism that combines (1) a custom Indian-language-optimized tokenizer, (2) sparse attention patterns (local + global), and (3) Flash Attention tiling for efficient processing of long Hindi/Tamil documents.
Evaluation: Compare against mBERT on IndicGLUE benchmarks while measuring wall-clock time, memory usage, and maximum processable sequence length.
Research Problem 2: Cross-Lingual Transfer Without Parallel Data
Problem: Can we train a Transformer that transfers NLP capabilities from English (high-resource) to languages like Gondi or Bodo (extremely low-resource, <10K sentences) without any parallel data? Investigate unsupervised cross-lingual representation learning using shared subword vocabularies, transliteration bridges (Devanagari โ Latin), and self-supervised alignment objectives.
Research Problem 3: Mixture-of-Experts for Multilingual Efficiency
Problem: India has 22 official languages with very different structures (Indo-Aryan vs. Dravidian). A single dense Transformer wastes capacity by activating all parameters for every language. Design a Mixture-of-Experts (MoE) Transformer where different experts specialize in different language families. Investigate routing strategies, load balancing, and whether language-family-aware routing outperforms learned routing on IndicGLUE.
Research Problem 4: Interpretability of Attention in Medical NLP
Problem: Transformers are increasingly used for medical text analysis (radiology reports, clinical notes) in Indian hospitals. However, attention weights do not always correlate with feature importance. Develop methods to explain Transformer predictions on Indian medical records, comparing attention visualization, gradient-based attribution, and SHAP. Validate explanations with domain expert doctors.
Key Takeaways
- Attention = Soft Database Lookup: The core mechanism computes Query-Key similarity to produce weighted combinations of Values. The formula Attention(Q,K,V) = softmax(QKT/โdk)V is the foundation of all modern AI.
- โdk Scaling is Critical: Without it, dot product variance grows as dk, causing softmax saturation and vanishing gradients. This is a first-principles variance normalization.
- Multi-Head = Multiple Perspectives: Running h parallel attention heads (each with dk=dmodel/h) captures different types of relationships. Concatenation + projection combines them.
- Positional Encoding Injects Order: Self-attention is permutation-invariant. Sinusoidal encodings (or learned embeddings) give the model position information. The rotation property enables relative position reasoning.
- Encoder understands, Decoder generates: BERT (encoder-only) excels at classification/NER/QA. GPT (decoder-only) excels at text generation. T5 (encoder-decoder) excels at transformation tasks.
- Scale Unlocks Emergent Abilities: LLMs show capabilities (chain-of-thought, few-shot learning) that only emerge at scale โ 100B+ parameters. This is why the race to scale continues.
- RLHF Aligns Capability with Intent: Pre-training gives capability; RLHF aligns it with human preferences. Without alignment, powerful models produce harmful or unhelpful outputs.
- O(nยฒ) is the Achilles Heel: Self-attention's quadratic cost limits context length. Flash Attention (IO-aware), Sparse Attention, and Linear Attention are active research areas to address this.
- Indian Languages Need Custom Solutions: Standard tokenizers waste 3-4ร tokens on Indian scripts. Projects like AI4Bharat IndicBERT, Krutrim, and Bhashini build India-specific solutions with custom tokenizers and training data.
- Transformers Have Won (For Now): From NLP to vision (ViT) to speech (Whisper) to protein folding (AlphaFold), Transformers dominate. State-space models (Mamba) are the primary challenger.
References
Foundational Papers
- Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS. The paper that started the revolution.
- Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers." NAACL. Encoder-only pre-training for NLP understanding.
- Radford, A. et al. (2018/2019). "Improving Language Understanding by Generative Pre-Training" (GPT-1/2). OpenAI. Decoder-only pre-training for generation.
- Brown, T. et al. (2020). "Language Models are Few-Shot Learners" (GPT-3). NeurIPS. In-context learning at scale.
- Dosovitskiy, A. et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition." ICLR. Vision Transformer (ViT).
Scaling & Alignment
- Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." OpenAI. Power-law relationships between scale and performance.
- Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback" (InstructGPT). OpenAI. RLHF methodology.
- Touvron, H. et al. (2023/2024). "Llama 2/3: Open Foundation Models." Meta AI. Open-source LLMs.
- OpenAI (2023). "GPT-4 Technical Report." Multimodal foundation model.
Indian AI & Efficient Attention
- Kakwani, D. et al. (2020). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Models." AI4Bharat/IIT Madras. IndicBERT and IndicCorp.
- Ramesh, G. et al. (2022). "Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages." Foundation for Indian NMT.
- Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention." NeurIPS. IO-aware attention.
- Beltagy, I. et al. (2020). "Longformer: The Long-Document Transformer." Sparse attention for long documents.
Textbooks & Resources
- Jurafsky, D. & Martin, J. (2024). "Speech and Language Processing." 3rd Ed. Ch. 9-10: Transformers.
- Tunstall, L. et al. (2022). "Natural Language Processing with Transformers." O'Reilly. Practical guide with HuggingFace.
- The Illustrated Transformer โ Jay Alammar's blog. Outstanding visual explanations.
- Andrej Karpathy, "Let's build GPT from scratch" โ YouTube lecture. Best hands-on introduction.