EduArtha Interactive Books
Secrets of AI Models
Build models from scratch, fine-tune on custom data & master transfer learning across Image, Video, Audio, Reasoning, Coding, Language, Cyber Security & Biological models.
Welcome to the Secrets of AI Models
Learning Objectives
- Understand the foundational technology and mathematics behind modern AI models.
- Learn how to create an AI model completely from scratch.
- Master knowledge transfer methods to utilize and fine-tune existing models on your own data.
- Explore deployment strategies with practical examples and code for various domains.
Why This Book?
The field of Artificial Intelligence is evolving at a breathtaking pace. However, many resources either skim the surface with high-level APIs or get bogged down in heavy mathematical theory without providing practical implementation details. This book bridges that gap.
Our purpose is simple: even a complete beginner will be able to understand the core technology and mathematics behind AI models, learn how to transfer knowledge between models, and deploy them successfully with real code examples.
What You Will Learn & The Optimal Learning Path
We will journey through the entire lifecycle of an AI model. However, the order in which you learn these domains drastically impacts how easily you grasp complex concepts. Here is our recommended learning path:
Comprehensive Coverage of AI Domains
Following this path, we'll dive deep into specific architectures tailored for diverse tasks:
- 1. Foundations: You must understand backpropagation and PyTorch first.
- 2. Image Models: Start here! Images are highly visual, making it intuitive to learn how neural networks extract features (CNNs, Vision Transformers, Diffusion).
- 3 & 4. Sequence & Language Models: Add the dimension of "time/order" (RNNs) leading into revolutionary Attention and Transformers (BERT, GPT).
- 5. Audio Models: Bridges the gap by converting sound into images (Spectrograms) and processing them as sequences (Speech-to-Text).
- 6. Video Models: Combines spatial Image knowledge with temporal Sequence knowledge (3D CNNs).
- 7 & 8. Advanced/Specialized: Fine-tuning Language Models for complex logic (Reasoning, Coding) or domain-specific tasks (Cyber Security, Biology).
Creating from Scratch vs. Transfer Learning
Sometimes you need a bespoke model built entirely from scratch to handle highly unique constraints. Other times, it is far more efficient to take an existing powerhouse model (like ResNet, BERT, or LLaMA) and fine-tune it on your own dataset. We will cover both approaches:
The Power of Transfer Learning
Transfer learning is the magic that allows a model trained on millions of data points to become an expert at identifying specific patterns (like dog breeds or medical anomalies) with only a few hundred examples. We will demystify the math behind freezing layers, updating weights, and fine-tuning.
A Sneak Peek into Code and Mathematics
We don't just talk about theory. Every concept is backed by the underlying mathematics and a concrete code example. For instance, here is how simply you can start a transfer learning process using PyTorch:
Python
import torch
import torchvision.models as models
import torch.nn as nn
# 1. Load a pre-trained ResNet model
model = models.resnet18(pretrained=True)
# 2. Freeze all base layers (stop gradients from updating them)
for param in model.parameters():
param.requires_grad = False
# 3. Replace the final layer to match our custom dataset (e.g., 5 classes)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 5)
# Now the model is ready to be trained on your own data!
This is just the beginning. Prepare to unlock the secrets behind the most powerful algorithms of our time.
PyTorch & Torchvision: The Engine of AI
Learning Objectives
- Understand the core definitions of PyTorch and Torchvision.
- Discover the critical need for these frameworks in modern AI models.
- Write practical code demonstrating tensors, autograd, and dataset loading.
- Explore essential references for further reading.
What is PyTorch?
PyTorch is an open-source machine learning framework developed primarily by Meta's AI Research lab (FAIR). At its core, it provides two high-level features:
- Tensor computing (like NumPy) with strong acceleration via Graphics Processing Units (GPUs).
- Deep neural networks built on a tape-based autograd system (Automatic Differentiation).
What is Torchvision?
Torchvision is a companion library to PyTorch specifically designed for Computer Vision. It provides access to popular datasets (like ImageNet, MNIST), model architectures (like ResNet, VGG, Vision Transformers), and common image transformations for data augmentation.
The Need for PyTorch in AI Models
You might wonder: Why can't I just build AI models using standard Python or C++? The answer lies in the immense computational complexity of modern AI. Here is why PyTorch is an absolute necessity:
1. Automatic Differentiation (Autograd) & The Tape-Based System
Training a model involves calculus (the Chain Rule) to calculate how much each weight should change to reduce errors (Backpropagation). If a model has billions of parameters, calculating these derivatives manually is impossible. PyTorch solves this using a tape-based autograd system.
Imagine a tape recorder running while you do math. As you perform operations on tensors, PyTorch "records" each operation and its inputs on a tape (a dynamic graph). When you finish the forward pass and call .backward(), PyTorch plays the tape in reverse. It applies the chain rule backward from the output to the inputs, computing gradients for every parameter instantly. Once played, the tape is discarded and a new one is recorded for the next pass, allowing for highly dynamic, on-the-fly model architectures.
2. GPU Acceleration
AI requires millions of matrix multiplications. CPUs are too slow for this. By simply calling .to("cuda"), PyTorch moves your data to the GPU, utilizing thousands of cores for parallel math, turning days of training into hours.
3. Dynamic Computational Graphs
Unlike older frameworks that used static graphs (where you had to define the entire network before running data through it), PyTorch builds the graph dynamically. This allows you to use standard Python if statements and for loops inside your neural network, making debugging extremely intuitive.
Understanding Backpropagation & The Chain Rule
To truly understand how AI learns, we must peek under the hood of backpropagation. At its core, backpropagation is just the repeated application of the Chain Rule from calculus.
Let's define a very simple model with one input \( x \), one weight \( w \), and a target value \( y \). The model's prediction is \( \hat{y} = w \times x \). We measure the error using Mean Squared Error (MSE): \( L = (\hat{y} - y)^2 \).
The Manual Math
Our goal is to find out how a tiny change in our weight \( w \) affects our loss \( L \). In calculus, this is the derivative \( \frac{\partial L}{\partial w} \). Using the Chain Rule, we break this down from the output backward to the weight:
Let's calculate the two parts:
- The derivative of the loss with respect to the prediction: \( \frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y) \)
- The derivative of the prediction with respect to the weight: \( \frac{\partial \hat{y}}{\partial w} = x \)
Multiplying them together gives our final gradient: \( \frac{\partial L}{\partial w} = 2(\hat{y} - y) \times x \). We then update our weight by taking a small step in the opposite direction of this gradient: \( w_{new} = w_{old} - (\text{learning\_rate} \times \text{gradient}) \).
Achieving this in Code: Manual vs. PyTorch
Below, we implement this exact math manually in pure Python, and then show how PyTorch's Autograd engine achieves the exact same result automatically, scaling to billions of parameters without breaking a sweat.
Python
# --- 1. The Manual Way (Hardcoding the Calculus) ---
x = 2.0 # Input
y = 10.0 # Target output
w = 1.0 # Initial weight
lr = 0.01 # Learning rate
# Forward pass
y_hat = w * x
loss = (y_hat - y) ** 2
# Backward pass (Manual Calculus)
dL_dyhat = 2 * (y_hat - y)
dyhat_dw = x
gradient = dL_dyhat * dyhat_dw
# Update weight
w_new = w - lr * gradient
print(f"Manual Gradient: {gradient}, New Weight: {w_new}")
# --- 2. The PyTorch Way (Automatic Differentiation) ---
import torch
x_t = torch.tensor(2.0)
y_t = torch.tensor(10.0)
w_t = torch.tensor(1.0, requires_grad=True) # Track this!
# Forward pass
y_hat_t = w_t * x_t
loss_t = (y_hat_t - y_t) ** 2
# Backward pass (PyTorch does the calculus!)
loss_t.backward()
gradient_t = w_t.grad
# Update weight
with torch.no_grad():
w_new_t = w_t - lr * gradient_t
print(f"PyTorch Gradient: {gradient_t.item()}, New Weight: {w_new_t.item()}")
Code Example: Tensors, Autograd & Torchvision
Let's see Torchvision and GPU capabilities in action. The following code demonstrates moving a tensor to the GPU and using Torchvision to load a pre-trained model.
Python
import torch
from torchvision import datasets, transforms, models
# --- 1. Autograd & GPU Example ---
# Create a tensor and tell PyTorch to track its gradients
x = torch.tensor([2.0, 3.0], requires_grad=True)
# Move to GPU if available
if torch.cuda.is_available():
x = x.to('cuda')
# Perform a mathematical operation
y = x ** 2 + 5
output = y.mean()
# Calculate gradients automatically!
output.backward()
print(f"Gradients of x: {x.grad}") # Outputs: tensor([2., 3.])
# --- 2. Torchvision Example ---
# Load a pre-trained ResNet18 model for image classification
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
# Define image transformations for incoming data
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Ready to process images!
print("ResNet18 loaded successfully.")
References & Further Reading
- Official PyTorch Documentation
- Torchvision Documentation
- Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 2019.
Image Models: Teaching Computers to See
Learning Objectives
- Understand Convolutions and how they extract spatial features.
- Build a Convolutional Neural Network (CNN) from scratch in PyTorch.
- Write a complete training loop to classify images.
- Gain the ability to create your own custom image classifier.
How to Run the Code in This Book
Before we dive into the math and architecture, you need to know how to actually execute the Python code provided in these chapters. You have two main options:
- The Easy Way (Google Colab - Recommended): Go to Google Colab, create a new notebook, and simply copy/paste the code blocks into the cells. Colab gives you a completely free coding environment in your browser, complete with free GPUs to train your models incredibly fast!
- The Local Way (Your Machine): Install Python on your computer, then install PyTorch by running
pip install torch torchvisionin your terminal. You can then save the full code blocks into a file (e.g.,model.py) and run it via your terminal usingpython model.py.
Tip: For Image and Video models, always try to use Google Colab or a machine with a dedicated GPU. Training deep neural networks on a standard laptop CPU can take hours instead of minutes!
How Do We "See" an Image?
To a computer, an image is just a grid of numbers (pixels). If it's a grayscale image, it's a 2D grid. If it's colored, it's 3D (Red, Green, Blue channels). To recognize a cat, the network needs to find edges, which form shapes, which form ears, which form a cat.
We achieve this using Convolutions. A convolution is a mathematical operation where a small matrix (a filter or kernel) slides over the image. As it slides, it multiplies its values with the pixels underneath it to detect specific features, like vertical lines or curves.
Building a CNN from Scratch
Let's build a real Convolutional Neural Network (CNN) from scratch. We will define a network that can classify 32x32 color images (like the famous CIFAR-10 dataset) into 10 categories (cars, cats, dogs, etc.).
In PyTorch, we build models by creating a class that inherits from nn.Module. We define our layers in the __init__ function, and we define how data flows through them in the forward function.
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
class MyImageClassifier(nn.Module):
def __init__(self):
super(MyImageClassifier, self).__init__()
# 1. Feature Extraction Layers (Convolutions)
# input channels (3 for RGB), output channels (16 filters), kernel_size (3x3)
self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
# Max Pooling layer to reduce the image size by half
self.pool = nn.MaxPool2d(2, 2)
# 2. Classification Layers (Fully Connected)
# After two max pools, a 32x32 image becomes 8x8.
# 32 channels * 8 * 8 = 2048
self.fc1 = nn.Linear(32 * 8 * 8, 512)
self.fc2 = nn.Linear(512, 10) # 10 output classes
def forward(self, x):
# Pass through conv1 -> ReLU activation -> Max Pool
x = self.pool(F.relu(self.conv1(x)))
# Pass through conv2 -> ReLU activation -> Max Pool
x = self.pool(F.relu(self.conv2(x)))
# Flatten the 3D tensor into a 1D vector for the linear layers
x = x.view(-1, 32 * 8 * 8)
# Pass through the fully connected layers
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
# Instantiate the model
model = MyImageClassifier()
print(model)
What is ReLU and Pooling?
ReLU (Rectified Linear Unit): Converts all negative numbers to zero. This introduces non-linearity, allowing the network to learn complex patterns instead of just straight lines.
Max Pooling: Downsamples the image by sliding a window and keeping only the maximum value. This reduces computational load and makes the network immune to slight shifts in the image.
Training Your Model
Now that you have built the model, you need to train it! Here is the standard PyTorch training loop that you will use for almost every project.
Python
import torch.optim as optim
# 1. Define Loss function and Optimizer
criterion = nn.CrossEntropyLoss() # Standard for classification
optimizer = optim.Adam(model.parameters(), lr=0.001)
epochs = 5
# 2. The Training Loop
for epoch in range(epochs):
running_loss = 0.0
for images, labels in trainloader: # (Assuming trainloader is defined)
# Step A: Zero the gradients!
optimizer.zero_grad()
# Step B: Forward pass (Predict)
outputs = model(images)
# Step C: Calculate the loss (Error)
loss = criterion(outputs, labels)
# Step D: Backward pass (Calculate gradients)
loss.backward()
# Step E: Optimize (Update weights)
optimizer.step()
running_loss += loss.item()
print(f"Epoch {epoch+1} completed. Loss: {running_loss/len(trainloader)}")
🚀 You are now an AI Developer!
Congratulations! You have just learned the underlying theory of convolutions, built an entire deep learning architecture from scratch using object-oriented PyTorch, and wrote the training loop required to teach it how to see.
You can now swap out the dataset in the DataLoader, and this exact code will train a custom AI model to distinguish between your own custom images (e.g., classifying X-rays or detecting defective parts on an assembly line).
Putting It All Together: The Full Runnable Code
Here is the complete, self-contained Python script. You can copy and paste this into any Python file (like train_image_model.py) or Jupyter Notebook and run it. It will automatically download the CIFAR-10 dataset, build your custom CNN, and train it for 5 epochs!
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# 1. Prepare Data (with Data Augmentation!)
transform = transforms.Compose([
transforms.RandomHorizontalFlip(), # Prevent overfitting
transforms.RandomRotation(10), # Prevent overfitting
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
# 2. Define the CNN Architecture (with BatchNorm & Dropout!)
class MyImageClassifier(nn.Module):
def __init__(self):
super(MyImageClassifier, self).__init__()
self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
self.bn1 = nn.BatchNorm2d(16) # Stabilizes learning
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
self.bn2 = nn.BatchNorm2d(32) # Stabilizes learning
self.pool = nn.MaxPool2d(2, 2)
self.dropout = nn.Dropout(0.25) # 25% chance to drop neurons
self.fc1 = nn.Linear(32 * 8 * 8, 512)
self.fc2 = nn.Linear(512, 10)
def forward(self, x):
# Add BatchNorm before the ReLU activation
x = self.pool(F.relu(self.bn1(self.conv1(x))))
x = self.pool(F.relu(self.bn2(self.conv2(x))))
x = x.view(-1, 32 * 8 * 8)
x = self.dropout(x) # Apply dropout before fully connected layers
x = F.relu(self.fc1(x))
x = self.dropout(x) # Apply dropout again
x = self.fc2(x)
return x
model = MyImageClassifier()
# 3. Define Loss & Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 4. Train the Model
print("Starting training...")
for epoch in range(5):
running_loss = 0.0
for i, (images, labels) in enumerate(trainloader):
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if (i+1) % 200 == 0: # Print every 200 mini-batches
print(f"Epoch [{epoch+1}/5], Step [{i+1}/{len(trainloader)}], Loss: {running_loss/200:.4f}")
running_loss = 0.0
print("Finished Training! You've built an AI!")
Future Improvements for Your Model
While the CNN we built is a fantastic starting point, real-world AI models use a few extra techniques to achieve state-of-the-art accuracy. Once you have the basic code running, try implementing these improvements to push your model further:
- Data Augmentation: A model learns better if it sees variations of the data. You can add
transforms.RandomHorizontalFlip()ortransforms.RandomRotation(10)to your dataset preparation. This artificially expands your dataset and prevents the model from just memorizing the images (overfitting). - Dropout & Batch Normalization: Adding
nn.Dropout()randomly turns off some neurons during training, forcing the network to learn robust, generalized features. Addingnn.BatchNorm2d()stabilizes the learning process and drastically speeds up convergence. - Going Deeper: Our model only has two convolutional layers. You can add a third or fourth layer (e.g., extracting 64 or 128 filters) to allow the network to learn much more complex spatial hierarchies (like specific textures and full object shapes).
- Transfer Learning: Instead of building a CNN entirely from scratch, import a pre-trained powerhouse like ResNet-50 or EfficientNet directly from Torchvision. Because it has already learned to "see" by looking at millions of images, you just need to replace the final Linear layer and fine-tune it on your custom dataset!
Going Deeper: A 4-Layer CNN Architecture
To demonstrate what a deeper model looks like in practice, here is a separate code block showing an expanded architecture. This model uses 4 convolutional layers (progressing from 16 to 128 filters) to learn much more complex spatial hierarchies. We also adjust the fully connected layers to handle the heavily pooled image dimensions.
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
class DeepImageClassifier(nn.Module):
def __init__(self):
super(DeepImageClassifier, self).__init__()
# Layer 1: 3 channels -> 16 filters
self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
self.bn1 = nn.BatchNorm2d(16)
# Layer 2: 16 filters -> 32 filters
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
self.bn2 = nn.BatchNorm2d(32)
# Layer 3: 32 filters -> 64 filters
self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
self.bn3 = nn.BatchNorm2d(64)
# Layer 4: 64 filters -> 128 filters
self.conv4 = nn.Conv2d(64, 128, 3, padding=1)
self.bn4 = nn.BatchNorm2d(128)
self.pool = nn.MaxPool2d(2, 2)
self.dropout = nn.Dropout(0.3)
# After 4 max pools, a 32x32 image becomes 2x2.
# 128 channels * 2 * 2 = 512 flattened features
self.fc1 = nn.Linear(128 * 2 * 2, 256)
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
# Forward pass through all 4 layers with Pooling
x = self.pool(F.relu(self.bn1(self.conv1(x))))
x = self.pool(F.relu(self.bn2(self.conv2(x))))
x = self.pool(F.relu(self.bn3(self.conv3(x))))
x = self.pool(F.relu(self.bn4(self.conv4(x))))
# Flatten for linear layers
x = x.view(-1, 128 * 2 * 2)
x = self.dropout(x)
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
model = DeepImageClassifier()
print("Deep CNN Built Successfully!")
What happens as we progress from 16 to 128 filters?
You might wonder why we increase the number of filters deeper in the network. This is the secret to deep learning's power! Here is exactly what each layer is doing:
- Layer 1 (16 filters): Looks at raw pixels and learns very simple, low-level features like horizontal lines, vertical lines, and color blobs.
- Layer 2 (32 filters): Combines the lines from Layer 1 to learn intermediate shapes, like circles, corners, and sharp edges.
- Layer 3 (64 filters): Combines shapes from Layer 2 to learn complex textures and object parts (like a car tire or a dog's ear).
- Layer 4 (128 filters): Combines parts from Layer 3 to recognize highly complex, high-level concepts (like a full car or an entire dog face).
By making the network deeper and systematically increasing the filter count, we give the AI the mathematical "brain capacity" to understand incredibly complex spatial hierarchies!
How Many Layers Should a Model Have?
A very common question when building neural networks is: "How do I decide how many layers to use?"
The truth is, there is no single mathematical formula to calculate the perfect number of layers. It is an empirical science (trial and error) guided by a few core principles:
- The Complexity of the Task: If you are classifying simple black-and-white handwritten digits (like the MNIST dataset), a 2-layer CNN is plenty. If you are trying to differentiate between 1,000 different breeds of dogs in high-resolution images, you will need a much deeper network (like a 50-layer ResNet) to learn all the complex fur textures and facial structures.
- The Size of Your Dataset: Deep networks have millions of parameters (weights). If you have a tiny dataset (e.g., 500 images) and you use a massive 100-layer network, the model will just instantly memorize the exact pixels of the training data and fail completely in the real world (this is called overfitting). Deep networks require massive amounts of data.
- Computational Resources: Every time you add a layer, the model takes longer to train and requires more GPU memory (VRAM). You must balance the depth of your model with the hardware you have available.
- The Golden Rule (Transfer Learning First): In modern AI development, you rarely guess how many layers to build from scratch. The industry standard is to start with a pre-trained model that has a mathematically proven depth (like ResNet-18 or ResNet-50). You only build from scratch if you are dealing with a completely novel type of data, or if you need a tiny, highly-optimized custom model for a mobile device.
The Golden Rule in Action: Transfer Learning Code
As mentioned above, the true industry standard is Transfer Learning. Why spend days training a model from scratch when giants have already trained massive models on supercomputers?
Here is a complete, runnable script showing how to take a pre-trained ResNet-50 (which already knows how to see shapes, textures, and objects) and fine-tune it for our custom 10-class dataset in just a few lines of code!
Python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
# 1. Prepare Data (Resize is required because ResNet expects 224x224 images)
transform = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True)
# 2. Load Pre-Trained ResNet-50
print("Downloading ResNet-50 weights...")
model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
# 3. Freeze the Base Layers (So we don't destroy its pre-trained knowledge)
for param in model.parameters():
param.requires_grad = False
# 4. Replace the Final Layer (The "Head")
# ResNet-50 was trained on 1000 classes. We only have 10 classes in CIFAR-10.
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 10) # New layer! requires_grad is True by default.
# 5. Define Loss & Optimizer (Only optimizing the new final layer!)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
# 6. Train the Model
print("Starting Transfer Learning...")
for epoch in range(2): # Needs far fewer epochs because it's already smart!
running_loss = 0.0
for i, (images, labels) in enumerate(trainloader):
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if (i+1) % 50 == 0:
print(f"Epoch [{epoch+1}/2], Step [{i+1}/{len(trainloader)}], Loss: {running_loss/50:.4f}")
running_loss = 0.0
print("Transfer Learning Complete!")
Vision Transformers (ViT): The Convolution Killer
For almost a decade, CNNs completely dominated Computer Vision. But in 2020, researchers at Google Brain asked a radical question: "What if we treat an image exactly like a sentence of text?"
This led to the creation of the Vision Transformer (ViT). Instead of sliding a mathematical convolution filter over an image to look for edges, a ViT does the following:
- Chop it up: It slices the image into a grid of non-overlapping "patches" (e.g., 16x16 pixel squares).
- Flatten it: It flattens each patch into a single 1D vector (just like turning a word into a text token).
- Add Position: It adds a positional number to each patch so the AI remembers where the patch came from (top-left vs bottom-right).
- Apply Attention: It feeds all these patches into a standard Transformer Encoder (the exact same architecture that powers LLMs like ChatGPT). The Transformer uses "Self-Attention" to look at all patches simultaneously and figure out how they relate to one another.
Why are ViTs taking over?
CNNs are restricted by their "receptive field"—a kernel can only look at a tiny 3x3 pixel area at a time. It takes many deep layers before a CNN realizes that a tire in the bottom left is connected to a steering wheel in the top right. A Vision Transformer, however, uses Self-Attention to look at the entire image globally at the very first layer. When trained on massive datasets, ViTs outperform even the deepest ResNets!
Applying a Vision Transformer in PyTorch
Just like we did with ResNet, we don't usually train ViTs from scratch because they require an absurd amount of data (often hundreds of millions of images) to learn the concept of "vision" via attention. Instead, we use Transfer Learning!
Here is how you load a pre-trained ViT in PyTorch, freeze its massive attention blocks, and swap the final Classification Head for your own dataset:
Python
import torch
import torch.nn as nn
from torchvision import models
# 1. Load a pre-trained Vision Transformer (ViT-Base with 16x16 patches)
print("Downloading ViT weights...")
vit_model = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)
# 2. Freeze the Transformer Encoder Blocks
for param in vit_model.parameters():
param.requires_grad = False
# 3. Replace the Classification Head
# In a ViT, the final classification layer is stored in 'model.heads'
num_features = vit_model.heads.head.in_features
# Let's say we are classifying 10 types of animals
vit_model.heads.head = nn.Linear(num_features, 10)
# 4. Test the model with dummy data (Batch Size=1, Channels=3, Height=224, Width=224)
# Note: ViT-B_16 MUST receive 224x224 images by default!
dummy_image = torch.randn(1, 3, 224, 224)
prediction = vit_model(dummy_image)
print("ViT Output Shape:", prediction.shape) # Expected: [1, 10]
Expert Path: Mastering Vision Transformers
To truly become an expert in ViTs and pass advanced machine learning interviews, you must understand the deep architectural secrets that PyTorch abstracts away. Here is your roadmap to mastery:
- 1. The [CLS] Token: Did you know a ViT doesn't actually average all the patch vectors to make a prediction? It prepends an extra, empty "Class Token" at the very beginning of the sequence. As this token passes through the Attention layers, it learns to "look" at all the other patches and aggregate their information. It is this single [CLS] token that gets fed into the final Classification Head!
- 2. The Lack of Inductive Bias (Data Hunger): CNNs have an "inductive bias"—they naturally assume pixels physically close to each other are related. ViTs do not assume this. Because they have to learn from scratch that neighboring patches are related, ViTs require massive amounts of data. If you have a small dataset, a CNN will actually beat a ViT!
- 3. Swin Transformers (The Hybrid): Pure ViTs are incredibly computationally expensive on high-resolution images because Self-Attention compares every patch to every other patch ($O(N^2)$ complexity). Swin Transformers solve this by computing attention only in localized "windows" that shift over the image—acting as a brilliant, highly-efficient hybrid between a CNN and a ViT!
- 4. Masked Autoencoders (MAE): How do Google and Meta pre-train these massive models? Self-Supervised Learning. They take an image, hide (mask) 75% of its patches, and force the ViT to mathematically guess the missing pixels. This forces the model to deeply understand object geometry without needing millions of expensive human labels!
- 5. Positional Embeddings (1D vs 2D): Unlike Convolutions, Transformers have no built-in concept of order. If you shuffle the patches, the ViT won't know! To fix this, we add "Positional Embeddings" to the patches. While 1D embeddings (1, 2, 3...) work, advanced ViTs use 2D embeddings (x, y coordinates) so the model understands the spatial grid perfectly.
- 6. Resolution Fine-Tuning (Interpolation): If you train a ViT on 224x224 images and want to fine-tune it on 384x384 images, the number of patches drastically increases. You can't just pass more patches because the Positional Embeddings won't match! Experts use Bicubic Interpolation to mathematically stretch the pre-trained positional embeddings to fit the new sequence length.
- 7. Knowledge Distillation (DeiT): Because ViTs are so data-hungry, researchers created Data-efficient Image Transformers (DeiT). They use a "Distillation Token" that allows the ViT to learn directly from a CNN (acting as a teacher). The ViT literally mimics the CNN's predictions, bypassing the need for a 300-million image dataset!
- 8. Attention Maps (Interpretability): It is notoriously difficult to understand what a deep CNN is looking at. But with a ViT, you can directly visualize the Attention Weights of the [CLS] token! You can mathematically plot exactly which patches the model focused on when it decided the image was a dog.
- 9. Cross-Attention for Multimodality (CLIP): ViTs are the bridge to Artificial General Intelligence (AGI). Because ViTs turn images into text-like tokens, we can use "Cross-Attention" to fuse them with Language Models. This is how OpenAI's CLIP works—it connects images and text in the same mathematical space, allowing you to search images using text!
- 10. Layer Normalization (Pre-Norm): CNNs heavily rely on Batch Normalization to stabilize training. ViTs, inheriting from NLP Transformers, use Layer Normalization (LayerNorm) applied before the attention blocks (Pre-Norm architecture) to handle independent patch sequences much more effectively.
- 11. GELU Activation: Instead of the classic ReLU used in CNNs, ViTs use GELU (Gaussian Error Linear Unit). It provides a smoother curve for gradients and is the absolute standard across all modern transformer architectures.
- 12. Patch Size Trade-offs (ViT-B/16 vs ViT-B/8): The `/16` means 16x16 pixel patches. Using a smaller patch size like `/8` creates vastly more patches, leading to much higher accuracy on fine details, but causes massive, quadratic ($O(N^2)$) increases in memory and compute costs.
- 13. Hierarchical Vision Transformers (Pyramid ViT): Pure ViTs maintain a constant resolution (e.g., 196 patches) throughout all layers. Advanced variations like PVT (Pyramid Vision Transformer) progressively merge patches to create spatial hierarchies, functioning exactly like a CNN's pooling layers!
- 14. DropPath (Stochastic Depth): Deep ViTs are incredibly prone to extreme overfitting. To combat this, experts use DropPath (dropping entire residual branches during training) rather than standard Dropout, which is absolutely crucial for scaling ViTs past 24 layers.
Under the Hood: Building the ViT Patch Embedding Layer
In advanced technical interviews, engineers are often asked: "How does a ViT actually split an image into patches without using slow Python loops?"
The secret is surprisingly elegant. It uses a standard 2D Convolution, but sets the kernel size and stride to be exactly equal to the patch size! This mathematical trick extracts every patch simultaneously, with zero overlap, and linearly projects them into vectors instantly.
Here is the exact PyTorch code to build this critical Patch Embedding layer from scratch. If you understand this, you truly understand Vision Transformers:
Python
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
def __init__(self, in_channels=3, patch_size=16, embed_dim=768):
super().__init__()
self.patch_size = patch_size
# The Secret: A Convolution where kernel_size == stride == patch_size!
# This grabs non-overlapping 16x16 blocks and turns them into 768-D vectors.
self.proj = nn.Conv2d(
in_channels,
embed_dim,
kernel_size=patch_size,
stride=patch_size
)
def forward(self, x):
# Input x shape: [Batch, Channels=3, Height=224, Width=224]
x = self.proj(x)
# Output x shape: [Batch, EmbedDim=768, Grid_H=14, Grid_W=14]
# (Since 224 / 16 = 14)
# Flatten the 14x14 grid into 196 sequential patches
x = x.flatten(2) # Shape: [Batch, 768, 196]
# Transpose so patches are the sequence: [Batch, 196, 768]
# Now it looks EXACTLY like a sentence of 196 words with 768 dimensions!
x = x.transpose(1, 2)
return x
# Test our advanced layer!
embedder = PatchEmbedding()
dummy_img = torch.randn(1, 3, 224, 224)
patches = embedder(dummy_img)
print(f"Original Image: {dummy_img.shape}")
print(f"Patch Sequence: {patches.shape} -> (Batch, Num_Patches, Embed_Dim)")
Under the Hood: The [CLS] Token in PyTorch
As mentioned in the Expert Path, we don't just feed the 196 patches into the Transformer. We prepend a special [CLS] (Classification) Token. Why? Because as the sequence passes through the Self-Attention layers, this specific token is trained to act as an aggregator, looking at all the other patches to collect the global "concept" of the image.
Here is exactly how PyTorch implements the [CLS] token and attaches it to the patch sequence:
Python
class ViT_Prep(nn.Module):
def __init__(self, embed_dim=768, num_patches=196):
super().__init__()
# 1. Create the [CLS] Token as a learnable parameter
# Shape: [1, 1, 768] - It is a single "word" with 768 dimensions
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
# 2. Create Positional Embeddings for all patches PLUS the [CLS] token
# Shape: [1, 196 + 1, 768]
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
def forward(self, patches):
# 'patches' shape from previous step: [Batch, 196, 768]
batch_size = patches.shape[0]
# Expand the [CLS] token so there is one for every image in the batch
# Shape: [Batch, 1, 768]
cls_tokens = self.cls_token.expand(batch_size, -1, -1)
# Concatenate the [CLS] token to the FRONT of the patches
# New Shape: [Batch, 197, 768]
x = torch.cat((cls_tokens, patches), dim=1)
# Add Positional Embeddings so the Transformer knows the order
x = x + self.pos_embed
return x
# Test our prep layer on the patches from the previous block!
prep_layer = ViT_Prep()
final_sequence = prep_layer(patches) # 'patches' from previous block
print(f"Sequence before CLS: {patches.shape} (196 patches)")
print(f"Sequence AFTER CLS: {final_sequence.shape} (197 tokens!)")
The Final Secret: Once this [Batch, 197, 768] sequence goes through all the Transformer blocks, PyTorch literally throws away patches 1 through 196! It isolates the 0th token (the [CLS] token) and feeds only that 768-D vector into the final nn.Linear Classification Head.
Under the Hood: Inductive Bias & Data Hunger
To understand why ViTs are so data-hungry, you have to understand the concept of Inductive Bias. Inductive bias is a set of assumptions a mathematical model inherently makes about the data before it even begins training.
- CNNs have HIGH Inductive Bias: A Convolutional layer physically forces the math to look at a local 3x3 pixel grid. It strictly assumes that pixels physically touching each other are highly related (e.g., forming a continuous edge). Because the architecture hardcodes this assumption, CNNs learn very fast and perform brilliantly on very little data!
- ViTs have ZERO Inductive Bias: A Transformer's Self-Attention mechanism compares every patch to every other patch equally. It doesn't inherently know that Patch #1 is right next to Patch #2. It treats a patch in the top-left exactly the same as a patch in the bottom-right.
Because ViTs lack these built-in architectural shortcuts, they have to learn the basic concept of "2D space" and "local geometry" purely through raw data. A ViT essentially has to look at millions of images before the mathematical weights finally realize, "Oh, patches that are next to each other usually contain continuous shapes!"
The Golden Rule of Dataset Size
If you are training a model from scratch on 10,000 images, a ResNet will completely destroy a Vision Transformer. But if you have Google's JFT-300M dataset (300 million images), the ViT eventually surpasses the CNN. Why? Because the ViT's lack of inductive bias allows it to learn more complex, non-local, global relationships that the rigid CNN kernel physically cannot see!
Under the Hood: Swin Transformers & The O(N²) Problem
If Vision Transformers are so powerful, why don't we use them on massive 4K resolution medical images or satellite photos? The answer is quadratic computational complexity ($O(N^2)$).
In a standard ViT, Self-Attention requires every patch to mathematically compare itself to every other patch. If you double the image resolution, the number of patches quadruples, and the attention computation increases by a factor of 16! This completely crashes even the most powerful GPUs.
To solve this, Microsoft researchers invented the Swin Transformer (Shifted Window Transformer)—a brilliant architecture that combines the local efficiency of a CNN with the global representation power of a ViT.
- Localized Windows: Instead of comparing a patch to all patches in the image, the image is divided into larger "Windows" (e.g., 7x7 patches). Attention is computed only inside that specific window. This makes the math linear ($O(N)$), exactly like a CNN!
- Shifted Windows: If we only compute attention inside isolated windows, the model can never see the global picture. So, in the next layer, Swin shifts the window boundaries by half a window size. This allows patches to leak information to neighboring windows, eventually giving the network a global view.
- Hierarchical Merging: Just like a CNN's MaxPooling layers, Swin progressively merges 2x2 neighboring patches together as you go deeper, creating a spatial hierarchy.
Because PyTorch is incredible, utilizing this highly advanced hybrid model is as easy as loading a ResNet:
Python
import torch
from torchvision import models
# Load a pre-trained Swin Transformer (Swin-Tiny)
print("Downloading Swin Transformer...")
swin_model = models.swin_t(weights=models.Swin_T_Weights.DEFAULT)
# Test it on a high-resolution image! (e.g., 800x800)
# A standard ViT would likely crash trying to compute attention across 2,500 patches.
# But Swin easily handles it because it only computes attention inside local windows!
high_res_image = torch.randn(1, 3, 800, 800)
output = swin_model(high_res_image)
print("Swin Output Shape:", output.shape) # Expected: [1, 1000] (ImageNet classes)
Under the Hood: Masked Autoencoders (MAE)
Because ViTs are so incredibly data-hungry (due to their lack of inductive bias), relying purely on human-labeled datasets is impossible. There simply aren't enough labeled images in the world to train the largest models. The solution is Self-Supervised Learning.
Researchers at Meta developed the Masked Autoencoder (MAE). The concept is beautifully simple: take an image, chop it into patches, and then randomly hide (mask) 75% of them! You feed only the remaining 25% of visible patches into the ViT and mathematically force it to reconstruct the missing pixels.
By constantly trying to guess the missing pieces of millions of images, the ViT develops a profound, global understanding of object geometry, lighting, and textures—all without a single human label!
Here is a clever PyTorch snippet demonstrating how experts mathematically drop 75% of the patches before feeding them to the network:
Python
import torch
# Imagine our image was split into 196 patches (Batch=1, Patches=196, Embed_Dim=768)
patches = torch.randn(1, 196, 768)
def random_masking(patches, mask_ratio=0.75):
batch_size, num_patches, embed_dim = patches.shape
# Calculate how many patches to KEEP (25%)
len_keep = int(num_patches * (1 - mask_ratio)) # 196 * 0.25 = 49 patches
# Generate random noise for every patch, then sort it to get random indices!
noise = torch.rand(batch_size, num_patches) # random values between 0 and 1
# 'ids_shuffle' tells us the index order to randomly shuffle the patches
ids_shuffle = torch.argsort(noise, dim=1)
# We ONLY keep the first 'len_keep' indices from our randomly shuffled list!
ids_keep = ids_shuffle[:, :len_keep]
# Use gather to mathematically extract only the 49 random patches
# We expand ids_keep to match the embed_dim (768)
ids_keep_expanded = ids_keep.unsqueeze(-1).expand(-1, -1, embed_dim)
patches_kept = torch.gather(patches, dim=1, index=ids_keep_expanded)
return patches_kept
# Apply the 75% mask!
visible_patches = random_masking(patches, mask_ratio=0.75)
print(f"Original Sequence: {patches.shape} (196 patches)")
print(f"Visible Sequence sent to ViT: {visible_patches.shape} (Only 49 patches!)")
The Efficiency Bonus: Because we dropped 75% of the data, the heavy ViT Encoder now processes the image 4 times faster! It learns the deep structures of the image from the remaining 25%, and a lightweight "Decoder" network at the very end attempts to draw the missing 75%. This is the absolute secret to scaling ViTs efficiently!
Under the Hood: Positional Embeddings (1D vs 2D)
Unlike CNNs, Transformers have no built-in concept of order. The Self-Attention mechanism is perfectly permutation invariant. This means if you completely shuffle the 196 image patches like a deck of cards, a pure Transformer will output the exact same result! To fix this fatal flaw, we must inject spatial awareness into the patches using Positional Embeddings before they enter the network.
- 1D Embeddings: The original ViT simply numbers the patches from 1 to 196 (like words in a linear sentence). It learns a unique 768-D vector for position 1, position 2, etc.
- 2D Embeddings: Advanced models (like MAE) realize that images aren't linear sentences; they are 2D grids! Instead of numbering 1 to 196, they assign an (X, Y) coordinate to each patch. They create one embedding for the X-axis, one for the Y-axis, and add them together. This helps the AI mathematically understand that the patch at (0, 0) is physically above the patch at (0, 1).
Here is the brilliant PyTorch code demonstrating how experts build a 2D Sine-Cosine Positional Embedding grid from scratch (this requires no training and uses pure math to represent space!):
Python
import torch
def get_2d_sincos_pos_embed(embed_dim, grid_size):
# embed_dim: 768, grid_size: 14 (for a 14x14 grid of patches = 196 total patches)
# We split the embedding dimension in half: 384 dimensions for Y, 384 for X
half_dim = embed_dim // 2
# Create a grid of X and Y coordinates (from 0 to 13)
grid_h = torch.arange(grid_size, dtype=torch.float32)
grid_w = torch.arange(grid_size, dtype=torch.float32)
grid_y, grid_x = torch.meshgrid(grid_h, grid_w, indexing='ij')
# Flatten the grids to shape (196,)
grid_y = grid_y.flatten()
grid_x = grid_x.flatten()
# The Magic Math: Compute Sine and Cosine frequencies
# This is directly from the original "Attention is All You Need" paper
omega = torch.arange(half_dim // 2, dtype=torch.float32)
omega = 1.0 / (10000 ** (omega / (half_dim / 2)))
# Outer product of coordinates and frequencies
out_y = torch.einsum('m,d->md', grid_y, omega)
out_x = torch.einsum('m,d->md', grid_x, omega)
# Create the Sine and Cosine waves for Y and X
emb_y = torch.cat([torch.sin(out_y), torch.cos(out_y)], dim=1) # Shape: (196, 384)
emb_x = torch.cat([torch.sin(out_x), torch.cos(out_x)], dim=1) # Shape: (196, 384)
# Concatenate Y and X embeddings to get the full 768-D vector for each of the 196 patches!
pos_embed = torch.cat([emb_y, emb_x], dim=1) # Shape: (196, 768)
return pos_embed
# Generate the 2D embeddings!
pos_embeddings_2d = get_2d_sincos_pos_embed(embed_dim=768, grid_size=14)
print(f"2D Positional Embeddings Shape: {pos_embeddings_2d.shape}")
print("These vectors will be ADDED to our patch vectors before entering the Transformer!")
Under the Hood: Resolution Fine-Tuning (Interpolation)
Imagine you download a ViT pre-trained on 224x224 images. This model mathematically expects exactly 196 patches (plus the CLS token). But your custom medical dataset uses massive 384x384 images! If you use the standard 16x16 patch size, your image will now be chopped into 576 patches.
You can't just feed 576 patches into the pre-trained model because the pre-trained Positional Embedding matrix only has 196 rows! It will instantly crash with a shape mismatch error.
The expert solution is Bicubic Interpolation. We take the 1D list of 196 embeddings, reshape it back into a 2D 14x14 spatial grid, mathematically stretch that grid to 24x24 using PyTorch's image interpolation functions, and then flatten it back to 576! This perfectly preserves the spatial knowledge.
Python
import torch
import torch.nn.functional as F
# 1. We have pre-trained embeddings for a 14x14 grid (196 patches)
# Shape: [1, 196, 768] (Ignoring the CLS token for this mathematical example)
pretrained_pos_embed = torch.randn(1, 196, 768)
# 2. Reshape into a 2D grid so PyTorch can treat it like an "image" of embeddings
# Shape becomes: [Batch=1, Channels=768, Height=14, Width=14]
grid_embed = pretrained_pos_embed.reshape(1, 14, 14, 768).permute(0, 3, 1, 2)
# 3. Mathematically STRETCH the grid from 14x14 to 24x24 (576 patches for 384x384 image)
# We use 'bicubic' interpolation, which creates smooth algorithmic transitions between values
stretched_grid = F.interpolate(grid_embed, size=(24, 24), mode='bicubic', align_corners=False)
# 4. Flatten it back to a 1D sequence!
# Shape becomes: [1, 576, 768]
new_pos_embed = stretched_grid.flatten(2).transpose(1, 2)
print(f"Old Embeddings: {pretrained_pos_embed.shape}")
print(f"New Embeddings: {new_pos_embed.shape} -> Ready for 384x384 images!")
Under the Hood: Knowledge Distillation (DeiT)
Because ViTs lack inductive bias, they traditionally require 300 million images to outperform CNNs. To fix this fatal data hunger, researchers created DeiT (Data-efficient Image Transformers).
DeiT introduces a brilliant Teacher-Student dynamic. It trains a powerful CNN (the Teacher) and a ViT (the Student) side-by-side. The ViT is given a second special token called the Distillation Token. While the standard [CLS] token tries to predict the true label (e.g., "Dog"), the Distillation Token tries to mathematically mimic whatever the CNN predicts!
By mimicking the CNN, the ViT essentially "absorbs" the CNN's inductive bias, allowing it to learn highly accurate spatial representations using only standard 1.2 million image datasets (ImageNet) instead of 300 million!
Python
import torch
import torch.nn.functional as F
# Imagine a training loop for DeiT
# Output of the CNN Teacher (already trained, weights frozen)
teacher_logits = torch.randn(1, 1000)
# Output of the ViT Student (has TWO heads: one for CLS token, one for Distillation token)
student_cls_logits = torch.randn(1, 1000)
student_distill_logits = torch.randn(1, 1000)
# The True Label (e.g. Class 5: 'Dog')
true_labels = torch.tensor([5])
# 1. Standard Loss: How far is the CLS token from the true label?
cls_loss = F.cross_entropy(student_cls_logits, true_labels)
# 2. Distillation Loss: How far is the Distillation token from the CNN's prediction?
# We use CrossEntropy (or KL Divergence) to make the outputs match exactly
distill_loss = F.cross_entropy(student_distill_logits, teacher_logits.argmax(dim=1))
# 3. The Total Loss is simply an average of both!
total_loss = (cls_loss + distill_loss) / 2
print(f"CLS Loss: {cls_loss.item():.4f}")
print(f"Distillation Loss: {distill_loss.item():.4f}")
print(f"Total Backprop Loss: {total_loss.item():.4f}")
# The ViT learns the true label AND absorbs the CNN's bias simultaneously!
Under the Hood: Attention Maps (Interpretability)
One of the biggest problems with deep CNNs is that they act as "Black Boxes". If a CNN misclassifies a dog as a cat, it is notoriously difficult to understand why. Vision Transformers solve this beautifully.
Because the Self-Attention mechanism explicitly computes a mathematical "weight" representing how much the [CLS] token cares about every single patch, we can extract these weights and plot them directly over the image as a heatmap!
Python
import torch
# Imagine we extracted the raw attention weights from the final Transformer block
# Shape: [Batch=1, Num_Heads=12, Tokens=197, Tokens=197]
attention_weights = torch.rand(1, 12, 197, 197)
# 1. Average the attention weights across all 12 attention heads
# Shape becomes: [1, 197, 197]
avg_attention = torch.mean(attention_weights, dim=1)
# 2. Isolate the [CLS] token (index 0) and how much it attended to the 196 image patches (index 1 to 197)
# Shape becomes: [196]
cls_attention = avg_attention[0, 0, 1:]
# 3. Reshape the 196 weights back into the 14x14 image grid!
attention_heatmap = cls_attention.reshape(14, 14)
print("Attention Heatmap Shape:", attention_heatmap.shape)
print("You can now overlay this 14x14 grid directly onto your image using plt.imshow() to see exactly what the AI was looking at!")
Under the Hood: Multimodality (OpenAI's CLIP)
Vision Transformers are the ultimate bridge to Artificial General Intelligence (AGI). Because a ViT flattens an image into a 1D sequence of tokens (exactly how a text Transformer processes words), it becomes mathematically trivial to fuse Vision and Language together.
OpenAI's CLIP (Contrastive Language-Image Pre-training) does exactly this. It passes an image through a ViT, and a caption through a Text Transformer. It then mathematically forces both models to output the exact same vector if the image and text match! This unlocks incredible Zero-Shot Classification—you can classify images using text without ever training a custom classification head!
Python
import torch
import torch.nn.functional as F
# 1. The ViT processes a picture of a dog into a single 512-D vector (from its CLS token)
image_features = torch.randn(1, 512)
# 2. A Text Transformer processes 3 text prompts into three 512-D vectors
text_prompts = ["A photo of a cat", "A photo of a dog", "A photo of a car"]
text_features = torch.randn(3, 512)
# 3. Normalize the vectors so they act as pure directions in mathematical space
image_features = F.normalize(image_features, p=2, dim=-1)
text_features = F.normalize(text_features, p=2, dim=-1)
# 4. Compute Cosine Similarity (Dot Product) between the Image and ALL Text Prompts
# Shape: [1, 512] @ [512, 3] = [1, 3]
similarities = image_features @ text_features.T
# 5. Convert similarities to probabilities using Softmax
# Multiply by 100 for temperature scaling, making the highest probability stand out
probabilities = F.softmax(similarities * 100, dim=-1)
print(f"Probabilities for [Cat, Dog, Car]: {probabilities[0].tolist()}")
print("The highest probability is our Zero-Shot prediction, driven entirely by text!")
Under the Hood: Layer Normalization (Pre-Norm)
CNNs famously rely on Batch Normalization (BatchNorm) to train smoothly. BatchNorm computes statistics across an entire batch of images. However, Transformers inherited their architecture from NLP, where sentences vary wildly in length. Therefore, Transformers use Layer Normalization (LayerNorm), which normalizes the features of each individual token/patch independently of the rest of the batch.
Furthermore, the original 2017 text Transformer used "Post-Norm" (normalizing after the Attention block). ViTs almost exclusively use Pre-Norm (normalizing before the Attention block, but inside the residual connection). This tiny architectural tweak is the secret to training incredibly deep Transformers without the gradients exploding!
Python
import torch
import torch.nn as nn
class ViT_Block(nn.Module):
def __init__(self, embed_dim=768, num_heads=12):
super().__init__()
# LayerNorm normalizes exactly 768 dimensions per individual patch!
self.norm1 = nn.LayerNorm(embed_dim)
# The standard Self-Attention mechanism
self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
def forward(self, x):
# PRE-NORM ARCHITECTURE:
# 1. We normalize 'x' BEFORE it goes into attention.
# 2. We add the original un-normalized 'x' back via a Residual (+) Connection.
normalized_x = self.norm1(x)
attention_output, _ = self.attn(normalized_x, normalized_x, normalized_x)
x = x + attention_output
return x
Under the Hood: GELU Activation
In the CNN chapter, we used ReLU (Rectified Linear Unit). ReLU is mathematically brutal: if a value is negative, it instantly snaps to exactly 0. This creates a sharp, mathematical "corner" in the graph.
Transformers are notoriously fragile to train. That sharp corner in ReLU can cause "dead neurons" where gradients completely stop flowing. To fix this, researchers invented GELU (Gaussian Error Linear Unit). GELU uses the mathematics of standard normal distributions to create a beautifully smooth, curved version of ReLU. Negative values don't instantly snap to 0; they gently curve towards it. This mathematical smoothness is absolutely vital for training ViTs.
Python
import torch
import torch.nn as nn
class ViT_MLP(nn.Module):
def __init__(self, embed_dim=768, hidden_dim=3072):
super().__init__()
# This is the "Feed Forward" network applied after Attention
self.fc1 = nn.Linear(embed_dim, hidden_dim)
# The magic activation function replacing ReLU!
self.act = nn.GELU()
self.fc2 = nn.Linear(hidden_dim, embed_dim)
def forward(self, x):
# Expand from 768 to 3072 (standard ViT expansion ratio is 4x)
x = self.fc1(x)
# Apply the smooth Gaussian curve
x = self.act(x)
# Compress back to 768
x = self.fc2(x)
return x
# Test the MLP block
mlp = ViT_MLP()
dummy_sequence = torch.randn(1, 197, 768)
output = mlp(dummy_sequence)
print(f"MLP Output Shape: {output.shape} (Shape is perfectly preserved!)")
What's Next: The Bleeding Edge of Generative AI
Everything we discussed so far is Discriminative (the AI predicts what an image is). The absolute frontier is Generative models (like Midjourney, DALL-E, or Stable Diffusion). These models learn to reverse the process of adding noise to a canvas, allowing them to literally create brand new, photorealistic images from scratch based on your text prompts!
Glossary of Famous Datasets & Architectures
Throughout this chapter, we referenced several famous datasets and models. Here is a quick reference guide so you know exactly what they are:
- MNIST: The "Hello World" dataset of Computer Vision. It contains 70,000 tiny (28x28 pixel) black-and-white images of handwritten digits (0-9). It is used universally to test and teach basic CNNs.
- ImageNet: A massive dataset containing over 14 million annotated images across 20,000 categories. Winning the "ImageNet Challenge" is what sparked the entire Deep Learning revolution back in 2012.
- ResNet-18 vs ResNet-50: "ResNet" stands for Residual Network. The number simply indicates how many layers deep it is. ResNet-18 is fast and lightweight (great for mobile apps). ResNet-50 is much deeper, significantly more accurate, and is the absolute industry standard for general image classification.
- EfficientNet: Created by Google in 2019, this architecture uses a mathematical formula to scale the depth, width, and resolution of a CNN perfectly. It achieves better accuracy than ResNet while being much faster and smaller!
The Evolution of Image Models (1998 - Present)
To truly appreciate how far we have come, here is a timeline of the major breakthroughs that shaped modern Computer Vision. You can click on the links below the chart to read the original research papers!
Original Research Papers (Click to Read):
- 1998 (LeNet-5): LeCun et al. - Gradient-Based Learning Applied to Document Recognition
- 2012 (AlexNet): Krizhevsky et al. - ImageNet Classification with Deep CNNs
- 2014 (GANs): Goodfellow et al. - Generative Adversarial Nets
- 2015 (ResNet): He et al. - Deep Residual Learning for Image Recognition
- 2020 (ViT): Dosovitskiy et al. - An Image is Worth 16x16 Words
- 2021 (CLIP): Radford et al. - Learning Transferable Visual Models From Natural Language Supervision
- 2022 (Stable Diffusion): Rombach et al. - High-Resolution Image Synthesis with Latent Diffusion Models
- 2023 (SAM): Kirillov et al. - Segment Anything
- 2024 (World Models): OpenAI Research - Video Generation Models as World Simulators
- 2025 (VLA Models): Brohan et al. - Vision-Language-Action Models (Bridging to Robotics)
- 2026 (Omnimodal Vision): State-of-the-art Tracking - Recent Breakthroughs in Real-Time Spatiotemporal Reasoning
Chapter Summary & The Road Ahead
In this chapter, we conquered the mathematical foundations of Computer Vision. We transitioned from the rigid, pixel-perfect calculations of Convolutional Neural Networks (CNNs)—which rely on strong inductive biases to learn efficiently—to the massive, data-hungry, global reasoning capabilities of Vision Transformers (ViTs).
You now understand the deepest secrets of AI architecture: how images are patched, how self-attention replaces kernels, how positional embeddings inject space into pure mathematics, and how self-supervised learning (MAE) and multimodality (CLIP) act as the primary stepping stones toward Artificial General Intelligence.
Future Work: Next in the Book
- Chapter 2: Sequence Models & NLP: How do machines read text or predict the stock market? We will dive into Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTMs), and time-series forecasting.
- Chapter 3: Generative AI (Diffusion & GANs): We will pivot from understanding images to creating them, building the mathematics behind tools like Midjourney and Stable Diffusion.
- Chapter 4: Large Language Models (LLMs): The architecture behind ChatGPT, covering InstructGPT, RLHF (Reinforcement Learning from Human Feedback), and Mixture of Experts (MoE).
- Chapter 5: Audio, Video, and Robotics: Moving beyond text and images to true spatiotemporal reasoning and Vision-Language-Action (VLA) models.
Chapter 2: Sequence Models & NLP
Welcome to Chapter 2. In Chapter 1, we dealt with static images. But what happens when data has order? A sentence ("The dog chased the cat") loses its meaning if you scramble the words. A stock market price chart is mathematically useless if you randomly shuffle the days. This is Sequential Data.
The Problem with Traditional Neural Networks
Why can't we just use a Convolutional Neural Network (CNN) or a standard Feed-Forward Network to predict the next word in a sentence?
- Fixed Input Size: Traditional networks physically require a fixed input size (e.g., exactly 224x224 pixels). Sentences, however, can be 3 words long or 300 words long!
- No Memory (State): If you feed a standard network the word "The", it processes it. If you then feed it the word "dog", it processes it entirely independently. It has absolutely no memory that it just saw the word "The" a millisecond ago!
Enter the Recurrent Neural Network (RNN)
To solve the memory problem, researchers invented the Recurrent Neural Network (RNN). An RNN features an internal "Hidden State" (a memory vector). When it reads a word, it mathematically updates this memory vector. When it reads the next word, it looks at the new word AND its own memory vector simultaneously!
Figure: "Unrolling" an RNN through time. The Hidden State acts as a bridge, passing context from the past into the future.
Building an RNN Cell from Scratch
The math inside an RNN cell is actually incredibly elegant. It takes the current input ($x_t$), multiplies it by a weight matrix ($W_{ih}$), takes the previous hidden state ($h_{t-1}$), multiplies it by a different weight matrix ($W_{hh}$), adds them together, and applies a Tanh activation function to keep the numbers stably between -1 and 1.
Python
import torch
import torch.nn as nn
class SimpleRNNCell(nn.Module):
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# Weights for the incoming data (e.g., the current word vector)
self.W_ih = nn.Linear(input_size, hidden_size)
# Weights for the previous memory (the hidden state)
self.W_hh = nn.Linear(hidden_size, hidden_size)
# Tanh activation squashes the memory between -1 and 1 to prevent exploding numbers
self.tanh = nn.Tanh()
def forward(self, x_t, h_prev):
# 1. Process the current input
input_transformed = self.W_ih(x_t)
# 2. Process the previous memory
hidden_transformed = self.W_hh(h_prev)
# 3. Combine them and apply activation to create the NEW memory!
h_current = self.tanh(input_transformed + hidden_transformed)
return h_current
# Let's test it!
rnn = SimpleRNNCell(input_size=10, hidden_size=20)
x = torch.randn(1, 10) # Current word vector (size 10)
h_prev = torch.zeros(1, 20) # Initial blank memory (size 20)
h_new = rnn(x, h_prev)
print(f"New Memory State Shape: {h_new.shape}")
The Fatal Flaw: The Vanishing Gradient Problem
Standard RNNs sound perfect, but they fail miserably in practice. Because the hidden state is constantly multiplied by weight matrices and squashed by Tanh over and over again, the gradients (the learning signals) exponentially shrink as they travel back in time during backpropagation. If you have a sentence of 50 words, by the time the AI gets to word 50, the math has completely "forgotten" word 1! This is the infamous Vanishing Gradient Problem.
The Savior: Long Short-Term Memory (LSTM)
In 1997, Sepp Hochreiter and Jürgen Schmidhuber famously solved the vanishing gradient problem by inventing the LSTM. Instead of just blindly overwriting the memory at every single step, an LSTM introduces a "Cell State" (like a conveyor belt that runs straight down the entire sequence undisturbed) and three mathematical Gates to carefully control what information is added or removed.
- Forget Gate: Looks at the current word and the previous memory, and decides what to delete from the Cell State (using a Sigmoid function to output a number between 0 and 1. 0 = erase entirely, 1 = keep entirely).
- Input Gate: Decides what new information from the current word is important enough to be added to the Cell State.
- Output Gate: Decides what specific part of the Cell State should be exposed as the final output for this particular time step.
Applied Project: Time-Series Forecasting (Stock Market)
Let's use an LSTM for what it does best: predicting the future based on a sequence of past events. Below is a fully functional PyTorch script that defines an LSTM model for Time-Series Forecasting (e.g., predicting tomorrow's stock price based on the last 7 days of historical data).
Python
import torch
import torch.nn as nn
class StockPredictorLSTM(nn.Module):
def __init__(self, input_size=1, hidden_size=64, num_layers=2):
super().__init__()
# PyTorch provides a highly optimized, pre-built LSTM module!
# batch_first=True means our data shape will be (Batch, SequenceLength, Features)
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True
)
# A simple Linear layer to map the LSTM's memory to a single price prediction
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
# x shape: [Batch, Sequence_Length, Features]
# For example: [32, 7, 1] (32 batches, 7 days of history, 1 price feature)
# Pass the sequence through the LSTM
# 'out' contains the outputs for EVERY time step.
# (hn, cn) are the final hidden and cell states.
out, (hn, cn) = self.lstm(x)
# We only care about the LSTM's output at the VERY LAST time step (day 7) to predict day 8
last_time_step_out = out[:, -1, :]
# Predict the next price!
prediction = self.fc(last_time_step_out)
return prediction
# Let's test the model!
model = StockPredictorLSTM()
# Dummy data: 1 batch, 7 days of historical prices, 1 feature per day
history = torch.tensor([[[150.0], [151.2], [152.1], [149.8], [150.5], [153.0], [155.1]]])
# Predict tomorrow's price!
predicted_price = model(history)
print(f"Input History Shape: {history.shape}")
print(f"Predicted Next Price (Raw Output): {predicted_price.item():.4f}")
Applied Project 2: Sentiment Analysis (Bi-LSTM + Embeddings)
A classic NLP problem is Sentiment Analysis: classifying a movie review as Positive or Negative. Here, we combine an Embedding Layer (to turn words into math vectors) with a Bidirectional LSTM (which reads the sentence both forwards and backwards for perfect context!).
Python
import torch
import torch.nn as nn
class SentimentBiLSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim=100, hidden_size=64):
super().__init__()
# 1. Embedding Layer: Turns integer token IDs into dense 100-D math vectors
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# 2. Bidirectional LSTM: Reads the sentence left-to-right AND right-to-left
# This makes the output hidden state 2x the hidden_size (64 * 2 = 128)
self.lstm = nn.LSTM(
input_size=embedding_dim,
hidden_size=hidden_size,
batch_first=True,
bidirectional=True
)
# 3. Classifier: Maps the 128-D memory to a single sentiment score
self.fc = nn.Linear(hidden_size * 2, 1)
# Sigmoid ensures output is between 0 (Negative) and 1 (Positive)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
# x shape: [Batch, Sequence_Length] (e.g., [1, 5] for a 5-word sentence)
embedded = self.embedding(x)
# Pass embeddings through Bi-LSTM
_, (hn, cn) = self.lstm(embedded)
# 'hn' contains the final memory of both the forward and backward LSTMs
# We concatenate them together to form a complete understanding of the sentence
hidden_cat = torch.cat((hn[-2,:,:], hn[-1,:,:]), dim=1)
out = self.fc(hidden_cat)
return self.sigmoid(out)
# Let's test the model!
vocab_size = 10000 # Our dictionary has 10,000 known words
model = SentimentBiLSTM(vocab_size)
# Dummy sentence: "The movie was absolutely fantastic" (represented by 5 integer IDs)
sentence_tokens = torch.tensor([[4, 892, 12, 405, 77]])
sentiment_score = model(sentence_tokens)
print(f"Input Tokens: {sentence_tokens.tolist()}")
print(f"Sentiment Score (0=Bad, 1=Good): {sentiment_score.item():.4f}")
Applied Project 3: Language Modeling (AI Text Generation)
How does an AI write text? It uses a Language Model architecture. The model is trained to look at a sequence of words and predict the probability of the next word in the dictionary. By taking the highest probability word and feeding it back into the model, the AI can "write" sentences endlessly!
Python
import torch
import torch.nn as nn
class TextGeneratorLSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim=128, hidden_size=256):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# Standard LSTM (Not Bidirectional, because we can't look at the future if we are writing it!)
self.lstm = nn.LSTM(embedding_dim, hidden_size, batch_first=True)
# Maps the memory to a probability score for EVERY word in the dictionary
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, prev_memory=None):
embedded = self.embedding(x)
out, memory = self.lstm(embedded, prev_memory)
# We only want to predict the next word based on the LAST time step's memory
next_word_logits = self.fc(out[:, -1, :])
return next_word_logits, memory
# Test the generation process!
vocab_size = 5000 # Dictionary size
model = TextGeneratorLSTM(vocab_size)
# The prompt: "Once upon a" (3 tokens)
prompt = torch.tensor([[150, 82, 5]])
# Predict the probabilities for the 4th word
logits, _ = model(prompt)
# Apply Softmax to convert raw scores to probabilities, then find the highest one (argmax)
probabilities = torch.nn.functional.softmax(logits, dim=-1)
predicted_word_id = torch.argmax(probabilities, dim=-1)
print(f"Prompt tokens: {prompt.tolist()}")
print(f"The AI predicts the next word ID is: {predicted_word_id.item()}")
print("(This loop is exactly how ChatGPT writes code and essays!)")
Expert Path: Advanced Sequence Concepts
- 1. GRU (Gated Recurrent Unit): An LSTM has 3 gates and requires a lot of computing power. In 2014, researchers invented the GRU, which cleverly merges the Cell State and Hidden State into one, and uses only 2 gates (Update and Reset). It performs almost as well as an LSTM but trains significantly faster!
- 2. Bidirectional RNNs/LSTMs: If you are predicting stock prices, you can only look at the past. But if you are translating a sentence, looking at the future words helps you perfectly understand the context of the current word! A Bidirectional LSTM runs two LSTMs simultaneously: one reading left-to-right, and one reading right-to-left, mathematically merging their hidden states.
- 3. Word Embeddings (Word2Vec): AI cannot read raw strings like "dog". Before text goes into an LSTM, every word is tokenized and mapped to a dense mathematical vector (e.g., 300 dimensions). These vectors geometrically capture semantic meaning, famously allowing vector arithmetic like:
Vector("King") - Vector("Man") + Vector("Woman") ≈ Vector("Queen").
Chapter 3: Generative AI (Diffusion & GANs)
Welcome to Chapter 3. Up until now, every AI model we have discussed has been Discriminative. They take data and predict a label (e.g., "Is this a dog?" or "Is this movie review positive?"). We are now entering the frontier of Generative AI. Instead of predicting labels, Generative AI learns the underlying mathematical distribution of the data so it can create brand new data from scratch.
1. Generative Adversarial Networks (GANs)
In 2014, Ian Goodfellow introduced a brilliant concept that dominated AI for almost a decade: the GAN. The idea is strictly based on Game Theory. You build two entirely separate neural networks and force them to fight each other:
- The Generator (The Counterfeiter): Takes random mathematical static (noise vectors) and tries to transform it into a realistic image (e.g., a fake Mona Lisa).
- The Discriminator (The Police): Looks at both real images and the Generator's fake images, and tries to guess which is which.
As they fight, they both get better. Eventually, the Generator becomes so mathematically perfect that the Discriminator is forced to guess 50/50. At that point, you have an AI that can generate photorealistic images!
Applied Project: A Simple GAN in PyTorch
Let's build the architecture for a GAN that generates handwritten digits (like the famous MNIST dataset). We will create a Generator to create fakes, and a Discriminator to catch them.
Python
import torch
import torch.nn as nn
# 1. THE GENERATOR (The Counterfeiter)
class Generator(nn.Module):
def __init__(self, noise_dim=100, img_dim=784): # 784 is a flat 28x28 image
super().__init__()
self.gen = nn.Sequential(
nn.Linear(noise_dim, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, img_dim),
nn.Tanh() # Output pixel values perfectly normalized between -1 and 1
)
def forward(self, x):
return self.gen(x)
# 2. THE DISCRIMINATOR (The Police)
class Discriminator(nn.Module):
def __init__(self, img_dim=784):
super().__init__()
self.disc = nn.Sequential(
nn.Linear(img_dim, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, 1),
nn.Sigmoid() # Outputs a probability: 0 (Fake) or 1 (Real)
)
def forward(self, x):
return self.disc(x)
# Testing the architectures
noise = torch.randn(1, 100) # Random static (100 dimensions)
generator = Generator()
discriminator = Discriminator()
fake_image = generator(noise) # Generator makes a fake 28x28 image
verdict = discriminator(fake_image) # Discriminator inspects the fake image
print(f"Fake Image Shape: {fake_image.shape}")
print(f"Discriminator Verdict (Probability it is Real): {verdict.item():.4f}")
The Flaw of GANs: Mode Collapse
If GANs are so great, why does Midjourney use Diffusion instead? GANs suffer from extreme training instability. The most famous problem is Mode Collapse: The Generator discovers exactly ONE single image (e.g., a perfect picture of a dog) that consistently fools the Discriminator. Instead of learning to draw cats, birds, and cars, the Generator gets lazy and just prints that exact same dog forever! GANs are notoriously difficult to tune and balance without them collapsing.
2. The Diffusion Revolution (Thermodynamics in AI)
Inspired by non-equilibrium thermodynamics in physics, Diffusion Models completely dethroned GANs. Instead of playing a delicate adversarial game between two networks, Diffusion works in two mathematically solid steps:
- The Forward Process (Destruction): Take a crisp, perfect image and slowly add Gaussian noise over 1,000 steps until the image is nothing but TV static. This requires no AI; it is pure, trackable math (a Markov Chain).
- The Reverse Process (Creation): Train a massive Neural Network (specifically, a U-Net) to take a slightly noisy image and predict the specific noise that was added. Subtract the predicted noise. Repeat this 1,000 times on pure static, and a brand new image emerges from the void!
3. Latent Diffusion (Stable Diffusion)
There was one major problem with standard Diffusion: running a massive U-Net 1,000 times on a 1024x1024 high-resolution pixel image takes days and requires supercomputers. To solve this, researchers invented Latent Diffusion.
Instead of diffusing pixels, we use a Variational Autoencoder (VAE) to compress the 1024x1024 image into a tiny 64x64 mathematical "Latent Space". We add noise and run the U-Net entirely in this tiny, highly compressed latent space. Once the U-Net generates a crisp latent vector, the VAE decodes it back into a massive 1024x1024 high-res image. This mathematical shortcut is the secret to generating images on a consumer laptop in just 3 seconds!
Applied Project: The Denoising U-Net Block
At the absolute core of Stable Diffusion is the Denoising U-Net. The U-Net must take a noisy image, BUT it also needs to know what step of the 1,000 steps it is currently on, and it needs to be conditioned on Text (so you can prompt it to draw a dog!). Here is how that architecture conceptually works in PyTorch using Cross-Attention.
Python
import torch
import torch.nn as nn
class DiffusionUNetBlock(nn.Module):
def __init__(self, channels=256, text_embed_dim=768):
super().__init__()
# 1. Standard Convolution to process the noisy image
self.conv = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
# 2. Linear layer to inject the Time-Step Embeddings (e.g., Step 500/1000)
self.time_mlp = nn.Linear(text_embed_dim, channels)
# 3. Cross-Attention: This is how Text (CLIP embeddings) steers the Image generation!
# The Image queries the Text to figure out what it should draw next.
self.cross_attention = nn.MultiheadAttention(embed_dim=channels, num_heads=8, batch_first=True)
def forward(self, noisy_image, time_embedding, text_embedding):
# 1. Process the image
x = self.conv(noisy_image)
# 2. Inject Time (telling the network how noisy the image currently is)
# We reshape time to broadcast across the height/width of the image
time_emb = self.time_mlp(time_embedding).unsqueeze(-1).unsqueeze(-1)
x = x + time_emb
# 3. Inject Text via Cross Attention (Conditioning)
# Flatten image from [Batch, Channels, H, W] to [Batch, H*W, Channels] for attention
b, c, h, w = x.shape
x_flat = x.view(b, c, -1).permute(0, 2, 1)
# Cross Attention: Image (Q) looks at Text (K, V)
attention_out, _ = self.cross_attention(query=x_flat, key=text_embedding, value=text_embedding)
# Reshape back to image format
x = attention_out.permute(0, 2, 1).view(b, c, h, w)
return x # The resulting tensor is used to predict the noise to subtract!
# Dummy execution to show tensor shapes
unet_block = DiffusionUNetBlock()
noisy_img = torch.randn(1, 256, 64, 64) # A 64x64 Latent Image Space
time_emb = torch.randn(1, 768) # Embedded time step (e.g., step 500)
text_emb = torch.randn(1, 77, 256) # "A dog in space" (77 tokens, 256 dimensions)
denoised_features = unet_block(noisy_img, time_emb, text_emb)
print(f"Denoised Features Shape: {denoised_features.shape}")
print("(This block repeats dozens of times inside Stable Diffusion!)")
Chapter 4: Large Language Models (LLMs) - Complete Guide
We now arrive at the technology that shocked the world: the Large Language Model. You already know how a text-generator works (we built an LSTM language model in Chapter 2). But how did we go from an AI that just spits out random text to a highly intelligent, conversational assistant like ChatGPT? The secret lies in two monumental breakthroughs: RLHF and Mixture of Experts (MoE).
1. The Base Model vs. The Assistant
If you trained a massive Transformer on the entire internet, you would get a "Base Model" (like the original GPT-3). Base models are incredibly smart, but they have one fatal flaw: they are document continuers, not assistants.
Example of Base Model behavior:
You prompt: "Write a poem about a cat."
Base Model responds: "Write a poem about a dog. Write a poem about a bird. Write a poem about a fish."
The model thinks you're just listing prompts to complete in a document, not asking for a specific task.
To fix this, OpenAI published the InstructGPT paper, introducing a brilliant pipeline to teach the AI to follow instructions and act like a helpful human.
2. RLHF (Reinforcement Learning from Human Feedback)
To turn the Base Model into an Assistant, we must align it with human preferences. This is done in three rigorous steps.
Step 1: Supervised Fine-Tuning (SFT)
OpenAI hired humans to write thousands of perfect prompt/response pairs. The Base Model is fine-tuned on this data to learn the format of being an assistant.
Complete SFT Implementation
Python
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
class AssistantDataset(Dataset):
def __init__(self, prompts, responses, tokenizer, max_length=512):
self.prompts = prompts
self.responses = responses
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.prompts)
def __getitem__(self, idx):
# Format: [PROMPT] + [SEP] + [RESPONSE]
full_text = f"[PROMPT] {self.prompts[idx]} [SEP] {self.responses[idx]}"
# Tokenize
encoded = self.tokenizer(
full_text,
truncation=True,
max_length=self.max_length,
padding='max_length',
return_tensors='pt'
)
return {
'input_ids': encoded['input_ids'].squeeze(),
'attention_mask': encoded['attention_mask'].squeeze()
}
def supervised_fine_tuning():
# Load base model
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
# Example training data (in practice, you'd have thousands of examples)
prompts = [
"Write a poem about a cat.",
"Explain quantum physics simply.",
"Write a recipe for chocolate cake."
]
responses = [
"Whiskers twitch, eyes aglow, \nA tiny hunter steps on snow...",
"Quantum physics is the study of how really tiny things behave...",
"Ingredients: flour, sugar, eggs, cocoa powder...",
]
# Create dataset and dataloader
dataset = AssistantDataset(prompts, responses, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
# Training setup
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)
model.train()
# Training loop (simplified)
for epoch in range(3):
for batch in dataloader:
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
# Shift labels for next token prediction
labels = input_ids.clone()
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
return model, tokenizer
# Note: In practice, you'd train on hundreds of thousands of examples
# and use much larger models.
Step 2: The Reward Model
We can't use humans to grade every single thing the AI ever writes; it's too expensive. Instead, we train a second AI (the Reward Model) to act as a human judge. It looks at an answer and outputs a score out of 10.
Complete Reward Model Implementation
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
class RewardDataset(Dataset):
def __init__(self, prompts, responses, scores):
self.prompts = prompts
self.responses = responses
self.scores = scores
def __len__(self):
return len(self.prompts)
def __getitem__(self, idx):
# Combine prompt and response with special tokens
combined = f"[PROMPT] {self.prompts[idx]} [SEP] {self.responses[idx]}"
# Convert to token IDs (simplified for demonstration)
# In reality, you'd use a proper tokenizer
tokens = [ord(c) % 1000 for c in combined]
tokens = torch.tensor(tokens[:512])
return tokens, torch.tensor(self.scores[idx], dtype=torch.float)
class RewardModel(nn.Module):
def __init__(self, vocab_size, d_model=256, max_seq_len=512):
super().__init__()
# 1. Embed the text (Prompt + Answer) into math vectors
self.embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(max_seq_len, d_model)
# 2. A standard Transformer Encoder to understand the context
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=8,
dim_feedforward=512,
batch_first=True,
dropout=0.1
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=4)
# 3. The magic: compress the sequence's meaning into a SINGLE score!
self.score_head = nn.Linear(d_model, 1)
self.layer_norm = nn.LayerNorm(d_model)
def forward(self, x):
batch_size, seq_len = x.shape
embedded = self.embedding(x)
positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
pos_embedded = self.position_embedding(positions)
embedded = embedded + pos_embedded
transformer_out = self.transformer(embedded)
transformer_out = self.layer_norm(transformer_out)
# Take the representation of the VERY LAST token
final_token_representation = transformer_out[:, -1, :]
# Output the score!
reward_score = self.score_head(final_token_representation)
reward_score = torch.sigmoid(reward_score) * 10
return reward_score
def train_reward_model():
model = RewardModel(10000)
prompts = [
"Write a poem about a cat.",
"Write a poem about a cat.",
"Explain quantum physics simply.",
"Explain quantum physics simply."
]
responses = [
"Whiskers twitch, eyes aglow...",
"cat cat cat cat cat cat",
"Quantum physics is the study...",
"I don't know what you're talking about."
]
scores = [9.0, 1.0, 8.5, 0.5]
dataset = RewardDataset(prompts, responses, scores)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.MSELoss()
model.train()
for epoch in range(10):
for tokens, scores in dataloader:
predicted_scores = model(tokens).squeeze()
loss = criterion(predicted_scores, scores)
loss.backward()
optimizer.step()
optimizer.zero_grad()
return model
Step 3: PPO (Proximal Policy Optimization)
Finally, we let the main AI generate millions of answers. The Reward Model scores them. Using Reinforcement Learning, the main AI updates its own brain to maximize its score!
Complete PPO Implementation for LLMs
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
import numpy as np
class PPOLLM:
def __init__(self, model, reward_model, tokenizer, lr=1e-5,
clip_epsilon=0.2, value_coef=0.5, entropy_coef=0.01):
self.model = model # The LLM we're training
self.reward_model = reward_model # The judge
self.tokenizer = tokenizer
self.clip_epsilon = clip_epsilon
self.value_coef = value_coef
self.entropy_coef = entropy_coef
self.actor_optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
# Memory for experiences
self.states, self.actions, self.log_probs = [], [], []
self.values, self.rewards, self.dones = [], [], []
def get_action(self, state):
with torch.no_grad():
logits = self.model(state)
probs = F.softmax(logits, dim=-1)
dist = Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
value = self.model.get_value(state)
return action, log_prob, value
def compute_advantages(self, gamma=0.99, gae_lambda=0.95):
advantages = []
gae = 0
rewards = torch.tensor(self.rewards, dtype=torch.float32)
values = torch.tensor(self.values, dtype=torch.float32)
dones = torch.tensor(self.dones, dtype=torch.float32)
for t in reversed(range(len(self.rewards))):
next_value = 0 if t == len(self.rewards) - 1 else values[t + 1]
delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
advantages.insert(0, gae)
advantages = torch.tensor(advantages, dtype=torch.float32)
return (advantages - advantages.mean()) / (advantages.std() + 1e-8)
def update(self, epochs=4, batch_size=64):
advantages = self.compute_advantages()
returns = advantages + torch.tensor(self.values, dtype=torch.float32)
states = torch.stack(self.states)
actions = torch.stack(self.actions)
old_log_probs = torch.stack(self.log_probs)
indices = np.arange(len(states))
for _ in range(epochs):
np.random.shuffle(indices)
for start in range(0, len(states), batch_size):
end = start + batch_size
batch_idx = indices[start:end]
logits = self.model(states[batch_idx])
probs = F.softmax(logits, dim=-1)
dist = Categorical(probs)
new_log_probs = dist.log_prob(actions[batch_idx])
entropy = dist.entropy().mean()
ratio = (new_log_probs - old_log_probs[batch_idx]).exp()
# Clipped surrogate objective
surr1 = ratio * advantages[batch_idx]
surr2 = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon) * advantages[batch_idx]
actor_loss = -torch.min(surr1, surr2).mean()
values = self.model.get_value(states[batch_idx])
value_loss = F.mse_loss(values, returns[batch_idx])
loss = actor_loss + self.value_coef * value_loss - self.entropy_coef * entropy
self.actor_optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 0.5)
self.actor_optimizer.step()
# Clear memory
self.states, self.actions, self.log_probs = [], [], []
self.values, self.rewards, self.dones = [], [], []
return loss.item()
class LLMWithValueHead(nn.Module):
"""Base LLM with an additional value head for RL"""
def __init__(self, base_model):
super().__init__()
self.base_model = base_model
self.value_head = nn.Linear(base_model.config.hidden_size, 1)
def forward(self, x):
outputs = self.base_model(x)
hidden_states = outputs.hidden_states[-1]
value = self.value_head(hidden_states[:, -1, :])
return outputs.logits, value
def get_value(self, x):
with torch.no_grad():
outputs = self.base_model(x)
hidden_states = outputs.hidden_states[-1]
return self.value_head(hidden_states[:, -1, :])
3. The Scaling Wall: Why we can't just build bigger models
GPT-3 had 175 Billion parameters. What if we want a 1-Trillion parameter model? The problem is VRAM (Video RAM) and Compute. A 1-Trillion parameter model is so massive that it would require thousands of GPUs just to generate a single word, because every single parameter must be multiplied for every single word! It is physically unsustainable.
The Math Behind the Scaling Wall
Python
def calculate_memory_requirements():
def format_bytes(bytes):
for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
if bytes < 1024: return f"{bytes:.2f} {unit}"
bytes /= 1024
return f"{bytes:.2f} PB"
model_sizes = {
'GPT-2 Small': 124e6,
'GPT-3': 175e9,
'Hypothetical 1T': 1e12,
'Hypothetical 10T': 10e12
}
print("Memory Requirements:")
for name, params in model_sizes.items():
memory_fp16 = params * 2 # 2 bytes per parameter
print(f"{name:<15}: {params/1e9:>6.2f}B params -> {format_bytes(memory_fp16)}")
calculate_memory_requirements()
# Output for 1T: ~2000 GB of VRAM just to HOLD the model (requires ~25 H100 GPUs!)
4. Scaling to Trillions: Mixture of Experts (MoE)
Mixture of Experts (MoE) is the architecture behind GPT-4 and Mistral. Instead of building one massive brain where every neuron fires for every word, you build several smaller brains (Experts). For example, you might have 8 Experts inside the model.
When the user asks a question about Calculus, a Router (Gating Network) mathematically analyzes the incoming text and routes it ONLY to Expert #3 (the math expert) and Expert #7 (the logic expert). The other 6 Experts remain completely asleep! This allows you to build a 1-Trillion parameter model, but it runs as fast as a 100-Billion parameter model.
Complete MoE Implementation
Python
import torch
import torch.nn as nn
import torch.nn.functional as F
class MoERouter(nn.Module):
def __init__(self, d_model=256, num_experts=8, top_k=2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.gate = nn.Linear(d_model, num_experts)
self.noise_std = 0.1
def forward(self, x, training=True):
logits = self.gate(x)
# Add noise for load balancing during training
if training and self.noise_std > 0:
logits = logits + torch.randn_like(logits) * self.noise_std
routing_probabilities = F.softmax(logits, dim=-1)
topk_probs, topk_indices = torch.topk(routing_probabilities, self.top_k, dim=-1)
topk_probs = topk_probs / topk_probs.sum(dim=-1, keepdim=True)
return topk_probs, topk_indices
class MoEExpert(nn.Module):
def __init__(self, d_model=256, d_ff=1024):
super().__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.activation = nn.GELU()
def forward(self, x):
return self.fc2(self.activation(self.fc1(x)))
class MoELayer(nn.Module):
def __init__(self, d_model=256, num_experts=8, top_k=2):
super().__init__()
self.top_k = top_k
self.router = MoERouter(d_model, num_experts, top_k)
self.experts = nn.ModuleList([MoEExpert(d_model) for _ in range(num_experts)])
def forward(self, x):
batch_size, seq_len, d_model = x.shape
routing_probs, routing_indices = self.router(x)
output = torch.zeros_like(x)
for b in range(batch_size):
for t in range(seq_len):
token = x[b, t]
token_output = torch.zeros(d_model, device=x.device)
for k in range(self.top_k):
expert_idx = routing_indices[b, t, k]
prob = routing_probs[b, t, k]
expert_output = self.experts[expert_idx](token.unsqueeze(0))
token_output += prob * expert_output.squeeze(0)
output[b, t] = token_output
return output
5. Advanced MoE Features: Load Balancing and Expert Capacity
In production, if all tokens are routed to just one expert, that GPU will run out of memory while the others sit idle. We must limit how many tokens an expert can process simultaneously.
Python
class AdvancedMoELayer(nn.Module):
# ... previous init ...
def forward(self, x):
batch_size, seq_len, d_model = x.shape
flat_x = x.view(-1, d_model) # Flatten for efficient batch routing
routing_probs, routing_indices = self.router(flat_x)
output = torch.zeros_like(flat_x)
# Compute the absolute maximum number of tokens an expert is allowed to handle
capacity = int(1.25 * flat_x.size(0) / self.num_experts)
for expert_idx in range(self.num_experts):
# Find tokens routed to this expert
expert_mask = (routing_indices == expert_idx).any(dim=-1)
expert_indices_flat = torch.where(expert_mask)[0]
# CRITICAL: Limit to capacity to prevent VRAM Out-of-Memory!
if len(expert_indices_flat) > capacity:
expert_indices_flat = expert_indices_flat[:capacity]
if len(expert_indices_flat) > 0:
expert_input = flat_x[expert_indices_flat]
expert_output = self.experts[expert_idx](expert_input)
# Apply probability weights
weights = routing_probs[expert_indices_flat, 0]
output[expert_indices_flat] += expert_output * weights.unsqueeze(-1)
return output.view(batch_size, seq_len, d_model)
6. Practical Deployment Considerations
When should you use Dense Models vs. Mixture of Experts?
| Scenario | Recommendation | Why? |
|---|---|---|
| < 10 Billion Parameters | Dense Model | Simpler architecture, easy to train, fits on a single GPU. |
| > 50 Billion Parameters | MoE Model | Saves massive compute. 80% sparsity means it runs 5x faster. |
| Ultra-Low Latency | Dense Model | MoE routing introduces a slight overhead. For real-time, dense is predictable. |
Infrastructure Requirements for MoE
Because Experts sit on different physical GPUs, you need incredibly fast interconnects (like NVIDIA NVLink) to instantly pass token data between GPUs during the routing phase. Without this, the model will bottleneck at the network transfer level!
Chapter 5: Multimodal Futures (Audio, Video, Robotics)
We have covered how AI processes Images (Chapter 1) and Text (Chapters 2 and 4). But the true future of Artificial General Intelligence (AGI) lies in Multimodality. An AI must be able to hear audio, watch video, and physically interact with the real world using a robotic body. Remarkably, the industry discovered that we don't need entirely new mathematics for this. We can use the exact same Transformer architecture we've been using all along, simply by changing how we tokenize the data.
1. Audio (Whisper): Turning Sound into Images
You cannot easily pass raw audio waves directly into a neural network because the frequencies are too chaotic. To solve this, models like OpenAI's Whisper use a brilliant trick: they convert the audio into an image.
Using Fourier Transforms, a 30-second audio clip is plotted on a 2D graph called a Mel Spectrogram (where the X-axis is time, the Y-axis is frequency, and the color is volume). Once the audio is an image, Whisper just treats it like any other picture! It slices the Spectrogram into 16x16 patches and feeds them directly into a standard Vision Transformer (ViT).
2. Video (Sora): Spatiotemporal 3D Patches
A video is simply a stack of images over time. In Chapter 1, we learned that a Vision Transformer slices a single 2D image into a flat grid of patches. But how does OpenAI's Sora process moving video?
Sora introduces the Spatiotemporal Patch. Instead of slicing in 2D (Width and Height), the model slices in 3D (Width, Height, and Time). A single patch might contain a 16x16 pixel square that spans across 4 consecutive frames of video. This allows the AI to learn both Space (what an object looks like) and Time (how physics causes that object to move).
Applied Project: Extracting 3D Video Patches
In PyTorch, extracting 2D patches for an image is done with nn.Conv2d. To extract 3D Spatiotemporal patches for video, we simply upgrade to nn.Conv3d!
Python
import torch
import torch.nn as nn
class SpatiotemporalPatchEmbed(nn.Module):
def __init__(self, patch_time=2, patch_size=16, in_channels=3, embed_dim=768):
super().__init__()
# The magic of Sora: A 3D Convolution!
# It slides a 3D box (Time, Height, Width) across the video tensor.
self.proj = nn.Conv3d(
in_channels=in_channels,
out_channels=embed_dim,
kernel_size=(patch_time, patch_size, patch_size),
stride=(patch_time, patch_size, patch_size)
)
def forward(self, video):
# video shape: [Batch, Channels, Frames, Height, Width]
# e.g., A 10-frame video, 224x224 pixels: [1, 3, 10, 224, 224]
# Extract the 3D Patches
patches = self.proj(video)
# Flatten the patches into a 1D sequence so a Transformer can read them
# Shape goes from [Batch, EmbedDim, TimePatches, HPatches, WPatches]
# to [Batch, EmbedDim, Total_Patches]
patches = patches.flatten(2)
# Transpose to [Batch, Sequence_Length, EmbedDim] (Standard Transformer input)
return patches.transpose(1, 2)
# Test the Video Patcher!
patcher = SpatiotemporalPatchEmbed()
# Dummy Video: 10 frames of 224x224 RGB video
dummy_video = torch.randn(1, 3, 10, 224, 224)
transformer_sequence = patcher(dummy_video)
print(f"Video sequence shape for Transformer: {transformer_sequence.shape}")
print("(The Transformer will now process 980 sequential patches of spacetime!)")
3. Robotics: Vision-Language-Action Models (VLAs)
If we want an AI to physically pick up an apple, we need a robotic arm. Historically, robotics relied on complex math and hardcoded physics engines. But with the invention of VLA Models (like Google's RT-2), we realized we could just use a Language Model!
How do you make ChatGPT move a robot? You expand its dictionary. Instead of just knowing words like "apple", we literally add new physical tokens to the model's vocabulary, such as [MOVE_X_10], [MOVE_Y_20], or [GRIPPER_CLOSE]. The AI looks at an image, reads the user's prompt ("Pick up the apple"), and instead of generating text, it generates a sequence of physical motor tokens that are sent directly to the robot's hardware!
Applied Project: The Robotic Output Head
Here is a conceptual look at how a Transformer outputs physical actions rather than text. We simply change the final layer to predict coordinates instead of words.
Python
import torch
import torch.nn as nn
class VLARoboticHead(nn.Module):
def __init__(self, d_model=1024):
super().__init__()
# Instead of mapping to 50,000 words in a dictionary...
# We map the Transformer's memory directly to 4 physical robot actions:
# 1. X Velocity, 2. Y Velocity, 3. Z Velocity, 4. Gripper Open/Close probability
self.action_head = nn.Sequential(
nn.Linear(d_model, 256),
nn.ReLU(),
nn.Linear(256, 4)
)
def forward(self, transformer_memory):
# Extract the representation of the very last token in the sequence
last_token_memory = transformer_memory[:, -1, :]
# Predict the motor commands!
actions = self.action_head(last_token_memory)
# The first 3 outputs are continuous X,Y,Z velocities.
velocities = actions[:, :3]
# The 4th output is passed through Sigmoid to get a Gripper probability (0=Open, 1=Close)
gripper = torch.sigmoid(actions[:, 3:])
return velocities, gripper
# Testing the Robotic Head
robot_head = VLARoboticHead()
# Dummy Memory (e.g., the AI has looked at the apple and read the prompt)
memory = torch.randn(1, 50, 1024)
vel, grip = robot_head(memory)
print(f"Robot Arm Velocity (X, Y, Z): {vel.tolist()[0]}")
print(f"Gripper Close Probability: {grip.item():.2f}")
Conclusion
You have reached the end of The Secrets of AI Models. From the humble math of a Convolutional Filter, to the time-traveling gates of an LSTM, to the thermodynamic miracles of Diffusion, and finally to physical robots controlled by Spatiotemporal Transformers. You now possess the mathematical intuition and the code required to build the future. The secrets are yours. Now, go build.