Neural Networks & Deep Learning
Chapter 18: Generative Models โ GANs, VAEs, and Diffusion
Teaching Machines to Create, Not Just Classify โ From Adversarial Games to Denoising Dreams
โฑ๏ธ Reading Time: ~4 hours | ๐ Unit 6: Modern Deep Learning | ๐ง Theory + Code + Ethics Chapter
๐ Prerequisites: Chapter 16 (GANs & VAEs Intro), Chapter 12 (CNNs), Probability & KL Divergence
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | State the GAN minimax objective, VAE ELBO, reparameterization trick, DDPM forward/reverse equations, and key architecture names (DCGAN, WGAN, StyleGAN) |
| ๐ต Understand | Explain why GANs frame generation as a game, why VAEs optimize a lower bound, why diffusion adds noise gradually, and why mode collapse occurs |
| ๐ข Apply | Implement a vanilla GAN, DCGAN, and simple diffusion model from scratch on MNIST; use PyTorch for all three |
| ๐ก Analyze | Derive the optimal discriminator D*, trace how Wasserstein distance fixes vanishing gradients, analyze the ฮฒ-VAE disentanglement trade-off |
| ๐ Evaluate | Compare GANs vs VAEs vs Diffusion on sample quality, diversity, training stability, and compute cost; assess deepfake risks in Indian elections |
| ๐ด Create | Design a diffusion-based product photography pipeline for Indian e-commerce; build a deepfake detection prototype |
Learning Objectives
By the end of this chapter, you will be able to:
- Distinguish discriminative models P(y|x) from generative models P(x), and explain why learning to generate data is harder than learning to classify it
- Derive the GAN minimax objective from first principles, prove the optimal discriminator D*(x) = p_data(x) / (p_data(x) + p_g(x)), and show the connection to Jensen-Shannon divergence
- Implement a vanilla GAN, DCGAN, and WGAN from scratch in NumPy and PyTorch on MNIST
- Diagnose GAN training pathologies โ mode collapse, vanishing gradients, oscillation โ and apply fixes (label smoothing, spectral normalization, progressive growing)
- Explain the VAE ELBO decomposition: log P(x) โฅ ๐ผ[log P(x|z)] โ KL(q(z|x) โฅ p(z)), and implement the reparameterization trick
- Derive the forward and reverse processes of DDPM, explain the noise schedule, and implement a simple diffusion model
- Compare GANs, VAEs, and Diffusion models on five axes: sample quality, diversity, training stability, latent space structure, and compute requirements
- Analyze real-world systems: Meesho's diffusion-based product photography and Midjourney/DALL-E text-to-image generation
- Evaluate ethical implications of generative AI, including deepfakes, misinformation, and IP concerns
- Solve GATE-style problems on GAN objectives, VAE loss components, and diffusion mathematics
Opening Hook
๐ฒ The Night GANs Were Born
In June 2014, Ian Goodfellow was at a bar in Montreal with friends, debating how to make neural networks generate images. The prevailing approach โ Boltzmann machines โ was painfully slow, requiring complex Markov chain Monte Carlo sampling. Someone suggested using neural networks to generate images directly, but how would you train them without a clear loss function?
Then Goodfellow had his insight: don't define an explicit loss โ let two neural networks fight each other. One network (the Generator) creates fake images. The other (the Discriminator) tries to tell real from fake. They compete, and in this adversarial game, the Generator learns to produce images so realistic that even the Discriminator can't tell the difference.
Goodfellow went home that night and coded the entire thing. It worked on the first try. Within a few hours, his laptop was generating recognizable handwritten digits from pure noise. The paper, submitted later that year to NeurIPS, became one of the most cited in deep learning history.
But here's the twist: GANs were just the beginning. In the decade since, we've seen Variational Autoencoders learn smooth latent spaces, and Diffusion Models โ inspired by thermodynamics โ overtake everything else in image quality. Today, a student in Jaipur can type "a tiger wearing a kurta in a Rajasthani palace" and a diffusion model will paint it in seconds. You're about to learn exactly how all three families of generative models work, from the math to the code.
The Intuition First โ Three Roads to Creation
The Art Forgery Analogy (GANs) ๐จ
Imagine a forger (Generator) trying to create fake Picasso paintings, and an art critic (Discriminator) trying to spot the fakes. Initially, the forger is terrible โ scribbling stick figures โ and the critic catches every fake easily. But as the forger gets feedback ("your brushstrokes are too uniform, your color palette is wrong"), they improve. Meanwhile, the critic must also improve, because the fakes are getting better.
This arms race continues until the forger creates paintings so perfect that even expert critics flip a coin: 50% chance any painting is real or fake. At that point, the Generator has learned the true distribution of Picasso paintings.
"Aha" question: What if the forger only learns to copy one perfect painting and shows it every time? The critic can't tell it from a real Picasso, but you've lost all diversity. This is mode collapse, and it's the central challenge of GAN training.
The Postal Code Analogy (VAEs) ๐ฎ
Think of a VAE like a postal system for images. The encoder takes an image and compresses it into a "postal code" โ a small vector in latent space. The decoder takes any postal code and reconstructs the image. The key insight: VAEs don't just learn one postal code per image โ they learn a region (a Gaussian cloud). Nearby postal codes decode to similar images. You can generate new images by sampling random postal codes and decoding them.
The Thermodynamics Analogy (Diffusion) ๐ก๏ธ
Drop a single drop of ink into a glass of water. Over time, the ink molecules spread out until they're uniformly distributed โ this is the forward process (adding noise). Now imagine you could reverse time and watch the uniform ink reconcentrate into a single drop. Diffusion models learn exactly this: they learn to reverse the noise process, turning pure static back into a coherent image, step by step.
The "it worked first try" story is real โ Goodfellow has confirmed it in multiple interviews. But he also says the first version was "very simple" โ just fully connected layers on MNIST. It took the community years to scale GANs to high-resolution photorealistic images (ProGAN, 2017; StyleGAN, 2018).
18.1 โ Discriminative vs. Generative Models
Before diving into specific architectures, you need to understand a fundamental split in machine learning.
Core Distinction
Learns the conditional distribution P(y|x) โ given an input x (image), predict a label y (cat/dog). Examples: Logistic Regression, CNNs for classification, Transformers for NER. These models draw decision boundaries.
Generative ModelLearns the joint distribution P(x, y) or just P(x) โ the full data distribution. Once you know P(x), you can sample from it to generate new data points. Examples: GANs, VAEs, Diffusion Models, GPT (autoregressive P(xโ, xโ, ..., xโ)).
Why Generative is HarderA discriminative model just needs to learn a boundary between classes. A generative model must understand the entire structure of the data โ every texture, every edge, every statistical regularity. It's the difference between "tell me if this is a face" (easy) vs "draw me a face" (hard).
Mathematical Formulation
By Bayes' theorem: P(y|x) = P(x|y)P(y) / P(x). A generative model learns P(x|y) and P(y), which also lets you compute P(y|x). So generative models are strictly more powerful โ but that power comes at a cost.
| Property | Discriminative | Generative |
|---|---|---|
| Learns | P(y|x) | P(x) or P(x,y) |
| Can classify? | โ Directly | โ Via Bayes' rule |
| Can generate? | โ | โ |
| Data efficiency | Better (simpler task) | Worse (harder task) |
| Examples | SVM, CNN, BERT | GAN, VAE, GPT, Diffusion |
Q: What does a generative model learn?
A: The data distribution P(x). This lets it both generate new samples AND classify (via Bayes' rule), while discriminative models only classify.
Key formula: P(y|x) = P(x|y)P(y) / P(x) โ generative models learn the right side.
18.2 โ GANs: The Minimax Game
The GAN Framework
A GAN consists of two neural networks trained simultaneously:
- Generator G(z; ฮธ_g): Takes random noise z ~ P_z(z) (usually N(0, I)) and maps it to a fake sample G(z)
- Discriminator D(x; ฮธ_d): Takes a sample x (real or fake) and outputs D(x) โ [0, 1] โ the probability that x is real
The Minimax Objective โ Derivation from First Principles
Step 1: What does the Discriminator want?
D wants to maximize its accuracy. For real samples x ~ p_data, it wants D(x) โ 1. For fake samples G(z), it wants D(G(z)) โ 0.
This is just binary cross-entropy! D maximizes:
V(D) = ๐ผx~p_data[log D(x)] + ๐ผz~p_z[log(1 โ D(G(z)))]
Step 2: What does the Generator want?
G wants to fool D. It wants D(G(z)) โ 1 (discriminator thinks fake is real). So G minimizes the same objective V:
G minimizes: ๐ผz~p_z[log(1 โ D(G(z)))]
Step 3: The combined minimax game:
minG maxD V(D, G) = ๐ผx~p_data[log D(x)] + ๐ผz~p_z[log(1 โ D(G(z)))]
Step 4: Deriving the Optimal Discriminator D*
For fixed G, we maximize V with respect to D. Rewrite V as an integral:
V = โซ [p_data(x) log D(x) + p_g(x) log(1 โ D(x))] dx
For each x, this is of the form f(D) = aยทlog(D) + bยทlog(1โD) where a = p_data(x), b = p_g(x).
Take derivative and set to zero:
f'(D) = a/D โ b/(1โD) = 0
a(1โD) = bD โ a โ aD = bD โ a = D(a+b)
D*(x) = p_data(x) / (p_data(x) + p_g(x))
This is beautiful! The optimal discriminator is simply the ratio of real data density to total density. When p_g = p_data (Generator perfectly matches data), D*(x) = 1/2 everywhere โ the discriminator is reduced to random guessing.
Connection to Jensen-Shannon Divergence
Step 5: Substituting D* back into V
With D = D*, let's compute V(G, D*):
V(G, D*) = ๐ผx~p_data[log (p_data/(p_data + p_g))] + ๐ผx~p_g[log (p_g/(p_data + p_g))]
Let m = (p_data + p_g)/2. Then:
= ๐ผp_data[log(p_data/2m)] + ๐ผp_g[log(p_g/2m)]
= ๐ผp_data[log(p_data/m)] + ๐ผp_g[log(p_g/m)] โ 2log2
= KL(p_data โฅ m) + KL(p_g โฅ m) โ 2log2
= 2 ยท JSD(p_data โฅ p_g) โ 2log2
where JSD is the Jensen-Shannon Divergence! So the GAN minimax game, at optimality, minimizes the JSD between p_data and p_g. The global minimum of โ2log2 is reached when p_g = p_data.
V(G, D*) = 2 ยท JSD(p_data โฅ p_g) โ 2ยทlog(2)
JSD(PโฅQ) = ยฝKL(PโฅM) + ยฝKL(QโฅM), where M = (P+Q)/2
Practical Training: Alternating Gradient Descent
In practice, you don't solve the minimax analytically. Instead, you alternate:
- Train D for k steps: Sample minibatch of real data x, sample noise z, compute loss = โ[log D(x) + log(1โD(G(z)))], update ฮธ_d via gradient ascent
- Train G for 1 step: Sample noise z, compute loss = log(1โD(G(z))), update ฮธ_g via gradient descent
The Non-Saturating Loss Trick: In practice, minimizing log(1โD(G(z))) gives weak gradients when D is confident (G(z) is clearly fake). Instead, G maximizes log D(G(z)). Same fixed point, but much stronger gradients early in training. This is what every practical GAN implementation uses.
"Generative Adversarial Nets" (Goodfellow et al., 2014) โ The original paper. Theorem 1 proves p_g โ p_data under sufficient model capacity. But the proof assumes perfect discriminator at each step, which never holds in practice. This gap between theory and practice drove a decade of research.
Read: arxiv.org/abs/1406.2661
18.3 โ GAN Variants: DCGAN, WGAN, and StyleGAN
DCGAN โ Deep Convolutional GAN (Radford et al., 2015)
The original GAN used fully connected layers. DCGAN established the architectural guidelines that made convolutional GANs work:
DCGAN Architecture Rules
- Replace pooling with strided convolutions โ Discriminator uses strided conv (downsampling), Generator uses transposed conv (upsampling)
- Use BatchNorm everywhere โ except in G's output layer and D's input layer
- No fully connected layers โ use global average pooling in D
- ReLU in Generator (except output: Tanh), LeakyReLU in Discriminator
- Adam optimizer with lr=0.0002, ฮฒโ=0.5
WGAN โ Wasserstein GAN (Arjovsky et al., 2017)
The key insight of WGAN: Jensen-Shannon Divergence is a terrible training signal when the generator and data distributions don't overlap (which is almost always true early in training, since images live on a low-dimensional manifold in pixel space).
Why JSD Fails
When p_data and p_g have disjoint supports (don't overlap), JSD = log(2) regardless of how "close" the distributions are. The discriminator achieves perfect accuracy instantly, and gradients for G vanish. This is the vanishing gradient problem in GANs.
The Wasserstein Distance (Earth Mover's Distance)
W(p_data, p_g) = infฮณ โ ฮ (p_data, p_g) ๐ผ(x,y)~ฮณ[โx โ yโ]
= minimum cost to "transport" p_data into p_g
The beauty of W: it's continuous and differentiable even when distributions don't overlap. Think of it as: "how much earth do you need to move, and how far?" โ even when two piles of dirt don't touch, you can always measure the distance.
WGAN Training Changes
| Vanilla GAN | WGAN |
|---|---|
| Discriminator outputs probability | Critic outputs unbounded score (no sigmoid) |
| Binary cross-entropy loss | Wasserstein loss: ๐ผ[D(x_real)] โ ๐ผ[D(x_fake)] |
| Train D for 1 step per G step | Train Critic for 5 steps per G step |
| No weight constraint | Weight clipping: w โ clip(w, โ0.01, 0.01) |
WGAN-GP (Gulrajani et al., 2017) replaced weight clipping with a gradient penalty: penalize the critic when โโ_x D(xฬ)โ โ 1, where xฬ is a random interpolation between real and fake. This enforces the Lipschitz constraint more elegantly and is the standard in practice.
StyleGAN โ Style-Based Generator Architecture (Karras et al., 2019)
StyleGAN revolutionized GANs by separating what is generated from how it's styled:
- Mapping Network: z โ w (8-layer MLP, maps noise to "style" space W)
- Synthesis Network: Generates image progressively (4ร4 โ 8ร8 โ ... โ 1024ร1024)
- AdaIN (Adaptive Instance Normalization): Style vector w controls normalization at each layer
- Noise injection: Stochastic details (hair strands, pores) via per-pixel noise
thispersondoesnotexist.com โ a website that generates a new photorealistic face every time you refresh โ uses StyleGAN2. The faces are 1024ร1024, indistinguishable from real photos, and no two are alike. These people literally do not exist.
18.4 โ Mode Collapse and GAN Training Tricks
What is Mode Collapse?
Imagine p_data is a mixture of 10 Gaussians (like MNIST's 10 digit classes). Mode collapse occurs when the Generator learns to produce only 1-2 of these modes, ignoring the rest. The Discriminator catches on โ "you're only generating 7s!" โ but the Generator responds by switching to another mode: "fine, now I'll only generate 3s."
Causes and Fixes
| Problem | Cause | Fix |
|---|---|---|
| Mode collapse | G finds "safe" mode that always fools D | Minibatch discrimination, unrolled GANs, diversity regularization |
| Vanishing gradients | D becomes too strong โ log(1โD(G(z))) saturates | Non-saturating loss, WGAN, label smoothing |
| Training oscillation | D and G alternate domination | Two-time-scale update rule (TTUR), spectral normalization |
| Gradient explosion | Unstable dynamics | Gradient clipping, spectral normalization |
Practical GAN Training Checklist
GAN Training Stability Checklist
- Use WGAN-GP or spectral normalization instead of vanilla GAN
- Use non-saturating loss for Generator: โlog D(G(z))
- Label smoothing: real labels = 0.9 instead of 1.0
- Train D more steps than G (typically 5:1 for WGAN)
- Use Adam with lr=0.0002, ฮฒโ=0.5, ฮฒโ=0.999
- Normalize inputs to [โ1, 1]; use Tanh in G's output
- Monitor both D and G losses โ neither should go to zero
- Use FID score (Frรฉchet Inception Distance) for evaluation
โ MYTH: "The Generator loss should decrease over training."
โ TRUTH: In a healthy GAN, both G and D losses oscillate around equilibrium. If G loss drops to zero, D has collapsed. If D loss drops to zero, G isn't learning. You want both losses to hover, indicating an ongoing "game."
๐ WHY IT MATTERS: Students often debug GANs by looking for decreasing loss curves like in supervised learning. This leads to premature stopping or incorrect hyperparameter tuning.
A student trains a GAN on MNIST and notices the Discriminator loss quickly drops to 0 while the Generator loss climbs to infinity. The generated images are pure noise. What went wrong? How would you fix it?
for epoch in range(100):
# Train D
real = next(dataloader)
fake = G(torch.randn(64, 100))
d_loss = -torch.mean(torch.log(D(real)) + torch.log(1 - D(fake)))
d_optimizer.step()
# Train G (same batch!)
g_loss = torch.mean(torch.log(1 - D(G(torch.randn(64, 100)))))
g_optimizer.step()
Bugs found: (1) .zero_grad() is never called โ gradients accumulate! (2) G uses the saturating loss log(1โD(G(z))) which gives near-zero gradients when D is confident โ use -torch.mean(torch.log(D(G(z)))) instead. (3) D's learning rate may be too high relative to G โ try separate learning rates or a TTUR schedule.
18.5 โ Variational Autoencoders (VAEs) and ฮฒ-VAE
From Autoencoders to VAEs
You already know autoencoders (Ch 12): encoder compresses x โ z, decoder reconstructs z โ xฬ. But regular autoencoders learn a deterministic mapping to a messy, disconnected latent space. You can't sample from it meaningfully.
VAEs fix this by making the encoding probabilistic: instead of z = f(x), the encoder outputs parameters of a distribution: ฮผ(x), ฯ(x). Then z is sampled from N(ฮผ, ฯยฒ). A KL divergence term pushes this distribution toward the standard normal N(0, 1), ensuring the latent space is smooth and continuous.
Deriving the ELBO from First Principles
Goal: We want to maximize log P(x) โ the log-likelihood of the data.
Problem: P(x) = โซ P(x|z)P(z) dz โ this integral is intractable for complex decoders.
Solution: Introduce a tractable approximation q(z|x) โ P(z|x) and derive a lower bound.
Step 1: Start with log P(x) and use Jensen's inequality:
log P(x) = log โซ P(x, z) dz = log โซ q(z|x) ยท [P(x, z) / q(z|x)] dz
โฅ โซ q(z|x) ยท log[P(x, z) / q(z|x)] dz โ Jensen's inequality (log is concave)
= ๐ผq(z|x)[log P(x, z) โ log q(z|x)]
Step 2: Expand P(x, z) = P(x|z) ยท P(z):
= ๐ผq(z|x)[log P(x|z)] + ๐ผq(z|x)[log P(z) โ log q(z|x)]
= ๐ผq(z|x)[log P(x|z)] โ KL(q(z|x) โฅ P(z))
Step 3: This is the ELBO (Evidence Lower BOund)!
โ = โ๐ผz~q(z|x)[log P(x|z)] + KL(q(z|x) โฅ P(z))
= Reconstruction Loss + KL Divergence Regularizer
First term: "how well can you reconstruct?" Second term: "how close is your encoding to N(0,I)?"
The Reparameterization Trick
Problem: z ~ N(ฮผ, ฯยฒ) is a stochastic node. You can't backpropagate through random sampling!
Solution: Rewrite z = ฮผ + ฮต ยท ฯ, where ฮต ~ N(0, 1). Now the randomness (ฮต) is external to the computation graph, and gradients flow through ฮผ and ฯ normally.
z = ฮผ(x) + ฯ(x) โ ฮต, ฮต ~ N(0, I)
โz/โฮผ = 1, โz/โฯ = ฮต โ gradients exist!
KL Divergence โ Closed Form
When q(z|x) = N(ฮผ, ฯยฒ) and P(z) = N(0, 1), the KL divergence has a beautiful closed form:
= โยฝ ฮฃj=1d (1 + log ฯjยฒ โ ฮผjยฒ โ ฯjยฒ)
ฮฒ-VAE: Controlling Disentanglement
Higgins et al. (2017) introduced ฮฒ-VAE by simply scaling the KL term:
โฮฒ-VAE = Reconstruction Loss + ฮฒ ยท KL(q(z|x) โฅ P(z))
- ฮฒ = 1: Standard VAE
- ฮฒ > 1: Stronger regularization โ more disentangled latent factors (each dimension captures one independent feature: rotation, scale, color), but blurrier reconstructions
- ฮฒ < 1: Better reconstruction, but messier latent space
โ MYTH: "VAEs generate blurry images because they're bad models."
โ TRUTH: Blurriness comes from the pixel-wise MSE reconstruction loss โ it averages over all possible reconstructions, creating blur. Use perceptual loss (comparing CNN features instead of pixels) or adversarial loss (VAE-GAN) for sharper results.
๐ WHY IT MATTERS: Understanding why VAEs are blurry tells you that it's a loss function choice, not an architectural flaw. The framework is sound.
ML Research Scientist โ Generative Models at Adobe, NVIDIA, Google DeepMind. Roles focus on improving VAE/GAN/Diffusion architectures, typically requiring PhD + published papers. Salary range: โน40-80 LPA (India) / $200-400K (US). Key skills: probabilistic modeling, PyTorch, distributed training.
18.6 โ Diffusion Models: DDPM and the Denoising Revolution
The Core Idea
Diffusion models draw inspiration from non-equilibrium thermodynamics. The idea is stunningly simple:
- Forward process (fixed, no learning): Gradually add Gaussian noise to an image over T steps until it becomes pure noise
- Reverse process (learned): Train a neural network to reverse each step โ to denoise slightly at each step โ until you recover a clean image from pure noise
Forward Process (Adding Noise)
At each timestep t = 1, 2, ..., T:
q(x_t | x_{t-1}) = N(x_t; โ(1โฮฒ_t) ยท x_{t-1}, ฮฒ_t ยท I)
where ฮฒ_t is a small noise variance (noise schedule, typically ฮฒโ = 10โปโด to ฮฒ_T = 0.02).
Key mathematical trick: You can jump directly from xโ to any x_t without computing all intermediate steps!
Define แพฑ_t = ฮ s=1t (1 โ ฮฒ_s). Then:
q(x_t | xโ) = N(x_t; โแพฑ_t ยท xโ, (1 โ แพฑ_t) ยท I)
Equivalently: x_t = โแพฑ_t ยท xโ + โ(1 โ แพฑ_t) ยท ฮต, ฮต ~ N(0, I)
This means: at any timestep t, the noisy image is just a weighted sum of the original image and random noise. As t โ T, แพฑ_T โ 0, and x_T โ pure noise.
x_t = โแพฑ_t ยท xโ + โ(1 โ แพฑ_t) ยท ฮต, ฮต ~ N(0, I)
where แพฑ_t = ฮ s=1t(1 โ ฮฒ_s) = cumulative signal retention
Reverse Process (Learning to Denoise)
The reverse process aims to undo each noise step:
p_ฮธ(x_{t-1} | x_t) = N(x_{t-1}; ฮผ_ฮธ(x_t, t), ฯ_tยฒ ยท I)
We train a neural network ฮต_ฮธ(x_t, t) to predict the noise ฮต that was added. Once we know the noise, we can compute the denoised image.
DDPM Training Objective
The loss is beautifully simple:
For each training step:
- Sample a clean image xโ from the training set
- Sample a random timestep t ~ Uniform(1, T)
- Sample noise ฮต ~ N(0, I)
- Create noisy image: x_t = โแพฑ_t ยท xโ + โ(1 โ แพฑ_t) ยท ฮต
- Feed x_t and t to the neural network, get prediction ฮต_ฮธ(x_t, t)
- Loss = โฮต โ ฮต_ฮธ(x_t, t)โยฒ โ just MSE between true and predicted noise!
โ = ๐ผt, xโ, ฮต[โฮต โ ฮต_ฮธ(โแพฑ_t ยท xโ + โ(1โแพฑ_t) ยท ฮต, t)โยฒ]
"Predict the noise that was added at timestep t"
Sampling (Generating Images)
To generate an image from scratch:
- Start with pure noise: x_T ~ N(0, I)
- For t = T, Tโ1, ..., 1: use the trained ฮต_ฮธ to predict the noise, subtract it, get x_{t-1}
- The final xโ is your generated image!
โ MYTH: "Diffusion models are just fancy autoencoders."
โ TRUTH: Autoencoders learn to compress and reconstruct. Diffusion models learn to reverse a stochastic process. There's no encoder at inference time โ you start from pure noise. The training uses a fixed, non-learned forward process with a learned reverse. The mathematical framework is closer to score matching and stochastic differential equations than to compression.
๐ WHY IT MATTERS: Understanding this distinction explains why diffusion models achieve higher sample diversity than GANs โ they're actually sampling from the learned distribution, not mapping a fixed noise vector through a deterministic generator.
"Denoising Diffusion Probabilistic Models" (Ho, Jain, Abbeel, 2020) โ The paper that showed diffusion models can match GAN quality. Key insight: the simplified loss (just predicting noise) works as well as the full variational bound. Building on Sohl-Dickstein et al. (2015) and Song & Ermon (2019).
"Denoising Diffusion Implicit Models" (Song et al., 2021) โ DDIM: a deterministic version that needs far fewer steps (50 vs 1000) for generation.
18.7 โ Stable Diffusion and Text-to-Image
Latent Diffusion Models (LDM)
Running diffusion directly on 512ร512ร3 images is computationally expensive. Latent Diffusion (Rombach et al., 2022) solves this by running the diffusion process in a compressed latent space:
Three Components of Stable Diffusion
- VAE (Autoencoder): Compresses images from pixel space (512ร512ร3) to latent space (64ร64ร4) and back. Trained separately.
- U-Net Denoiser: The diffusion model itself โ predicts noise in latent space. Conditioned on timestep t and text embedding via cross-attention layers.
- Text Encoder (CLIP): Converts text prompts to embeddings that guide the U-Net's denoising process.
Classifier-Free Guidance
To make outputs follow text prompts more closely, Stable Diffusion uses classifier-free guidance:
ฮตฬ = ฮต_ฮธ(x_t, โ ) + w ยท (ฮต_ฮธ(x_t, c) โ ฮต_ฮธ(x_t, โ ))
where c is the text condition, โ is the unconditional embedding, and w is the guidance scale (typically 7.5). Higher w = more adherent to prompt but less diverse.
Q: Why does Stable Diffusion run diffusion in latent space instead of pixel space?
A: Computational efficiency. A 512ร512ร3 image has 786,432 dimensions. The VAE compresses this to 64ร64ร4 = 16,384 dimensions โ a 48ร reduction. Diffusion in this space is much faster, enabling generation on consumer GPUs.
18.8 โ GANs vs. VAEs vs. Diffusion: The Complete Comparison
| Axis | GAN | VAE | Diffusion |
|---|---|---|---|
| Sample Quality | โญโญโญโญ (sharp) | โญโญ (blurry) | โญโญโญโญโญ (best) |
| Diversity | โญโญ (mode collapse risk) | โญโญโญโญ (good coverage) | โญโญโญโญโญ (full coverage) |
| Training Stability | โญโญ (hard to tune) | โญโญโญโญ (stable) | โญโญโญโญโญ (very stable) |
| Latent Space | โ No meaningful latent | โญโญโญโญโญ (smooth, interpretable) | โญโญโญ (via guidance) |
| Inference Speed | โญโญโญโญโญ (one forward pass) | โญโญโญโญโญ (one forward pass) | โญ (50-1000 steps!) |
| Compute Cost | โญโญโญโญ (moderate) | โญโญโญโญ (moderate) | โญโญ (expensive) |
| Likelihood | โ No explicit P(x) | โ Lower bound (ELBO) | โ Via variational bound |
| Math Foundation | Game theory, JSD | Variational inference, KL | Thermodynamics, SDE |
| Killer App | Face generation, style transfer | Representation learning, anomaly detection | Text-to-image, video generation |
When to use what? (1) Need interpretable latent space? โ VAE. (2) Need fast generation? โ GAN. (3) Need best quality regardless of speed? โ Diffusion. (4) Need text-conditioned generation? โ Diffusion (Stable Diffusion). (5) Need anomaly detection? โ VAE (high reconstruction error = anomaly).
๐ฎ๐ณ India: Current Landscape
- Meesho: Diffusion for product photography
- Flipkart: GAN-based virtual try-on
- ISRO: Super-resolution satellite imagery via diffusion
- IIT Bombay/Madras: VAE research for Indic script generation
- Startup ecosystem: 50+ GenAI startups (Krutrim, Sarvam AI)
- Key challenge: Compute access, data for Indian contexts
๐บ๐ธ USA: Current Landscape
- OpenAI: DALL-E 3 (diffusion + CLIP)
- Midjourney: Proprietary diffusion model
- NVIDIA: StyleGAN series, GauGAN
- Google: Imagen, Gemini image gen
- Stability AI: Open-source Stable Diffusion
- Key challenge: Copyright lawsuits, ethical governance
Worked Examples
Worked Example 1: Computing Optimal Discriminator (By Hand) โ๏ธ
Problem
Suppose our data lives in a 1D space. The real data distribution is p_data(x) = 2x for x โ [0, 1] (a triangle distribution). The current Generator produces p_g(x) = 1 for x โ [0, 1] (uniform). Find the optimal discriminator D*(x).
SolutionUsing our derived formula:
D*(x) = p_data(x) / (p_data(x) + p_g(x)) = 2x / (2x + 1)
Let's check a few values:
- At x = 0: D*(0) = 0/(0+1) = 0 โ the discriminator knows real data never appears at x=0 (p_data(0) = 0)
- At x = 0.5: D*(0.5) = 1/(1+1) = 0.5 โ at x=0.5, both distributions have equal density
- At x = 1: D*(1) = 2/(2+1) = 2/3 โ real data is twice as likely as fake at x=1
Interpretation: D*(x) is high where real data is dense relative to fake data. It equals 1/2 where both distributions have equal density. This is exactly what we'd expect from a perfect classifier!
Worked Example 2: VAE KL Divergence (Indian Industry Context) ๐ฎ๐ณ
Meesho Product Encoding
Meesho trains a VAE on product images. For a specific saree image, the encoder outputs: ฮผ = [0.5, โ1.0, 2.0], log ฯยฒ = [โ0.5, 0.0, 0.5]. Compute the KL divergence.
SolutionKL = โยฝ ฮฃ (1 + log ฯยฒ โ ฮผยฒ โ ฯยฒ)
For each dimension j:
- j=1: โยฝ(1 + (โ0.5) โ 0.25 โ e^{โ0.5}) = โยฝ(1 โ 0.5 โ 0.25 โ 0.607) = โยฝ(โ0.357) = 0.179
- j=2: โยฝ(1 + 0 โ 1.0 โ e^0) = โยฝ(1 โ 1 โ 1) = โยฝ(โ1) = 0.500
- j=3: โยฝ(1 + 0.5 โ 4.0 โ e^{0.5}) = โยฝ(1.5 โ 4 โ 1.649) = โยฝ(โ4.149) = 2.075
Total KL = 0.179 + 0.500 + 2.075 = 2.754
Interpretation: Dimension 3 contributes the most KL โ its mean (2.0) is far from 0, and its variance (e^0.5 โ 1.65) is above 1. The KL penalty will push this encoding toward N(0,1), encouraging the latent space to stay organized.
Worked Example 3: DDPM Noise Scheduling (US/Global Context) ๐บ๐ธ
DALL-E Style Diffusion
A diffusion model uses T=1000 steps with linear noise schedule: ฮฒ_t = 0.0001 + (0.02 โ 0.0001) ร t/1000. Compute แพฑ_t for t = 1, 500, and 1000.
Solutionแพฑ_t = ฮ _{s=1}^{t} (1 โ ฮฒ_s) = ฮ _{s=1}^{t} ฮฑ_s
For the linear schedule, ฮฒโ = 0.0001, ฮฒโ โโ โ 0.01, ฮฒโโโโ = 0.02.
Since there are many steps, we use the log: log แพฑ_t = ฮฃ log(1 โ ฮฒ_s) โ โฮฃ ฮฒ_s (for small ฮฒ).
- t=1: แพฑโ = 1 โ 0.0001 โ 0.9999 โ almost no noise, image is nearly clean
- t=500: แพฑโ โโ โ exp(โฮฃ_{s=1}^{500} ฮฒ_s) โ exp(โ2.55) โ 0.078 โ image is mostly noise
- t=1000: แพฑโโโโ โ exp(โฮฃ_{s=1}^{1000} ฮฒ_s) โ exp(โ10.05) โ 0.0000435 โ virtually pure noise
Interpretation: The signal (โแพฑ_t) decreases from ~1.0 โ ~0.28 โ ~0.007 while noise (โ(1โแพฑ_t)) increases from ~0 โ ~0.96 โ ~1.0. At t=500, the image is almost unrecognizable. At t=1000, it's pure Gaussian noise.
From-Scratch Code: NumPy Implementations
1. Vanilla GAN from Scratch (NumPy)
Python / NumPyimport numpy as np
# โโโ Utility Functions โโโ
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def sigmoid_deriv(x):
s = sigmoid(x)
return s * (1 - s)
def relu(x):
return np.maximum(0, x)
def relu_deriv(x):
return (x > 0).astype(np.float64)
def leaky_relu(x, alpha=0.2):
return np.where(x > 0, x, alpha * x)
def leaky_relu_deriv(x, alpha=0.2):
return np.where(x > 0, 1, alpha)
# โโโ Generator: z(100) โ hidden(256) โ output(784) โโโ
np.random.seed(42)
z_dim = 100
h_dim = 256
img_dim = 784 # 28ร28 for MNIST
lr = 0.0002
# Generator weights (Xavier init)
W_g1 = np.random.randn(z_dim, h_dim) * np.sqrt(2/z_dim)
b_g1 = np.zeros((1, h_dim))
W_g2 = np.random.randn(h_dim, img_dim) * np.sqrt(2/h_dim)
b_g2 = np.zeros((1, img_dim))
# Discriminator weights
W_d1 = np.random.randn(img_dim, h_dim) * np.sqrt(2/img_dim)
b_d1 = np.zeros((1, h_dim))
W_d2 = np.random.randn(h_dim, 1) * np.sqrt(2/h_dim)
b_d2 = np.zeros((1, 1))
def generator_forward(z):
"""z โ ReLU(zW1+b1) โ Sigmoid(hW2+b2) โ fake image"""
h = z @ W_g1 + b_g1 # (batch, 256)
h_act = relu(h) # ReLU activation
out = h_act @ W_g2 + b_g2 # (batch, 784)
img = sigmoid(out) # Sigmoid โ [0,1] pixel values
return z, h, h_act, out, img
def discriminator_forward(x):
"""x โ LeakyReLU(xW1+b1) โ Sigmoid(hW2+b2) โ probability"""
h = x @ W_d1 + b_d1 # (batch, 256)
h_act = leaky_relu(h) # LeakyReLU
out = h_act @ W_d2 + b_d2 # (batch, 1)
prob = sigmoid(out) # probability real/fake
return x, h, h_act, out, prob
def train_step(real_batch, batch_size=64):
global W_g1, b_g1, W_g2, b_g2, W_d1, b_d1, W_d2, b_d2
# โโ Step 1: Train Discriminator โโ
z = np.random.randn(batch_size, z_dim)
_, g_h, g_h_act, g_out, fake = generator_forward(z)
# D on real data (want D(x) โ 1)
_, d_h_r, d_ha_r, d_out_r, d_prob_r = discriminator_forward(real_batch)
# D on fake data (want D(G(z)) โ 0)
_, d_h_f, d_ha_f, d_out_f, d_prob_f = discriminator_forward(fake)
# Binary cross-entropy gradients for D
# Loss_D = -[log(D(real)) + log(1-D(fake))]
d_loss = -np.mean(np.log(d_prob_r + 1e-8) + np.log(1 - d_prob_f + 1e-8))
# Backprop through D (real path)
dL_dout_r = -(1 / (d_prob_r + 1e-8)) * sigmoid_deriv(d_out_r) / batch_size
dW_d2_r = d_ha_r.T @ dL_dout_r
db_d2_r = np.sum(dL_dout_r, axis=0, keepdims=True)
dL_dha_r = dL_dout_r @ W_d2.T
dL_dh_r = dL_dha_r * leaky_relu_deriv(d_h_r)
dW_d1_r = real_batch.T @ dL_dh_r
db_d1_r = np.sum(dL_dh_r, axis=0, keepdims=True)
# Backprop through D (fake path)
dL_dout_f = (1 / (1 - d_prob_f + 1e-8)) * sigmoid_deriv(d_out_f) / batch_size
dW_d2_f = d_ha_f.T @ dL_dout_f
db_d2_f = np.sum(dL_dout_f, axis=0, keepdims=True)
dL_dha_f = dL_dout_f @ W_d2.T
dL_dh_f = dL_dha_f * leaky_relu_deriv(d_h_f)
dW_d1_f = fake.T @ dL_dh_f
db_d1_f = np.sum(dL_dh_f, axis=0, keepdims=True)
# Update D
W_d2 -= lr * (dW_d2_r + dW_d2_f)
b_d2 -= lr * (db_d2_r + db_d2_f)
W_d1 -= lr * (dW_d1_r + dW_d1_f)
b_d1 -= lr * (db_d1_r + db_d1_f)
# โโ Step 2: Train Generator โโ
# Non-saturating loss: G maximizes log(D(G(z)))
z = np.random.randn(batch_size, z_dim)
_, g_h, g_h_act, g_out, fake = generator_forward(z)
_, d_h_f, d_ha_f, d_out_f, d_prob_f = discriminator_forward(fake)
g_loss = -np.mean(np.log(d_prob_f + 1e-8))
# Backprop through D (frozen) then through G
dL_dout = -(1 / (d_prob_f + 1e-8)) * sigmoid_deriv(d_out_f) / batch_size
dL_dha = dL_dout @ W_d2.T
dL_dh = dL_dha * leaky_relu_deriv(d_h_f)
dL_dfake = dL_dh @ W_d1.T # gradient at fake image
# Continue through G
dL_gout = dL_dfake * sigmoid_deriv(g_out)
dW_g2 = g_h_act.T @ dL_gout
db_g2 = np.sum(dL_gout, axis=0, keepdims=True)
dL_ghact = dL_gout @ W_g2.T
dL_gh = dL_ghact * relu_deriv(g_h)
dW_g1 = z.T @ dL_gh
db_g1 = np.sum(dL_gh, axis=0, keepdims=True)
# Update G
W_g2 -= lr * dW_g2
b_g2 -= lr * db_g2
W_g1 -= lr * dW_g1
b_g1 -= lr * db_g1
return d_loss, g_loss
# Training loop (with simulated MNIST data)
print("Training Vanilla GAN from scratch...")
for epoch in range(100):
# Simulate a batch of "real" data (replace with actual MNIST)
real = np.random.rand(64, img_dim) * 0.5 + 0.25
d_loss, g_loss = train_step(real)
if epoch % 20 == 0:
print(f"Epoch {epoch}: D_loss={d_loss:.4f}, G_loss={g_loss:.4f}")
2. Simple Diffusion Model from Scratch (NumPy)
Python / NumPyimport numpy as np
# โโโ DDPM from Scratch โ Simplified for 1D data โโโ
# We'll implement the core math, then show PyTorch version for images
T = 100 # number of diffusion steps (1000 in practice)
beta_start = 1e-4
beta_end = 0.02
# Linear noise schedule
betas = np.linspace(beta_start, beta_end, T)
alphas = 1.0 - betas
alpha_bars = np.cumprod(alphas) # แพฑ_t = cumulative product
print("Noise schedule check:")
print(f" แพฑ_1 = {alpha_bars[0]:.6f} (almost clean)")
print(f" แพฑ_50 = {alpha_bars[49]:.6f} (partially noisy)")
print(f" แพฑ_100 = {alpha_bars[99]:.6f} (mostly noise)")
def forward_diffusion(x0, t, noise=None):
"""Add noise to x0 at timestep t: x_t = โแพฑ_tยทxโ + โ(1โแพฑ_t)ยทฮต"""
if noise is None:
noise = np.random.randn(*x0.shape)
sqrt_ab = np.sqrt(alpha_bars[t])
sqrt_1_ab = np.sqrt(1 - alpha_bars[t])
return sqrt_ab * x0 + sqrt_1_ab * noise, noise
# Simple denoiser: 2-layer MLP that predicts noise
# Input: [x_t, t_embedding], Output: predicted noise
h_dim = 64
input_dim = 2 # 1D data + timestep encoding
W1 = np.random.randn(input_dim, h_dim) * 0.1
b1 = np.zeros(h_dim)
W2 = np.random.randn(h_dim, 1) * 0.1
b2 = np.zeros(1)
def predict_noise(x_t, t):
"""Simple MLP to predict noise ฮต from (x_t, t)"""
t_norm = t / T # normalize timestep to [0, 1]
inp = np.column_stack([x_t.reshape(-1, 1),
np.full((len(x_t), 1), t_norm)])
h = np.tanh(inp @ W1 + b1) # hidden layer
return h @ W2 + b2 # predicted noise
def train_diffusion(data, epochs=1000, lr=0.001):
"""Train denoiser to predict noise at random timesteps"""
global W1, b1, W2, b2
for epoch in range(epochs):
# 1. Sample random data point
x0 = data[np.random.randint(len(data))]
# 2. Sample random timestep
t = np.random.randint(0, T)
# 3. Add noise (forward process)
x_t, true_noise = forward_diffusion(np.array([x0]), t)
# 4. Predict noise
pred_noise = predict_noise(x_t, t)
# 5. Loss = MSE(true_noise, pred_noise)
loss = np.mean((true_noise - pred_noise.flatten()) ** 2)
# Manual backprop (simplified for 1D)
# ... gradient computation omitted for brevity ...
if epoch % 200 == 0:
print(f"Epoch {epoch}: loss = {loss:.4f}")
# Generate 1D data: mixture of two Gaussians
data = np.concatenate([
np.random.randn(500) * 0.5 + 3.0, # mode 1
np.random.randn(500) * 0.5 - 3.0, # mode 2
])
print("Data shape:", data.shape)
print("Training simplified diffusion model...")
train_diffusion(data, epochs=500)
PyTorch Implementations
1. DCGAN on MNIST (PyTorch)
Python / PyTorchimport torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# โโโ Hyperparameters โโโ
z_dim = 100
img_channels = 1
features_g = 64
features_d = 64
lr = 0.0002
batch_size = 128
epochs = 50
# โโโ Generator โโโ
class Generator(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
# z โ 7ร7ร256
nn.ConvTranspose2d(z_dim, features_g*4, 7, 1, 0),
nn.BatchNorm2d(features_g*4),
nn.ReLU(True),
# 7ร7 โ 14ร14
nn.ConvTranspose2d(features_g*4, features_g*2, 4, 2, 1),
nn.BatchNorm2d(features_g*2),
nn.ReLU(True),
# 14ร14 โ 28ร28
nn.ConvTranspose2d(features_g*2, img_channels, 4, 2, 1),
nn.Tanh(), # output in [-1, 1]
)
def forward(self, z):
return self.net(z.view(-1, z_dim, 1, 1))
# โโโ Discriminator โโโ
class Discriminator(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
# 28ร28 โ 14ร14
nn.Conv2d(img_channels, features_d, 4, 2, 1),
nn.LeakyReLU(0.2, inplace=True),
# 14ร14 โ 7ร7
nn.Conv2d(features_d, features_d*2, 4, 2, 1),
nn.BatchNorm2d(features_d*2),
nn.LeakyReLU(0.2, inplace=True),
# 7ร7 โ 1ร1
nn.Conv2d(features_d*2, 1, 7, 1, 0),
nn.Sigmoid(),
)
def forward(self, x):
return self.net(x).view(-1, 1)
# โโโ Training Loop โโโ
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
G = Generator().to(device)
D = Discriminator().to(device)
criterion = nn.BCELoss()
opt_g = optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999))
opt_d = optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999))
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)), # โ [-1, 1]
])
dataset = datasets.MNIST(root="./data", train=True, transform=transform, download=True)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
for epoch in range(epochs):
for real, _ in loader:
real = real.to(device)
bs = real.size(0)
# โโ Train Discriminator โโ
z = torch.randn(bs, z_dim).to(device)
fake = G(z).detach()
d_real = D(real)
d_fake = D(fake)
loss_d = criterion(d_real, torch.ones_like(d_real) * 0.9) + \
criterion(d_fake, torch.zeros_like(d_fake))
opt_d.zero_grad()
loss_d.backward()
opt_d.step()
# โโ Train Generator (non-saturating) โโ
z = torch.randn(bs, z_dim).to(device)
fake = G(z)
d_fake = D(fake)
loss_g = criterion(d_fake, torch.ones_like(d_fake)) # fool D
opt_g.zero_grad()
loss_g.backward()
opt_g.step()
print(f"Epoch [{epoch+1}/{epochs}] D_loss: {loss_d:.4f} G_loss: {loss_g:.4f}")
2. Simple DDPM on MNIST (PyTorch)
Python / PyTorchimport torch
import torch.nn as nn
import torch.nn.functional as F
# โโโ Noise Schedule โโโ
T = 1000
betas = torch.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)
def forward_diffusion(x0, t, device):
"""q(x_t | x_0) โ add noise at timestep t"""
noise = torch.randn_like(x0)
sqrt_ab = alpha_bars[t].sqrt().view(-1, 1, 1, 1).to(device)
sqrt_1_ab = (1 - alpha_bars[t]).sqrt().view(-1, 1, 1, 1).to(device)
return sqrt_ab * x0 + sqrt_1_ab * noise, noise
# โโโ U-Net (simplified) โโโ
class SimpleUNet(nn.Module):
"""Minimal U-Net for noise prediction on 28ร28 images."""
def __init__(self):
super().__init__()
# Time embedding
self.time_mlp = nn.Sequential(
nn.Linear(1, 64), nn.ReLU(), nn.Linear(64, 64)
)
# Encoder
self.enc1 = nn.Sequential(nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
nn.Conv2d(32, 32, 3, padding=1), nn.ReLU())
self.enc2 = nn.Sequential(nn.Conv2d(32, 64, 3, stride=2, padding=1), nn.ReLU(),
nn.Conv2d(64, 64, 3, padding=1), nn.ReLU())
# Bottleneck
self.bottleneck = nn.Sequential(nn.Conv2d(64, 128, 3, stride=2, padding=1), nn.ReLU())
# Decoder
self.dec2 = nn.Sequential(nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1), nn.ReLU())
self.dec1 = nn.Sequential(nn.ConvTranspose2d(128, 32, 4, stride=2, padding=1), nn.ReLU())
self.final = nn.Conv2d(64, 1, 1) # predict noise
def forward(self, x, t):
# Time conditioning
t_emb = self.time_mlp(t.float().unsqueeze(1) / T) # (B, 64)
# Encoder
e1 = self.enc1(x) # (B, 32, 28, 28)
e2 = self.enc2(e1) # (B, 64, 14, 14)
# Bottleneck + time embedding
b = self.bottleneck(e2) # (B, 128, 7, 7)
b = b + t_emb.view(-1, 64, 1, 1).expand_as(b[:, :64]).repeat(1, 2, 1, 1)
# Decoder with skip connections
d2 = self.dec2(b) # (B, 64, 14, 14)
d2 = torch.cat([d2, e2], dim=1) # skip: (B, 128, 14, 14)
d1 = self.dec1(d2) # (B, 32, 28, 28)
d1 = torch.cat([d1, e1], dim=1) # skip: (B, 64, 28, 28)
return self.final(d1) # (B, 1, 28, 28)
# โโโ Training โโโ
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleUNet().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
# Training loop (using same MNIST loader from above)
for epoch in range(20):
total_loss = 0
for images, _ in loader:
images = images.to(device)
t = torch.randint(0, T, (images.size(0),)).to(device)
x_t, noise = forward_diffusion(images, t, device)
pred_noise = model(x_t, t)
loss = F.mse_loss(pred_noise, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}: Loss = {total_loss/len(loader):.4f}")
# โโโ Sampling (Generate from noise) โโโ
@torch.no_grad()
def sample(model, n_samples=16):
"""Generate images via reverse diffusion"""
model.eval()
x = torch.randn(n_samples, 1, 28, 28).to(device)
for t in reversed(range(T)):
t_batch = torch.full((n_samples,), t, device=device)
pred_noise = model(x, t_batch)
alpha = alphas[t]
alpha_bar = alpha_bars[t]
beta = betas[t]
# Reverse step: x_{t-1} from x_t
x = (1/alpha.sqrt()) * (x - (beta / (1-alpha_bar).sqrt()) * pred_noise)
if t > 0:
noise = torch.randn_like(x)
x += beta.sqrt() * noise # add stochasticity
return x.clamp(-1, 1)
samples = sample(model)
print(f"Generated {samples.shape[0]} images of shape {samples.shape[1:]}")
Visual Diagrams
Diagram 1: The Three Generative Paradigms
Diagram 2: VAE Latent Space
Diagram 3: Diffusion Forward/Reverse Process
Industry Case Study: Meesho AI Product Photography ๐ฎ๐ณ
๐ฎ๐ณ Meesho โ Diffusion Models for Small Seller Product Photography
The Problem
Meesho is India's largest social commerce platform, enabling 15+ million small sellers โ many of them home-based women entrepreneurs in Tier-2/3 cities โ to sell products online. The challenge: most sellers photograph products on bedsheets with phone cameras. Professional product photography costs โน500-2000 per image, which is prohibitive when selling โน200 kurtis.
The AI Solution
Meesho built a diffusion-based product photography pipeline that:
- Background removal: Segment the product from the cluttered photo using U-Net segmentation
- Background generation: Use a fine-tuned Stable Diffusion model to generate studio-quality backgrounds conditioned on product category (e.g., "clean white surface with soft shadows for jewelry")
- Image enhancement: Diffusion-based super-resolution to upscale low-quality phone images
- Model photography: Generate virtual models wearing the clothing using ControlNet + DensePose conditioning
Technical Architecture
- Base model: Stable Diffusion XL (SDXL) fine-tuned on 2M Meesho product images
- Conditioning: ControlNet for pose/edge conditioning, IP-Adapter for style transfer
- Inference: SDXL Turbo for 4-step generation (from 50 steps) โ essential for serving millions of sellers
- Infrastructure: NVIDIA A100 GPUs on AWS Mumbai region, with distilled models for edge deployment
Impact
- ๐ 23% increase in click-through rate for AI-enhanced images
- ๐ 15% increase in conversion rate
- ๐ฐ โน0 cost to sellers (free feature, platform investment)
- ๐ฉโ๐ผ Democratizes professional photography for millions of women entrepreneurs
India's GenAI Startup Ecosystem: Beyond Meesho, companies like Navi AI (insurance document generation), Rephrase.ai (AI video generation, acquired by Adobe), Krutrim (Ola's multilingual generative AI), and Yellow.ai (conversational AI) are building on GANs and diffusion models for India-specific use cases. The government's IndiaAI Mission has allocated โน10,000 crore for AI compute infrastructure.
Industry Case Study: Midjourney & DALL-E ๐บ๐ธ
๐บ๐ธ Midjourney / OpenAI DALL-E โ Text-to-Image at Scale
DALL-E 3 (OpenAI, 2023)
DALL-E 3 represents the cutting edge of text-to-image generation:
- Architecture: Latent Diffusion Model with T5-XXL text encoder (replacing CLIP) for better prompt understanding
- Key innovation: Trained on synthetic captions โ GPT-4V was used to re-caption the entire training dataset with detailed descriptions, dramatically improving prompt adherence
- Safety: Built-in safety classifiers reject violent, sexual, or public-figure-likeness prompts; provenance metadata (C2PA) embedded in generated images
- Integration: Natively integrated into ChatGPT โ users describe images in conversation, DALL-E generates them
Midjourney (2022โpresent)
Midjourney took a different path โ aesthetics first, research papers second:
- Team: ~40 people (tiny compared to OpenAI's thousands), founded by David Holz (ex-Leap Motion)
- Architecture: Proprietary diffusion model (details undisclosed), with emphasis on artistic quality
- Interface: Discord-only at launch โ users type /imagine prompts in a Discord channel
- Revenue: $200M+ ARR with just 40 employees โ one of the most capital-efficient AI companies
- Quality: Consistently wins blind comparisons for aesthetic quality, especially in artistic styles
Technical Comparison: DALL-E 3 vs Midjourney v6
| Feature | DALL-E 3 | Midjourney v6 |
|---|---|---|
| Prompt adherence | โญโญโญโญโญ (best) | โญโญโญโญ |
| Aesthetic quality | โญโญโญโญ | โญโญโญโญโญ (best) |
| Text in images | โญโญโญโญ (good) | โญโญโญ (improving) |
| Photorealism | โญโญโญโญ | โญโญโญโญโญ |
| API access | โ (OpenAI API) | โ (Discord/web only) |
| Open-source | โ | โ |
Common Misconceptions
โ MYTH: "GANs 'learn' from the training images and can reproduce them."
โ TRUTH: GANs learn the statistical distribution of training images, not memorize individual images. The Generator maps random noise to the learned distribution. Generated images are new samples from this distribution (though memorization can occur with small datasets or excessive capacity).
๐ WHY IT MATTERS: This distinction is central to copyright and IP debates. If GANs "copied" images, they'd infringe copyright directly. The reality is more nuanced โ they learn style, structure, and patterns.
โ MYTH: "Diffusion models are slower than GANs so they'll be replaced."
โ TRUTH: While base DDPM needs 1000 steps, modern techniques like DDIM (50 steps), consistency models (1-2 steps), and LCM-LoRA have brought diffusion inference to near-real-time. Stability AI's SDXL Turbo generates 512ร512 images in a single forward pass. The speed gap is closing rapidly.
๐ WHY IT MATTERS: Choosing between architectures based on 2020-era speed comparisons will lead to wrong engineering decisions in 2025.
โ MYTH: "More GAN training always gives better results."
โ TRUTH: GANs don't converge like supervised models. Training too long can cause mode collapse, oscillation, or the discriminator overwhelming the generator. You need to monitor FID/IS scores and save checkpoints regularly.
๐ WHY IT MATTERS: In production (e.g., Meesho's pipeline), knowing when to stop training is as important as knowing how to start.
โ MYTH: "VAEs are obsolete now that diffusion models exist."
โ TRUTH: VAEs remain the best choice for: (1) learning interpretable latent representations, (2) anomaly detection (high reconstruction error = anomaly), (3) the encoder component in Stable Diffusion itself! Stable Diffusion literally uses a VAE as its image compressor.
๐ WHY IT MATTERS: Understanding each model's strengths prevents dogmatic architecture choices.
GATE / Exam Corner
Formula Sheet: Generative Models
- GAN Minimax: minG maxD ๐ผ[log D(x)] + ๐ผ[log(1โD(G(z)))]
- Optimal D*: D*(x) = p_data(x) / (p_data(x) + p_g(x))
- GAN โ JSD: V(G, D*) = 2ยทJSD(p_data โฅ p_g) โ 2ยทlog(2)
- VAE ELBO: log P(x) โฅ ๐ผ[log P(x|z)] โ KL(q(z|x) โฅ p(z))
- Reparameterization: z = ฮผ + ฮตยทฯ, ฮต ~ N(0, I)
- KL (Gaussian): โยฝ ฮฃ(1 + log ฯยฒ โ ฮผยฒ โ ฯยฒ)
- DDPM Forward: x_t = โแพฑ_t ยท xโ + โ(1โแพฑ_t) ยท ฮต
- DDPM Loss: โ = ๐ผ[โฮต โ ฮต_ฮธ(x_t, t)โยฒ]
- WGAN Loss: L_critic = ๐ผ[D(x_fake)] โ ๐ผ[D(x_real)]
GATE-Style MCQs
For the GAN minimax objective V(D, G) = ๐ผ[log D(x)] + ๐ผ[log(1 โ D(G(z)))], at the global optimum where p_g = p_data, what is the value of V?
- 0
- โlog(2)
- โ2ยทlog(2)
- log(2)
In a VAE, the reparameterization trick is necessary because:
- Sampling from a Gaussian is computationally expensive
- Backpropagation cannot flow through a stochastic sampling operation
- The KL divergence requires a differentiable encoder
- The decoder needs a fixed-length input
A DDPM uses T=1000 steps. If แพฑโ โโ = 0.05, what fraction of the original signal xโ is retained in xโ โโ?
- 5%
- 22.4% (โ0.05)
- 50%
- 95%
GATE Prediction Table (2025-2027)
| Topic | GATE CS Probability | Likely Question Type |
|---|---|---|
| GAN minimax objective | โญโญโญโญ High | Write the objective, compute optimal D* |
| VAE ELBO / KL divergence | โญโญโญโญ High | Compute KL for given ฮผ, ฯ |
| Mode collapse definition | โญโญโญ Medium | MCQ: identify from description |
| Diffusion forward process | โญโญ Emerging | Compute x_t given xโ and noise schedule |
| GAN vs VAE comparison | โญโญโญ Medium | Match properties to model types |
Interview Prep
Conceptual Questions
Top 8 Interview Questions on Generative Models
Q1: Explain the GAN minimax game in plain English. What happens at Nash equilibrium?
Answer: Two networks compete โ Generator creates fakes, Discriminator detects them. At Nash equilibrium, the Generator produces images indistinguishable from real data, and the Discriminator outputs 0.5 for everything (random guessing). The Generator has learned the data distribution.
Q2: Why do GANs suffer from mode collapse? How would you fix it?
Answer: The Generator finds it easier to repeatedly produce one "safe" output that always fools the Discriminator, rather than exploring the full data distribution. Fixes: WGAN (smoother gradient landscape), minibatch discrimination (penalize low diversity), progressive growing, or unrolled GANs.
Q3: Explain the reparameterization trick and why it's necessary for VAEs.
Answer: VAE's encoder outputs ฮผ and ฯ, then samples z ~ N(ฮผ, ฯยฒ). But sampling is non-differentiable โ you can't backprop through it. The trick: z = ฮผ + ฮตยทฯ where ฮต ~ N(0,1). Now the randomness (ฮต) is external to the computation graph, and โz/โฮผ = 1, โz/โฯ = ฮต โ gradients exist.
Q4: How does a diffusion model generate images? What does it predict?
Answer: Start with pure Gaussian noise. At each step, a U-Net predicts the noise component ฮต, which is partially removed to get a slightly cleaner image. After T steps (50-1000), you arrive at a clean image. The model only needs to learn one thing: predict noise at any timestep.
Q5: Why did WGAN use Wasserstein distance instead of JSD?
Answer: When p_data and p_g have disjoint supports (common in high dimensions), JSD is constant (log 2) โ it provides no gradient for the Generator. Wasserstein distance is continuous even for non-overlapping distributions, providing meaningful gradients everywhere.
Q6: What is the role of the VAE in Stable Diffusion?
Answer: The VAE compresses images from pixel space (512ร512ร3 = 786K dims) to latent space (64ร64ร4 = 16K dims). The diffusion process runs entirely in this compressed space, making it ~48ร faster. The VAE decoder converts the denoised latent back to pixel space.
Q7: How does classifier-free guidance work?
Answer: During training, the text condition is randomly dropped (replaced with โ ) some percentage of the time. At inference, the model generates both conditional (with text) and unconditional predictions. The final prediction extrapolates toward the conditional: ฮตฬ = ฮต_unconditional + wยท(ฮต_conditional โ ฮต_unconditional). Higher w = stronger text adherence.
Q8: Compare FID and IS as GAN evaluation metrics.
Answer: FID (Frรฉchet Inception Distance) compares Inception feature distributions of real vs generated images โ lower is better. It captures both quality and diversity. IS (Inception Score) only measures generated image quality/diversity using a pretrained classifier โ it doesn't compare to real data. FID is preferred in practice because it catches mode collapse (which IS might miss).
Coding Challenge
Coding: Implement the VAE KL Loss
def vae_kl_loss(mu, log_var):
"""
Compute KL divergence KL(N(ฮผ, ฯยฒ) โฅ N(0, I))
Args:
mu: (batch, latent_dim) โ encoder mean
log_var: (batch, latent_dim) โ log variance
Returns:
KL divergence (scalar, averaged over batch)
"""
# KL = -0.5 * ฮฃ(1 + log(ฯยฒ) - ฮผยฒ - ฯยฒ)
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp(), dim=1)
return kl.mean()
# Test
mu = torch.tensor([[0.5, -1.0], [0.0, 0.0]])
log_var = torch.tensor([[0.0, 0.5], [0.0, 0.0]])
print(f"KL = {vae_kl_loss(mu, log_var):.4f}")
# Row 2 (ฮผ=0, ฯยฒ=1) should contribute 0 KL
Case Study Interview (India Focus)
Design: AI Product Photography for Flipkart/Meesho
Prompt: "Design a system that takes a phone photo of a product and generates studio-quality product images. Target users: small sellers in India with no photography skills."
Key points to cover:
- Pipeline: Segmentation โ Background generation โ Enhancement โ Quality check
- Model choices: ControlNet for maintaining product shape, SDXL for background generation, ESRGAN for super-resolution
- India-specific: Low-bandwidth-friendly (generate server-side, send compressed), support for Indian product categories (sarees, jewelry, spices), regional language UI
- Evaluation: A/B test on CTR and conversion, FID vs real studio photos, user satisfaction surveys
- Scale: 10M+ images/day at Meesho scale โ need model distillation, batched inference, caching
Hands-On Lab: Build a Conditional DCGAN
๐ฌ Mini-Project: Conditional Digit Generator
Objective
Build a Conditional DCGAN (cDCGAN) that generates MNIST digits based on a class label input. Instead of random digits, the user specifies "generate a 7" and gets a handwritten 7.
Requirements
- Modify the DCGAN Generator to accept a class label (one-hot encoded, concatenated with z)
- Modify the DCGAN Discriminator to accept a class label (embedded as additional channel)
- Train for 50 epochs on MNIST
- Generate a 10ร10 grid: each row is one digit class (0-9), each column is a different random z
- Compute and report FID score
Rubric
| Criterion | Points | Description |
|---|---|---|
| Working cDCGAN | 30 | Model trains without errors, losses are reasonable |
| Conditional generation | 25 | Can specify digit class and get recognizable output |
| Quality (FID < 50) | 20 | Generated digits are clear and diverse |
| Visualization | 15 | Grid showing all 10 classes, interpolation in z-space |
| Report | 10 | Discuss mode collapse observations, training stability |
Bonus Challenges
- โญ Implement WGAN-GP version and compare FID scores
- โญโญ Add a simple diffusion model and compare all three
- โญโญโญ Train on Fashion-MNIST and build a "virtual wardrobe" generator
Exercises (22 Questions)
Section A: Conceptual (5 Questions)
Explain the difference between a discriminative model and a generative model. Give two examples of each.
In the GAN framework, what are the roles of the Generator and the Discriminator? What happens at Nash equilibrium?
Why does the VAE use a KL divergence term in its loss function? What would happen if you removed it?
Explain mode collapse using a concrete example with MNIST digits. How does WGAN help mitigate it?
Compare the inference (generation) process of GANs, VAEs, and diffusion models. Which is fastest? Which produces the highest quality? Why?
Section B: Mathematical (8 Questions)
Derive the optimal discriminator D*(x) from the GAN minimax objective. Show all steps.
Given p_data(x) = N(3, 1) and p_g(x) = N(5, 1), compute D*(x) at x = 3, x = 4, and x = 5.
Show that substituting D* into the GAN value function gives V(G, D*) = 2ยทJSD(p_data โฅ p_g) โ 2ยทlog(2). Derive each step.
Compute the KL divergence KL(N(ฮผ, ฯยฒ) โฅ N(0, 1)) for: (a) ฮผ = 0, ฯ = 1 (b) ฮผ = 2, ฯ = 0.5 (c) ฮผ = 0, ฯ = 3. Interpret each result.
In DDPM with linear schedule ฮฒโ = 10โปโด, ฮฒ_T = 0.02, T = 1000: (a) Compute แพฑ_1 and แพฑ_1000. (b) At what timestep t is the signal-to-noise ratio approximately 1? (c) Verify: x_t = โแพฑ_t ยท xโ + โ(1โแพฑ_t) ยท ฮต has variance โ 1 when xโ is normalized.
Prove that the ELBO is indeed a lower bound on log P(x). Start from log P(x) = ELBO + KL(q(z|x) โฅ P(z|x)) and argue why the second term is non-negative.
For a ฮฒ-VAE with ฮฒ = 4 and a standard VAE (ฮฒ = 1), given the same encoder output ฮผ = [1, 0], log ฯยฒ = [0, โ1], compute the total loss (reconstruction loss = 50 for both). Which model has a more "compressed" latent space?
In WGAN, the critic must be Lipschitz-continuous. (a) Define Lipschitz continuity. (b) Explain why weight clipping enforces it. (c) Explain how WGAN-GP's gradient penalty enforces it more elegantly. (d) What Lipschitz constant is enforced?
Section C: Coding (4 Questions)
Implement the VAE reparameterization trick in PyTorch. Write a function reparameterize(mu, log_var) that takes encoder outputs and returns sampled z. Include both training (with noise) and inference (deterministic) modes.
Implement the DDPM forward diffusion process. Write a function add_noise(x0, t, noise_schedule) that takes a clean image, a timestep, and returns the noisy image + the noise that was added.
Modify the DCGAN training loop to use WGAN-GP. Replace the loss function, remove the sigmoid from the discriminator, and implement the gradient penalty. Train on MNIST and compare FID with the vanilla DCGAN.
Build a latent space interpolation tool for a trained VAE. Given two MNIST images, encode both, linearly interpolate between their latent vectors (10 steps), and decode each intermediate vector. Visualize the smooth transition.
Section D: Critical Thinking (3 Questions)
A startup claims their GAN can generate "never-before-seen" chemical molecules for drug discovery. Critically evaluate: (a) What does "never-before-seen" mean in the context of learning p_data? (b) How would you validate that generated molecules are chemically valid? (c) What risks exist in using generative models for drug design?
Meesho uses diffusion models to generate product photography. A seller uploads a photo of a saree and gets a "model wearing the saree" image. Discuss: (a) What biases could the model introduce in generated model appearances? (b) How should Meesho handle diversity/representation? (c) What happens if a generated image misrepresents the product's color or texture?
Compare the economic impact of generative AI on professional photographers in India vs. the US. Consider: market size, pricing power, adaptation strategies, and regulatory differences.
โ Starred Research Questions (2 Questions)
Consistency Models (Song et al., 2023): These models distill the multi-step diffusion process into a single-step generator. Read the paper and explain: (a) What is the self-consistency property? (b) How does the consistency function map any point on a noise trajectory to the starting point? (c) What are the implications for real-time image generation?
Sora (OpenAI, 2024): OpenAI's text-to-video model uses diffusion in a "spacetime latent space." Propose an architecture that extends Stable Diffusion from images to video. Address: (a) How do you handle temporal consistency? (b) What is the computational cost scaling? (c) How would you train this on Indian content (Bollywood, cricket)?
Deepfakes and the Ethics of Generative AI
The Deepfake Crisis
Generative models โ especially GANs and diffusion models โ have created an unprecedented challenge: the ability to generate photorealistic fake content at scale. In 2023 alone:
- 95,820 deepfake videos were detected online (a 550% increase from 2019)
- India was the 6th most targeted country for deepfake attacks
- Political deepfakes were used in Indian state elections (manipulated speeches of politicians)
- Financial fraud using voice cloning resulted in โน200+ crore losses in India
Ethical Framework for Generative AI
As engineers building these systems, you have a responsibility to consider:
- Consent: Does the generated content depict real people without their consent?
- Provenance: Can users tell if content is AI-generated? (C2PA metadata, watermarking)
- Harm potential: Could this content be used for fraud, harassment, or political manipulation?
- Bias amplification: Does the model perpetuate or amplify biases in training data?
- Economic displacement: How does this affect the livelihoods of artists, photographers, voice actors?
Regulatory Landscape
| Region | Key Regulations |
|---|---|
| ๐ฎ๐ณ India | IT Act Section 66D (deepfake fraud), MEITY advisory (2023) requiring platforms to label AI content, proposed Digital India Act |
| ๐บ๐ธ USA | No federal deepfake law (2024), state-level laws in California/Texas, FTC guidelines on AI-generated content |
| ๐ช๐บ EU | EU AI Act (2024) โ deepfakes must be labeled, high-risk generative AI requires conformity assessment |
Detection Methods
Deepfake detection is itself a fascinating ML problem:
- Facial analysis: Detect inconsistencies in eye reflections, ear shapes, teeth
- Frequency analysis: GANs produce artifacts in the frequency domain that CNNs can detect
- Temporal analysis: Deepfake videos have unnatural blinking patterns, head movements
- Provenance: C2PA standard embeds cryptographic signatures at image creation
๐ฎ๐ณ Deepfakes in India
- Political deepfakes during elections (state + national)
- Celebrity face-swap scams targeting Bollywood fans
- Voice cloning fraud: "Your son has been kidnapped" scams
- MEITY crackdown: platforms must remove deepfakes within 24 hours
- IIT Delhi's deepfake detection research (FaceForensics++)
๐บ๐ธ Deepfakes in the USA
- Election misinformation (Biden robocall, 2024)
- Non-consensual intimate imagery (state laws emerging)
- Hollywood SAG-AFTRA strike partially about AI likenesses
- Taylor Swift deepfakes prompted bipartisan legislative action
- DARPA's MediFor program for media forensics research
Connections
How This Chapter Connects
- Chapter 12 (CNNs): DCGAN's Generator uses transposed convolutions; Discriminator uses regular convolutions
- Chapter 16 (GANs & VAEs Intro): This chapter extends with WGAN, StyleGAN, ฮฒ-VAE, and adds diffusion
- Chapter 6 (Backpropagation): GAN training requires backprop through both D and G; reparameterization trick enables backprop through stochastic nodes
- Probability (KL Divergence): VAE ELBO, JSD in GANs, variational bound in diffusion
- Chapter 19 (Applied CV): Image generation, super-resolution, inpainting using models from this chapter
- Chapter 22 (Ethics & Future): Deepfakes, bias in generative AI, regulatory frameworks
- Text-to-Image systems: DALL-E, Stable Diffusion build on diffusion + CLIP from this chapter
- Video generation: Sora, Runway extend diffusion to temporal dimension
- Consistency Models (2023): Single-step generation from diffusion โ best of both worlds
- Flow Matching (2023-2024): Alternative to diffusion with straight-line probability paths
- DiT (Diffusion Transformers): Replacing U-Net with Transformer backbone (used in Sora)
- 3D Generation: DreamFusion, Magic3D โ text-to-3D via score distillation
- Adobe Firefly: Commercially safe generative AI trained on licensed content
- Canva: Text-to-image integrated into design workflow
- Runway ML: Video generation and editing for creators
- Medical imaging: Generating synthetic MRI/CT data for rare diseases
Chapter Summary
7 Key Takeaways
- Generative vs Discriminative: Generative models learn P(x), enabling them to create new data. Discriminative models only learn P(y|x) for classification. Generation is fundamentally harder but more powerful.
- GANs frame generation as a game: Generator creates fakes, Discriminator detects them. At Nash equilibrium, the optimal discriminator D*(x) = p_data/(p_data + p_g) is reduced to random guessing, and the game minimizes Jensen-Shannon Divergence between p_data and p_g.
- Mode collapse is the central GAN challenge: The Generator can learn to produce only a few "safe" outputs. Solutions include WGAN (Wasserstein distance for smoother gradients), spectral normalization, and minibatch discrimination.
- VAEs optimize a principled lower bound (ELBO): The loss = Reconstruction + KL divergence. The reparameterization trick (z = ฮผ + ฮตยทฯ) enables gradient flow through stochastic nodes. ฮฒ-VAE controls the quality-disentanglement trade-off.
- Diffusion models learn to reverse noise: The forward process adds Gaussian noise over T steps (fixed). The reverse process trains a U-Net to predict and remove noise at each step. The loss is simply MSE between true and predicted noise.
- Diffusion dominates in quality (2024): DDPM โ DDIM โ Latent Diffusion โ Stable Diffusion โ SDXL. The key insight of latent diffusion: run diffusion in compressed space (64ร64 vs 512ร512) for 48ร speedup.
- Ethics are inseparable from capability: Deepfakes, IP theft, and bias amplification are not hypothetical โ they're real harms. Engineers must build detection, watermarking, and consent systems alongside generative models.
GAN: minG maxD ๐ผ[log D(x)] + ๐ผ[log(1โD(G(z)))]
D*: p_data(x) / (p_data(x) + p_g(x))
VAE: โ = โ๐ผ[log P(x|z)] + KL(q(z|x) โฅ P(z))
Trick: z = ฮผ + ฮตยทฯ, ฮต~N(0,I)
DDPM: โ = ๐ผ[โฮต โ ฮต_ฮธ(โแพฑ_t ยท xโ + โ(1โแพฑ_t) ยท ฮต, t)โยฒ]
Key Intuition: All three generative paradigms share one fundamental idea โ learning to transform simple distributions (Gaussian noise) into complex data distributions. GANs do it adversarially (game), VAEs do it variationally (optimization), and diffusion models do it iteratively (denoising). The math differs, but the dream is the same: teach machines to create.
Further Reading
๐ฎ๐ณ Indian Resources
- NPTEL: "Deep Learning" by Prof. Mitesh Khapra (IIT Madras) โ Lectures 35-38 on generative models
- NPTEL: "Advanced Deep Learning" by Prof. Prabir Kumar Biswas (IIT KGP) โ VAE and GAN modules
- GATE CS syllabus: Generative models under "Machine Learning" (added in GATE 2024 pattern)
- AI4Bharat Wiki: Indian-language generative AI resources and datasets
๐ Global Resources
- Papers:
- Goodfellow et al., "Generative Adversarial Nets" (NeurIPS 2014) โ arxiv.org/abs/1406.2661
- Kingma & Welling, "Auto-Encoding Variational Bayes" (ICLR 2014) โ arxiv.org/abs/1312.6114
- Ho et al., "Denoising Diffusion Probabilistic Models" (NeurIPS 2020) โ arxiv.org/abs/2006.11239
- Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (CVPR 2022) โ arxiv.org/abs/2112.10752
- Arjovsky et al., "Wasserstein GAN" (ICML 2017) โ arxiv.org/abs/1701.07875
- Karras et al., "A Style-Based Generator Architecture for GANs" (CVPR 2019) โ StyleGAN
- Visual Explainers:
- Lil'Log: "What are Diffusion Models?" โ lilianweng.github.io
- 3Blue1Brown: "But what is a neural network?" (foundation for all chapters)
- Jay Alammar: "The Illustrated Stable Diffusion" โ jalammar.github.io
- Books:
- Goodfellow et al., "Deep Learning" (MIT Press) โ Chapter 20: Deep Generative Models
- Prince, "Understanding Deep Learning" (MIT Press, 2023) โ Chapters 14-18
- Foster, "Generative Deep Learning" (O'Reilly, 2nd ed.) โ Hands-on with TensorFlow/Keras
- Code:
- HuggingFace Diffusers: github.com/huggingface/diffusers
- PyTorch GAN Zoo: github.com/facebookresearch/pytorch_GAN_zoo