Neural Networks & Deep Learning

Chapter 18: Generative Models โ€” GANs, VAEs, and Diffusion

Teaching Machines to Create, Not Just Classify โ€” From Adversarial Games to Denoising Dreams

โฑ๏ธ Reading Time: ~4 hours  |  ๐Ÿ“– Unit 6: Modern Deep Learning  |  ๐Ÿง  Theory + Code + Ethics Chapter

๐Ÿ“‹ Prerequisites: Chapter 16 (GANs & VAEs Intro), Chapter 12 (CNNs), Probability & KL Divergence

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberState the GAN minimax objective, VAE ELBO, reparameterization trick, DDPM forward/reverse equations, and key architecture names (DCGAN, WGAN, StyleGAN)
๐Ÿ”ต UnderstandExplain why GANs frame generation as a game, why VAEs optimize a lower bound, why diffusion adds noise gradually, and why mode collapse occurs
๐ŸŸข ApplyImplement a vanilla GAN, DCGAN, and simple diffusion model from scratch on MNIST; use PyTorch for all three
๐ŸŸก AnalyzeDerive the optimal discriminator D*, trace how Wasserstein distance fixes vanishing gradients, analyze the ฮฒ-VAE disentanglement trade-off
๐ŸŸ  EvaluateCompare GANs vs VAEs vs Diffusion on sample quality, diversity, training stability, and compute cost; assess deepfake risks in Indian elections
๐Ÿ”ด CreateDesign a diffusion-based product photography pipeline for Indian e-commerce; build a deepfake detection prototype
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  1. Distinguish discriminative models P(y|x) from generative models P(x), and explain why learning to generate data is harder than learning to classify it
  2. Derive the GAN minimax objective from first principles, prove the optimal discriminator D*(x) = p_data(x) / (p_data(x) + p_g(x)), and show the connection to Jensen-Shannon divergence
  3. Implement a vanilla GAN, DCGAN, and WGAN from scratch in NumPy and PyTorch on MNIST
  4. Diagnose GAN training pathologies โ€” mode collapse, vanishing gradients, oscillation โ€” and apply fixes (label smoothing, spectral normalization, progressive growing)
  5. Explain the VAE ELBO decomposition: log P(x) โ‰ฅ ๐”ผ[log P(x|z)] โˆ’ KL(q(z|x) โˆฅ p(z)), and implement the reparameterization trick
  6. Derive the forward and reverse processes of DDPM, explain the noise schedule, and implement a simple diffusion model
  7. Compare GANs, VAEs, and Diffusion models on five axes: sample quality, diversity, training stability, latent space structure, and compute requirements
  8. Analyze real-world systems: Meesho's diffusion-based product photography and Midjourney/DALL-E text-to-image generation
  9. Evaluate ethical implications of generative AI, including deepfakes, misinformation, and IP concerns
  10. Solve GATE-style problems on GAN objectives, VAE loss components, and diffusion mathematics
Section 2

Opening Hook

๐ŸŽฒ The Night GANs Were Born

In June 2014, Ian Goodfellow was at a bar in Montreal with friends, debating how to make neural networks generate images. The prevailing approach โ€” Boltzmann machines โ€” was painfully slow, requiring complex Markov chain Monte Carlo sampling. Someone suggested using neural networks to generate images directly, but how would you train them without a clear loss function?

Then Goodfellow had his insight: don't define an explicit loss โ€” let two neural networks fight each other. One network (the Generator) creates fake images. The other (the Discriminator) tries to tell real from fake. They compete, and in this adversarial game, the Generator learns to produce images so realistic that even the Discriminator can't tell the difference.

Goodfellow went home that night and coded the entire thing. It worked on the first try. Within a few hours, his laptop was generating recognizable handwritten digits from pure noise. The paper, submitted later that year to NeurIPS, became one of the most cited in deep learning history.

But here's the twist: GANs were just the beginning. In the decade since, we've seen Variational Autoencoders learn smooth latent spaces, and Diffusion Models โ€” inspired by thermodynamics โ€” overtake everything else in image quality. Today, a student in Jaipur can type "a tiger wearing a kurta in a Rajasthani palace" and a diffusion model will paint it in seconds. You're about to learn exactly how all three families of generative models work, from the math to the code.

Goodfellow (NeurIPS 2014) NVIDIA StyleGAN OpenAI DALL-E Stability AI Meesho AI
Section 3

The Intuition First โ€” Three Roads to Creation

The Art Forgery Analogy (GANs) ๐ŸŽจ

Imagine a forger (Generator) trying to create fake Picasso paintings, and an art critic (Discriminator) trying to spot the fakes. Initially, the forger is terrible โ€” scribbling stick figures โ€” and the critic catches every fake easily. But as the forger gets feedback ("your brushstrokes are too uniform, your color palette is wrong"), they improve. Meanwhile, the critic must also improve, because the fakes are getting better.

This arms race continues until the forger creates paintings so perfect that even expert critics flip a coin: 50% chance any painting is real or fake. At that point, the Generator has learned the true distribution of Picasso paintings.

"Aha" question: What if the forger only learns to copy one perfect painting and shows it every time? The critic can't tell it from a real Picasso, but you've lost all diversity. This is mode collapse, and it's the central challenge of GAN training.

The Postal Code Analogy (VAEs) ๐Ÿ“ฎ

Think of a VAE like a postal system for images. The encoder takes an image and compresses it into a "postal code" โ€” a small vector in latent space. The decoder takes any postal code and reconstructs the image. The key insight: VAEs don't just learn one postal code per image โ€” they learn a region (a Gaussian cloud). Nearby postal codes decode to similar images. You can generate new images by sampling random postal codes and decoding them.

The Thermodynamics Analogy (Diffusion) ๐ŸŒก๏ธ

Drop a single drop of ink into a glass of water. Over time, the ink molecules spread out until they're uniformly distributed โ€” this is the forward process (adding noise). Now imagine you could reverse time and watch the uniform ink reconcentrate into a single drop. Diffusion models learn exactly this: they learn to reverse the noise process, turning pure static back into a coherent image, step by step.

Three Roads to Image Generation: GANs: z (noise) โ”€โ”€โ†’ [Generator] โ”€โ”€โ†’ fake image โ†โ”€โ”€ [Discriminator] โ”€โ”€โ†’ real/fake? โ†• compete real image VAEs: image โ”€โ”€โ†’ [Encoder] โ”€โ”€โ†’ ฮผ,ฯƒ โ”€โ”€โ†’ z ~ N(ฮผ,ฯƒยฒ) โ”€โ”€โ†’ [Decoder] โ”€โ”€โ†’ image' latent space (smooth, continuous) Diffusion: xโ‚€ โ”€โ”€โ†’ xโ‚ โ”€โ”€โ†’ xโ‚‚ โ”€โ”€โ†’ ... โ”€โ”€โ†’ x_T (pure noise) image (add noise each step โ†’) gaussian noise xโ‚€ โ†โ”€โ”€ xโ‚ โ†โ”€โ”€ xโ‚‚ โ†โ”€โ”€ ... โ†โ”€โ”€ x_T image (โ† learn to denoise) start from noise

The "it worked first try" story is real โ€” Goodfellow has confirmed it in multiple interviews. But he also says the first version was "very simple" โ€” just fully connected layers on MNIST. It took the community years to scale GANs to high-resolution photorealistic images (ProGAN, 2017; StyleGAN, 2018).

Section 4

18.1 โ€” Discriminative vs. Generative Models

Before diving into specific architectures, you need to understand a fundamental split in machine learning.

Core Distinction

Discriminative Model

Learns the conditional distribution P(y|x) โ€” given an input x (image), predict a label y (cat/dog). Examples: Logistic Regression, CNNs for classification, Transformers for NER. These models draw decision boundaries.

Generative Model

Learns the joint distribution P(x, y) or just P(x) โ€” the full data distribution. Once you know P(x), you can sample from it to generate new data points. Examples: GANs, VAEs, Diffusion Models, GPT (autoregressive P(xโ‚, xโ‚‚, ..., xโ‚™)).

Why Generative is Harder

A discriminative model just needs to learn a boundary between classes. A generative model must understand the entire structure of the data โ€” every texture, every edge, every statistical regularity. It's the difference between "tell me if this is a face" (easy) vs "draw me a face" (hard).

Mathematical Formulation

By Bayes' theorem: P(y|x) = P(x|y)P(y) / P(x). A generative model learns P(x|y) and P(y), which also lets you compute P(y|x). So generative models are strictly more powerful โ€” but that power comes at a cost.

PropertyDiscriminativeGenerative
LearnsP(y|x)P(x) or P(x,y)
Can classify?โœ… Directlyโœ… Via Bayes' rule
Can generate?โŒโœ…
Data efficiencyBetter (simpler task)Worse (harder task)
ExamplesSVM, CNN, BERTGAN, VAE, GPT, Diffusion

Q: What does a generative model learn?

A: The data distribution P(x). This lets it both generate new samples AND classify (via Bayes' rule), while discriminative models only classify.

Key formula: P(y|x) = P(x|y)P(y) / P(x) โ€” generative models learn the right side.

Section 5

18.2 โ€” GANs: The Minimax Game

The GAN Framework

A GAN consists of two neural networks trained simultaneously:

  • Generator G(z; ฮธ_g): Takes random noise z ~ P_z(z) (usually N(0, I)) and maps it to a fake sample G(z)
  • Discriminator D(x; ฮธ_d): Takes a sample x (real or fake) and outputs D(x) โˆˆ [0, 1] โ€” the probability that x is real

The Minimax Objective โ€” Derivation from First Principles

Step 1: What does the Discriminator want?

D wants to maximize its accuracy. For real samples x ~ p_data, it wants D(x) โ†’ 1. For fake samples G(z), it wants D(G(z)) โ†’ 0.

This is just binary cross-entropy! D maximizes:

V(D) = ๐”ผx~p_data[log D(x)] + ๐”ผz~p_z[log(1 โˆ’ D(G(z)))]

Step 2: What does the Generator want?

G wants to fool D. It wants D(G(z)) โ†’ 1 (discriminator thinks fake is real). So G minimizes the same objective V:

G minimizes: ๐”ผz~p_z[log(1 โˆ’ D(G(z)))]

Step 3: The combined minimax game:

minG maxD V(D, G) = ๐”ผx~p_data[log D(x)] + ๐”ผz~p_z[log(1 โˆ’ D(G(z)))]

Step 4: Deriving the Optimal Discriminator D*

For fixed G, we maximize V with respect to D. Rewrite V as an integral:

V = โˆซ [p_data(x) log D(x) + p_g(x) log(1 โˆ’ D(x))] dx

For each x, this is of the form f(D) = aยทlog(D) + bยทlog(1โˆ’D) where a = p_data(x), b = p_g(x).

Take derivative and set to zero:

f'(D) = a/D โˆ’ b/(1โˆ’D) = 0

a(1โˆ’D) = bD โ†’ a โˆ’ aD = bD โ†’ a = D(a+b)

Optimal Discriminator:
D*(x) = p_data(x) / (p_data(x) + p_g(x))

This is beautiful! The optimal discriminator is simply the ratio of real data density to total density. When p_g = p_data (Generator perfectly matches data), D*(x) = 1/2 everywhere โ€” the discriminator is reduced to random guessing.

Connection to Jensen-Shannon Divergence

Step 5: Substituting D* back into V

With D = D*, let's compute V(G, D*):

V(G, D*) = ๐”ผx~p_data[log (p_data/(p_data + p_g))] + ๐”ผx~p_g[log (p_g/(p_data + p_g))]

Let m = (p_data + p_g)/2. Then:

= ๐”ผp_data[log(p_data/2m)] + ๐”ผp_g[log(p_g/2m)]

= ๐”ผp_data[log(p_data/m)] + ๐”ผp_g[log(p_g/m)] โˆ’ 2log2

= KL(p_data โˆฅ m) + KL(p_g โˆฅ m) โˆ’ 2log2

= 2 ยท JSD(p_data โˆฅ p_g) โˆ’ 2log2

where JSD is the Jensen-Shannon Divergence! So the GAN minimax game, at optimality, minimizes the JSD between p_data and p_g. The global minimum of โˆ’2log2 is reached when p_g = p_data.

GAN โ†” JSD Connection:
V(G, D*) = 2 ยท JSD(p_data โˆฅ p_g) โˆ’ 2ยทlog(2)

JSD(PโˆฅQ) = ยฝKL(PโˆฅM) + ยฝKL(QโˆฅM), where M = (P+Q)/2

Practical Training: Alternating Gradient Descent

In practice, you don't solve the minimax analytically. Instead, you alternate:

  1. Train D for k steps: Sample minibatch of real data x, sample noise z, compute loss = โˆ’[log D(x) + log(1โˆ’D(G(z)))], update ฮธ_d via gradient ascent
  2. Train G for 1 step: Sample noise z, compute loss = log(1โˆ’D(G(z))), update ฮธ_g via gradient descent

The Non-Saturating Loss Trick: In practice, minimizing log(1โˆ’D(G(z))) gives weak gradients when D is confident (G(z) is clearly fake). Instead, G maximizes log D(G(z)). Same fixed point, but much stronger gradients early in training. This is what every practical GAN implementation uses.

GAN Training Loop (one iteration): Step 1: Update Discriminator (k times, typically k=1) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ x_real ~ p_data z ~ N(0,I) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ”‚ โ”‚ D(x_real) โ†’ want 1 G(z) โ†’ x_fake โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ D(x_fake) โ†’ want 0 โ”‚ โ”‚ โ”‚ โ”‚ Loss_D = -[log D(x_real) + log(1 - D(x_fake))] โ”‚ โ”‚ ฮธ_d โ† ฮธ_d - ฮฑ ยท โˆ‡_ฮธd Loss_D โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Step 2: Update Generator (1 time) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ z ~ N(0,I) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ G(z) โ†’ x_fake โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ D(x_fake) โ†’ want 1 (fool D!) โ”‚ โ”‚ โ”‚ โ”‚ Loss_G = -log D(G(z)) โ† non-saturating trick โ”‚ โ”‚ ฮธ_g โ† ฮธ_g - ฮฑ ยท โˆ‡_ฮธg Loss_G โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

"Generative Adversarial Nets" (Goodfellow et al., 2014) โ€” The original paper. Theorem 1 proves p_g โ†’ p_data under sufficient model capacity. But the proof assumes perfect discriminator at each step, which never holds in practice. This gap between theory and practice drove a decade of research.

Read: arxiv.org/abs/1406.2661

Section 6

18.3 โ€” GAN Variants: DCGAN, WGAN, and StyleGAN

DCGAN โ€” Deep Convolutional GAN (Radford et al., 2015)

The original GAN used fully connected layers. DCGAN established the architectural guidelines that made convolutional GANs work:

DCGAN Architecture Rules

  1. Replace pooling with strided convolutions โ€” Discriminator uses strided conv (downsampling), Generator uses transposed conv (upsampling)
  2. Use BatchNorm everywhere โ€” except in G's output layer and D's input layer
  3. No fully connected layers โ€” use global average pooling in D
  4. ReLU in Generator (except output: Tanh), LeakyReLU in Discriminator
  5. Adam optimizer with lr=0.0002, ฮฒโ‚=0.5
DCGAN Generator Architecture: z โˆˆ โ„ยนโฐโฐ (noise vector) โ”‚ โ–ผ [Project & Reshape: 4ร—4ร—1024] โ”‚ โ–ผ [ConvTranspose2d โ†’ 8ร—8ร—512]โ”€โ”€[BatchNorm]โ”€โ”€[ReLU] โ”‚ โ–ผ [ConvTranspose2d โ†’ 16ร—16ร—256]โ”€โ”€[BatchNorm]โ”€โ”€[ReLU] โ”‚ โ–ผ [ConvTranspose2d โ†’ 32ร—32ร—128]โ”€โ”€[BatchNorm]โ”€โ”€[ReLU] โ”‚ โ–ผ [ConvTranspose2d โ†’ 64ร—64ร—3]โ”€โ”€[Tanh] โ”‚ โ–ผ Generated Image (64ร—64ร—3, values in [-1,1])

WGAN โ€” Wasserstein GAN (Arjovsky et al., 2017)

The key insight of WGAN: Jensen-Shannon Divergence is a terrible training signal when the generator and data distributions don't overlap (which is almost always true early in training, since images live on a low-dimensional manifold in pixel space).

Why JSD Fails

When p_data and p_g have disjoint supports (don't overlap), JSD = log(2) regardless of how "close" the distributions are. The discriminator achieves perfect accuracy instantly, and gradients for G vanish. This is the vanishing gradient problem in GANs.

The Wasserstein Distance (Earth Mover's Distance)

Wasserstein-1 Distance:
W(p_data, p_g) = infฮณ โˆˆ ฮ (p_data, p_g) ๐”ผ(x,y)~ฮณ[โ€–x โˆ’ yโ€–]

= minimum cost to "transport" p_data into p_g

The beauty of W: it's continuous and differentiable even when distributions don't overlap. Think of it as: "how much earth do you need to move, and how far?" โ€” even when two piles of dirt don't touch, you can always measure the distance.

WGAN Training Changes

Vanilla GANWGAN
Discriminator outputs probabilityCritic outputs unbounded score (no sigmoid)
Binary cross-entropy lossWasserstein loss: ๐”ผ[D(x_real)] โˆ’ ๐”ผ[D(x_fake)]
Train D for 1 step per G stepTrain Critic for 5 steps per G step
No weight constraintWeight clipping: w โ† clip(w, โˆ’0.01, 0.01)

WGAN-GP (Gulrajani et al., 2017) replaced weight clipping with a gradient penalty: penalize the critic when โ€–โˆ‡_x D(xฬ‚)โ€– โ‰  1, where xฬ‚ is a random interpolation between real and fake. This enforces the Lipschitz constraint more elegantly and is the standard in practice.

StyleGAN โ€” Style-Based Generator Architecture (Karras et al., 2019)

StyleGAN revolutionized GANs by separating what is generated from how it's styled:

  • Mapping Network: z โ†’ w (8-layer MLP, maps noise to "style" space W)
  • Synthesis Network: Generates image progressively (4ร—4 โ†’ 8ร—8 โ†’ ... โ†’ 1024ร—1024)
  • AdaIN (Adaptive Instance Normalization): Style vector w controls normalization at each layer
  • Noise injection: Stochastic details (hair strands, pores) via per-pixel noise

thispersondoesnotexist.com โ€” a website that generates a new photorealistic face every time you refresh โ€” uses StyleGAN2. The faces are 1024ร—1024, indistinguishable from real photos, and no two are alike. These people literally do not exist.

Section 7

18.4 โ€” Mode Collapse and GAN Training Tricks

What is Mode Collapse?

Imagine p_data is a mixture of 10 Gaussians (like MNIST's 10 digit classes). Mode collapse occurs when the Generator learns to produce only 1-2 of these modes, ignoring the rest. The Discriminator catches on โ€” "you're only generating 7s!" โ€” but the Generator responds by switching to another mode: "fine, now I'll only generate 3s."

Mode Collapse Visualization: True data distribution p_data: Generator's output p_g: โ•ญโ”€โ•ฎ โ•ญโ”€โ•ฎ โ•ญโ”€โ•ฎ โ•ญโ”€โ•ฎ โ•ญโ”€โ”€โ”€โ”€โ•ฎ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ† all samples โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ collapse to โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€ โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€ ONE mode! "0" "3" "5" "7" "7" Full collapse: G always outputs the same image Partial collapse: G oscillates between 2-3 modes

Causes and Fixes

ProblemCauseFix
Mode collapseG finds "safe" mode that always fools DMinibatch discrimination, unrolled GANs, diversity regularization
Vanishing gradientsD becomes too strong โ†’ log(1โˆ’D(G(z))) saturatesNon-saturating loss, WGAN, label smoothing
Training oscillationD and G alternate dominationTwo-time-scale update rule (TTUR), spectral normalization
Gradient explosionUnstable dynamicsGradient clipping, spectral normalization

Practical GAN Training Checklist

GAN Training Stability Checklist

  • Use WGAN-GP or spectral normalization instead of vanilla GAN
  • Use non-saturating loss for Generator: โˆ’log D(G(z))
  • Label smoothing: real labels = 0.9 instead of 1.0
  • Train D more steps than G (typically 5:1 for WGAN)
  • Use Adam with lr=0.0002, ฮฒโ‚=0.5, ฮฒโ‚‚=0.999
  • Normalize inputs to [โˆ’1, 1]; use Tanh in G's output
  • Monitor both D and G losses โ€” neither should go to zero
  • Use FID score (Frรฉchet Inception Distance) for evaluation

โŒ MYTH: "The Generator loss should decrease over training."

โœ… TRUTH: In a healthy GAN, both G and D losses oscillate around equilibrium. If G loss drops to zero, D has collapsed. If D loss drops to zero, G isn't learning. You want both losses to hover, indicating an ongoing "game."

๐Ÿ” WHY IT MATTERS: Students often debug GANs by looking for decreasing loss curves like in supervised learning. This leads to premature stopping or incorrect hyperparameter tuning.

A student trains a GAN on MNIST and notices the Discriminator loss quickly drops to 0 while the Generator loss climbs to infinity. The generated images are pure noise. What went wrong? How would you fix it?

for epoch in range(100):
    # Train D
    real = next(dataloader)
    fake = G(torch.randn(64, 100))
    d_loss = -torch.mean(torch.log(D(real)) + torch.log(1 - D(fake)))
    d_optimizer.step()
    
    # Train G (same batch!)
    g_loss = torch.mean(torch.log(1 - D(G(torch.randn(64, 100)))))
    g_optimizer.step()

Bugs found: (1) .zero_grad() is never called โ€” gradients accumulate! (2) G uses the saturating loss log(1โˆ’D(G(z))) which gives near-zero gradients when D is confident โ€” use -torch.mean(torch.log(D(G(z)))) instead. (3) D's learning rate may be too high relative to G โ€” try separate learning rates or a TTUR schedule.

Section 8

18.5 โ€” Variational Autoencoders (VAEs) and ฮฒ-VAE

From Autoencoders to VAEs

You already know autoencoders (Ch 12): encoder compresses x โ†’ z, decoder reconstructs z โ†’ xฬ‚. But regular autoencoders learn a deterministic mapping to a messy, disconnected latent space. You can't sample from it meaningfully.

VAEs fix this by making the encoding probabilistic: instead of z = f(x), the encoder outputs parameters of a distribution: ฮผ(x), ฯƒ(x). Then z is sampled from N(ฮผ, ฯƒยฒ). A KL divergence term pushes this distribution toward the standard normal N(0, 1), ensuring the latent space is smooth and continuous.

Deriving the ELBO from First Principles

Goal: We want to maximize log P(x) โ€” the log-likelihood of the data.

Problem: P(x) = โˆซ P(x|z)P(z) dz โ€” this integral is intractable for complex decoders.

Solution: Introduce a tractable approximation q(z|x) โ‰ˆ P(z|x) and derive a lower bound.

Step 1: Start with log P(x) and use Jensen's inequality:

log P(x) = log โˆซ P(x, z) dz = log โˆซ q(z|x) ยท [P(x, z) / q(z|x)] dz

   โ‰ฅ โˆซ q(z|x) ยท log[P(x, z) / q(z|x)] dz   โ† Jensen's inequality (log is concave)

   = ๐”ผq(z|x)[log P(x, z) โˆ’ log q(z|x)]

Step 2: Expand P(x, z) = P(x|z) ยท P(z):

   = ๐”ผq(z|x)[log P(x|z)] + ๐”ผq(z|x)[log P(z) โˆ’ log q(z|x)]

   = ๐”ผq(z|x)[log P(x|z)] โˆ’ KL(q(z|x) โˆฅ P(z))

Step 3: This is the ELBO (Evidence Lower BOund)!

VAE Loss (negative ELBO):
โ„’ = โˆ’๐”ผz~q(z|x)[log P(x|z)] + KL(q(z|x) โˆฅ P(z))

= Reconstruction Loss + KL Divergence Regularizer
First term: "how well can you reconstruct?"   Second term: "how close is your encoding to N(0,I)?"

The Reparameterization Trick

Problem: z ~ N(ฮผ, ฯƒยฒ) is a stochastic node. You can't backpropagate through random sampling!

Solution: Rewrite z = ฮผ + ฮต ยท ฯƒ, where ฮต ~ N(0, 1). Now the randomness (ฮต) is external to the computation graph, and gradients flow through ฮผ and ฯƒ normally.

Reparameterization Trick:
z = ฮผ(x) + ฯƒ(x) โŠ™ ฮต,    ฮต ~ N(0, I)

โˆ‚z/โˆ‚ฮผ = 1,   โˆ‚z/โˆ‚ฯƒ = ฮต  โ€” gradients exist!

KL Divergence โ€” Closed Form

When q(z|x) = N(ฮผ, ฯƒยฒ) and P(z) = N(0, 1), the KL divergence has a beautiful closed form:

KL(N(ฮผ, ฯƒยฒ) โˆฅ N(0, 1)):
= โˆ’ยฝ ฮฃj=1d (1 + log ฯƒjยฒ โˆ’ ฮผjยฒ โˆ’ ฯƒjยฒ)

ฮฒ-VAE: Controlling Disentanglement

Higgins et al. (2017) introduced ฮฒ-VAE by simply scaling the KL term:

โ„’ฮฒ-VAE = Reconstruction Loss + ฮฒ ยท KL(q(z|x) โˆฅ P(z))

  • ฮฒ = 1: Standard VAE
  • ฮฒ > 1: Stronger regularization โ†’ more disentangled latent factors (each dimension captures one independent feature: rotation, scale, color), but blurrier reconstructions
  • ฮฒ < 1: Better reconstruction, but messier latent space

โŒ MYTH: "VAEs generate blurry images because they're bad models."

โœ… TRUTH: Blurriness comes from the pixel-wise MSE reconstruction loss โ€” it averages over all possible reconstructions, creating blur. Use perceptual loss (comparing CNN features instead of pixels) or adversarial loss (VAE-GAN) for sharper results.

๐Ÿ” WHY IT MATTERS: Understanding why VAEs are blurry tells you that it's a loss function choice, not an architectural flaw. The framework is sound.

ML Research Scientist โ€” Generative Models at Adobe, NVIDIA, Google DeepMind. Roles focus on improving VAE/GAN/Diffusion architectures, typically requiring PhD + published papers. Salary range: โ‚น40-80 LPA (India) / $200-400K (US). Key skills: probabilistic modeling, PyTorch, distributed training.

Section 9

18.6 โ€” Diffusion Models: DDPM and the Denoising Revolution

The Core Idea

Diffusion models draw inspiration from non-equilibrium thermodynamics. The idea is stunningly simple:

  1. Forward process (fixed, no learning): Gradually add Gaussian noise to an image over T steps until it becomes pure noise
  2. Reverse process (learned): Train a neural network to reverse each step โ€” to denoise slightly at each step โ€” until you recover a clean image from pure noise

Forward Process (Adding Noise)

At each timestep t = 1, 2, ..., T:

q(x_t | x_{t-1}) = N(x_t; โˆš(1โˆ’ฮฒ_t) ยท x_{t-1}, ฮฒ_t ยท I)

where ฮฒ_t is a small noise variance (noise schedule, typically ฮฒโ‚ = 10โปโด to ฮฒ_T = 0.02).

Key mathematical trick: You can jump directly from xโ‚€ to any x_t without computing all intermediate steps!

Define แพฑ_t = ฮ s=1t (1 โˆ’ ฮฒ_s). Then:

q(x_t | xโ‚€) = N(x_t; โˆšแพฑ_t ยท xโ‚€, (1 โˆ’ แพฑ_t) ยท I)

Equivalently: x_t = โˆšแพฑ_t ยท xโ‚€ + โˆš(1 โˆ’ แพฑ_t) ยท ฮต,   ฮต ~ N(0, I)

This means: at any timestep t, the noisy image is just a weighted sum of the original image and random noise. As t โ†’ T, แพฑ_T โ†’ 0, and x_T โ‰ˆ pure noise.

Forward Diffusion (Direct Jump):
x_t = โˆšแพฑ_t ยท xโ‚€ + โˆš(1 โˆ’ แพฑ_t) ยท ฮต,    ฮต ~ N(0, I)

where แพฑ_t = ฮ s=1t(1 โˆ’ ฮฒ_s) = cumulative signal retention

Reverse Process (Learning to Denoise)

The reverse process aims to undo each noise step:

p_ฮธ(x_{t-1} | x_t) = N(x_{t-1}; ฮผ_ฮธ(x_t, t), ฯƒ_tยฒ ยท I)

We train a neural network ฮต_ฮธ(x_t, t) to predict the noise ฮต that was added. Once we know the noise, we can compute the denoised image.

DDPM Training Objective

The loss is beautifully simple:

For each training step:

  1. Sample a clean image xโ‚€ from the training set
  2. Sample a random timestep t ~ Uniform(1, T)
  3. Sample noise ฮต ~ N(0, I)
  4. Create noisy image: x_t = โˆšแพฑ_t ยท xโ‚€ + โˆš(1 โˆ’ แพฑ_t) ยท ฮต
  5. Feed x_t and t to the neural network, get prediction ฮต_ฮธ(x_t, t)
  6. Loss = โ€–ฮต โˆ’ ฮต_ฮธ(x_t, t)โ€–ยฒ โ€” just MSE between true and predicted noise!
DDPM Simple Loss:
โ„’ = ๐”ผt, xโ‚€, ฮต[โ€–ฮต โˆ’ ฮต_ฮธ(โˆšแพฑ_t ยท xโ‚€ + โˆš(1โˆ’แพฑ_t) ยท ฮต, t)โ€–ยฒ]

"Predict the noise that was added at timestep t"

Sampling (Generating Images)

To generate an image from scratch:

  1. Start with pure noise: x_T ~ N(0, I)
  2. For t = T, Tโˆ’1, ..., 1: use the trained ฮต_ฮธ to predict the noise, subtract it, get x_{t-1}
  3. The final xโ‚€ is your generated image!
DDPM Sampling Process (reverse diffusion): x_T (noise) xโ‚€ (image!) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚โ–’โ–’โ–’โ–’โ–’โ–’โ”‚ โ”€โ”€โ†’ โ”‚โ–’โ–’โ–‘โ–‘โ–’โ–’โ”‚ โ”€โ”€โ†’ โ”‚โ–‘โ–‘ โ–‘โ–‘โ”‚ โ†’ ... โ†’ โ”‚ ๐Ÿฑ โ”‚ โ”‚โ–’โ–’โ–’โ–’โ–’โ–’โ”‚ โ”‚โ–’โ–‘โ–‘โ–’โ–‘โ–’โ”‚ โ”‚โ–‘ โ–‘ โ”‚ โ”‚ โ”‚ โ”‚โ–’โ–’โ–’โ–’โ–’โ–’โ”‚ โ”‚โ–’โ–’โ–‘โ–’โ–’โ–’โ”‚ โ”‚โ–‘โ–‘ โ–‘โ–‘โ–‘โ”‚ โ”‚ cat! โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ t = 1000 t = 800 t = 500 t = 0 Each step: x_{t-1} = (1/โˆšฮฑ_t)(x_t โˆ’ (ฮฒ_t/โˆš(1โˆ’แพฑ_t))ยทฮต_ฮธ(x_t,t)) + ฯƒ_tยทz

โŒ MYTH: "Diffusion models are just fancy autoencoders."

โœ… TRUTH: Autoencoders learn to compress and reconstruct. Diffusion models learn to reverse a stochastic process. There's no encoder at inference time โ€” you start from pure noise. The training uses a fixed, non-learned forward process with a learned reverse. The mathematical framework is closer to score matching and stochastic differential equations than to compression.

๐Ÿ” WHY IT MATTERS: Understanding this distinction explains why diffusion models achieve higher sample diversity than GANs โ€” they're actually sampling from the learned distribution, not mapping a fixed noise vector through a deterministic generator.

"Denoising Diffusion Probabilistic Models" (Ho, Jain, Abbeel, 2020) โ€” The paper that showed diffusion models can match GAN quality. Key insight: the simplified loss (just predicting noise) works as well as the full variational bound. Building on Sohl-Dickstein et al. (2015) and Song & Ermon (2019).

"Denoising Diffusion Implicit Models" (Song et al., 2021) โ€” DDIM: a deterministic version that needs far fewer steps (50 vs 1000) for generation.

Section 10

18.7 โ€” Stable Diffusion and Text-to-Image

Latent Diffusion Models (LDM)

Running diffusion directly on 512ร—512ร—3 images is computationally expensive. Latent Diffusion (Rombach et al., 2022) solves this by running the diffusion process in a compressed latent space:

Stable Diffusion Architecture: "a cat wearing โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” a space suit" โ”€โ”€โ†’ โ”‚ CLIP Text โ”‚ โ”€โ”€โ†’ โ”‚ Cross-Attention โ”‚ โ”‚ Encoder โ”‚ โ”‚ conditioning โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ z_T (noise โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” zโ‚€ (clean in latent โ”€โ”€โ†’โ”‚ U-Net Denoiser โ”‚โ”€โ”€โ†’ latent) space 64ร—64) โ”‚ (with timestep & text conditioning) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ VAE Decoder โ”‚ โ”€โ”€โ†’ 512ร—512 Image โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Key insight: Diffusion happens at 64ร—64ร—4, not 512ร—512ร—3! That's 48ร— fewer dimensions โ†’ massively cheaper.

Three Components of Stable Diffusion

  1. VAE (Autoencoder): Compresses images from pixel space (512ร—512ร—3) to latent space (64ร—64ร—4) and back. Trained separately.
  2. U-Net Denoiser: The diffusion model itself โ€” predicts noise in latent space. Conditioned on timestep t and text embedding via cross-attention layers.
  3. Text Encoder (CLIP): Converts text prompts to embeddings that guide the U-Net's denoising process.

Classifier-Free Guidance

To make outputs follow text prompts more closely, Stable Diffusion uses classifier-free guidance:

ฮตฬƒ = ฮต_ฮธ(x_t, โˆ…) + w ยท (ฮต_ฮธ(x_t, c) โˆ’ ฮต_ฮธ(x_t, โˆ…))

where c is the text condition, โˆ… is the unconditional embedding, and w is the guidance scale (typically 7.5). Higher w = more adherent to prompt but less diverse.

Q: Why does Stable Diffusion run diffusion in latent space instead of pixel space?

A: Computational efficiency. A 512ร—512ร—3 image has 786,432 dimensions. The VAE compresses this to 64ร—64ร—4 = 16,384 dimensions โ€” a 48ร— reduction. Diffusion in this space is much faster, enabling generation on consumer GPUs.

Section 11

18.8 โ€” GANs vs. VAEs vs. Diffusion: The Complete Comparison

AxisGANVAEDiffusion
Sample Qualityโญโญโญโญ (sharp)โญโญ (blurry)โญโญโญโญโญ (best)
Diversityโญโญ (mode collapse risk)โญโญโญโญ (good coverage)โญโญโญโญโญ (full coverage)
Training Stabilityโญโญ (hard to tune)โญโญโญโญ (stable)โญโญโญโญโญ (very stable)
Latent SpaceโŒ No meaningful latentโญโญโญโญโญ (smooth, interpretable)โญโญโญ (via guidance)
Inference Speedโญโญโญโญโญ (one forward pass)โญโญโญโญโญ (one forward pass)โญ (50-1000 steps!)
Compute Costโญโญโญโญ (moderate)โญโญโญโญ (moderate)โญโญ (expensive)
LikelihoodโŒ No explicit P(x)โœ… Lower bound (ELBO)โœ… Via variational bound
Math FoundationGame theory, JSDVariational inference, KLThermodynamics, SDE
Killer AppFace generation, style transferRepresentation learning, anomaly detectionText-to-image, video generation

When to use what? (1) Need interpretable latent space? โ†’ VAE. (2) Need fast generation? โ†’ GAN. (3) Need best quality regardless of speed? โ†’ Diffusion. (4) Need text-conditioned generation? โ†’ Diffusion (Stable Diffusion). (5) Need anomaly detection? โ†’ VAE (high reconstruction error = anomaly).

๐Ÿ‡ฎ๐Ÿ‡ณ India: Current Landscape

  • Meesho: Diffusion for product photography
  • Flipkart: GAN-based virtual try-on
  • ISRO: Super-resolution satellite imagery via diffusion
  • IIT Bombay/Madras: VAE research for Indic script generation
  • Startup ecosystem: 50+ GenAI startups (Krutrim, Sarvam AI)
  • Key challenge: Compute access, data for Indian contexts

๐Ÿ‡บ๐Ÿ‡ธ USA: Current Landscape

  • OpenAI: DALL-E 3 (diffusion + CLIP)
  • Midjourney: Proprietary diffusion model
  • NVIDIA: StyleGAN series, GauGAN
  • Google: Imagen, Gemini image gen
  • Stability AI: Open-source Stable Diffusion
  • Key challenge: Copyright lawsuits, ethical governance
Section 12

Worked Examples

Worked Example 1: Computing Optimal Discriminator (By Hand) โœ๏ธ

Problem

Suppose our data lives in a 1D space. The real data distribution is p_data(x) = 2x for x โˆˆ [0, 1] (a triangle distribution). The current Generator produces p_g(x) = 1 for x โˆˆ [0, 1] (uniform). Find the optimal discriminator D*(x).

Solution

Using our derived formula:

D*(x) = p_data(x) / (p_data(x) + p_g(x)) = 2x / (2x + 1)

Let's check a few values:

  • At x = 0: D*(0) = 0/(0+1) = 0 โ€” the discriminator knows real data never appears at x=0 (p_data(0) = 0)
  • At x = 0.5: D*(0.5) = 1/(1+1) = 0.5 โ€” at x=0.5, both distributions have equal density
  • At x = 1: D*(1) = 2/(2+1) = 2/3 โ€” real data is twice as likely as fake at x=1

Interpretation: D*(x) is high where real data is dense relative to fake data. It equals 1/2 where both distributions have equal density. This is exactly what we'd expect from a perfect classifier!

Worked Example 2: VAE KL Divergence (Indian Industry Context) ๐Ÿ‡ฎ๐Ÿ‡ณ

Meesho Product Encoding

Meesho trains a VAE on product images. For a specific saree image, the encoder outputs: ฮผ = [0.5, โˆ’1.0, 2.0], log ฯƒยฒ = [โˆ’0.5, 0.0, 0.5]. Compute the KL divergence.

Solution

KL = โˆ’ยฝ ฮฃ (1 + log ฯƒยฒ โˆ’ ฮผยฒ โˆ’ ฯƒยฒ)

For each dimension j:

  • j=1: โˆ’ยฝ(1 + (โˆ’0.5) โˆ’ 0.25 โˆ’ e^{โˆ’0.5}) = โˆ’ยฝ(1 โˆ’ 0.5 โˆ’ 0.25 โˆ’ 0.607) = โˆ’ยฝ(โˆ’0.357) = 0.179
  • j=2: โˆ’ยฝ(1 + 0 โˆ’ 1.0 โˆ’ e^0) = โˆ’ยฝ(1 โˆ’ 1 โˆ’ 1) = โˆ’ยฝ(โˆ’1) = 0.500
  • j=3: โˆ’ยฝ(1 + 0.5 โˆ’ 4.0 โˆ’ e^{0.5}) = โˆ’ยฝ(1.5 โˆ’ 4 โˆ’ 1.649) = โˆ’ยฝ(โˆ’4.149) = 2.075

Total KL = 0.179 + 0.500 + 2.075 = 2.754

Interpretation: Dimension 3 contributes the most KL โ€” its mean (2.0) is far from 0, and its variance (e^0.5 โ‰ˆ 1.65) is above 1. The KL penalty will push this encoding toward N(0,1), encouraging the latent space to stay organized.

Worked Example 3: DDPM Noise Scheduling (US/Global Context) ๐Ÿ‡บ๐Ÿ‡ธ

DALL-E Style Diffusion

A diffusion model uses T=1000 steps with linear noise schedule: ฮฒ_t = 0.0001 + (0.02 โˆ’ 0.0001) ร— t/1000. Compute แพฑ_t for t = 1, 500, and 1000.

Solution

แพฑ_t = ฮ _{s=1}^{t} (1 โˆ’ ฮฒ_s) = ฮ _{s=1}^{t} ฮฑ_s

For the linear schedule, ฮฒโ‚ = 0.0001, ฮฒโ‚…โ‚€โ‚€ โ‰ˆ 0.01, ฮฒโ‚โ‚€โ‚€โ‚€ = 0.02.

Since there are many steps, we use the log: log แพฑ_t = ฮฃ log(1 โˆ’ ฮฒ_s) โ‰ˆ โˆ’ฮฃ ฮฒ_s (for small ฮฒ).

  • t=1: แพฑโ‚ = 1 โˆ’ 0.0001 โ‰ˆ 0.9999 โ€” almost no noise, image is nearly clean
  • t=500: แพฑโ‚…โ‚€โ‚€ โ‰ˆ exp(โˆ’ฮฃ_{s=1}^{500} ฮฒ_s) โ‰ˆ exp(โˆ’2.55) โ‰ˆ 0.078 โ€” image is mostly noise
  • t=1000: แพฑโ‚โ‚€โ‚€โ‚€ โ‰ˆ exp(โˆ’ฮฃ_{s=1}^{1000} ฮฒ_s) โ‰ˆ exp(โˆ’10.05) โ‰ˆ 0.0000435 โ€” virtually pure noise

Interpretation: The signal (โˆšแพฑ_t) decreases from ~1.0 โ†’ ~0.28 โ†’ ~0.007 while noise (โˆš(1โˆ’แพฑ_t)) increases from ~0 โ†’ ~0.96 โ†’ ~1.0. At t=500, the image is almost unrecognizable. At t=1000, it's pure Gaussian noise.

Section 13

From-Scratch Code: NumPy Implementations

1. Vanilla GAN from Scratch (NumPy)

Python / NumPyimport numpy as np

# โ•โ•โ• Utility Functions โ•โ•โ•
def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1 - s)

def relu(x):
    return np.maximum(0, x)

def relu_deriv(x):
    return (x > 0).astype(np.float64)

def leaky_relu(x, alpha=0.2):
    return np.where(x > 0, x, alpha * x)

def leaky_relu_deriv(x, alpha=0.2):
    return np.where(x > 0, 1, alpha)

# โ•โ•โ• Generator: z(100) โ†’ hidden(256) โ†’ output(784) โ•โ•โ•
np.random.seed(42)
z_dim = 100
h_dim = 256
img_dim = 784  # 28ร—28 for MNIST
lr = 0.0002

# Generator weights (Xavier init)
W_g1 = np.random.randn(z_dim, h_dim) * np.sqrt(2/z_dim)
b_g1 = np.zeros((1, h_dim))
W_g2 = np.random.randn(h_dim, img_dim) * np.sqrt(2/h_dim)
b_g2 = np.zeros((1, img_dim))

# Discriminator weights
W_d1 = np.random.randn(img_dim, h_dim) * np.sqrt(2/img_dim)
b_d1 = np.zeros((1, h_dim))
W_d2 = np.random.randn(h_dim, 1) * np.sqrt(2/h_dim)
b_d2 = np.zeros((1, 1))

def generator_forward(z):
    """z โ†’ ReLU(zW1+b1) โ†’ Sigmoid(hW2+b2) โ†’ fake image"""
    h = z @ W_g1 + b_g1         # (batch, 256)
    h_act = relu(h)              # ReLU activation
    out = h_act @ W_g2 + b_g2   # (batch, 784)
    img = sigmoid(out)           # Sigmoid โ†’ [0,1] pixel values
    return z, h, h_act, out, img

def discriminator_forward(x):
    """x โ†’ LeakyReLU(xW1+b1) โ†’ Sigmoid(hW2+b2) โ†’ probability"""
    h = x @ W_d1 + b_d1         # (batch, 256)
    h_act = leaky_relu(h)       # LeakyReLU
    out = h_act @ W_d2 + b_d2   # (batch, 1)
    prob = sigmoid(out)          # probability real/fake
    return x, h, h_act, out, prob

def train_step(real_batch, batch_size=64):
    global W_g1, b_g1, W_g2, b_g2, W_d1, b_d1, W_d2, b_d2
    
    # โ”€โ”€ Step 1: Train Discriminator โ”€โ”€
    z = np.random.randn(batch_size, z_dim)
    _, g_h, g_h_act, g_out, fake = generator_forward(z)
    
    # D on real data (want D(x) โ†’ 1)
    _, d_h_r, d_ha_r, d_out_r, d_prob_r = discriminator_forward(real_batch)
    # D on fake data (want D(G(z)) โ†’ 0)
    _, d_h_f, d_ha_f, d_out_f, d_prob_f = discriminator_forward(fake)
    
    # Binary cross-entropy gradients for D
    # Loss_D = -[log(D(real)) + log(1-D(fake))]
    d_loss = -np.mean(np.log(d_prob_r + 1e-8) + np.log(1 - d_prob_f + 1e-8))
    
    # Backprop through D (real path)
    dL_dout_r = -(1 / (d_prob_r + 1e-8)) * sigmoid_deriv(d_out_r) / batch_size
    dW_d2_r = d_ha_r.T @ dL_dout_r
    db_d2_r = np.sum(dL_dout_r, axis=0, keepdims=True)
    dL_dha_r = dL_dout_r @ W_d2.T
    dL_dh_r = dL_dha_r * leaky_relu_deriv(d_h_r)
    dW_d1_r = real_batch.T @ dL_dh_r
    db_d1_r = np.sum(dL_dh_r, axis=0, keepdims=True)
    
    # Backprop through D (fake path)
    dL_dout_f = (1 / (1 - d_prob_f + 1e-8)) * sigmoid_deriv(d_out_f) / batch_size
    dW_d2_f = d_ha_f.T @ dL_dout_f
    db_d2_f = np.sum(dL_dout_f, axis=0, keepdims=True)
    dL_dha_f = dL_dout_f @ W_d2.T
    dL_dh_f = dL_dha_f * leaky_relu_deriv(d_h_f)
    dW_d1_f = fake.T @ dL_dh_f
    db_d1_f = np.sum(dL_dh_f, axis=0, keepdims=True)
    
    # Update D
    W_d2 -= lr * (dW_d2_r + dW_d2_f)
    b_d2 -= lr * (db_d2_r + db_d2_f)
    W_d1 -= lr * (dW_d1_r + dW_d1_f)
    b_d1 -= lr * (db_d1_r + db_d1_f)
    
    # โ”€โ”€ Step 2: Train Generator โ”€โ”€
    # Non-saturating loss: G maximizes log(D(G(z)))
    z = np.random.randn(batch_size, z_dim)
    _, g_h, g_h_act, g_out, fake = generator_forward(z)
    _, d_h_f, d_ha_f, d_out_f, d_prob_f = discriminator_forward(fake)
    
    g_loss = -np.mean(np.log(d_prob_f + 1e-8))
    
    # Backprop through D (frozen) then through G
    dL_dout = -(1 / (d_prob_f + 1e-8)) * sigmoid_deriv(d_out_f) / batch_size
    dL_dha = dL_dout @ W_d2.T
    dL_dh = dL_dha * leaky_relu_deriv(d_h_f)
    dL_dfake = dL_dh @ W_d1.T  # gradient at fake image
    
    # Continue through G
    dL_gout = dL_dfake * sigmoid_deriv(g_out)
    dW_g2 = g_h_act.T @ dL_gout
    db_g2 = np.sum(dL_gout, axis=0, keepdims=True)
    dL_ghact = dL_gout @ W_g2.T
    dL_gh = dL_ghact * relu_deriv(g_h)
    dW_g1 = z.T @ dL_gh
    db_g1 = np.sum(dL_gh, axis=0, keepdims=True)
    
    # Update G
    W_g2 -= lr * dW_g2
    b_g2 -= lr * db_g2
    W_g1 -= lr * dW_g1
    b_g1 -= lr * db_g1
    
    return d_loss, g_loss

# Training loop (with simulated MNIST data)
print("Training Vanilla GAN from scratch...")
for epoch in range(100):
    # Simulate a batch of "real" data (replace with actual MNIST)
    real = np.random.rand(64, img_dim) * 0.5 + 0.25
    d_loss, g_loss = train_step(real)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}: D_loss={d_loss:.4f}, G_loss={g_loss:.4f}")

2. Simple Diffusion Model from Scratch (NumPy)

Python / NumPyimport numpy as np

# โ•โ•โ• DDPM from Scratch โ€” Simplified for 1D data โ•โ•โ•
# We'll implement the core math, then show PyTorch version for images

T = 100  # number of diffusion steps (1000 in practice)
beta_start = 1e-4
beta_end = 0.02

# Linear noise schedule
betas = np.linspace(beta_start, beta_end, T)
alphas = 1.0 - betas
alpha_bars = np.cumprod(alphas)  # แพฑ_t = cumulative product

print("Noise schedule check:")
print(f"  แพฑ_1   = {alpha_bars[0]:.6f}  (almost clean)")
print(f"  แพฑ_50  = {alpha_bars[49]:.6f}  (partially noisy)")
print(f"  แพฑ_100 = {alpha_bars[99]:.6f}  (mostly noise)")

def forward_diffusion(x0, t, noise=None):
    """Add noise to x0 at timestep t: x_t = โˆšแพฑ_tยทxโ‚€ + โˆš(1โˆ’แพฑ_t)ยทฮต"""
    if noise is None:
        noise = np.random.randn(*x0.shape)
    sqrt_ab = np.sqrt(alpha_bars[t])
    sqrt_1_ab = np.sqrt(1 - alpha_bars[t])
    return sqrt_ab * x0 + sqrt_1_ab * noise, noise

# Simple denoiser: 2-layer MLP that predicts noise
# Input: [x_t, t_embedding], Output: predicted noise
h_dim = 64
input_dim = 2  # 1D data + timestep encoding

W1 = np.random.randn(input_dim, h_dim) * 0.1
b1 = np.zeros(h_dim)
W2 = np.random.randn(h_dim, 1) * 0.1
b2 = np.zeros(1)

def predict_noise(x_t, t):
    """Simple MLP to predict noise ฮต from (x_t, t)"""
    t_norm = t / T  # normalize timestep to [0, 1]
    inp = np.column_stack([x_t.reshape(-1, 1), 
                           np.full((len(x_t), 1), t_norm)])
    h = np.tanh(inp @ W1 + b1)  # hidden layer
    return h @ W2 + b2             # predicted noise

def train_diffusion(data, epochs=1000, lr=0.001):
    """Train denoiser to predict noise at random timesteps"""
    global W1, b1, W2, b2
    for epoch in range(epochs):
        # 1. Sample random data point
        x0 = data[np.random.randint(len(data))]
        # 2. Sample random timestep
        t = np.random.randint(0, T)
        # 3. Add noise (forward process)
        x_t, true_noise = forward_diffusion(np.array([x0]), t)
        # 4. Predict noise
        pred_noise = predict_noise(x_t, t)
        # 5. Loss = MSE(true_noise, pred_noise)
        loss = np.mean((true_noise - pred_noise.flatten()) ** 2)
        
        # Manual backprop (simplified for 1D)
        # ... gradient computation omitted for brevity ...
        
        if epoch % 200 == 0:
            print(f"Epoch {epoch}: loss = {loss:.4f}")

# Generate 1D data: mixture of two Gaussians
data = np.concatenate([
    np.random.randn(500) * 0.5 + 3.0,  # mode 1
    np.random.randn(500) * 0.5 - 3.0,  # mode 2
])
print("Data shape:", data.shape)
print("Training simplified diffusion model...")
train_diffusion(data, epochs=500)
Section 14

PyTorch Implementations

1. DCGAN on MNIST (PyTorch)

Python / PyTorchimport torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# โ•โ•โ• Hyperparameters โ•โ•โ•
z_dim = 100
img_channels = 1
features_g = 64
features_d = 64
lr = 0.0002
batch_size = 128
epochs = 50

# โ•โ•โ• Generator โ•โ•โ•
class Generator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            # z โ†’ 7ร—7ร—256
            nn.ConvTranspose2d(z_dim, features_g*4, 7, 1, 0),
            nn.BatchNorm2d(features_g*4),
            nn.ReLU(True),
            # 7ร—7 โ†’ 14ร—14
            nn.ConvTranspose2d(features_g*4, features_g*2, 4, 2, 1),
            nn.BatchNorm2d(features_g*2),
            nn.ReLU(True),
            # 14ร—14 โ†’ 28ร—28
            nn.ConvTranspose2d(features_g*2, img_channels, 4, 2, 1),
            nn.Tanh(),  # output in [-1, 1]
        )
    
    def forward(self, z):
        return self.net(z.view(-1, z_dim, 1, 1))

# โ•โ•โ• Discriminator โ•โ•โ•
class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            # 28ร—28 โ†’ 14ร—14
            nn.Conv2d(img_channels, features_d, 4, 2, 1),
            nn.LeakyReLU(0.2, inplace=True),
            # 14ร—14 โ†’ 7ร—7
            nn.Conv2d(features_d, features_d*2, 4, 2, 1),
            nn.BatchNorm2d(features_d*2),
            nn.LeakyReLU(0.2, inplace=True),
            # 7ร—7 โ†’ 1ร—1
            nn.Conv2d(features_d*2, 1, 7, 1, 0),
            nn.Sigmoid(),
        )
    
    def forward(self, x):
        return self.net(x).view(-1, 1)

# โ•โ•โ• Training Loop โ•โ•โ•
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
G = Generator().to(device)
D = Discriminator().to(device)
criterion = nn.BCELoss()
opt_g = optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999))
opt_d = optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999))

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),  # โ†’ [-1, 1]
])
dataset = datasets.MNIST(root="./data", train=True, transform=transform, download=True)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

for epoch in range(epochs):
    for real, _ in loader:
        real = real.to(device)
        bs = real.size(0)
        
        # โ”€โ”€ Train Discriminator โ”€โ”€
        z = torch.randn(bs, z_dim).to(device)
        fake = G(z).detach()
        d_real = D(real)
        d_fake = D(fake)
        loss_d = criterion(d_real, torch.ones_like(d_real) * 0.9) + \
                 criterion(d_fake, torch.zeros_like(d_fake))
        opt_d.zero_grad()
        loss_d.backward()
        opt_d.step()
        
        # โ”€โ”€ Train Generator (non-saturating) โ”€โ”€
        z = torch.randn(bs, z_dim).to(device)
        fake = G(z)
        d_fake = D(fake)
        loss_g = criterion(d_fake, torch.ones_like(d_fake))  # fool D
        opt_g.zero_grad()
        loss_g.backward()
        opt_g.step()
    
    print(f"Epoch [{epoch+1}/{epochs}] D_loss: {loss_d:.4f} G_loss: {loss_g:.4f}")

2. Simple DDPM on MNIST (PyTorch)

Python / PyTorchimport torch
import torch.nn as nn
import torch.nn.functional as F

# โ•โ•โ• Noise Schedule โ•โ•โ•
T = 1000
betas = torch.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)

def forward_diffusion(x0, t, device):
    """q(x_t | x_0) โ€” add noise at timestep t"""
    noise = torch.randn_like(x0)
    sqrt_ab = alpha_bars[t].sqrt().view(-1, 1, 1, 1).to(device)
    sqrt_1_ab = (1 - alpha_bars[t]).sqrt().view(-1, 1, 1, 1).to(device)
    return sqrt_ab * x0 + sqrt_1_ab * noise, noise

# โ•โ•โ• U-Net (simplified) โ•โ•โ•
class SimpleUNet(nn.Module):
    """Minimal U-Net for noise prediction on 28ร—28 images."""
    def __init__(self):
        super().__init__()
        # Time embedding
        self.time_mlp = nn.Sequential(
            nn.Linear(1, 64), nn.ReLU(), nn.Linear(64, 64)
        )
        # Encoder
        self.enc1 = nn.Sequential(nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
                                  nn.Conv2d(32, 32, 3, padding=1), nn.ReLU())
        self.enc2 = nn.Sequential(nn.Conv2d(32, 64, 3, stride=2, padding=1), nn.ReLU(),
                                  nn.Conv2d(64, 64, 3, padding=1), nn.ReLU())
        # Bottleneck
        self.bottleneck = nn.Sequential(nn.Conv2d(64, 128, 3, stride=2, padding=1), nn.ReLU())
        # Decoder
        self.dec2 = nn.Sequential(nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1), nn.ReLU())
        self.dec1 = nn.Sequential(nn.ConvTranspose2d(128, 32, 4, stride=2, padding=1), nn.ReLU())
        self.final = nn.Conv2d(64, 1, 1)  # predict noise
    
    def forward(self, x, t):
        # Time conditioning
        t_emb = self.time_mlp(t.float().unsqueeze(1) / T)  # (B, 64)
        
        # Encoder
        e1 = self.enc1(x)                     # (B, 32, 28, 28)
        e2 = self.enc2(e1)                    # (B, 64, 14, 14)
        
        # Bottleneck + time embedding
        b = self.bottleneck(e2)               # (B, 128, 7, 7)
        b = b + t_emb.view(-1, 64, 1, 1).expand_as(b[:, :64]).repeat(1, 2, 1, 1)
        
        # Decoder with skip connections
        d2 = self.dec2(b)                     # (B, 64, 14, 14)
        d2 = torch.cat([d2, e2], dim=1)      # skip: (B, 128, 14, 14)
        d1 = self.dec1(d2)                    # (B, 32, 28, 28)
        d1 = torch.cat([d1, e1], dim=1)      # skip: (B, 64, 28, 28)
        return self.final(d1)                 # (B, 1, 28, 28)

# โ•โ•โ• Training โ•โ•โ•
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleUNet().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)

# Training loop (using same MNIST loader from above)
for epoch in range(20):
    total_loss = 0
    for images, _ in loader:
        images = images.to(device)
        t = torch.randint(0, T, (images.size(0),)).to(device)
        
        x_t, noise = forward_diffusion(images, t, device)
        pred_noise = model(x_t, t)
        
        loss = F.mse_loss(pred_noise, noise)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}: Loss = {total_loss/len(loader):.4f}")

# โ•โ•โ• Sampling (Generate from noise) โ•โ•โ•
@torch.no_grad()
def sample(model, n_samples=16):
    """Generate images via reverse diffusion"""
    model.eval()
    x = torch.randn(n_samples, 1, 28, 28).to(device)
    
    for t in reversed(range(T)):
        t_batch = torch.full((n_samples,), t, device=device)
        pred_noise = model(x, t_batch)
        
        alpha = alphas[t]
        alpha_bar = alpha_bars[t]
        beta = betas[t]
        
        # Reverse step: x_{t-1} from x_t
        x = (1/alpha.sqrt()) * (x - (beta / (1-alpha_bar).sqrt()) * pred_noise)
        
        if t > 0:
            noise = torch.randn_like(x)
            x += beta.sqrt() * noise  # add stochasticity
    
    return x.clamp(-1, 1)

samples = sample(model)
print(f"Generated {samples.shape[0]} images of shape {samples.shape[1:]}")
Section 15

Visual Diagrams

Diagram 1: The Three Generative Paradigms

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ THE THREE ROADS TO GENERATION โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€ GAN โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ z ~ N(0,I) โ”€โ”€โ†’ [Generator G] โ”€โ”€โ†’ fake xฬ‚ โ”‚ โ•‘ โ•‘ โ”‚ โ†‘ โ†“ โ”‚ โ•‘ โ•‘ โ”‚ โˆ‡ฮธ_g loss [Discriminator D] โ”‚ โ•‘ โ•‘ โ”‚ โ†‘ โ†“ โ”‚ โ•‘ โ•‘ โ”‚ adversarial real or fake? โ”‚ โ•‘ โ•‘ โ”‚ signal โ†— real x from data โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ โœ… Sharp outputs โŒ Unstable โŒ Mode collapse โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€ VAE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ x โ”€โ”€โ†’ [Encoder] โ”€โ”€โ†’ ฮผ,ฯƒ โ”€โ”€โ†’ z=ฮผ+ฮตฯƒ โ”€โ”€โ†’ [Decoder] โ”€โ”€โ†’ xฬ‚ โ”‚ โ•‘ โ•‘ โ”‚ โ†‘ โ”‚ โ•‘ โ•‘ โ”‚ Loss = MSE(x,xฬ‚) + KL(q(z|x) โˆฅ N(0,I)) โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ โœ… Smooth latent โœ… Stable โŒ Blurry outputs โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€ DIFFUSION โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ Forward (fixed): xโ‚€ โ†’ xโ‚ โ†’ xโ‚‚ โ†’ ... โ†’ x_T โ‰ˆ N(0,I) โ”‚ โ•‘ โ•‘ โ”‚ (progressively add noise) โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ Reverse (learned): x_T โ†’ x_{T-1} โ†’ ... โ†’ xโ‚ โ†’ xโ‚€ โ”‚ โ•‘ โ•‘ โ”‚ (learn to denoise each step) โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ Loss = ๐”ผ[โ€–ฮต - ฮต_ฮธ(x_t, t)โ€–ยฒ] (just predict the noise!) โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ โœ… Best quality โœ… Stable โœ… Diverse โŒ Slow generation โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Diagram 2: VAE Latent Space

VAE Latent Space (2D visualization): zโ‚‚ โ†‘ 4 โ”‚ โ˜… "9" โ”‚ โ˜…โ˜… 2 โ”‚ โ˜…โ˜…โ˜…"7" โ—‹โ—‹ "0" โ”‚ โ˜…โ˜… โ—‹โ—‹โ—‹ 0 โ”‚โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ”€โ”€โ—‹โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ zโ‚ โ”‚ โ—โ—"1" โ–ฒโ–ฒ -2 โ”‚ โ—โ— โ–ฒโ–ฒโ–ฒ "4" โ”‚ โ–ฒโ–ฒ -4 โ”‚ โ˜… = sevens, โ— = ones, โ—‹ = zeros, โ–ฒ = fours Key properties: โ€ข Nearby points decode to similar images โ€ข Interpolating zโ‚โ†’zโ‚‚ smoothly morphs digit โ€ข Sampling anywhere gives a valid digit โ€ข ฮฒ-VAE: โ†‘ฮฒ โ†’ clusters separate more (disentangled)

Diagram 3: Diffusion Forward/Reverse Process

DDPM: Forward and Reverse Process Signal strength: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ†’ โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ Noise strength: โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ†’ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ t=0 t=200 t=500 t=800 t=1000 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ๐Ÿฑ โ”‚โ”€โ”€โ”€โ–ถโ”‚ ๐Ÿฑ. โ”‚โ”€โ”€โ”€โ–ถโ”‚ ?.. โ”‚โ”€โ”€โ”€โ–ถโ”‚ .... โ”‚โ”€โ”€โ”€โ–ถโ”‚ โ–’โ–’โ–’โ–’ โ”‚ โ”‚ cat โ”‚ โ”‚ cat? โ”‚ โ”‚ ??? โ”‚ โ”‚ .... โ”‚ โ”‚ โ–’โ–’โ–’โ–’ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ xโ‚€ xโ‚‚โ‚€โ‚€ xโ‚…โ‚€โ‚€ xโ‚ˆโ‚€โ‚€ xโ‚โ‚€โ‚€โ‚€ clean slightly mostly almost pure image noisy noisy noise noise FORWARD: q(xโ‚œ|xโ‚œโ‚‹โ‚) = N(โˆšฮฑโ‚œยทxโ‚œโ‚‹โ‚, ฮฒโ‚œI) [FIXED, no learning] โ†‘โ†‘โ†‘ REVERSE (learned) โ†‘โ†‘โ†‘ Neural network ฮต_ฮธ(xโ‚œ, t) predicts the noise at each step LOSS = โ€–ฮต_true โˆ’ ฮต_predictedโ€–ยฒ
Section 16

Industry Case Study: Meesho AI Product Photography ๐Ÿ‡ฎ๐Ÿ‡ณ

๐Ÿ‡ฎ๐Ÿ‡ณ Meesho โ€” Diffusion Models for Small Seller Product Photography

The Problem

Meesho is India's largest social commerce platform, enabling 15+ million small sellers โ€” many of them home-based women entrepreneurs in Tier-2/3 cities โ€” to sell products online. The challenge: most sellers photograph products on bedsheets with phone cameras. Professional product photography costs โ‚น500-2000 per image, which is prohibitive when selling โ‚น200 kurtis.

The AI Solution

Meesho built a diffusion-based product photography pipeline that:

  1. Background removal: Segment the product from the cluttered photo using U-Net segmentation
  2. Background generation: Use a fine-tuned Stable Diffusion model to generate studio-quality backgrounds conditioned on product category (e.g., "clean white surface with soft shadows for jewelry")
  3. Image enhancement: Diffusion-based super-resolution to upscale low-quality phone images
  4. Model photography: Generate virtual models wearing the clothing using ControlNet + DensePose conditioning

Technical Architecture

  • Base model: Stable Diffusion XL (SDXL) fine-tuned on 2M Meesho product images
  • Conditioning: ControlNet for pose/edge conditioning, IP-Adapter for style transfer
  • Inference: SDXL Turbo for 4-step generation (from 50 steps) โ€” essential for serving millions of sellers
  • Infrastructure: NVIDIA A100 GPUs on AWS Mumbai region, with distilled models for edge deployment

Impact

  • ๐Ÿ“ˆ 23% increase in click-through rate for AI-enhanced images
  • ๐Ÿ“ˆ 15% increase in conversion rate
  • ๐Ÿ’ฐ โ‚น0 cost to sellers (free feature, platform investment)
  • ๐Ÿ‘ฉโ€๐Ÿ’ผ Democratizes professional photography for millions of women entrepreneurs

India's GenAI Startup Ecosystem: Beyond Meesho, companies like Navi AI (insurance document generation), Rephrase.ai (AI video generation, acquired by Adobe), Krutrim (Ola's multilingual generative AI), and Yellow.ai (conversational AI) are building on GANs and diffusion models for India-specific use cases. The government's IndiaAI Mission has allocated โ‚น10,000 crore for AI compute infrastructure.

Section 17

Industry Case Study: Midjourney & DALL-E ๐Ÿ‡บ๐Ÿ‡ธ

๐Ÿ‡บ๐Ÿ‡ธ Midjourney / OpenAI DALL-E โ€” Text-to-Image at Scale

DALL-E 3 (OpenAI, 2023)

DALL-E 3 represents the cutting edge of text-to-image generation:

  • Architecture: Latent Diffusion Model with T5-XXL text encoder (replacing CLIP) for better prompt understanding
  • Key innovation: Trained on synthetic captions โ€” GPT-4V was used to re-caption the entire training dataset with detailed descriptions, dramatically improving prompt adherence
  • Safety: Built-in safety classifiers reject violent, sexual, or public-figure-likeness prompts; provenance metadata (C2PA) embedded in generated images
  • Integration: Natively integrated into ChatGPT โ€” users describe images in conversation, DALL-E generates them

Midjourney (2022โ€“present)

Midjourney took a different path โ€” aesthetics first, research papers second:

  • Team: ~40 people (tiny compared to OpenAI's thousands), founded by David Holz (ex-Leap Motion)
  • Architecture: Proprietary diffusion model (details undisclosed), with emphasis on artistic quality
  • Interface: Discord-only at launch โ€” users type /imagine prompts in a Discord channel
  • Revenue: $200M+ ARR with just 40 employees โ€” one of the most capital-efficient AI companies
  • Quality: Consistently wins blind comparisons for aesthetic quality, especially in artistic styles

Technical Comparison: DALL-E 3 vs Midjourney v6

FeatureDALL-E 3Midjourney v6
Prompt adherenceโญโญโญโญโญ (best)โญโญโญโญ
Aesthetic qualityโญโญโญโญโญโญโญโญโญ (best)
Text in imagesโญโญโญโญ (good)โญโญโญ (improving)
Photorealismโญโญโญโญโญโญโญโญโญ
API accessโœ… (OpenAI API)โŒ (Discord/web only)
Open-sourceโŒโŒ
Section 18

Common Misconceptions

โŒ MYTH: "GANs 'learn' from the training images and can reproduce them."

โœ… TRUTH: GANs learn the statistical distribution of training images, not memorize individual images. The Generator maps random noise to the learned distribution. Generated images are new samples from this distribution (though memorization can occur with small datasets or excessive capacity).

๐Ÿ” WHY IT MATTERS: This distinction is central to copyright and IP debates. If GANs "copied" images, they'd infringe copyright directly. The reality is more nuanced โ€” they learn style, structure, and patterns.

โŒ MYTH: "Diffusion models are slower than GANs so they'll be replaced."

โœ… TRUTH: While base DDPM needs 1000 steps, modern techniques like DDIM (50 steps), consistency models (1-2 steps), and LCM-LoRA have brought diffusion inference to near-real-time. Stability AI's SDXL Turbo generates 512ร—512 images in a single forward pass. The speed gap is closing rapidly.

๐Ÿ” WHY IT MATTERS: Choosing between architectures based on 2020-era speed comparisons will lead to wrong engineering decisions in 2025.

โŒ MYTH: "More GAN training always gives better results."

โœ… TRUTH: GANs don't converge like supervised models. Training too long can cause mode collapse, oscillation, or the discriminator overwhelming the generator. You need to monitor FID/IS scores and save checkpoints regularly.

๐Ÿ” WHY IT MATTERS: In production (e.g., Meesho's pipeline), knowing when to stop training is as important as knowing how to start.

โŒ MYTH: "VAEs are obsolete now that diffusion models exist."

โœ… TRUTH: VAEs remain the best choice for: (1) learning interpretable latent representations, (2) anomaly detection (high reconstruction error = anomaly), (3) the encoder component in Stable Diffusion itself! Stable Diffusion literally uses a VAE as its image compressor.

๐Ÿ” WHY IT MATTERS: Understanding each model's strengths prevents dogmatic architecture choices.

Section 19

GATE / Exam Corner

Formula Sheet: Generative Models

  • GAN Minimax: minG maxD ๐”ผ[log D(x)] + ๐”ผ[log(1โˆ’D(G(z)))]
  • Optimal D*: D*(x) = p_data(x) / (p_data(x) + p_g(x))
  • GAN โ†” JSD: V(G, D*) = 2ยทJSD(p_data โˆฅ p_g) โˆ’ 2ยทlog(2)
  • VAE ELBO: log P(x) โ‰ฅ ๐”ผ[log P(x|z)] โˆ’ KL(q(z|x) โˆฅ p(z))
  • Reparameterization: z = ฮผ + ฮตยทฯƒ, ฮต ~ N(0, I)
  • KL (Gaussian): โˆ’ยฝ ฮฃ(1 + log ฯƒยฒ โˆ’ ฮผยฒ โˆ’ ฯƒยฒ)
  • DDPM Forward: x_t = โˆšแพฑ_t ยท xโ‚€ + โˆš(1โˆ’แพฑ_t) ยท ฮต
  • DDPM Loss: โ„’ = ๐”ผ[โ€–ฮต โˆ’ ฮต_ฮธ(x_t, t)โ€–ยฒ]
  • WGAN Loss: L_critic = ๐”ผ[D(x_fake)] โˆ’ ๐”ผ[D(x_real)]

GATE-Style MCQs

Q1 (GATE CSE 2023 Style)

For the GAN minimax objective V(D, G) = ๐”ผ[log D(x)] + ๐”ผ[log(1 โˆ’ D(G(z)))], at the global optimum where p_g = p_data, what is the value of V?

  1. 0
  2. โˆ’log(2)
  3. โˆ’2ยทlog(2)
  4. log(2)
Answer: C. When p_g = p_data, D*(x) = 1/2 everywhere. V = ๐”ผ[log(1/2)] + ๐”ผ[log(1/2)] = 2ยทlog(1/2) = โˆ’2ยทlog(2) โ‰ˆ โˆ’1.386.
AnalyzeGATE 2023
Q2 (GATE CSE Style)

In a VAE, the reparameterization trick is necessary because:

  1. Sampling from a Gaussian is computationally expensive
  2. Backpropagation cannot flow through a stochastic sampling operation
  3. The KL divergence requires a differentiable encoder
  4. The decoder needs a fixed-length input
Answer: B. The sampling operation z ~ N(ฮผ, ฯƒยฒ) has no gradient. By rewriting z = ฮผ + ฮตยทฯƒ (ฮต ~ N(0,1)), the stochasticity is externalized, and gradients flow through ฮผ and ฯƒ via the chain rule.
UnderstandVAE
Q3 (Numerical)

A DDPM uses T=1000 steps. If แพฑโ‚…โ‚€โ‚€ = 0.05, what fraction of the original signal xโ‚€ is retained in xโ‚…โ‚€โ‚€?

  1. 5%
  2. 22.4% (โˆš0.05)
  3. 50%
  4. 95%
Answer: B. xโ‚…โ‚€โ‚€ = โˆšแพฑโ‚…โ‚€โ‚€ ยท xโ‚€ + noise. The signal coefficient is โˆš0.05 โ‰ˆ 0.224, so 22.4% of the original signal amplitude is retained.
ApplyDiffusion

GATE Prediction Table (2025-2027)

TopicGATE CS ProbabilityLikely Question Type
GAN minimax objectiveโญโญโญโญ HighWrite the objective, compute optimal D*
VAE ELBO / KL divergenceโญโญโญโญ HighCompute KL for given ฮผ, ฯƒ
Mode collapse definitionโญโญโญ MediumMCQ: identify from description
Diffusion forward processโญโญ EmergingCompute x_t given xโ‚€ and noise schedule
GAN vs VAE comparisonโญโญโญ MediumMatch properties to model types
Section 20

Interview Prep

Conceptual Questions

Top 8 Interview Questions on Generative Models

Q1: Explain the GAN minimax game in plain English. What happens at Nash equilibrium?

Answer: Two networks compete โ€” Generator creates fakes, Discriminator detects them. At Nash equilibrium, the Generator produces images indistinguishable from real data, and the Discriminator outputs 0.5 for everything (random guessing). The Generator has learned the data distribution.

Q2: Why do GANs suffer from mode collapse? How would you fix it?

Answer: The Generator finds it easier to repeatedly produce one "safe" output that always fools the Discriminator, rather than exploring the full data distribution. Fixes: WGAN (smoother gradient landscape), minibatch discrimination (penalize low diversity), progressive growing, or unrolled GANs.

Q3: Explain the reparameterization trick and why it's necessary for VAEs.

Answer: VAE's encoder outputs ฮผ and ฯƒ, then samples z ~ N(ฮผ, ฯƒยฒ). But sampling is non-differentiable โ€” you can't backprop through it. The trick: z = ฮผ + ฮตยทฯƒ where ฮต ~ N(0,1). Now the randomness (ฮต) is external to the computation graph, and โˆ‚z/โˆ‚ฮผ = 1, โˆ‚z/โˆ‚ฯƒ = ฮต โ€” gradients exist.

Q4: How does a diffusion model generate images? What does it predict?

Answer: Start with pure Gaussian noise. At each step, a U-Net predicts the noise component ฮต, which is partially removed to get a slightly cleaner image. After T steps (50-1000), you arrive at a clean image. The model only needs to learn one thing: predict noise at any timestep.

Q5: Why did WGAN use Wasserstein distance instead of JSD?

Answer: When p_data and p_g have disjoint supports (common in high dimensions), JSD is constant (log 2) โ€” it provides no gradient for the Generator. Wasserstein distance is continuous even for non-overlapping distributions, providing meaningful gradients everywhere.

Q6: What is the role of the VAE in Stable Diffusion?

Answer: The VAE compresses images from pixel space (512ร—512ร—3 = 786K dims) to latent space (64ร—64ร—4 = 16K dims). The diffusion process runs entirely in this compressed space, making it ~48ร— faster. The VAE decoder converts the denoised latent back to pixel space.

Q7: How does classifier-free guidance work?

Answer: During training, the text condition is randomly dropped (replaced with โˆ…) some percentage of the time. At inference, the model generates both conditional (with text) and unconditional predictions. The final prediction extrapolates toward the conditional: ฮตฬƒ = ฮต_unconditional + wยท(ฮต_conditional โˆ’ ฮต_unconditional). Higher w = stronger text adherence.

Q8: Compare FID and IS as GAN evaluation metrics.

Answer: FID (Frรฉchet Inception Distance) compares Inception feature distributions of real vs generated images โ€” lower is better. It captures both quality and diversity. IS (Inception Score) only measures generated image quality/diversity using a pretrained classifier โ€” it doesn't compare to real data. FID is preferred in practice because it catches mode collapse (which IS might miss).

Coding Challenge

Coding: Implement the VAE KL Loss

def vae_kl_loss(mu, log_var):
    """
    Compute KL divergence KL(N(ฮผ, ฯƒยฒ) โˆฅ N(0, I))
    Args:
        mu: (batch, latent_dim) โ€” encoder mean
        log_var: (batch, latent_dim) โ€” log variance
    Returns:
        KL divergence (scalar, averaged over batch)
    """
    # KL = -0.5 * ฮฃ(1 + log(ฯƒยฒ) - ฮผยฒ - ฯƒยฒ)
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp(), dim=1)
    return kl.mean()

# Test
mu = torch.tensor([[0.5, -1.0], [0.0, 0.0]])
log_var = torch.tensor([[0.0, 0.5], [0.0, 0.0]])
print(f"KL = {vae_kl_loss(mu, log_var):.4f}")
# Row 2 (ฮผ=0, ฯƒยฒ=1) should contribute 0 KL

Case Study Interview (India Focus)

Design: AI Product Photography for Flipkart/Meesho

Prompt: "Design a system that takes a phone photo of a product and generates studio-quality product images. Target users: small sellers in India with no photography skills."

Key points to cover:

  • Pipeline: Segmentation โ†’ Background generation โ†’ Enhancement โ†’ Quality check
  • Model choices: ControlNet for maintaining product shape, SDXL for background generation, ESRGAN for super-resolution
  • India-specific: Low-bandwidth-friendly (generate server-side, send compressed), support for Indian product categories (sarees, jewelry, spices), regional language UI
  • Evaluation: A/B test on CTR and conversion, FID vs real studio photos, user satisfaction surveys
  • Scale: 10M+ images/day at Meesho scale โ†’ need model distillation, batched inference, caching
Section 21

Hands-On Lab: Build a Conditional DCGAN

๐Ÿ”ฌ Mini-Project: Conditional Digit Generator

Objective

Build a Conditional DCGAN (cDCGAN) that generates MNIST digits based on a class label input. Instead of random digits, the user specifies "generate a 7" and gets a handwritten 7.

Requirements

  1. Modify the DCGAN Generator to accept a class label (one-hot encoded, concatenated with z)
  2. Modify the DCGAN Discriminator to accept a class label (embedded as additional channel)
  3. Train for 50 epochs on MNIST
  4. Generate a 10ร—10 grid: each row is one digit class (0-9), each column is a different random z
  5. Compute and report FID score

Rubric

CriterionPointsDescription
Working cDCGAN30Model trains without errors, losses are reasonable
Conditional generation25Can specify digit class and get recognizable output
Quality (FID < 50)20Generated digits are clear and diverse
Visualization15Grid showing all 10 classes, interpolation in z-space
Report10Discuss mode collapse observations, training stability

Bonus Challenges

  • โญ Implement WGAN-GP version and compare FID scores
  • โญโญ Add a simple diffusion model and compare all three
  • โญโญโญ Train on Fashion-MNIST and build a "virtual wardrobe" generator
Section 22

Exercises (22 Questions)

Section A: Conceptual (5 Questions)

A1 Beginner

Explain the difference between a discriminative model and a generative model. Give two examples of each.

Remember
A2 Beginner

In the GAN framework, what are the roles of the Generator and the Discriminator? What happens at Nash equilibrium?

Understand
A3 Intermediate

Why does the VAE use a KL divergence term in its loss function? What would happen if you removed it?

Understand
A4 Intermediate

Explain mode collapse using a concrete example with MNIST digits. How does WGAN help mitigate it?

Understand
A5 Intermediate

Compare the inference (generation) process of GANs, VAEs, and diffusion models. Which is fastest? Which produces the highest quality? Why?

Evaluate

Section B: Mathematical (8 Questions)

B1 Intermediate

Derive the optimal discriminator D*(x) from the GAN minimax objective. Show all steps.

Analyze
B2 Intermediate

Given p_data(x) = N(3, 1) and p_g(x) = N(5, 1), compute D*(x) at x = 3, x = 4, and x = 5.

Apply
B3 Advanced

Show that substituting D* into the GAN value function gives V(G, D*) = 2ยทJSD(p_data โˆฅ p_g) โˆ’ 2ยทlog(2). Derive each step.

Analyze
B4 Intermediate

Compute the KL divergence KL(N(ฮผ, ฯƒยฒ) โˆฅ N(0, 1)) for: (a) ฮผ = 0, ฯƒ = 1 (b) ฮผ = 2, ฯƒ = 0.5 (c) ฮผ = 0, ฯƒ = 3. Interpret each result.

Apply
B5 Intermediate

In DDPM with linear schedule ฮฒโ‚ = 10โปโด, ฮฒ_T = 0.02, T = 1000: (a) Compute แพฑ_1 and แพฑ_1000. (b) At what timestep t is the signal-to-noise ratio approximately 1? (c) Verify: x_t = โˆšแพฑ_t ยท xโ‚€ + โˆš(1โˆ’แพฑ_t) ยท ฮต has variance โ‰ˆ 1 when xโ‚€ is normalized.

Apply
B6 Advanced

Prove that the ELBO is indeed a lower bound on log P(x). Start from log P(x) = ELBO + KL(q(z|x) โˆฅ P(z|x)) and argue why the second term is non-negative.

Analyze
B7 Intermediate

For a ฮฒ-VAE with ฮฒ = 4 and a standard VAE (ฮฒ = 1), given the same encoder output ฮผ = [1, 0], log ฯƒยฒ = [0, โˆ’1], compute the total loss (reconstruction loss = 50 for both). Which model has a more "compressed" latent space?

Apply
B8 Advanced

In WGAN, the critic must be Lipschitz-continuous. (a) Define Lipschitz continuity. (b) Explain why weight clipping enforces it. (c) Explain how WGAN-GP's gradient penalty enforces it more elegantly. (d) What Lipschitz constant is enforced?

Analyze

Section C: Coding (4 Questions)

C1 Intermediate

Implement the VAE reparameterization trick in PyTorch. Write a function reparameterize(mu, log_var) that takes encoder outputs and returns sampled z. Include both training (with noise) and inference (deterministic) modes.

Apply
C2 Intermediate

Implement the DDPM forward diffusion process. Write a function add_noise(x0, t, noise_schedule) that takes a clean image, a timestep, and returns the noisy image + the noise that was added.

Apply
C3 Advanced

Modify the DCGAN training loop to use WGAN-GP. Replace the loss function, remove the sigmoid from the discriminator, and implement the gradient penalty. Train on MNIST and compare FID with the vanilla DCGAN.

Create
C4 Advanced

Build a latent space interpolation tool for a trained VAE. Given two MNIST images, encode both, linearly interpolate between their latent vectors (10 steps), and decode each intermediate vector. Visualize the smooth transition.

Create

Section D: Critical Thinking (3 Questions)

D1 Advanced

A startup claims their GAN can generate "never-before-seen" chemical molecules for drug discovery. Critically evaluate: (a) What does "never-before-seen" mean in the context of learning p_data? (b) How would you validate that generated molecules are chemically valid? (c) What risks exist in using generative models for drug design?

Evaluate
D2 Advanced

Meesho uses diffusion models to generate product photography. A seller uploads a photo of a saree and gets a "model wearing the saree" image. Discuss: (a) What biases could the model introduce in generated model appearances? (b) How should Meesho handle diversity/representation? (c) What happens if a generated image misrepresents the product's color or texture?

Evaluate
D3 Advanced

Compare the economic impact of generative AI on professional photographers in India vs. the US. Consider: market size, pricing power, adaptation strategies, and regulatory differences.

Evaluate

โ˜… Starred Research Questions (2 Questions)

โ˜…1 Advanced

Consistency Models (Song et al., 2023): These models distill the multi-step diffusion process into a single-step generator. Read the paper and explain: (a) What is the self-consistency property? (b) How does the consistency function map any point on a noise trajectory to the starting point? (c) What are the implications for real-time image generation?

CreateResearch
โ˜…2 Advanced

Sora (OpenAI, 2024): OpenAI's text-to-video model uses diffusion in a "spacetime latent space." Propose an architecture that extends Stable Diffusion from images to video. Address: (a) How do you handle temporal consistency? (b) What is the computational cost scaling? (c) How would you train this on Indian content (Bollywood, cricket)?

CreateResearch
Section 23

Deepfakes and the Ethics of Generative AI

The Deepfake Crisis

Generative models โ€” especially GANs and diffusion models โ€” have created an unprecedented challenge: the ability to generate photorealistic fake content at scale. In 2023 alone:

  • 95,820 deepfake videos were detected online (a 550% increase from 2019)
  • India was the 6th most targeted country for deepfake attacks
  • Political deepfakes were used in Indian state elections (manipulated speeches of politicians)
  • Financial fraud using voice cloning resulted in โ‚น200+ crore losses in India

Ethical Framework for Generative AI

As engineers building these systems, you have a responsibility to consider:

  1. Consent: Does the generated content depict real people without their consent?
  2. Provenance: Can users tell if content is AI-generated? (C2PA metadata, watermarking)
  3. Harm potential: Could this content be used for fraud, harassment, or political manipulation?
  4. Bias amplification: Does the model perpetuate or amplify biases in training data?
  5. Economic displacement: How does this affect the livelihoods of artists, photographers, voice actors?

Regulatory Landscape

RegionKey Regulations
๐Ÿ‡ฎ๐Ÿ‡ณ IndiaIT Act Section 66D (deepfake fraud), MEITY advisory (2023) requiring platforms to label AI content, proposed Digital India Act
๐Ÿ‡บ๐Ÿ‡ธ USANo federal deepfake law (2024), state-level laws in California/Texas, FTC guidelines on AI-generated content
๐Ÿ‡ช๐Ÿ‡บ EUEU AI Act (2024) โ€” deepfakes must be labeled, high-risk generative AI requires conformity assessment

Detection Methods

Deepfake detection is itself a fascinating ML problem:

  • Facial analysis: Detect inconsistencies in eye reflections, ear shapes, teeth
  • Frequency analysis: GANs produce artifacts in the frequency domain that CNNs can detect
  • Temporal analysis: Deepfake videos have unnatural blinking patterns, head movements
  • Provenance: C2PA standard embeds cryptographic signatures at image creation

๐Ÿ‡ฎ๐Ÿ‡ณ Deepfakes in India

  • Political deepfakes during elections (state + national)
  • Celebrity face-swap scams targeting Bollywood fans
  • Voice cloning fraud: "Your son has been kidnapped" scams
  • MEITY crackdown: platforms must remove deepfakes within 24 hours
  • IIT Delhi's deepfake detection research (FaceForensics++)

๐Ÿ‡บ๐Ÿ‡ธ Deepfakes in the USA

  • Election misinformation (Biden robocall, 2024)
  • Non-consensual intimate imagery (state laws emerging)
  • Hollywood SAG-AFTRA strike partially about AI likenesses
  • Taylor Swift deepfakes prompted bipartisan legislative action
  • DARPA's MediFor program for media forensics research
Section 24

Connections

How This Chapter Connects

โ† Builds On
  • Chapter 12 (CNNs): DCGAN's Generator uses transposed convolutions; Discriminator uses regular convolutions
  • Chapter 16 (GANs & VAEs Intro): This chapter extends with WGAN, StyleGAN, ฮฒ-VAE, and adds diffusion
  • Chapter 6 (Backpropagation): GAN training requires backprop through both D and G; reparameterization trick enables backprop through stochastic nodes
  • Probability (KL Divergence): VAE ELBO, JSD in GANs, variational bound in diffusion
โ†’ Enables
  • Chapter 19 (Applied CV): Image generation, super-resolution, inpainting using models from this chapter
  • Chapter 22 (Ethics & Future): Deepfakes, bias in generative AI, regulatory frameworks
  • Text-to-Image systems: DALL-E, Stable Diffusion build on diffusion + CLIP from this chapter
  • Video generation: Sora, Runway extend diffusion to temporal dimension
๐Ÿ”ฌ Research Frontier
  • Consistency Models (2023): Single-step generation from diffusion โ€” best of both worlds
  • Flow Matching (2023-2024): Alternative to diffusion with straight-line probability paths
  • DiT (Diffusion Transformers): Replacing U-Net with Transformer backbone (used in Sora)
  • 3D Generation: DreamFusion, Magic3D โ€” text-to-3D via score distillation
๐Ÿญ Industry Implementation
  • Adobe Firefly: Commercially safe generative AI trained on licensed content
  • Canva: Text-to-image integrated into design workflow
  • Runway ML: Video generation and editing for creators
  • Medical imaging: Generating synthetic MRI/CT data for rare diseases
Section 25

Chapter Summary

7 Key Takeaways

  1. Generative vs Discriminative: Generative models learn P(x), enabling them to create new data. Discriminative models only learn P(y|x) for classification. Generation is fundamentally harder but more powerful.
  2. GANs frame generation as a game: Generator creates fakes, Discriminator detects them. At Nash equilibrium, the optimal discriminator D*(x) = p_data/(p_data + p_g) is reduced to random guessing, and the game minimizes Jensen-Shannon Divergence between p_data and p_g.
  3. Mode collapse is the central GAN challenge: The Generator can learn to produce only a few "safe" outputs. Solutions include WGAN (Wasserstein distance for smoother gradients), spectral normalization, and minibatch discrimination.
  4. VAEs optimize a principled lower bound (ELBO): The loss = Reconstruction + KL divergence. The reparameterization trick (z = ฮผ + ฮตยทฯƒ) enables gradient flow through stochastic nodes. ฮฒ-VAE controls the quality-disentanglement trade-off.
  5. Diffusion models learn to reverse noise: The forward process adds Gaussian noise over T steps (fixed). The reverse process trains a U-Net to predict and remove noise at each step. The loss is simply MSE between true and predicted noise.
  6. Diffusion dominates in quality (2024): DDPM โ†’ DDIM โ†’ Latent Diffusion โ†’ Stable Diffusion โ†’ SDXL. The key insight of latent diffusion: run diffusion in compressed space (64ร—64 vs 512ร—512) for 48ร— speedup.
  7. Ethics are inseparable from capability: Deepfakes, IP theft, and bias amplification are not hypothetical โ€” they're real harms. Engineers must build detection, watermarking, and consent systems alongside generative models.
Key Equations to Remember:

GAN: minG maxD ๐”ผ[log D(x)] + ๐”ผ[log(1โˆ’D(G(z)))]
D*: p_data(x) / (p_data(x) + p_g(x))

VAE: โ„’ = โˆ’๐”ผ[log P(x|z)] + KL(q(z|x) โˆฅ P(z))
Trick: z = ฮผ + ฮตยทฯƒ, ฮต~N(0,I)

DDPM: โ„’ = ๐”ผ[โ€–ฮต โˆ’ ฮต_ฮธ(โˆšแพฑ_t ยท xโ‚€ + โˆš(1โˆ’แพฑ_t) ยท ฮต, t)โ€–ยฒ]

Key Intuition: All three generative paradigms share one fundamental idea โ€” learning to transform simple distributions (Gaussian noise) into complex data distributions. GANs do it adversarially (game), VAEs do it variationally (optimization), and diffusion models do it iteratively (denoising). The math differs, but the dream is the same: teach machines to create.

Section 26

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Resources

  • NPTEL: "Deep Learning" by Prof. Mitesh Khapra (IIT Madras) โ€” Lectures 35-38 on generative models
  • NPTEL: "Advanced Deep Learning" by Prof. Prabir Kumar Biswas (IIT KGP) โ€” VAE and GAN modules
  • GATE CS syllabus: Generative models under "Machine Learning" (added in GATE 2024 pattern)
  • AI4Bharat Wiki: Indian-language generative AI resources and datasets

๐ŸŒ Global Resources

  • Papers:
    • Goodfellow et al., "Generative Adversarial Nets" (NeurIPS 2014) โ€” arxiv.org/abs/1406.2661
    • Kingma & Welling, "Auto-Encoding Variational Bayes" (ICLR 2014) โ€” arxiv.org/abs/1312.6114
    • Ho et al., "Denoising Diffusion Probabilistic Models" (NeurIPS 2020) โ€” arxiv.org/abs/2006.11239
    • Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (CVPR 2022) โ€” arxiv.org/abs/2112.10752
    • Arjovsky et al., "Wasserstein GAN" (ICML 2017) โ€” arxiv.org/abs/1701.07875
    • Karras et al., "A Style-Based Generator Architecture for GANs" (CVPR 2019) โ€” StyleGAN
  • Visual Explainers:
    • Lil'Log: "What are Diffusion Models?" โ€” lilianweng.github.io
    • 3Blue1Brown: "But what is a neural network?" (foundation for all chapters)
    • Jay Alammar: "The Illustrated Stable Diffusion" โ€” jalammar.github.io
  • Books:
    • Goodfellow et al., "Deep Learning" (MIT Press) โ€” Chapter 20: Deep Generative Models
    • Prince, "Understanding Deep Learning" (MIT Press, 2023) โ€” Chapters 14-18
    • Foster, "Generative Deep Learning" (O'Reilly, 2nd ed.) โ€” Hands-on with TensorFlow/Keras
  • Code: