Chapter 22: Autoencoders & Variational Inference

🎯 Learning Objectives

By the end of this chapter, you will be able to:

1

Explain the encoder→bottleneck→decoder architecture and the role of the latent space

2

Distinguish undercomplete from overcomplete autoencoders and when each is appropriate

3

Implement MSE and BCE reconstruction losses, and add sparsity penalties (L1, KL divergence)

4

Build denoising autoencoders that remove noise from images and signals

5

Derive the Evidence Lower Bound (ELBO) from first principles and understand the reparameterization trick

6

Implement Variational Autoencoders (VAE) and visualize latent space interpolation

7

Explain β-VAE and its role in disentangled representation learning

8

Connect autoencoders to modern diffusion models (DDPM, Stable Diffusion)

9

Apply autoencoders for anomaly detection, image denoising, and data compression

10

Design and train complete AE/VAE pipelines in Python, TensorFlow, and Scikit-Learn

📘 Introduction

Imagine you need to describe the Taj Mahal to someone who has never seen it. You wouldn't describe every brick and every grain of marble — you'd compress the essential features: "a white marble mausoleum with a central dome, four minarets, and a reflecting pool." This is encoding. When your listener imagines the building from your description, that's decoding. The compressed description is the latent representation.

This is exactly what an autoencoder does with data. It learns to compress inputs into a compact representation and then reconstruct them. This seemingly simple idea — learning to copy its input through a bottleneck — unlocks an astonishing range of applications: denoising images, detecting anomalies, compressing data, and even generating entirely new data.

In this chapter, we'll journey from the simplest autoencoder to the probabilistic elegance of Variational Autoencoders (VAEs), which introduced the machinery of variational inference to deep learning. We'll derive the famous Evidence Lower Bound (ELBO) from scratch, understand the reparameterization trick that makes VAEs trainable, and explore how these ideas connect to the latest revolution in AI: diffusion models that power Stable Diffusion and DALL-E.

Whether you're a Class 11 student curious about how AI creates images, or a PhD researcher exploring variational inference, this chapter is structured to take you from intuition to rigorous mathematical derivation to working code.

📜 Historical Background

The Origins (1980s–1990s)

The autoencoder concept traces back to the 1980s. David Rumelhart, Geoffrey Hinton, and Ronald Williams (1986) introduced backpropagation and showed that neural networks could learn internal representations by being trained to reproduce their input. Hinton and the PDP group demonstrated that a network forced through a narrow hidden layer would discover compact codes — essentially rediscovering PCA for linear networks.

In 1989, Mark Kramer formalized nonlinear PCA through autoencoders, showing that neural networks could learn nonlinear manifolds that traditional PCA could not capture.

The Deep Learning Revival (2006–2012)

Geoffrey Hinton and Ruslan Salakhutdinov (2006) published a landmark Science paper showing that deep autoencoders — networks with many hidden layers — could dramatically outperform PCA for dimensionality reduction, provided they were pre-trained layer by layer using Restricted Boltzmann Machines. This paper was a key catalyst of the deep learning revolution.

Pascal Vincent et al. (2008) introduced Denoising Autoencoders (DAE), showing that training an autoencoder to reconstruct clean data from corrupted inputs learned much more robust features. Andrew Ng's group (2011) popularized Sparse Autoencoders with explicit sparsity penalties.

The Variational Revolution (2013–Present)

Diederik Kingma and Max Welling (2013) introduced the Variational Autoencoder (VAE) in their paper "Auto-Encoding Variational Bayes" — arguably one of the most influential papers in modern machine learning. Simultaneously, Danilo Rezende, Shakir Mohamed, and Daan Wierstra proposed a similar framework. The VAE married deep learning with Bayesian inference, creating a principled generative model.

Higgins et al. (2017) introduced β-VAE, showing that a simple modification to the VAE objective could encourage disentangled representations. This opened a rich line of research in representation learning.

The legacy of autoencoders extends directly to diffusion models (2020–present), where Stable Diffusion uses a VAE to compress images to a latent space before applying the diffusion process — a connection we'll explore in Section 10.

Year	Milestone	Researchers
1986	Backprop & internal representations	Rumelhart, Hinton, Williams
1989	Nonlinear PCA via autoencoders	Kramer
2006	Deep autoencoders for dimensionality reduction	Hinton & Salakhutdinov
2008	Denoising Autoencoders	Vincent et al.
2011	Sparse Autoencoders at scale	Ng et al.
2013	Variational Autoencoder (VAE)	Kingma & Welling
2017	β-VAE for disentanglement	Higgins et al.
2020	Denoising Diffusion (DDPM)	Ho et al.
2022	Stable Diffusion (uses VAE)	Rombach et al. / Stability AI

💡 Conceptual Explanation

4.1 What is an Autoencoder?

An autoencoder is a neural network trained to copy its input to its output — but with a twist. Between the input and output, the data must pass through a bottleneck (a layer with fewer neurons than the input). This forces the network to learn a compressed representation.

The architecture has three parts:

Encoder f(x): Maps input x to latent code z = f(x)
Bottleneck / Latent Space: The compressed representation z
Decoder g(z): Reconstructs the input: x̂ = g(z) = g(f(x))

The network is trained to minimize the reconstruction error: how different is x̂ from x? If the autoencoder can reconstruct well despite the bottleneck, it has learned the essential structure of the data.

4.2 Undercomplete vs. Overcomplete Autoencoders

Property	Undercomplete	Overcomplete
Bottleneck Size	dim(z) < dim(x)	dim(z) ≥ dim(x)
What it Learns	Compression by necessity	Identity function (without regularization)
Regularization Needed?	Not strictly	Yes — sparsity, denoising, etc.
Example Use	Dimensionality reduction	Sparse feature extraction
Analogy	Summarize a book in 100 words	Write a book report longer than the book, but only highlight key themes

4.3 Types of Autoencoders

Denoising Autoencoder (DAE)

Instead of inputting clean data x, we corrupt it with noise: x̃ = x + ε. The network must reconstruct the original clean x from noisy x̃. This forces learning robust, meaningful features rather than trivial identity mappings.

Sparse Autoencoder

We add a sparsity constraint: most neurons in the hidden layer should be inactive (close to 0) for any given input. This is achieved by adding an L1 penalty on activations or a KL divergence penalty that pushes the average activation toward a small target value ρ (e.g., 0.05).

Contractive Autoencoder

Adds a penalty on the Frobenius norm of the Jacobian of the encoder, forcing the learned representation to be insensitive to small input perturbations.

Variational Autoencoder (VAE)

A probabilistic generative model that learns a distribution over the latent space rather than a deterministic mapping. This enables generation of new samples by sampling from the latent distribution.

4.4 Reconstruction Losses

Mean Squared Error (MSE)

Used when inputs are continuous (e.g., normalized pixel values in [0, 1]): L = (1/n) Σ(xᵢ - x̂ᵢ)². Treats reconstruction as a regression problem.

Binary Cross-Entropy (BCE)

Used when inputs are binary or can be interpreted as probabilities: L = -Σ[xᵢ log(x̂ᵢ) + (1-xᵢ) log(1-x̂ᵢ)]. Natural choice when decoder uses sigmoid activation.

4.5 The Latent Space

The latent space is where the magic happens. A well-trained autoencoder organizes similar data points near each other in latent space. For a VAE, the latent space is continuous and smooth, meaning:

Interpolation: Moving smoothly between two points in latent space produces semantically meaningful transitions (e.g., one face morphing into another)
Sampling: Random points in latent space decode into plausible data
Disentanglement: Different dimensions capture different independent factors of variation

📐 Mathematical Foundation

5.1 Autoencoder Objective

Let x ∈ ℝᵈ be an input vector. The autoencoder consists of:

Encoder & Decoder

Encoder: z = f_θ(x) = σ(Wx + b) where z ∈ ℝᵏ, k < d
Decoder: x̂ = g_φ(z) = σ(W'z + b') where x̂ ∈ ℝᵈ

Objective: min_{θ,φ} L(x, g_φ(f_θ(x)))

5.2 Reconstruction Losses

Mean Squared Error

L_MSE(x, x̂) = (1/d) Σᵢ₌₁ᵈ (xᵢ - x̂ᵢ)²

Binary Cross-Entropy

L_BCE(x, x̂) = -(1/d) Σᵢ₌₁ᵈ [xᵢ log(x̂ᵢ) + (1 - xᵢ) log(1 - x̂ᵢ)]

5.3 Sparse Autoencoder Penalty

Let ρ̂ⱼ = (1/m) Σᵢ aⱼ(xᵢ) be the average activation of hidden unit j over the training set, and ρ be the target sparsity (e.g., 0.05).

KL Divergence Sparsity

Ω_sparse = Σⱼ KL(ρ ‖ ρ̂ⱼ) = Σⱼ [ρ log(ρ/ρ̂ⱼ) + (1-ρ) log((1-ρ)/(1-ρ̂ⱼ))]

Total Loss = L_reconstruction + β · Ω_sparse

5.4 VAE: Probabilistic Framework

The VAE treats the autoencoder as a probabilistic graphical model. We assume:

There exists a latent variable z drawn from a prior p(z) = N(0, I)
The data x is generated from z via a likelihood p_θ(x|z) (the decoder)
We want to infer the posterior p(z|x), which is intractable
So we approximate it with q_φ(z|x) = N(μ_φ(x), σ²_φ(x)·I) (the encoder)

The Generative Story

For each data point x:
1. Sample z ~ p(z) = N(0, I) ← Prior
2. Generate x ~ p_θ(x|z) ← Decoder / Likelihood

Goal: maximize p_θ(x) = ∫ p_θ(x|z) p(z) dz ← Marginal likelihood (intractable!)

5.5 The Reparameterization Trick

We can't backpropagate through a stochastic sampling operation z ~ q_φ(z|x). The trick: express z as a deterministic function of φ and a noise variable ε:

Reparameterization

z = μ_φ(x) + σ_φ(x) ⊙ ε, where ε ~ N(0, I)

Now gradients flow through μ and σ — ε is just random noise, not a function of parameters!

5.6 β-VAE

β-VAE Objective

L_β-VAE = E_{q_φ(z|x)}[log p_θ(x|z)] - β · KL(q_φ(z|x) ‖ p(z))

β = 1: Standard VAE
β > 1: Stronger disentanglement (each z dimension captures independent factors)
β < 1: Better reconstruction, less disentangled

🔬 Formula Derivations

6.1 Deriving the ELBO (Evidence Lower Bound)

This is one of the most important derivations in modern machine learning. We start from the goal of maximizing the log-likelihood log p_θ(x).

Step 1: Start with Log-Marginal Likelihood

log p_θ(x) = log ∫ p_θ(x, z) dz = log ∫ p_θ(x|z) p(z) dz

This integral is intractable for complex p_θ(x|z). We introduce an approximate posterior q_φ(z|x):

Step 2: Multiply and Divide by q_φ(z|x)

log p_θ(x) = log ∫ p_θ(x, z) · [q_φ(z|x) / q_φ(z|x)] dz
= log E_{q_φ(z|x)} [p_θ(x, z) / q_φ(z|x)]

Step 3: Apply Jensen's Inequality (log E[·] ≥ E[log ·])

log p_θ(x) ≥ E_{q_φ(z|x)} [log p_θ(x, z) / q_φ(z|x)]

This lower bound is the ELBO!
ELBO = E_{q_φ(z|x)} [log p_θ(x, z) - log q_φ(z|x)]

Step 4: Decompose the ELBO

ELBO = E_{q_φ(z|x)} [log p_θ(x, z)] - E_{q_φ(z|x)} [log q_φ(z|x)]
= E_{q_φ(z|x)} [log p_θ(x|z) + log p(z)] - E_{q_φ(z|x)} [log q_φ(z|x)]
= E_{q_φ(z|x)} [log p_θ(x|z)] + E_{q_φ(z|x)} [log p(z) - log q_φ(z|x)]
= E_{q_φ(z|x)} [log p_θ(x|z)] - KL(q_φ(z|x) ‖ p(z))

ELBO = Reconstruction Term - KL Divergence Regularizer

6.2 Deriving the Exact Gap: log p(x) = ELBO + KL(q‖p)

Alternative Derivation (No Jensen's Inequality Needed)

log p_θ(x) = E_{q_φ(z|x)} [log p_θ(x)] ← p_θ(x) doesn't depend on z
= E_{q_φ(z|x)} [log (p_θ(x,z)/p_θ(z|x))] ← Bayes: p(x) = p(x,z)/p(z|x)
= E_{q_φ(z|x)} [log (p_θ(x,z)/q_φ(z|x) · q_φ(z|x)/p_θ(z|x))]
= E_{q_φ(z|x)} [log (p_θ(x,z)/q_φ(z|x))] + E_{q_φ(z|x)} [log (q_φ(z|x)/p_θ(z|x))]
= ELBO + KL(q_φ(z|x) ‖ p_θ(z|x))

Since KL ≥ 0, we get: log p_θ(x) ≥ ELBO ✓

This alternative derivation reveals the beautiful identity: the gap between the true log-likelihood and the ELBO is exactly the KL divergence between our approximate posterior and the true (intractable) posterior. As q approaches p(z|x), the ELBO becomes tight.

6.3 KL Divergence for Gaussians (Closed Form)

For the VAE, both q_φ(z|x) and p(z) are Gaussian. The KL divergence has a beautiful closed form:

KL(N(μ, σ²) ‖ N(0, 1)) — Scalar Case

KL = -½ (1 + log σ² - μ² - σ²)

For J-dimensional latent space:
KL = -½ Σⱼ₌₁ᴶ (1 + log σ²ⱼ - μ²ⱼ - σ²ⱼ)

Derivation of the Gaussian KL:

Full Derivation

KL(q‖p) = ∫ q(z) log(q(z)/p(z)) dz
= E_q[log q(z)] - E_q[log p(z)]

For q = N(μ, σ²): E_q[log q(z)] = -½ log(2πσ²) - ½
For p = N(0, 1): E_q[log p(z)] = -½ log(2π) - ½(μ² + σ²)

KL = -½ log(2πσ²) - ½ - (-½ log(2π) - ½(μ² + σ²))
= -½ log σ² - ½ + ½μ² + ½σ²
= -½ (1 + log σ² - μ² - σ²) ✓

🔢 Worked Numerical Examples

Example 1: Reconstruction Loss Calculation

Problem

An autoencoder receives input x = [0.8, 0.3, 0.9, 0.1] and produces reconstruction x̂ = [0.75, 0.35, 0.85, 0.15]. Calculate MSE and BCE losses.

MSE Calculation

MSE = (1/4) [(0.8-0.75)² + (0.3-0.35)² + (0.9-0.85)² + (0.1-0.15)²]
= (1/4) [0.0025 + 0.0025 + 0.0025 + 0.0025]
= (1/4) × 0.01 = 0.0025

BCE Calculation

BCE = -(1/4) Σ [xᵢ log(x̂ᵢ) + (1-xᵢ) log(1-x̂ᵢ)]

Term 1: 0.8·log(0.75) + 0.2·log(0.25) = 0.8·(-0.2877) + 0.2·(-1.3863) = -0.2301 + (-0.2773) = -0.5074
Term 2: 0.3·log(0.35) + 0.7·log(0.65) = 0.3·(-1.0498) + 0.7·(-0.4308) = -0.3149 + (-0.3016) = -0.6165
Term 3: 0.9·log(0.85) + 0.1·log(0.15) = 0.9·(-0.1625) + 0.1·(-1.8971) = -0.1463 + (-0.1897) = -0.3360
Term 4: 0.1·log(0.15) + 0.9·log(0.85) = 0.1·(-1.8971) + 0.9·(-0.1625) = -0.1897 + (-0.1463) = -0.3360

BCE = -(1/4) × (-0.5074 + (-0.6165) + (-0.3360) + (-0.3360))
= -(1/4) × (-1.7959) = 0.4490

Example 2: KL Divergence Calculation

Problem

A VAE encoder produces μ = [0.5, -0.3] and log(σ²) = [-0.2, 0.1] for a data point. Calculate KL(q‖p).

Solution

KL = -½ Σⱼ (1 + log σ²ⱼ - μ²ⱼ - σ²ⱼ)

Given log σ² = [-0.2, 0.1], so σ² = [e^(-0.2), e^(0.1)] = [0.8187, 1.1052]

Dim 1: 1 + (-0.2) - (0.5)² - 0.8187 = 1 - 0.2 - 0.25 - 0.8187 = -0.2687
Dim 2: 1 + (0.1) - (-0.3)² - 1.1052 = 1 + 0.1 - 0.09 - 1.1052 = -0.0952

KL = -½ × (-0.2687 + (-0.0952)) = -½ × (-0.3639) = 0.1820

Example 3: Sparsity Penalty

Problem

Target sparsity ρ = 0.05. After a batch, a hidden unit has average activation ρ̂ = 0.3. Compute KL(ρ ‖ ρ̂).

Solution

KL(ρ ‖ ρ̂) = ρ·log(ρ/ρ̂) + (1-ρ)·log((1-ρ)/(1-ρ̂))
= 0.05·log(0.05/0.3) + 0.95·log(0.95/0.7)
= 0.05·log(0.1667) + 0.95·log(1.3571)
= 0.05·(-1.7918) + 0.95·(0.3054)
= -0.0896 + 0.2901 = 0.2005

The penalty is 0.2005 — quite high! This pushes the unit to be less active.

Example 4: Reparameterization Trick

Problem

Encoder outputs μ = 2.0, σ = 0.5. Noise sample ε = 1.3. Find z and compute ∂z/∂μ and ∂z/∂σ.

Solution

z = μ + σ · ε = 2.0 + 0.5 × 1.3 = 2.0 + 0.65 = 2.65

∂z/∂μ = 1 (gradient flows directly!)
∂z/∂σ = ε = 1.3 (gradient flows through ε!)

Without reparameterization, z ~ N(2.0, 0.25) would be a sampling operation
with no gradient — training would be impossible with standard backprop!

📊 Visual Diagrams

Diagram 1: Standard Autoencoder Architecture

INPUT (784) ENCODER BOTTLENECK (32) DECODER OUTPUT (784) ┌─────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │ x₁ │ │ Hidden 512 │ │ │ │ Hidden 512 │ │ x̂₁ │ │ x₂ │───▶│ ReLU │───▶│ z₁ z₂ ... │───▶│ ReLU │───▶│ x̂₂ │ │ ... │ │ │ │ Latent │ │ │ │ ... │ │ x₇₈₄ │ │ Hidden 256 │ │ Code │ │ Hidden 256 │ │ x̂₇₈₄ │ └─────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └─────────┘ ◄── Compression ──▶ ◄── Bottleneck ──▶ ◄── Expansion ──▶ Loss = ‖x - x̂‖² (MSE) or BCE(x, x̂)

Diagram 2: Variational Autoencoder (VAE)

ε ~ N(0, I) │ ┌─────────┐ ┌──────────┐ ┌──────┐ ┌─────▼─────┐ ┌──────────┐ ┌─────────┐ │ │ │ │ │ μ │ │ │ │ │ │ │ │ Input │───▶│ Encoder │───▶│ │────▶│z = μ + σ·ε│───▶│ Decoder │───▶│ Output │ │ x │ │ f_φ(x) │ │ logσ²│ │ │ │ g_θ(z) │ │ x̂ │ │ │ │ │ │ │ │ Reparam. │ │ │ │ │ └─────────┘ └──────────┘ └──────┘ └───────────┘ └──────────┘ └─────────┘ Loss = E[log p(x|z)] - β·KL(q(z|x) ‖ p(z)) ◄─ Reconstruct ─▶ ◄── Regularize ──▶

Diagram 3: Undercomplete vs Overcomplete

UNDERCOMPLETE OVERCOMPLETE (Bottleneck narrower than input) (Bottleneck wider than input) ████████ Input (100) ████████ Input (100) ██████ Hidden (64) ██████████ Hidden (200) ████ Bottleneck (32) ← Compressed! ████████████ Bottleneck (300) ← Wider! ██████ Hidden (64) ██████████ Hidden (200) ████████ Output (100) ████████ Output (100) ✓ Forces compression ✗ Can learn identity ✓ No extra regularization needed ✓ Needs: sparsity / denoising

Diagram 4: Denoising Autoencoder

Clean Input x Corrupt Noisy x̃ Encode → Decode Compare with ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ ┌──────────┐ │ ██ ██ │ │ Add │ │ ██░░██ │ │ │ │ ██ ██ │ │ ██████ │───▶│ Noise │───▶│ ░█████░ │───▶│ Autoencoder │───▶│ ██████ │ │ ██ ██ │ │ ε~N(0,σ)│ │ ██░░██ │ │ Reconstruction │ │ ██ ██ │ └──────────┘ └──────────┘ └──────────┘ └──────────────────┘ └──────────┘ │ │ └───────────────────── Loss = ‖x - x̂‖² ──────────────────────────────┘ (Compare with CLEAN input!)

Diagram 5: VAE Latent Space — Smooth Interpolation

Latent Space (2D for visualization) ───────────────────────────────────── │ ●₀ · · · ●₁ · · · ●₂ │ Digits cluster │ · ●₃ · · · ●₄ · ●₅ · │ smoothly in │ · · ●₆ · · · · ●₇ · │ latent space │ · · · ●₈ · · ●₉ · · │ │ · · · · · · · · · │ ───────────────────────────────────── Interpolation from "3" to "8": [●₃] ─── [●] ─── [●] ─── [●] ─── [●₈] 3 3/8 blend ~5ish 5/8 blend 8 Each intermediate point decodes to a smooth transition!

🔄 Flowcharts

Flowchart 1: Choosing the Right Autoencoder

START: What is your goal? │ ├──▶ Dimensionality Reduction? │ └──▶ Use UNDERCOMPLETE AE (or PCA for linear) │ ├──▶ Denoising? │ └──▶ Use DENOISING AE (add noise to inputs, reconstruct clean) │ ├──▶ Feature Selection / Sparse Features? │ └──▶ Use SPARSE AE (L1 penalty or KL sparsity) │ ├──▶ Anomaly Detection? │ ├──▶ Simple: Use UNDERCOMPLETE AE → high reconstruction error = anomaly │ └──▶ Advanced: Use VAE → low p(x) in latent space = anomaly │ ├──▶ Generate New Data? │ ├──▶ Simple generation: Use VAE (sample z ~ N(0,I), decode) │ ├──▶ High-quality generation: Use VAE + adversarial training │ └──▶ State-of-art: Use Diffusion Models (DDPM / Stable Diffusion) │ └──▶ Disentangled Representations? └──▶ Use β-VAE (β > 1)

Flowchart 2: VAE Training Pipeline

┌──────────────────────────────────────────────────────────┐ │ VAE Training Loop │ └──────────────────────────────────────────────────────────┘ │ ┌─────────▼──────────┐ │ Sample batch x │ │ from dataset │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ Encode: compute │ │ μ = f_μ(x) │ │ log σ² = f_σ(x) │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ Sample ε ~ N(0,I) │ │ z = μ + σ · ε │ │ (Reparameterize) │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ Decode: x̂ = g(z) │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ Compute Loss: │ │ L = -ELBO │ │ = Recon + β·KL │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ Backprop & Update │ │ θ, φ via Adam │ └─────────┬──────────┘ │ ┌─────────▼──────────┐ │ Repeat until │ │ convergence │ └──────────────────────┘

Flowchart 3: Anomaly Detection with Autoencoders

┌───────────┐ ┌───────────────┐ ┌───────────────┐ │ Train AE │ │ New data x │ │ Compute │ │ on normal│────▶│ arrives │────▶│ recon error │ │ data only│ │ │ │ e = ‖x - x̂‖² │ └───────────┘ └───────────────┘ └───────┬───────┘ │ ┌────────▼────────┐ │ e > threshold? │ └────────┬────────┘ │ Yes │ No ┌─────▼──────┐ ┌────▼─────┐ │ 🚨 ANOMALY│ │ ✅ Normal │ │ Flag for │ │ Pass │ │ review │ │ │ └────────────┘ └──────────┘

🐍 Python Implementation from Scratch

10.1 Vanilla Autoencoder (NumPy)

Python — Vanilla AE from Scratch
import numpy as np

class AutoencoderNumpy:
    """Simple autoencoder using only NumPy — no frameworks."""

    def __init__(self, input_dim, hidden_dim, latent_dim, lr=0.001):
        self.lr = lr
        # Xavier initialization
        scale1 = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale2 = np.sqrt(2.0 / (hidden_dim + latent_dim))

        # Encoder weights
        self.W1 = np.random.randn(input_dim, hidden_dim) * scale1
        self.b1 = np.zeros((1, hidden_dim))
        self.W2 = np.random.randn(hidden_dim, latent_dim) * scale2
        self.b2 = np.zeros((1, latent_dim))

        # Decoder weights
        self.W3 = np.random.randn(latent_dim, hidden_dim) * scale2
        self.b3 = np.zeros((1, hidden_dim))
        self.W4 = np.random.randn(hidden_dim, input_dim) * scale1
        self.b4 = np.zeros((1, input_dim))

    def relu(self, x):
        return np.maximum(0, x)

    def relu_deriv(self, x):
        return (x > 0).astype(float)

    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

    def encode(self, x):
        self.z1 = x @ self.W1 + self.b1
        self.a1 = self.relu(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.latent = self.relu(self.z2)  # Latent code
        return self.latent

    def decode(self, z):
        self.z3 = z @ self.W3 + self.b3
        self.a3 = self.relu(self.z3)
        self.z4 = self.a3 @ self.W4 + self.b4
        self.output = self.sigmoid(self.z4)  # Output in [0,1]
        return self.output

    def forward(self, x):
        z = self.encode(x)
        return self.decode(z)

    def compute_loss(self, x, x_hat):
        """MSE Loss"""
        return np.mean((x - x_hat) ** 2)

    def backward(self, x, x_hat):
        batch_size = x.shape[0]

        # d(MSE)/d(x_hat)
        d_output = 2.0 * (x_hat - x) / x.shape[1]

        # Through sigmoid
        d_z4 = d_output * x_hat * (1 - x_hat)

        # Decoder gradients
        d_W4 = self.a3.T @ d_z4 / batch_size
        d_b4 = np.mean(d_z4, axis=0, keepdims=True)

        d_a3 = d_z4 @ self.W4.T
        d_z3 = d_a3 * self.relu_deriv(self.z3)

        d_W3 = self.latent.T @ d_z3 / batch_size
        d_b3 = np.mean(d_z3, axis=0, keepdims=True)

        # Encoder gradients
        d_latent = d_z3 @ self.W3.T
        d_z2 = d_latent * self.relu_deriv(self.z2)

        d_W2 = self.a1.T @ d_z2 / batch_size
        d_b2 = np.mean(d_z2, axis=0, keepdims=True)

        d_a1 = d_z2 @ self.W2.T
        d_z1 = d_a1 * self.relu_deriv(self.z1)

        d_W1 = x.T @ d_z1 / batch_size
        d_b1 = np.mean(d_z1, axis=0, keepdims=True)

        # Update weights (gradient descent)
        self.W4 -= self.lr * d_W4
        self.b4 -= self.lr * d_b4
        self.W3 -= self.lr * d_W3
        self.b3 -= self.lr * d_b3
        self.W2 -= self.lr * d_W2
        self.b2 -= self.lr * d_b2
        self.W1 -= self.lr * d_W1
        self.b1 -= self.lr * d_b1

    def train(self, X, epochs=100, batch_size=64, verbose=True):
        n = X.shape[0]
        history = []
        for epoch in range(epochs):
            indices = np.random.permutation(n)
            total_loss = 0
            for i in range(0, n, batch_size):
                batch = X[indices[i:i+batch_size]]
                x_hat = self.forward(batch)
                loss = self.compute_loss(batch, x_hat)
                self.backward(batch, x_hat)
                total_loss += loss * batch.shape[0]
            avg_loss = total_loss / n
            history.append(avg_loss)
            if verbose and (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}/{epochs} — Loss: {avg_loss:.6f}")
        return history


# Demo with synthetic data
np.random.seed(42)
# Create simple data: points on a noisy circle
t = np.linspace(0, 2*np.pi, 500)
X = np.column_stack([
    np.cos(t) + np.random.randn(500) * 0.1,
    np.sin(t) + np.random.randn(500) * 0.1,
    0.5 * np.cos(2*t) + np.random.randn(500) * 0.1,
    0.5 * np.sin(2*t) + np.random.randn(500) * 0.1
])
X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))  # Normalize to [0,1]

ae = AutoencoderNumpy(input_dim=4, hidden_dim=8, latent_dim=2, lr=0.01)
history = ae.train(X, epochs=100, batch_size=32)
print(f"\nFinal Loss: {history[-1]:.6f}")
print(f"Latent codes shape: {ae.encode(X).shape}")  # (500, 2)

10.2 Variational Autoencoder from Scratch (NumPy)

Python — VAE from Scratch
import numpy as np

class VAENumpy:
    """Variational Autoencoder using only NumPy."""

    def __init__(self, input_dim, hidden_dim, latent_dim, lr=0.001):
        self.lr = lr
        self.latent_dim = latent_dim

        # Encoder: input → hidden → (mu, log_var)
        s1 = np.sqrt(2.0 / (input_dim + hidden_dim))
        s2 = np.sqrt(2.0 / (hidden_dim + latent_dim))

        self.We1 = np.random.randn(input_dim, hidden_dim) * s1
        self.be1 = np.zeros((1, hidden_dim))
        self.W_mu = np.random.randn(hidden_dim, latent_dim) * s2
        self.b_mu = np.zeros((1, latent_dim))
        self.W_logvar = np.random.randn(hidden_dim, latent_dim) * s2
        self.b_logvar = np.zeros((1, latent_dim))

        # Decoder: latent → hidden → output
        self.Wd1 = np.random.randn(latent_dim, hidden_dim) * s2
        self.bd1 = np.zeros((1, hidden_dim))
        self.Wd2 = np.random.randn(hidden_dim, input_dim) * s1
        self.bd2 = np.zeros((1, input_dim))

    def relu(self, x):
        return np.maximum(0, x)

    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

    def encode(self, x):
        """Encode to mu and log_var."""
        self.h_enc_pre = x @ self.We1 + self.be1
        self.h_enc = self.relu(self.h_enc_pre)
        self.mu = self.h_enc @ self.W_mu + self.b_mu
        self.log_var = self.h_enc @ self.W_logvar + self.b_logvar
        return self.mu, self.log_var

    def reparameterize(self, mu, log_var):
        """z = mu + sigma * epsilon (reparameterization trick)."""
        self.std = np.exp(0.5 * log_var)
        self.eps = np.random.randn(*mu.shape)
        z = mu + self.std * self.eps
        return z

    def decode(self, z):
        """Decode from latent space."""
        self.h_dec_pre = z @ self.Wd1 + self.bd1
        self.h_dec = self.relu(self.h_dec_pre)
        self.out_pre = self.h_dec @ self.Wd2 + self.bd2
        self.x_hat = self.sigmoid(self.out_pre)
        return self.x_hat

    def forward(self, x):
        mu, log_var = self.encode(x)
        self.z = self.reparameterize(mu, log_var)
        return self.decode(self.z)

    def compute_loss(self, x, x_hat):
        """ELBO = Reconstruction (BCE) + KL divergence."""
        # Reconstruction: Binary Cross-Entropy
        bce = -np.mean(np.sum(
            x * np.log(x_hat + 1e-8) + (1 - x) * np.log(1 - x_hat + 1e-8),
            axis=1
        ))
        # KL divergence: -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
        kl = -0.5 * np.mean(np.sum(
            1 + self.log_var - self.mu**2 - np.exp(self.log_var),
            axis=1
        ))
        return bce + kl, bce, kl

    def train_step(self, x):
        """One training step with numerical gradients (simplified)."""
        x_hat = self.forward(x)
        loss, recon, kl = self.compute_loss(x, x_hat)

        # Backprop through decoder
        batch_size = x.shape[0]
        d_out = (x_hat - x) / batch_size  # Simplified BCE gradient

        d_Wd2 = self.h_dec.T @ d_out
        d_bd2 = np.sum(d_out, axis=0, keepdims=True)

        d_h_dec = d_out @ self.Wd2.T
        d_h_dec_pre = d_h_dec * (self.h_dec_pre > 0).astype(float)

        d_Wd1 = self.z.T @ d_h_dec_pre
        d_bd1 = np.sum(d_h_dec_pre, axis=0, keepdims=True)

        d_z = d_h_dec_pre @ self.Wd1.T

        # Reparameterization: z = mu + std * eps
        # KL gradient w.r.t mu: mu/batch_size
        # KL gradient w.r.t log_var: 0.5*(exp(log_var) - 1)/batch_size
        d_mu = d_z + self.mu / batch_size
        d_log_var = d_z * 0.5 * self.std * self.eps + \
                    0.5 * (np.exp(self.log_var) - 1) / batch_size

        # Encoder gradients
        d_W_mu = self.h_enc.T @ d_mu
        d_b_mu = np.sum(d_mu, axis=0, keepdims=True)

        d_W_logvar = self.h_enc.T @ d_log_var
        d_b_logvar = np.sum(d_log_var, axis=0, keepdims=True)

        d_h_enc = d_mu @ self.W_mu.T + d_log_var @ self.W_logvar.T
        d_h_enc_pre = d_h_enc * (self.h_enc_pre > 0).astype(float)

        d_We1 = x.T @ d_h_enc_pre
        d_be1 = np.sum(d_h_enc_pre, axis=0, keepdims=True)

        # Update all weights
        for param, grad in [
            ('Wd2', d_Wd2), ('bd2', d_bd2),
            ('Wd1', d_Wd1), ('bd1', d_bd1),
            ('W_mu', d_W_mu), ('b_mu', d_b_mu),
            ('W_logvar', d_W_logvar), ('b_logvar', d_b_logvar),
            ('We1', d_We1), ('be1', d_be1),
        ]:
            setattr(self, param, getattr(self, param) - self.lr * grad)

        return loss, recon, kl

    def generate(self, n_samples=10):
        """Generate new samples by sampling from the prior."""
        z = np.random.randn(n_samples, self.latent_dim)
        return self.decode(z)


# Usage
np.random.seed(42)
X = np.random.rand(1000, 20)  # 1000 samples, 20 features
X = (X > 0.5).astype(float)   # Binary data

vae = VAENumpy(input_dim=20, hidden_dim=64, latent_dim=4, lr=0.001)
for epoch in range(50):
    loss, recon, kl = vae.train_step(X)
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}: Loss={loss:.4f}, Recon={recon:.4f}, KL={kl:.4f}")

# Generate new samples
new_samples = vae.generate(5)
print(f"\nGenerated samples shape: {new_samples.shape}")
print(f"Sample values (rounded): {np.round(new_samples[0, :5], 3)}")

🔶 TensorFlow Implementation

11.1 MNIST Autoencoder

TensorFlow / Keras — MNIST Autoencoder
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Load MNIST
(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)

# ===================== VANILLA AUTOENCODER =====================
class SimpleAutoencoder(keras.Model):
    def __init__(self, latent_dim=32):
        super().__init__()
        # Encoder
        self.encoder = keras.Sequential([
            layers.Dense(256, activation='relu'),
            layers.Dense(128, activation='relu'),
            layers.Dense(latent_dim, activation='relu', name='bottleneck'),
        ])
        # Decoder
        self.decoder = keras.Sequential([
            layers.Dense(128, activation='relu'),
            layers.Dense(256, activation='relu'),
            layers.Dense(784, activation='sigmoid'),
        ])

    def call(self, x):
        z = self.encoder(x)
        return self.decoder(z)

# Build and train
ae = SimpleAutoencoder(latent_dim=32)
ae.compile(optimizer='adam', loss='mse')
history = ae.fit(x_train, x_train,
                 epochs=20, batch_size=256,
                 validation_data=(x_test, x_test),
                 verbose=1)

# Visualize reconstructions
reconstructed = ae.predict(x_test[:10])
fig, axes = plt.subplots(2, 10, figsize=(20, 4))
for i in range(10):
    axes[0, i].imshow(x_test[i].reshape(28, 28), cmap='gray')
    axes[0, i].axis('off')
    axes[0, i].set_title('Original')
    axes[1, i].imshow(reconstructed[i].reshape(28, 28), cmap='gray')
    axes[1, i].axis('off')
    axes[1, i].set_title('Reconstructed')
plt.tight_layout()
plt.savefig('ae_reconstruction.png', dpi=100)
plt.show()

11.2 Denoising Autoencoder (TensorFlow)

TensorFlow — Denoising Autoencoder
# Add noise to training data
noise_factor = 0.35
x_train_noisy = x_train + noise_factor * np.random.randn(*x_train.shape)
x_test_noisy = x_test + noise_factor * np.random.randn(*x_test.shape)
x_train_noisy = np.clip(x_train_noisy, 0., 1.)
x_test_noisy = np.clip(x_test_noisy, 0., 1.)

# Use a convolutional architecture for better denoising
class ConvDenoisingAE(keras.Model):
    def __init__(self):
        super().__init__()
        self.encoder = keras.Sequential([
            layers.Reshape((28, 28, 1)),
            layers.Conv2D(32, 3, activation='relu', padding='same'),
            layers.MaxPooling2D(2, padding='same'),
            layers.Conv2D(32, 3, activation='relu', padding='same'),
            layers.MaxPooling2D(2, padding='same'),
        ])
        self.decoder = keras.Sequential([
            layers.Conv2D(32, 3, activation='relu', padding='same'),
            layers.UpSampling2D(2),
            layers.Conv2D(32, 3, activation='relu', padding='same'),
            layers.UpSampling2D(2),
            layers.Conv2D(1, 3, activation='sigmoid', padding='same'),
            layers.Reshape((784,))
        ])

    def call(self, x):
        z = self.encoder(x)
        return self.decoder(z)

dae = ConvDenoisingAE()
dae.compile(optimizer='adam', loss='mse')
dae.fit(x_train_noisy, x_train,   # Input: noisy, Target: clean!
        epochs=15, batch_size=128,
        validation_data=(x_test_noisy, x_test))

# Visualize denoising results
denoised = dae.predict(x_test_noisy[:10])
fig, axes = plt.subplots(3, 10, figsize=(20, 6))
for i in range(10):
    axes[0, i].imshow(x_test[i].reshape(28, 28), cmap='gray')
    axes[0, i].axis('off'); axes[0, i].set_title('Clean')
    axes[1, i].imshow(x_test_noisy[i].reshape(28, 28), cmap='gray')
    axes[1, i].axis('off'); axes[1, i].set_title('Noisy')
    axes[2, i].imshow(denoised[i].reshape(28, 28), cmap='gray')
    axes[2, i].axis('off'); axes[2, i].set_title('Denoised')
plt.tight_layout()
plt.savefig('denoising_results.png', dpi=100)
plt.show()

11.3 VAE with Latent Space Visualization

TensorFlow — Full VAE with Latent Space Visualization
import tensorflow as tf
from tensorflow.keras import layers, Model
import numpy as np
import matplotlib.pyplot as plt

# ===================== SAMPLING LAYER =====================
class Sampling(layers.Layer):
    """Reparameterization trick: z = mu + sigma * epsilon."""
    def call(self, inputs):
        z_mean, z_log_var = inputs
        batch = tf.shape(z_mean)[0]
        dim = tf.shape(z_mean)[1]
        epsilon = tf.random.normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon

# ===================== ENCODER =====================
latent_dim = 2  # 2D for visualization!

encoder_inputs = keras.Input(shape=(784,))
x = layers.Dense(512, activation='relu')(encoder_inputs)
x = layers.Dense(256, activation='relu')(x)
z_mean = layers.Dense(latent_dim, name='z_mean')(x)
z_log_var = layers.Dense(latent_dim, name='z_log_var')(x)
z = Sampling()([z_mean, z_log_var])

encoder = Model(encoder_inputs, [z_mean, z_log_var, z], name='encoder')
encoder.summary()

# ===================== DECODER =====================
decoder_inputs = keras.Input(shape=(latent_dim,))
x = layers.Dense(256, activation='relu')(decoder_inputs)
x = layers.Dense(512, activation='relu')(x)
decoder_outputs = layers.Dense(784, activation='sigmoid')(x)

decoder = Model(decoder_inputs, decoder_outputs, name='decoder')
decoder.summary()

# ===================== VAE MODEL =====================
class VAE(keras.Model):
    def __init__(self, encoder, decoder, beta=1.0, **kwargs):
        super().__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        self.beta = beta  # β-VAE parameter
        self.total_loss_tracker = keras.metrics.Mean(name='total_loss')
        self.recon_loss_tracker = keras.metrics.Mean(name='recon_loss')
        self.kl_loss_tracker = keras.metrics.Mean(name='kl_loss')

    @property
    def metrics(self):
        return [self.total_loss_tracker,
                self.recon_loss_tracker,
                self.kl_loss_tracker]

    def train_step(self, data):
        with tf.GradientTape() as tape:
            z_mean, z_log_var, z = self.encoder(data)
            reconstruction = self.decoder(z)

            # Reconstruction loss (BCE)
            recon_loss = tf.reduce_mean(
                tf.reduce_sum(
                    keras.losses.binary_crossentropy(data, reconstruction),
                    axis=-1
                )
            )
            # KL divergence loss
            kl_loss = -0.5 * tf.reduce_mean(
                tf.reduce_sum(
                    1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var),
                    axis=-1
                )
            )
            total_loss = recon_loss + self.beta * kl_loss

        grads = tape.gradient(total_loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))

        self.total_loss_tracker.update_state(total_loss)
        self.recon_loss_tracker.update_state(recon_loss)
        self.kl_loss_tracker.update_state(kl_loss)
        return {
            'loss': self.total_loss_tracker.result(),
            'recon_loss': self.recon_loss_tracker.result(),
            'kl_loss': self.kl_loss_tracker.result(),
        }

# Train!
vae = VAE(encoder, decoder, beta=1.0)
vae.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-3))
vae.fit(x_train, epochs=30, batch_size=128)

# ===================== LATENT SPACE VISUALIZATION =====================
def plot_latent_space(encoder, data, labels, title='VAE Latent Space'):
    z_mean, _, _ = encoder.predict(data, batch_size=512)
    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(z_mean[:, 0], z_mean[:, 1],
                          c=labels, cmap='tab10',
                          alpha=0.5, s=2)
    plt.colorbar(scatter, label='Digit')
    plt.xlabel('z₁'); plt.ylabel('z₂')
    plt.title(title)
    plt.savefig('vae_latent_space.png', dpi=150)
    plt.show()

# Reload labels for coloring
(_, y_train), (_, y_test) = keras.datasets.mnist.load_data()
plot_latent_space(encoder, x_test, y_test)

# ===================== LATENT SPACE INTERPOLATION =====================
def plot_latent_manifold(decoder, n=20, figsize=15):
    """Sample points on a grid in latent space and decode them."""
    figure = np.zeros((28 * n, 28 * n))
    # Linearly spaced coordinates on the unit square
    grid_x = np.linspace(-3, 3, n)
    grid_y = np.linspace(-3, 3, n)[::-1]

    for i, yi in enumerate(grid_y):
        for j, xi in enumerate(grid_x):
            z_sample = np.array([[xi, yi]])
            x_decoded = decoder.predict(z_sample, verbose=0)
            digit = x_decoded[0].reshape(28, 28)
            figure[i*28:(i+1)*28, j*28:(j+1)*28] = digit

    plt.figure(figsize=(figsize, figsize))
    plt.imshow(figure, cmap='gray')
    plt.title('VAE Latent Space Manifold')
    plt.xlabel('z₁'); plt.ylabel('z₂')
    plt.savefig('vae_manifold.png', dpi=150)
    plt.show()

plot_latent_manifold(decoder, n=20)

# ===================== GENERATE NEW DIGITS =====================
def generate_samples(decoder, n=10):
    """Generate new digits by sampling from N(0,I)."""
    z = np.random.randn(n, latent_dim)
    generated = decoder.predict(z)
    fig, axes = plt.subplots(1, n, figsize=(2*n, 2))
    for i in range(n):
        axes[i].imshow(generated[i].reshape(28, 28), cmap='gray')
        axes[i].axis('off')
        axes[i].set_title(f'z=[{z[i,0]:.1f},{z[i,1]:.1f}]')
    plt.suptitle('Generated Samples from VAE')
    plt.tight_layout()
    plt.savefig('vae_generated.png', dpi=100)
    plt.show()

generate_samples(decoder, n=10)

📦 Scikit-Learn Implementation

Scikit-learn doesn't have native autoencoder support, but we can use its MLPRegressor as a "poor man's autoencoder" and combine with its evaluation tools. We also show PCA for comparison, as it is the linear analogue of undercomplete autoencoders.

Scikit-Learn — PCA as Linear Autoencoder + Anomaly Detection
from sklearn.decomposition import PCA, KernelPCA
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, classification_report
from sklearn.datasets import make_classification
import numpy as np

# =================== PCA as Linear Autoencoder ===================
from sklearn.datasets import fetch_openml

# Load a subset of MNIST
X, y = fetch_openml('mnist_784', version=1, return_X_y=True,
                     as_frame=False, parser='auto')
X = X[:10000] / 255.0

# PCA with different components (analogous to different bottleneck sizes)
for n_comp in [2, 10, 32, 100]:
    pca = PCA(n_components=n_comp)
    X_encoded = pca.fit_transform(X)
    X_reconstructed = pca.inverse_transform(X_encoded)
    recon_error = mean_squared_error(X, X_reconstructed)
    variance = pca.explained_variance_ratio_.sum()
    print(f"PCA-{n_comp:3d}: Recon MSE = {recon_error:.6f}, "
          f"Variance Explained = {variance:.4f}")

# =================== MLPRegressor as Autoencoder ===================
# Train neural network to reconstruct input
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X[:5000])

ae_mlp = MLPRegressor(
    hidden_layer_sizes=(256, 32, 256),  # Bottleneck = 32
    activation='relu',
    solver='adam',
    max_iter=50,
    batch_size=128,
    learning_rate_init=0.001,
    verbose=True,
    random_state=42
)
# Train: input = output (autoencoder!)
ae_mlp.fit(X_scaled, X_scaled)

X_recon = ae_mlp.predict(X_scaled)
print(f"\nMLP AE Recon MSE: {mean_squared_error(X_scaled, X_recon):.6f}")

# =================== Anomaly Detection with AE ===================
# Generate normal data and anomalies
np.random.seed(42)
X_normal = np.random.randn(1000, 10) * 0.5 + 2.0
X_anomaly = np.random.randn(50, 10) * 3.0 + 7.0  # Different distribution

# Train autoencoder on normal data only
scaler_ad = MinMaxScaler()
X_normal_scaled = scaler_ad.fit_transform(X_normal)

ae_detector = MLPRegressor(
    hidden_layer_sizes=(32, 8, 32),
    activation='relu', solver='adam',
    max_iter=100, random_state=42
)
ae_detector.fit(X_normal_scaled, X_normal_scaled)

# Compute reconstruction errors
X_all = np.vstack([X_normal, X_anomaly])
y_true = np.array([0]*1000 + [1]*50)  # 0=normal, 1=anomaly
X_all_scaled = scaler_ad.transform(X_all)

X_all_recon = ae_detector.predict(X_all_scaled)
recon_errors = np.mean((X_all_scaled - X_all_recon)**2, axis=1)

# Set threshold (e.g., 95th percentile of normal errors)
threshold = np.percentile(recon_errors[:1000], 95)
y_pred = (recon_errors > threshold).astype(int)

print(f"\nAnomaly Detection Results:")
print(f"Threshold: {threshold:.6f}")
print(classification_report(y_true, y_pred,
                            target_names=['Normal', 'Anomaly']))

# Kernel PCA for nonlinear dimensionality reduction
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.01,
                 fit_inverse_transform=True)
X_kpca = kpca.fit_transform(X[:5000])
X_kpca_recon = kpca.inverse_transform(X_kpca)
print(f"\nKernel PCA Recon MSE: {mean_squared_error(X[:5000], X_kpca_recon):.6f}")

🇮🇳 Indian Case Studies

Case Study 1: Aadhaar Biometric Data Compression

India's Aadhaar system, managed by UIDAI, stores biometric data (fingerprints and iris scans) for 1.4+ billion residents. Each fingerprint template is roughly 20–40 KB. With 10 fingerprints per person, the raw storage requirement is enormous.

The Challenge

Store biometric templates for 1.4 billion people efficiently
Enable real-time matching during authentication (Aadhaar e-KYC processes ~100 million verifications/month)
Maintain high recognition accuracy despite compression

The Autoencoder Solution

Autoencoder-based compression techniques can reduce biometric template sizes by 80–90% while maintaining near-perfect match accuracy:

Python — Biometric Compression Concept
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, Model

class BiometricCompressor(Model):
    """Autoencoder for fingerprint template compression.
    Reduces 512-dimensional templates to 64 dimensions (87.5% compression).
    """
    def __init__(self):
        super().__init__()
        self.encoder = tf.keras.Sequential([
            layers.Dense(256, activation='relu'),
            layers.BatchNormalization(),
            layers.Dense(128, activation='relu'),
            layers.BatchNormalization(),
            layers.Dense(64, activation='relu', name='compressed_template'),
        ])
        self.decoder = tf.keras.Sequential([
            layers.Dense(128, activation='relu'),
            layers.BatchNormalization(),
            layers.Dense(256, activation='relu'),
            layers.BatchNormalization(),
            layers.Dense(512, activation='sigmoid'),
        ])

    def call(self, x):
        z = self.encoder(x)
        return self.decoder(z)

    def compress(self, template):
        """Compress a biometric template for storage."""
        return self.encoder(template)

    def decompress(self, compressed):
        """Decompress for matching."""
        return self.decoder(compressed)

# Simulated usage
model = BiometricCompressor()
model.compile(optimizer='adam', loss='mse')

# Simulate fingerprint templates (512-dim feature vectors)
templates = np.random.rand(10000, 512).astype('float32')
model.fit(templates, templates, epochs=10, batch_size=256, verbose=0)

original = templates[:5]
compressed = model.compress(original)
reconstructed = model.decompress(compressed)

print(f"Original shape:    {original.shape}")      # (5, 512)
print(f"Compressed shape:  {compressed.shape}")     # (5, 64)
print(f"Compression ratio: {512/64:.1f}x")          # 8.0x
print(f"Recon MSE:         {np.mean((original - reconstructed.numpy())**2):.6f}")

Impact: With 8x compression, storage requirements drop from ~560 TB to ~70 TB for the entire Aadhaar database, saving significant infrastructure costs while maintaining sub-second authentication times.

Case Study 2: Network Anomaly Detection at Jio

Reliance Jio, India's largest telecom operator with 450+ million subscribers, processes massive volumes of network traffic data. Detecting anomalies — DDoS attacks, unusual traffic patterns, equipment failures — in real-time is critical.

The Challenge

Monitor millions of network flows per second across 200,000+ cell towers
Distinguish normal traffic variations (cricket matches, festivals) from genuine attacks
Minimize false alarms that waste engineer time

Autoencoder-Based Detection

Python — Network Anomaly Detection (Jio-style)
import numpy as np
from sklearn.preprocessing import StandardScaler

class NetworkAnomalyDetector:
    """Autoencoder-based anomaly detector for telecom network traffic."""

    def __init__(self):
        import tensorflow as tf
        from tensorflow.keras import layers

        self.model = tf.keras.Sequential([
            # Encoder
            layers.Dense(64, activation='relu', input_shape=(20,)),
            layers.Dense(32, activation='relu'),
            layers.Dense(8, activation='relu'),    # Bottleneck
            # Decoder
            layers.Dense(32, activation='relu'),
            layers.Dense(64, activation='relu'),
            layers.Dense(20, activation='linear'),
        ])
        self.model.compile(optimizer='adam', loss='mse')
        self.scaler = StandardScaler()
        self.threshold = None

    def fit(self, X_normal, epochs=50):
        """Train on NORMAL traffic only."""
        X_scaled = self.scaler.fit_transform(X_normal)
        self.model.fit(X_scaled, X_scaled, epochs=epochs,
                       batch_size=256, verbose=0)
        # Set threshold as 99th percentile of training errors
        recon = self.model.predict(X_scaled, verbose=0)
        errors = np.mean((X_scaled - recon)**2, axis=1)
        self.threshold = np.percentile(errors, 99)
        print(f"Threshold set at: {self.threshold:.6f}")

    def detect(self, X_new):
        """Returns anomaly scores and predictions."""
        X_scaled = self.scaler.transform(X_new)
        recon = self.model.predict(X_scaled, verbose=0)
        errors = np.mean((X_scaled - recon)**2, axis=1)
        is_anomaly = errors > self.threshold
        return errors, is_anomaly

# Simulate network traffic features
# Features: packet_count, byte_count, flow_duration, port_entropy,
#           src_diversity, dst_diversity, protocol_dist, ...
np.random.seed(42)
X_normal = np.random.randn(50000, 20) + np.array([5]*20)

# Simulate attacks (different distribution)
X_ddos = np.random.randn(100, 20) * 3 + np.array([15]*20)
X_scan = np.random.randn(100, 20) * 0.1 + np.array([0.5]*20)

detector = NetworkAnomalyDetector()
detector.fit(X_normal, epochs=20)

# Test detection
X_test = np.vstack([X_normal[:1000], X_ddos, X_scan])
y_true = np.array([0]*1000 + [1]*100 + [1]*100)

errors, predictions = detector.detect(X_test)
tp = np.sum(predictions[1000:])  # True positives
fp = np.sum(predictions[:1000])  # False positives
print(f"Detection rate: {tp}/{200} = {tp/200*100:.1f}%")
print(f"False alarm rate: {fp}/{1000} = {fp/1000*100:.2f}%")

Impact: Autoencoder-based anomaly detection at Jio can process 10 million+ flows per minute, detecting sophisticated attacks that rule-based systems miss, while maintaining false positive rates below 1%.

🌍 Global Case Studies

Case Study 1: Stability AI — Stable Diffusion's VAE

Stable Diffusion, released by Stability AI in 2022, uses a VAE as a critical architectural component. Instead of running the computationally expensive diffusion process directly on high-resolution images (512×512×3 = 786,432 dimensions), the image is first compressed by a VAE encoder into a much smaller latent space.

Architecture

VAE Encoder: Compresses 512×512×3 images to 64×64×4 latent space (48x compression)
U-Net: Performs the iterative denoising diffusion in this compressed latent space
VAE Decoder: Upsamples the denoised latent back to a full 512×512 image
Text Encoder (CLIP): Converts text prompts into embeddings that guide the U-Net

Stable Diffusion Architecture

"A photo of a cat" │ ┌──────▼───────┐ ┌────────────────────────────────────────┐ │ CLIP Text │ │ Latent Diffusion Process │ │ Encoder │ │ │ └──────┬───────┘ │ z_T (noise) ──▶ U-Net ──▶ z_0 (clean)│ │ │ ▲ denoise │ │ │ │ │ (T steps) │ │ └─────────────┤ text embedding guides │ └───────────────────┬────────────────────┘ │ ┌──────────────┐ ┌──────▼───────┐ │ Input Image │ VAE Encoder │ VAE Decoder │ Output Image │ 512×512×3 │──────────────▶ │ 64×64×4 ──▶ │──▶ 512×512×3 └──────────────┘ (compress) │ 512×512×3 │ (high quality) └──────────────┘

Key Innovation: By performing diffusion in the VAE's latent space rather than pixel space, Stable Diffusion is 10–100x faster than pixel-space diffusion models, enabling consumer GPUs to generate images in seconds.

Case Study 2: OpenAI — DALL-E Architecture

DALL-E (2021) uses a discrete VAE (dVAE) that tokenizes images into a grid of discrete tokens, similar to how text is tokenized into words. DALL-E 2 (2022) moved to a different architecture with CLIP embeddings and diffusion, but the VAE concept remains central.

DALL-E 1 Pipeline

Stage 1 — dVAE Training: Train a discrete VAE to compress 256×256 images into a 32×32 grid of 8192 possible tokens
Stage 2 — Transformer: Train an autoregressive transformer to model the joint distribution of text tokens and image tokens
Generation: Given text, autoregressively generate image tokens, then decode with the dVAE

Impact: DALL-E demonstrated that the language of "tokens" could unify text and image generation, with the VAE serving as the bridge between continuous pixel space and discrete token space.

🚀 Startup Applications

15.1 Medical Imaging Startups

Qure.ai (Mumbai): Uses autoencoders for anomaly detection in chest X-rays. Normal X-rays are encoded well; abnormal ones (tuberculosis, pneumonia) show high reconstruction error, flagging them for radiologist review.

15.2 E-commerce Product Search

ViSenze (Singapore): Uses autoencoders to create compact visual embeddings for products. Users upload a photo, and the autoencoder's latent space enables fast similarity search across millions of products.

15.3 FinTech Fraud Detection

Razorpay (India): Employs autoencoder-based anomaly detection to flag unusual payment patterns. The system trains on normal transactions and flags any transaction with high reconstruction error — potentially fraudulent patterns that don't match learned normal behavior.

15.4 Audio/Music Generation

AIVA (Luxembourg): Uses VAEs to generate music. Musical pieces are encoded into a latent space where interpolation between styles produces novel compositions. The latent space captures genre, tempo, key, and instrumentation as disentangled factors.

Python — Startup Product Similarity Search
# Simplified visual product search using AE embeddings
class ProductSearchEngine:
    def __init__(self, ae_model):
        self.encoder = ae_model.encoder
        self.product_embeddings = {}

    def index_product(self, product_id, image_features):
        """Add product to search index."""
        embedding = self.encoder.predict(
            image_features.reshape(1, -1), verbose=0
        ).flatten()
        self.product_embeddings[product_id] = embedding

    def search(self, query_image, top_k=5):
        """Find most similar products."""
        query_emb = self.encoder.predict(
            query_image.reshape(1, -1), verbose=0
        ).flatten()

        distances = {}
        for pid, emb in self.product_embeddings.items():
            dist = np.linalg.norm(query_emb - emb)
            distances[pid] = dist

        return sorted(distances.items(), key=lambda x: x[1])[:top_k]

🏛️ Government Applications

16.1 Satellite Image Compression (ISRO)

India's ISRO satellites generate terabytes of multispectral imagery daily. Autoencoders can compress hyperspectral images (100+ bands) into compact representations for efficient downlink and storage, preserving spectral information better than traditional JPEG-like methods.

16.2 Tax Fraud Detection (Income Tax Department)

The Indian Income Tax Department uses anomaly detection systems to flag suspicious returns. An autoencoder trained on normal tax return patterns can identify returns with unusual combinations of income, deductions, and claimed exemptions — potential fraud or errors.

16.3 Smart City Surveillance (MoHUA)

Under the Smart Cities Mission, surveillance systems process massive video streams. Autoencoders compress video features for efficient storage and detect anomalous events (abandoned objects, unusual crowd movements) by flagging frames with high reconstruction error.

16.4 Cybersecurity (CERT-In)

India's Computer Emergency Response Team (CERT-In) uses autoencoder-based IDS (Intrusion Detection Systems) to monitor network traffic across government networks, detecting zero-day attacks that signature-based systems miss.

🏭 Industry Applications

Industry	Application	AE Type	Impact
Manufacturing	Predictive maintenance — detect sensor anomalies before equipment failure	Undercomplete AE	30% reduction in downtime
Healthcare	Medical image denoising (MRI, CT scans)	Denoising AE	Clearer images, better diagnosis
Finance	Credit card fraud detection	AE + anomaly score	95%+ detection rate
Retail	Recommendation via learned embeddings	VAE embeddings	15% CTR improvement
Autonomous Driving	LiDAR point cloud compression	Convolutional AE	10x compression
Drug Discovery	Molecular generation & optimization	VAE on SMILES	10x faster screening
NLP	Sentence embeddings for semantic search	Seq2Seq AE	Fast similarity search
Gaming	Procedural content generation	β-VAE	Infinite level variations
Telecom	Network intrusion detection	Sparse AE	Real-time threat detection
Energy	Smart grid anomaly detection	LSTM-AE	Predictive grid management

🔧 Mini Projects

Mini Project 1: Image Denoiser

Objective

Build a denoising autoencoder that cleans noisy images from the Fashion-MNIST dataset.

Python — Fashion-MNIST Image Denoiser
import tensorflow as tf
from tensorflow.keras import layers, Model
import numpy as np
import matplotlib.pyplot as plt

# Load Fashion-MNIST
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Add Gaussian noise
def add_noise(images, noise_factor=0.4):
    noisy = images + noise_factor * np.random.randn(*images.shape)
    return np.clip(noisy, 0.0, 1.0)

x_train_noisy = add_noise(x_train)
x_test_noisy = add_noise(x_test)

# Build Convolutional Denoising Autoencoder
class FashionDenoiser(Model):
    def __init__(self):
        super().__init__()
        self.encoder = tf.keras.Sequential([
            layers.Reshape((28, 28, 1), input_shape=(28, 28)),
            layers.Conv2D(64, 3, activation='relu', padding='same'),
            layers.MaxPooling2D(2, padding='same'),
            layers.Conv2D(32, 3, activation='relu', padding='same'),
            layers.MaxPooling2D(2, padding='same'),
            layers.Conv2D(16, 3, activation='relu', padding='same'),
            layers.MaxPooling2D(2, padding='same'),
        ])
        self.decoder = tf.keras.Sequential([
            layers.Conv2D(16, 3, activation='relu', padding='same'),
            layers.UpSampling2D(2),
            layers.Conv2D(32, 3, activation='relu', padding='same'),
            layers.UpSampling2D(2),
            layers.Conv2D(64, 3, activation='relu'),
            layers.UpSampling2D(2),
            layers.Conv2D(1, 3, activation='sigmoid', padding='same'),
            layers.Reshape((28, 28)),
        ])

    def call(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

# Train
denoiser = FashionDenoiser()
denoiser.compile(optimizer='adam', loss='mse')
history = denoiser.fit(
    x_train_noisy, x_train,  # noisy input → clean target
    epochs=20, batch_size=128,
    validation_data=(x_test_noisy, x_test)
)

# Evaluate
denoised = denoiser.predict(x_test_noisy[:10])
fig, axes = plt.subplots(3, 10, figsize=(20, 6))
labels = ['T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat',
          'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Boot']
for i in range(10):
    axes[0, i].imshow(x_test[i], cmap='gray')
    axes[0, i].set_title('Clean', fontsize=8)
    axes[0, i].axis('off')
    axes[1, i].imshow(x_test_noisy[i], cmap='gray')
    axes[1, i].set_title('Noisy', fontsize=8)
    axes[1, i].axis('off')
    axes[2, i].imshow(denoised[i], cmap='gray')
    axes[2, i].set_title('Denoised', fontsize=8)
    axes[2, i].axis('off')
plt.suptitle('Fashion-MNIST Image Denoiser', fontsize=14)
plt.tight_layout()
plt.savefig('fashion_denoiser.png', dpi=150)
plt.show()

# PSNR calculation
def psnr(original, reconstructed):
    mse = np.mean((original - reconstructed)**2)
    if mse == 0:
        return float('inf')
    return 10 * np.log10(1.0 / mse)

denoised_all = denoiser.predict(x_test_noisy)
print(f"PSNR (noisy vs original):    {psnr(x_test, x_test_noisy):.2f} dB")
print(f"PSNR (denoised vs original): {psnr(x_test, denoised_all):.2f} dB")
print(f"Improvement: {psnr(x_test, denoised_all) - psnr(x_test, x_test_noisy):.2f} dB")

Mini Project 2: Credit Card Fraud Anomaly Detector

Objective

Build an autoencoder-based anomaly detector for credit card transactions using the Kaggle Credit Card Fraud dataset approach.

Python — Credit Card Fraud Detector
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (precision_recall_curve, auc,
                             confusion_matrix, classification_report)

# Simulate credit card transaction data
np.random.seed(42)

# Normal transactions (99.8% of data)
n_normal = 10000
X_normal = np.random.randn(n_normal, 29) * np.array(
    [1.5, 0.8, 1.2, 0.5, 0.9, 1.1, 0.7, 0.6, 1.0, 0.8,
     0.9, 1.3, 0.4, 0.7, 0.8, 1.0, 0.5, 0.6, 0.9, 1.1,
     0.7, 0.8, 0.6, 0.9, 1.0, 0.8, 0.7, 0.5, 0.3]
)
# Add spending amount (log-normal)
amounts_normal = np.random.lognormal(3.0, 1.0, n_normal)

# Fraudulent transactions (0.2% of data)
n_fraud = 50
X_fraud = np.random.randn(n_fraud, 29) * np.array(
    [3.0, 2.5, 3.0, 2.0, 2.5, 3.0, 2.0, 1.5, 2.5, 2.0,
     2.5, 3.0, 1.5, 2.0, 2.5, 3.0, 1.5, 2.0, 2.5, 3.0,
     2.0, 2.5, 2.0, 2.5, 3.0, 2.5, 2.0, 1.5, 1.0]
) + np.array([2]*29)
amounts_fraud = np.random.lognormal(5.0, 1.5, n_fraud)

# Combine
X_all = np.column_stack([
    np.vstack([X_normal, X_fraud]),
    np.concatenate([amounts_normal, amounts_fraud]).reshape(-1, 1)
])
y_all = np.array([0]*n_normal + [1]*n_fraud)

# Split: train on normal only
scaler = StandardScaler()
X_train = scaler.fit_transform(X_all[:8000])  # First 8000 normal
X_test = scaler.transform(X_all[8000:])       # Remaining normal + all fraud
y_test = y_all[8000:]

# Build autoencoder
model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(30,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(8, activation='relu'),        # Bottleneck
    layers.Dense(16, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(30, activation='linear'),
])

model.compile(optimizer='adam', loss='mse')
model.fit(X_train, X_train, epochs=50, batch_size=128,
          validation_split=0.1, verbose=0)

# Predict and compute anomaly scores
X_test_recon = model.predict(X_test, verbose=0)
anomaly_scores = np.mean((X_test - X_test_recon)**2, axis=1)

# Find optimal threshold using precision-recall
precision, recall, thresholds = precision_recall_curve(y_test, anomaly_scores)
pr_auc = auc(recall, precision)
print(f"PR-AUC: {pr_auc:.4f}")

# Use threshold at F1 maximum
f1_scores = 2 * precision * recall / (precision + recall + 1e-8)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[min(optimal_idx, len(thresholds)-1)]

y_pred = (anomaly_scores > optimal_threshold).astype(int)
print(f"\nOptimal threshold: {optimal_threshold:.6f}")
print(classification_report(y_test, y_pred,
                            target_names=['Normal', 'Fraud']))

# Visualize reconstruction error distribution
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(anomaly_scores[y_test==0], bins=50, alpha=0.7, label='Normal', color='green')
ax.hist(anomaly_scores[y_test==1], bins=50, alpha=0.7, label='Fraud', color='red')
ax.axvline(optimal_threshold, color='black', linestyle='--', label=f'Threshold={optimal_threshold:.4f}')
ax.set_xlabel('Reconstruction Error')
ax.set_ylabel('Count')
ax.set_title('Anomaly Score Distribution')
ax.legend()
plt.tight_layout()
plt.savefig('fraud_detection.png', dpi=150)
plt.show()

Mini Project 3: Latent Space Explorer

Objective

Build a VAE on MNIST and create an interactive latent space explorer that generates digits by varying z₁ and z₂ coordinates.

Python — Latent Space Explorer
import numpy as np
import matplotlib.pyplot as plt

def latent_space_explorer(decoder, z1_range=(-3, 3), z2_range=(-3, 3),
                          steps=20, save_path='latent_explorer.png'):
    """Generate a grid of decoded images across the latent space."""
    z1_values = np.linspace(z1_range[0], z1_range[1], steps)
    z2_values = np.linspace(z2_range[0], z2_range[1], steps)

    canvas = np.zeros((28 * steps, 28 * steps))

    for i, z2 in enumerate(reversed(z2_values)):
        for j, z1 in enumerate(z1_values):
            z = np.array([[z1, z2]])
            img = decoder.predict(z, verbose=0).reshape(28, 28)
            canvas[i*28:(i+1)*28, j*28:(j+1)*28] = img

    fig, ax = plt.subplots(figsize=(12, 12))
    im = ax.imshow(canvas, cmap='inferno')
    ax.set_xlabel(f'z₁ ({z1_range[0]} → {z1_range[1]})')
    ax.set_ylabel(f'z₂ ({z2_range[0]} → {z2_range[1]})')
    ax.set_title('VAE Latent Space Explorer')

    # Add z-value ticks
    tick_positions = np.linspace(0, 28*steps-1, 7)
    ax.set_xticks(tick_positions)
    ax.set_xticklabels([f'{v:.1f}' for v in np.linspace(z1_range[0], z1_range[1], 7)])
    ax.set_yticks(tick_positions)
    ax.set_yticklabels([f'{v:.1f}' for v in np.linspace(z2_range[1], z2_range[0], 7)])

    plt.tight_layout()
    plt.savefig(save_path, dpi=150)
    plt.show()
    print(f"Saved to {save_path}")

# After training the VAE from Section 11.3:
# latent_space_explorer(decoder, steps=25)

def interpolate_digits(encoder, decoder, x1, x2, steps=10):
    """Smoothly interpolate between two digits in latent space."""
    z1, _, _ = encoder.predict(x1.reshape(1, -1), verbose=0)
    z2, _, _ = encoder.predict(x2.reshape(1, -1), verbose=0)

    alphas = np.linspace(0, 1, steps)
    fig, axes = plt.subplots(1, steps, figsize=(2*steps, 2))
    for i, alpha in enumerate(alphas):
        z = (1 - alpha) * z1 + alpha * z2
        img = decoder.predict(z, verbose=0).reshape(28, 28)
        axes[i].imshow(img, cmap='gray')
        axes[i].axis('off')
        axes[i].set_title(f'α={alpha:.1f}', fontsize=8)
    plt.suptitle('Latent Space Interpolation')
    plt.tight_layout()
    plt.savefig('interpolation.png', dpi=100)
    plt.show()

📝 End-of-Chapter Exercises

Exercise 1 (Conceptual)

Explain why a linear autoencoder with MSE loss learns the same subspace as PCA. What is the relationship between the encoder weights and principal components?

Exercise 2 (Mathematical)

Derive the gradient of the MSE reconstruction loss with respect to the decoder's final layer weights. Show all intermediate steps.

Exercise 3 (Coding)

Implement an autoencoder with a 3-dimensional bottleneck for the Iris dataset (4 features → 3 → 4). Plot the 3D latent space colored by species. Does it separate the classes?

Exercise 4 (Analytical)

For a sparse autoencoder with target sparsity ρ = 0.1, compute the KL penalty when a unit has average activation ρ̂ = 0.5. What is the gradient ∂KL/∂ρ̂ at this point?

Exercise 5 (Conceptual)

Why is the reparameterization trick necessary for VAEs? What goes wrong if you try to directly sample z ~ N(μ, σ²) and backpropagate through the sampling step?

Exercise 6 (Mathematical)

Prove that the KL divergence KL(N(μ, σ²) ‖ N(0, 1)) = -½(1 + log σ² - μ² - σ²) using the definition of KL divergence for continuous distributions.

Exercise 7 (Coding)

Train a VAE on the Fashion-MNIST dataset. Generate new clothing items by sampling from N(0, I). Which categories are easier to generate? Why?

Exercise 8 (Design)

You have a dataset of 100,000 sensor readings (50 features) from a factory. Only 0.1% are known faults. Design a complete anomaly detection pipeline using an autoencoder. Specify architecture, training strategy, threshold selection, and evaluation metrics.

Exercise 9 (Analytical)

In a β-VAE with β = 5, how does the loss landscape change compared to β = 1? What happens to reconstruction quality and latent space organization?

Exercise 10 (Coding)

Implement a denoising autoencoder that handles three types of noise: Gaussian, salt-and-pepper, and speckle. Compare PSNR for each noise type.

Exercise 11 (Research)

Read the original VAE paper (Kingma & Welling, 2013). Explain the "wake-sleep" interpretation of the ELBO optimization. Which part of the ELBO corresponds to "wake" and which to "sleep"?

Exercise 12 (Mathematical)

For a multivariate Gaussian q(z|x) = N(μ, diag(σ²)) and prior p(z) = N(0, I), derive the KL divergence for J dimensions. Show the step where the diagonal covariance assumption simplifies the trace term.

Exercise 13 (Coding)

Build a convolutional autoencoder for CIFAR-10 (32×32×3 color images). Compare reconstruction quality (SSIM, PSNR) for bottleneck sizes of 64, 128, and 256.

Exercise 14 (Application)

Design an autoencoder-based image compression system. For a 28×28 grayscale image, what is the compression ratio when using a 32-dimensional latent space? Compare with JPEG at equivalent bitrates.

Exercise 15 (Conceptual)

Explain the "posterior collapse" problem in VAEs. What causes it, and what strategies can mitigate it? (Hint: consider KL annealing and free bits.)

Exercise 16 (Coding)

Implement a sparse autoencoder with KL-divergence sparsity penalty. Train on MNIST and visualize the learned features (decoder weights) for the hidden layer. How do they compare to PCA components?

Exercise 17 (Mathematical)

Show that the ELBO is tight (equals log p(x)) if and only if q_φ(z|x) = p_θ(z|x). Prove this using the relationship log p(x) = ELBO + KL(q‖p).

Exercise 18 (Application)

A hospital wants to detect rare diseases from blood test panels (20 measurements). Design an autoencoder system. What bottleneck size would you choose? How would you handle the class imbalance? How would you set the anomaly threshold?

Exercise 19 (Coding)

Implement interpolation in VAE latent space: encode two MNIST digits of different classes, linearly interpolate between their latent representations, and decode the intermediate points. Create a smooth animation.

Exercise 20 (Conceptual)

Compare autoencoders with GANs for image generation. What are the strengths and weaknesses of each approach? When would you choose a VAE over a GAN?

Exercise 21 (Advanced)

Explain how Stable Diffusion uses a VAE. Why is performing diffusion in latent space advantageous over pixel space? What are the computational savings?

Exercise 22 (Coding)

Build a β-VAE with β ∈ {0.5, 1, 2, 4, 10}. For each β, compute and plot: (a) reconstruction MSE, (b) KL divergence, (c) latent space visualization. Identify the optimal β for MNIST.

❓ Multiple Choice Questions

1. What is the purpose of the bottleneck in an autoencoder?

(a) To increase the dimensionality of the data (b) To force the network to learn a compressed representation (c) To add noise to the data (d) To regularize the decoder weights

✅ (b) The bottleneck forces the network to learn a compressed, meaningful representation of the input, capturing only the most important features.

2. In the VAE loss function ELBO = E[log p(x|z)] - KL(q(z|x)‖p(z)), what does the KL divergence term encourage?

(a) Better reconstruction quality (b) The approximate posterior to match the prior distribution N(0, I) (c) Larger latent dimensions (d) Sparser activations in the encoder

✅ (b) The KL term pushes the learned posterior q(z|x) toward the standard normal prior p(z) = N(0,I), ensuring a smooth, continuous latent space suitable for generation.

3. What problem does the reparameterization trick solve?

(a) Vanishing gradients in deep networks (b) Inability to backpropagate through stochastic sampling operations (c) Overfitting on small datasets (d) Mode collapse during training

✅ (b) Sampling z ~ N(μ, σ²) is a non-differentiable operation. The reparameterization trick writes z = μ + σ·ε (where ε ~ N(0,1)), making z a differentiable function of μ and σ, enabling standard backpropagation.

4. A denoising autoencoder differs from a standard autoencoder in that:

(a) It has a larger bottleneck (b) Its input is corrupted while the target remains clean (c) It uses a different loss function (d) It has no decoder

✅ (b) The denoising AE receives corrupted input x̃ = x + noise but must reconstruct the original clean x. This forces the model to learn robust features rather than the identity function.

5. In β-VAE, setting β > 1 results in:

(a) Better reconstruction but worse latent space (b) More disentangled representations at the cost of reconstruction quality (c) Faster training convergence (d) Larger model capacity

✅ (b) Higher β puts more weight on the KL term, forcing each latent dimension to be more independent (disentangled), but reconstruction quality typically degrades.

6. An overcomplete autoencoder has:

(a) A bottleneck smaller than the input dimension (b) A bottleneck equal to or larger than the input dimension (c) No hidden layers (d) Only linear activations

✅ (b) An overcomplete autoencoder has dim(z) ≥ dim(x). Without regularization, it can learn the trivial identity function. Sparsity, denoising, or contractive penalties are needed.

7. In anomaly detection using autoencoders, an anomaly is identified when:

(a) The latent code is close to zero (b) The reconstruction error is below a threshold (c) The reconstruction error exceeds a threshold (d) The input matches training data exactly

✅ (c) The AE is trained on normal data. Anomalous inputs are poorly reconstructed (high error) because they differ from the learned normal patterns.

8. Which loss function is most appropriate for an autoencoder with binary input data?

(a) Mean Absolute Error (b) Mean Squared Error (c) Binary Cross-Entropy (d) Hinge Loss

✅ (c) BCE is natural for binary data. It corresponds to maximizing the Bernoulli log-likelihood of the reconstruction, with the decoder using a sigmoid output activation.

9. In Stable Diffusion, the VAE is used to:

(a) Generate text prompts (b) Compress images to a latent space where diffusion happens (c) Train the CLIP text encoder (d) Perform attention computations

✅ (b) The VAE compresses 512×512 images to a 64×64 latent representation. The diffusion process operates in this smaller space for efficiency, and the VAE decoder upsamples back to full resolution.

10. The ELBO (Evidence Lower BOund) is a lower bound on:

(a) The reconstruction error (b) The KL divergence (c) The log marginal likelihood log p(x) (d) The entropy of the latent distribution

✅ (c) ELBO ≤ log p(x), with equality when q(z|x) = p(z|x). Maximizing the ELBO simultaneously improves the generative model and the approximate posterior.

11. Posterior collapse in VAEs occurs when:

(a) The decoder becomes too powerful and ignores the latent code z (b) The encoder produces very large μ values (c) The learning rate is too small (d) The bottleneck is too narrow

✅ (a) When the decoder is powerful enough (e.g., autoregressive), it can reconstruct well without using z, causing q(z|x) to collapse to the prior p(z). The KL goes to 0 but the latent space becomes uninformative.

12. Which of the following is NOT a valid regularization technique for autoencoders?

(a) Denoising (corrupting inputs) (b) Sparsity penalty (L1 or KL) (c) Contractive penalty (Jacobian Frobenius norm) (d) Increasing the bottleneck to match input dimension

✅ (d) Increasing the bottleneck size removes the compression constraint. It makes the model overcomplete without regularization, which is the opposite of regularizing — it makes learning the identity trivial.

💼 Interview Questions

Q1: Explain the difference between an Autoencoder and a Variational Autoencoder.

Standard AE: Deterministic mapping. Encoder produces a single point z = f(x) in latent space. Good for reconstruction and compression. Cannot generate new data because the latent space may have "holes" — regions that don't correspond to valid data.

VAE: Probabilistic. Encoder produces parameters of a distribution q(z|x) = N(μ, σ²). Each input maps to a region, not a point. The KL regularization ensures the latent space is smooth and continuous. New data can be generated by sampling z ~ N(0, I) and decoding.

Key differences: (1) VAE has a principled training objective (ELBO), (2) VAE latent space is smooth → good for generation, (3) VAE reconstructions tend to be blurrier than AE due to the KL regularization.

Q2: What is the reparameterization trick, and why is it essential?

In a VAE, we need to sample z from q(z|x) = N(μ_φ(x), σ²_φ(x)). However, sampling is non-differentiable — you can't compute gradients through a random sampling operation.

The reparameterization trick rewrites: z = μ + σ ⊙ ε, where ε ~ N(0, I). Now z is a deterministic function of μ, σ, and ε. The randomness is "externalized" to ε, which doesn't depend on parameters. Gradients ∂L/∂μ and ∂L/∂σ can be computed normally via backpropagation.

Q3: How would you use autoencoders for anomaly detection? Walk through the complete pipeline.

Pipeline: (1) Collect labeled normal data. (2) Train an autoencoder to reconstruct normal data only. (3) For new data, compute reconstruction error e = ‖x - x̂‖². (4) Set threshold τ using validation data (e.g., 95th percentile of normal errors, or optimize F1 on a validation set). (5) Flag points with e > τ as anomalies.

Why it works: The AE learns to reconstruct normal patterns well. Anomalies have different patterns → poor reconstruction → high error.

Advanced: Use VAE and compute log p(x) ≈ ELBO for anomaly scoring. Combine reconstruction error + KL divergence. Use ensemble of autoencoders. Consider the Mahalanobis distance in latent space.

Q4: Derive the ELBO and explain each term intuitively.

Derivation: Start from log p(x) = log ∫ p(x|z)p(z)dz. Introduce q(z|x), apply Jensen's inequality (or the Bayes' rule derivation). Get ELBO = E_q[log p(x|z)] - KL(q(z|x)‖p(z)).

Term 1 (Reconstruction): "How well can the decoder reconstruct x from the sampled z?" Maximizing this improves reconstruction quality.

Term 2 (KL Regularization): "How close is the learned posterior to the prior?" Minimizing this ensures the latent space is well-organized and suitable for generation.

The gap: log p(x) = ELBO + KL(q‖p_true). The ELBO becomes tight when our approximate posterior equals the true posterior.

Q5: What is posterior collapse, and how do you mitigate it?

Problem: When the decoder is powerful (e.g., autoregressive RNN), it can reconstruct without using z. The optimizer drives KL(q‖p) → 0, making q(z|x) ≈ p(z) for all x. The latent code becomes uninformative.

Mitigations: (1) KL annealing: start with β=0 and gradually increase to 1 during training. (2) Free bits: set a minimum KL per dimension (e.g., KL ≥ 0.1 per dim). (3) Weaken the decoder (e.g., use a simpler architecture). (4) Use skip connections from encoder to decoder. (5) Use aggressive training schedule for encoder.

Q6: Compare VAEs and GANs for image generation.

VAEs: ✅ Principled training (ELBO), stable training, meaningful latent space, good for interpolation, can compute likelihood. ❌ Blurry outputs, less sharp than GANs.

GANs: ✅ Sharp, realistic images. ❌ Mode collapse, training instability, no explicit likelihood, harder to control.

When to use VAE: Need smooth latent space, stable training, likelihood estimation, anomaly detection, representation learning. When to use GAN: Need highest visual quality, have resources for careful tuning, super-resolution, style transfer.

Q7: Explain disentanglement and how β-VAE achieves it.

Disentanglement: Each dimension of the latent space captures one independent factor of variation (e.g., z₁ = rotation, z₂ = color, z₃ = size). Changing one dimension should change only one attribute.

β-VAE: By setting β > 1, we increase the weight of the KL term. This forces q(z|x) to be very close to N(0,I), which has independent dimensions. The stronger KL pressure forces each dimension to be independent, encouraging disentanglement. The cost is lower reconstruction quality.

Q8: How does Stable Diffusion use VAEs?

Stable Diffusion performs "Latent Diffusion." Step 1: A pre-trained VAE encoder compresses a 512×512×3 image to a 64×64×4 latent representation. Step 2: The diffusion process (iterative denoising by a U-Net) operates entirely in this latent space, guided by text embeddings from CLIP. Step 3: The VAE decoder upsamples the final latent back to 512×512 pixels.

Why: Operating in latent space is ~48x cheaper computationally. Training and inference are dramatically faster. The VAE's latent space is perceptually meaningful, so diffusion learns high-level structure rather than pixel-level details.

Q9: You're building a data compression system. Why might you prefer an autoencoder over JPEG?

Advantages of AE-based compression: (1) Learns domain-specific compression — an AE trained on faces will compress faces better than general-purpose JPEG. (2) Can achieve much higher compression ratios for specific data types. (3) The latent space captures semantic information, enabling search and manipulation of compressed data. (4) Can handle arbitrary data types beyond images (audio, molecular data, etc.).

Disadvantages: Requires training, slower encode/decode (GPU needed), not standardized, model must be shipped with compressed data. JPEG is universal, fast, hardware-supported, and good enough for general images.

Q10: Explain the connection between PCA and linear autoencoders.

A single-layer linear autoencoder with MSE loss learns the same subspace as PCA. Specifically, the encoder weight matrix spans the same column space as the top-k principal components. However, the weights may not be orthogonal — PCA gives orthogonal components, while the linear AE finds an arbitrary basis for the same subspace.

Adding nonlinear activations makes the autoencoder strictly more powerful than PCA — it can capture nonlinear manifolds that PCA cannot. This is why autoencoders are sometimes called "nonlinear PCA."

Q11: How would you choose the latent dimension for an autoencoder?

Methods: (1) Cross-validation on reconstruction error — plot error vs. latent dim, find the "elbow." (2) Information-theoretic: the intrinsic dimensionality of the data (can estimate with methods like MLE or correlation dimension). (3) Downstream task performance — the latent dim that maximizes classification/clustering performance. (4) Variance explained — analogous to choosing PCA components by cumulative variance. (5) β-VAE active units — count how many latent dimensions have KL > 0; unused dims can be removed.

Q12: What is the difference between denoising, sparse, contractive, and variational autoencoders in terms of regularization?

Denoising: Regularizes by corrupting the input (additive noise, masking, etc.). Forces learning robust features.

Sparse: Regularizes by penalizing hidden activations (L1 norm or KL on average activations). Forces selective feature activation.

Contractive: Regularizes by penalizing the Frobenius norm of the encoder's Jacobian ∂h/∂x. Forces the representation to be locally invariant to small input changes.

Variational: Regularizes by enforcing a prior distribution on the latent space via KL divergence. Forces a smooth, continuous, generation-friendly latent space.

All four prevent the autoencoder from learning trivial identity mappings, but each biases the solution toward different properties.

🔬 Research Problems

Research Problem 1: Hierarchical VAEs for Indian Language Generation

Design a hierarchical VAE (with multiple levels of latent variables) for generating text in Indian languages (Hindi, Tamil, Telugu). The hierarchy should capture character-level, word-level, and sentence-level structure. How would you handle the diverse scripts (Devanagari, Tamil, Telugu)? Propose a unified tokenization scheme and evaluate using perplexity and BLEU scores.

Starting point: Sønderby et al. (2016) "Ladder Variational Autoencoders" + IndicNLP corpus

Research Problem 2: VAE-based Drug Discovery for Tropical Diseases

Tropical diseases disproportionately affect India and other developing countries but receive less pharmaceutical R&D investment. Propose a VAE architecture for molecular generation that targets specific protein targets for malaria, dengue, or tuberculosis. The model should encode SMILES strings of known active compounds and generate novel candidates with desired properties (binding affinity, toxicity, synthesizability). How would you evaluate the generated molecules?

Starting point: Gómez-Bombarelli et al. (2018) "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules"

Research Problem 3: Improving VAE Latent Space Quality

Despite β-VAE and other techniques, disentanglement in VAE latent spaces remains an open problem. Propose and evaluate a novel regularization technique that encourages disentanglement without sacrificing reconstruction quality. Compare your method against β-VAE, FactorVAE, and DIP-VAE on standard benchmarks (dSprites, CelebA). Prove theoretically under what conditions your method achieves perfect disentanglement.

Starting point: Locatello et al. (2019) "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations"

Research Problem 4: Autoencoder-Diffusion Hybrid for Indian Art Generation

Combine VAEs with diffusion models to generate Indian art styles (Madhubani, Warli, Rajasthani miniature, Tanjore). Design a conditional generative model where the VAE encodes art style into a latent code, and a diffusion model generates the content. How do you handle the limited training data for each art form? Propose few-shot learning strategies and evaluate with FID score and art expert evaluation.

🎯 Key Takeaways

1

An autoencoder learns to compress data through a bottleneck (encoder→latent→decoder). The bottleneck forces the model to capture only the most essential features, making it a powerful tool for dimensionality reduction, denoising, and feature learning.

2

Undercomplete autoencoders (dim(z) < dim(x)) compress by necessity, while overcomplete ones (dim(z) ≥ dim(x)) need regularization (sparsity, denoising, contractive penalties) to learn useful features instead of the trivial identity function.

3

VAEs transform autoencoders into principled generative models by learning a probability distribution q(z|x) rather than a deterministic mapping. The ELBO objective = Reconstruction – KL Divergence balances quality with latent space regularity.

4

The reparameterization trick (z = μ + σ·ε, ε ~ N(0,I)) is what makes VAEs trainable — it externalizes randomness so that gradients can flow through the sampling step via standard backpropagation.

5

β-VAE controls the reconstruction-disentanglement tradeoff: β = 1 is standard VAE, β > 1 encourages disentangled representations (each latent dim captures an independent factor), and β < 1 prioritizes reconstruction quality.

6

Autoencoders are the backbone of anomaly detection: train on normal data, then flag high-reconstruction-error inputs as anomalies. This is used in fraud detection (Razorpay), network security (Jio), and medical diagnostics (Qure.ai).

7

Modern diffusion models (Stable Diffusion, DALL-E) rely on VAEs as a compression front-end: the VAE compresses images to a compact latent space, and diffusion operates in that space for 10–100x computational savings compared to pixel-space diffusion.

8

The latent space of a well-trained VAE is continuous and smooth — interpolation between two points produces semantically meaningful transitions, and random sampling from N(0,I) generates plausible new data. This makes VAEs indispensable for creative AI applications.

9

From Aadhaar biometric compression to Stable Diffusion image generation, autoencoders bridge fundamental representation learning and cutting-edge generative AI — making them one of the most versatile architectures in modern deep learning.

📚 References

Foundational Papers

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature, 323(6088), 533–536.
Hinton, G. E., & Salakhutdinov, R. R. (2006). "Reducing the Dimensionality of Data with Neural Networks." Science, 313(5786), 504–507.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). "Extracting and Composing Robust Features with Denoising Autoencoders." ICML 2008.
Kingma, D. P., & Welling, M. (2013). "Auto-Encoding Variational Bayes." arXiv:1312.6114. — The VAE paper.
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." ICML 2014.
Higgins, I., Matthey, L., Pal, A., et al. (2017). "β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework." ICLR 2017.

Diffusion Models & Modern Extensions

Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS 2020.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. — Stable Diffusion paper.
Ramesh, A., et al. (2021). "Zero-Shot Text-to-Image Generation." ICML 2021. — DALL-E paper.
Ramesh, A., et al. (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents." — DALL-E 2.

Textbooks & Surveys

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 14 & 20.
Murphy, K. P. (2022). Probabilistic Machine Learning: Advanced Topics. MIT Press. Chapters 21–22.
Bank, D., Koenigstein, N., & Giryes, R. (2020). "Autoencoders." arXiv:2003.05991. — Comprehensive survey.
Kingma, D. P., & Welling, M. (2019). "An Introduction to Variational Autoencoders." Foundations and Trends in ML, 12(4).

Indian Context

UIDAI Technical Architecture documents — Aadhaar biometric system specifications.
Jio Network Operations Center — Public technical blog posts on AI-driven network management.
ISRO NRSC — Remote sensing data compression standards for Indian satellites.
Qure.ai research publications on medical image analysis for Indian healthcare contexts.

🌊 Bonus: Diffusion Models Overview

From VAEs to Diffusion: The Connection

Diffusion models can be seen as an extreme form of hierarchical VAE with T levels of latent variables. Instead of compressing data in a single step, they gradually add noise over T timesteps (forward process) and learn to reverse each step (reverse process).

DDPM (Denoising Diffusion Probabilistic Model)

Forward Process (Adding Noise)

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) · x_{t-1}, β_t · I)

After T steps: x_T ≈ N(0, I) (pure noise)

Reverse Process (Denoising — what the network learns)

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

The network predicts the noise ε at each step:
Loss = E[‖ε - ε_θ(x_t, t)‖²] ← Simple MSE on predicted noise!

Stable Diffusion = VAE + Diffusion in Latent Space

The key insight of Latent Diffusion Models (LDM):

Train a powerful VAE to compress images (512×512 → 64×64 latent)
Perform the entire diffusion process in the compressed latent space
Use the VAE decoder to upsample the final clean latent back to an image
Condition on text via CLIP embeddings through cross-attention in the U-Net

This approach is what enabled consumer-grade GPUs to generate stunning images — the VAE handles the compression, and diffusion handles the creative generation.

Evolution: VAE → Diffusion → Latent Diffusion (Stable Diffusion)

VAE (2013) DDPM (2020) Stable Diffusion (2022) ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ x→[Enc]→z │ │ x→noise→... │ │ x→[VAE Enc]→z │ │ z→[Dec]→x̂ │ │ denoise←... │ │ z→noise→... │ │ ELBO loss │ │ predict ε │ │ denoise in z-space │ └─────────────┘ │ pixel space │ │ z→[VAE Dec]→x̂ │ ✓ Generation └─────────────┘ └─────────────────────┘ ✗ Blurry ✓ High quality ✓ High quality ✗ Slow (pixel) ✓ Fast (latent) ✗ GPU hungry ✓ Consumer GPU OK