Chapter 16: Dimensionality Reduction — PCA, t-SNE, UMAP

1

Learning Objectives

Upon completing this chapter, you will be able to:

1

Explain the curse of dimensionality and why reducing dimensions is essential

2

Derive PCA from the maximum variance perspective using eigenvalue decomposition

3

Compute the covariance matrix, eigenvalues, eigenvectors, and project data manually

4

Select the optimal number of principal components using explained variance ratios and scree plots

5

Connect Singular Value Decomposition (SVD) to PCA and understand Kernel PCA for non-linear data

6

Understand t-SNE's probabilistic framework: conditional probabilities, KL divergence, perplexity, and the crowding problem

7

Compare UMAP's topological approach: fuzzy simplicial sets, cross-entropy optimization, and superior scalability

8

Apply LDA as a supervised dimensionality reduction technique maximizing class separability

9

Distinguish feature selection from feature extraction and choose the right technique for a given scenario

10

Implement PCA from scratch in Python using NumPy eigendecomposition

11

Build complete pipelines with scikit-learn PCA, t-SNE, and UMAP with publication-quality visualizations

12

Apply dimensionality reduction to real-world problems: Eigenfaces, satellite imagery, biometrics, and genomics

2

Introduction

Imagine you're trying to describe a person. You could use thousands of measurements — height, weight, shoe size, arm length, hair color (encoded numerically), income, education years, and so on. But do you really need all of them? Height and arm length are highly correlated; so are income and education level. Much of the information in those thousands of features is redundant.

Dimensionality reduction is the art and science of finding a smaller set of variables that captures the essential structure of your data. It is one of the most powerful tools in the machine learning arsenal — used for visualization, noise removal, computational speedup, and combating the infamous curse of dimensionality.

This chapter explores four major families of techniques:

PCA (Principal Component Analysis) — The classical linear method based on maximum variance / eigenvalue decomposition
t-SNE (t-distributed Stochastic Neighbor Embedding) — A non-linear method optimized for 2D/3D visualization
UMAP (Uniform Manifold Approximation and Projection) — A newer non-linear method that preserves both local and global structure
LDA (Linear Discriminant Analysis) — A supervised method that maximizes class separability

We'll also study the connection between PCA and SVD (Singular Value Decomposition), explore Kernel PCA for non-linear mappings, and contrast feature selection with feature extraction. Every concept will be accompanied by mathematical derivations from first principles, hand-computed examples, and Python code.

3

Historical Background

3.1 The Birth of PCA (1901–1936)

Karl Pearson introduced the concept in 1901 as fitting planes to point clouds by minimizing orthogonal distances. Harold Hotelling formalized it in 1933 using eigenvalue decomposition of the covariance matrix, giving us the PCA we know today. The method became computationally practical only in the 1960s with the advent of digital computers and efficient algorithms for eigenvalue computation (QR algorithm by John Francis, 1961).

3.2 SVD and the Numerical Revolution (1960s–1970s)

Gene Golub and William Kahan's 1965 algorithm for computing the Singular Value Decomposition made SVD numerically stable and efficient. The realization that PCA could be computed via SVD (without explicitly forming the covariance matrix) was transformative for large datasets.

3.3 Kernel Methods and Non-Linear Extensions (1998)

Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller introduced Kernel PCA in 1998, extending PCA to non-linear manifolds using the kernel trick — the same insight that powered SVMs.

3.4 t-SNE: A Visualization Revolution (2008)

Laurens van der Maaten and Geoffrey Hinton published t-SNE in 2008, building on Hinton and Roweis's earlier SNE (2002). The key innovation was using a Student's t-distribution in the low-dimensional space to solve the crowding problem. The method quickly became the gold standard for visualizing high-dimensional data like MNIST digits and gene expression data.

3.5 UMAP: Speed and Global Structure (2018)

Leland McInnes, John Healy, and James Melville published UMAP in 2018, grounding dimensionality reduction in the mathematics of algebraic topology (Riemannian geometry and fuzzy simplicial sets). UMAP offered dramatic speedups over t-SNE while better preserving global structure.

4

Conceptual Explanation

4.1 The Curse of Dimensionality

Richard Bellman coined the term "curse of dimensionality" in 1961. The key problems are:

Problem 1: Distance Concentration

In high dimensions, all points become approximately equidistant. If you sample points uniformly in a d-dimensional unit hypercube, the ratio of the maximum to minimum distance approaches 1 as d → ∞. This makes distance-based algorithms (KNN, K-Means, RBF kernels) unreliable.

Problem 2: Volume Explosion

A d-dimensional unit hypercube has 2d corners. To sample the space with the same density as 10 samples per dimension, you need 10^d points. For d = 100 features, that's 10¹⁰⁰ — vastly more than atoms in the universe (~10⁸⁰).

Problem 3: Empty Space

Most of the volume of a high-dimensional unit sphere concentrates near its surface. A d-dimensional sphere of radius r = 0.99 has volume (0.99)^d relative to the full sphere. For d = 500, this is essentially 0. Almost all points are "on the boundary."

Problem 4: Statistical Sparsity

More features mean more parameters to estimate, requiring exponentially more data to achieve the same statistical reliability. Overfitting becomes inevitable.

4.2 Feature Selection vs Feature Extraction

Aspect	Feature Selection	Feature Extraction
Approach	Choose a subset of original features	Create new features as combinations of originals
Interpretability	High — original features are preserved	Low — new features may be abstract
Methods	Filter (correlation), Wrapper (RFE), Embedded (Lasso)	PCA, t-SNE, UMAP, Autoencoders
Information Loss	Higher — discarded features are gone	Lower — all features contribute to new ones
When to Use	When interpretability is critical	When maximum information preservation is needed

4.3 Overview of Methods

PCA — Linear, Unsupervised

PCA finds orthogonal directions (principal components) that maximize variance. It's a linear transformation — the new features are linear combinations of the old ones. Best for: preprocessing, noise removal, compression, and exploratory analysis when data lies on a linear subspace.

t-SNE — Non-Linear, Unsupervised

t-SNE models pairwise similarities as probabilities in both high-D and low-D spaces, then minimizes the KL divergence between them. It excels at revealing cluster structure in 2D/3D visualizations. Limitations: non-parametric (can't transform new points), slow on large datasets, random initialization can give different results.

UMAP — Non-Linear, Unsupervised

UMAP builds a weighted graph in high-D space (fuzzy simplicial set), then optimizes a low-D graph to match it. It preserves both local and global structure, runs much faster than t-SNE, and produces a parametric mapping that can transform new data.

LDA — Linear, Supervised

LDA finds directions that maximize the ratio of between-class variance to within-class variance. Unlike PCA, it uses class labels. Maximum components = C-1 (where C is the number of classes).

Method	Linear?	Supervised?	Preserves	Speed	Best For
PCA	✅ Yes	❌ No	Global variance	⚡ Fast	Preprocessing, compression
Kernel PCA	❌ No	❌ No	Non-linear structure	🐢 Slow	Non-linear manifolds
t-SNE	❌ No	❌ No	Local neighborhoods	🐢 Slow	2D/3D visualization
UMAP	❌ No	❌ No	Local + global	⚡ Fast	Visualization + downstream
LDA	✅ Yes	✅ Yes	Class separability	⚡ Fast	Classification preprocessing

5

Mathematical Foundation

5.1 Covariance Matrix

Given a data matrix X ∈ ℝ^n×d (n samples, d features), center the data by subtracting the mean: X̃ = X - μ, where μ = (1/n)Σ_ix_i.

Covariance Matrix C = (1/(n-1)) X̃ T X̃ \in ℝ d\timesd C jk = (1/(n-1)) Σ i=1 n (x ij - μ j)(x ik - μ k)

The covariance matrix is symmetric positive semi-definite (all eigenvalues ≥ 0). Diagonal entries are variances; off-diagonal entries are covariances between feature pairs.

5.2 Eigenvalue Decomposition

Since C is symmetric, it admits an eigendecomposition:

Eigenvalue Decomposition C v i = λ i v i C = V Λ V T where V = [v 1 | v 2 | ... | v d] (orthonormal eigenvectors) and Λ = diag(λ 1, λ 2, ..., λ d) with λ 1 \geq λ 2 \geq ... \geq λ d \geq 0

5.3 PCA Projection

To reduce from d to k dimensions, select the top k eigenvectors W = [v₁ | ... | v_k] ∈ ℝ^d×k:

PCA Projection Z = X̃ W \in ℝ n\timesk Each row z i = W T x̃ i is the projection of sample i onto k principal components

5.4 Explained Variance Ratio

Explained Variance Ratio EVR i = λ i / Σ j=1 d λ j Cumulative EVR for k components = Σ i=1 k λ i / Σ j=1 d λ j

The explained variance ratio tells us what fraction of the total variance in the data is captured by each principal component. A common heuristic is to choose k such that the cumulative EVR exceeds 95%.

5.5 SVD Connection to PCA

The centered data matrix X̃ has a Singular Value Decomposition:

SVD of Centered Data X̃ = U Σ V T where U \in ℝ n\timesn (left singular vectors), Σ \in ℝ n\timesd (singular values σ i) V \in ℝ d\timesd (right singular vectors = eigenvectors of X̃ T X̃)

The connection: C = (1/(n-1))X̃^TX̃ = (1/(n-1))VΣ^TΣV^T. So λ_i = σ_i² / (n-1), and the right singular vectors V are the principal component directions. This means PCA can be computed via SVD without ever forming the d×d covariance matrix — crucial when d is large.

5.6 Kernel PCA

For non-linear data, we map to a higher-dimensional space φ: ℝ^d → ℝ^D (D >> d) and perform PCA there. Using the kernel trick K(x_i, x_j) = φ(x_i)·φ(x_j), we work with the n×n kernel matrix instead of the D×D covariance matrix:

Kernel PCA Eigenvalue Problem K̃ α i = n λ i α i where K̃ is the centered kernel matrix: K̃ = K - 1 n K - K 1 n + 1 n K 1 n (1 n = (1/n) matrix of all ones)

5.7 t-SNE Mathematics

Step 1: High-D Pairwise Similarities

For each pair of points, compute a conditional probability representing "how likely is x_j to be a neighbor of x_i under a Gaussian centered at x_i":

t-SNE: High-D Conditional Probability p j|i = exp(-||x i - x j ||² / 2σ i ²) / Σ k\neqi exp(-||x i - x k ||² / 2σ i ²) Symmetrize: p ij = (p j|i + p i|j) / 2n

The bandwidth σ_i is set so that the perplexity of the conditional distribution equals a user-specified value (typically 5–50). Perplexity ≈ effective number of neighbors:

Perplexity Perp(P i) = 2 H(P i) where H(P i) = -Σ j p j|i log 2 (p j|i)

Step 2: Low-D Similarities with Student-t Distribution

In the low-dimensional space, similarities use a heavy-tailed Student's t-distribution (1 degree of freedom = Cauchy):

t-SNE: Low-D Probability q ij = (1 + ||y i - y j ||²) -1 / Σ k\neql (1 + ||y k - y l ||²) -1

The heavy tail allows moderately distant points in high-D to be placed far apart in low-D without penalty, solving the crowding problem.

Step 3: Minimize KL Divergence

t-SNE Objective: KL Divergence C = KL(P || Q) = Σ i Σ j p ij log(p ij / q ij) Gradient: \partialC/\partialy i = 4 Σ j (p ij - q ij)(y i - y j)(1 + ||y i - y j ||²) -1

5.8 UMAP Mathematics (Simplified)

UMAP constructs a weighted graph in high-D using k-nearest neighbors with a fuzzy membership function:

UMAP: Fuzzy Membership w(x i, x j) = exp(-(d(x i, x j) - ρ i) / σ i) where ρ i = distance to nearest neighbor (ensures local connectivity) σ i is calibrated to achieve log 2 (k) = Σ j w(x i, x j)

The fuzzy union of local membership functions is symmetrized: w_ij = w_i→j + w_j→i - w_i→j·w_j→i. UMAP then minimizes a cross-entropy objective (not KL divergence), which allows repulsive forces between non-neighbors:

UMAP Objective: Cross-Entropy C = Σ edges w ij log(w ij /q ij) + (1-w ij) log((1-w ij)/(1-q ij))

5.9 LDA Mathematics

LDA: Fisher's Criterion J(w) = (w T S B w) / (w T S W w) S B = Σ c n c (μ c - μ)(μ c - μ) T (Between-class scatter) S W = Σ c Σ i\inc (x i - μ c)(x i - μ c) T (Within-class scatter) Optimal: S W -1 S B w = λ w \to generalized eigenproblem

6

Formula Derivations

6.1 Deriving PCA from Maximum Variance

We want to find a unit vector w such that the variance of the projected data is maximized.

Step 1: Set up the projection

Given centered data x̃_i, the projection of point i onto direction w is: z_i = w^Tx̃_i

Step 2: Write the variance of the projection

Var(z) = (1/(n-1)) Σ_i z_i² = (1/(n-1)) Σ_i (w^Tx̃_i)² = (1/(n-1)) w^T (Σ_i x̃_ix̃_i^T) w = w^T C w

Step 3: Constrained optimization

Maximize w^TCw subject to w^Tw = 1.

Using Lagrange multipliers: L = w^TCw - λ(w^Tw - 1)

Step 4: Take the derivative and set to zero

∂L/∂w = 2Cw - 2λw = 0

⟹ Cw = λw

Step 5: Interpret the result

This is the eigenvalue equation! w must be an eigenvector of C. The variance along this direction is: w^TCw = w^Tλw = λ. So the variance equals the eigenvalue, and to maximize variance, we pick the eigenvector with the largest eigenvalue. ∎

Step 6: Second component

The second principal component maximizes variance subject to being orthogonal to the first: w₂^Tw₁ = 0. By an analogous derivation with two Lagrange constraints, w₂ is the eigenvector with the second largest eigenvalue. This pattern continues for all k components.

6.2 Deriving the t-SNE Gradient

The cost function is C = Σ_i≠j p_ij log(p_ij/q_ij).

Step 1: Rewrite using q_ij

Let d_ij = ||y_i - y_j||² and Z = Σ_k≠l(1 + d_kl)^-1. Then q_ij = (1 + d_ij)^-1 / Z.

Step 2: Differentiate

∂C/∂y_i = 2 Σ_j≠i (p_ij - q_ij) · 2(y_i - y_j) · (1 + d_ij)^-1

= 4 Σ_j (p_ij - q_ij)(y_i - y_j)(1 + ||y_i - y_j||²)^-1

Interpretation: If p_ij > q_ij (points should be closer than they are), the gradient pulls y_i toward y_j. If p_ij < q_ij, it pushes them apart. The (1+d_ij)^-1 term gives more weight to nearby points.

6.3 PCA via SVD Derivation

Start with the SVD: X̃ = UΣV^T. The covariance matrix:

C = (1/(n-1)) X̃^TX̃ = (1/(n-1)) VΣ^TU^TUΣV^T = (1/(n-1)) VΣ²V^T

Comparing with C = VΛV^T, we get Λ = Σ²/(n-1), i.e., eigenvalue λ_i = σ_i²/(n-1).

The projection: Z = X̃V_k = UΣV^TV_k = U_kΣ_k (using orthogonality of V).

Key advantage: For n << d, computing the SVD of X̃ (n×d) is cheaper than the eigendecomposition of C (d×d). Scikit-learn uses this approach.

7

Worked Numerical Examples

7.1 PCA by Hand: 3 Features → 2 Components

Given Data (5 samples, 3 features)

Sample	x₁	x₂	x₃
A	2	4	1
B	4	6	3
C	6	8	5
D	8	10	7
E	10	12	9

Step 1: Compute means

μ₁ = (2+4+6+8+10)/5 = 6, μ₂ = (4+6+8+10+12)/5 = 8, μ₃ = (1+3+5+7+9)/5 = 5

Step 2: Center the data (subtract means)

Sample	x̃₁	x̃₂	x̃₃
A	-4	-4	-4
B	-2	-2	-2
C	0	0	0
D	2	2	2
E	4	4	4

Step 3: Compute covariance matrix C = (1/4) X̃ᵀ X̃

X̃ᵀX̃ = [(-4)²+(-2)²+0+4+16, ...] Let's compute each entry:

Σx̃₁² = 16+4+0+4+16 = 40 → C₁₁ = 40/4 = 10

Σx̃₁x̃₂ = 16+4+0+4+16 = 40 → C₁₂ = 40/4 = 10

Σx̃₁x̃₃ = 16+4+0+4+16 = 40 → C₁₃ = 40/4 = 10

Similarly: C₂₂ = 10, C₂₃ = 10, C₃₃ = 10

Covariance Matrix C = ⎡10 10 10⎤ ⎢10 10 10⎥ ⎣10 10 10⎦

Step 4: Find eigenvalues

det(C - λI) = 0. Since all rows are identical (rank 1 matrix), we know there are two zero eigenvalues.

trace(C) = 30 = sum of eigenvalues. Since rank = 1, we have λ₁ = 30, λ₂ = 0, λ₃ = 0.

Verification: C·[1,1,1]ᵀ = [30,30,30]ᵀ = 30·[1,1,1]ᵀ ✓

Step 5: Find eigenvectors

For λ₁ = 30: v₁ = [1,1,1]ᵀ/√3 = [0.577, 0.577, 0.577]ᵀ

For λ₂ = 0: any vector orthogonal to v₁, e.g., v₂ = [1,-1,0]ᵀ/√2 = [0.707, -0.707, 0]ᵀ

For λ₃ = 0: v₃ = [1,1,-2]ᵀ/√6 = [0.408, 0.408, -0.816]ᵀ

Step 6: Project onto top 2 components

W = [v₁ | v₂] (3×2 matrix). Z = X̃ · W:

z_A = [-4, -4, -4] · W = [-4(0.577+0.577+0.577), -4(0.707-0.707+0)] = [-6.928, 0]

z_B = [-3.464, 0], z_C = [0, 0], z_D = [3.464, 0], z_E = [6.928, 0]

Step 7: Explained variance

EVR₁ = 30/30 = 100%. All variance is captured by PC1 (because the data lies on a perfect line in 3D).

7.2 Explained Variance — A More Realistic Example

Suppose after computing PCA on a 10-feature dataset, the eigenvalues are:

λ = [5.8, 3.2, 1.5, 0.9, 0.6, 0.4, 0.3, 0.2, 0.07, 0.03]

Total = 13.0

EVR = [44.6%, 24.6%, 11.5%, 6.9%, 4.6%, 3.1%, 2.3%, 1.5%, 0.5%, 0.2%]

Cumulative = [44.6%, 69.2%, 80.8%, 87.7%, 92.3%, 95.4%, ...]

Decision: For 95% threshold, choose k = 6 components (reducing 10 → 6). For 90%, choose k = 5.

8

Visual Diagrams

Figure 16.1: PCA Geometric Interpretation — Finding Axes of Maximum Variance Original 2D Data: After PCA Rotation: x₂ ↑ · · PC2 ↑ | · · · | | · ·· ··· | · · |· ··· ·· | · ·· ·· · |·· · |·· · · ··· ···· +----------→ x₁ +------------------→ PC1 (max variance) PC1 = direction of maximum spread (largest eigenvalue) PC2 = orthogonal to PC1 (smaller eigenvalue) If λ₂ ≈ 0, we can discard PC2 → 2D reduced to 1D!

Figure 16.2: The Curse of Dimensionality — Volume of Hypersphere Ratio of Hypersphere to Hypercube Volume: 1.0 |█ |█ █ |█ █ █ 0.5 |█ █ █ █ |█ █ █ █ █ |█ █ █ █ █ █ 0.0 |█ █ █ █ █ █ █ █ █ █ · · · · · +----------------------------------------→ Dimensions 1 2 3 5 10 20 50 100 500 1000 d=1: ratio = 1.00 d=10: ratio = 0.0025 d=3: ratio = 0.52 d=100: ratio ≈ 10⁻⁷⁰ ⟹ In high-D, almost ALL volume is in the CORNERS!

Figure 16.3: t-SNE vs PCA on Cluster Data PCA (Linear): t-SNE (Non-linear): PC2 ↑ y₂ ↑ | ●●●○○○△△△ | ●●● △△△ |●●●○○○△△△ | ●●●●● △△△△ | ●●●○○○△△ | ●●● △△△ +------------→ PC1 | | ○○○ Clusters overlap in | ○○○○○ linear projection | ○○○ +------------→ y₁ Clusters clearly separated! ● = Class A ○ = Class B △ = Class C

Figure 16.4: Scree Plot — Choosing Number of Components Eigenvalue 6 |█ |█ 5 |█ |█ 4 |█ |█ █ 3 |█ █ |█ █ 2 |█ █ █ |█ █ █ 1 |█ █ █ █ █ ← "elbow" here |█ █ █ █ █ █ █ █ █ █ 0 +--+--+--+--+--+--+--+--+--+--→ PC1 PC2 PC3 PC4 PC5 PC6 PC7... Choose k at the "elbow" where eigenvalues flatten out. Here: k = 3 or 4 captures most variance. Cumulative: 44% 69% 81% 88% 92% 95% 97% ...

Figure 16.5: LDA vs PCA — Different Objectives PCA (maximize total variance): LDA (maximize class separation): ↑ PC1 direction ↑ LD1 direction | ●●● | ●●● | ●●●● | ●●●● | ○○○●● | | ○○○○ | ○○○○ | ○○○ | ○○○ +--→ max spread +--→ max between/within PCA might project onto a direction LDA separates classes even if where classes overlap! it doesn't maximize total variance.

9

Flowcharts

Flowchart 16.1: PCA Pipeline ┌─────────────────────┐ │ Raw Data X (n × d) │ └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 1. Standardize │ z = (x - μ) / σ │ (zero mean, │ (important if features │ unit variance) │ have different scales) └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 2. Compute Covariance│ C = (1/(n-1)) X̃ᵀX̃ │ Matrix C (d × d) │ OR use SVD directly └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 3. Eigendecomposition│ C vᵢ = λᵢ vᵢ │ (or SVD) │ Sort: λ₁ ≥ λ₂ ≥ ... ≥ λd └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 4. Choose k │ - Scree plot (elbow) │ components │ - Cumulative EVR ≥ 95% │ │ - Kaiser rule (λᵢ > 1) └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 5. Project Data │ Z = X̃ · W (n × k) │ W = [v₁|...|vk] │ W is d × k └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ Reduced Data Z (n×k)│ │ Use for: │ │ - Visualization │ │ - ML model input │ │ - Noise reduction │ └─────────────────────┘

Flowchart 16.2: Choosing a Dimensionality Reduction Method ┌───────────────────────────┐ │ Do you have class labels? │ └──────────┬────────────────┘ │ ┌────┴────┐ YES NO │ │ ▼ ▼ ┌──────────┐ ┌──────────────────────┐ │ LDA │ │ Is the goal │ │(max C-1 │ │ visualization only? │ │ dims) │ └──────┬───────────────┘ └──────────┘ ┌────┴────┐ YES NO │ │ ▼ ▼ ┌────────────────────┐ ┌──────────────────┐ │ n < 10,000? │ │ Linear enough? │ └───────┬────────────┘ └────┬─────────────┘ ┌────┴───┐ ┌────┴────┐ YES NO YES NO │ │ │ │ ▼ ▼ ▼ ▼ ┌──────┐ ┌──────┐ ┌──────┐ ┌───────────┐ │t-SNE │ │UMAP │ │ PCA │ │Kernel PCA │ │ │ │ │ │ │ │or UMAP │ │(best │ │(fast,│ │(fast,│ │ │ │local)│ │local+│ │exact)│ │ │ │ │ │global│ │ │ │ │ └──────┘ └──────┘ └──────┘ └───────────┘

Flowchart 16.3: t-SNE Algorithm Steps ┌─────────────────────────────┐ │ Input: High-D data X, n pts │ │ Hyperparams: perplexity, │ │ learning_rate, n_iter │ └────────────┬────────────────┘ ▼ ┌─────────────────────────────┐ │ 1. Compute pairwise │ │ distances in high-D │ │ d²ᵢⱼ = ||xᵢ - xⱼ||² │ └────────────┬────────────────┘ ▼ ┌─────────────────────────────┐ │ 2. Binary search for σᵢ │ │ to match target │ │ perplexity │ └────────────┬────────────────┘ ▼ ┌─────────────────────────────┐ │ 3. Compute pᵢⱼ (symmetrize)│ └────────────┬────────────────┘ ▼ ┌─────────────────────────────┐ │ 4. Initialize low-D points │ │ yᵢ ∼ N(0, 10⁻⁴I) │ └────────────┬────────────────┘ ▼ ┌─────────────────────────────┐ │ 5. For iter = 1 to T: │ │ a) Compute qᵢⱼ (t-dist) │ │ b) Compute gradient │ │ c) Update: y ← y + η∇ │ │ d) Apply momentum │ │ (Early exaggeration for │ │ first ~250 iterations) │ └────────────┬────────────────┘ ▼ ┌─────────────────────────────┐ │ Output: Low-D embedding Y │ └─────────────────────────────┘

10

Python Implementation from Scratch

10.1 PCA Class — Full Eigendecomposition

Python pca_from_scratch.py

import numpy as np
import matplotlib.pyplot as plt

class PCA:
    """
    Principal Component Analysis from scratch.
    
    Implements the eigendecomposition approach:
    1. Center data (optionally standardize)
    2. Compute covariance matrix
    3. Eigendecompose
    4. Project onto top-k eigenvectors
    """
    
    def __init__(self, n_components=2, standardize=True):
        """
        Parameters:
            n_components: int or float
                If int: number of components to keep
                If float (0-1): keep enough for this explained variance ratio
            standardize: bool
                If True, scale features to unit variance (recommended)
        """
        self.n_components = n_components
        self.standardize = standardize
        self.components_ = None       # Principal component directions (k x d)
        self.eigenvalues_ = None      # All eigenvalues
        self.explained_variance_ = None
        self.explained_variance_ratio_ = None
        self.mean_ = None
        self.std_ = None
    
    def fit(self, X):
        """Fit PCA on training data X (n_samples x n_features)."""
        n, d = X.shape
        
        # Step 1: Center (and optionally scale)
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_
        
        if self.standardize:
            self.std_ = np.std(X, axis=0, ddof=0)
            # Avoid division by zero for constant features
            self.std_[self.std_ == 0] = 1.0
            X_centered = X_centered / self.std_
        
        # Step 2: Covariance matrix (d x d)
        # Using (n-1) for unbiased estimator
        cov_matrix = (X_centered.T @ X_centered) / (n - 1)
        
        # Step 3: Eigenvalue decomposition
        # np.linalg.eigh is for symmetric matrices (faster, numerically stable)
        eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
        
        # Sort in descending order (eigh returns ascending)
        sorted_idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[sorted_idx]
        eigenvectors = eigenvectors[:, sorted_idx]
        
        # Clip tiny negative eigenvalues (numerical noise)
        eigenvalues = np.maximum(eigenvalues, 0)
        
        self.eigenvalues_ = eigenvalues
        self.explained_variance_ = eigenvalues
        total_var = np.sum(eigenvalues)
        self.explained_variance_ratio_ = eigenvalues / total_var if total_var > 0 else eigenvalues
        
        # Determine number of components
        if isinstance(self.n_components, float) and 0 < self.n_components < 1:
            cumulative = np.cumsum(self.explained_variance_ratio_)
            k = np.searchsorted(cumulative, self.n_components) + 1
            self.n_components = min(k, d)
        
        self.n_components = min(self.n_components, d)
        
        # Step 4: Select top-k eigenvectors
        # components_ shape: (k, d) — each row is a principal direction
        self.components_ = eigenvectors[:, :self.n_components].T
        
        return self
    
    def transform(self, X):
        """Project data onto principal components."""
        X_centered = X - self.mean_
        if self.standardize:
            X_centered = X_centered / self.std_
        # Z = X_centered @ W where W = components_.T (d x k)
        return X_centered @ self.components_.T
    
    def fit_transform(self, X):
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)
    
    def inverse_transform(self, Z):
        """Reconstruct data from reduced representation."""
        X_approx = Z @ self.components_  # (n x k) @ (k x d) = (n x d)
        if self.standardize:
            X_approx = X_approx * self.std_
        X_approx = X_approx + self.mean_
        return X_approx
    
    def plot_scree(self, figsize=(10, 4)):
        """Plot scree diagram and cumulative explained variance."""
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
        
        k = min(len(self.eigenvalues_), 20)  # Show up to 20
        x = range(1, k + 1)
        
        # Scree plot
        ax1.bar(x, self.eigenvalues_[:k], color='#059669', alpha=0.7)
        ax1.plot(x, self.eigenvalues_[:k], 'o-', color='#0891b2', linewidth=2)
        ax1.set_xlabel('Principal Component')
        ax1.set_ylabel('Eigenvalue')
        ax1.set_title('Scree Plot')
        ax1.set_xticks(x)
        ax1.grid(alpha=0.3)
        
        # Cumulative variance
        cumulative = np.cumsum(self.explained_variance_ratio_[:k]) * 100
        ax2.plot(x, cumulative, 'o-', color='#059669', linewidth=2, markersize=6)
        ax2.axhline(y=95, color='#f43f5e', linestyle='--', label='95% threshold')
        ax2.fill_between(x, cumulative, alpha=0.1, color='#059669')
        ax2.set_xlabel('Number of Components')
        ax2.set_ylabel('Cumulative Explained Variance (%)')
        ax2.set_title('Cumulative Explained Variance')
        ax2.set_xticks(x)
        ax2.legend()
        ax2.grid(alpha=0.3)
        
        plt.tight_layout()
        plt.show()


# ============== DEMO ==============
if __name__ == "__main__":
    # Generate sample data: 3 clusters in 5D
    np.random.seed(42)
    n = 200
    
    # 3 clusters with different centers
    c1 = np.random.randn(n, 5) + np.array([3, 3, 3, 0, 0])
    c2 = np.random.randn(n, 5) + np.array([-3, -3, -3, 0, 0])
    c3 = np.random.randn(n, 5) + np.array([0, 0, 0, 5, 5])
    X = np.vstack([c1, c2, c3])
    labels = np.array([0]*n + [1]*n + [2]*n)
    
    # Apply PCA
    pca = PCA(n_components=2, standardize=True)
    Z = pca.fit_transform(X)
    
    print(f"Original shape: {X.shape}")
    print(f"Reduced shape:  {Z.shape}")
    print(f"Explained variance ratios: {pca.explained_variance_ratio_[:5].round(4)}")
    print(f"Cumulative (2 components): {sum(pca.explained_variance_ratio_[:2]):.4f}")
    
    # Plot
    pca.plot_scree()
    
    plt.figure(figsize=(8, 6))
    colors = ['#059669', '#0891b2', '#f59e0b']
    for i, c in enumerate(colors):
        mask = labels == i
        plt.scatter(Z[mask, 0], Z[mask, 1], c=c, label=f'Cluster {i}', alpha=0.6, s=30)
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.title('PCA: 5D → 2D Projection')
    plt.legend()
    plt.grid(alpha=0.3)
    plt.show()

10.2 PCA via SVD (Numerically Superior)

Python pca_svd.py

import numpy as np

class PCA_SVD:
    """PCA via Singular Value Decomposition — avoids forming covariance matrix."""
    
    def __init__(self, n_components=2):
        self.n_components = n_components
    
    def fit(self, X):
        n, d = X.shape
        self.mean_ = X.mean(axis=0)
        X_centered = X - self.mean_
        
        # Truncated SVD: only compute top-k singular values/vectors
        # Full SVD: X̃ = U Σ Vᵀ
        U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
        
        # Principal directions are rows of Vt
        self.components_ = Vt[:self.n_components]  # (k x d)
        
        # Eigenvalues of covariance = σ²/(n-1)
        self.explained_variance_ = (S ** 2) / (n - 1)
        total = self.explained_variance_.sum()
        self.explained_variance_ratio_ = self.explained_variance_ / total
        
        return self
    
    def transform(self, X):
        return (X - self.mean_) @ self.components_.T
    
    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

# Demo
np.random.seed(42)
X = np.random.randn(500, 50)  # 500 samples, 50 features
pca = PCA_SVD(n_components=5)
Z = pca.fit_transform(X)
print(f"Reduced: {X.shape} → {Z.shape}")
print(f"Top 5 EVR: {pca.explained_variance_ratio_[:5].round(4)}")

10.3 t-SNE Simplified Implementation

Python tsne_simplified.py

import numpy as np

def compute_pairwise_distances(X):
    """Compute squared Euclidean distances between all pairs."""
    sum_sq = np.sum(X ** 2, axis=1)
    # ||xi - xj||² = ||xi||² + ||xj||² - 2 xi·xj
    D = sum_sq[:, np.newaxis] + sum_sq[np.newaxis, :] - 2 * X @ X.T
    return np.maximum(D, 0)  # Clip numerical negatives

def compute_perplexity_probabilities(D, target_perplexity=30.0, tol=1e-5, max_iter=50):
    """
    For each point, find σᵢ via binary search so that
    Perplexity(Pᵢ) = target_perplexity.
    Returns symmetrized joint probabilities P.
    """
    n = D.shape[0]
    P = np.zeros((n, n))
    target_entropy = np.log2(target_perplexity)
    
    for i in range(n):
        # Binary search for σᵢ (actually searching over β = 1/(2σ²))
        beta_min, beta_max = -np.inf, np.inf
        beta = 1.0
        
        Di = D[i].copy()
        Di[i] = np.inf  # Exclude self
        
        for _ in range(max_iter):
            # Compute conditional probabilities p(j|i)
            exp_D = np.exp(-Di * beta)
            exp_D[i] = 0
            sum_exp = np.sum(exp_D)
            
            if sum_exp == 0:
                P_i = np.zeros(n)
            else:
                P_i = exp_D / sum_exp
            
            # Compute entropy
            H = -np.sum(P_i[P_i > 0] * np.log2(P_i[P_i > 0]))
            
            # Check if we hit target
            diff = H - target_entropy
            if abs(diff) < tol:
                break
            
            # Update β via binary search
            if diff > 0:  # Entropy too high → increase β (decrease σ)
                beta_min = beta
                beta = beta * 2 if beta_max == np.inf else (beta + beta_max) / 2
            else:
                beta_max = beta
                beta = beta / 2 if beta_min == -np.inf else (beta + beta_min) / 2
        
        P[i] = P_i
    
    # Symmetrize: p_ij = (p_j|i + p_i|j) / (2n)
    P = (P + P.T) / (2 * n)
    P = np.maximum(P, 1e-12)  # Avoid log(0)
    return P

def tsne(X, n_components=2, perplexity=30.0, n_iter=500, lr=200.0, momentum=0.8):
    """
    Simplified t-SNE implementation.
    
    Parameters:
        X: ndarray (n_samples, n_features)
        n_components: output dimensions (2 or 3)
        perplexity: effective number of neighbors (5-50)
        n_iter: gradient descent iterations
        lr: learning rate
        momentum: momentum coefficient
    """
    n = X.shape[0]
    
    # Step 1: Pairwise distances in high-D
    D_high = compute_pairwise_distances(X)
    
    # Step 2: Compute high-D probabilities with target perplexity
    P = compute_perplexity_probabilities(D_high, target_perplexity=perplexity)
    
    # Early exaggeration (multiply P by 4 for first 100 iterations)
    P_exag = P * 4.0
    
    # Step 3: Initialize low-D embedding
    Y = np.random.randn(n, n_components) * 1e-4
    velocity = np.zeros_like(Y)
    
    for t in range(n_iter):
        # Use exaggerated P for early iterations
        P_use = P_exag if t < 100 else P
        
        # Pairwise distances in low-D
        D_low = compute_pairwise_distances(Y)
        
        # Compute q_ij using Student-t distribution
        Q_num = 1.0 / (1.0 + D_low)  # (1 + ||yi - yj||²)⁻¹
        np.fill_diagonal(Q_num, 0)
        Q = Q_num / np.sum(Q_num)
        Q = np.maximum(Q, 1e-12)
        
        # Compute gradient
        # ∂C/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ)(yᵢ - yⱼ)(1 + ||yᵢ - yⱼ||²)⁻¹
        PQ_diff = P_use - Q
        
        # Efficient gradient computation
        grad = np.zeros_like(Y)
        for i in range(n):
            diff = Y[i] - Y  # (n, n_components)
            grad[i] = 4 * np.sum((PQ_diff[i, :, np.newaxis]) * diff * Q_num[i, :, np.newaxis], axis=0)
        
        # Update with momentum
        velocity = momentum * velocity - lr * grad
        Y += velocity
        
        # Re-center
        Y -= Y.mean(axis=0)
        
        if (t + 1) % 100 == 0:
            cost = np.sum(P_use * np.log(P_use / Q))
            print(f"  Iteration {t+1}/{n_iter}, KL divergence: {cost:.4f}")
    
    return Y

# Demo
if __name__ == "__main__":
    np.random.seed(42)
    # Create 3 clusters in 20D
    X = np.vstack([
        np.random.randn(50, 20) + np.array([5]*10 + [0]*10),
        np.random.randn(50, 20) + np.array([0]*10 + [5]*10),
        np.random.randn(50, 20) + np.array([-5]*10 + [-5]*10),
    ])
    
    print("Running t-SNE (this may take a moment)...")
    Y = tsne(X, n_components=2, perplexity=15, n_iter=500, lr=100)
    print(f"Embedding shape: {Y.shape}")

11

TensorFlow Implementation

11.1 PCA with TensorFlow (GPU-Accelerated)

TensorFlow pca_tensorflow.py

import tensorflow as tf
import numpy as np

class TF_PCA:
    """GPU-accelerated PCA using TensorFlow's SVD."""
    
    def __init__(self, n_components=2):
        self.n_components = n_components
    
    def fit_transform(self, X_np):
        """
        Fit PCA and return transformed data.
        Uses TF's SVD for GPU acceleration on large matrices.
        """
        X = tf.constant(X_np, dtype=tf.float64)
        
        # Center data
        mean = tf.reduce_mean(X, axis=0)
        X_centered = X - mean
        
        # SVD: X̃ = U Σ Vᵀ
        S, U, V = tf.linalg.svd(X_centered, full_matrices=False)
        
        # Project onto top k right singular vectors
        V_k = V[:, :self.n_components]  # (d x k)
        Z = tf.matmul(X_centered, V_k)  # (n x k)
        
        # Explained variance
        n = tf.cast(tf.shape(X)[0], tf.float64)
        eigenvalues = S ** 2 / (n - 1)
        total = tf.reduce_sum(eigenvalues)
        evr = eigenvalues / total
        
        self.explained_variance_ratio_ = evr.numpy()
        self.components_ = tf.transpose(V_k).numpy()
        
        return Z.numpy()

# Demo
np.random.seed(42)
X = np.random.randn(1000, 100)
pca = TF_PCA(n_components=10)
Z = pca.fit_transform(X)
print(f"Shape: {X.shape} → {Z.shape}")
print(f"Top 5 EVR: {pca.explained_variance_ratio_[:5].round(4)}")

11.2 Autoencoder for Non-Linear Dimensionality Reduction

TensorFlow/Keras autoencoder_dimred.py

import tensorflow as tf
from tensorflow import keras
import numpy as np

def build_autoencoder(input_dim, encoding_dim=2, hidden_dims=[128, 64]):
    """
    Build a symmetric autoencoder.
    Encoder: input_dim → hidden1 → hidden2 → encoding_dim
    Decoder: encoding_dim → hidden2 → hidden1 → input_dim
    """
    # Encoder
    encoder_input = keras.Input(shape=(input_dim,))
    x = encoder_input
    for dim in hidden_dims:
        x = keras.layers.Dense(dim, activation='relu')(x)
        x = keras.layers.BatchNormalization()(x)
    encoded = keras.layers.Dense(encoding_dim, activation='linear', name='bottleneck')(x)
    
    # Decoder (mirror of encoder)
    x = encoded
    for dim in reversed(hidden_dims):
        x = keras.layers.Dense(dim, activation='relu')(x)
        x = keras.layers.BatchNormalization()(x)
    decoded = keras.layers.Dense(input_dim, activation='linear')(x)
    
    # Models
    autoencoder = keras.Model(encoder_input, decoded, name='autoencoder')
    encoder = keras.Model(encoder_input, encoded, name='encoder')
    
    autoencoder.compile(
        optimizer=keras.optimizers.Adam(learning_rate=1e-3),
        loss='mse'
    )
    
    return autoencoder, encoder

# Example: MNIST dimensionality reduction
# Load data
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train = X_train.reshape(-1, 784).astype('float32') / 255.0
X_test = X_test.reshape(-1, 784).astype('float32') / 255.0

# Build and train
autoencoder, encoder = build_autoencoder(784, encoding_dim=2)
autoencoder.summary()

history = autoencoder.fit(
    X_train, X_train,
    epochs=30,
    batch_size=256,
    validation_data=(X_test, X_test),
    verbose=1
)

# Get 2D embeddings
Z_train = encoder.predict(X_train)
print(f"Embedding shape: {Z_train.shape}")  # (60000, 2)

# Visualize (compare with PCA and t-SNE!)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
scatter = plt.scatter(Z_train[:5000, 0], Z_train[:5000, 1],
                      c=y_train[:5000], cmap='tab10', s=3, alpha=0.5)
plt.colorbar(scatter, label='Digit')
plt.title('Autoencoder: MNIST 784D → 2D')
plt.xlabel('Latent Dim 1')
plt.ylabel('Latent Dim 2')
plt.show()

12

Scikit-Learn Implementation

12.1 Complete PCA Pipeline with Visualization

Scikit-Learn sklearn_pca_pipeline.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, KernelPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_wine, load_digits
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

# ============================================
# PART 1: PCA on Wine Dataset
# ============================================
wine = load_wine()
X, y = wine.data, wine.target
print(f"Wine dataset: {X.shape} → {len(wine.feature_names)} features")

# Pipeline: Scale → PCA
pipe_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2))
])
X_pca = pipe_pca.fit_transform(X)
pca = pipe_pca.named_steps['pca']

print(f"\nExplained Variance Ratio: {pca.explained_variance_ratio_.round(4)}")
print(f"Cumulative: {np.cumsum(pca.explained_variance_ratio_).round(4)}")

# Scree Plot
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Full PCA for scree plot
pca_full = PCA().fit(StandardScaler().fit_transform(X))
axes[0].bar(range(1, 14), pca_full.explained_variance_ratio_, color='#059669', alpha=0.7)
axes[0].plot(range(1, 14), pca_full.explained_variance_ratio_, 'o-', color='#0891b2')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot')

# Cumulative variance
axes[1].plot(range(1, 14), np.cumsum(pca_full.explained_variance_ratio_), 
             'o-', color='#059669', linewidth=2)
axes[1].axhline(y=0.95, color='red', linestyle='--', label='95%')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Variance')
axes[1].legend()

# 2D Projection
colors = ['#059669', '#0891b2', '#f59e0b']
for i in range(3):
    mask = y == i
    axes[2].scatter(X_pca[mask, 0], X_pca[mask, 1], 
                    c=colors[i], label=wine.target_names[i], s=50, alpha=0.7)
axes[2].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[2].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
axes[2].set_title('Wine: PCA 2D Projection')
axes[2].legend()
plt.tight_layout()
plt.show()

# Biplot: show feature contributions
def biplot(pca_model, X_pca, feature_names, y, ax):
    """Create a biplot showing both data points and feature loadings."""
    colors = ['#059669', '#0891b2', '#f59e0b']
    for i, c in enumerate(colors):
        mask = y == i
        ax.scatter(X_pca[mask, 0], X_pca[mask, 1], c=c, alpha=0.4, s=30)
    
    # Feature loadings (arrows)
    loadings = pca_model.components_.T  # (d x k)
    scale = 3  # Arrow scale factor
    for i, name in enumerate(feature_names):
        ax.arrow(0, 0, loadings[i, 0]*scale, loadings[i, 1]*scale,
                color='red', alpha=0.7, head_width=0.05)
        ax.text(loadings[i, 0]*scale*1.1, loadings[i, 1]*scale*1.1,
               name, fontsize=7, color='red')
    ax.set_xlabel('PC1')
    ax.set_ylabel('PC2')
    ax.set_title('Biplot')
    ax.grid(alpha=0.3)

fig, ax = plt.subplots(figsize=(10, 8))
biplot(pca, X_pca, wine.feature_names, y, ax)
plt.tight_layout()
plt.show()

# ============================================
# PART 2: PCA for Classification Improvement
# ============================================
print("\n--- PCA + SVM Classification ---")
for k in [2, 4, 6, 8, 10, 13]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=k)),
        ('svm', SVC(kernel='rbf', C=10))
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"  k={k:2d}: accuracy = {scores.mean():.4f} ± {scores.std():.4f}")

# ============================================
# PART 3: Kernel PCA
# ============================================
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Linear PCA
pca_linear = PCA(n_components=1)
X_lin = pca_linear.fit_transform(X_moons)
axes[0].scatter(X_lin, np.zeros_like(X_lin), c=y_moons, cmap='RdYlGn', s=30)
axes[0].set_title('Linear PCA (1D)')

# Kernel PCA with RBF
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=10)
X_kpca = kpca.fit_transform(X_moons)
axes[1].scatter(X_kpca[:, 0], X_kpca[:, 1], c=y_moons, cmap='RdYlGn', s=30)
axes[1].set_title('Kernel PCA (RBF, γ=10)')

# Original data
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='RdYlGn', s=30)
axes[2].set_title('Original Moon Data')

plt.tight_layout()
plt.show()

12.2 t-SNE Pipeline with MNIST

Scikit-Learn sklearn_tsne_mnist.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import time

# Load digits dataset (8x8 images, 64 features, 10 classes)
digits = load_digits()
X, y = digits.data, digits.target
print(f"Digits dataset: {X.shape}, {len(np.unique(y))} classes")

# Scale data
X_scaled = StandardScaler().fit_transform(X)

# ============================================
# Strategy: PCA first (64→30), then t-SNE (30→2)
# This is the recommended approach for large datasets!
# ============================================
pca_pre = PCA(n_components=30)
X_pca30 = pca_pre.fit_transform(X_scaled)
print(f"PCA 64→30: {sum(pca_pre.explained_variance_ratio_):.2%} variance retained")

# t-SNE with different perplexities
perplexities = [5, 15, 30, 50]
fig, axes = plt.subplots(1, 4, figsize=(24, 5))

for ax, perp in zip(axes, perplexities):
    start = time.time()
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42,
                n_iter=1000, learning_rate='auto', init='pca')
    X_tsne = tsne.fit_transform(X_pca30)
    elapsed = time.time() - start
    
    scatter = ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
    ax.set_title(f'Perplexity = {perp}\n(KL = {tsne.kl_divergence_:.3f}, {elapsed:.1f}s)')
    ax.set_xticks([])
    ax.set_yticks([])

plt.suptitle('t-SNE: Effect of Perplexity on MNIST Digits', fontsize=14, fontweight='bold')
plt.colorbar(scatter, ax=axes, label='Digit', shrink=0.8)
plt.tight_layout()
plt.show()

# ============================================
# Comparison: PCA vs t-SNE
# ============================================
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# PCA
X_pca2 = PCA(n_components=2).fit_transform(X_scaled)
axes[0].scatter(X_pca2[:, 0], X_pca2[:, 1], c=y, cmap='tab10', s=8, alpha=0.5)
axes[0].set_title('PCA (2D)')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')

# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42,
            n_iter=1000, learning_rate='auto', init='pca')
X_tsne2 = tsne.fit_transform(X_pca30)
scatter = axes[1].scatter(X_tsne2[:, 0], X_tsne2[:, 1], c=y, cmap='tab10', s=8, alpha=0.5)
axes[1].set_title('t-SNE (2D)')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')

plt.colorbar(scatter, ax=axes, label='Digit')
plt.suptitle('PCA vs t-SNE: MNIST Digits', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

12.3 UMAP Pipeline

Python + UMAP umap_pipeline.py

# pip install umap-learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import time

try:
    import umap
    
    digits = load_digits()
    X, y = digits.data, digits.target
    X_scaled = StandardScaler().fit_transform(X)
    
    # UMAP with different n_neighbors
    n_neighbors_list = [5, 15, 50, 200]
    fig, axes = plt.subplots(1, 4, figsize=(24, 5))
    
    for ax, nn in zip(axes, n_neighbors_list):
        start = time.time()
        reducer = umap.UMAP(
            n_components=2,
            n_neighbors=nn,
            min_dist=0.1,
            metric='euclidean',
            random_state=42
        )
        X_umap = reducer.fit_transform(X_scaled)
        elapsed = time.time() - start
        
        scatter = ax.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
        ax.set_title(f'n_neighbors = {nn}\n({elapsed:.1f}s)')
        ax.set_xticks([])
        ax.set_yticks([])
    
    plt.suptitle('UMAP: Effect of n_neighbors on MNIST', fontsize=14, fontweight='bold')
    plt.colorbar(scatter, ax=axes, label='Digit', shrink=0.8)
    plt.tight_layout()
    plt.show()
    
    # ============================================
    # Grand Comparison: PCA vs t-SNE vs UMAP
    # ============================================
    from sklearn.manifold import TSNE
    from sklearn.decomposition import PCA
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # PCA
    t0 = time.time()
    X_pca = PCA(n_components=2).fit_transform(X_scaled)
    axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
    axes[0].set_title(f'PCA ({time.time()-t0:.3f}s)')
    
    # t-SNE
    t0 = time.time()
    X_tsne = TSNE(n_components=2, perplexity=30, random_state=42, init='pca').fit_transform(X_scaled)
    axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
    axes[1].set_title(f't-SNE ({time.time()-t0:.1f}s)')
    
    # UMAP
    t0 = time.time()
    X_umap = umap.UMAP(n_components=2, random_state=42).fit_transform(X_scaled)
    scatter = axes[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
    axes[2].set_title(f'UMAP ({time.time()-t0:.1f}s)')
    
    plt.colorbar(scatter, ax=axes, label='Digit')
    plt.suptitle('Grand Comparison: PCA vs t-SNE vs UMAP', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

except ImportError:
    print("Install umap-learn: pip install umap-learn")

12.4 LDA Implementation

Scikit-Learn lda_pipeline.py

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import numpy as np

wine = load_wine()
X, y = wine.data, wine.target
X_scaled = StandardScaler().fit_transform(X)

# LDA: max components = C-1 = 2
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

# Compare PCA vs LDA
from sklearn.decomposition import PCA

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
colors = ['#059669', '#0891b2', '#f59e0b']
names = wine.target_names

for ax, X_proj, title in [(axes[0], PCA(2).fit_transform(X_scaled), 'PCA'),
                            (axes[1], X_lda, 'LDA')]:
    for i, (c, name) in enumerate(zip(colors, names)):
        mask = y == i
        ax.scatter(X_proj[mask, 0], X_proj[mask, 1], c=c, label=name, s=50, alpha=0.7)
    ax.set_title(f'{title} Projection (Wine Dataset)')
    ax.legend()
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Classification accuracy comparison
print("\n--- PCA vs LDA for Classification ---")
for method_name, reducer in [('PCA(2)', PCA(2)), ('LDA(2)', LDA(n_components=2))]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reduce', reducer),
        ('svm', SVC(kernel='rbf'))
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"  {method_name}: {scores.mean():.4f} ± {scores.std():.4f}")

13

Indian Case Studies

🇮🇳 Case Study 1: Aadhaar Biometric Dimensionality Reduction

Challenge

India's Aadhaar system stores biometrics for 1.4 billion people. Each fingerprint scan produces a high-dimensional minutiae template (~200+ features per finger × 10 fingers). Iris scans generate IrisCodes of 2,048 bits per eye. Storing and matching these at scale is computationally prohibitive.

Solution: Dimensionality Reduction in Biometric Matching

UIDAI employs several dimensionality reduction strategies:

Fingerprint templates: ISO/IEC 19794-2 minutiae extraction reduces raw fingerprint images (typically 300×400 pixels = 120K dimensions) to ~60-100 minutiae points (each with x, y, angle = 3 features), achieving roughly 400:1 reduction
IrisCode: Daugman's algorithm projects iris texture patterns onto a 2,048-bit binary code using Gabor wavelets — a form of feature extraction that reduces ~300K pixel values to 256 bytes
Multi-modal fusion: PCA is used to combine fingerprint and iris features into a compact representation for 1:N identification (searching the entire database)

Scale

Over 100 billion biometric authentications processed. PCA-based indexing reduces search from 1.4B full template comparisons to ~100K candidates per query. Average authentication time: under 200ms.

Python Simulation

Python aadhaar_biometric_pca.py

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Simulate biometric data: 10,000 people, 200 features per person
np.random.seed(42)
n_people = 10000
n_features = 200  # Minutiae + IrisCode features

# Generate clustered data (each person = cluster of templates)
n_classes = 100  # Simulate 100 unique identities
samples_per_class = n_people // n_classes

X_biometric = []
y_identity = []
for person_id in range(n_classes):
    center = np.random.randn(n_features) * 3
    templates = center + np.random.randn(samples_per_class, n_features) * 0.5
    X_biometric.append(templates)
    y_identity.extend([person_id] * samples_per_class)

X = np.vstack(X_biometric)
y = np.array(y_identity)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without PCA: KNN on full 200D
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

knn_full = KNeighborsClassifier(n_neighbors=3)
knn_full.fit(X_train_s, y_train)
acc_full = accuracy_score(y_test, knn_full.predict(X_test_s))

# With PCA: reduce to 20D
pca = PCA(n_components=20)
X_train_pca = pca.fit_transform(X_train_s)
X_test_pca = pca.transform(X_test_s)

knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print("=== Aadhaar Biometric Matching Simulation ===")
print(f"Full features (200D): accuracy = {acc_full:.4f}")
print(f"PCA reduced  ( 20D): accuracy = {acc_pca:.4f}")
print(f"Variance retained:   {sum(pca.explained_variance_ratio_):.2%}")
print(f"Dimensionality reduction: {200/20:.0f}x")
print(f"Search speedup (KNN): ~{(200/20):.0f}x faster distance computations")

🇮🇳 Case Study 2: ISRO Satellite Image Compression

Challenge

ISRO's Cartosat and Resourcesat satellites generate hyperspectral images with 200+ spectral bands per pixel. A single scene can be 10,000×10,000 pixels × 200 bands = 20 billion values. Downlinking this data to ground stations (bandwidth-limited) requires massive compression.

Solution: PCA-Based Spectral Compression

ISRO uses PCA to reduce spectral dimensionality:

200+ spectral bands → 10-15 principal components (retaining 99%+ variance)
Each pixel: 200 values → 15 values = 13:1 compression
Combined with spatial compression (JPEG2000), achieves 100:1 total compression
Enables near-real-time crop monitoring under the Fasal programme

Python isro_spectral_pca.py

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Simulate hyperspectral image: 100x100 pixels, 200 bands
np.random.seed(42)
height, width, bands = 100, 100, 200
image = np.random.randn(height * width, bands)

# Add correlated structure (simulating real spectral correlations)
basis = np.random.randn(10, bands)  # 10 underlying spectral patterns
coefficients = np.random.randn(height * width, 10)
image = coefficients @ basis + np.random.randn(height * width, bands) * 0.1

# Apply PCA for spectral compression
pca = PCA(n_components=15)
compressed = pca.fit_transform(image)
reconstructed = pca.inverse_transform(compressed)

# Compute reconstruction quality
mse = np.mean((image - reconstructed) ** 2)
signal_power = np.mean(image ** 2)
psnr = 10 * np.log10(signal_power / mse) if mse > 0 else float('inf')

print("=== ISRO Spectral Compression Simulation ===")
print(f"Original: {height}x{width}x{bands} = {height*width*bands:,} values")
print(f"Compressed: {height}x{width}x{pca.n_components_} = {height*width*pca.n_components_:,} values")
print(f"Compression ratio: {bands/pca.n_components_:.1f}:1")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.4%}")
print(f"Reconstruction PSNR: {psnr:.1f} dB")

🇮🇳 Case Study 3: Flipkart Product Embedding

Application

Flipkart manages a catalog of 150+ million products. Product descriptions are embedded into 768-dimensional vectors using BERT. For real-time similar-product recommendations, they use PCA to reduce embeddings from 768D to 128D, followed by approximate nearest neighbor search (FAISS). This reduces memory from 4.5 GB to 760 MB for 10M products while maintaining 95%+ recommendation quality.

14

Global Case Studies

🌍 Case Study 1: MNIST Visualization with t-SNE

The Iconic Visualization

The t-SNE visualization of MNIST handwritten digits (784 pixels → 2D) is perhaps the most famous use of dimensionality reduction in ML. Published by van der Maaten in 2008, it showed for the first time that t-SNE could separate all 10 digit classes into distinct, tight clusters — something PCA completely fails to do.

Key insights from the MNIST t-SNE plot:

Digits 0, 1, 6 form well-separated clusters (visually distinct shapes)
Digits 3, 5, 8 often overlap (similar curved strokes)
Digits 4, 7, 9 partially overlap (shared angular features)
Within-cluster structure often reveals writing styles (e.g., 7 with/without crossbar)

🌍 Case Study 2: Genomics Data Reduction

The Challenge

A typical gene expression study measures ~20,000 genes across a few hundred patient samples. This is the extreme n << d regime where the curse of dimensionality is most severe.

Application: Cancer Subtype Discovery

The Cancer Genome Atlas (TCGA) project used PCA and t-SNE to discover cancer subtypes:

Breast cancer: PCA of 20,000 gene expressions revealed 4 molecular subtypes (Luminal A, Luminal B, HER2+, Basal-like) that had different prognoses and treatment responses
Single-cell RNA-seq: UMAP is now the standard for visualizing single-cell data (10,000-50,000 genes × 100,000+ cells). Tools like Scanpy and Seurat use UMAP by default

🌍 Case Study 3: Google — Embedding Spaces at Scale

Application

Google's recommendation systems embed users and items into high-dimensional spaces (typically 64-512 dimensions). For serving recommendations at 8.5 billion search queries/day, they use:

Product Quantization (PQ): A form of dimensionality reduction that splits 128D vectors into 8 sub-vectors of 16D each, quantizing each sub-vector to 256 centroids. This compresses each embedding from 512 bytes to 8 bytes
ScaNN (Scalable Nearest Neighbors): Uses PCA as a preprocessing step for approximate nearest neighbor search across billions of embeddings

🌍 Case Study 4: Tesla Autopilot — Sensor Fusion

Application

Tesla's 8 cameras produce ~1.2 billion pixels per second. The vision backbone reduces each camera's 1280×960×3 image (~3.7M dimensions) to a 256-dimensional feature vector using a neural network (a learned form of dimensionality reduction). These compact representations are fused across cameras and time for 3D scene understanding.

15

Startup Applications

🧬 Elucidata (India)

Uses UMAP on multi-omics data (genomics + proteomics + metabolomics) to help pharmaceutical companies identify drug targets. Their platform reduces 50,000+ molecular features to interpretable 2D visualizations that biologists can explore interactively.

🎨 Runway ML (US)

Uses PCA and t-SNE in their "Concept Canvas" to let artists explore the latent space of generative models. Users can navigate the reduced space to find novel image generations. The 512D StyleGAN latent space is projected to 2D for interactive exploration.

🛡️ Darktrace (UK)

Their Enterprise Immune System uses PCA on network traffic features (100+ dimensions including packet sizes, timing patterns, protocols) to detect anomalies. Reducing to 10 principal components makes the real-time computation feasible across 10,000+ devices per enterprise.

📊 SigTuple (India)

Uses dimensionality reduction on medical image features (blood cell morphology — 200+ shape/color features) to assist pathologists. PCA reduces feature vectors before classification, enabling real-time analysis on low-power edge devices in rural health centers.

16

Government Applications

🇮🇳 DRDO — Radar Signal Processing

DRDO's radar systems use PCA for clutter suppression. Radar returns contain signal (target) + clutter (ground, weather) + noise. PCA decomposes returns into principal components: the first few correspond to strong clutter, which can be removed. This is equivalent to applying a subspace projection filter, improving target detection by 10-15 dB.

🇺🇸 NSA — Signal Intelligence

Communication signals contain thousands of features (frequency components, timing patterns, modulation characteristics). PCA and LDA are used to classify signal types and identify emitters. Dimensionality reduction enables real-time processing of intercepted communications.

🇮🇳 Census of India — Demographic Analysis

India's census collects 100+ socioeconomic variables per household across 640+ districts. PCA identifies the key factors driving demographic variation: typically 5-7 principal components capture 80% of variance, corresponding to interpretable factors like urbanization, literacy, agricultural dependency, and industrial development.

🇪🇺 ESA/Copernicus — Earth Observation

The European Space Agency's Sentinel-2 satellites capture 13 spectral bands. PCA reduces these to 3 components for false-color visualization, enabling rapid assessment of vegetation health, water quality, and urban sprawl across the European Union.

17

Industry Applications

💊 Pharma — Drug Discovery

Molecular descriptors for drug candidates often number in the thousands (topological indices, electronic properties, binding affinities). PCA reduces these to 10-20 "chemical space" coordinates. Companies like AstraZeneca use this to navigate vast libraries of candidate molecules efficiently.

🏭 Manufacturing — Process Control

A semiconductor fab monitors 500+ process variables per wafer. PCA-based multivariate statistical process control (MSPC) reduces these to ~20 principal components, detecting out-of-control conditions as deviations in the reduced space. This catches defects 3-5x faster than monitoring individual variables.

🎵 Spotify — Audio Features

Spotify extracts 128 audio features from each song (tempo, key, energy, spectral features, MFCCs). PCA and autoencoders compress these into 32-dimensional "taste profiles" for users and songs. The Discover Weekly algorithm matches users to songs by proximity in this reduced space.

💰 Finance — Risk Factor Models

Portfolio managers track thousands of assets. PCA on the correlation matrix of returns identifies the "risk factors" driving market movements. The first principal component typically captures 30-40% of variance (the "market factor"), followed by sector rotations, value/growth, and size factors — the foundation of factor investing.

18

Mini Projects

🎯 Mini Project 1: Eigenfaces — Face Recognition with PCA

Build a face recognition system using PCA (the "Eigenfaces" method, Turk & Pentland, 1991).

Python eigenfaces.py

"""
Mini Project 1: Eigenfaces for Face Recognition
=================================================
Uses PCA to reduce face image dimensionality, then KNN for classification.
The principal components (eigenfaces) capture the key modes of variation
in human faces: lighting, expression, identity.
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load Labeled Faces in the Wild dataset
# Only keep people with ≥ 50 images for reliable training
print("Loading LFW dataset (this may download ~200MB on first run)...")
lfw = fetch_lfw_people(min_faces_per_person=50, resize=0.5)
X, y = lfw.data, lfw.target
names = lfw.target_names
n_samples, n_features = X.shape
h, w = lfw.images.shape[1], lfw.images.shape[2]

print(f"Dataset: {n_samples} images, {n_features} pixels each ({h}x{w})")
print(f"Classes: {len(names)} people: {names}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Step 1: PCA (finding Eigenfaces)
n_components = 100  # Keep 100 principal components
pca = PCA(n_components=n_components, svd_solver='randomized', whiten=True, random_state=42)
pca.fit(X_train)

# Transform data
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

print(f"\nDimensionality: {n_features} → {n_components}")
print(f"Variance explained: {sum(pca.explained_variance_ratio_):.2%}")

# Visualize top eigenfaces
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.flat):
    eigenface = pca.components_[i].reshape(h, w)
    ax.imshow(eigenface, cmap='RdBu_r')
    ax.set_title(f'PC {i+1}\n({pca.explained_variance_ratio_[i]:.1%})')
    ax.axis('off')
plt.suptitle('Top 10 Eigenfaces', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Step 2: SVM Classification on PCA-reduced features
svm = SVC(kernel='rbf', C=10, gamma='scale', class_weight='balanced')
svm.fit(X_train_pca, y_train)
y_pred = svm.predict(X_test_pca)

print("\n" + classification_report(y_test, y_pred, target_names=names))

# Step 3: Reconstruction quality at different k
fig, axes = plt.subplots(2, 6, figsize=(18, 6))
sample = X_test[0]  # Original face

axes[0, 0].imshow(sample.reshape(h, w), cmap='gray')
axes[0, 0].set_title('Original')
axes[0, 0].axis('off')

for i, k in enumerate([5, 10, 25, 50, 100]):
    pca_k = PCA(n_components=k).fit(X_train)
    reconstructed = pca_k.inverse_transform(pca_k.transform([sample]))[0]
    axes[0, i+1].imshow(reconstructed.reshape(h, w), cmap='gray')
    axes[0, i+1].set_title(f'k={k} ({sum(pca_k.explained_variance_ratio_):.0%})')
    axes[0, i+1].axis('off')

# Show reconstruction error
for i, k in enumerate([5, 10, 25, 50, 100]):
    pca_k = PCA(n_components=k).fit(X_train)
    recon = pca_k.inverse_transform(pca_k.transform([sample]))[0]
    error = np.abs(sample - recon)
    axes[1, i+1].imshow(error.reshape(h, w), cmap='hot')
    axes[1, i+1].set_title(f'Error (k={k})')
    axes[1, i+1].axis('off')

axes[1, 0].axis('off')
plt.suptitle('Face Reconstruction at Different PCA Dimensions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

🎯 Mini Project 2: High-Dimensional Data Explorer

Build an interactive tool that applies PCA, t-SNE, and UMAP to any dataset and compares results side-by-side.

Python dim_reduction_explorer.py

"""
Mini Project 2: High-Dimensional Data Explorer
=================================================
A comprehensive tool for comparing dimensionality reduction methods.
Generates publication-quality comparison plots.
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, load_wine, load_breast_cancer
import time

class DimReductionExplorer:
    """Compares PCA, Kernel PCA, t-SNE, and (optionally) UMAP."""
    
    def __init__(self, X, y, feature_names=None, class_names=None, dataset_name="Dataset"):
        self.X_raw = X
        self.y = y
        self.feature_names = feature_names
        self.class_names = class_names
        self.dataset_name = dataset_name
        self.scaler = StandardScaler()
        self.X = self.scaler.fit_transform(X)
        self.results = {}
    
    def run_all(self, tsne_perplexity=30, kpca_kernel='rbf', kpca_gamma=None):
        """Run all dimensionality reduction methods."""
        print(f"=== {self.dataset_name}: {self.X.shape[0]} samples, {self.X.shape[1]} features ===\n")
        
        # PCA
        t0 = time.time()
        pca = PCA(n_components=2)
        self.results['PCA'] = {
            'embedding': pca.fit_transform(self.X),
            'time': time.time() - t0,
            'model': pca
        }
        print(f"PCA:        {time.time()-t0:.3f}s | EVR: {sum(pca.explained_variance_ratio_):.2%}")
        
        # Kernel PCA
        t0 = time.time()
        kpca = KernelPCA(n_components=2, kernel=kpca_kernel, gamma=kpca_gamma)
        self.results['Kernel PCA'] = {
            'embedding': kpca.fit_transform(self.X),
            'time': time.time() - t0,
            'model': kpca
        }
        print(f"Kernel PCA: {time.time()-t0:.3f}s")
        
        # t-SNE
        t0 = time.time()
        tsne = TSNE(n_components=2, perplexity=tsne_perplexity, random_state=42, init='pca')
        self.results['t-SNE'] = {
            'embedding': tsne.fit_transform(self.X),
            'time': time.time() - t0,
            'model': tsne
        }
        print(f"t-SNE:      {time.time()-t0:.1f}s | KL div: {tsne.kl_divergence_:.4f}")
        
        # UMAP (optional)
        try:
            import umap
            t0 = time.time()
            reducer = umap.UMAP(n_components=2, random_state=42)
            self.results['UMAP'] = {
                'embedding': reducer.fit_transform(self.X),
                'time': time.time() - t0,
                'model': reducer
            }
            print(f"UMAP:       {time.time()-t0:.1f}s")
        except ImportError:
            print("UMAP: skipped (pip install umap-learn)")
    
    def plot_comparison(self, figsize=(20, 5)):
        """Plot all methods side by side."""
        methods = list(self.results.keys())
        n_methods = len(methods)
        
        fig, axes = plt.subplots(1, n_methods, figsize=(5*n_methods, 5))
        if n_methods == 1:
            axes = [axes]
        
        cmap = 'tab10' if len(np.unique(self.y)) <= 10 else 'tab20'
        
        for ax, method in zip(axes, methods):
            Z = self.results[method]['embedding']
            t = self.results[method]['time']
            scatter = ax.scatter(Z[:, 0], Z[:, 1], c=self.y, cmap=cmap,
                               s=10, alpha=0.6, edgecolors='none')
            ax.set_title(f'{method}\n({t:.2f}s)', fontsize=12, fontweight='bold')
            ax.set_xticks([])
            ax.set_yticks([])
        
        plt.colorbar(scatter, ax=axes, label='Class', shrink=0.8)
        plt.suptitle(f'{self.dataset_name}: Dimensionality Reduction Comparison',
                    fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
    
    def plot_pca_analysis(self, max_components=None):
        """Detailed PCA analysis: scree plot, cumulative variance, loadings."""
        if max_components is None:
            max_components = min(self.X.shape)
        
        pca_full = PCA(n_components=max_components).fit(self.X)
        
        fig, axes = plt.subplots(1, 3, figsize=(18, 5))
        k = min(max_components, 20)
        
        # Scree plot
        axes[0].bar(range(1, k+1), pca_full.explained_variance_ratio_[:k],
                   color='#059669', alpha=0.7)
        axes[0].set_xlabel('Component')
        axes[0].set_ylabel('Explained Variance Ratio')
        axes[0].set_title('Scree Plot')
        
        # Cumulative
        cum = np.cumsum(pca_full.explained_variance_ratio_[:k]) * 100
        axes[1].plot(range(1, k+1), cum, 'o-', color='#059669', linewidth=2)
        axes[1].axhline(y=95, color='red', linestyle='--', label='95%')
        axes[1].axhline(y=90, color='orange', linestyle='--', label='90%')
        axes[1].set_xlabel('Number of Components')
        axes[1].set_ylabel('Cumulative Variance (%)')
        axes[1].set_title('Cumulative Explained Variance')
        axes[1].legend()
        
        # Top features for PC1 and PC2
        if self.feature_names is not None:
            loadings = pca_full.components_[:2]
            top_features_pc1 = np.argsort(np.abs(loadings[0]))[-5:]
            axes[2].barh(range(5), loadings[0][top_features_pc1], color='#059669')
            axes[2].set_yticks(range(5))
            axes[2].set_yticklabels([self.feature_names[i] for i in top_features_pc1])
            axes[2].set_title('Top 5 Feature Loadings (PC1)')
        
        plt.suptitle(f'{self.dataset_name}: PCA Analysis', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()

# ============================================
# DEMO: Run on multiple datasets
# ============================================
if __name__ == "__main__":
    # Dataset 1: Wine
    wine = load_wine()
    explorer = DimReductionExplorer(
        wine.data, wine.target,
        feature_names=wine.feature_names,
        class_names=wine.target_names,
        dataset_name="Wine Dataset"
    )
    explorer.run_all()
    explorer.plot_comparison()
    explorer.plot_pca_analysis()
    
    # Dataset 2: Digits
    digits = load_digits()
    explorer2 = DimReductionExplorer(
        digits.data, digits.target,
        dataset_name="Digits Dataset (64D)"
    )
    explorer2.run_all(tsne_perplexity=30)
    explorer2.plot_comparison()

🎯 Mini Project 3: Image Compression with PCA

Compress images by applying PCA to image patches and reconstructing at different compression ratios.

Python image_compression_pca.py

"""
Mini Project 3: Image Compression using PCA
=============================================
Treats each row of the image as a sample with width features.
PCA extracts principal components = dominant patterns across rows.
"""
import numpy as np
import matplotlib.pyplot as plt

# Create a sample grayscale image (or load your own)
from sklearn.datasets import load_sample_image

try:
    china = load_sample_image("china.jpg")
    # Convert to grayscale
    img = np.mean(china, axis=2)  # Average RGB channels
except:
    # Fallback: create a synthetic image
    x = np.linspace(0, 4*np.pi, 200)
    y = np.linspace(0, 4*np.pi, 300)
    X, Y = np.meshgrid(x, y)
    img = np.sin(X) * np.cos(Y) + np.random.randn(300, 200) * 0.1

print(f"Image shape: {img.shape}")
print(f"Original size: {img.nbytes / 1024:.1f} KB")

# Apply PCA with different numbers of components
from sklearn.decomposition import PCA

fig, axes = plt.subplots(2, 4, figsize=(20, 10))
k_values = [1, 5, 10, 25, 50, 100, 200, img.shape[1]]

for ax, k in zip(axes.flat, k_values):
    k = min(k, min(img.shape))
    pca = PCA(n_components=k)
    compressed = pca.fit_transform(img)     # (height x k)
    reconstructed = pca.inverse_transform(compressed)  # (height x width)
    
    # Compression ratio
    original_size = img.shape[0] * img.shape[1]
    compressed_size = compressed.shape[0] * compressed.shape[1] + pca.components_.size
    ratio = original_size / compressed_size
    mse = np.mean((img - reconstructed) ** 2)
    
    ax.imshow(reconstructed, cmap='gray')
    ax.set_title(f'k={k}\nRatio: {ratio:.1f}x, MSE: {mse:.2f}')
    ax.axis('off')

plt.suptitle('Image Compression with PCA', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

19

End-of-Chapter Exercises

Conceptual Questions

Exercise 16.1

Explain the curse of dimensionality using the "volume of hypersphere" argument. Show mathematically that V_sphere/V_cube → 0 as d → ∞.

Exercise 16.2

Why must we center the data before applying PCA? What happens if we don't? Prove that the mean of the projected data along each principal component is zero.

Exercise 16.3

Explain why PCA is sensitive to the scale of features. Give an example where standardization changes the principal components dramatically.

Exercise 16.4

Derive the reconstruction error formula for PCA: E = Σ_i=k+1^d λ_i. Show that discarding the smallest eigenvalues minimizes the MSE.

Exercise 16.5

Explain the crowding problem in SNE and how t-SNE solves it using a heavy-tailed t-distribution. Why is KL divergence asymmetric, and how does this affect the embedding?

Exercise 16.6

Compare PCA, t-SNE, and UMAP along five dimensions: (1) linearity, (2) computational complexity, (3) preservation of global vs. local structure, (4) out-of-sample extension, (5) determinism.

Exercise 16.7

Explain why LDA is limited to at most C-1 components (where C is the number of classes). What happens if S_W is singular?

Computational Exercises

Exercise 16.8

Given the data matrix X = [[1,2],[3,4],[5,6],[7,8]], compute PCA by hand: (a) center the data, (b) compute the covariance matrix, (c) find eigenvalues and eigenvectors, (d) project onto PC1.

Exercise 16.9

For the covariance matrix C = [[4, 2], [2, 3]], find the eigenvalues, eigenvectors, and explain what the principal components represent geometrically.

Exercise 16.10

Implement Kernel PCA from scratch using the RBF kernel. Test it on the make_moons dataset and compare with linear PCA.

Exercise 16.11

Write a function that computes the optimal number of PCA components using: (a) the 95% variance threshold, (b) Kaiser's rule (λ > 1 for standardized data), (c) the elbow method (maximum curvature of scree plot).

Exercise 16.12

Implement the biplot function from scratch. For the Wine dataset, create a biplot and interpret which features drive each principal component.

Programming Projects

Exercise 16.13

Apply PCA to the MNIST dataset. Plot accuracy of a KNN classifier vs. number of PCA components (k = 1, 2, 5, 10, 20, 50, 100, 200, 784). Find the optimal k.

Exercise 16.14

Run t-SNE on MNIST with perplexities [5, 10, 30, 50, 100] and learning rates [10, 50, 200, 1000]. Create a 5×4 grid of visualizations. Which hyperparameter has more impact?

Exercise 16.15

Compare PCA + Logistic Regression vs. LDA + Logistic Regression on three datasets (Wine, Iris, Breast Cancer). Report accuracy, training time, and number of components used.

Exercise 16.16

Implement incremental PCA for a dataset that doesn't fit in memory. Use sklearn's IncrementalPCA with batch_size=100 on a 100,000-sample dataset. Compare results with standard PCA.

Exercise 16.17

Apply UMAP with different metrics (euclidean, cosine, manhattan) to the 20 Newsgroups text dataset (TF-IDF features). Which metric produces the best cluster separation?

Research-Oriented Exercises

Exercise 16.18

Implement Sparse PCA using sklearn and compare the resulting components with standard PCA. When and why would sparse components be preferred?

Exercise 16.19

The Johnson-Lindenstrauss lemma states that random projections can approximately preserve distances. Implement a random projection and compare with PCA for nearest-neighbor queries. At what dimensionality do they converge?

Exercise 16.20

Implement Probabilistic PCA (PPCA) which models data as X = Wz + μ + ε where z ~ N(0, I) and ε ~ N(0, σ²I). Show how it handles missing data gracefully, unlike standard PCA.

Exercise 16.21

Compare the stability of t-SNE and UMAP embeddings across different random seeds. Run each 10 times and quantify consistency using Procrustes distance. Which is more reproducible?

Exercise 16.22

Build an "embedding space interpolation" tool: given two images, project them to PCA space, interpolate between their PCA coordinates, and reconstruct intermediate images. Apply to faces and digits.

20

Multiple Choice Questions

Q1: What do the eigenvalues of the covariance matrix represent in PCA?

A) The mean of each feature
B) The variance explained by each principal component
C) The correlation between features
D) The number of samples

✅ B) Each eigenvalue λᵢ equals the variance of the data projected onto eigenvector vᵢ. Var(z) = wᵀCw = wᵀλw = λ (when w is unit length).

Q2: If a dataset has d features and n samples (n < d), what is the maximum number of non-zero eigenvalues of the covariance matrix?

A) d
B) n
C) n - 1
D) min(n, d)

✅ C) After centering, the data lies in at most an (n-1)-dimensional subspace. The rank of the centered data matrix is at most n-1, so the covariance matrix has at most n-1 non-zero eigenvalues.

Q3: What distribution does t-SNE use in the low-dimensional space?

A) Gaussian distribution
B) Student's t-distribution with 1 degree of freedom (Cauchy)
C) Uniform distribution
D) Exponential distribution

✅ B) The Student's t-distribution with 1 DoF (Cauchy) has heavier tails than a Gaussian. This solves the crowding problem by allowing moderate distances in high-D to map to large distances in low-D without excessive penalty.

Q4: What does the perplexity parameter in t-SNE approximately control?

A) The output dimensionality
B) The learning rate
C) The effective number of neighbors considered
D) The number of iterations

✅ C) Perplexity = 2^H(P_i) where H is the entropy of the conditional distribution. A perplexity of 30 means each point effectively has ~30 neighbors. Typical range: 5-50.

Q5: Why does UMAP generally outperform t-SNE in preserving global structure?

A) UMAP uses a Gaussian distribution
B) UMAP uses more iterations
C) UMAP minimizes cross-entropy (not KL divergence), which penalizes both false neighbors and missing neighbors
D) UMAP always uses more neighbors

✅ C) KL(P||Q) only penalizes cases where p_ij > q_ij (nearby points placed far apart). Cross-entropy also penalizes q_ij > 0 when w_ij ≈ 0 (distant points placed close together), helping preserve global structure.

Q6: What is the maximum number of discriminant components LDA can produce for a C-class problem?

A) C
B) C - 1
C) d (number of features)
D) min(n, d)

✅ B) The between-class scatter matrix S_B has rank at most C-1 (since it's the sum of C rank-1 matrices minus one constraint). Therefore S_W⁻¹S_B has at most C-1 non-zero eigenvalues.

Q7: In the SVD of centered data X̃ = UΣVᵀ, the principal components directions correspond to:

A) Columns of U
B) Diagonal of Σ
C) Columns of V (rows of Vᵀ)
D) Rows of U

✅ C) V contains the right singular vectors, which are the eigenvectors of X̃ᵀX̃ (proportional to the covariance matrix). The eigenvalues are λᵢ = σᵢ²/(n-1).

Q8: Which statement about Kernel PCA is FALSE?

A) It can capture non-linear relationships
B) It uses the kernel trick to avoid explicit mapping
C) It is always faster than linear PCA
D) It works with an n×n kernel matrix instead of d×d covariance

✅ C) Kernel PCA requires computing and eigendecomposing an n×n kernel matrix (O(n³)), which is slower than linear PCA's O(d³) or O(nd²) when n > d. It's only preferable when data has non-linear structure.

Q9: What is the primary advantage of feature selection over feature extraction?

A) Better compression ratio
B) Preserved interpretability of original features
C) Lower information loss
D) Works with any data type

✅ B) Feature selection keeps original features intact, so a doctor can still interpret "blood glucose > 126" rather than "PC3 > 1.5". Feature extraction (PCA, t-SNE) creates abstract combinations that are mathematically optimal but hard to interpret.

Q10: The Kaiser criterion for selecting principal components recommends keeping components with:

A) Explained variance ratio > 5%
B) Eigenvalue > 1 (for standardized data)
C) Cumulative variance > 50%
D) Loading > 0.5

✅ B) For standardized data, each original feature has variance 1. A principal component with eigenvalue > 1 explains more variance than any single original feature, making it worth keeping. This criterion works well for moderate-dimensional data (d < 50).

Q11: Why is it recommended to apply PCA before t-SNE for very high-dimensional data?

A) t-SNE requires linear data
B) PCA improves cluster separation
C) PCA reduces noise and speeds up t-SNE's O(n²) distance computation
D) PCA makes the data Gaussian

✅ C) t-SNE has O(n²d) complexity for computing pairwise distances. PCA as a preprocessing step (e.g., 784D → 50D) reduces d by 15x, dramatically speeding up t-SNE while also removing noisy dimensions that could confuse the embedding.

Q12: What is the "whitening" transformation in PCA?

A) Removing outliers
B) Normalizing to [0, 1]
C) Scaling principal components to have unit variance (Z̃ = ZΛ^(-1/2))
D) Removing correlated features

✅ C) Whitening divides each PC by √λᵢ, making the transformed data have identity covariance matrix. This is useful for algorithms that assume uncorrelated, unit-variance features (e.g., ICA, some neural networks).

21

Interview Questions

IQ 16.1 — Conceptual (Google, Amazon)

Q: Explain PCA to a non-technical stakeholder. When would you use it and when would you avoid it?

Model Answer: PCA finds the most important "directions" in your data. Imagine a cloud of data points in 3D — PCA finds the best 2D plane to project them onto while losing the least information, like taking the best possible photograph of a 3D object. Use it when: features are correlated, you need to speed up ML models, or reduce storage. Avoid it when: feature interpretability is critical (e.g., regulatory compliance), data has strong non-linear structure, or when the number of features is already small.

IQ 16.2 — Technical (Microsoft, Meta)

Q: What's the computational complexity of PCA? How does SVD help with the n >> d and n << d cases?

Model Answer: Eigendecomposition of d×d covariance matrix: O(d³). But forming the covariance matrix costs O(nd²). SVD of n×d data matrix costs O(min(n²d, nd²)). When n >> d: SVD costs O(nd²), same as eigendecomposition route. When n << d (e.g., genomics): SVD is O(n²d), much cheaper than the O(d³) eigendecomposition. Randomized SVD (used in sklearn) can further reduce to O(ndk) for k components.

IQ 16.3 — System Design (Flipkart, Netflix)

Q: You're building a product recommendation system with 1 million products, each represented as a 768-dimensional BERT embedding. How would you enable real-time nearest-neighbor search?

Model Answer: (1) PCA: reduce 768D → 128D (retaining ~95% variance, 6x memory reduction). (2) Product Quantization: split 128D into 8 sub-vectors of 16D, quantize each to 256 centroids = 8 bytes/product. Total memory: 1M × 8 bytes = 8 MB (vs 3 GB original). (3) Use FAISS or ScaNN for ANN search. (4) For production: build an IVF (Inverted File) index with ~1000 partitions. Query time: ~1ms (vs ~1s for brute force). PCA is the critical first step that makes everything else feasible.

IQ 16.4 — Deep Technical (OpenAI, DeepMind)

Q: Why is t-SNE non-parametric (can't transform new points), while UMAP can? How does parametric t-SNE solve this?

Model Answer: t-SNE optimizes positions y_i directly via gradient descent — there's no learned function f(x) → y. To embed a new point x_new, you'd need to re-run the entire optimization. UMAP, however, learns a graph structure that can be used to map new points by finding their nearest neighbors in the training set and interpolating. Parametric t-SNE (Sainburg et al., 2021) uses a neural network to learn f(x) → y, training it with the same KL divergence objective. This gives the best of both worlds: t-SNE quality + out-of-sample capability.

IQ 16.5 — Applied (ISRO, DRDO)

Q: How would you apply PCA to hyperspectral satellite images for crop classification?

Model Answer: (1) Stack all pixels as rows, spectral bands as columns (N×B matrix). (2) Standardize each band. (3) Apply PCA to reduce B=200 bands to k=15 components (capturing 99%+ variance). (4) Feed reduced features to a Random Forest or SVM classifier with training labels (crop type). Key insight: spectral bands are highly correlated (neighboring wavelengths measure similar properties), making PCA extremely effective. Additional trick: use "MNF transform" (Minimum Noise Fraction), a noise-aware variant of PCA that orders components by signal-to-noise ratio rather than variance.

IQ 16.6 — Tricky (Amazon, Bloomberg)

Q: Can you use PCA for feature selection? What's the difference?

Model Answer: PCA does feature extraction, not selection. PCs are linear combinations of ALL original features — you can't remove any single feature. However, you can use PCA loadings to identify which features contribute most to the top PCs, then keep only those features (this is "PCA-guided feature selection"). Also, Sparse PCA (which adds an L1 penalty) produces components that use only a few original features, bridging the gap between extraction and selection.

IQ 16.7 — ML Engineering (Uber, Airbnb)

Q: You applied t-SNE and got beautiful clusters, but your classifier accuracy didn't improve. Why?

Model Answer: t-SNE is optimized for visualization (preserving local neighborhoods), not for preserving distances or linear separability. Several issues: (1) t-SNE distorts distances — close clusters might actually be far apart and vice versa. (2) t-SNE is non-parametric — you can't apply the same transformation to test data. (3) The 2D embedding discards information that a classifier could use. For classification, use PCA (preserves variance) or LDA (maximizes class separation) instead. t-SNE is a diagnostic tool, not a preprocessing step.

IQ 16.8 — Statistics (Goldman Sachs, JP Morgan)

Q: In finance, PCA applied to the correlation matrix of stock returns typically shows the first PC explaining 30-40% of variance. What does this PC represent?

Model Answer: The first PC is the "market factor" — it has roughly equal positive loadings on all stocks and represents broad market movements (risk-on/risk-off sentiment). PC2 often represents sector rotation (e.g., growth vs. value). PC3 might capture size effects. This is the basis of factor models (APT, Fama-French). Traders use PCA to decompose portfolio risk: the residual variance after removing the top PCs represents idiosyncratic/stock-specific risk.

IQ 16.9 — ML Theory (Research Labs)

Q: Prove that PCA minimizes reconstruction error. What's the connection between maximum variance and minimum error?

Model Answer: Total variance = Σ λᵢ. Keeping k components preserves Σ_{i=1}^k λᵢ variance. Reconstruction error = Σ_{i=k+1}^d λᵢ = Total - Preserved. Minimizing reconstruction error ⟺ maximizing preserved variance ⟺ choosing the largest k eigenvalues. The proof uses the Eckart-Young theorem: the best rank-k approximation to a matrix (in Frobenius norm) is given by its top-k SVD components.

IQ 16.10 — Practical (Data Science Roles)

Q: You have a dataset with 1000 features but only 100 samples. What problems do you face, and how does PCA help?

Model Answer: With p >> n: (1) Covariance matrix (1000×1000) is rank-deficient (rank ≤ 99), making it non-invertible. (2) Models will overfit — more parameters than data points. (3) Distance-based methods fail (curse of dimensionality). PCA helps by: (a) Reducing to at most 99 meaningful components (99 non-zero eigenvalues). (b) Typically 10-20 PCs capture most variance, giving a well-conditioned feature space. (c) Computing PCA via SVD of the 100×1000 data matrix (cheaper than 1000×1000 eigendecomposition). This is the standard approach in genomics, text mining, and neuroimaging.

22

Research Problems

🔬 Research Problem 1: Adaptive Perplexity for t-SNE

Standard t-SNE uses a single perplexity value for all points. But in datasets with varying density (e.g., a dense cluster near a sparse one), uniform perplexity is suboptimal. Design and implement an adaptive perplexity scheme where each point uses a different perplexity based on local density. Compare with standard t-SNE on synthetic datasets with multi-scale cluster structure.

Starting point: Compute local density ρ_i = 1/d̄_k (inverse of mean k-nearest-neighbor distance). Set perplexity_i ∝ ρ_i^-α for some parameter α. Evaluate using: (a) trustworthiness metric, (b) silhouette score of known clusters, (c) visual quality.

Key reference: Linderman et al., "Clustering with t-SNE, provably" (2019).

🔬 Research Problem 2: Dimensionality Reduction for Time Series

Standard PCA treats each feature independently, ignoring temporal correlations. Design a "Temporal PCA" that finds principal components in the time-frequency domain. Apply to: (a) EEG brain signals (64 channels × 1000 time points), (b) stock price histories (500 stocks × 252 trading days).

Approach: Apply Short-Time Fourier Transform to each channel, flatten the time-frequency representation, then apply PCA. Compare with: standard PCA on raw time series, Dynamic Time Warping + MDS, and the method of Dynamic Mode Decomposition (DMD).

🔬 Research Problem 3: Fairness in Dimensionality Reduction

PCA maximizes total variance, which may disproportionately represent majority groups. For example, in face recognition, eigenfaces trained on predominantly light-skinned faces may poorly represent darker-skinned faces. Design a "Fair PCA" that ensures equal reconstruction error across demographic groups.

Formulation: Minimize max_g∈Groups E_g[||x - x̂||²] subject to dim(projection) = k. Compare with: standard PCA, group-stratified PCA (separate PCA per group), and regularized PCA with group-wise reconstruction constraints.

Key reference: Samadi et al., "The Price of Fair PCA" (NeurIPS 2018).

🔬 Research Problem 4: Topology-Preserving Dimensionality Reduction

UMAP is inspired by topology, but how well does it actually preserve topological features (holes, loops, connected components)? Develop quantitative metrics based on persistent homology to evaluate how well PCA, t-SNE, UMAP, and autoencoders preserve the topological structure of data manifolds.

Tools: Use Ripser or GUDHI for persistent homology computation. Test on: (a) torus, (b) Klein bottle samples, (c) real protein conformation data.

23

Key Takeaways

🎯

The Curse is Real: High-dimensional spaces are geometrically weird — distances concentrate, volumes explode, and data becomes sparse. Dimensionality reduction is not optional for high-D data; it's essential for both computational and statistical reasons.

📐

PCA = Maximum Variance = Minimum Reconstruction Error: PCA finds orthogonal directions that capture the most variance. Mathematically, principal components are eigenvectors of the covariance matrix. The eigenvalue equals the variance along that direction. This can be computed via eigendecomposition or SVD.

🔗

SVD is PCA's Best Friend: PCA via SVD avoids forming the covariance matrix, is numerically more stable, and is cheaper when n << d. Scikit-learn uses SVD internally. The connection: λᵢ = σᵢ²/(n-1).

🌀

t-SNE Reveals Local Structure: By modeling pairwise similarities as probabilities and minimizing KL divergence, t-SNE produces stunning 2D visualizations that reveal cluster structure invisible to PCA. The t-distribution's heavy tail solves the crowding problem. Key hyperparameter: perplexity (≈ number of neighbors).

⚡

UMAP: The Modern Standard: UMAP preserves both local and global structure, runs much faster than t-SNE, and can transform new data. Its cross-entropy objective (vs. t-SNE's KL divergence) handles both attraction and repulsion. Use UMAP as your default for visualization and as a preprocessing step.

🏷️

LDA Uses Labels: When you have class labels, LDA is better than PCA for classification preprocessing because it maximizes class separability (between-class to within-class variance ratio). Limited to C-1 components.

⚖️

Selection vs. Extraction: Feature selection (filter, wrapper, embedded) preserves interpretability. Feature extraction (PCA, t-SNE, UMAP) preserves information. Choose based on whether you need to explain which features matter or how much information to keep.

📊

Use Scree Plots and EVR: Always visualize eigenvalues (scree plot) and cumulative explained variance before choosing k. The 95% threshold, elbow method, and Kaiser's rule (λ > 1) are complementary criteria. There is no single "correct" k.

⚠️

t-SNE is for Visualization Only: Never use t-SNE embeddings as features for a classifier. It distorts distances, is non-deterministic, and is non-parametric. PCA, LDA, or UMAP are better for preprocessing. t-SNE is a diagnostic tool — treat it like a microscope, not a scalpel.

24

References

Foundational Papers

Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space." Philosophical Magazine, 2(11), 559–572.
Hotelling, H. (1933). "Analysis of a Complex of Statistical Variables into Principal Components." Journal of Educational Psychology, 24(6), 417–441.
Fisher, R.A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." Annals of Eugenics, 7(2), 179–188.
Golub, G.H. & Kahan, W. (1965). "Calculating the Singular Values and Pseudo-Inverse of a Matrix." SIAM Journal on Numerical Analysis, 2(2), 205–224.
Schölkopf, B., Smola, A., & Müller, K.R. (1998). "Nonlinear Component Analysis as a Kernel Eigenvalue Problem." Neural Computation, 10(5), 1299–1319.

t-SNE and UMAP

Hinton, G.E. & Roweis, S.T. (2002). "Stochastic Neighbor Embedding." NeurIPS.
van der Maaten, L. & Hinton, G. (2008). "Visualizing Data using t-SNE." JMLR, 9, 2579–2605.
McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv:1802.03426.
Linderman, G.C. et al. (2019). "Fast Interpolation-based t-SNE for Improved Visualization of Single-Cell RNA-Seq Data." Nature Methods, 16, 243–245.

Applications

Turk, M. & Pentland, A. (1991). "Eigenfaces for Recognition." Journal of Cognitive Neuroscience, 3(1), 71–86.
Jolliffe, I.T. & Cadima, J. (2016). "Principal Component Analysis: A Review and Recent Developments." Phil. Trans. R. Soc. A, 374, 20150202.
Becht, E. et al. (2019). "Dimensionality Reduction for Visualizing Single-Cell Data Using UMAP." Nature Biotechnology, 37, 38–44.

Textbooks

Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 12 (PCA).
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 20.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapter 8.

Indian Context

UIDAI Technical Reports on Biometric De-duplication Architecture (2020).
ISRO/NRSC Hyperspectral Remote Sensing Documentation, NRSC-TR-2019.
Rao, C.R. (1948). "The Utilization of Multiple Measurements in Problems of Biological Classification." JRSS B, 10(2), 159–203.

Learning Objectives

Introduction

Historical Background

3.1 The Birth of PCA (1901–1936)

3.2 SVD and the Numerical Revolution (1960s–1970s)

3.3 Kernel Methods and Non-Linear Extensions (1998)

3.4 t-SNE: A Visualization Revolution (2008)

3.5 UMAP: Speed and Global Structure (2018)

Conceptual Explanation

4.1 The Curse of Dimensionality

Problem 1: Distance Concentration

Problem 2: Volume Explosion

Problem 3: Empty Space

Problem 4: Statistical Sparsity

4.2 Feature Selection vs Feature Extraction

4.3 Overview of Methods

PCA — Linear, Unsupervised

t-SNE — Non-Linear, Unsupervised

UMAP — Non-Linear, Unsupervised

LDA — Linear, Supervised

Mathematical Foundation

5.1 Covariance Matrix

5.2 Eigenvalue Decomposition

5.3 PCA Projection

5.4 Explained Variance Ratio

5.5 SVD Connection to PCA

5.6 Kernel PCA

5.7 t-SNE Mathematics

Step 1: High-D Pairwise Similarities

Step 2: Low-D Similarities with Student-t Distribution

Step 3: Minimize KL Divergence

5.8 UMAP Mathematics (Simplified)

5.9 LDA Mathematics

Formula Derivations

6.1 Deriving PCA from Maximum Variance

Step 1: Set up the projection

Step 2: Write the variance of the projection

Step 3: Constrained optimization

Step 4: Take the derivative and set to zero

Step 5: Interpret the result

Step 6: Second component

6.2 Deriving the t-SNE Gradient

Step 1: Rewrite using qij

Step 2: Differentiate

6.3 PCA via SVD Derivation

Worked Numerical Examples

7.1 PCA by Hand: 3 Features → 2 Components

Given Data (5 samples, 3 features)

Step 1: Compute means

Step 2: Center the data (subtract means)

Step 3: Compute covariance matrix C = (1/4) X̃ᵀ X̃

Step 4: Find eigenvalues

Step 5: Find eigenvectors

Step 6: Project onto top 2 components

Step 7: Explained variance

7.2 Explained Variance — A More Realistic Example

Visual Diagrams

Flowcharts

Python Implementation from Scratch

10.1 PCA Class — Full Eigendecomposition

10.2 PCA via SVD (Numerically Superior)

10.3 t-SNE Simplified Implementation

TensorFlow Implementation

11.1 PCA with TensorFlow (GPU-Accelerated)

11.2 Autoencoder for Non-Linear Dimensionality Reduction

Scikit-Learn Implementation

12.1 Complete PCA Pipeline with Visualization

12.2 t-SNE Pipeline with MNIST

12.3 UMAP Pipeline

12.4 LDA Implementation

Indian Case Studies

🇮🇳 Case Study 1: Aadhaar Biometric Dimensionality Reduction

Challenge

Solution: Dimensionality Reduction in Biometric Matching

Scale

Python Simulation

🇮🇳 Case Study 2: ISRO Satellite Image Compression

Challenge

Solution: PCA-Based Spectral Compression

🇮🇳 Case Study 3: Flipkart Product Embedding

Step 1: Rewrite using q_ij