Part V ยท Chapter 16

Dimensionality Reduction

PCA, t-SNE, UMAP โ€” Taming the Curse of High-Dimensional Data

๐Ÿ“– 3.5 Hours ๐Ÿ“ Advanced Mathematics ๐Ÿ”— Prereqs: Ch 6, Ch 15 ๐Ÿ’ป 3 Implementations
1

Learning Objectives

Upon completing this chapter, you will be able to:

1
Explain the curse of dimensionality and why reducing dimensions is essential
2
Derive PCA from the maximum variance perspective using eigenvalue decomposition
3
Compute the covariance matrix, eigenvalues, eigenvectors, and project data manually
4
Select the optimal number of principal components using explained variance ratios and scree plots
5
Connect Singular Value Decomposition (SVD) to PCA and understand Kernel PCA for non-linear data
6
Understand t-SNE's probabilistic framework: conditional probabilities, KL divergence, perplexity, and the crowding problem
7
Compare UMAP's topological approach: fuzzy simplicial sets, cross-entropy optimization, and superior scalability
8
Apply LDA as a supervised dimensionality reduction technique maximizing class separability
9
Distinguish feature selection from feature extraction and choose the right technique for a given scenario
10
Implement PCA from scratch in Python using NumPy eigendecomposition
11
Build complete pipelines with scikit-learn PCA, t-SNE, and UMAP with publication-quality visualizations
12
Apply dimensionality reduction to real-world problems: Eigenfaces, satellite imagery, biometrics, and genomics
2

Introduction

Imagine you're trying to describe a person. You could use thousands of measurements โ€” height, weight, shoe size, arm length, hair color (encoded numerically), income, education years, and so on. But do you really need all of them? Height and arm length are highly correlated; so are income and education level. Much of the information in those thousands of features is redundant.

Dimensionality reduction is the art and science of finding a smaller set of variables that captures the essential structure of your data. It is one of the most powerful tools in the machine learning arsenal โ€” used for visualization, noise removal, computational speedup, and combating the infamous curse of dimensionality.

This chapter explores four major families of techniques:

We'll also study the connection between PCA and SVD (Singular Value Decomposition), explore Kernel PCA for non-linear mappings, and contrast feature selection with feature extraction. Every concept will be accompanied by mathematical derivations from first principles, hand-computed examples, and Python code.

3

Historical Background

3.1 The Birth of PCA (1901โ€“1936)

Karl Pearson introduced the concept in 1901 as fitting planes to point clouds by minimizing orthogonal distances. Harold Hotelling formalized it in 1933 using eigenvalue decomposition of the covariance matrix, giving us the PCA we know today. The method became computationally practical only in the 1960s with the advent of digital computers and efficient algorithms for eigenvalue computation (QR algorithm by John Francis, 1961).

3.2 SVD and the Numerical Revolution (1960sโ€“1970s)

Gene Golub and William Kahan's 1965 algorithm for computing the Singular Value Decomposition made SVD numerically stable and efficient. The realization that PCA could be computed via SVD (without explicitly forming the covariance matrix) was transformative for large datasets.

3.3 Kernel Methods and Non-Linear Extensions (1998)

Bernhard Schรถlkopf, Alexander Smola, and Klaus-Robert Mรผller introduced Kernel PCA in 1998, extending PCA to non-linear manifolds using the kernel trick โ€” the same insight that powered SVMs.

3.4 t-SNE: A Visualization Revolution (2008)

Laurens van der Maaten and Geoffrey Hinton published t-SNE in 2008, building on Hinton and Roweis's earlier SNE (2002). The key innovation was using a Student's t-distribution in the low-dimensional space to solve the crowding problem. The method quickly became the gold standard for visualizing high-dimensional data like MNIST digits and gene expression data.

3.5 UMAP: Speed and Global Structure (2018)

Leland McInnes, John Healy, and James Melville published UMAP in 2018, grounding dimensionality reduction in the mathematics of algebraic topology (Riemannian geometry and fuzzy simplicial sets). UMAP offered dramatic speedups over t-SNE while better preserving global structure.

4

Conceptual Explanation

4.1 The Curse of Dimensionality

Richard Bellman coined the term "curse of dimensionality" in 1961. The key problems are:

Problem 1: Distance Concentration

In high dimensions, all points become approximately equidistant. If you sample points uniformly in a d-dimensional unit hypercube, the ratio of the maximum to minimum distance approaches 1 as d โ†’ โˆž. This makes distance-based algorithms (KNN, K-Means, RBF kernels) unreliable.

Problem 2: Volume Explosion

A d-dimensional unit hypercube has 2d corners. To sample the space with the same density as 10 samples per dimension, you need 10d points. For d = 100 features, that's 10100 โ€” vastly more than atoms in the universe (~1080).

Problem 3: Empty Space

Most of the volume of a high-dimensional unit sphere concentrates near its surface. A d-dimensional sphere of radius r = 0.99 has volume (0.99)d relative to the full sphere. For d = 500, this is essentially 0. Almost all points are "on the boundary."

Problem 4: Statistical Sparsity

More features mean more parameters to estimate, requiring exponentially more data to achieve the same statistical reliability. Overfitting becomes inevitable.

4.2 Feature Selection vs Feature Extraction

AspectFeature SelectionFeature Extraction
ApproachChoose a subset of original featuresCreate new features as combinations of originals
InterpretabilityHigh โ€” original features are preservedLow โ€” new features may be abstract
MethodsFilter (correlation), Wrapper (RFE), Embedded (Lasso)PCA, t-SNE, UMAP, Autoencoders
Information LossHigher โ€” discarded features are goneLower โ€” all features contribute to new ones
When to UseWhen interpretability is criticalWhen maximum information preservation is needed

4.3 Overview of Methods

PCA โ€” Linear, Unsupervised

PCA finds orthogonal directions (principal components) that maximize variance. It's a linear transformation โ€” the new features are linear combinations of the old ones. Best for: preprocessing, noise removal, compression, and exploratory analysis when data lies on a linear subspace.

t-SNE โ€” Non-Linear, Unsupervised

t-SNE models pairwise similarities as probabilities in both high-D and low-D spaces, then minimizes the KL divergence between them. It excels at revealing cluster structure in 2D/3D visualizations. Limitations: non-parametric (can't transform new points), slow on large datasets, random initialization can give different results.

UMAP โ€” Non-Linear, Unsupervised

UMAP builds a weighted graph in high-D space (fuzzy simplicial set), then optimizes a low-D graph to match it. It preserves both local and global structure, runs much faster than t-SNE, and produces a parametric mapping that can transform new data.

LDA โ€” Linear, Supervised

LDA finds directions that maximize the ratio of between-class variance to within-class variance. Unlike PCA, it uses class labels. Maximum components = C-1 (where C is the number of classes).

MethodLinear?Supervised?PreservesSpeedBest For
PCAโœ… YesโŒ NoGlobal varianceโšก FastPreprocessing, compression
Kernel PCAโŒ NoโŒ NoNon-linear structure๐Ÿข SlowNon-linear manifolds
t-SNEโŒ NoโŒ NoLocal neighborhoods๐Ÿข Slow2D/3D visualization
UMAPโŒ NoโŒ NoLocal + globalโšก FastVisualization + downstream
LDAโœ… Yesโœ… YesClass separabilityโšก FastClassification preprocessing
5

Mathematical Foundation

5.1 Covariance Matrix

Given a data matrix X โˆˆ โ„nร—d (n samples, d features), center the data by subtracting the mean: Xฬƒ = X - ฮผ, where ฮผ = (1/n)ฮฃixi.

Covariance Matrix
C = (1/(n-1)) XฬƒT Xฬƒ โˆˆ โ„dร—d

Cjk = (1/(n-1)) ฮฃi=1n (xij - ฮผj)(xik - ฮผk)

The covariance matrix is symmetric positive semi-definite (all eigenvalues โ‰ฅ 0). Diagonal entries are variances; off-diagonal entries are covariances between feature pairs.

5.2 Eigenvalue Decomposition

Since C is symmetric, it admits an eigendecomposition:

Eigenvalue Decomposition
C vi = ฮปi vi

C = V ฮ› VT

where V = [v1 | v2 | ... | vd] (orthonormal eigenvectors)
and ฮ› = diag(ฮป1, ฮป2, ..., ฮปd) with ฮป1 โ‰ฅ ฮป2 โ‰ฅ ... โ‰ฅ ฮปd โ‰ฅ 0

5.3 PCA Projection

To reduce from d to k dimensions, select the top k eigenvectors W = [v1 | ... | vk] โˆˆ โ„dร—k:

PCA Projection
Z = Xฬƒ W โˆˆ โ„nร—k

Each row zi = WT xฬƒi is the projection of sample i onto k principal components

5.4 Explained Variance Ratio

Explained Variance Ratio
EVRi = ฮปi / ฮฃj=1d ฮปj

Cumulative EVR for k components = ฮฃi=1k ฮปi / ฮฃj=1d ฮปj

The explained variance ratio tells us what fraction of the total variance in the data is captured by each principal component. A common heuristic is to choose k such that the cumulative EVR exceeds 95%.

5.5 SVD Connection to PCA

The centered data matrix Xฬƒ has a Singular Value Decomposition:

SVD of Centered Data
Xฬƒ = U ฮฃ VT

where U โˆˆ โ„nร—n (left singular vectors), ฮฃ โˆˆ โ„nร—d (singular values ฯƒi)
V โˆˆ โ„dร—d (right singular vectors = eigenvectors of XฬƒTXฬƒ)

The connection: C = (1/(n-1))XฬƒTXฬƒ = (1/(n-1))VฮฃTฮฃVT. So ฮปi = ฯƒiยฒ / (n-1), and the right singular vectors V are the principal component directions. This means PCA can be computed via SVD without ever forming the dร—d covariance matrix โ€” crucial when d is large.

5.6 Kernel PCA

For non-linear data, we map to a higher-dimensional space ฯ†: โ„d โ†’ โ„D (D >> d) and perform PCA there. Using the kernel trick K(xi, xj) = ฯ†(xi)ยทฯ†(xj), we work with the nร—n kernel matrix instead of the Dร—D covariance matrix:

Kernel PCA Eigenvalue Problem
Kฬƒ ฮฑi = n ฮปi ฮฑi

where Kฬƒ is the centered kernel matrix:
Kฬƒ = K - 1nK - K1n + 1nK1n
(1n = (1/n) matrix of all ones)

5.7 t-SNE Mathematics

Step 1: High-D Pairwise Similarities

For each pair of points, compute a conditional probability representing "how likely is xj to be a neighbor of xi under a Gaussian centered at xi":

t-SNE: High-D Conditional Probability
pj|i = exp(-||xi - xj||ยฒ / 2ฯƒiยฒ) / ฮฃkโ‰ i exp(-||xi - xk||ยฒ / 2ฯƒiยฒ)

Symmetrize: pij = (pj|i + pi|j) / 2n

The bandwidth ฯƒi is set so that the perplexity of the conditional distribution equals a user-specified value (typically 5โ€“50). Perplexity โ‰ˆ effective number of neighbors:

Perplexity
Perp(Pi) = 2H(Pi) where H(Pi) = -ฮฃj pj|i log2(pj|i)

Step 2: Low-D Similarities with Student-t Distribution

In the low-dimensional space, similarities use a heavy-tailed Student's t-distribution (1 degree of freedom = Cauchy):

t-SNE: Low-D Probability
qij = (1 + ||yi - yj||ยฒ)-1 / ฮฃkโ‰ l (1 + ||yk - yl||ยฒ)-1

The heavy tail allows moderately distant points in high-D to be placed far apart in low-D without penalty, solving the crowding problem.

Step 3: Minimize KL Divergence

t-SNE Objective: KL Divergence
C = KL(P || Q) = ฮฃi ฮฃj pij log(pij / qij)

Gradient: โˆ‚C/โˆ‚yi = 4 ฮฃj (pij - qij)(yi - yj)(1 + ||yi - yj||ยฒ)-1

5.8 UMAP Mathematics (Simplified)

UMAP constructs a weighted graph in high-D using k-nearest neighbors with a fuzzy membership function:

UMAP: Fuzzy Membership
w(xi, xj) = exp(-(d(xi, xj) - ฯi) / ฯƒi)

where ฯi = distance to nearest neighbor (ensures local connectivity)
ฯƒi is calibrated to achieve log2(k) = ฮฃj w(xi, xj)

The fuzzy union of local membership functions is symmetrized: wij = wiโ†’j + wjโ†’i - wiโ†’jยทwjโ†’i. UMAP then minimizes a cross-entropy objective (not KL divergence), which allows repulsive forces between non-neighbors:

UMAP Objective: Cross-Entropy
C = ฮฃedges wij log(wij/qij) + (1-wij) log((1-wij)/(1-qij))

5.9 LDA Mathematics

LDA: Fisher's Criterion
J(w) = (wT SB w) / (wT SW w)

SB = ฮฃc nc (ฮผc - ฮผ)(ฮผc - ฮผ)T (Between-class scatter)
SW = ฮฃc ฮฃiโˆˆc (xi - ฮผc)(xi - ฮผc)T (Within-class scatter)

Optimal: SW-1 SB w = ฮป w โ†’ generalized eigenproblem
6

Formula Derivations

6.1 Deriving PCA from Maximum Variance

We want to find a unit vector w such that the variance of the projected data is maximized.

Step 1: Set up the projection

Given centered data xฬƒi, the projection of point i onto direction w is: zi = wTxฬƒi

Step 2: Write the variance of the projection

Var(z) = (1/(n-1)) ฮฃi ziยฒ = (1/(n-1)) ฮฃi (wTxฬƒi)ยฒ = (1/(n-1)) wT (ฮฃi xฬƒixฬƒiT) w = wT C w

Step 3: Constrained optimization

Maximize wTCw subject to wTw = 1.

Using Lagrange multipliers: L = wTCw - ฮป(wTw - 1)

Step 4: Take the derivative and set to zero

โˆ‚L/โˆ‚w = 2Cw - 2ฮปw = 0

โŸน Cw = ฮปw

Step 5: Interpret the result

This is the eigenvalue equation! w must be an eigenvector of C. The variance along this direction is: wTCw = wTฮปw = ฮป. So the variance equals the eigenvalue, and to maximize variance, we pick the eigenvector with the largest eigenvalue. โˆŽ

Step 6: Second component

The second principal component maximizes variance subject to being orthogonal to the first: w2Tw1 = 0. By an analogous derivation with two Lagrange constraints, w2 is the eigenvector with the second largest eigenvalue. This pattern continues for all k components.

6.2 Deriving the t-SNE Gradient

The cost function is C = ฮฃiโ‰ j pij log(pij/qij).

Step 1: Rewrite using qij

Let dij = ||yi - yj||ยฒ and Z = ฮฃkโ‰ l(1 + dkl)-1. Then qij = (1 + dij)-1 / Z.

Step 2: Differentiate

โˆ‚C/โˆ‚yi = 2 ฮฃjโ‰ i (pij - qij) ยท 2(yi - yj) ยท (1 + dij)-1

= 4 ฮฃj (pij - qij)(yi - yj)(1 + ||yi - yj||ยฒ)-1

Interpretation: If pij > qij (points should be closer than they are), the gradient pulls yi toward yj. If pij < qij, it pushes them apart. The (1+dij)-1 term gives more weight to nearby points.

6.3 PCA via SVD Derivation

Start with the SVD: Xฬƒ = UฮฃVT. The covariance matrix:

C = (1/(n-1)) XฬƒTXฬƒ = (1/(n-1)) VฮฃTUTUฮฃVT = (1/(n-1)) Vฮฃ2VT

Comparing with C = Vฮ›VT, we get ฮ› = ฮฃ2/(n-1), i.e., eigenvalue ฮปi = ฯƒiยฒ/(n-1).

The projection: Z = XฬƒVk = UฮฃVTVk = Ukฮฃk (using orthogonality of V).

Key advantage: For n << d, computing the SVD of Xฬƒ (nร—d) is cheaper than the eigendecomposition of C (dร—d). Scikit-learn uses this approach.

7

Worked Numerical Examples

7.1 PCA by Hand: 3 Features โ†’ 2 Components

Given Data (5 samples, 3 features)

Samplexโ‚xโ‚‚xโ‚ƒ
A241
B463
C685
D8107
E10129

Step 1: Compute means

ฮผโ‚ = (2+4+6+8+10)/5 = 6, ฮผโ‚‚ = (4+6+8+10+12)/5 = 8, ฮผโ‚ƒ = (1+3+5+7+9)/5 = 5

Step 2: Center the data (subtract means)

Samplexฬƒโ‚xฬƒโ‚‚xฬƒโ‚ƒ
A-4-4-4
B-2-2-2
C000
D222
E444

Step 3: Compute covariance matrix C = (1/4) Xฬƒแต€ Xฬƒ

Xฬƒแต€Xฬƒ = [(-4)ยฒ+(-2)ยฒ+0+4+16, ...] Let's compute each entry:

ฮฃxฬƒโ‚ยฒ = 16+4+0+4+16 = 40 โ†’ Cโ‚โ‚ = 40/4 = 10

ฮฃxฬƒโ‚xฬƒโ‚‚ = 16+4+0+4+16 = 40 โ†’ Cโ‚โ‚‚ = 40/4 = 10

ฮฃxฬƒโ‚xฬƒโ‚ƒ = 16+4+0+4+16 = 40 โ†’ Cโ‚โ‚ƒ = 40/4 = 10

Similarly: Cโ‚‚โ‚‚ = 10, Cโ‚‚โ‚ƒ = 10, Cโ‚ƒโ‚ƒ = 10

Covariance Matrix
C = โŽก10 10 10โŽค โŽข10 10 10โŽฅ โŽฃ10 10 10โŽฆ

Step 4: Find eigenvalues

det(C - ฮปI) = 0. Since all rows are identical (rank 1 matrix), we know there are two zero eigenvalues.

trace(C) = 30 = sum of eigenvalues. Since rank = 1, we have ฮปโ‚ = 30, ฮปโ‚‚ = 0, ฮปโ‚ƒ = 0.

Verification: Cยท[1,1,1]แต€ = [30,30,30]แต€ = 30ยท[1,1,1]แต€ โœ“

Step 5: Find eigenvectors

For ฮปโ‚ = 30: vโ‚ = [1,1,1]แต€/โˆš3 = [0.577, 0.577, 0.577]แต€

For ฮปโ‚‚ = 0: any vector orthogonal to vโ‚, e.g., vโ‚‚ = [1,-1,0]แต€/โˆš2 = [0.707, -0.707, 0]แต€

For ฮปโ‚ƒ = 0: vโ‚ƒ = [1,1,-2]แต€/โˆš6 = [0.408, 0.408, -0.816]แต€

Step 6: Project onto top 2 components

W = [vโ‚ | vโ‚‚] (3ร—2 matrix). Z = Xฬƒ ยท W:

z_A = [-4, -4, -4] ยท W = [-4(0.577+0.577+0.577), -4(0.707-0.707+0)] = [-6.928, 0]

z_B = [-3.464, 0], z_C = [0, 0], z_D = [3.464, 0], z_E = [6.928, 0]

Step 7: Explained variance

EVRโ‚ = 30/30 = 100%. All variance is captured by PC1 (because the data lies on a perfect line in 3D).

7.2 Explained Variance โ€” A More Realistic Example

Suppose after computing PCA on a 10-feature dataset, the eigenvalues are:

ฮป = [5.8, 3.2, 1.5, 0.9, 0.6, 0.4, 0.3, 0.2, 0.07, 0.03]

Total = 13.0

EVR = [44.6%, 24.6%, 11.5%, 6.9%, 4.6%, 3.1%, 2.3%, 1.5%, 0.5%, 0.2%]

Cumulative = [44.6%, 69.2%, 80.8%, 87.7%, 92.3%, 95.4%, ...]

Decision: For 95% threshold, choose k = 6 components (reducing 10 โ†’ 6). For 90%, choose k = 5.

8

Visual Diagrams

Figure 16.1: PCA Geometric Interpretation โ€” Finding Axes of Maximum Variance Original 2D Data: After PCA Rotation: xโ‚‚ โ†‘ ยท ยท PC2 โ†‘ | ยท ยท ยท | | ยท ยทยท ยทยทยท | ยท ยท |ยท ยทยทยท ยทยท | ยท ยทยท ยทยท ยท |ยทยท ยท |ยทยท ยท ยท ยทยทยท ยทยทยทยท +----------โ†’ xโ‚ +------------------โ†’ PC1 (max variance) PC1 = direction of maximum spread (largest eigenvalue) PC2 = orthogonal to PC1 (smaller eigenvalue) If ฮปโ‚‚ โ‰ˆ 0, we can discard PC2 โ†’ 2D reduced to 1D!
Figure 16.2: The Curse of Dimensionality โ€” Volume of Hypersphere Ratio of Hypersphere to Hypercube Volume: 1.0 |โ–ˆ |โ–ˆ โ–ˆ |โ–ˆ โ–ˆ โ–ˆ 0.5 |โ–ˆ โ–ˆ โ–ˆ โ–ˆ |โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ |โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ 0.0 |โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ ยท ยท ยท ยท ยท +----------------------------------------โ†’ Dimensions 1 2 3 5 10 20 50 100 500 1000 d=1: ratio = 1.00 d=10: ratio = 0.0025 d=3: ratio = 0.52 d=100: ratio โ‰ˆ 10โปโทโฐ โŸน In high-D, almost ALL volume is in the CORNERS!
Figure 16.3: t-SNE vs PCA on Cluster Data PCA (Linear): t-SNE (Non-linear): PC2 โ†‘ yโ‚‚ โ†‘ | โ—โ—โ—โ—‹โ—‹โ—‹โ–ณโ–ณโ–ณ | โ—โ—โ— โ–ณโ–ณโ–ณ |โ—โ—โ—โ—‹โ—‹โ—‹โ–ณโ–ณโ–ณ | โ—โ—โ—โ—โ— โ–ณโ–ณโ–ณโ–ณ | โ—โ—โ—โ—‹โ—‹โ—‹โ–ณโ–ณ | โ—โ—โ— โ–ณโ–ณโ–ณ +------------โ†’ PC1 | | โ—‹โ—‹โ—‹ Clusters overlap in | โ—‹โ—‹โ—‹โ—‹โ—‹ linear projection | โ—‹โ—‹โ—‹ +------------โ†’ yโ‚ Clusters clearly separated! โ— = Class A โ—‹ = Class B โ–ณ = Class C
Figure 16.4: Scree Plot โ€” Choosing Number of Components Eigenvalue 6 |โ–ˆ |โ–ˆ 5 |โ–ˆ |โ–ˆ 4 |โ–ˆ |โ–ˆ โ–ˆ 3 |โ–ˆ โ–ˆ |โ–ˆ โ–ˆ 2 |โ–ˆ โ–ˆ โ–ˆ |โ–ˆ โ–ˆ โ–ˆ 1 |โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ† "elbow" here |โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ โ–ˆ 0 +--+--+--+--+--+--+--+--+--+--โ†’ PC1 PC2 PC3 PC4 PC5 PC6 PC7... Choose k at the "elbow" where eigenvalues flatten out. Here: k = 3 or 4 captures most variance. Cumulative: 44% 69% 81% 88% 92% 95% 97% ...
Figure 16.5: LDA vs PCA โ€” Different Objectives PCA (maximize total variance): LDA (maximize class separation): โ†‘ PC1 direction โ†‘ LD1 direction | โ—โ—โ— | โ—โ—โ— | โ—โ—โ—โ— | โ—โ—โ—โ— | โ—‹โ—‹โ—‹โ—โ— | | โ—‹โ—‹โ—‹โ—‹ | โ—‹โ—‹โ—‹โ—‹ | โ—‹โ—‹โ—‹ | โ—‹โ—‹โ—‹ +--โ†’ max spread +--โ†’ max between/within PCA might project onto a direction LDA separates classes even if where classes overlap! it doesn't maximize total variance.
9

Flowcharts

Flowchart 16.1: PCA Pipeline โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Raw Data X (n ร— d) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 1. Standardize โ”‚ z = (x - ฮผ) / ฯƒ โ”‚ (zero mean, โ”‚ (important if features โ”‚ unit variance) โ”‚ have different scales) โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 2. Compute Covarianceโ”‚ C = (1/(n-1)) Xฬƒแต€Xฬƒ โ”‚ Matrix C (d ร— d) โ”‚ OR use SVD directly โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 3. Eigendecompositionโ”‚ C vแตข = ฮปแตข vแตข โ”‚ (or SVD) โ”‚ Sort: ฮปโ‚ โ‰ฅ ฮปโ‚‚ โ‰ฅ ... โ‰ฅ ฮปd โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 4. Choose k โ”‚ - Scree plot (elbow) โ”‚ components โ”‚ - Cumulative EVR โ‰ฅ 95% โ”‚ โ”‚ - Kaiser rule (ฮปแตข > 1) โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 5. Project Data โ”‚ Z = Xฬƒ ยท W (n ร— k) โ”‚ W = [vโ‚|...|vk] โ”‚ W is d ร— k โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Reduced Data Z (nร—k)โ”‚ โ”‚ Use for: โ”‚ โ”‚ - Visualization โ”‚ โ”‚ - ML model input โ”‚ โ”‚ - Noise reduction โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Flowchart 16.2: Choosing a Dimensionality Reduction Method โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Do you have class labels? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” YES NO โ”‚ โ”‚ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LDA โ”‚ โ”‚ Is the goal โ”‚ โ”‚(max C-1 โ”‚ โ”‚ visualization only? โ”‚ โ”‚ dims) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” YES NO โ”‚ โ”‚ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ n < 10,000? โ”‚ โ”‚ Linear enough? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” YES NO YES NO โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚t-SNE โ”‚ โ”‚UMAP โ”‚ โ”‚ PCA โ”‚ โ”‚Kernel PCA โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚or UMAP โ”‚ โ”‚(best โ”‚ โ”‚(fast,โ”‚ โ”‚(fast,โ”‚ โ”‚ โ”‚ โ”‚local)โ”‚ โ”‚local+โ”‚ โ”‚exact)โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚globalโ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Flowchart 16.3: t-SNE Algorithm Steps โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Input: High-D data X, n pts โ”‚ โ”‚ Hyperparams: perplexity, โ”‚ โ”‚ learning_rate, n_iter โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 1. Compute pairwise โ”‚ โ”‚ distances in high-D โ”‚ โ”‚ dยฒแตขโฑผ = ||xแตข - xโฑผ||ยฒ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 2. Binary search for ฯƒแตข โ”‚ โ”‚ to match target โ”‚ โ”‚ perplexity โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 3. Compute pแตขโฑผ (symmetrize)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 4. Initialize low-D points โ”‚ โ”‚ yแตข โˆผ N(0, 10โปโดI) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 5. For iter = 1 to T: โ”‚ โ”‚ a) Compute qแตขโฑผ (t-dist) โ”‚ โ”‚ b) Compute gradient โ”‚ โ”‚ c) Update: y โ† y + ฮทโˆ‡ โ”‚ โ”‚ d) Apply momentum โ”‚ โ”‚ (Early exaggeration for โ”‚ โ”‚ first ~250 iterations) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Output: Low-D embedding Y โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
10

Python Implementation from Scratch

10.1 PCA Class โ€” Full Eigendecomposition

Python pca_from_scratch.py
import numpy as np
import matplotlib.pyplot as plt

class PCA:
    """
    Principal Component Analysis from scratch.
    
    Implements the eigendecomposition approach:
    1. Center data (optionally standardize)
    2. Compute covariance matrix
    3. Eigendecompose
    4. Project onto top-k eigenvectors
    """
    
    def __init__(self, n_components=2, standardize=True):
        """
        Parameters:
            n_components: int or float
                If int: number of components to keep
                If float (0-1): keep enough for this explained variance ratio
            standardize: bool
                If True, scale features to unit variance (recommended)
        """
        self.n_components = n_components
        self.standardize = standardize
        self.components_ = None       # Principal component directions (k x d)
        self.eigenvalues_ = None      # All eigenvalues
        self.explained_variance_ = None
        self.explained_variance_ratio_ = None
        self.mean_ = None
        self.std_ = None
    
    def fit(self, X):
        """Fit PCA on training data X (n_samples x n_features)."""
        n, d = X.shape
        
        # Step 1: Center (and optionally scale)
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_
        
        if self.standardize:
            self.std_ = np.std(X, axis=0, ddof=0)
            # Avoid division by zero for constant features
            self.std_[self.std_ == 0] = 1.0
            X_centered = X_centered / self.std_
        
        # Step 2: Covariance matrix (d x d)
        # Using (n-1) for unbiased estimator
        cov_matrix = (X_centered.T @ X_centered) / (n - 1)
        
        # Step 3: Eigenvalue decomposition
        # np.linalg.eigh is for symmetric matrices (faster, numerically stable)
        eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
        
        # Sort in descending order (eigh returns ascending)
        sorted_idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[sorted_idx]
        eigenvectors = eigenvectors[:, sorted_idx]
        
        # Clip tiny negative eigenvalues (numerical noise)
        eigenvalues = np.maximum(eigenvalues, 0)
        
        self.eigenvalues_ = eigenvalues
        self.explained_variance_ = eigenvalues
        total_var = np.sum(eigenvalues)
        self.explained_variance_ratio_ = eigenvalues / total_var if total_var > 0 else eigenvalues
        
        # Determine number of components
        if isinstance(self.n_components, float) and 0 < self.n_components < 1:
            cumulative = np.cumsum(self.explained_variance_ratio_)
            k = np.searchsorted(cumulative, self.n_components) + 1
            self.n_components = min(k, d)
        
        self.n_components = min(self.n_components, d)
        
        # Step 4: Select top-k eigenvectors
        # components_ shape: (k, d) โ€” each row is a principal direction
        self.components_ = eigenvectors[:, :self.n_components].T
        
        return self
    
    def transform(self, X):
        """Project data onto principal components."""
        X_centered = X - self.mean_
        if self.standardize:
            X_centered = X_centered / self.std_
        # Z = X_centered @ W where W = components_.T (d x k)
        return X_centered @ self.components_.T
    
    def fit_transform(self, X):
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)
    
    def inverse_transform(self, Z):
        """Reconstruct data from reduced representation."""
        X_approx = Z @ self.components_  # (n x k) @ (k x d) = (n x d)
        if self.standardize:
            X_approx = X_approx * self.std_
        X_approx = X_approx + self.mean_
        return X_approx
    
    def plot_scree(self, figsize=(10, 4)):
        """Plot scree diagram and cumulative explained variance."""
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
        
        k = min(len(self.eigenvalues_), 20)  # Show up to 20
        x = range(1, k + 1)
        
        # Scree plot
        ax1.bar(x, self.eigenvalues_[:k], color='#059669', alpha=0.7)
        ax1.plot(x, self.eigenvalues_[:k], 'o-', color='#0891b2', linewidth=2)
        ax1.set_xlabel('Principal Component')
        ax1.set_ylabel('Eigenvalue')
        ax1.set_title('Scree Plot')
        ax1.set_xticks(x)
        ax1.grid(alpha=0.3)
        
        # Cumulative variance
        cumulative = np.cumsum(self.explained_variance_ratio_[:k]) * 100
        ax2.plot(x, cumulative, 'o-', color='#059669', linewidth=2, markersize=6)
        ax2.axhline(y=95, color='#f43f5e', linestyle='--', label='95% threshold')
        ax2.fill_between(x, cumulative, alpha=0.1, color='#059669')
        ax2.set_xlabel('Number of Components')
        ax2.set_ylabel('Cumulative Explained Variance (%)')
        ax2.set_title('Cumulative Explained Variance')
        ax2.set_xticks(x)
        ax2.legend()
        ax2.grid(alpha=0.3)
        
        plt.tight_layout()
        plt.show()


# ============== DEMO ==============
if __name__ == "__main__":
    # Generate sample data: 3 clusters in 5D
    np.random.seed(42)
    n = 200
    
    # 3 clusters with different centers
    c1 = np.random.randn(n, 5) + np.array([3, 3, 3, 0, 0])
    c2 = np.random.randn(n, 5) + np.array([-3, -3, -3, 0, 0])
    c3 = np.random.randn(n, 5) + np.array([0, 0, 0, 5, 5])
    X = np.vstack([c1, c2, c3])
    labels = np.array([0]*n + [1]*n + [2]*n)
    
    # Apply PCA
    pca = PCA(n_components=2, standardize=True)
    Z = pca.fit_transform(X)
    
    print(f"Original shape: {X.shape}")
    print(f"Reduced shape:  {Z.shape}")
    print(f"Explained variance ratios: {pca.explained_variance_ratio_[:5].round(4)}")
    print(f"Cumulative (2 components): {sum(pca.explained_variance_ratio_[:2]):.4f}")
    
    # Plot
    pca.plot_scree()
    
    plt.figure(figsize=(8, 6))
    colors = ['#059669', '#0891b2', '#f59e0b']
    for i, c in enumerate(colors):
        mask = labels == i
        plt.scatter(Z[mask, 0], Z[mask, 1], c=c, label=f'Cluster {i}', alpha=0.6, s=30)
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.title('PCA: 5D โ†’ 2D Projection')
    plt.legend()
    plt.grid(alpha=0.3)
    plt.show()

10.2 PCA via SVD (Numerically Superior)

Python pca_svd.py
import numpy as np

class PCA_SVD:
    """PCA via Singular Value Decomposition โ€” avoids forming covariance matrix."""
    
    def __init__(self, n_components=2):
        self.n_components = n_components
    
    def fit(self, X):
        n, d = X.shape
        self.mean_ = X.mean(axis=0)
        X_centered = X - self.mean_
        
        # Truncated SVD: only compute top-k singular values/vectors
        # Full SVD: Xฬƒ = U ฮฃ Vแต€
        U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
        
        # Principal directions are rows of Vt
        self.components_ = Vt[:self.n_components]  # (k x d)
        
        # Eigenvalues of covariance = ฯƒยฒ/(n-1)
        self.explained_variance_ = (S ** 2) / (n - 1)
        total = self.explained_variance_.sum()
        self.explained_variance_ratio_ = self.explained_variance_ / total
        
        return self
    
    def transform(self, X):
        return (X - self.mean_) @ self.components_.T
    
    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

# Demo
np.random.seed(42)
X = np.random.randn(500, 50)  # 500 samples, 50 features
pca = PCA_SVD(n_components=5)
Z = pca.fit_transform(X)
print(f"Reduced: {X.shape} โ†’ {Z.shape}")
print(f"Top 5 EVR: {pca.explained_variance_ratio_[:5].round(4)}")

10.3 t-SNE Simplified Implementation

Python tsne_simplified.py
import numpy as np

def compute_pairwise_distances(X):
    """Compute squared Euclidean distances between all pairs."""
    sum_sq = np.sum(X ** 2, axis=1)
    # ||xi - xj||ยฒ = ||xi||ยฒ + ||xj||ยฒ - 2 xiยทxj
    D = sum_sq[:, np.newaxis] + sum_sq[np.newaxis, :] - 2 * X @ X.T
    return np.maximum(D, 0)  # Clip numerical negatives

def compute_perplexity_probabilities(D, target_perplexity=30.0, tol=1e-5, max_iter=50):
    """
    For each point, find ฯƒแตข via binary search so that
    Perplexity(Pแตข) = target_perplexity.
    Returns symmetrized joint probabilities P.
    """
    n = D.shape[0]
    P = np.zeros((n, n))
    target_entropy = np.log2(target_perplexity)
    
    for i in range(n):
        # Binary search for ฯƒแตข (actually searching over ฮฒ = 1/(2ฯƒยฒ))
        beta_min, beta_max = -np.inf, np.inf
        beta = 1.0
        
        Di = D[i].copy()
        Di[i] = np.inf  # Exclude self
        
        for _ in range(max_iter):
            # Compute conditional probabilities p(j|i)
            exp_D = np.exp(-Di * beta)
            exp_D[i] = 0
            sum_exp = np.sum(exp_D)
            
            if sum_exp == 0:
                P_i = np.zeros(n)
            else:
                P_i = exp_D / sum_exp
            
            # Compute entropy
            H = -np.sum(P_i[P_i > 0] * np.log2(P_i[P_i > 0]))
            
            # Check if we hit target
            diff = H - target_entropy
            if abs(diff) < tol:
                break
            
            # Update ฮฒ via binary search
            if diff > 0:  # Entropy too high โ†’ increase ฮฒ (decrease ฯƒ)
                beta_min = beta
                beta = beta * 2 if beta_max == np.inf else (beta + beta_max) / 2
            else:
                beta_max = beta
                beta = beta / 2 if beta_min == -np.inf else (beta + beta_min) / 2
        
        P[i] = P_i
    
    # Symmetrize: p_ij = (p_j|i + p_i|j) / (2n)
    P = (P + P.T) / (2 * n)
    P = np.maximum(P, 1e-12)  # Avoid log(0)
    return P

def tsne(X, n_components=2, perplexity=30.0, n_iter=500, lr=200.0, momentum=0.8):
    """
    Simplified t-SNE implementation.
    
    Parameters:
        X: ndarray (n_samples, n_features)
        n_components: output dimensions (2 or 3)
        perplexity: effective number of neighbors (5-50)
        n_iter: gradient descent iterations
        lr: learning rate
        momentum: momentum coefficient
    """
    n = X.shape[0]
    
    # Step 1: Pairwise distances in high-D
    D_high = compute_pairwise_distances(X)
    
    # Step 2: Compute high-D probabilities with target perplexity
    P = compute_perplexity_probabilities(D_high, target_perplexity=perplexity)
    
    # Early exaggeration (multiply P by 4 for first 100 iterations)
    P_exag = P * 4.0
    
    # Step 3: Initialize low-D embedding
    Y = np.random.randn(n, n_components) * 1e-4
    velocity = np.zeros_like(Y)
    
    for t in range(n_iter):
        # Use exaggerated P for early iterations
        P_use = P_exag if t < 100 else P
        
        # Pairwise distances in low-D
        D_low = compute_pairwise_distances(Y)
        
        # Compute q_ij using Student-t distribution
        Q_num = 1.0 / (1.0 + D_low)  # (1 + ||yi - yj||ยฒ)โปยน
        np.fill_diagonal(Q_num, 0)
        Q = Q_num / np.sum(Q_num)
        Q = np.maximum(Q, 1e-12)
        
        # Compute gradient
        # โˆ‚C/โˆ‚yแตข = 4 ฮฃโฑผ (pแตขโฑผ - qแตขโฑผ)(yแตข - yโฑผ)(1 + ||yแตข - yโฑผ||ยฒ)โปยน
        PQ_diff = P_use - Q
        
        # Efficient gradient computation
        grad = np.zeros_like(Y)
        for i in range(n):
            diff = Y[i] - Y  # (n, n_components)
            grad[i] = 4 * np.sum((PQ_diff[i, :, np.newaxis]) * diff * Q_num[i, :, np.newaxis], axis=0)
        
        # Update with momentum
        velocity = momentum * velocity - lr * grad
        Y += velocity
        
        # Re-center
        Y -= Y.mean(axis=0)
        
        if (t + 1) % 100 == 0:
            cost = np.sum(P_use * np.log(P_use / Q))
            print(f"  Iteration {t+1}/{n_iter}, KL divergence: {cost:.4f}")
    
    return Y

# Demo
if __name__ == "__main__":
    np.random.seed(42)
    # Create 3 clusters in 20D
    X = np.vstack([
        np.random.randn(50, 20) + np.array([5]*10 + [0]*10),
        np.random.randn(50, 20) + np.array([0]*10 + [5]*10),
        np.random.randn(50, 20) + np.array([-5]*10 + [-5]*10),
    ])
    
    print("Running t-SNE (this may take a moment)...")
    Y = tsne(X, n_components=2, perplexity=15, n_iter=500, lr=100)
    print(f"Embedding shape: {Y.shape}")
11

TensorFlow Implementation

11.1 PCA with TensorFlow (GPU-Accelerated)

TensorFlow pca_tensorflow.py
import tensorflow as tf
import numpy as np

class TF_PCA:
    """GPU-accelerated PCA using TensorFlow's SVD."""
    
    def __init__(self, n_components=2):
        self.n_components = n_components
    
    def fit_transform(self, X_np):
        """
        Fit PCA and return transformed data.
        Uses TF's SVD for GPU acceleration on large matrices.
        """
        X = tf.constant(X_np, dtype=tf.float64)
        
        # Center data
        mean = tf.reduce_mean(X, axis=0)
        X_centered = X - mean
        
        # SVD: Xฬƒ = U ฮฃ Vแต€
        S, U, V = tf.linalg.svd(X_centered, full_matrices=False)
        
        # Project onto top k right singular vectors
        V_k = V[:, :self.n_components]  # (d x k)
        Z = tf.matmul(X_centered, V_k)  # (n x k)
        
        # Explained variance
        n = tf.cast(tf.shape(X)[0], tf.float64)
        eigenvalues = S ** 2 / (n - 1)
        total = tf.reduce_sum(eigenvalues)
        evr = eigenvalues / total
        
        self.explained_variance_ratio_ = evr.numpy()
        self.components_ = tf.transpose(V_k).numpy()
        
        return Z.numpy()

# Demo
np.random.seed(42)
X = np.random.randn(1000, 100)
pca = TF_PCA(n_components=10)
Z = pca.fit_transform(X)
print(f"Shape: {X.shape} โ†’ {Z.shape}")
print(f"Top 5 EVR: {pca.explained_variance_ratio_[:5].round(4)}")

11.2 Autoencoder for Non-Linear Dimensionality Reduction

TensorFlow/Keras autoencoder_dimred.py
import tensorflow as tf
from tensorflow import keras
import numpy as np

def build_autoencoder(input_dim, encoding_dim=2, hidden_dims=[128, 64]):
    """
    Build a symmetric autoencoder.
    Encoder: input_dim โ†’ hidden1 โ†’ hidden2 โ†’ encoding_dim
    Decoder: encoding_dim โ†’ hidden2 โ†’ hidden1 โ†’ input_dim
    """
    # Encoder
    encoder_input = keras.Input(shape=(input_dim,))
    x = encoder_input
    for dim in hidden_dims:
        x = keras.layers.Dense(dim, activation='relu')(x)
        x = keras.layers.BatchNormalization()(x)
    encoded = keras.layers.Dense(encoding_dim, activation='linear', name='bottleneck')(x)
    
    # Decoder (mirror of encoder)
    x = encoded
    for dim in reversed(hidden_dims):
        x = keras.layers.Dense(dim, activation='relu')(x)
        x = keras.layers.BatchNormalization()(x)
    decoded = keras.layers.Dense(input_dim, activation='linear')(x)
    
    # Models
    autoencoder = keras.Model(encoder_input, decoded, name='autoencoder')
    encoder = keras.Model(encoder_input, encoded, name='encoder')
    
    autoencoder.compile(
        optimizer=keras.optimizers.Adam(learning_rate=1e-3),
        loss='mse'
    )
    
    return autoencoder, encoder

# Example: MNIST dimensionality reduction
# Load data
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train = X_train.reshape(-1, 784).astype('float32') / 255.0
X_test = X_test.reshape(-1, 784).astype('float32') / 255.0

# Build and train
autoencoder, encoder = build_autoencoder(784, encoding_dim=2)
autoencoder.summary()

history = autoencoder.fit(
    X_train, X_train,
    epochs=30,
    batch_size=256,
    validation_data=(X_test, X_test),
    verbose=1
)

# Get 2D embeddings
Z_train = encoder.predict(X_train)
print(f"Embedding shape: {Z_train.shape}")  # (60000, 2)

# Visualize (compare with PCA and t-SNE!)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
scatter = plt.scatter(Z_train[:5000, 0], Z_train[:5000, 1],
                      c=y_train[:5000], cmap='tab10', s=3, alpha=0.5)
plt.colorbar(scatter, label='Digit')
plt.title('Autoencoder: MNIST 784D โ†’ 2D')
plt.xlabel('Latent Dim 1')
plt.ylabel('Latent Dim 2')
plt.show()
12

Scikit-Learn Implementation

12.1 Complete PCA Pipeline with Visualization

Scikit-Learn sklearn_pca_pipeline.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, KernelPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_wine, load_digits
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

# ============================================
# PART 1: PCA on Wine Dataset
# ============================================
wine = load_wine()
X, y = wine.data, wine.target
print(f"Wine dataset: {X.shape} โ†’ {len(wine.feature_names)} features")

# Pipeline: Scale โ†’ PCA
pipe_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2))
])
X_pca = pipe_pca.fit_transform(X)
pca = pipe_pca.named_steps['pca']

print(f"\nExplained Variance Ratio: {pca.explained_variance_ratio_.round(4)}")
print(f"Cumulative: {np.cumsum(pca.explained_variance_ratio_).round(4)}")

# Scree Plot
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Full PCA for scree plot
pca_full = PCA().fit(StandardScaler().fit_transform(X))
axes[0].bar(range(1, 14), pca_full.explained_variance_ratio_, color='#059669', alpha=0.7)
axes[0].plot(range(1, 14), pca_full.explained_variance_ratio_, 'o-', color='#0891b2')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot')

# Cumulative variance
axes[1].plot(range(1, 14), np.cumsum(pca_full.explained_variance_ratio_), 
             'o-', color='#059669', linewidth=2)
axes[1].axhline(y=0.95, color='red', linestyle='--', label='95%')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Variance')
axes[1].legend()

# 2D Projection
colors = ['#059669', '#0891b2', '#f59e0b']
for i in range(3):
    mask = y == i
    axes[2].scatter(X_pca[mask, 0], X_pca[mask, 1], 
                    c=colors[i], label=wine.target_names[i], s=50, alpha=0.7)
axes[2].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[2].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
axes[2].set_title('Wine: PCA 2D Projection')
axes[2].legend()
plt.tight_layout()
plt.show()

# Biplot: show feature contributions
def biplot(pca_model, X_pca, feature_names, y, ax):
    """Create a biplot showing both data points and feature loadings."""
    colors = ['#059669', '#0891b2', '#f59e0b']
    for i, c in enumerate(colors):
        mask = y == i
        ax.scatter(X_pca[mask, 0], X_pca[mask, 1], c=c, alpha=0.4, s=30)
    
    # Feature loadings (arrows)
    loadings = pca_model.components_.T  # (d x k)
    scale = 3  # Arrow scale factor
    for i, name in enumerate(feature_names):
        ax.arrow(0, 0, loadings[i, 0]*scale, loadings[i, 1]*scale,
                color='red', alpha=0.7, head_width=0.05)
        ax.text(loadings[i, 0]*scale*1.1, loadings[i, 1]*scale*1.1,
               name, fontsize=7, color='red')
    ax.set_xlabel('PC1')
    ax.set_ylabel('PC2')
    ax.set_title('Biplot')
    ax.grid(alpha=0.3)

fig, ax = plt.subplots(figsize=(10, 8))
biplot(pca, X_pca, wine.feature_names, y, ax)
plt.tight_layout()
plt.show()

# ============================================
# PART 2: PCA for Classification Improvement
# ============================================
print("\n--- PCA + SVM Classification ---")
for k in [2, 4, 6, 8, 10, 13]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=k)),
        ('svm', SVC(kernel='rbf', C=10))
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"  k={k:2d}: accuracy = {scores.mean():.4f} ยฑ {scores.std():.4f}")

# ============================================
# PART 3: Kernel PCA
# ============================================
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Linear PCA
pca_linear = PCA(n_components=1)
X_lin = pca_linear.fit_transform(X_moons)
axes[0].scatter(X_lin, np.zeros_like(X_lin), c=y_moons, cmap='RdYlGn', s=30)
axes[0].set_title('Linear PCA (1D)')

# Kernel PCA with RBF
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=10)
X_kpca = kpca.fit_transform(X_moons)
axes[1].scatter(X_kpca[:, 0], X_kpca[:, 1], c=y_moons, cmap='RdYlGn', s=30)
axes[1].set_title('Kernel PCA (RBF, ฮณ=10)')

# Original data
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='RdYlGn', s=30)
axes[2].set_title('Original Moon Data')

plt.tight_layout()
plt.show()

12.2 t-SNE Pipeline with MNIST

Scikit-Learn sklearn_tsne_mnist.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import time

# Load digits dataset (8x8 images, 64 features, 10 classes)
digits = load_digits()
X, y = digits.data, digits.target
print(f"Digits dataset: {X.shape}, {len(np.unique(y))} classes")

# Scale data
X_scaled = StandardScaler().fit_transform(X)

# ============================================
# Strategy: PCA first (64โ†’30), then t-SNE (30โ†’2)
# This is the recommended approach for large datasets!
# ============================================
pca_pre = PCA(n_components=30)
X_pca30 = pca_pre.fit_transform(X_scaled)
print(f"PCA 64โ†’30: {sum(pca_pre.explained_variance_ratio_):.2%} variance retained")

# t-SNE with different perplexities
perplexities = [5, 15, 30, 50]
fig, axes = plt.subplots(1, 4, figsize=(24, 5))

for ax, perp in zip(axes, perplexities):
    start = time.time()
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42,
                n_iter=1000, learning_rate='auto', init='pca')
    X_tsne = tsne.fit_transform(X_pca30)
    elapsed = time.time() - start
    
    scatter = ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
    ax.set_title(f'Perplexity = {perp}\n(KL = {tsne.kl_divergence_:.3f}, {elapsed:.1f}s)')
    ax.set_xticks([])
    ax.set_yticks([])

plt.suptitle('t-SNE: Effect of Perplexity on MNIST Digits', fontsize=14, fontweight='bold')
plt.colorbar(scatter, ax=axes, label='Digit', shrink=0.8)
plt.tight_layout()
plt.show()

# ============================================
# Comparison: PCA vs t-SNE
# ============================================
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# PCA
X_pca2 = PCA(n_components=2).fit_transform(X_scaled)
axes[0].scatter(X_pca2[:, 0], X_pca2[:, 1], c=y, cmap='tab10', s=8, alpha=0.5)
axes[0].set_title('PCA (2D)')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')

# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42,
            n_iter=1000, learning_rate='auto', init='pca')
X_tsne2 = tsne.fit_transform(X_pca30)
scatter = axes[1].scatter(X_tsne2[:, 0], X_tsne2[:, 1], c=y, cmap='tab10', s=8, alpha=0.5)
axes[1].set_title('t-SNE (2D)')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')

plt.colorbar(scatter, ax=axes, label='Digit')
plt.suptitle('PCA vs t-SNE: MNIST Digits', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

12.3 UMAP Pipeline

Python + UMAP umap_pipeline.py
# pip install umap-learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import time

try:
    import umap
    
    digits = load_digits()
    X, y = digits.data, digits.target
    X_scaled = StandardScaler().fit_transform(X)
    
    # UMAP with different n_neighbors
    n_neighbors_list = [5, 15, 50, 200]
    fig, axes = plt.subplots(1, 4, figsize=(24, 5))
    
    for ax, nn in zip(axes, n_neighbors_list):
        start = time.time()
        reducer = umap.UMAP(
            n_components=2,
            n_neighbors=nn,
            min_dist=0.1,
            metric='euclidean',
            random_state=42
        )
        X_umap = reducer.fit_transform(X_scaled)
        elapsed = time.time() - start
        
        scatter = ax.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
        ax.set_title(f'n_neighbors = {nn}\n({elapsed:.1f}s)')
        ax.set_xticks([])
        ax.set_yticks([])
    
    plt.suptitle('UMAP: Effect of n_neighbors on MNIST', fontsize=14, fontweight='bold')
    plt.colorbar(scatter, ax=axes, label='Digit', shrink=0.8)
    plt.tight_layout()
    plt.show()
    
    # ============================================
    # Grand Comparison: PCA vs t-SNE vs UMAP
    # ============================================
    from sklearn.manifold import TSNE
    from sklearn.decomposition import PCA
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # PCA
    t0 = time.time()
    X_pca = PCA(n_components=2).fit_transform(X_scaled)
    axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
    axes[0].set_title(f'PCA ({time.time()-t0:.3f}s)')
    
    # t-SNE
    t0 = time.time()
    X_tsne = TSNE(n_components=2, perplexity=30, random_state=42, init='pca').fit_transform(X_scaled)
    axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
    axes[1].set_title(f't-SNE ({time.time()-t0:.1f}s)')
    
    # UMAP
    t0 = time.time()
    X_umap = umap.UMAP(n_components=2, random_state=42).fit_transform(X_scaled)
    scatter = axes[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
    axes[2].set_title(f'UMAP ({time.time()-t0:.1f}s)')
    
    plt.colorbar(scatter, ax=axes, label='Digit')
    plt.suptitle('Grand Comparison: PCA vs t-SNE vs UMAP', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

except ImportError:
    print("Install umap-learn: pip install umap-learn")

12.4 LDA Implementation

Scikit-Learn lda_pipeline.py
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import numpy as np

wine = load_wine()
X, y = wine.data, wine.target
X_scaled = StandardScaler().fit_transform(X)

# LDA: max components = C-1 = 2
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

# Compare PCA vs LDA
from sklearn.decomposition import PCA

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
colors = ['#059669', '#0891b2', '#f59e0b']
names = wine.target_names

for ax, X_proj, title in [(axes[0], PCA(2).fit_transform(X_scaled), 'PCA'),
                            (axes[1], X_lda, 'LDA')]:
    for i, (c, name) in enumerate(zip(colors, names)):
        mask = y == i
        ax.scatter(X_proj[mask, 0], X_proj[mask, 1], c=c, label=name, s=50, alpha=0.7)
    ax.set_title(f'{title} Projection (Wine Dataset)')
    ax.legend()
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Classification accuracy comparison
print("\n--- PCA vs LDA for Classification ---")
for method_name, reducer in [('PCA(2)', PCA(2)), ('LDA(2)', LDA(n_components=2))]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reduce', reducer),
        ('svm', SVC(kernel='rbf'))
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"  {method_name}: {scores.mean():.4f} ยฑ {scores.std():.4f}")
13

Indian Case Studies

๐Ÿ‡ฎ๐Ÿ‡ณ Case Study 1: Aadhaar Biometric Dimensionality Reduction

Challenge

India's Aadhaar system stores biometrics for 1.4 billion people. Each fingerprint scan produces a high-dimensional minutiae template (~200+ features per finger ร— 10 fingers). Iris scans generate IrisCodes of 2,048 bits per eye. Storing and matching these at scale is computationally prohibitive.

Solution: Dimensionality Reduction in Biometric Matching

UIDAI employs several dimensionality reduction strategies:

  • Fingerprint templates: ISO/IEC 19794-2 minutiae extraction reduces raw fingerprint images (typically 300ร—400 pixels = 120K dimensions) to ~60-100 minutiae points (each with x, y, angle = 3 features), achieving roughly 400:1 reduction
  • IrisCode: Daugman's algorithm projects iris texture patterns onto a 2,048-bit binary code using Gabor wavelets โ€” a form of feature extraction that reduces ~300K pixel values to 256 bytes
  • Multi-modal fusion: PCA is used to combine fingerprint and iris features into a compact representation for 1:N identification (searching the entire database)

Scale

Over 100 billion biometric authentications processed. PCA-based indexing reduces search from 1.4B full template comparisons to ~100K candidates per query. Average authentication time: under 200ms.

Python Simulation

Python aadhaar_biometric_pca.py
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Simulate biometric data: 10,000 people, 200 features per person
np.random.seed(42)
n_people = 10000
n_features = 200  # Minutiae + IrisCode features

# Generate clustered data (each person = cluster of templates)
n_classes = 100  # Simulate 100 unique identities
samples_per_class = n_people // n_classes

X_biometric = []
y_identity = []
for person_id in range(n_classes):
    center = np.random.randn(n_features) * 3
    templates = center + np.random.randn(samples_per_class, n_features) * 0.5
    X_biometric.append(templates)
    y_identity.extend([person_id] * samples_per_class)

X = np.vstack(X_biometric)
y = np.array(y_identity)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without PCA: KNN on full 200D
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

knn_full = KNeighborsClassifier(n_neighbors=3)
knn_full.fit(X_train_s, y_train)
acc_full = accuracy_score(y_test, knn_full.predict(X_test_s))

# With PCA: reduce to 20D
pca = PCA(n_components=20)
X_train_pca = pca.fit_transform(X_train_s)
X_test_pca = pca.transform(X_test_s)

knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print("=== Aadhaar Biometric Matching Simulation ===")
print(f"Full features (200D): accuracy = {acc_full:.4f}")
print(f"PCA reduced  ( 20D): accuracy = {acc_pca:.4f}")
print(f"Variance retained:   {sum(pca.explained_variance_ratio_):.2%}")
print(f"Dimensionality reduction: {200/20:.0f}x")
print(f"Search speedup (KNN): ~{(200/20):.0f}x faster distance computations")

๐Ÿ‡ฎ๐Ÿ‡ณ Case Study 2: ISRO Satellite Image Compression

Challenge

ISRO's Cartosat and Resourcesat satellites generate hyperspectral images with 200+ spectral bands per pixel. A single scene can be 10,000ร—10,000 pixels ร— 200 bands = 20 billion values. Downlinking this data to ground stations (bandwidth-limited) requires massive compression.

Solution: PCA-Based Spectral Compression

ISRO uses PCA to reduce spectral dimensionality:

  • 200+ spectral bands โ†’ 10-15 principal components (retaining 99%+ variance)
  • Each pixel: 200 values โ†’ 15 values = 13:1 compression
  • Combined with spatial compression (JPEG2000), achieves 100:1 total compression
  • Enables near-real-time crop monitoring under the Fasal programme
Python isro_spectral_pca.py
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Simulate hyperspectral image: 100x100 pixels, 200 bands
np.random.seed(42)
height, width, bands = 100, 100, 200
image = np.random.randn(height * width, bands)

# Add correlated structure (simulating real spectral correlations)
basis = np.random.randn(10, bands)  # 10 underlying spectral patterns
coefficients = np.random.randn(height * width, 10)
image = coefficients @ basis + np.random.randn(height * width, bands) * 0.1

# Apply PCA for spectral compression
pca = PCA(n_components=15)
compressed = pca.fit_transform(image)
reconstructed = pca.inverse_transform(compressed)

# Compute reconstruction quality
mse = np.mean((image - reconstructed) ** 2)
signal_power = np.mean(image ** 2)
psnr = 10 * np.log10(signal_power / mse) if mse > 0 else float('inf')

print("=== ISRO Spectral Compression Simulation ===")
print(f"Original: {height}x{width}x{bands} = {height*width*bands:,} values")
print(f"Compressed: {height}x{width}x{pca.n_components_} = {height*width*pca.n_components_:,} values")
print(f"Compression ratio: {bands/pca.n_components_:.1f}:1")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.4%}")
print(f"Reconstruction PSNR: {psnr:.1f} dB")

๐Ÿ‡ฎ๐Ÿ‡ณ Case Study 3: Flipkart Product Embedding

Application

Flipkart manages a catalog of 150+ million products. Product descriptions are embedded into 768-dimensional vectors using BERT. For real-time similar-product recommendations, they use PCA to reduce embeddings from 768D to 128D, followed by approximate nearest neighbor search (FAISS). This reduces memory from 4.5 GB to 760 MB for 10M products while maintaining 95%+ recommendation quality.

14

Global Case Studies

๐ŸŒ Case Study 1: MNIST Visualization with t-SNE

The Iconic Visualization

The t-SNE visualization of MNIST handwritten digits (784 pixels โ†’ 2D) is perhaps the most famous use of dimensionality reduction in ML. Published by van der Maaten in 2008, it showed for the first time that t-SNE could separate all 10 digit classes into distinct, tight clusters โ€” something PCA completely fails to do.

Key insights from the MNIST t-SNE plot:

  • Digits 0, 1, 6 form well-separated clusters (visually distinct shapes)
  • Digits 3, 5, 8 often overlap (similar curved strokes)
  • Digits 4, 7, 9 partially overlap (shared angular features)
  • Within-cluster structure often reveals writing styles (e.g., 7 with/without crossbar)

๐ŸŒ Case Study 2: Genomics Data Reduction

The Challenge

A typical gene expression study measures ~20,000 genes across a few hundred patient samples. This is the extreme n << d regime where the curse of dimensionality is most severe.

Application: Cancer Subtype Discovery

The Cancer Genome Atlas (TCGA) project used PCA and t-SNE to discover cancer subtypes:

  • Breast cancer: PCA of 20,000 gene expressions revealed 4 molecular subtypes (Luminal A, Luminal B, HER2+, Basal-like) that had different prognoses and treatment responses
  • Single-cell RNA-seq: UMAP is now the standard for visualizing single-cell data (10,000-50,000 genes ร— 100,000+ cells). Tools like Scanpy and Seurat use UMAP by default

๐ŸŒ Case Study 3: Google โ€” Embedding Spaces at Scale

Application

Google's recommendation systems embed users and items into high-dimensional spaces (typically 64-512 dimensions). For serving recommendations at 8.5 billion search queries/day, they use:

  • Product Quantization (PQ): A form of dimensionality reduction that splits 128D vectors into 8 sub-vectors of 16D each, quantizing each sub-vector to 256 centroids. This compresses each embedding from 512 bytes to 8 bytes
  • ScaNN (Scalable Nearest Neighbors): Uses PCA as a preprocessing step for approximate nearest neighbor search across billions of embeddings

๐ŸŒ Case Study 4: Tesla Autopilot โ€” Sensor Fusion

Application

Tesla's 8 cameras produce ~1.2 billion pixels per second. The vision backbone reduces each camera's 1280ร—960ร—3 image (~3.7M dimensions) to a 256-dimensional feature vector using a neural network (a learned form of dimensionality reduction). These compact representations are fused across cameras and time for 3D scene understanding.

15

Startup Applications

๐Ÿงฌ Elucidata (India)

Uses UMAP on multi-omics data (genomics + proteomics + metabolomics) to help pharmaceutical companies identify drug targets. Their platform reduces 50,000+ molecular features to interpretable 2D visualizations that biologists can explore interactively.

๐ŸŽจ Runway ML (US)

Uses PCA and t-SNE in their "Concept Canvas" to let artists explore the latent space of generative models. Users can navigate the reduced space to find novel image generations. The 512D StyleGAN latent space is projected to 2D for interactive exploration.

๐Ÿ›ก๏ธ Darktrace (UK)

Their Enterprise Immune System uses PCA on network traffic features (100+ dimensions including packet sizes, timing patterns, protocols) to detect anomalies. Reducing to 10 principal components makes the real-time computation feasible across 10,000+ devices per enterprise.

๐Ÿ“Š SigTuple (India)

Uses dimensionality reduction on medical image features (blood cell morphology โ€” 200+ shape/color features) to assist pathologists. PCA reduces feature vectors before classification, enabling real-time analysis on low-power edge devices in rural health centers.

16

Government Applications

๐Ÿ‡ฎ๐Ÿ‡ณ DRDO โ€” Radar Signal Processing

DRDO's radar systems use PCA for clutter suppression. Radar returns contain signal (target) + clutter (ground, weather) + noise. PCA decomposes returns into principal components: the first few correspond to strong clutter, which can be removed. This is equivalent to applying a subspace projection filter, improving target detection by 10-15 dB.

๐Ÿ‡บ๐Ÿ‡ธ NSA โ€” Signal Intelligence

Communication signals contain thousands of features (frequency components, timing patterns, modulation characteristics). PCA and LDA are used to classify signal types and identify emitters. Dimensionality reduction enables real-time processing of intercepted communications.

๐Ÿ‡ฎ๐Ÿ‡ณ Census of India โ€” Demographic Analysis

India's census collects 100+ socioeconomic variables per household across 640+ districts. PCA identifies the key factors driving demographic variation: typically 5-7 principal components capture 80% of variance, corresponding to interpretable factors like urbanization, literacy, agricultural dependency, and industrial development.

๐Ÿ‡ช๐Ÿ‡บ ESA/Copernicus โ€” Earth Observation

The European Space Agency's Sentinel-2 satellites capture 13 spectral bands. PCA reduces these to 3 components for false-color visualization, enabling rapid assessment of vegetation health, water quality, and urban sprawl across the European Union.

17

Industry Applications

๐Ÿ’Š Pharma โ€” Drug Discovery

Molecular descriptors for drug candidates often number in the thousands (topological indices, electronic properties, binding affinities). PCA reduces these to 10-20 "chemical space" coordinates. Companies like AstraZeneca use this to navigate vast libraries of candidate molecules efficiently.

๐Ÿญ Manufacturing โ€” Process Control

A semiconductor fab monitors 500+ process variables per wafer. PCA-based multivariate statistical process control (MSPC) reduces these to ~20 principal components, detecting out-of-control conditions as deviations in the reduced space. This catches defects 3-5x faster than monitoring individual variables.

๐ŸŽต Spotify โ€” Audio Features

Spotify extracts 128 audio features from each song (tempo, key, energy, spectral features, MFCCs). PCA and autoencoders compress these into 32-dimensional "taste profiles" for users and songs. The Discover Weekly algorithm matches users to songs by proximity in this reduced space.

๐Ÿ’ฐ Finance โ€” Risk Factor Models

Portfolio managers track thousands of assets. PCA on the correlation matrix of returns identifies the "risk factors" driving market movements. The first principal component typically captures 30-40% of variance (the "market factor"), followed by sector rotations, value/growth, and size factors โ€” the foundation of factor investing.

18

Mini Projects

๐ŸŽฏ Mini Project 1: Eigenfaces โ€” Face Recognition with PCA

Build a face recognition system using PCA (the "Eigenfaces" method, Turk & Pentland, 1991).

Python eigenfaces.py
"""
Mini Project 1: Eigenfaces for Face Recognition
=================================================
Uses PCA to reduce face image dimensionality, then KNN for classification.
The principal components (eigenfaces) capture the key modes of variation
in human faces: lighting, expression, identity.
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load Labeled Faces in the Wild dataset
# Only keep people with โ‰ฅ 50 images for reliable training
print("Loading LFW dataset (this may download ~200MB on first run)...")
lfw = fetch_lfw_people(min_faces_per_person=50, resize=0.5)
X, y = lfw.data, lfw.target
names = lfw.target_names
n_samples, n_features = X.shape
h, w = lfw.images.shape[1], lfw.images.shape[2]

print(f"Dataset: {n_samples} images, {n_features} pixels each ({h}x{w})")
print(f"Classes: {len(names)} people: {names}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Step 1: PCA (finding Eigenfaces)
n_components = 100  # Keep 100 principal components
pca = PCA(n_components=n_components, svd_solver='randomized', whiten=True, random_state=42)
pca.fit(X_train)

# Transform data
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

print(f"\nDimensionality: {n_features} โ†’ {n_components}")
print(f"Variance explained: {sum(pca.explained_variance_ratio_):.2%}")

# Visualize top eigenfaces
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.flat):
    eigenface = pca.components_[i].reshape(h, w)
    ax.imshow(eigenface, cmap='RdBu_r')
    ax.set_title(f'PC {i+1}\n({pca.explained_variance_ratio_[i]:.1%})')
    ax.axis('off')
plt.suptitle('Top 10 Eigenfaces', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Step 2: SVM Classification on PCA-reduced features
svm = SVC(kernel='rbf', C=10, gamma='scale', class_weight='balanced')
svm.fit(X_train_pca, y_train)
y_pred = svm.predict(X_test_pca)

print("\n" + classification_report(y_test, y_pred, target_names=names))

# Step 3: Reconstruction quality at different k
fig, axes = plt.subplots(2, 6, figsize=(18, 6))
sample = X_test[0]  # Original face

axes[0, 0].imshow(sample.reshape(h, w), cmap='gray')
axes[0, 0].set_title('Original')
axes[0, 0].axis('off')

for i, k in enumerate([5, 10, 25, 50, 100]):
    pca_k = PCA(n_components=k).fit(X_train)
    reconstructed = pca_k.inverse_transform(pca_k.transform([sample]))[0]
    axes[0, i+1].imshow(reconstructed.reshape(h, w), cmap='gray')
    axes[0, i+1].set_title(f'k={k} ({sum(pca_k.explained_variance_ratio_):.0%})')
    axes[0, i+1].axis('off')

# Show reconstruction error
for i, k in enumerate([5, 10, 25, 50, 100]):
    pca_k = PCA(n_components=k).fit(X_train)
    recon = pca_k.inverse_transform(pca_k.transform([sample]))[0]
    error = np.abs(sample - recon)
    axes[1, i+1].imshow(error.reshape(h, w), cmap='hot')
    axes[1, i+1].set_title(f'Error (k={k})')
    axes[1, i+1].axis('off')

axes[1, 0].axis('off')
plt.suptitle('Face Reconstruction at Different PCA Dimensions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

๐ŸŽฏ Mini Project 2: High-Dimensional Data Explorer

Build an interactive tool that applies PCA, t-SNE, and UMAP to any dataset and compares results side-by-side.

Python dim_reduction_explorer.py
"""
Mini Project 2: High-Dimensional Data Explorer
=================================================
A comprehensive tool for comparing dimensionality reduction methods.
Generates publication-quality comparison plots.
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, load_wine, load_breast_cancer
import time

class DimReductionExplorer:
    """Compares PCA, Kernel PCA, t-SNE, and (optionally) UMAP."""
    
    def __init__(self, X, y, feature_names=None, class_names=None, dataset_name="Dataset"):
        self.X_raw = X
        self.y = y
        self.feature_names = feature_names
        self.class_names = class_names
        self.dataset_name = dataset_name
        self.scaler = StandardScaler()
        self.X = self.scaler.fit_transform(X)
        self.results = {}
    
    def run_all(self, tsne_perplexity=30, kpca_kernel='rbf', kpca_gamma=None):
        """Run all dimensionality reduction methods."""
        print(f"=== {self.dataset_name}: {self.X.shape[0]} samples, {self.X.shape[1]} features ===\n")
        
        # PCA
        t0 = time.time()
        pca = PCA(n_components=2)
        self.results['PCA'] = {
            'embedding': pca.fit_transform(self.X),
            'time': time.time() - t0,
            'model': pca
        }
        print(f"PCA:        {time.time()-t0:.3f}s | EVR: {sum(pca.explained_variance_ratio_):.2%}")
        
        # Kernel PCA
        t0 = time.time()
        kpca = KernelPCA(n_components=2, kernel=kpca_kernel, gamma=kpca_gamma)
        self.results['Kernel PCA'] = {
            'embedding': kpca.fit_transform(self.X),
            'time': time.time() - t0,
            'model': kpca
        }
        print(f"Kernel PCA: {time.time()-t0:.3f}s")
        
        # t-SNE
        t0 = time.time()
        tsne = TSNE(n_components=2, perplexity=tsne_perplexity, random_state=42, init='pca')
        self.results['t-SNE'] = {
            'embedding': tsne.fit_transform(self.X),
            'time': time.time() - t0,
            'model': tsne
        }
        print(f"t-SNE:      {time.time()-t0:.1f}s | KL div: {tsne.kl_divergence_:.4f}")
        
        # UMAP (optional)
        try:
            import umap
            t0 = time.time()
            reducer = umap.UMAP(n_components=2, random_state=42)
            self.results['UMAP'] = {
                'embedding': reducer.fit_transform(self.X),
                'time': time.time() - t0,
                'model': reducer
            }
            print(f"UMAP:       {time.time()-t0:.1f}s")
        except ImportError:
            print("UMAP: skipped (pip install umap-learn)")
    
    def plot_comparison(self, figsize=(20, 5)):
        """Plot all methods side by side."""
        methods = list(self.results.keys())
        n_methods = len(methods)
        
        fig, axes = plt.subplots(1, n_methods, figsize=(5*n_methods, 5))
        if n_methods == 1:
            axes = [axes]
        
        cmap = 'tab10' if len(np.unique(self.y)) <= 10 else 'tab20'
        
        for ax, method in zip(axes, methods):
            Z = self.results[method]['embedding']
            t = self.results[method]['time']
            scatter = ax.scatter(Z[:, 0], Z[:, 1], c=self.y, cmap=cmap,
                               s=10, alpha=0.6, edgecolors='none')
            ax.set_title(f'{method}\n({t:.2f}s)', fontsize=12, fontweight='bold')
            ax.set_xticks([])
            ax.set_yticks([])
        
        plt.colorbar(scatter, ax=axes, label='Class', shrink=0.8)
        plt.suptitle(f'{self.dataset_name}: Dimensionality Reduction Comparison',
                    fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
    
    def plot_pca_analysis(self, max_components=None):
        """Detailed PCA analysis: scree plot, cumulative variance, loadings."""
        if max_components is None:
            max_components = min(self.X.shape)
        
        pca_full = PCA(n_components=max_components).fit(self.X)
        
        fig, axes = plt.subplots(1, 3, figsize=(18, 5))
        k = min(max_components, 20)
        
        # Scree plot
        axes[0].bar(range(1, k+1), pca_full.explained_variance_ratio_[:k],
                   color='#059669', alpha=0.7)
        axes[0].set_xlabel('Component')
        axes[0].set_ylabel('Explained Variance Ratio')
        axes[0].set_title('Scree Plot')
        
        # Cumulative
        cum = np.cumsum(pca_full.explained_variance_ratio_[:k]) * 100
        axes[1].plot(range(1, k+1), cum, 'o-', color='#059669', linewidth=2)
        axes[1].axhline(y=95, color='red', linestyle='--', label='95%')
        axes[1].axhline(y=90, color='orange', linestyle='--', label='90%')
        axes[1].set_xlabel('Number of Components')
        axes[1].set_ylabel('Cumulative Variance (%)')
        axes[1].set_title('Cumulative Explained Variance')
        axes[1].legend()
        
        # Top features for PC1 and PC2
        if self.feature_names is not None:
            loadings = pca_full.components_[:2]
            top_features_pc1 = np.argsort(np.abs(loadings[0]))[-5:]
            axes[2].barh(range(5), loadings[0][top_features_pc1], color='#059669')
            axes[2].set_yticks(range(5))
            axes[2].set_yticklabels([self.feature_names[i] for i in top_features_pc1])
            axes[2].set_title('Top 5 Feature Loadings (PC1)')
        
        plt.suptitle(f'{self.dataset_name}: PCA Analysis', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()

# ============================================
# DEMO: Run on multiple datasets
# ============================================
if __name__ == "__main__":
    # Dataset 1: Wine
    wine = load_wine()
    explorer = DimReductionExplorer(
        wine.data, wine.target,
        feature_names=wine.feature_names,
        class_names=wine.target_names,
        dataset_name="Wine Dataset"
    )
    explorer.run_all()
    explorer.plot_comparison()
    explorer.plot_pca_analysis()
    
    # Dataset 2: Digits
    digits = load_digits()
    explorer2 = DimReductionExplorer(
        digits.data, digits.target,
        dataset_name="Digits Dataset (64D)"
    )
    explorer2.run_all(tsne_perplexity=30)
    explorer2.plot_comparison()

๐ŸŽฏ Mini Project 3: Image Compression with PCA

Compress images by applying PCA to image patches and reconstructing at different compression ratios.

Python image_compression_pca.py
"""
Mini Project 3: Image Compression using PCA
=============================================
Treats each row of the image as a sample with width features.
PCA extracts principal components = dominant patterns across rows.
"""
import numpy as np
import matplotlib.pyplot as plt

# Create a sample grayscale image (or load your own)
from sklearn.datasets import load_sample_image

try:
    china = load_sample_image("china.jpg")
    # Convert to grayscale
    img = np.mean(china, axis=2)  # Average RGB channels
except:
    # Fallback: create a synthetic image
    x = np.linspace(0, 4*np.pi, 200)
    y = np.linspace(0, 4*np.pi, 300)
    X, Y = np.meshgrid(x, y)
    img = np.sin(X) * np.cos(Y) + np.random.randn(300, 200) * 0.1

print(f"Image shape: {img.shape}")
print(f"Original size: {img.nbytes / 1024:.1f} KB")

# Apply PCA with different numbers of components
from sklearn.decomposition import PCA

fig, axes = plt.subplots(2, 4, figsize=(20, 10))
k_values = [1, 5, 10, 25, 50, 100, 200, img.shape[1]]

for ax, k in zip(axes.flat, k_values):
    k = min(k, min(img.shape))
    pca = PCA(n_components=k)
    compressed = pca.fit_transform(img)     # (height x k)
    reconstructed = pca.inverse_transform(compressed)  # (height x width)
    
    # Compression ratio
    original_size = img.shape[0] * img.shape[1]
    compressed_size = compressed.shape[0] * compressed.shape[1] + pca.components_.size
    ratio = original_size / compressed_size
    mse = np.mean((img - reconstructed) ** 2)
    
    ax.imshow(reconstructed, cmap='gray')
    ax.set_title(f'k={k}\nRatio: {ratio:.1f}x, MSE: {mse:.2f}')
    ax.axis('off')

plt.suptitle('Image Compression with PCA', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
19

End-of-Chapter Exercises

Conceptual Questions

Exercise 16.1

Explain the curse of dimensionality using the "volume of hypersphere" argument. Show mathematically that Vsphere/Vcube โ†’ 0 as d โ†’ โˆž.

Exercise 16.2

Why must we center the data before applying PCA? What happens if we don't? Prove that the mean of the projected data along each principal component is zero.

Exercise 16.3

Explain why PCA is sensitive to the scale of features. Give an example where standardization changes the principal components dramatically.

Exercise 16.4

Derive the reconstruction error formula for PCA: E = ฮฃi=k+1d ฮปi. Show that discarding the smallest eigenvalues minimizes the MSE.

Exercise 16.5

Explain the crowding problem in SNE and how t-SNE solves it using a heavy-tailed t-distribution. Why is KL divergence asymmetric, and how does this affect the embedding?

Exercise 16.6

Compare PCA, t-SNE, and UMAP along five dimensions: (1) linearity, (2) computational complexity, (3) preservation of global vs. local structure, (4) out-of-sample extension, (5) determinism.

Exercise 16.7

Explain why LDA is limited to at most C-1 components (where C is the number of classes). What happens if SW is singular?

Computational Exercises

Exercise 16.8

Given the data matrix X = [[1,2],[3,4],[5,6],[7,8]], compute PCA by hand: (a) center the data, (b) compute the covariance matrix, (c) find eigenvalues and eigenvectors, (d) project onto PC1.

Exercise 16.9

For the covariance matrix C = [[4, 2], [2, 3]], find the eigenvalues, eigenvectors, and explain what the principal components represent geometrically.

Exercise 16.10

Implement Kernel PCA from scratch using the RBF kernel. Test it on the make_moons dataset and compare with linear PCA.

Exercise 16.11

Write a function that computes the optimal number of PCA components using: (a) the 95% variance threshold, (b) Kaiser's rule (ฮป > 1 for standardized data), (c) the elbow method (maximum curvature of scree plot).

Exercise 16.12

Implement the biplot function from scratch. For the Wine dataset, create a biplot and interpret which features drive each principal component.

Programming Projects

Exercise 16.13

Apply PCA to the MNIST dataset. Plot accuracy of a KNN classifier vs. number of PCA components (k = 1, 2, 5, 10, 20, 50, 100, 200, 784). Find the optimal k.

Exercise 16.14

Run t-SNE on MNIST with perplexities [5, 10, 30, 50, 100] and learning rates [10, 50, 200, 1000]. Create a 5ร—4 grid of visualizations. Which hyperparameter has more impact?

Exercise 16.15

Compare PCA + Logistic Regression vs. LDA + Logistic Regression on three datasets (Wine, Iris, Breast Cancer). Report accuracy, training time, and number of components used.

Exercise 16.16

Implement incremental PCA for a dataset that doesn't fit in memory. Use sklearn's IncrementalPCA with batch_size=100 on a 100,000-sample dataset. Compare results with standard PCA.

Exercise 16.17

Apply UMAP with different metrics (euclidean, cosine, manhattan) to the 20 Newsgroups text dataset (TF-IDF features). Which metric produces the best cluster separation?

Research-Oriented Exercises

Exercise 16.18

Implement Sparse PCA using sklearn and compare the resulting components with standard PCA. When and why would sparse components be preferred?

Exercise 16.19

The Johnson-Lindenstrauss lemma states that random projections can approximately preserve distances. Implement a random projection and compare with PCA for nearest-neighbor queries. At what dimensionality do they converge?

Exercise 16.20

Implement Probabilistic PCA (PPCA) which models data as X = Wz + ฮผ + ฮต where z ~ N(0, I) and ฮต ~ N(0, ฯƒยฒI). Show how it handles missing data gracefully, unlike standard PCA.

Exercise 16.21

Compare the stability of t-SNE and UMAP embeddings across different random seeds. Run each 10 times and quantify consistency using Procrustes distance. Which is more reproducible?

Exercise 16.22

Build an "embedding space interpolation" tool: given two images, project them to PCA space, interpolate between their PCA coordinates, and reconstruct intermediate images. Apply to faces and digits.

20

Multiple Choice Questions

Q1: What do the eigenvalues of the covariance matrix represent in PCA?
  • A) The mean of each feature
  • B) The variance explained by each principal component
  • C) The correlation between features
  • D) The number of samples
โœ… B) Each eigenvalue ฮปแตข equals the variance of the data projected onto eigenvector vแตข. Var(z) = wแต€Cw = wแต€ฮปw = ฮป (when w is unit length).
Q2: If a dataset has d features and n samples (n < d), what is the maximum number of non-zero eigenvalues of the covariance matrix?
  • A) d
  • B) n
  • C) n - 1
  • D) min(n, d)
โœ… C) After centering, the data lies in at most an (n-1)-dimensional subspace. The rank of the centered data matrix is at most n-1, so the covariance matrix has at most n-1 non-zero eigenvalues.
Q3: What distribution does t-SNE use in the low-dimensional space?
  • A) Gaussian distribution
  • B) Student's t-distribution with 1 degree of freedom (Cauchy)
  • C) Uniform distribution
  • D) Exponential distribution
โœ… B) The Student's t-distribution with 1 DoF (Cauchy) has heavier tails than a Gaussian. This solves the crowding problem by allowing moderate distances in high-D to map to large distances in low-D without excessive penalty.
Q4: What does the perplexity parameter in t-SNE approximately control?
  • A) The output dimensionality
  • B) The learning rate
  • C) The effective number of neighbors considered
  • D) The number of iterations
โœ… C) Perplexity = 2^H(P_i) where H is the entropy of the conditional distribution. A perplexity of 30 means each point effectively has ~30 neighbors. Typical range: 5-50.
Q5: Why does UMAP generally outperform t-SNE in preserving global structure?
  • A) UMAP uses a Gaussian distribution
  • B) UMAP uses more iterations
  • C) UMAP minimizes cross-entropy (not KL divergence), which penalizes both false neighbors and missing neighbors
  • D) UMAP always uses more neighbors
โœ… C) KL(P||Q) only penalizes cases where p_ij > q_ij (nearby points placed far apart). Cross-entropy also penalizes q_ij > 0 when w_ij โ‰ˆ 0 (distant points placed close together), helping preserve global structure.
Q6: What is the maximum number of discriminant components LDA can produce for a C-class problem?
  • A) C
  • B) C - 1
  • C) d (number of features)
  • D) min(n, d)
โœ… B) The between-class scatter matrix S_B has rank at most C-1 (since it's the sum of C rank-1 matrices minus one constraint). Therefore S_WโปยนS_B has at most C-1 non-zero eigenvalues.
Q7: In the SVD of centered data Xฬƒ = UฮฃVแต€, the principal components directions correspond to:
  • A) Columns of U
  • B) Diagonal of ฮฃ
  • C) Columns of V (rows of Vแต€)
  • D) Rows of U
โœ… C) V contains the right singular vectors, which are the eigenvectors of Xฬƒแต€Xฬƒ (proportional to the covariance matrix). The eigenvalues are ฮปแตข = ฯƒแตขยฒ/(n-1).
Q8: Which statement about Kernel PCA is FALSE?
  • A) It can capture non-linear relationships
  • B) It uses the kernel trick to avoid explicit mapping
  • C) It is always faster than linear PCA
  • D) It works with an nร—n kernel matrix instead of dร—d covariance
โœ… C) Kernel PCA requires computing and eigendecomposing an nร—n kernel matrix (O(nยณ)), which is slower than linear PCA's O(dยณ) or O(ndยฒ) when n > d. It's only preferable when data has non-linear structure.
Q9: What is the primary advantage of feature selection over feature extraction?
  • A) Better compression ratio
  • B) Preserved interpretability of original features
  • C) Lower information loss
  • D) Works with any data type
โœ… B) Feature selection keeps original features intact, so a doctor can still interpret "blood glucose > 126" rather than "PC3 > 1.5". Feature extraction (PCA, t-SNE) creates abstract combinations that are mathematically optimal but hard to interpret.
Q10: The Kaiser criterion for selecting principal components recommends keeping components with:
  • A) Explained variance ratio > 5%
  • B) Eigenvalue > 1 (for standardized data)
  • C) Cumulative variance > 50%
  • D) Loading > 0.5
โœ… B) For standardized data, each original feature has variance 1. A principal component with eigenvalue > 1 explains more variance than any single original feature, making it worth keeping. This criterion works well for moderate-dimensional data (d < 50).
Q11: Why is it recommended to apply PCA before t-SNE for very high-dimensional data?
  • A) t-SNE requires linear data
  • B) PCA improves cluster separation
  • C) PCA reduces noise and speeds up t-SNE's O(nยฒ) distance computation
  • D) PCA makes the data Gaussian
โœ… C) t-SNE has O(nยฒd) complexity for computing pairwise distances. PCA as a preprocessing step (e.g., 784D โ†’ 50D) reduces d by 15x, dramatically speeding up t-SNE while also removing noisy dimensions that could confuse the embedding.
Q12: What is the "whitening" transformation in PCA?
  • A) Removing outliers
  • B) Normalizing to [0, 1]
  • C) Scaling principal components to have unit variance (Zฬƒ = Zฮ›^(-1/2))
  • D) Removing correlated features
โœ… C) Whitening divides each PC by โˆšฮปแตข, making the transformed data have identity covariance matrix. This is useful for algorithms that assume uncorrelated, unit-variance features (e.g., ICA, some neural networks).
21

Interview Questions

IQ 16.1 โ€” Conceptual (Google, Amazon)

Q: Explain PCA to a non-technical stakeholder. When would you use it and when would you avoid it?

Model Answer: PCA finds the most important "directions" in your data. Imagine a cloud of data points in 3D โ€” PCA finds the best 2D plane to project them onto while losing the least information, like taking the best possible photograph of a 3D object. Use it when: features are correlated, you need to speed up ML models, or reduce storage. Avoid it when: feature interpretability is critical (e.g., regulatory compliance), data has strong non-linear structure, or when the number of features is already small.

IQ 16.2 โ€” Technical (Microsoft, Meta)

Q: What's the computational complexity of PCA? How does SVD help with the n >> d and n << d cases?

Model Answer: Eigendecomposition of dร—d covariance matrix: O(dยณ). But forming the covariance matrix costs O(ndยฒ). SVD of nร—d data matrix costs O(min(nยฒd, ndยฒ)). When n >> d: SVD costs O(ndยฒ), same as eigendecomposition route. When n << d (e.g., genomics): SVD is O(nยฒd), much cheaper than the O(dยณ) eigendecomposition. Randomized SVD (used in sklearn) can further reduce to O(ndk) for k components.

IQ 16.3 โ€” System Design (Flipkart, Netflix)

Q: You're building a product recommendation system with 1 million products, each represented as a 768-dimensional BERT embedding. How would you enable real-time nearest-neighbor search?

Model Answer: (1) PCA: reduce 768D โ†’ 128D (retaining ~95% variance, 6x memory reduction). (2) Product Quantization: split 128D into 8 sub-vectors of 16D, quantize each to 256 centroids = 8 bytes/product. Total memory: 1M ร— 8 bytes = 8 MB (vs 3 GB original). (3) Use FAISS or ScaNN for ANN search. (4) For production: build an IVF (Inverted File) index with ~1000 partitions. Query time: ~1ms (vs ~1s for brute force). PCA is the critical first step that makes everything else feasible.

IQ 16.4 โ€” Deep Technical (OpenAI, DeepMind)

Q: Why is t-SNE non-parametric (can't transform new points), while UMAP can? How does parametric t-SNE solve this?

Model Answer: t-SNE optimizes positions y_i directly via gradient descent โ€” there's no learned function f(x) โ†’ y. To embed a new point x_new, you'd need to re-run the entire optimization. UMAP, however, learns a graph structure that can be used to map new points by finding their nearest neighbors in the training set and interpolating. Parametric t-SNE (Sainburg et al., 2021) uses a neural network to learn f(x) โ†’ y, training it with the same KL divergence objective. This gives the best of both worlds: t-SNE quality + out-of-sample capability.

IQ 16.5 โ€” Applied (ISRO, DRDO)

Q: How would you apply PCA to hyperspectral satellite images for crop classification?

Model Answer: (1) Stack all pixels as rows, spectral bands as columns (Nร—B matrix). (2) Standardize each band. (3) Apply PCA to reduce B=200 bands to k=15 components (capturing 99%+ variance). (4) Feed reduced features to a Random Forest or SVM classifier with training labels (crop type). Key insight: spectral bands are highly correlated (neighboring wavelengths measure similar properties), making PCA extremely effective. Additional trick: use "MNF transform" (Minimum Noise Fraction), a noise-aware variant of PCA that orders components by signal-to-noise ratio rather than variance.

IQ 16.6 โ€” Tricky (Amazon, Bloomberg)

Q: Can you use PCA for feature selection? What's the difference?

Model Answer: PCA does feature extraction, not selection. PCs are linear combinations of ALL original features โ€” you can't remove any single feature. However, you can use PCA loadings to identify which features contribute most to the top PCs, then keep only those features (this is "PCA-guided feature selection"). Also, Sparse PCA (which adds an L1 penalty) produces components that use only a few original features, bridging the gap between extraction and selection.

IQ 16.7 โ€” ML Engineering (Uber, Airbnb)

Q: You applied t-SNE and got beautiful clusters, but your classifier accuracy didn't improve. Why?

Model Answer: t-SNE is optimized for visualization (preserving local neighborhoods), not for preserving distances or linear separability. Several issues: (1) t-SNE distorts distances โ€” close clusters might actually be far apart and vice versa. (2) t-SNE is non-parametric โ€” you can't apply the same transformation to test data. (3) The 2D embedding discards information that a classifier could use. For classification, use PCA (preserves variance) or LDA (maximizes class separation) instead. t-SNE is a diagnostic tool, not a preprocessing step.

IQ 16.8 โ€” Statistics (Goldman Sachs, JP Morgan)

Q: In finance, PCA applied to the correlation matrix of stock returns typically shows the first PC explaining 30-40% of variance. What does this PC represent?

Model Answer: The first PC is the "market factor" โ€” it has roughly equal positive loadings on all stocks and represents broad market movements (risk-on/risk-off sentiment). PC2 often represents sector rotation (e.g., growth vs. value). PC3 might capture size effects. This is the basis of factor models (APT, Fama-French). Traders use PCA to decompose portfolio risk: the residual variance after removing the top PCs represents idiosyncratic/stock-specific risk.

IQ 16.9 โ€” ML Theory (Research Labs)

Q: Prove that PCA minimizes reconstruction error. What's the connection between maximum variance and minimum error?

Model Answer: Total variance = ฮฃ ฮปแตข. Keeping k components preserves ฮฃ_{i=1}^k ฮปแตข variance. Reconstruction error = ฮฃ_{i=k+1}^d ฮปแตข = Total - Preserved. Minimizing reconstruction error โŸบ maximizing preserved variance โŸบ choosing the largest k eigenvalues. The proof uses the Eckart-Young theorem: the best rank-k approximation to a matrix (in Frobenius norm) is given by its top-k SVD components.

IQ 16.10 โ€” Practical (Data Science Roles)

Q: You have a dataset with 1000 features but only 100 samples. What problems do you face, and how does PCA help?

Model Answer: With p >> n: (1) Covariance matrix (1000ร—1000) is rank-deficient (rank โ‰ค 99), making it non-invertible. (2) Models will overfit โ€” more parameters than data points. (3) Distance-based methods fail (curse of dimensionality). PCA helps by: (a) Reducing to at most 99 meaningful components (99 non-zero eigenvalues). (b) Typically 10-20 PCs capture most variance, giving a well-conditioned feature space. (c) Computing PCA via SVD of the 100ร—1000 data matrix (cheaper than 1000ร—1000 eigendecomposition). This is the standard approach in genomics, text mining, and neuroimaging.

22

Research Problems

๐Ÿ”ฌ Research Problem 1: Adaptive Perplexity for t-SNE

Standard t-SNE uses a single perplexity value for all points. But in datasets with varying density (e.g., a dense cluster near a sparse one), uniform perplexity is suboptimal. Design and implement an adaptive perplexity scheme where each point uses a different perplexity based on local density. Compare with standard t-SNE on synthetic datasets with multi-scale cluster structure.

Starting point: Compute local density ฯi = 1/dฬ„k (inverse of mean k-nearest-neighbor distance). Set perplexityi โˆ ฯi-ฮฑ for some parameter ฮฑ. Evaluate using: (a) trustworthiness metric, (b) silhouette score of known clusters, (c) visual quality.

Key reference: Linderman et al., "Clustering with t-SNE, provably" (2019).

๐Ÿ”ฌ Research Problem 2: Dimensionality Reduction for Time Series

Standard PCA treats each feature independently, ignoring temporal correlations. Design a "Temporal PCA" that finds principal components in the time-frequency domain. Apply to: (a) EEG brain signals (64 channels ร— 1000 time points), (b) stock price histories (500 stocks ร— 252 trading days).

Approach: Apply Short-Time Fourier Transform to each channel, flatten the time-frequency representation, then apply PCA. Compare with: standard PCA on raw time series, Dynamic Time Warping + MDS, and the method of Dynamic Mode Decomposition (DMD).

๐Ÿ”ฌ Research Problem 3: Fairness in Dimensionality Reduction

PCA maximizes total variance, which may disproportionately represent majority groups. For example, in face recognition, eigenfaces trained on predominantly light-skinned faces may poorly represent darker-skinned faces. Design a "Fair PCA" that ensures equal reconstruction error across demographic groups.

Formulation: Minimize maxgโˆˆGroups Eg[||x - xฬ‚||ยฒ] subject to dim(projection) = k. Compare with: standard PCA, group-stratified PCA (separate PCA per group), and regularized PCA with group-wise reconstruction constraints.

Key reference: Samadi et al., "The Price of Fair PCA" (NeurIPS 2018).

๐Ÿ”ฌ Research Problem 4: Topology-Preserving Dimensionality Reduction

UMAP is inspired by topology, but how well does it actually preserve topological features (holes, loops, connected components)? Develop quantitative metrics based on persistent homology to evaluate how well PCA, t-SNE, UMAP, and autoencoders preserve the topological structure of data manifolds.

Tools: Use Ripser or GUDHI for persistent homology computation. Test on: (a) torus, (b) Klein bottle samples, (c) real protein conformation data.

23

Key Takeaways

๐ŸŽฏ
The Curse is Real: High-dimensional spaces are geometrically weird โ€” distances concentrate, volumes explode, and data becomes sparse. Dimensionality reduction is not optional for high-D data; it's essential for both computational and statistical reasons.
๐Ÿ“
PCA = Maximum Variance = Minimum Reconstruction Error: PCA finds orthogonal directions that capture the most variance. Mathematically, principal components are eigenvectors of the covariance matrix. The eigenvalue equals the variance along that direction. This can be computed via eigendecomposition or SVD.
๐Ÿ”—
SVD is PCA's Best Friend: PCA via SVD avoids forming the covariance matrix, is numerically more stable, and is cheaper when n << d. Scikit-learn uses SVD internally. The connection: ฮปแตข = ฯƒแตขยฒ/(n-1).
๐ŸŒ€
t-SNE Reveals Local Structure: By modeling pairwise similarities as probabilities and minimizing KL divergence, t-SNE produces stunning 2D visualizations that reveal cluster structure invisible to PCA. The t-distribution's heavy tail solves the crowding problem. Key hyperparameter: perplexity (โ‰ˆ number of neighbors).
โšก
UMAP: The Modern Standard: UMAP preserves both local and global structure, runs much faster than t-SNE, and can transform new data. Its cross-entropy objective (vs. t-SNE's KL divergence) handles both attraction and repulsion. Use UMAP as your default for visualization and as a preprocessing step.
๐Ÿท๏ธ
LDA Uses Labels: When you have class labels, LDA is better than PCA for classification preprocessing because it maximizes class separability (between-class to within-class variance ratio). Limited to C-1 components.
โš–๏ธ
Selection vs. Extraction: Feature selection (filter, wrapper, embedded) preserves interpretability. Feature extraction (PCA, t-SNE, UMAP) preserves information. Choose based on whether you need to explain which features matter or how much information to keep.
๐Ÿ“Š
Use Scree Plots and EVR: Always visualize eigenvalues (scree plot) and cumulative explained variance before choosing k. The 95% threshold, elbow method, and Kaiser's rule (ฮป > 1) are complementary criteria. There is no single "correct" k.
โš ๏ธ
t-SNE is for Visualization Only: Never use t-SNE embeddings as features for a classifier. It distorts distances, is non-deterministic, and is non-parametric. PCA, LDA, or UMAP are better for preprocessing. t-SNE is a diagnostic tool โ€” treat it like a microscope, not a scalpel.
24

References

Foundational Papers

  1. Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space." Philosophical Magazine, 2(11), 559โ€“572.
  2. Hotelling, H. (1933). "Analysis of a Complex of Statistical Variables into Principal Components." Journal of Educational Psychology, 24(6), 417โ€“441.
  3. Fisher, R.A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." Annals of Eugenics, 7(2), 179โ€“188.
  4. Golub, G.H. & Kahan, W. (1965). "Calculating the Singular Values and Pseudo-Inverse of a Matrix." SIAM Journal on Numerical Analysis, 2(2), 205โ€“224.
  5. Schรถlkopf, B., Smola, A., & Mรผller, K.R. (1998). "Nonlinear Component Analysis as a Kernel Eigenvalue Problem." Neural Computation, 10(5), 1299โ€“1319.

t-SNE and UMAP

  1. Hinton, G.E. & Roweis, S.T. (2002). "Stochastic Neighbor Embedding." NeurIPS.
  2. van der Maaten, L. & Hinton, G. (2008). "Visualizing Data using t-SNE." JMLR, 9, 2579โ€“2605.
  3. McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv:1802.03426.
  4. Linderman, G.C. et al. (2019). "Fast Interpolation-based t-SNE for Improved Visualization of Single-Cell RNA-Seq Data." Nature Methods, 16, 243โ€“245.

Applications

  1. Turk, M. & Pentland, A. (1991). "Eigenfaces for Recognition." Journal of Cognitive Neuroscience, 3(1), 71โ€“86.
  2. Jolliffe, I.T. & Cadima, J. (2016). "Principal Component Analysis: A Review and Recent Developments." Phil. Trans. R. Soc. A, 374, 20150202.
  3. Becht, E. et al. (2019). "Dimensionality Reduction for Visualizing Single-Cell Data Using UMAP." Nature Biotechnology, 37, 38โ€“44.

Textbooks

  1. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 12 (PCA).
  2. Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 20.
  3. Gรฉron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapter 8.

Indian Context

  1. UIDAI Technical Reports on Biometric De-duplication Architecture (2020).
  2. ISRO/NRSC Hyperspectral Remote Sensing Documentation, NRSC-TR-2019.
  3. Rao, C.R. (1948). "The Utilization of Multiple Measurements in Problems of Biological Classification." JRSS B, 10(2), 159โ€“203.