Part V: Ensemble & Unsupervised ~4 Hours Chapter 15 of 30 Prerequisites: Ch 6

Unsupervised Learning — Clustering

Discovering Hidden Structure: K-Means, Hierarchical, DBSCAN, and Gaussian Mixture Models

Learning Objectives

Distinguish supervised from unsupervised learning and articulate when clustering is the right choice

Implement K-Means from scratch, derive convergence via within-cluster SSE minimization

Explain K-Means++ initialization and prove its O(log k)-competitive guarantee

Apply Elbow Method and Silhouette Score to select optimal number of clusters

Build hierarchical clustering (agglomerative & divisive) and interpret dendrograms

Master DBSCAN for density-based clustering with noise handling via ε and MinPts

Derive the Expectation-Maximization (EM) algorithm for Gaussian Mixture Models

Evaluate clusters using Silhouette, Davies-Bouldin, Calinski-Harabasz, and ARI metrics

Compare K-Means, Hierarchical, DBSCAN, and GMM — know which to use when

Build end-to-end clustering projects: customer segmentation and image color quantization

Introduction

Imagine walking into a library where none of the books have category labels. You'd naturally start grouping them — novels here, science texts there, history books in another pile. You're performing clustering: discovering natural groupings in data without being told what those groups are.

In supervised learning (Chapters 6–14), we had labeled data: "this email is spam," "this image is a cat." In unsupervised learning, we have only the data itself — no labels, no teacher. The algorithm must discover structure on its own.

Clustering is the most fundamental unsupervised learning task. Given a dataset of n points in d-dimensional space, the goal is to partition them into k groups (clusters) such that points within the same cluster are "similar" to each other and "dissimilar" to points in other clusters.

Why Clustering Matters

Historical Background

Clustering has one of the richest histories in machine learning, spanning mathematics, statistics, and computer science:

YearMilestoneContributorSignificance
1932Anthropometric clusteringDriver & KroeberFirst formal cluster analysis in social science
1957K-Means algorithmStuart Lloyd (Bell Labs)Originally for PCM signal quantization; published 1982
1963K-Means rediscoveryEdward ForgyPublished the iterative K-Means independently
1967Hierarchical clusteringJohnson (single/complete linkage)Systematic taxonomy via dendrograms
1977EM algorithm formalizedDempster, Laird, RubinGeneral framework for GMMs and missing data
1996DBSCANEster, Kriegel, Sander, XuDensity-based clustering, arbitrary-shaped clusters
2007K-Means++Arthur & VassilvitskiiO(log k)-competitive initialization
2011Mini-Batch K-MeansSculley (Google)Scalable K-Means for web-scale data
2017Deep Clustering (DEC)Xie, Girshick, FarhadiNeural network-based unsupervised clustering

Conceptual Explanation

4.1 Supervised vs. Unsupervised Learning

In supervised learning, we have pairs (xᵢ, yᵢ) and learn a mapping f: X → Y. In unsupervised learning, we have only {x₁, x₂, ..., xₙ} and seek to discover hidden patterns, groupings, or representations.

  SUPERVISED LEARNING              UNSUPERVISED LEARNING
  ═══════════════════              ═════════════════════
  Input: (x₁,y₁), ..., (xₙ,yₙ)   Input: x₁, x₂, ..., xₙ
          ↓                                 ↓
  Learn: f(x) → y                 Discover: Structure in X
          ↓                                 ↓
  Output: Predicted labels         Output: Clusters, components,
                                           associations
  
  Examples:                        Examples:
  • Classification                 • Clustering ← THIS CHAPTER
  • Regression                     • Dimensionality Reduction
                                   • Anomaly Detection
                                   • Association Rules
    
Figure 15.1: Supervised vs. Unsupervised Learning paradigm comparison

4.2 K-Means Clustering

K-Means is the most popular clustering algorithm. The idea is elegantly simple:

  1. Choose k — the number of clusters
  2. Initialize — pick k points as initial centroids
  3. Assign — assign each data point to the nearest centroid
  4. Update — recompute each centroid as the mean of its assigned points
  5. Repeat — iterate steps 3–4 until convergence

The algorithm converges because it monotonically decreases the within-cluster sum of squares (WCSS) — also called inertia:

J = Σₖ Σ_{x∈Cₖ} ||x - μₖ||²
Equation 15.1: K-Means Objective (WCSS / Inertia)

4.3 K-Means++ Initialization

Standard K-Means randomly initializes centroids, which can lead to poor local minima. K-Means++ is a smart initialization strategy:

  1. Pick the first centroid uniformly at random from the data
  2. For each remaining point x, compute D(x) = distance to nearest existing centroid
  3. Pick the next centroid with probability proportional to D(x)²
  4. Repeat steps 2–3 until k centroids are chosen

This ensures centroids are spread out, leading to O(log k)-competitive approximation to the optimal WCSS.

4.4 Choosing Optimal K

Elbow Method

Plot WCSS vs. k. The "elbow" — where the curve bends sharply — suggests the optimal k. Beyond this point, adding more clusters gives diminishing returns.

Silhouette Score

For each point i, the silhouette score measures how similar it is to its own cluster versus the nearest neighboring cluster:

s(i) = (b(i) - a(i)) / max(a(i), b(i))
Equation 15.2: Silhouette Score per point. a(i) = avg intra-cluster dist, b(i) = avg nearest-cluster dist

s(i) ranges from -1 (wrong cluster) to +1 (perfectly clustered). The overall silhouette score is the mean over all points.

4.5 Hierarchical Clustering

Instead of specifying k upfront, hierarchical clustering builds a tree (dendrogram) of nested clusterings:

Linkage Types

The key design choice is how to measure distance between clusters:

LinkageDistance Between A and BBehavior
Singlemin{d(a,b) : a∈A, b∈B}Can find elongated clusters; sensitive to noise ("chaining")
Completemax{d(a,b) : a∈A, b∈B}Produces compact, equal-diameter clusters
Average (UPGMA)mean{d(a,b) : a∈A, b∈B}Compromise between single and complete
Ward'sΔJ (increase in WCSS upon merge)Minimizes variance; produces equal-sized clusters

4.6 DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as dense regions separated by sparse regions. Two hyperparameters:

Three point types:

4.7 Gaussian Mixture Models (GMM)

While K-Means does "hard" assignment (each point belongs to exactly one cluster), GMMs do soft clustering — each point has a probability of belonging to each cluster.

A GMM models the data as a mixture of k Gaussian distributions:

p(x) = Σₖ πₖ · N(x | μₖ, Σₖ)
Equation 15.3: Gaussian Mixture Model density. πₖ = mixing coefficients, Σπₖ = 1

The parameters {πₖ, μₖ, Σₖ} are learned via the Expectation-Maximization (EM) algorithm.

4.8 Cluster Evaluation Metrics

Silhouette Score

How well-separated are clusters? Range: [-1, 1]. Higher = better.

Davies-Bouldin Index

Ratio of within-cluster to between-cluster distances. Lower = better.

Calinski-Harabasz

Ratio of between-cluster to within-cluster variance. Higher = better.

ARI (Adjusted Rand)

Measures agreement with ground truth (when available). Range: [-1, 1].

Mathematical Foundation

5.1 K-Means as an Optimization Problem

Given n data points {x₁, ..., xₙ} in ℝᵈ, K-Means seeks a partition C = {C₁, ..., Cₖ} and centroids {μ₁, ..., μₖ} that minimize:

J(C, μ) = Σ_{j=1}^{k} Σ_{i∈Cⱼ} ||xᵢ - μⱼ||²
Equation 15.4: K-Means Objective Function

This is an NP-hard problem in general (even for k=2 in d-dimensions). K-Means gives a local optimum via alternating minimization.

5.2 Why K-Means Converges

The algorithm alternates two steps, each of which can only decrease (or maintain) J:

  1. Assignment step (fix μ, optimize C): For each xᵢ, assigning it to the nearest centroid minimizes ||xᵢ - μⱼ||². This cannot increase J.
  2. Update step (fix C, optimize μ): Given cluster Cⱼ, the value of μ that minimizes Σ_{i∈Cⱼ} ||xᵢ - μ||² is the mean: μⱼ = (1/|Cⱼ|) Σ_{i∈Cⱼ} xᵢ. This cannot increase J.

Since J is bounded below by 0 and each step decreases it, the algorithm must converge. There are finitely many partitions, so it terminates in finite steps (usually fast).

5.3 K-Means++ Competitive Ratio

Let OPT denote the optimal WCSS. K-Means++ initialization produces initial centroids with expected cost:

E[J(initial)] ≤ 8(ln k + 2) · OPT
Equation 15.5: K-Means++ approximation guarantee (O(log k)-competitive)

5.4 Silhouette Score Formalization

For point xᵢ assigned to cluster Cₖ:

a(i) = (1 / |Cₖ| - 1) Σ_{j∈Cₖ, j≠i} d(xᵢ, xⱼ) [mean intra-cluster distance]
b(i) = min_{ℓ≠k} (1 / |Cₗ|) Σ_{j∈Cₗ} d(xᵢ, xⱼ) [mean nearest-cluster distance]
s(i) = (b(i) - a(i)) / max(a(i), b(i))
Equation 15.6: Silhouette coefficient derivation

5.5 Gaussian Mixture Model — Full Specification

The probability density under a GMM with k components:

p(x | θ) = Σ_{j=1}^{k} πⱼ · N(x | μⱼ, Σⱼ)
where N(x|μ,Σ) = (2π)^{-d/2} |Σ|^{-1/2} exp(-½(x-μ)ᵀ Σ⁻¹ (x-μ))
θ = {π₁,...,πₖ, μ₁,...,μₖ, Σ₁,...,Σₖ}
Equation 15.7: Full GMM specification with multivariate Gaussian

5.6 Davies-Bouldin Index

DB = (1/k) Σ_{i=1}^{k} max_{j≠i} ((σᵢ + σⱼ) / d(cᵢ, cⱼ))
Equation 15.8: σᵢ = avg distance of points in Cᵢ to centroid cᵢ. Lower DB = better.

5.7 Calinski-Harabasz Index

CH = [BSS / (k-1)] / [WSS / (n-k)]
BSS = Σⱼ |Cⱼ| · ||μⱼ - μ̄||² (between-cluster sum of squares)
WSS = Σⱼ Σ_{i∈Cⱼ} ||xᵢ - μⱼ||² (within-cluster sum of squares)
Equation 15.9: Calinski-Harabasz Index (higher = better-defined clusters)

Formula Derivations

6.1 Deriving the Optimal Centroid (Mean Minimizes SSE)

We prove that the centroid μⱼ that minimizes within-cluster SSE is the mean of the cluster points.

Given: Cluster Cⱼ = {x₁, x₂, ..., xₘ}. Find μ* = argmin_μ Σᵢ ||xᵢ - μ||²

Step 1: Define f(μ) = Σ_{i=1}^{m} ||xᵢ - μ||² = Σᵢ (xᵢ - μ)ᵀ(xᵢ - μ)
Step 2: Take gradient: ∇_μ f(μ) = Σᵢ -2(xᵢ - μ) = -2 Σᵢ xᵢ + 2mμ
Step 3: Set to zero: 2mμ* = 2 Σᵢ xᵢ
Step 4: μ* = (1/m) Σᵢ xᵢ = mean(Cⱼ) ✓
Derivation 15.1: The cluster mean minimizes within-cluster SSE (convex optimization)

The Hessian ∇²f = 2mI ≻ 0 (positive definite), confirming this is a global minimum.

6.2 Deriving the EM Algorithm for GMMs

The log-likelihood for a GMM is:

ℓ(θ) = Σ_{i=1}^{n} ln [Σ_{j=1}^{k} πⱼ N(xᵢ | μⱼ, Σⱼ)]
Equation 15.10: GMM log-likelihood

Direct maximization is intractable (ln of sum). The EM algorithm introduces latent variables zᵢ ∈ {1,...,k} indicating cluster assignment.

E-Step: Compute Responsibilities

γᵢⱼ = P(zᵢ = j | xᵢ, θ^{old}) = πⱼ N(xᵢ | μⱼ, Σⱼ) / Σₗ πₗ N(xᵢ | μₗ, Σₗ)
Derivation 15.2: Posterior responsibility via Bayes' theorem

γᵢⱼ is the "soft" probability that point xᵢ belongs to cluster j.

M-Step: Update Parameters

Maximize the expected complete-data log-likelihood. Define Nⱼ = Σᵢ γᵢⱼ (effective number of points in cluster j).

μⱼ^{new} = (1/Nⱼ) Σᵢ γᵢⱼ xᵢ
Σⱼ^{new} = (1/Nⱼ) Σᵢ γᵢⱼ (xᵢ - μⱼ^{new})(xᵢ - μⱼ^{new})ᵀ
πⱼ^{new} = Nⱼ / n
Derivation 15.3: M-step updates — weighted MLE with soft assignments

Proof sketch for μⱼ: Taking ∂/∂μⱼ of the expected log-likelihood Q(θ, θ^{old}) = Σᵢ Σⱼ γᵢⱼ ln[πⱼ N(xᵢ|μⱼ,Σⱼ)] and setting to zero:

∂Q/∂μⱼ = Σᵢ γᵢⱼ Σⱼ⁻¹ (xᵢ - μⱼ) = 0
⟹ Σᵢ γᵢⱼ xᵢ = (Σᵢ γᵢⱼ) μⱼ = Nⱼ μⱼ
⟹ μⱼ = (1/Nⱼ) Σᵢ γᵢⱼ xᵢ ✓
Derivation 15.4: M-step mean update from first principles

6.3 Ward's Linkage — Merge Cost Derivation

When merging clusters A and B, the increase in total WCSS is:

ΔJ(A,B) = (|A|·|B|)/(|A|+|B|) · ||μ_A - μ_B||²
Derivation 15.5: Ward's merge cost — penalizes merging distant or large clusters

Proof: The WCSS of the merged cluster A∪B with mean μ_{A∪B} = (|A|μ_A + |B|μ_B)/(|A|+|B|) can be expanded. The increase equals the variance due to the difference in means, weighted by the harmonic-like factor |A|·|B|/(|A|+|B|).

Worked Numerical Examples

Example 1: K-Means — 3 Iterations by Hand

Data points (2D): A(1,1), B(1.5,2), C(3,4), D(5,7), E(3.5,5), F(4.5,5), G(3.5,4.5)

k = 2. Initial centroids: μ₁ = A(1,1), μ₂ = D(5,7)

Iteration 1 — Assignment Step

PointCoordsdist² to μ₁(1,1)dist² to μ₂(5,7)Cluster
A(1, 1)052C₁
B(1.5, 2)1.2537.25C₁
C(3, 4)1313C₁ (tie → pick first)
D(5, 7)520C₂
E(3.5, 5)22.256.25C₂
F(4.5, 5)28.254.25C₂
G(3.5, 4.5)18.58.5C₂

C₁ = {A, B, C}, C₂ = {D, E, F, G}

Iteration 1 — Update Step

μ₁ = ((1+1.5+3)/3, (1+2+4)/3) = (1.833, 2.333)
μ₂ = ((5+3.5+4.5+3.5)/4, (7+5+5+4.5)/4) = (4.125, 5.375)

Iteration 2 — Assignment Step

Pointdist² to μ₁(1.83,2.33)dist² to μ₂(4.13,5.38)Cluster
A(1,1)2.4728.89C₁
B(1.5,2)0.2218.33C₁
C(3,4)4.153.16C₂ ← moved!
D(5,7)31.783.41C₂
E(3.5,5)9.900.53C₂
F(4.5,5)18.260.28C₂
G(3.5,4.5)7.471.14C₂

C₁ = {A, B}, C₂ = {C, D, E, F, G} — Point C moved to C₂!

Iteration 2 — Update Step

μ₁ = ((1+1.5)/2, (1+2)/2) = (1.25, 1.5)
μ₂ = ((3+5+3.5+4.5+3.5)/5, (4+7+5+5+4.5)/5) = (3.9, 5.1)

Iteration 3 — Assignment Step

Recomputing distances with μ₁=(1.25,1.5) and μ₂=(3.9,5.1):

All points remain in the same clusters: C₁ = {A, B}, C₂ = {C, D, E, F, G}. Converged!

Final WCSS = [0.125 + 0.375] + [2.02 + 4.82 + 0.17 + 0.37 + 0.52] = 0.5 + 7.9 = 8.4
Final inertia = 8.4

Example 2: Silhouette Score Computation

Clusters: C₁ = {(0,0), (1,0), (0,1)}, C₂ = {(5,5), (6,5), (5,6)}

Compute silhouette for point x₁ = (0,0) in C₁:

a(x₁) = (d(x₁,(1,0)) + d(x₁,(0,1))) / 2 = (1 + 1) / 2 = 1.0
b(x₁) = mean dist to C₂ = (√50 + √61 + √61) / 3 ≈ (7.07 + 7.81 + 7.81) / 3 ≈ 7.56
s(x₁) = (7.56 - 1.0) / max(1.0, 7.56) = 6.56 / 7.56 ≈ 0.868
Point (0,0) has silhouette ≈ 0.87 — well clustered!

Visual Diagrams

K-Means Iteration Visualization

  Iteration 0 (Init)        Iteration 1              Iteration 2 (Converged)
  ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
  │        · ·       │    │        ■ ■       │    │        ■ ■       │
  │       ·     ·    │    │       ■     ■    │    │       ■   ★ ■    │
  │          ★       │    │          ★       │    │                  │
  │                  │    │                  │    │                  │
  │   ★              │    │   ★              │    │   ★              │
  │  · ·             │    │  ● ●             │    │  ● ●             │
  │    ·             │    │    ●             │    │    ●             │
  └──────────────────┘    └──────────────────┘    └──────────────────┘
  
  ★ = centroid    · = unassigned    ● = cluster 1    ■ = cluster 2
    
Figure 15.2: K-Means assigns points to nearest centroid, then moves centroids to cluster means

DBSCAN Point Classification

  ε = 1.5, MinPts = 3
  
  ┌─────────────────────────────────┐
  │                                 │
  │    ●───●       Core points: ●   │
  │    │ ╲ │       Border pts:  ○   │
  │    ●───●───○   Noise pts:   ✕   │
  │                                 │
  │         ✕                       │
  │                                 │
  │              ●───●              │
  │              │ ╲ │              │
  │         ○────●───●              │
  │                                 │
  │                     ✕           │
  └─────────────────────────────────┘
  
  Cluster 1: top-left group (4 core + 1 border)
  Cluster 2: bottom-center group (4 core + 1 border)
  Noise: 2 isolated points (✕)
    
Figure 15.3: DBSCAN identifies dense regions as clusters, isolated points as noise

Dendrogram Reading

  Height (distance)
    │
  8 ├─────────────────────┐
    │                     │
  6 ├─────────┐           │
    │         │           │
  4 ├───┐     │           │
    │   │     │           │
  2 ├─┐ │   ┌─┘         ┌─┘
    │ │ │   │           │
  0 ├─┴─┴───┴───────────┴──
    A  B  C  D     E  F  G
    └──┘  │  │     └──┘  │
    merge  │  │     merge │
    at 2   │  │     at 2  │
    └─────┘  │     └──────┘
    merge     │     merge
    at 4      │     at 3
    └─────────┘
    merge at 6          
    └─────────────────────┘
              merge at 8
  
  Cut at height 5 → 2 clusters: {A,B,C,D} and {E,F,G}
  Cut at height 3 → 3 clusters: {A,B}, {C,D}, {E,F,G}
    
Figure 15.4: Dendrogram — cut at different heights gives different numbers of clusters

GMM Soft Clustering vs K-Means Hard Clustering

  K-MEANS (Hard Assignment)           GMM (Soft Assignment)
  ┌──────────────────────┐            ┌──────────────────────┐
  │ ● ● ●  │  ■ ■ ■     │            │ ● ● ●    ■ ■ ■      │
  │  ● ●   │   ■ ■      │            │  ● ●   ◐  ■ ■       │
  │   ●    ▌   ■         │            │   ●    ◑   ■        │
  │        │             │            │                      │
  │ Hard boundary        │            │ ◐ = 70% cluster 1   │
  │ between clusters     │            │ ◑ = 40% cluster 1   │
  │                      │            │ Smooth probabilities │
  └──────────────────────┘            └──────────────────────┘
    
Figure 15.5: K-Means uses hard boundaries; GMM assigns soft probabilities

Flowcharts

K-Means Algorithm Flowchart

         ┌──────────────────┐
         │   START: Input   │
         │  X, k (clusters) │
         └────────┬─────────┘
                  ▼
         ┌──────────────────┐
         │  Initialize k    │
         │  centroids       │
         │  (random/K-M++)  │
         └────────┬─────────┘
                  ▼
    ┌────►┌──────────────────┐
    │     │  ASSIGN each xᵢ  │
    │     │  to nearest μⱼ   │
    │     │  cᵢ = argmin     │
    │     │    ||xᵢ - μⱼ||²  │
    │     └────────┬─────────┘
    │              ▼
    │     ┌──────────────────┐
    │     │  UPDATE centroids│
    │     │  μⱼ = mean(Cⱼ)  │
    │     └────────┬─────────┘
    │              ▼
    │     ┌──────────────────┐
    │  NO │  Converged?      │
    ├─────┤  (assignments    │
    │     │   unchanged?)    │
    │     └────────┬─────────┘
    │              │ YES
    │              ▼
    │     ┌──────────────────┐
    │     │  OUTPUT: Clusters│
    │     │  C₁,...,Cₖ and   │
    │     │  centroids μ₁..μₖ│
    │     └──────────────────┘
    
Figure 15.6: K-Means Algorithm Flowchart

DBSCAN Algorithm Flowchart

    ┌─────────────────────────┐
    │ START: Input X, ε, MinPts│
    └───────────┬─────────────┘
                ▼
    ┌─────────────────────────┐
    │ Mark all points UNVISITED│
    │ cluster_id = 0           │
    └───────────┬─────────────┘
                ▼
    ┌─────────────────────────┐
    │ Pick unvisited point P   │◄──────────────┐
    │ Mark P as VISITED        │               │
    └───────────┬─────────────┘               │
                ▼                              │
    ┌─────────────────────────┐               │
    │ Find neighbors N(P)      │               │
    │ within radius ε          │               │
    └───────────┬─────────────┘               │
                ▼                              │
        ┌───────────────┐                      │
        │|N(P)| ≥ MinPts?│                     │
        └───┬───────┬───┘                      │
         YES│       │NO                        │
            ▼       ▼                          │
    ┌──────────┐ ┌──────────┐                  │
    │ CORE PT  │ │ NOISE    │                  │
    │ cluster++│ │ (may     │                  │
    │ Expand   │ │ become   │                  │
    │ cluster  │ │ border)  │                  │
    │ via BFS  │ └──────────┘                  │
    └─────┬────┘                               │
          ▼                                    │
    ┌───────────────┐                          │
    │ More unvisited │───YES───────────────────┘
    │   points?      │
    └───────┬───────┘
            │ NO
            ▼
    ┌──────────────────┐
    │ OUTPUT: Clusters  │
    │ + Noise points    │
    └──────────────────┘
    
Figure 15.7: DBSCAN — density reachability expands clusters from core points

EM Algorithm Flowchart

    ┌──────────────────────────┐
    │  START: Data X, k comps  │
    │  Init: πⱼ, μⱼ, Σⱼ       │
    └────────────┬─────────────┘
                 ▼
    ┌───►┌──────────────────────┐
    │    │  E-STEP              │
    │    │  Compute γᵢⱼ =       │
    │    │  P(zᵢ=j | xᵢ, θ)   │
    │    │  (responsibilities)  │
    │    └────────────┬─────────┘
    │                 ▼
    │    ┌──────────────────────┐
    │    │  M-STEP              │
    │    │  Update: μⱼ, Σⱼ, πⱼ │
    │    │  using γᵢⱼ weights   │
    │    └────────────┬─────────┘
    │                 ▼
    │    ┌──────────────────────┐
    │    │  Compute ℓ(θ)        │
    │ NO │  Converged?          │
    ├────┤  |ℓ_new - ℓ_old| <ε? │
    │    └────────────┬─────────┘
    │                 │ YES
    │                 ▼
    │    ┌──────────────────────┐
    │    │  OUTPUT:              │
    │    │  πⱼ, μⱼ, Σⱼ for each │
    │    │  component + γᵢⱼ     │
    │    └──────────────────────┘
    
Figure 15.8: EM Algorithm for GMMs — alternates E-step and M-step until convergence

Choosing the Right Clustering Algorithm

                    ┌─────────────────────┐
                    │ Do you know k       │
                    │ (number of clusters)?│
                    └─────┬──────┬────────┘
                      YES │      │ NO
                          ▼      ▼
                ┌──────────┐  ┌──────────────┐
                │Spherical  │  │Need to handle│
                │clusters?  │  │noise/outliers│
                └──┬────┬──┘  └──┬───────┬───┘
                YES│    │NO   YES│       │NO
                   ▼    ▼       ▼       ▼
              ┌──────┐┌─────┐┌──────┐┌──────────┐
              │K-Means││ GMM ││DBSCAN││Hierarchi-│
              │       ││     ││      ││cal (cut  │
              │Fast,  ││Soft ││Auto k││dendrogram│
              │simple ││prob ││Noise ││at desired│
              │       ││     ││label ││height)   │
              └──────┘└─────┘└──────┘└──────────┘
    
Figure 15.9: Decision tree for choosing a clustering algorithm

Python Implementation from Scratch

10.1 K-Means from Scratch

kmeans_scratch.py Python
import numpy as np
import matplotlib.pyplot as plt

class KMeans:
    """K-Means clustering from scratch with K-Means++ initialization."""
    
    def __init__(self, k=3, max_iters=100, tol=1e-4, init='kmeans++', random_state=42):
        self.k = k
        self.max_iters = max_iters
        self.tol = tol
        self.init = init
        self.random_state = random_state
        self.centroids = None
        self.labels = None
        self.inertia_ = None
        self.n_iters_ = 0
    
    def _init_centroids(self, X):
        """Initialize centroids using random or K-Means++."""
        rng = np.random.RandomState(self.random_state)
        n = X.shape[0]
        
        if self.init == 'random':
            indices = rng.choice(n, self.k, replace=False)
            return X[indices].copy()
        
        elif self.init == 'kmeans++':
            centroids = [X[rng.randint(n)]]
            
            for _ in range(1, self.k):
                # Compute squared distances to nearest centroid
                dists = np.array([
                    min(np.sum((x - c) ** 2) for c in centroids)
                    for x in X
                ])
                # Probability proportional to D(x)^2
                probs = dists / dists.sum()
                # Weighted random selection
                idx = rng.choice(n, p=probs)
                centroids.append(X[idx])
            
            return np.array(centroids)
    
    def _assign_clusters(self, X):
        """Assign each point to nearest centroid."""
        # Vectorized distance computation
        # X: (n, d), centroids: (k, d)
        # Using broadcasting: (n, 1, d) - (1, k, d) -> (n, k, d)
        diffs = X[:, np.newaxis, :] - self.centroids[np.newaxis, :, :]
        distances = np.sum(diffs ** 2, axis=2)  # (n, k)
        return np.argmin(distances, axis=1)
    
    def _update_centroids(self, X, labels):
        """Update centroids as cluster means."""
        new_centroids = np.zeros_like(self.centroids)
        for j in range(self.k):
            cluster_points = X[labels == j]
            if len(cluster_points) > 0:
                new_centroids[j] = cluster_points.mean(axis=0)
            else:
                # Handle empty cluster: reinitialize randomly
                new_centroids[j] = X[np.random.randint(X.shape[0])]
        return new_centroids
    
    def _compute_inertia(self, X, labels):
        """Compute within-cluster sum of squares (WCSS)."""
        inertia = 0
        for j in range(self.k):
            cluster_points = X[labels == j]
            if len(cluster_points) > 0:
                inertia += np.sum((cluster_points - self.centroids[j]) ** 2)
        return inertia
    
    def fit(self, X):
        """Fit K-Means to data X."""
        self.centroids = self._init_centroids(X)
        
        for i in range(self.max_iters):
            # Assignment step
            self.labels = self._assign_clusters(X)
            
            # Update step
            new_centroids = self._update_centroids(X, self.labels)
            
            # Check convergence
            shift = np.sum((new_centroids - self.centroids) ** 2)
            self.centroids = new_centroids
            self.n_iters_ = i + 1
            
            if shift < self.tol:
                break
        
        self.inertia_ = self._compute_inertia(X, self.labels)
        return self
    
    def predict(self, X):
        """Predict cluster labels for new data."""
        return self._assign_clusters(X)
    
    def fit_predict(self, X):
        """Fit and return labels."""
        self.fit(X)
        return self.labels


# --- Demo ---
np.random.seed(42)
# Generate 3 blobs
X = np.vstack([
    np.random.randn(50, 2) + [0, 0],
    np.random.randn(50, 2) + [5, 5],
    np.random.randn(50, 2) + [10, 0],
])

km = KMeans(k=3, init='kmeans++')
labels = km.fit_predict(X)

print(f"Converged in {km.n_iters_} iterations")
print(f"Inertia (WCSS): {km.inertia_:.2f}")
print(f"Centroids:\n{km.centroids}")

# Elbow method
inertias = []
K_range = range(1, 10)
for k in K_range:
    km_temp = KMeans(k=k)
    km_temp.fit(X)
    inertias.append(km_temp.inertia_)

print("\nElbow Method - Inertias:")
for k, inertia in zip(K_range, inertias):
    bar = '█' * int(inertia / 50)
    print(f"  k={k}: {inertia:8.1f} |{bar}")

10.2 DBSCAN from Scratch

dbscan_scratch.py Python
import numpy as np
from collections import deque

class DBSCAN:
    """DBSCAN clustering from scratch."""
    
    def __init__(self, eps=0.5, min_pts=5):
        self.eps = eps
        self.min_pts = min_pts
        self.labels = None
        self.core_points_ = None
        self.n_clusters_ = 0
    
    def _get_neighbors(self, X, point_idx):
        """Find all points within eps of X[point_idx]."""
        distances = np.sqrt(np.sum((X - X[point_idx]) ** 2, axis=1))
        return np.where(distances <= self.eps)[0]
    
    def fit(self, X):
        """Fit DBSCAN to data X."""
        n = X.shape[0]
        self.labels = np.full(n, -1)  # -1 = unvisited/noise
        visited = np.zeros(n, dtype=bool)
        cluster_id = 0
        core_mask = np.zeros(n, dtype=bool)
        
        # Precompute distance matrix for efficiency
        # (for small datasets; for large ones, use spatial index)
        dist_matrix = np.sqrt(
            np.sum((X[:, np.newaxis] - X[np.newaxis, :]) ** 2, axis=2)
        )
        
        for i in range(n):
            if visited[i]:
                continue
            visited[i] = True
            
            # Find neighbors
            neighbors = np.where(dist_matrix[i] <= self.eps)[0]
            
            if len(neighbors) < self.min_pts:
                # Mark as noise (may become border later)
                self.labels[i] = -1
                continue
            
            # Core point — start new cluster
            core_mask[i] = True
            self.labels[i] = cluster_id
            
            # BFS expansion
            queue = deque(neighbors.tolist())
            
            while queue:
                q = queue.popleft()
                
                if not visited[q]:
                    visited[q] = True
                    q_neighbors = np.where(dist_matrix[q] <= self.eps)[0]
                    
                    if len(q_neighbors) >= self.min_pts:
                        core_mask[q] = True
                        queue.extend(q_neighbors.tolist())
                
                # Assign to cluster if not yet assigned
                if self.labels[q] == -1:
                    self.labels[q] = cluster_id
            
            cluster_id += 1
        
        self.core_points_ = core_mask
        self.n_clusters_ = cluster_id
        return self
    
    def fit_predict(self, X):
        """Fit and return labels."""
        self.fit(X)
        return self.labels


# --- Demo ---
np.random.seed(42)
# Two crescent moons (non-spherical clusters)
from sklearn.datasets import make_moons
X, y_true = make_moons(n_samples=200, noise=0.1, random_state=42)

db = DBSCAN(eps=0.2, min_pts=5)
labels = db.fit_predict(X)

print(f"Number of clusters found: {db.n_clusters_}")
print(f"Noise points: {np.sum(labels == -1)}")
print(f"Core points: {np.sum(db.core_points_)}")
print(f"Cluster sizes: {[np.sum(labels == i) for i in range(db.n_clusters_)]}")

10.3 Silhouette Score from Scratch

silhouette_scratch.py Python
import numpy as np

def silhouette_score(X, labels):
    """Compute mean silhouette score from scratch."""
    n = X.shape[0]
    unique_labels = np.unique(labels)
    k = len(unique_labels)
    
    if k <= 1 or k >= n:
        return 0.0  # Undefined for 1 cluster or n clusters
    
    # Pairwise distance matrix
    dist_matrix = np.sqrt(
        np.sum((X[:, np.newaxis] - X[np.newaxis, :]) ** 2, axis=2)
    )
    
    silhouettes = np.zeros(n)
    
    for i in range(n):
        own_cluster = labels[i]
        own_mask = labels == own_cluster
        
        # a(i): mean intra-cluster distance
        if own_mask.sum() > 1:
            a_i = dist_matrix[i][own_mask].sum() / (own_mask.sum() - 1)
        else:
            a_i = 0
        
        # b(i): min mean distance to other clusters
        b_i = np.inf
        for label in unique_labels:
            if label == own_cluster or label == -1:
                continue
            other_mask = labels == label
            if other_mask.sum() > 0:
                mean_dist = dist_matrix[i][other_mask].mean()
                b_i = min(b_i, mean_dist)
        
        if b_i == np.inf:
            silhouettes[i] = 0
        else:
            silhouettes[i] = (b_i - a_i) / max(a_i, b_i)
    
    return silhouettes.mean()


# --- Demo ---
np.random.seed(42)
X = np.vstack([
    np.random.randn(30, 2) + [0, 0],
    np.random.randn(30, 2) + [5, 5],
    np.random.randn(30, 2) + [10, 0],
])

from kmeans_scratch import KMeans  # from above

for k in range(2, 7):
    km = KMeans(k=k)
    labels = km.fit_predict(X)
    score = silhouette_score(X, labels)
    bar = '█' * int(score * 30)
    print(f"k={k}: Silhouette = {score:.4f} |{bar}")
# Best k should be 3 (highest silhouette)

TensorFlow Implementation

While clustering is traditionally non-neural, TensorFlow can accelerate K-Means on GPU and enables deep clustering approaches.

tf_kmeans.py TensorFlow
import tensorflow as tf
import numpy as np

class TFKMeans:
    """GPU-accelerated K-Means using TensorFlow."""
    
    def __init__(self, k=3, max_iters=100, tol=1e-4):
        self.k = k
        self.max_iters = max_iters
        self.tol = tol
    
    @tf.function
    def _assign_step(self, X, centroids):
        """Vectorized assignment using TF ops."""
        # X: (n, d), centroids: (k, d)
        # Expand: (n, 1, d) - (1, k, d) = (n, k, d)
        diffs = tf.expand_dims(X, 1) - tf.expand_dims(centroids, 0)
        distances = tf.reduce_sum(diffs ** 2, axis=2)  # (n, k)
        return tf.argmin(distances, axis=1)  # (n,)
    
    @tf.function
    def _update_step(self, X, labels, k):
        """Update centroids as cluster means."""
        new_centroids = []
        for j in tf.range(k):
            mask = tf.equal(labels, tf.cast(j, labels.dtype))
            cluster_pts = tf.boolean_mask(X, mask)
            new_centroids.append(tf.reduce_mean(cluster_pts, axis=0))
        return tf.stack(new_centroids)
    
    def fit(self, X_np):
        """Fit K-Means on data."""
        X = tf.constant(X_np, dtype=tf.float32)
        n = X.shape[0]
        
        # K-Means++ init
        indices = [np.random.randint(n)]
        for _ in range(1, self.k):
            centroids_so_far = tf.gather(X, indices)
            diffs = tf.expand_dims(X, 1) - tf.expand_dims(centroids_so_far, 0)
            dists = tf.reduce_min(tf.reduce_sum(diffs**2, axis=2), axis=1)
            probs = dists / tf.reduce_sum(dists)
            idx = np.random.choice(n, p=probs.numpy())
            indices.append(idx)
        
        self.centroids = tf.Variable(tf.gather(X, indices))
        
        for i in range(self.max_iters):
            labels = self._assign_step(X, self.centroids)
            new_centroids = self._update_step(X, labels, self.k)
            
            shift = tf.reduce_sum((new_centroids - self.centroids) ** 2)
            self.centroids.assign(new_centroids)
            
            if shift.numpy() < self.tol:
                print(f"Converged at iteration {i+1}")
                break
        
        self.labels_ = labels.numpy()
        return self


# --- Demo ---
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) + [0, 0],
    np.random.randn(100, 2) + [6, 6],
    np.random.randn(100, 2) + [12, 0],
]).astype(np.float32)

model = TFKMeans(k=3)
model.fit(X)
print(f"TF Centroids:\n{model.centroids.numpy()}")
print(f"Cluster sizes: {[np.sum(model.labels_ == i) for i in range(3)]}")


# --- Deep Clustering with Autoencoder ---
class DeepCluster(tf.keras.Model):
    """Autoencoder-based deep clustering."""
    
    def __init__(self, latent_dim=10, n_clusters=3):
        super().__init__()
        self.encoder = tf.keras.Sequential([
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(latent_dim, activation='relu'),
        ])
        self.decoder = tf.keras.Sequential([
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dense(784, activation='sigmoid'),  # MNIST
        ])
        self.cluster_centers = tf.Variable(
            tf.random.normal([n_clusters, latent_dim])
        )
    
    def call(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat, z

print("\nDeep Clustering architecture defined for high-dimensional data.")

Scikit-Learn Implementation

sklearn_clustering_pipeline.py Scikit-Learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    silhouette_score, silhouette_samples,
    davies_bouldin_score, calinski_harabasz_score,
    adjusted_rand_score
)
from sklearn.datasets import make_blobs, make_moons
from scipy.cluster.hierarchy import dendrogram, linkage
import warnings
warnings.filterwarnings('ignore')

# =========================================
# 1. K-MEANS with Elbow & Silhouette
# =========================================
print("=" * 60)
print("1. K-MEANS CLUSTERING")
print("=" * 60)

X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8,
                        random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Elbow Method
inertias = []
sil_scores = []
K_range = range(2, 10)

for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_scaled, km.labels_))

print("\nElbow Method Results:")
for k, inertia, sil in zip(K_range, inertias, sil_scores):
    bar_i = '█' * int(inertia / 20)
    bar_s = '█' * int(sil * 40)
    print(f"  k={k}: Inertia={inertia:7.1f} |{bar_i}")
    print(f"       Silhouette={sil:.4f} |{bar_s}")

# Best K-Means
best_k = list(K_range)[np.argmax(sil_scores)]
print(f"\nOptimal k by Silhouette: {best_k}")

km_final = KMeans(n_clusters=best_k, init='k-means++', n_init=10, 
                   random_state=42)
labels_km = km_final.fit_predict(X_scaled)

print(f"Final Inertia: {km_final.inertia_:.2f}")
print(f"Silhouette: {silhouette_score(X_scaled, labels_km):.4f}")
print(f"Davies-Bouldin: {davies_bouldin_score(X_scaled, labels_km):.4f}")
print(f"Calinski-Harabasz: {calinski_harabasz_score(X_scaled, labels_km):.1f}")
print(f"ARI (vs true): {adjusted_rand_score(y_true, labels_km):.4f}")


# =========================================
# 2. HIERARCHICAL CLUSTERING
# =========================================
print("\n" + "=" * 60)
print("2. HIERARCHICAL CLUSTERING")
print("=" * 60)

# Test all linkage types
for link in ['ward', 'complete', 'average', 'single']:
    hc = AgglomerativeClustering(
        n_clusters=4, linkage=link
    )
    labels_hc = hc.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels_hc)
    ari = adjusted_rand_score(y_true, labels_hc)
    print(f"  {link:10s} linkage: Silhouette={sil:.4f}, ARI={ari:.4f}")

# Dendrogram with scipy
Z = linkage(X_scaled[:50], method='ward')  # Subset for readability
print(f"\nLinkage matrix shape: {Z.shape}")
print("  Columns: [idx1, idx2, distance, n_points]")
print(f"  First 3 merges:")
for row in Z[:3]:
    print(f"    Merge clusters {int(row[0])} & {int(row[1])}, "
          f"dist={row[2]:.3f}, size={int(row[3])}")


# =========================================
# 3. DBSCAN
# =========================================
print("\n" + "=" * 60)
print("3. DBSCAN CLUSTERING")
print("=" * 60)

X_moons, y_moons = make_moons(n_samples=300, noise=0.08, random_state=42)
X_moons_scaled = StandardScaler().fit_transform(X_moons)

# Parameter search
print("\nParameter sensitivity:")
for eps in [0.2, 0.3, 0.5, 0.7]:
    for min_pts in [3, 5, 10]:
        db = DBSCAN(eps=eps, min_samples=min_pts)
        labels_db = db.fit_predict(X_moons_scaled)
        n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
        n_noise = (labels_db == -1).sum()
        if n_clusters >= 2:
            sil = silhouette_score(X_moons_scaled[labels_db != -1],
                                    labels_db[labels_db != -1])
        else:
            sil = -1
        print(f"  eps={eps}, min_pts={min_pts}: "
              f"clusters={n_clusters}, noise={n_noise}, sil={sil:.3f}")

# Best DBSCAN
db_best = DBSCAN(eps=0.3, min_samples=5)
labels_db = db_best.fit_predict(X_moons_scaled)
print(f"\nBest DBSCAN: {len(set(labels_db)) - 1} clusters found")
print(f"Noise points: {(labels_db == -1).sum()}")
print(f"ARI vs true: {adjusted_rand_score(y_moons, labels_db):.4f}")


# =========================================
# 4. GAUSSIAN MIXTURE MODEL
# =========================================
print("\n" + "=" * 60)
print("4. GAUSSIAN MIXTURE MODEL (EM)")
print("=" * 60)

# BIC/AIC for model selection
print("\nModel selection via BIC/AIC:")
for k in range(2, 8):
    gmm = GaussianMixture(n_components=k, covariance_type='full',
                           n_init=3, random_state=42)
    gmm.fit(X_scaled)
    labels_gmm = gmm.predict(X_scaled)
    sil = silhouette_score(X_scaled, labels_gmm)
    print(f"  k={k}: BIC={gmm.bic(X_scaled):8.1f}, "
          f"AIC={gmm.aic(X_scaled):8.1f}, Silhouette={sil:.4f}")

# Best GMM
gmm_best = GaussianMixture(n_components=4, covariance_type='full',
                             random_state=42)
gmm_best.fit(X_scaled)
labels_gmm = gmm_best.predict(X_scaled)
proba = gmm_best.predict_proba(X_scaled)

print(f"\nBest GMM (k=4):")
print(f"  Converged: {gmm_best.converged_}")
print(f"  Iterations: {gmm_best.n_iter_}")
print(f"  ARI vs true: {adjusted_rand_score(y_true, labels_gmm):.4f}")
print(f"\nSoft assignments (first 5 points):")
for i in range(5):
    probs_str = ', '.join(f'{p:.3f}' for p in proba[i])
    print(f"  Point {i}: [{probs_str}] → Cluster {labels_gmm[i]}")

print(f"\nMixing weights: {gmm_best.weights_.round(3)}")


# =========================================
# 5. COMPREHENSIVE COMPARISON
# =========================================
print("\n" + "=" * 60)
print("5. ALGORITHM COMPARISON")
print("=" * 60)

algorithms = {
    'K-Means (k=4)': KMeans(n_clusters=4, random_state=42),
    'Hierarchical (Ward)': AgglomerativeClustering(n_clusters=4, linkage='ward'),
    'DBSCAN (0.5, 5)': DBSCAN(eps=0.5, min_samples=5),
    'GMM (k=4)': GaussianMixture(n_components=4, random_state=42),
}

print(f"\n{'Algorithm':<25} {'Clusters':>8} {'Silhouette':>10} "
      f"{'DB-Index':>10} {'ARI':>8}")
print("-" * 65)

for name, algo in algorithms.items():
    if hasattr(algo, 'fit_predict'):
        labels = algo.fit_predict(X_scaled)
    else:
        algo.fit(X_scaled)
        labels = algo.predict(X_scaled)
    
    mask = labels >= 0
    n_clusters = len(set(labels[mask]))
    
    if n_clusters >= 2:
        sil = silhouette_score(X_scaled[mask], labels[mask])
        db = davies_bouldin_score(X_scaled[mask], labels[mask])
    else:
        sil = db = float('nan')
    ari = adjusted_rand_score(y_true, labels)
    
    print(f"{name:<25} {n_clusters:>8d} {sil:>10.4f} {db:>10.4f} {ari:>8.4f}")

Indian Case Studies

Jio Customer Segmentation — 200M+ Users

Telecom · Clustering at Scale

Problem

Reliance Jio has 400M+ subscribers generating diverse usage patterns. Targeted plans, content recommendations, and network optimization require understanding distinct user segments — without predefined labels.

Approach

  • Features: Daily data usage (GB), voice minutes, video streaming hours, recharge frequency, location (urban/rural), device tier, app usage patterns
  • Preprocessing: StandardScaler + PCA to 15 dimensions (from 50+ raw features)
  • Algorithm: Mini-Batch K-Means (k=8 discovered via elbow + business validation)
  • Validation: Silhouette score = 0.41 (acceptable for high-dimensional real data)

Discovered Segments

SegmentSizeProfileJio Action
Data Heavy18%10GB+/day, heavy streaming₹999 unlimited plans
Voice Primary22%Low data, high calls, rural₹149 voice-first plans
Price Sensitive25%Recharge only when offersFlash sales, sachets
JioTV Fans12%Heavy video, moderate dataContent bundles
Business Users8%Weekday heavy, enterprise appsJioBusiness plans
Light Users15%Feature phone, minimal dataJioPhone specials

Impact

Clustering-driven personalized plans increased ARPU (Average Revenue Per User) by 15% and reduced churn by 12% in the price-sensitive segment through targeted retention offers.

jio_segmentation_demo.py Python
# Simulated Jio customer segmentation
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

np.random.seed(42)

# Simulate 10,000 users with 8 features
n_users = 10000
data_usage_gb = np.concatenate([
    np.random.exponential(2, 3000),   # light users
    np.random.normal(8, 2, 4000),     # moderate
    np.random.normal(15, 3, 3000),    # heavy
])
voice_mins = np.random.normal(120, 60, n_users).clip(0)
video_hours = np.random.exponential(1.5, n_users)
recharge_freq = np.random.poisson(2, n_users) + 1
location = np.random.binomial(1, 0.6, n_users)  # 1=urban
device_tier = np.random.choice([1, 2, 3], n_users, p=[0.3, 0.5, 0.2])
app_diversity = np.random.poisson(5, n_users)
monthly_spend = np.random.lognormal(5.5, 0.8, n_users)

X = np.column_stack([
    data_usage_gb, voice_mins, video_hours, recharge_freq,
    location, device_tier, app_diversity, monthly_spend
])

# Pipeline
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA for visualization
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)
print(f"PCA explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# Mini-Batch K-Means for scalability
mbkm = MiniBatchKMeans(n_clusters=6, batch_size=1000, random_state=42)
labels = mbkm.fit_predict(X_scaled)

# Segment analysis
print("\nSegment Analysis:")
print(f"{'Segment':<12} {'Count':>6} {'Avg Data(GB)':>12} "
      f"{'Avg Voice':>10} {'Avg Spend':>10}")
print("-" * 55)
for seg in range(6):
    mask = labels == seg
    print(f"Segment {seg:<4} {mask.sum():>6} "
          f"{data_usage_gb[mask].mean():>12.1f} "
          f"{voice_mins[mask].mean():>10.0f} "
          f"{monthly_spend[mask].mean():>10.0f}")

Indian Census — District-Level Population Clustering

Government · Socioeconomic Analysis

Problem

India's Census (2011: 640 districts, 2021: 773 districts) captures hundreds of indicators per district. Clustering helps identify development patterns and target government programs.

Features Used

  • Population density, urbanization rate, literacy rate, sex ratio
  • Agricultural employment %, industrial employment %, service sector %
  • Infant mortality rate, access to electricity, road connectivity
  • Scheduled Caste/Tribe population %, HDI indicators

Methodology

Ward's hierarchical clustering with 5 clusters, validated against known development indices:

  • Cluster 1 (Metro): High urbanization, high literacy — Delhi, Mumbai, Bangalore districts
  • Cluster 2 (Developing Urban): Medium urbanization, growing services — Tier-2 cities
  • Cluster 3 (Agricultural): High agricultural employment, moderate literacy — Punjab, Haryana heartland
  • Cluster 4 (Tribal/Forest): Low density, high ST population — Chhattisgarh, Jharkhand
  • Cluster 5 (Under-served): Low literacy, high IMR — parts of Bihar, UP

Policy Impact

The clustering directly informed Aspirational Districts Programme — identifying 112 districts requiring concentrated development resources. Cluster membership helps predict which interventions (education, healthcare, infrastructure) will be most effective.

Global Case Studies

Spotify — Music Clustering & Discover Weekly

Entertainment · Audio Features

How Spotify Clusters Music

Spotify's audio analysis API extracts features for each of its 100M+ tracks:

  • Acoustic features: Danceability, energy, valence (mood), tempo, key, loudness, speechiness, acousticness, instrumentalness
  • Spectral features: MFCC (Mel-Frequency Cepstral Coefficients) — 13-dimensional representation of audio spectrum
  • Embedding features: 128-dimensional vectors from deep learning models trained on listener co-occurrence

Multi-Level Clustering

  1. Macro clusters (~5000): K-Means on audio embeddings — broad genres (rock, jazz, hip-hop, classical, electronic)
  2. Micro clusters (~100,000): Further subdivision — "90s grunge," "lo-fi bedroom pop," "Bollywood dance"
  3. User taste clusters: GMM on user listening histories — each user is a mixture of music taste clusters

Discover Weekly Pipeline

For each user: find their taste cluster → find songs in those clusters → filter out already-heard songs → rank by popularity and novelty → deliver 30 songs every Monday. This serves 600M+ personalized playlists weekly.

spotify_clustering_demo.py Python
# Spotify-style audio feature clustering
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

# Simulated tracks with Spotify-like features
n_tracks = 1000
features = {
    'danceability': np.random.beta(5, 3, n_tracks),
    'energy': np.random.beta(4, 3, n_tracks),
    'valence': np.random.beta(3, 3, n_tracks),
    'tempo': np.random.normal(120, 30, n_tracks).clip(60, 200),
    'acousticness': np.random.beta(2, 5, n_tracks),
    'instrumentalness': np.random.beta(1.5, 8, n_tracks),
    'speechiness': np.random.beta(1.5, 10, n_tracks),
}

X = np.column_stack(list(features.values()))
X_scaled = StandardScaler().fit_transform(X)

# Cluster into micro-genres
km = KMeans(n_clusters=8, init='k-means++', n_init=10, random_state=42)
genre_labels = km.fit_predict(X_scaled)

# Analyze clusters
print("Discovered Micro-Genres:")
print(f"{'Cluster':<10} {'Count':>6} {'Dance':>7} {'Energy':>8} "
      f"{'Valence':>9} {'Tempo':>7}")
print("-" * 52)
for c in range(8):
    mask = genre_labels == c
    print(f"Genre {c:<5} {mask.sum():>5}  "
          f"{features['danceability'][mask].mean():>6.3f}  "
          f"{features['energy'][mask].mean():>7.3f}  "
          f"{features['valence'][mask].mean():>8.3f}  "
          f"{features['tempo'][mask].mean():>6.1f}")

Amazon — Product Grouping & Recommendation

E-Commerce · Product Taxonomy

Problem

Amazon's catalog has 350M+ products. Manual categorization doesn't scale. Clustering helps:

  • Product similarity: Group products with similar descriptions, images, and purchase patterns
  • "Customers also bought": Users within the same behavioral cluster tend to buy similar items
  • Inventory optimization: Cluster warehouses by demand patterns to optimize stocking

Technique Stack

  1. Text clustering: TF-IDF embeddings of product descriptions → K-Means for initial grouping
  2. Image clustering: ResNet features of product images → DBSCAN for visual similarity
  3. Behavioral clustering: User-product interaction matrices → GMM on user purchase embeddings
  4. Hierarchical taxonomy: Agglomerative clustering builds the category tree (Electronics → Phones → Android Phones → Samsung Galaxy)

Scale

Amazon processes 2.5M product clustering updates daily. Mini-Batch K-Means with FAISS (Facebook AI Similarity Search) enables nearest-centroid queries in <1ms for real-time recommendation.

Startup Applications

Razorpay 🇮🇳

Clusters merchants by transaction patterns for fraud risk scoring. Unusual clusters flag potential fraud rings.

Meesho 🇮🇳

Clusters resellers by behavior to personalize product catalogs — "fashion-focused" vs "electronics-focused" resellers.

CRED 🇮🇳

Clusters credit card users by spending patterns to offer personalized rewards and cashback categories.

PhonePe 🇮🇳

Clusters UPI merchants by transaction volume and timing for dynamic pricing of payment processing.

Government Applications

Industry Applications

IndustryClustering ApplicationAlgorithmImpact
HealthcarePatient subtyping (diabetes subtypes)GMMPersonalized treatment protocols
FinancePortfolio diversificationHierarchicalOptimal asset allocation
RetailMarket basket analysisK-MeansStore layout optimization
ManufacturingQuality control (defect grouping)DBSCANRoot cause identification
CybersecurityNetwork intrusion detectionDBSCANAnomalous traffic isolation
NLPTopic discovery in documentsK-Means + TF-IDFContent organization
GenomicsGene expression profilingHierarchicalDisease biomarker discovery
AstronomyGalaxy classificationGMMMorphological taxonomy

Mini Projects

Project 1: Customer Segmentation Dashboard

Objective: Build an end-to-end customer segmentation pipeline with interactive visualization.

customer_segmentation.py Python
"""
Mini Project 1: Customer Segmentation Dashboard
Dataset: Mall Customer Segmentation (simulated)
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, silhouette_samples

np.random.seed(42)

# --- Simulate Mall Customer Data ---
n_customers = 500
data = {
    'annual_income_k': np.concatenate([
        np.random.normal(30, 8, 150),   # Low income
        np.random.normal(55, 10, 200),  # Middle
        np.random.normal(90, 12, 150),  # High
    ]),
    'spending_score': np.concatenate([
        np.random.normal(70, 15, 100),   # High spenders (low income)
        np.random.normal(30, 10, 100),   # Low spenders (low income)
        np.random.normal(50, 15, 150),   # Average
        np.random.normal(75, 12, 80),    # High spenders (high income)
        np.random.normal(20, 8, 70),     # Low spenders (high income)
    ]),
    'age': np.concatenate([
        np.random.normal(25, 5, 150),
        np.random.normal(40, 8, 200),
        np.random.normal(55, 7, 150),
    ]),
    'visit_frequency': np.concatenate([
        np.random.poisson(8, 200),
        np.random.poisson(3, 150),
        np.random.poisson(5, 150),
    ]),
}

X = np.column_stack([data[f] for f in data]).astype(float)

# --- Preprocessing ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Elbow + Silhouette Analysis ---
K_range = range(2, 10)
results = {'k': [], 'inertia': [], 'silhouette': []}

for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    results['k'].append(k)
    results['inertia'].append(km.inertia_)
    results['silhouette'].append(silhouette_score(X_scaled, labels))

# Print analysis
print("Optimal K Analysis:")
print(f"{'k':>3} {'Inertia':>10} {'Silhouette':>12}")
print("-" * 28)
for k, ine, sil in zip(results['k'], results['inertia'], results['silhouette']):
    marker = ' ← BEST' if sil == max(results['silhouette']) else ''
    print(f"{k:>3} {ine:>10.1f} {sil:>12.4f}{marker}")

# --- Final Clustering ---
best_k = results['k'][np.argmax(results['silhouette'])]
km_final = KMeans(n_clusters=best_k, init='k-means++', n_init=10, 
                   random_state=42)
labels = km_final.fit_predict(X_scaled)

# --- Segment Profiles ---
print(f"\n--- Customer Segments (k={best_k}) ---")
print(f"{'Segment':<12} {'Count':>6} {'Income':>8} {'Spend':>7} "
      f"{'Age':>6} {'Visits':>7}")
print("-" * 52)

segment_names = ['Budget', 'Value', 'Premium', 'VIP', 'Casual']
for seg in range(best_k):
    mask = labels == seg
    name = segment_names[seg] if seg < len(segment_names) else f"Seg{seg}"
    print(f"{name:<12} {mask.sum():>6} "
          f"{data['annual_income_k'][mask].mean():>8.1f} "
          f"{data['spending_score'][mask].mean():>7.1f} "
          f"{data['age'][mask].mean():>6.1f} "
          f"{data['visit_frequency'][mask].mean():>7.1f}")

# --- PCA Visualization ---
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

print(f"\nPCA Explained Variance: {pca.explained_variance_ratio_.sum():.2%}")
print(f"Silhouette Score: {silhouette_score(X_scaled, labels):.4f}")

# --- Actionable Recommendations ---
print("\n--- Marketing Recommendations ---")
for seg in range(best_k):
    mask = labels == seg
    income = data['annual_income_k'][mask].mean()
    spend = data['spending_score'][mask].mean()
    name = segment_names[seg] if seg < len(segment_names) else f"Seg{seg}"
    
    if income > 70 and spend > 60:
        action = "VIP loyalty program, exclusive previews"
    elif income > 70 and spend < 40:
        action = "Targeted promotions to increase engagement"
    elif income < 40 and spend > 60:
        action = "Installment plans, EMI options"
    elif income < 40 and spend < 40:
        action = "Value deals, festive season discounts"
    else:
        action = "Balanced offers, membership benefits"
    
    print(f"  {name}: {action}")

Project 2: Image Color Quantization

Objective: Reduce the number of colors in an image using K-Means — a classic application of clustering to compress images.

color_quantization.py Python
"""
Mini Project 2: Image Color Quantization using K-Means
Reduces 16M possible colors to k representative colors
"""
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min

def create_sample_image(height=100, width=100):
    """Create a synthetic image with distinct color regions."""
    img = np.zeros((height, width, 3), dtype=np.uint8)
    
    # Sky (light blue)
    img[:40, :, :] = [135, 206, 235]
    # Sun (yellow)
    for i in range(15, 30):
        for j in range(70, 85):
            if (i-22)**2 + (j-77)**2 < 49:
                img[i, j] = [255, 223, 0]
    # Grass (green)
    img[40:70, :, :] = [34, 139, 34]
    # Earth (brown)
    img[70:, :, :] = [139, 90, 43]
    # Add noise
    noise = np.random.randint(-15, 15, img.shape)
    img = np.clip(img.astype(int) + noise, 0, 255).astype(np.uint8)
    
    return img

def quantize_image(image, n_colors=8):
    """Quantize image colors using K-Means."""
    h, w, c = image.shape
    
    # Reshape to (n_pixels, 3)
    pixels = image.reshape(-1, 3).astype(np.float64)
    
    print(f"Original: {len(np.unique(pixels, axis=0))} unique colors")
    
    # Mini-Batch K-Means for speed
    kmeans = MiniBatchKMeans(
        n_clusters=n_colors, 
        batch_size=1000,
        random_state=42
    )
    labels = kmeans.fit_predict(pixels)
    
    # Replace each pixel with its centroid color
    quantized_pixels = kmeans.cluster_centers_[labels]
    quantized_image = quantized_pixels.reshape(h, w, c).astype(np.uint8)
    
    # Compression ratio
    original_size = h * w * 3 * 8  # bits
    # Quantized: palette (n_colors * 3 * 8 bits) + indices (h*w * log2(n_colors) bits)
    compressed_size = n_colors * 3 * 8 + h * w * np.ceil(np.log2(n_colors))
    ratio = original_size / compressed_size
    
    print(f"Quantized: {n_colors} unique colors")
    print(f"Compression ratio: {ratio:.1f}x")
    print(f"Palette (RGB):")
    for i, color in enumerate(kmeans.cluster_centers_.astype(int)):
        count = (labels == i).sum()
        pct = count / len(labels) * 100
        print(f"  Color {i}: RGB({color[0]:3d}, {color[1]:3d}, {color[2]:3d})"
              f" — {pct:.1f}% of pixels")
    
    return quantized_image, kmeans

# --- Run ---
img = create_sample_image()
print(f"Image shape: {img.shape}")
print(f"Total pixels: {img.shape[0] * img.shape[1]}\n")

for n_colors in [4, 8, 16]:
    print(f"\n{'='*40}")
    print(f"Quantization with {n_colors} colors:")
    print(f"{'='*40}")
    q_img, km = quantize_image(img, n_colors)
    
    # MSE between original and quantized
    mse = np.mean((img.astype(float) - q_img.astype(float)) ** 2)
    psnr = 10 * np.log10(255**2 / mse) if mse > 0 else float('inf')
    print(f"MSE: {mse:.2f}")
    print(f"PSNR: {psnr:.2f} dB")

Project 3: Document Topic Discovery

Objective: Cluster news articles by topic using TF-IDF + K-Means.

document_clustering.py Python
"""
Mini Project 3: Document Topic Discovery
Clusters text documents into topics using TF-IDF + KMeans
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Sample news headlines
documents = [
    # Sports
    "India wins cricket world cup after thrilling final",
    "Virat Kohli scores century in IPL match",
    "Premier League football season kicks off",
    "Olympic gold medal for Indian athlete in javelin",
    "Tennis grand slam results surprise fans",
    # Technology
    "New AI model breaks language understanding records",
    "Smartphone sales surge in Indian market",
    "Tech startup raises funding for cloud platform",
    "Machine learning transforms healthcare diagnosis",
    "Quantum computing milestone achieved by researchers",
    # Politics
    "Parliament session discusses new education policy",
    "State elections results show shifting voter patterns",
    "Government announces new infrastructure spending bill",
    "International summit focuses on climate change policy",
    "Political parties prepare for upcoming general elections",
    # Business
    "Stock market reaches all-time high on positive earnings",
    "RBI announces interest rate decision for quarter",
    "Startup unicorn IPO values company at billion dollars",
    "Manufacturing sector shows strong growth this quarter",
    "Oil prices impact global economic recovery forecast",
]

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=200, stop_words='english')
X_tfidf = vectorizer.fit_transform(documents)

print(f"TF-IDF matrix shape: {X_tfidf.shape}")
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")

# Find optimal k
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_tfidf)
    sil = silhouette_score(X_tfidf, labels)
    print(f"k={k}: Silhouette={sil:.4f}")

# Final clustering
km = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = km.fit_predict(X_tfidf)

# Display results
print("\nDiscovered Topics:")
feature_names = vectorizer.get_feature_names_out()
for c in range(4):
    print(f"\n--- Topic {c} ---")
    # Top terms per cluster (highest centroid values)
    center = km.cluster_centers_[c]
    top_indices = center.argsort()[-5:][::-1]
    top_terms = [feature_names[i] for i in top_indices]
    print(f"Top terms: {', '.join(top_terms)}")
    print(f"Documents:")
    for i, doc in enumerate(documents):
        if labels[i] == c:
            print(f"  • {doc[:60]}")

End-of-Chapter Exercises

1 Easy Define clustering. How does it differ from classification?
2 Easy List the five steps of the K-Means algorithm. What does "convergence" mean in this context?
3 Easy Explain what the Elbow Method is and how to interpret the elbow plot.
4 Easy What is the difference between agglomerative and divisive hierarchical clustering?
5 Easy Define core point, border point, and noise point in DBSCAN.
6 Medium Given points {(0,0), (1,0), (0,1), (10,10), (11,10), (10,11)} and k=2, run K-Means for 2 iterations starting with centroids at (0,0) and (10,10). Show each step.
7 Medium Compute the silhouette score for point (1,0) if Cluster 1 = {(0,0), (1,0), (0,1)} and Cluster 2 = {(10,10), (11,10), (10,11)}.
8 Medium Explain why K-Means can get stuck in local minima. Give a 2D example where random initialization leads to a suboptimal result.
9 Medium Compare single linkage and complete linkage. Sketch how each would cluster the dataset: {1, 2, 3, 10, 11, 12, 6}.
10 Medium For DBSCAN with ε=2 and MinPts=3, classify each point as core, border, or noise: {(0,0), (1,0), (0,1), (5,5), (1,1), (10,10)}.
11 Medium Explain the E-step and M-step of the EM algorithm for GMMs. What does each step compute and why?
12 Medium Derive the Ward's linkage merge cost formula. What does it minimize?
13 Medium How does K-Means++ initialization work? Why is it better than random initialization?
14 Medium Write Python code to generate 3 clusters and apply the Elbow Method to determine the optimal k. Plot the results.
15 Medium Compare Davies-Bouldin index and Calinski-Harabasz index. When would you use each? Which is higher = better?
16 Hard Prove that the K-Means objective function J(C,μ) is monotonically non-increasing across iterations.
17 Hard Implement DBSCAN from scratch using only NumPy. Test on make_moons dataset and compare with sklearn.
18 Hard Derive the M-step update for the covariance matrix Σⱼ in a GMM. Show every step.
19 Hard Implement a Gaussian Mixture Model from scratch with full covariance matrices. Verify log-likelihood increases at each iteration.
20 Hard Design a clustering pipeline for e-commerce customer segmentation. Include preprocessing, algorithm selection, evaluation, and business interpretation. Write complete code.
21 Hard K-Means is equivalent to EM for GMMs with equal, isotropic covariances and uniform mixing weights. Prove this equivalence.
22 Hard Explain how HDBSCAN improves upon DBSCAN. What is the role of mutual reachability distance?
23 Hard Apply spectral clustering to a dataset where K-Means fails (e.g., concentric circles). Explain why it works.
24 Hard Implement Mini-Batch K-Means from scratch. Compare convergence speed and final WCSS with standard K-Means on a dataset of 100,000 points.

Multiple Choice Questions

Q1. K-Means clustering minimizes which objective?

  • A) Between-cluster sum of squares
  • B) Within-cluster sum of squares (WCSS)
  • C) Silhouette score
  • D) Davies-Bouldin index
Answer: B. K-Means minimizes J = Σₖ Σ_{x∈Cₖ} ||x - μₖ||², the within-cluster sum of squared distances (inertia).

Q2. In K-Means++, the probability of selecting the next centroid is proportional to:

  • A) D(x) — distance to nearest centroid
  • B) D(x)² — squared distance to nearest centroid
  • C) 1/D(x) — inverse distance
  • D) Uniform random
Answer: B. K-Means++ selects each new centroid with probability proportional to D(x)², ensuring far-away points are more likely to be chosen, leading to O(log k)-competitive initialization.

Q3. Which clustering algorithm can automatically determine the number of clusters?

  • A) K-Means
  • B) Agglomerative Clustering (with n_clusters parameter)
  • C) DBSCAN
  • D) GMM (with n_components parameter)
Answer: C. DBSCAN discovers the number of clusters automatically based on density — you specify ε and MinPts, not k.

Q4. A silhouette score of -0.3 for a data point indicates:

  • A) The point is well-clustered
  • B) The point is on the cluster boundary
  • C) The point is likely assigned to the wrong cluster
  • D) Insufficient data for evaluation
Answer: C. Negative silhouette means a(i) > b(i) — the point is closer to a neighboring cluster than to its own. It's probably misassigned.

Q5. Which linkage type in hierarchical clustering is most susceptible to the "chaining" effect?

  • A) Single linkage
  • B) Complete linkage
  • C) Average linkage
  • D) Ward's linkage
Answer: A. Single linkage uses minimum inter-cluster distance. A chain of points bridging two clusters can cause them to merge prematurely ("chaining effect").

Q6. In the EM algorithm for GMMs, the E-step computes:

  • A) New cluster centroids
  • B) Posterior responsibilities γᵢⱼ = P(zᵢ = j | xᵢ)
  • C) Updated covariance matrices
  • D) The log-likelihood gradient
Answer: B. The E-step (Expectation) computes the posterior probability that each point belongs to each Gaussian component, using Bayes' theorem with current parameters.

Q7. DBSCAN with ε=0.5 and MinPts=5 classifies a point as a core point if:

  • A) It has exactly 5 neighbors within radius 0.5
  • B) It has at least 5 neighbors within radius 0.5 (including itself)
  • C) It is within 0.5 of at least 5 clusters
  • D) Its nearest neighbor is within 0.5
Answer: B. A core point has ≥ MinPts points (including itself) within its ε-neighborhood.

Q8. The key advantage of GMMs over K-Means is:

  • A) GMMs are faster
  • B) GMMs require fewer parameters
  • C) GMMs provide soft (probabilistic) cluster assignments
  • D) GMMs don't require specifying the number of clusters
Answer: C. GMMs assign each point a probability of belonging to each cluster (soft clustering), while K-Means assigns each point to exactly one cluster (hard clustering). GMMs also model elliptical clusters.

Q9. The time complexity of standard K-Means per iteration is:

  • A) O(n)
  • B) O(nkd)
  • C) O(n² k)
  • D) O(n² d)
Answer: B. Each iteration computes distances from n points to k centroids in d dimensions: O(nkd). This is why K-Means is efficient — linear in n.

Q10. For the Davies-Bouldin Index, which value indicates better clustering?

  • A) Lower values
  • B) Higher values
  • C) Values close to 0.5
  • D) Values close to 1.0
Answer: A. DB = (1/k)Σ max((σᵢ+σⱼ)/d(cᵢ,cⱼ)). Lower DB means clusters are compact (small σ) and well-separated (large d). The minimum possible is 0.

Q11. What happens if a K-Means cluster becomes empty during an iteration?

  • A) The algorithm always terminates
  • B) The empty cluster is removed, reducing k
  • C) Typically, the centroid is re-initialized (implementation-dependent)
  • D) All points are reassigned uniformly
Answer: C. Empty clusters are handled differently across implementations — sklearn reinitializes the centroid to the point furthest from any existing centroid, or the centroid remains where it was. The theoretical algorithm doesn't specify this edge case.

Q12. Which evaluation metric requires ground truth labels?

  • A) Silhouette Score
  • B) Davies-Bouldin Index
  • C) Adjusted Rand Index (ARI)
  • D) Calinski-Harabasz Index
Answer: C. ARI compares cluster assignments against ground truth labels. Silhouette, DB, and CH are intrinsic metrics that don't need labels — they only use the data geometry.

Interview Questions

Q1: How would you choose between K-Means, DBSCAN, and Hierarchical clustering?

Answer: The choice depends on the data characteristics and requirements:

  • K-Means: Use when you expect roughly spherical, equally-sized clusters and know k. Fast (O(nkd)), scalable. Best for: customer segmentation, image quantization.
  • DBSCAN: Use when clusters have arbitrary shapes, you don't know k, and you need to identify outliers. Best for: spatial data, anomaly detection. Struggles with varying densities.
  • Hierarchical: Use when you want to explore multiple granularities via a dendrogram, small-to-medium datasets. Best for: taxonomy building, gene expression analysis. O(n²) memory limits scalability.
  • GMM: Use when clusters overlap and you need soft assignments. Best for: probabilistic applications, when clusters are elliptical.
Q2: K-Means converges — but to what? Can it give wrong results?

Answer: K-Means converges to a local minimum of the WCSS objective, not necessarily the global minimum. The NP-hard nature of the optimization means different initializations can lead to different solutions. Mitigations: (1) K-Means++ initialization, (2) multiple restarts (n_init=10 in sklearn), (3) verify using evaluation metrics. A "converged but wrong" result often happens when k is wrong or when clusters are non-spherical.

Q3: How do you handle high-dimensional data in clustering?

Answer: The "curse of dimensionality" makes distances meaningless in high dimensions (all points become equally far apart). Solutions:

  • PCA/UMAP: Reduce dimensionality before clustering
  • Feature selection: Remove irrelevant features
  • Subspace clustering: Find clusters in different subspaces
  • Deep clustering: Learn a low-dimensional embedding using autoencoders, then cluster in latent space
  • Cosine distance: More meaningful than Euclidean in high dimensions (used in text clustering)
Q4: Explain the connection between K-Means and EM algorithm.

Answer: K-Means is a special case of EM for GMMs where: (1) all covariance matrices are σ²I (identical isotropic), (2) mixing weights are uniform (πⱼ = 1/k), and (3) we take σ → 0 so soft assignments become hard assignments. The E-step becomes hard assignment (argmin distance), and the M-step becomes computing cluster means. This is known as "hard EM."

Q5: How would you evaluate clustering when there are no ground truth labels?

Answer: Use intrinsic evaluation metrics: (1) Silhouette Score: measures cluster cohesion vs separation [-1, 1]. (2) Davies-Bouldin Index: measures ratio of within/between cluster distances (lower = better). (3) Calinski-Harabasz: ratio of between/within variance (higher = better). (4) Elbow method: look for diminishing returns in WCSS. (5) Business validation: do the clusters make domain sense? (6) Stability: bootstrap the data and check if clusters are consistent.

Q6: Your DBSCAN returns everything as one cluster or everything as noise. What went wrong?

Answer: The ε and MinPts parameters are mistuned:

  • Everything in one cluster: ε is too large — every point reaches every other point via density connectivity. Solution: decrease ε.
  • Everything is noise: ε is too small or MinPts is too large — no point has enough neighbors to be a core point. Solution: increase ε or decrease MinPts.
  • Tuning strategy: Plot the k-distance graph (sorted distances to the k-th nearest neighbor). The "elbow" in this graph suggests a good ε value. MinPts ≈ 2 × dimensionality is a common heuristic.
Q7: How would you cluster 10 billion data points?

Answer: Standard algorithms won't fit in memory. Scalable approaches:

  • Mini-Batch K-Means: Processes random subsets; sklearn implementation works on 10M+ points
  • BIRCH: Builds a compact summary (CF tree) in one pass, then clusters summaries
  • Distributed K-Means: MapReduce/Spark — each mapper assigns local points, reducer updates global centroids
  • FAISS (Facebook): GPU-accelerated nearest neighbor search for trillion-scale clustering
  • Approximate methods: Sample 1%, cluster, assign rest to nearest centroids
Q8: What preprocessing is essential before clustering?

Answer: Clustering is heavily influenced by preprocessing:

  • Feature scaling: StandardScaler or MinMaxScaler — distance-based algorithms are sensitive to scale
  • Outlier handling: Winsorize or remove extreme outliers (they distort centroids)
  • Dimensionality reduction: PCA for de-correlation; UMAP for nonlinear manifolds
  • Feature engineering: Log-transform skewed features, encode categoricals properly
  • Missing values: Impute before clustering (KNN imputation is good for clustering pipelines)
Q9: Explain soft clustering vs hard clustering with a real example.

Answer: Hard clustering (K-Means): each point belongs to exactly one cluster. Example: "This customer is in the 'Premium' segment." Soft clustering (GMM): each point has probabilities across all clusters. Example: "This customer is 60% Premium, 30% Value-seeker, 10% Budget." Soft clustering is better when categories overlap — a Spotify user who listens to both jazz and rock shouldn't be forced into one genre cluster. GMMs, fuzzy c-means, and topic models (LDA) provide soft assignments.

Q10: What's the time and space complexity of hierarchical clustering vs K-Means?

Answer:

  • K-Means: Time: O(n·k·d·I) where I = iterations (usually <100). Space: O(n·d + k·d). Very scalable.
  • Hierarchical (agglomerative): Time: O(n³) naive, O(n² log n) with priority queue. Space: O(n²) for the distance matrix. This limits it to ~10,000-50,000 points in practice.
  • DBSCAN: Time: O(n²) naive, O(n log n) with spatial index (KD-tree). Space: O(n).
  • GMM (EM): Time: O(n·k·d²·I) (d² for covariance). Space: O(k·d²).

Research Problems

R1 Research Deep Clustering with Contrastive Learning: Standard K-Means fails on high-dimensional image data. Design a deep clustering pipeline: (1) pretrain an encoder using SimCLR or MoCo (contrastive learning), (2) cluster the learned representations, (3) fine-tune end-to-end with a clustering loss. Compare against DEC (Deep Embedded Clustering) on MNIST and CIFAR-10. Measure NMI, ARI, and clustering accuracy.
R2 Research Clustering Indian Languages by Phonetic Similarity: India has 22 scheduled languages and 100+ spoken languages. Collect phonetic feature vectors (IPA representations, formant frequencies) for core vocabulary (Swadesh list) across 30+ Indian languages. Apply hierarchical clustering to build a linguistic family tree. Compare your clustering against the known Indo-Aryan / Dravidian / Sino-Tibetan classification. Does the data-driven tree match linguistic theory?
R3 Research Fair Clustering: Standard clustering can produce clusters that are biased on protected attributes (gender, caste, religion). Formulate the "fair K-Means" problem: clusters should have similar proportions of protected groups as the overall population. Implement a constrained optimization approach, test on census data, and measure the trade-off between clustering quality (WCSS) and fairness (demographic parity within clusters).
R4 Research Online Clustering for Streaming Data: Design an algorithm for clustering data that arrives one point at a time (streaming). The algorithm should: (1) never store all data, (2) update clusters incrementally, (3) detect when a new cluster emerges or an old one disappears. Apply to real-time Twitter trend detection. Compare against sliding-window K-Means.

Key Takeaways

References

Foundational Papers

  1. Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. [Original K-Means paper, written 1957]
  2. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD-96. [DBSCAN]
  3. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. JRSS Series B, 39(1), 1–38. [EM Algorithm]
  4. Arthur, D. & Vassilvitskii, S. (2007). K-Means++: The Advantages of Careful Seeding. SODA '07. [K-Means++ initialization]
  5. Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. JASA, 58(301), 236–244. [Ward's linkage]

Textbooks

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9: Mixture Models and EM.
  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 14: Unsupervised Learning.
  3. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 21: Clustering.
  4. Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapter 9.
  5. Jain, A. K. (2010). Data Clustering: 50 Years Beyond K-Means. Pattern Recognition Letters, 31(8), 651–666.

Indian Context References

  1. NITI Aayog (2018). Aspirational Districts — Methodology for Identification. Government of India.
  2. Rao, C. R. (1945). Information and Accuracy Attainable in the Estimation of Statistical Parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91. [Rao distance for clustering]
  3. UIDAI (2016). Aadhaar Technology Architecture. [Biometric deduplication via blocking/clustering]
  4. TRAI Reports on Telecom Usage Patterns — subscriber segmentation methodology.

Software Documentation

  1. Scikit-Learn Clustering Documentation: sklearn.cluster
  2. SciPy Hierarchical Clustering: scipy.cluster.hierarchy
  3. HDBSCAN Library: McInnes, L. (2017). hdbscan.readthedocs.io