Part VI: Production ML — Chapter 17

Cross-Validation, Hyperparameter Tuning & Model Selection

Master the art and science of choosing the right model, tuning it to perfection, and validating that your results will generalize to unseen data.

📖 Reading Time: ~3.5 hours 📋 Prerequisites: Ch 13, Ch 14 📊 Difficulty: Intermediate–Advanced 💻 Code Examples: 20+

🎯

Learning Objectives

After completing this chapter, you will be able to:

Explain why train/test split alone is insufficient and articulate the need for cross-validation.
Implement K-Fold, Stratified K-Fold, Leave-One-Out, and Time-Series Cross-Validation from scratch and via scikit-learn.
Diagnose underfitting vs. overfitting using bias-variance decomposition and learning curves.
Execute Grid Search, Random Search, and Bayesian Optimization (Optuna) for hyperparameter tuning.
Avoid data leakage by building proper $Pipeline$ objects with StandardScaler, feature selection, and model training.
Compare models rigorously using paired t-tests, Wilcoxon signed-rank tests, and confidence intervals.
Build end-to-end AutoML pipelines that automate model selection and hyperparameter optimization.
Tune regularization parameters (L1/L2 λ) as a model selection mechanism.
Analyze hyperparameter importance to prioritize which parameters to tune first.
Apply these techniques to real-world scenarios across Indian and global industry contexts.

📝 Exam Tip

Model selection and cross-validation questions appear in nearly every ML interview and exam. Be prepared to explain why random shuffling breaks time-series validation and how data leakage invalidates your entire experiment.

📘

Introduction

Building a machine learning model is only half the story. The other half — arguably the harder half — is figuring out which model to build, what settings to use, and how confident you can be that your model will work on data it has never seen before.

Consider a data scientist at Flipkart trying to predict customer churn. She has access to Random Forests, Gradient Boosting, SVMs, and Neural Networks. Each model has dozens of adjustable knobs (hyperparameters). A Random Forest alone has parameters like $n_estimators$ , $max_depth$ , $min_samples_split$ , and $max_features$ . With just 4 parameters and 5 choices each, that's 5⁴ = 625 possible combinations — for one model type.

This chapter addresses three interlinked questions:

Validation: How do we reliably estimate how well a model will perform on unseen data? (Cross-validation)
Optimization: How do we efficiently search the vast space of hyperparameters? (Hyperparameter tuning)
Selection: How do we compare multiple models and choose the best one with statistical confidence? (Model selection)

These are not academic exercises — they are the difference between a model that works in a notebook and one that works in production. A model that achieves 95% accuracy in training but 70% in production is worse than useless; it's dangerous, because it creates false confidence.

🎓 Professor's Insight

I tell my students: "Training a model is like cooking in your kitchen. Cross-validation is like having strangers taste your food. Grid search is like adjusting your recipe systematically. Model selection is choosing between your best dishes for the restaurant menu." The goal isn't to impress yourself — it's to satisfy the real-world diner.

This chapter forms the bridge between learning algorithms (Chapters 4–16) and deploying them in production (Chapter 18+). Without these techniques, every model you've built so far is just an optimistic guess.

📜

Historical Background

The quest for robust model evaluation has evolved over more than a century:

The Holdout Era (Early 1900s–1960s)

The simplest validation idea — splitting data into train and test sets — dates back to early statistical practice. However, results were highly variable depending on which data ended up in which partition. A "lucky" split could make a bad model look great.

Cross-Validation Emerges (1960s–1970s)

M. Stone (1974) formalized cross-validation as a model assessment technique, connecting it to Akaike's information criterion. Allen (1974) independently proposed the PRESS (Predicted Residual Sum of Squares) statistic, essentially leave-one-out CV. Geisser (1975) established the theoretical foundations of predictive inference using cross-validation.

K-Fold Becomes Standard (1980s–1990s)

Researchers found that K=5 or K=10 offered good bias-variance tradeoffs. Kohavi (1995) published a landmark empirical study comparing bootstrap, holdout, and K-fold methods, establishing K=10 stratified cross-validation as the recommended default. Dietterich (1998) proposed the 5×2 cross-validation paired t-test for model comparison.

Hyperparameter Optimization Revolution (2000s–2010s)

Bergstra & Bengio (2012) demonstrated that random search is more efficient than grid search — a result that surprised many practitioners. Snoek, Larochelle & Adams (2012) brought Bayesian optimization to ML with their NIPS paper on Gaussian-process-based hyperparameter tuning. Google Vizier (2017) operationalized these ideas as a cloud service.

AutoML & Modern Methods (2015–Present)

Auto-sklearn (Feurer et al., 2015) combined Bayesian optimization with meta-learning. Optuna (Akiba et al., 2019) introduced an efficient define-by-run API for hyperparameter search. Today, frameworks like Google AutoML, H2O AutoML, and TCS's iON platform automate the entire pipeline from data to deployed model.

🇮🇳 India Spotlight

TCS Research developed the iON AutoML platform in the late 2010s, enabling automated model selection for clients across banking, insurance, and retail sectors. Their approach combines meta-learning (learning which models work well on which types of data) with efficient hyperparameter optimization.

💡

Conceptual Explanation

4.1 Why Not Just Use Training Accuracy?

Imagine a student who memorizes every answer in the textbook but can't solve a new problem. Training accuracy measures memorization; we need a metric that measures generalization — the ability to perform well on data the model has never seen.

4.2 The Train/Test Split

The simplest approach: split your dataset into two parts — typically 70-80% for training and 20-30% for testing. The model learns from the training set and is evaluated on the test set.

⚠️ Critical Warning: Data Leakage

The test set must never be used during any part of training — not for feature selection, not for scaling, not for imputation. If test data influences any training decision, your evaluation is invalid. This is called data leakage, and it's the #1 mistake in applied ML.

Stratified Splitting

When classes are imbalanced (e.g., 95% non-fraud, 5% fraud), a random split might give you a test set with 0% fraud cases! Stratified splitting preserves the class proportions in both sets.

4.3 K-Fold Cross-Validation

A single train/test split gives us one estimate of performance, but that estimate depends heavily on which samples happened to land in each partition. K-Fold CV solves this by using every data point for both training and testing:

Divide the dataset into K equal-sized folds.
For each fold i (i = 1 to K): use fold i as the validation set; use all other folds for training.
Record the score from each iteration.
Report the mean and standard deviation of all K scores.

With K=5, each data point appears in exactly one validation set and four training sets. This gives us 5 estimates instead of 1, along with a measure of variability.

4.4 Stratified K-Fold

Same as K-Fold, but each fold preserves the original class distribution. This is critical for imbalanced datasets and is the default recommendation for classification tasks.

4.5 Leave-One-Out Cross-Validation (LOOCV)

The extreme case: K = N (one fold per data point). Each iteration trains on N-1 samples and tests on 1. Produces the least biased estimate but has extremely high variance and is computationally expensive (N full training runs).

4.6 Time-Series Split

For time-series data, standard K-Fold is invalid because it shuffles data points randomly, allowing the model to "peek into the future." Time-series CV uses an expanding window: train on periods 1–t, test on period t+1.

🎓 Professor's Insight

Think of time-series split as simulating what would actually happen in deployment: at each point in time, you only have access to past data. Any validation strategy that violates this temporal ordering is fundamentally flawed.

4.7 The Bias-Variance Tradeoff

The expected test error of any model can be decomposed into three components:

E[Error] = Bias² + Variance + Irreducible Noise

High Bias (Underfitting): Model is too simple — it can't capture the true pattern. Training error is high. Example: fitting a line to quadratic data.
High Variance (Overfitting): Model is too complex — it memorizes noise. Training error is low but test error is high. Example: a degree-20 polynomial on 10 data points.

4.8 Learning Curves

A learning curve plots model performance (y-axis) against training set size (x-axis) for both training and validation sets:

Underfitting: Both curves plateau at a high error. More data won't help — you need a more complex model.
Overfitting: Large gap between training (low error) and validation (high error). More data may help close the gap.
Good fit: Both curves converge to low error as training size increases.

4.9 Hyperparameter Tuning Strategies

Grid Search

Exhaustively tries every combination of a pre-defined set of hyperparameter values. Guarantees finding the best in the grid but is exponentially expensive: with p parameters and v values each, cost = v^p.

Random Search

Randomly samples hyperparameter combinations from defined distributions. Bergstra & Bengio (2012) proved that for many problems, random search finds better hyperparameters with far fewer iterations because it explores more of each dimension.

Bayesian Optimization (Optuna)

Uses a probabilistic model (surrogate function) to predict which hyperparameter combinations are likely to be best, then intelligently chooses the next point to evaluate. Much more sample-efficient than random search for expensive evaluations.

4.10 Model Comparison with Statistical Tests

Saying "Model A scored 0.85 and Model B scored 0.83" is not enough. We need statistical tests to determine if the difference is significant or just due to random variation in the folds:

Paired t-test: Assumes scores are normally distributed. Tests if the mean difference across folds is significantly different from zero.
Wilcoxon signed-rank test: Non-parametric alternative — doesn't assume normality. Better for small K or non-normal score distributions.

4.11 Pipeline: Preventing Data Leakage

A $Pipeline$ chains preprocessing steps and the final model into a single object. When used inside $cross_val_score$ , the scaler is fit only on the training folds — never on the validation fold. Without a pipeline, you'd accidentally scale using statistics from the entire dataset, including the validation fold — data leakage!

🚀 Career Path

ML Engineer / MLOps Specialist — Companies like Flipkart, PhonePe, and Razorpay actively hire for roles that require expertise in model selection pipelines, A/B testing, and production validation. Salary range in India: ₹18–45 LPA. In global roles (Google, Netflix): $120K–200K+.

📐

Mathematical Foundation

5.1 Cross-Validation Error Estimator

Given dataset D with N samples, partition into K folds {F₁, F₂, ..., F_K}. The K-fold CV estimate of generalization error is:

CV(K) = (1/K) \times Σᵢ₌₁ᴷ L(yᵢ, f̂₋ᵢ(xᵢ))

Where $f̂₋ᵢ$ is the model trained on all data except fold i, and $L$ is the loss function evaluated on fold i.

5.2 Bias-Variance Decomposition

For a regression model $f̂(x)$ estimating the true function $f(x)$ with noise $ε ~ N(0, σ²)$ :

E[(y - f̂(x))²] = [f(x) - E[f̂(x)]]² + E[(f̂(x) - E[f̂(x)])²] + σ² = Bias²[f̂(x)] + Var[f̂(x)] + σ²

5.3 K-Fold CV: Bias and Variance Properties

Let n_train = N × (K-1)/K be the training set size per fold:

K = N (LOOCV): n_train = N-1 → Low bias (almost entire dataset used for training), but high variance (training sets differ by only 1 point, so models are highly correlated).
K = 2: n_train = N/2 → High bias (only half the data used), but low variance.
K = 5 or 10: Good compromise between bias and variance. Empirical studies (Kohavi, 1995) confirm this.

5.4 Variance of CV Estimate

Var(CV) = (1/K²) \times [K \times σ² + K(K-1) \times ρ \times σ²] = (σ²/K) \times [1 + (K-1) \times ρ]

Where ρ is the correlation between fold error estimates and σ² is the variance of individual fold errors. Note: as K → N (LOOCV), ρ → 1 because training sets are nearly identical, so variance does not decrease with more folds.

5.5 Paired t-Test for Model Comparison

Given K fold differences $dᵢ = scoreᵢ(A) - scoreᵢ(B)$ :

t = d̄ / (s_d / \sqrtK) where d̄ = (1/K) Σ dᵢ, s_d = \sqrt[(1/(K-1)) Σ (dᵢ - d̄)²] Under H₀: E[dᵢ] = 0, t follows t-distribution with K-1 degrees of freedom

5.6 Regularization Parameter Selection

For Ridge Regression (L2), the objective with regularization parameter λ:

J(w; λ) = (1/2N) \times ||Xw - y||² + λ \times ||w||²

The optimal λ is found by minimizing CV error across a grid of λ values (typically log-spaced from 10⁻⁴ to 10⁴).

5.7 Expected Improvement (Bayesian Optimization)

In Bayesian optimization, the acquisition function Expected Improvement is:

EI(x) = E[max(f(x) - f(x⁺), 0)] = (μ(x) - f(x⁺)) \times Φ(Z) + σ(x) \times φ(Z) where Z = (μ(x) - f(x⁺)) / σ(x)

Here, μ(x) and σ(x) are the mean and std from the surrogate model (Gaussian Process), f(x⁺) is the best observed value, and Φ/φ are the CDF/PDF of the standard normal.

📝 Exam Tip

The bias-variance decomposition formula appears in almost every ML exam. Remember: bias measures systematic error (the model is consistently wrong), while variance measures sensitivity to training data (the model changes drastically with different samples).

🔬

Formula Derivations

6.1 Deriving the Bias-Variance Decomposition

Start with the expected MSE for a prediction at point x:

E[(y - f̂(x))²]

Since y = f(x) + ε where E[ε] = 0, Var(ε) = σ²:

E[(y - f̂(x))²] = E[(f(x) + ε - f̂(x))²]

Step 1: Add and subtract E[f̂(x)]:

= E[((f(x) - E[f̂(x)]) + (E[f̂(x)] - f̂(x)) + ε)²]

Step 2: Let A = f(x) − E[f̂(x)] (a constant), B = E[f̂(x)] − f̂(x) (random), C = ε (random, independent of B). Expand (A + B + C)²:

= A² + E[B²] + E[C²] + 2A\cdotE[B] + 2A\cdotE[C] + 2E[BC]

Step 3: Simplify using E[B] = 0, E[C] = 0, E[BC] = 0 (ε is independent of f̂):

= (f(x) - E[f̂(x)])² + E[(f̂(x) - E[f̂(x)])²] + σ² = Bias² + Variance + Noise

6.2 Deriving the CV Variance Formula

Let eᵢ be the error on fold i. The CV estimate is $CV = (1/K) Σ eᵢ$ :

Var(CV) = Var((1/K) Σ eᵢ) = (1/K²) \times Var(Σ eᵢ) = (1/K²) \times [Σ Var(eᵢ) + Σᵢ\neqⱼ Cov(eᵢ, eⱼ)]

Assume all folds have equal variance σ² and pairwise correlation ρ:

Cov(eᵢ, eⱼ) = ρσ² for i \neq j Var(CV) = (1/K²) \times [Kσ² + K(K-1)ρσ²] = (σ²/K) \times [1 + (K-1)ρ]

Key insight: If ρ = 0 (independent folds), variance decreases as 1/K. But for LOOCV, ρ ≈ 1, so Var(CV) ≈ σ² regardless of K — variance doesn't shrink!

6.3 Deriving the Corrected t-Test (Nadeau & Bengio, 2003)

The standard paired t-test underestimates variance because CV folds share training data. The corrected variance:

σ²_corrected = (1/K + n_test/n_train) \times s²_d

Where n_test and n_train are the test and train sizes per fold. The corrected t-statistic:

t_corrected = d̄ / \sqrt[(1/K + n_test/n_train) \times s²_d]

6.4 Deriving Optimal Lambda for Ridge Regression via CV

The Ridge solution for a given λ:

ŵ(λ) = (XᵀX + λI)⁻¹Xᵀy

For LOOCV, there's an efficient closed-form (bypassing N separate model fits):

CV(λ) = (1/N) Σᵢ [(yᵢ - x̂ᵢᵀŵ(λ)) / (1 - Hᵢᵢ(λ))]²

Where $H(λ) = X(XᵀX + λI)⁻¹Xᵀ$ is the hat matrix. This allows computing LOOCV for all λ values with a single matrix decomposition — an O(Np²) operation instead of O(N²p²).

🎓 Professor's Insight

The LOOCV shortcut for Ridge Regression is one of the most elegant results in statistical learning. It shows that sometimes, what seems computationally intractable has a beautiful closed-form solution hiding in the linear algebra.

✏️

Worked Numerical Examples

Example 1: 5-Fold Cross-Validation by Hand

Dataset: 10 samples with labels: [0, 1, 1, 0, 1, 0, 0, 1, 1, 0]

Task: Compute 5-fold CV accuracy for a hypothetical classifier.

Fold 1: Train on {3,4,...,10}, Test on {1,2} Fold 2: Train on {1,2,5,6,...,10}, Test on {3,4} Fold 3: Train on {1,...,4,7,...,10}, Test on {5,6} Fold 4: Train on {1,...,6,9,10}, Test on {7,8} Fold 5: Train on {1,...,8}, Test on {9,10}

Suppose the classifier produces these fold accuracies:

Fold	Train Size	Test Size	Predictions	Accuracy
1	8	2	[0✓, 1✓]	2/2 = 1.00
2	8	2	[1✓, 1✗]	1/2 = 0.50
3	8	2	[1✓, 0✓]	2/2 = 1.00
4	8	2	[0✓, 0✗]	1/2 = 0.50
5	8	2	[1✓, 0✓]	2/2 = 1.00

CV Accuracy = (1.00 + 0.50 + 1.00 + 0.50 + 1.00) / 5 = 4.00 / 5 = 0.80 Std Dev = \sqrt[((0.20² + 0.30² + 0.20² + 0.30² + 0.20²) / 4)] = \sqrt[(0.04+0.09+0.04+0.09+0.04)/4] = \sqrt[0.30/4] = \sqrt0.075 \approx 0.274 Report: Accuracy = 0.80 \pm 0.27

Example 2: Grid Search Parameter Space Calculation

Model: Random Forest with the following hyperparameter grid:

Parameter	Values	Count
n_estimators	[50, 100, 200, 500]	4
max_depth	[5, 10, 20, None]	4
min_samples_split	[2, 5, 10]	3
max_features	['sqrt', 'log2']	2

Total combinations = 4 \times 4 \times 3 \times 2 = 96 With 5-fold CV: Total model fits = 96 \times 5 = 480 If each fit takes 30 seconds: Total time = 480 \times 30 = 14,400 seconds = 4 hours

Example 3: Paired t-Test for Model Comparison

Given: 5-fold CV scores for Model A and Model B:

Fold	Model A	Model B	Difference (dᵢ)
1	0.88	0.84	+0.04
2	0.85	0.82	+0.03
3	0.90	0.86	+0.04
4	0.82	0.81	+0.01
5	0.87	0.85	+0.02

d̄ = (0.04 + 0.03 + 0.04 + 0.01 + 0.02) / 5 = 0.14 / 5 = 0.028 s²_d = [(0.012² + 0.002² + 0.012² + 0.018² + 0.008²) / 4] = [(0.000144 + 0.000004 + 0.000144 + 0.000324 + 0.000064) / 4] = 0.00068 / 4 = 0.00017 s_d = \sqrt0.00017 \approx 0.01304 t = 0.028 / (0.01304 / \sqrt5) = 0.028 / 0.00583 \approx 4.80 t_critical (df=4, α=0.05, two-tailed) \approx 2.776 Since |t| = 4.80 > 2.776 \to Reject H₀ \to Model A is significantly better than Model B (p < 0.05)

Example 4: Random Search Efficiency

Scenario: 5 hyperparameters, 10 values each. Grid search: 10⁵ = 100,000 combinations. Random search with 60 iterations covers:

P(finding top 5% region for 1 parameter in 60 tries) = 1 - (1 - 0.05)⁶⁰ = 1 - 0.95⁶⁰ = 1 - 0.046 = 0.954 So with 60 random trials, there's a 95.4% probability of landing in the top 5% for each individual parameter — while Grid Search with 100,000 tries guarantees hitting only 10 distinct values per parameter!

Example 5: Time-Series CV Split

Dataset: Monthly sales data for 12 months (Jan–Dec)

Time-Series Cross-Validation (Expanding Window)

Split 1: Train=[Jan,Feb,Mar] Test=[Apr] Split 2: Train=[Jan,Feb,Mar,Apr] Test=[May] Split 3: Train=[Jan,...,May] Test=[Jun] Split 4: Train=[Jan,...,Jun] Test=[Jul] Split 5: Train=[Jan,...,Jul] Test=[Aug] Note: Train ALWAYS comes before Test — no "future" data leaks in!

💻 Code Challenge

Implement the paired t-test from Example 3 in Python without using scipy.stats.ttest_rel. Compute the t-statistic and p-value from scratch using only numpy. Compare your result with scipy.

📊

Visual Diagrams

8.1 K-Fold Cross-Validation (K=5)

5-Fold Cross-Validation

Dataset: ████████████████████████████████████████████████████ Fold 1: [TEST ][ ──── TRAIN ──── ][ ──── TRAIN ──── ][ TRAIN ][ TRAIN ] → Score₁ Fold 2: [TRAIN][ TEST ][ ──── TRAIN ──── ][ ──── TRAIN ──── ][ TRAIN ] → Score₂ Fold 3: [TRAIN][ TRAIN ][ TEST ][ ──── TRAIN ──── ][ ──── TRAIN ──── ] → Score₃ Fold 4: [TRAIN][ TRAIN ][ TRAIN ][ TEST ][ ──── TRAIN ──── ][ TRAIN ] → Score₄ Fold 5: [TRAIN][ ──── TRAIN ──── ][ ──── TRAIN ──── ][ TRAIN ][ TEST ] → Score₅ Final Score = Mean(Score₁...Score₅) ± Std(Score₁...Score₅)

8.2 Bias-Variance Tradeoff

Bias-Variance Tradeoff vs Model Complexity

Error │ │ ╲ Total Error ╱ │ ╲ ╱ │ ╲ ╱ │ ╲ ╭───────╮ ╱ │ ╲ ╱ Sweet ╲╱ │ ╲╱ Spot │ │ ╱╲ │ │ ╱ ╲ │ │ ╱ ╲─────────│────── Variance │ ╱ │ │───╱────────────────│────── Bias² │ ╱ │ │ ╱ │ │╱ │ ├────────────────────┼─────────────────► Model Complexity │ │ │ UNDERFITTING │ OVERFITTING │ (High Bias) │ (High Variance) │ │ │ ◄── Optimal ──► │

8.3 Learning Curves

Learning Curves: Underfitting vs Good Fit vs Overfitting

UNDERFITTING GOOD FIT OVERFITTING Error Error Error │ │ │ │ ─── ─── ─── ── ── │ ╲ │ │════════════════════ │ ╲ Validation │ Training │ Training ≈ Validation │ ╲ ───── ── ── ── │ ══════════════ │ │ ══════════════ │ │ Both HIGH error! │ Training │ Validation │ │ │ ── ── ─── ──── ─ │ Fix: More complex │ Both converge LOW! │ │ model needed │ ✓ This is ideal │ BIG GAP! Fix: ├──────────────────► ├──────────────────► │ More data or Training Size Training Size │ regularization ├──────────────────► Training Size

8.4 Grid Search vs Random Search

Grid Search vs Random Search (9 trials, 2 parameters)

GRID SEARCH RANDOM SEARCH Parameter 2 Parameter 2 ▲ ▲ │ ●─────●─────● │ ● ● │ │ │ │ │ ● │ │ │ │ │ ● │ ●─────●─────● │ ● │ │ │ │ │ ● │ │ │ │ │ ● │ ●─────●─────● │ ● │ │ ● └──────────────────► Param 1 └──────────────────► Param 1 Only 3 distinct values 9 distinct values tested tested per parameter! per parameter! Grid: If the best value is between grid points, you miss it. Random: Explores more of each dimension with same budget.

8.5 Stratified vs Non-Stratified Split

Stratified Split Preserves Class Distribution

Original Dataset (20% positive █, 80% negative ░): ░░░░░░░░░░░░░░░░████ NON-STRATIFIED RANDOM SPLIT: Train: ░░░░░░░░░░░░░██ Test: ░░░██ (40% positive! ✗) STRATIFIED SPLIT (preserves 20%/80%): Train: ░░░░░░░░░░░░███ Test: ░░░░█ (20% positive ✓) Critical for: fraud detection, rare disease diagnosis, churn prediction

🔄

Flowcharts

9.1 Complete Model Selection Pipeline

End-to-End Model Selection Flowchart

┌─────────────────────┐ │ Raw Dataset (D) │ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ Train/Test Split │ ← Stratified for classification │ (80/20 or 70/30) │ ← Time-ordered for time series └─────────┬───────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ TRAINING SET ONLY │ │ ┌──────────────────────────────────┐ │ │ │ Define Candidate Models: │ │ │ │ • Logistic Regression │ │ │ │ • Random Forest │ │ │ │ • Gradient Boosting (XGBoost) │ │ │ │ • SVM │ │ │ └──────────┬───────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────┐ │ │ │ For Each Model: │ │ │ │ ┌────────────────────────────┐ │ │ │ │ │ Inner CV (K=5): │ │ │ │ │ │ GridSearchCV or Optuna │ │ │ │ │ │ to tune hyperparameters │ │ │ │ │ │ inside a Pipeline (!) │ │ │ │ │ └────────────┬───────────────┘ │ │ │ │ ▼ │ │ │ │ Best hyperparams + CV score │ │ │ └──────────┬───────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────┐ │ │ │ Compare best CV scores │ │ │ │ Using paired t-test / Wilcoxon │ │ │ │ → Select winning model │ │ │ └──────────┬───────────────────────┘ │ │ │ │ └─────────────┼────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Retrain Winner on FULL Training Set │ └─────────────┬───────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Final Evaluation on HELD-OUT Test │ │ (Use only ONCE — no going back!) │ └─────────────┬───────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Deploy to Production │ └─────────────────────────────────────┘

9.2 Choosing a CV Strategy

Decision Tree: Which Cross-Validation Method?

┌──────────────────────┐ │ Is your data ordered │ │ in time (time series)?│ └─────────┬────────────┘ Yes │ No ┌───────────┴───────────┐ ▼ ▼ ┌──────────────────┐ ┌──────────────────────┐ │ Use TimeSeriesCV │ │ Is it a classification │ │ (expanding window)│ │ problem? │ └──────────────────┘ └──────────┬─────────────┘ Yes │ No ┌──────────┴──────────┐ ▼ ▼ ┌───────────────────┐ ┌───────────────────┐ │ Are classes │ │ Use standard │ │ imbalanced (>1:5)? │ │ K-Fold (K=5 or 10)│ └────────┬──────────┘ └───────────────────┘ Yes │ No ┌──────────┴──────────┐ ▼ ▼ ┌───────────────────────┐ ┌──────────────────────┐ │ Use Stratified K-Fold │ │ K-Fold is fine (K=10)│ │ (K=5 or 10) │ │ Stratified is also │ │ ALWAYS for imbalanced │ │ safe to use anyway │ └───────────────────────┘ └──────────────────────┘ Special case: N < 100 → Consider LOOCV Special case: Need quick result → K=5 instead of K=10

9.3 Bayesian Optimization Loop

Bayesian Optimization Workflow

┌─────────────────────┐ │ Initialize: Sample │ │ n₀ random points │ └─────────┬───────────┘ │ ▼ ┌─────────────────────────────┐ │ Evaluate f(x) via CV for │ │ each sampled point │◄─────────────────────┐ └─────────┬───────────────────┘ │ │ │ ▼ │ ┌─────────────────────────────┐ │ │ Fit/Update Surrogate Model │ │ │ (Tree-Parzen Estimator or │ │ │ Gaussian Process) │ │ └─────────┬───────────────────┘ │ │ │ ▼ │ ┌─────────────────────────────┐ │ │ Compute Acquisition Function│ │ │ (Expected Improvement) │ │ └─────────┬───────────────────┘ │ │ │ ▼ │ ┌─────────────────────────────┐ ┌───────────┴──────────┐ │ Select next point x* that │ Yes │ Budget remaining? │ │ maximizes acquisition ├──────────► (iterations < max) │ └─────────────────────────────┘ └───────────┬──────────┘ No │ ▼ ┌───────────────────────┐ │ Return best x* found │ └───────────────────────┘

🐍

Python Implementation

10.1 K-Fold Cross-Validation from Scratch

Python

import numpy as np

class KFoldCV:
    """K-Fold Cross-Validation implemented from scratch."""

    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state

    def split(self, X):
        """Generate train/test indices for K folds."""
        n_samples = len(X)
        indices = np.arange(n_samples)

        if self.shuffle:
            rng = np.random.RandomState(self.random_state)
            rng.shuffle(indices)

        # Compute fold sizes (handle uneven splits)
        fold_sizes = np.full(self.n_splits, n_samples // self.n_splits)
        fold_sizes[:n_samples % self.n_splits] += 1

        current = 0
        for fold_size in fold_sizes:
            test_idx = indices[current:current + fold_size]
            train_idx = np.concatenate([
                indices[:current],
                indices[current + fold_size:]
            ])
            yield train_idx, test_idx
            current += fold_size

    def cross_val_score(self, model_class, X, y, **model_params):
        """Compute cross-validated scores."""
        scores = []
        for fold, (train_idx, test_idx) in enumerate(self.split(X)):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]

            # Create fresh model for each fold
            model = model_class(**model_params)
            model.fit(X_train, y_train)

            score = np.mean(model.predict(X_test) == y_test)
            scores.append(score)
            print(f"  Fold {fold+1}: Accuracy = {score:.4f}")

        scores = np.array(scores)
        print(f"\n  Mean: {scores.mean():.4f} ± {scores.std():.4f}")
        return scores

# --- Demo with a simple model ---
class SimpleThresholdClassifier:
    """Classify based on feature mean threshold."""
    def fit(self, X, y):
        # Learn: for each class, compute mean feature value
        self.threshold = X[y == 1].mean() - X[y == 0].mean()
        self.global_mean = X.mean(axis=0)
        return self

    def predict(self, X):
        distances = np.linalg.norm(X - self.global_mean, axis=1)
        return (distances < np.median(distances)).astype(int)

# Generate sample data
np.random.seed(42)
X = np.random.randn(100, 3)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Run 5-Fold CV
print("=" * 50)
print("5-Fold Cross-Validation from Scratch")
print("=" * 50)
kfold = KFoldCV(n_splits=5, shuffle=True, random_state=42)
scores = kfold.cross_val_score(SimpleThresholdClassifier, X, y)

10.2 Stratified K-Fold from Scratch

Python

class StratifiedKFold:
    """Stratified K-Fold: preserves class distribution in each fold."""

    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state

    def split(self, X, y):
        n_samples = len(y)
        classes = np.unique(y)
        rng = np.random.RandomState(self.random_state)

        # Group indices by class
        class_indices = {}
        for cls in classes:
            idx = np.where(y == cls)[0]
            if self.shuffle:
                rng.shuffle(idx)
            class_indices[cls] = idx

        # Split each class into K folds, then combine
        folds = [[] for _ in range(self.n_splits)]
        for cls in classes:
            idx = class_indices[cls]
            fold_sizes = np.full(self.n_splits, len(idx) // self.n_splits)
            fold_sizes[:len(idx) % self.n_splits] += 1

            current = 0
            for i, size in enumerate(fold_sizes):
                folds[i].extend(idx[current:current + size])
                current += size

        # Generate train/test pairs
        for i in range(self.n_splits):
            test_idx = np.array(folds[i])
            train_idx = np.concatenate([folds[j] for j in range(self.n_splits) if j != i])
            yield train_idx.astype(int), test_idx.astype(int)

# Demo: Imbalanced dataset
np.random.seed(42)
X_imb = np.random.randn(200, 4)
y_imb = np.array([0]*180 + [1]*20)  # 90% vs 10%

print("\n" + "=" * 50)
print("Stratified K-Fold (Imbalanced: 90/10 split)")
print("=" * 50)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, test_idx) in enumerate(skf.split(X_imb, y_imb)):
    train_dist = np.bincount(y_imb[train_idx])
    test_dist = np.bincount(y_imb[test_idx])
    print(f"Fold {fold+1}: Train {train_dist} | Test {test_dist} | "
          f"Test class-1 ratio: {test_dist[1]/test_dist.sum():.2%}")

10.3 Time-Series Cross-Validation

Python

class TimeSeriesSplit:
    """Time-series CV with expanding training window."""

    def __init__(self, n_splits=5, min_train_size=None):
        self.n_splits = n_splits
        self.min_train_size = min_train_size

    def split(self, X):
        n_samples = len(X)
        test_size = n_samples // (self.n_splits + 1)
        min_train = self.min_train_size or test_size

        for i in range(self.n_splits):
            train_end = min_train + i * test_size
            test_start = train_end
            test_end = test_start + test_size

            if test_end > n_samples:
                break

            train_idx = np.arange(0, train_end)
            test_idx = np.arange(test_start, test_end)
            yield train_idx, test_idx

# Demo
print("\n" + "=" * 50)
print("Time-Series Split (60 months of data)")
print("=" * 50)
X_ts = np.arange(60)  # 60 months
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
    print(f"Fold {fold+1}: Train months [0..{train_idx[-1]}] "
          f"({len(train_idx)} months) → Test months "
          f"[{test_idx[0]}..{test_idx[-1]}] ({len(test_idx)} months)")

10.4 Grid Search from Scratch

Python

from itertools import product
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

def grid_search_cv(model_class, param_grid, X, y, cv=5):
    """Grid Search with K-Fold Cross-Validation from scratch."""
    # Generate all combinations
    keys = list(param_grid.keys())
    values = list(param_grid.values())
    combinations = list(product(*values))

    print(f"Total combinations: {len(combinations)}")
    print(f"Total model fits: {len(combinations) * cv}\n")

    best_score = -np.inf
    best_params = None
    results = []

    for combo in combinations:
        params = dict(zip(keys, combo))

        # K-Fold CV for this combination
        fold_scores = []
        kf = KFoldCV(n_splits=cv, shuffle=True, random_state=42)
        for train_idx, test_idx in kf.split(X):
            model = model_class(**params)
            model.fit(X[train_idx], y[train_idx])
            pred = model.predict(X[test_idx])
            fold_scores.append(accuracy_score(y[test_idx], pred))

        mean_score = np.mean(fold_scores)
        std_score = np.std(fold_scores)
        results.append((params, mean_score, std_score))

        if mean_score > best_score:
            best_score = mean_score
            best_params = params

    # Sort and display top 5
    results.sort(key=lambda x: x[1], reverse=True)
    print("Top 5 Configurations:")
    print("-" * 60)
    for i, (params, mean, std) in enumerate(results[:5]):
        print(f"  {i+1}. Score: {mean:.4f} ± {std:.4f} | {params}")

    print(f"\n★ Best: {best_params} → {best_score:.4f}")
    return best_params, best_score

# Run Grid Search
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

param_grid = {
    'max_depth': [2, 3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10, 20],
    'criterion': ['gini', 'entropy']
}

best_params, best_score = grid_search_cv(
    DecisionTreeClassifier, param_grid, X_iris, y_iris, cv=5
)

10.5 Bayesian Optimization with Optuna

Python

# pip install optuna
import optuna
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

def objective(trial):
    """Optuna objective: define search space and evaluate."""
    # Choose model type
    model_name = trial.suggest_categorical('model', ['rf', 'gb'])

    if model_name == 'rf':
        params = {
            'n_estimators': trial.suggest_int('rf_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('rf_max_depth', 2, 20),
            'min_samples_split': trial.suggest_int('rf_min_samples_split', 2, 20),
            'max_features': trial.suggest_categorical('rf_max_features',
                                                       ['sqrt', 'log2']),
        }
        model = RandomForestClassifier(**params, random_state=42)
    else:
        params = {
            'n_estimators': trial.suggest_int('gb_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('gb_max_depth', 2, 10),
            'learning_rate': trial.suggest_float('gb_lr', 0.01, 0.3, log=True),
            'subsample': trial.suggest_float('gb_subsample', 0.5, 1.0),
        }
        model = GradientBoostingClassifier(**params, random_state=42)

    # 5-Fold Stratified CV
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    return scores.mean()

# Run optimization
study = optuna.create_study(direction='maximize',
                            sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"\nBest trial:")
print(f"  Value (accuracy): {study.best_trial.value:.4f}")
print(f"  Params: {study.best_trial.params}")

# Hyperparameter importance
importances = optuna.importance.get_param_importances(study)
print(f"\nHyperparameter Importance:")
for param, imp in importances.items():
    print(f"  {param}: {imp:.4f}")

10.6 Learning Curves from Scratch

Python

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC

def plot_learning_curves(model, X, y, title="Learning Curve"):
    """Plot training and validation learning curves."""
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5, scoring='accuracy',
        n_jobs=-1, random_state=42
    )

    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)

    plt.figure(figsize=(10, 6))
    plt.fill_between(train_sizes, train_mean - train_std,
                     train_mean + train_std, alpha=0.1, color='#059669')
    plt.fill_between(train_sizes, val_mean - val_std,
                     val_mean + val_std, alpha=0.1, color='#0891b2')
    plt.plot(train_sizes, train_mean, 'o-', color='#059669',
             label='Training Score', linewidth=2)
    plt.plot(train_sizes, val_mean, 'o-', color='#0891b2',
             label='Validation Score', linewidth=2)
    plt.xlabel('Training Set Size', fontsize=12)
    plt.ylabel('Accuracy', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend(loc='lower right', fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

    # Diagnose
    gap = train_mean[-1] - val_mean[-1]
    if val_mean[-1] < 0.7:
        print("→ UNDERFITTING: Both scores are low. Try a more complex model.")
    elif gap > 0.1:
        print(f"→ OVERFITTING: Training-validation gap = {gap:.3f}. "
              f"Try regularization or more data.")
    else:
        print(f"→ GOOD FIT: Gap = {gap:.3f}. Model generalizes well.")

# Example usage
data = load_breast_cancer()
plot_learning_curves(SVC(kernel='rbf', C=1.0), data.data, data.target,
                     title="SVM (RBF) Learning Curve")

10.7 Paired t-Test for Model Comparison

Python

from scipy import stats

def compare_models(model_a, model_b, X, y, cv=10, alpha=0.05):
    """Compare two models using paired t-test and Wilcoxon test."""
    from sklearn.model_selection import StratifiedKFold

    skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    scores_a = []
    scores_b = []

    for train_idx, test_idx in skf.split(X, y):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        # Model A
        ma = model_a.__class__(**model_a.get_params())
        ma.fit(X_train, y_train)
        scores_a.append(accuracy_score(y_test, ma.predict(X_test)))

        # Model B
        mb = model_b.__class__(**model_b.get_params())
        mb.fit(X_train, y_train)
        scores_b.append(accuracy_score(y_test, mb.predict(X_test)))

    scores_a = np.array(scores_a)
    scores_b = np.array(scores_b)
    differences = scores_a - scores_b

    print("=" * 60)
    print("MODEL COMPARISON REPORT")
    print("=" * 60)
    print(f"Model A: {model_a.__class__.__name__}")
    print(f"  Mean accuracy: {scores_a.mean():.4f} ± {scores_a.std():.4f}")
    print(f"Model B: {model_b.__class__.__name__}")
    print(f"  Mean accuracy: {scores_b.mean():.4f} ± {scores_b.std():.4f}")
    print(f"\nMean difference: {differences.mean():.4f}")

    # Paired t-test
    t_stat, p_value_t = stats.ttest_rel(scores_a, scores_b)
    print(f"\nPaired t-test: t={t_stat:.4f}, p={p_value_t:.4f}")

    # Wilcoxon signed-rank test
    try:
        w_stat, p_value_w = stats.wilcoxon(scores_a, scores_b)
        print(f"Wilcoxon test:  W={w_stat:.4f}, p={p_value_w:.4f}")
    except ValueError:
        print("Wilcoxon test: Cannot compute (all differences may be zero)")

    # Decision
    if p_value_t < alpha:
        winner = "A" if differences.mean() > 0 else "B"
        print(f"\n✓ SIGNIFICANT difference (p={p_value_t:.4f} < {alpha})")
        print(f"  → Model {winner} is statistically better.")
    else:
        print(f"\n✗ NO significant difference (p={p_value_t:.4f} ≥ {alpha})")
        print(f"  → Choose the simpler or faster model.")

    return scores_a, scores_b

# Usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

data = load_breast_cancer()
scores_a, scores_b = compare_models(
    RandomForestClassifier(n_estimators=100, random_state=42),
    LogisticRegression(max_iter=1000, random_state=42),
    data.data, data.target, cv=10
)

🔥

TensorFlow Implementation

11.1 Neural Network Hyperparameter Tuning with Keras Tuner

Python (TensorFlow/Keras)

# pip install keras-tuner
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# --- Manual K-Fold CV for Neural Networks ---
def nn_cross_validate(X, y, build_fn, epochs=50, n_splits=5):
    """Cross-validate a Keras model with proper data handling."""
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    fold_scores = []

    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
        print(f"\n--- Fold {fold+1}/{n_splits} ---")

        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        # Scale INSIDE the fold (no data leakage!)
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)  # transform only

        # Build fresh model for each fold
        model = build_fn(input_dim=X_train.shape[1])

        # Train with early stopping
        early_stop = keras.callbacks.EarlyStopping(
            monitor='val_loss', patience=10, restore_best_weights=True
        )
        model.fit(
            X_train_scaled, y_train,
            validation_split=0.15,
            epochs=epochs, batch_size=32,
            callbacks=[early_stop],
            verbose=0
        )

        # Evaluate
        _, accuracy = model.evaluate(X_test_scaled, y_test, verbose=0)
        fold_scores.append(accuracy)
        print(f"  Fold {fold+1} accuracy: {accuracy:.4f}")

    fold_scores = np.array(fold_scores)
    print(f"\n{'='*50}")
    print(f"CV Result: {fold_scores.mean():.4f} ± {fold_scores.std():.4f}")
    return fold_scores

# --- Model builder ---
def build_model(input_dim, units=64, dropout=0.3, lr=0.001):
    model = keras.Sequential([
        keras.layers.Dense(units, activation='relu',
                          input_shape=(input_dim,)),
        keras.layers.Dropout(dropout),
        keras.layers.Dense(units // 2, activation='relu'),
        keras.layers.Dropout(dropout / 2),
        keras.layers.Dense(1, activation='sigmoid')
    ])
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    return model

# Run CV
scores = nn_cross_validate(X, y, build_model, epochs=100, n_splits=5)

11.2 Keras Tuner for Automated Hyperparameter Search

Python (Keras Tuner)

import keras_tuner as kt

def build_tunable_model(hp):
    """Define hyperparameter search space for Keras Tuner."""
    model = keras.Sequential()

    # Tune number of layers
    for i in range(hp.Int('n_layers', 1, 4)):
        model.add(keras.layers.Dense(
            units=hp.Int(f'units_{i}', min_value=16, max_value=256, step=16),
            activation=hp.Choice(f'activation_{i}', ['relu', 'tanh', 'selu'])
        ))
        if hp.Boolean(f'dropout_{i}'):
            model.add(keras.layers.Dropout(
                rate=hp.Float(f'dropout_rate_{i}', 0.1, 0.5, step=0.1)
            ))

    model.add(keras.layers.Dense(1, activation='sigmoid'))

    # Tune learning rate
    lr = hp.Float('learning_rate', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    return model

# Bayesian optimization tuner
tuner = kt.BayesianOptimization(
    build_tunable_model,
    objective='val_accuracy',
    max_trials=30,
    directory='tuner_results',
    project_name='breast_cancer_nn'
)

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Search
tuner.search(
    X_scaled, y,
    validation_split=0.2,
    epochs=50,
    batch_size=32,
    callbacks=[keras.callbacks.EarlyStopping(patience=5)]
)

# Results
print("\nTop 3 Models:")
tuner.results_summary(num_trials=3)

best_model = tuner.get_best_models(num_models=1)[0]
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print(f"\nBest hyperparameters: {best_hp.values}")

⚠️ Data Leakage in Neural Networks

When using validation_split inside model.fit(), Keras takes the last X% of data as validation. If your data isn't shuffled, this can be biased. Always shuffle beforehand or use validation_data with explicit splits.

🧪

Scikit-Learn Implementation

12.1 cross_val_score — The Swiss Army Knife

Python (sklearn)

from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, LeaveOneOut,
    TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load data
X, y = load_breast_cancer(return_X_y=True)

# --- 1. Basic cross_val_score ---
print("=" * 60)
print("1. Basic cross_val_score (5-Fold)")
print("=" * 60)
models = {
    'Logistic Regression': LogisticRegression(max_iter=5000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"  {name:25s}: {scores.mean():.4f} ± {scores.std():.4f}")

# --- 2. Stratified K-Fold with custom CV object ---
print(f"\n{'=' * 60}")
print("2. Stratified K-Fold (10 folds)")
print("=" * 60)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y, cv=skf, scoring='accuracy'
)
print(f"  RF (10-Fold Stratified): {scores.mean():.4f} ± {scores.std():.4f}")
print(f"  Individual fold scores: {np.round(scores, 4)}")

# --- 3. Multiple metrics ---
from sklearn.model_selection import cross_validate
print(f"\n{'=' * 60}")
print("3. Multiple Metrics with cross_validate")
print("=" * 60)
results = cross_validate(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y, cv=5,
    scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
    return_train_score=True
)
for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
    train = results[f'train_{metric}'].mean()
    test = results[f'test_{metric}'].mean()
    print(f"  {metric:12s}: Train={train:.4f} | Test={test:.4f} | "
          f"Gap={train-test:.4f}")

12.2 GridSearchCV — Exhaustive Hyperparameter Search

Python (sklearn)

from sklearn.model_selection import GridSearchCV

# Define pipeline (prevents data leakage!)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=42))
])

# Define parameter grid (note: 'clf__' prefix for pipeline)
param_grid = {
    'clf__n_estimators': [50, 100, 200, 300],
    'clf__max_depth': [5, 10, 15, 20, None],
    'clf__min_samples_split': [2, 5, 10],
    'clf__max_features': ['sqrt', 'log2'],
}

# Run Grid Search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='accuracy',
    n_jobs=-1,       # Use all CPU cores
    verbose=1,
    return_train_score=True,
    refit=True       # Refit best model on full training data
)

grid_search.fit(X, y)

# Results
print(f"\n{'='*60}")
print("GRID SEARCH RESULTS")
print(f"{'='*60}")
print(f"Best Score: {grid_search.best_score_:.4f}")
print(f"Best Params: {grid_search.best_params_}")
print(f"Total fits: {grid_search.n_splits_} × "
      f"{len(grid_search.cv_results_['mean_test_score'])} = "
      f"{grid_search.n_splits_ * len(grid_search.cv_results_['mean_test_score'])}")

# Top 5 combinations
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
top5 = results_df.nsmallest(5, 'rank_test_score')[
    ['params', 'mean_test_score', 'std_test_score', 'mean_fit_time']
]
print("\nTop 5 Configurations:")
for _, row in top5.iterrows():
    print(f"  Score: {row['mean_test_score']:.4f} ± {row['std_test_score']:.4f} "
          f"| Time: {row['mean_fit_time']:.2f}s")

12.3 RandomizedSearchCV — Efficient Exploration

Python (sklearn)

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Distributions instead of fixed lists
param_distributions = {
    'clf__n_estimators': randint(50, 500),
    'clf__max_depth': randint(3, 30),
    'clf__min_samples_split': randint(2, 20),
    'clf__max_features': ['sqrt', 'log2'],
    'clf__min_samples_leaf': randint(1, 10),
}

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions,
    n_iter=60,         # Only 60 random combinations (vs 1000s for grid)
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1,
    return_train_score=True
)

random_search.fit(X, y)

print(f"\n{'='*60}")
print("RANDOMIZED SEARCH RESULTS")
print(f"{'='*60}")
print(f"Best Score: {random_search.best_score_:.4f}")
print(f"Best Params: {random_search.best_params_}")
print(f"Evaluated {random_search.n_splits_ * 60} model fits "
      f"(Grid would need {4*5*3*2*4 * 5} = {4*5*3*2*4*5})")

12.4 Complete Pipeline with Data Leakage Prevention

Python (sklearn)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA

# The CORRECT way: everything inside the pipeline
proper_pipeline = Pipeline([
    ('scaler', StandardScaler()),              # Step 1: Scale
    ('feature_select', SelectKBest(f_classif)), # Step 2: Feature Selection
    ('pca', PCA()),                             # Step 3: Dimensionality Reduction
    ('clf', LogisticRegression(max_iter=5000))  # Step 4: Classifier
])

# Grid search over PIPELINE parameters
pipeline_param_grid = {
    'feature_select__k': [5, 10, 15, 20, 'all'],
    'pca__n_components': [3, 5, 10, 15],
    'clf__C': [0.01, 0.1, 1, 10, 100],
    'clf__penalty': ['l1', 'l2'],
    'clf__solver': ['saga'],
}

# This ensures NO data leakage:
# - StandardScaler is fit ONLY on training folds
# - SelectKBest chooses features ONLY from training folds
# - PCA components are learned ONLY from training folds
pipeline_search = GridSearchCV(
    proper_pipeline,
    pipeline_param_grid,
    cv=5, scoring='accuracy',
    n_jobs=-1, verbose=0
)
pipeline_search.fit(X, y)

print(f"Pipeline Best Score: {pipeline_search.best_score_:.4f}")
print(f"Pipeline Best Params: {pipeline_search.best_params_}")

# ⚠️ THE WRONG WAY (DATA LEAKAGE):
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)  ← Sees ALL data including test!
# selector = SelectKBest(k=10).fit_transform(X_scaled, y)  ← Leakage!
# scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)
# ^^ This score is OPTIMISTICALLY BIASED!

12.5 Regularization Parameter Tuning (L1/L2)

Python (sklearn)

from sklearn.linear_model import LogisticRegressionCV, RidgeCV, LassoCV
import matplotlib.pyplot as plt

# --- Logistic Regression with built-in CV for C ---
lr_cv = LogisticRegressionCV(
    Cs=np.logspace(-4, 4, 50),  # 50 values from 0.0001 to 10000
    cv=10,
    penalty='l2',
    scoring='accuracy',
    max_iter=5000,
    random_state=42
)
lr_cv.fit(X, y)
print(f"Best C (L2): {lr_cv.C_[0]:.4f}")
print(f"Best score: {lr_cv.scores_[1].mean(axis=0).max():.4f}")

# --- Visualize C vs CV Accuracy ---
mean_scores = lr_cv.scores_[1].mean(axis=0)
plt.figure(figsize=(10, 5))
plt.semilogx(np.logspace(-4, 4, 50), mean_scores, 'o-', color='#059669')
plt.axvline(x=lr_cv.C_[0], color='#f43f5e', linestyle='--',
            label=f'Best C = {lr_cv.C_[0]:.4f}')
plt.xlabel('Regularization Parameter C (log scale)')
plt.ylabel('CV Accuracy')
plt.title('L2 Regularization: C vs Cross-Validated Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# --- Ridge Regression with built-in CV ---
from sklearn.datasets import fetch_california_housing
X_reg, y_reg = fetch_california_housing(return_X_y=True)
ridge_cv = RidgeCV(
    alphas=np.logspace(-4, 4, 100),
    cv=10, scoring='neg_mean_squared_error'
)
ridge_cv.fit(X_reg, y_reg)
print(f"\nBest Ridge alpha: {ridge_cv.alpha_:.4f}")

# --- Lasso with CV (automatic feature selection) ---
lasso_cv = LassoCV(cv=10, random_state=42, max_iter=10000)
lasso_cv.fit(X_reg, y_reg)
print(f"Best Lasso alpha: {lasso_cv.alpha_:.6f}")
print(f"Features selected: {np.sum(lasso_cv.coef_ != 0)} / {X_reg.shape[1]}")

📝 Exam Tip

In sklearn, C in LogisticRegression is the inverse of regularization strength (C = 1/λ). Small C = strong regularization = simpler model. Large C = weak regularization = complex model. This is the opposite of the λ convention!

🇮🇳

Indian Case Studies

Case Study 1: Flipkart's A/B Testing & Model Selection Framework

🛒 Flipkart — India's E-Commerce Giant

Challenge: Flipkart runs hundreds of ML models simultaneously — for search ranking, product recommendations, delivery time estimation, and fraud detection. Selecting and tuning each model independently was slow and error-prone.

Solution: They built an internal Model Selection Platform (MSP) that:

Automatically runs Stratified K-Fold CV on candidate models
Uses Bayesian optimization (similar to Optuna) for hyperparameter tuning
Performs statistical A/B tests before deploying model updates
Monitors production performance and triggers retraining when CV scores drift

Result: 40% reduction in model deployment time, 15% improvement in recommendation click-through rates after systematic tuning.

Key Insight: Flipkart uses time-aware splits for their demand forecasting models. They discovered that random K-fold on transaction data was giving optimistic estimates because the model was "seeing the future."

Case Study 2: TCS AutoML (iON Platform)

🏢 TCS — Automated ML for Enterprise

Context: TCS developed the iON Cognitive Platform to democratize ML for enterprises that lack dedicated data science teams.

Approach:

Meta-learning: The system learns from past experiments which model families work best for different data characteristics (number of features, class imbalance ratio, data types)
Multi-fidelity optimization: Instead of full K-fold CV for every candidate, initial screening uses 2-fold CV with small data subsets. Only promising candidates get full 10-fold evaluation
Ensemble construction: Top-performing models are automatically combined into weighted ensembles

Deployment: Used by TCS clients in BFSI (Banking, Financial Services, Insurance) sector. One banking client reported 25% reduction in loan default prediction errors after switching from manually-tuned models to AutoML-selected pipelines.

Case Study 3: Aadhaar Biometric Matching — Cross-Validation at Scale

🆔 UIDAI — 1.3 Billion Identity Records

Challenge: UIDAI's biometric matching system must maintain extremely low false acceptance rates (FAR < 0.01%) while keeping false rejection rates manageable. Model selection is literally a matter of national identity.

CV Strategy:

Group K-Fold: folds are grouped by geographic region to ensure geographic generalization
Stratified by biometric quality: ensures each fold has representative samples across quality levels
Repeated CV (3×10 fold): 30 total evaluations per model to reduce variance

Statistical Testing: McNemar's test (not paired t-test) is used because individual predictions are binary. Significance level set at α = 0.001 (extremely stringent for a national system).

Case Study 4: PhonePe Fraud Detection Pipeline

💳 PhonePe — Real-Time UPI Fraud Detection

Challenge: With 4+ billion monthly UPI transactions, PhonePe needs models that detect fraud in <50ms while maintaining <0.1% false positive rate on legitimate transactions.

Pipeline: StandardScaler → SMOTE (inside CV!) → Feature Selection → XGBoost → Calibration

Key Design: SMOTE (oversampling) is applied inside each CV fold, not before splitting. Applying SMOTE before CV creates synthetic copies that can appear in both train and test — severe data leakage!

🌍

Global Case Studies

Case Study 1: Google Vizier — Hyperparameter Optimization as a Service

🔍 Google — Vizier (2017)

What: Google's internal black-box optimization service used across the company for hyperparameter tuning. Published at KDD 2017.

Scale: Handles thousands of concurrent optimization studies, from tuning neural machine translation models to optimizing ad auction parameters.

Key innovations:

Transfer learning: Warm-starts optimization using results from similar past studies
Early stopping: Terminates unpromising trials early using median stopping rules
Multi-objective: Can optimize accuracy AND latency simultaneously
Safe optimization: Handles constraints (e.g., "model must fit in 2GB memory")

Impact: Reduced average tuning time by 3× while finding better hyperparameters than manual tuning.

Case Study 2: Netflix — Personalization Model Selection

🎬 Netflix — Model Selection at Scale

Challenge: Netflix runs 100s of A/B tests simultaneously. For each recommendation algorithm update, they need rigorous offline evaluation before deploying to users.

Approach:

Time-based splits: Always train on past viewing history, evaluate on future behavior
Replay evaluation: Simulates what would have happened if the new model had been deployed in the past
Interleaving experiments: Before full A/B tests, they "interleave" recommendations from two models on the same page and measure user preference

Scale: Their model selection pipeline evaluates 1000+ model variants weekly, with automated statistical significance testing determining which proceed to A/B testing.

Case Study 3: Tesla — AutoML for Self-Driving

🚗 Tesla — Automated Neural Architecture Search

Tesla uses Neural Architecture Search (NAS) — an extreme form of model selection where the architecture itself is a hyperparameter. Their Dojo supercomputer runs thousands of architecture evaluations in parallel, using progressive CV-like validation on increasingly larger subsets of their driving dataset.

Case Study 4: OpenAI — Scaling Law-Based Model Selection

🤖 OpenAI — Predicting Performance Before Training

OpenAI's scaling laws research (Kaplan et al., 2020) discovered that model performance can be predicted based on model size, dataset size, and compute budget. This enables "model selection" before expensive full training — you can estimate which configuration will perform best using small-scale experiments and extrapolation.

🚀

Startup Applications

15.1 When Resources Are Limited

Startups typically can't afford Google-scale hyperparameter searches. Practical strategies:

Stage	Strategy	CV Method	Tuning
MVP (0-1K samples)	Simple models first	LOOCV or 5-fold	Manual + intuition
Growth (1K-100K)	Compare 3-5 models	Stratified 5-fold	Random Search (30 trials)
Scale (100K+)	Full pipeline	Stratified 10-fold	Bayesian (Optuna)

15.2 Indian Startup Examples

Razorpay (Payments): Uses time-series CV for fraud model evaluation; updates models weekly with expanding-window retraining
Nykaa (Beauty E-commerce): Random Search + 5-fold CV for product recommendation model selection; found that a well-tuned LightGBM outperformed a default neural network
CRED (FinTech): Group K-Fold CV where groups are individual users — ensures no user's data appears in both train and test
Ola (Ride-hailing): Geographic K-Fold for demand prediction — train on some cities, validate on held-out cities

💻 Code Challenge

Build a "Quick Model Selector" function that takes a dataset and automatically: (1) tries Logistic Regression, Random Forest, SVM, and XGBoost, (2) runs 5-fold Stratified CV for each, (3) returns a ranked leaderboard with mean scores, std, and training time. Complete it in under 30 lines.

🏛️

Government Applications

16.1 ISRO — Satellite Image Classification

ISRO's remote sensing division uses spatial cross-validation for land-cover classification. Standard K-fold would allow spatially adjacent pixels (which are highly correlated) to appear in both train and test, inflating accuracy. Their solution: block CV where folds correspond to geographic tiles, ensuring spatial independence.

16.2 Ministry of Health — Epidemic Prediction

India's Integrated Disease Surveillance Programme (IDSP) uses time-series CV to validate epidemic forecasting models. They compare ARIMA, Prophet, and ensemble ML models using expanding-window CV on weekly disease incidence data. The paired t-test determines if newer models statistically improve on existing ones before deployment.

16.3 Income Tax Department — Fraud Detection

The IT department uses stratified CV with extreme class imbalance (fraud rates < 0.5%). They found that SMOTE inside CV folds, combined with XGBoost tuned via Bayesian optimization, significantly outperformed manually-tuned rules-based systems.

16.4 Election Commission — Voter Turnout Prediction

Uses geographic group CV — trains on some constituencies, validates on others. This prevents the model from memorizing constituency-specific patterns and ensures generalization to new electoral scenarios.

🏭

Industry Applications

17.1 Manufacturing — Predictive Maintenance

CV Strategy: Group K-Fold (grouped by machine ID). Never allow same machine's data in both train and test.

Tuning: Multi-objective optimization — balance precision (don't trigger unnecessary maintenance) with recall (don't miss failures).

17.2 Pharma — Drug Discovery

CV Strategy: Scaffold splitting — molecules are grouped by chemical scaffold. Standard random CV overestimates drug activity prediction because structurally similar molecules end up in both sets.

Tuning: Bayesian optimization with Gaussian Processes, due to extremely expensive evaluations (each "trial" may take hours of molecular simulation).

17.3 Finance — Credit Scoring

CV Strategy: Time-series split (train on past applications, test on future ones) + Stratified for default/non-default balance.

Regulatory requirement: Models must be explainable, so model selection favors logistic regression and decision trees over black-box models, even if black-box models perform slightly better in CV.

17.4 Retail — Demand Forecasting

CV Strategy: Sliding window time-series CV with store-level grouping.

Tuning Priority: Hyperparameter importance analysis reveals that $learning_rate$ and $n_estimators$ matter most for GBM-based forecasters; $max_depth$ has diminishing returns beyond 8.

⚡ Industry Alert

In production ML, the choice of CV strategy often matters more than the choice of model. A well-validated simple model beats a poorly-validated complex one every time. Top companies spend 50-70% of model development time on validation infrastructure.

🛠️

Mini Projects

Mini Project 1: Complete Model Selection Framework

Python

"""
Mini Project 1: AutoModelSelector
A complete framework for automated model selection with:
- Multiple CV strategies
- Statistical model comparison
- Leakage-free pipelines
- Results visualization
"""

import numpy as np
import pandas as pd
from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, TimeSeriesSplit
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier
)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from scipy import stats
import time

class AutoModelSelector:
    """Automated model selection with statistical comparison."""

    def __init__(self, cv_strategy='stratified', n_folds=5,
                 random_state=42):
        self.cv_strategy = cv_strategy
        self.n_folds = n_folds
        self.random_state = random_state
        self.results_ = {}
        self.best_model_ = None

        # Define candidate models (all inside pipelines)
        self.candidates = {
            'LogisticRegression': Pipeline([
                ('scaler', StandardScaler()),
                ('clf', LogisticRegression(max_iter=5000,
                                           random_state=random_state))
            ]),
            'RandomForest': Pipeline([
                ('scaler', StandardScaler()),
                ('clf', RandomForestClassifier(n_estimators=100,
                                               random_state=random_state))
            ]),
            'GradientBoosting': Pipeline([
                ('scaler', StandardScaler()),
                ('clf', GradientBoostingClassifier(
                    random_state=random_state))
            ]),
            'SVM_RBF': Pipeline([
                ('scaler', StandardScaler()),
                ('clf', SVC(kernel='rbf', random_state=random_state))
            ]),
            'KNN': Pipeline([
                ('scaler', StandardScaler()),
                ('clf', KNeighborsClassifier(n_neighbors=5))
            ]),
        }

    def _get_cv(self, y=None):
        if self.cv_strategy == 'stratified':
            return StratifiedKFold(n_splits=self.n_folds,
                                   shuffle=True,
                                   random_state=self.random_state)
        elif self.cv_strategy == 'timeseries':
            return TimeSeriesSplit(n_splits=self.n_folds)
        else:
            return self.n_folds

    def fit(self, X, y, scoring='accuracy'):
        """Evaluate all candidate models."""
        cv = self._get_cv(y)
        print("=" * 70)
        print("AUTO MODEL SELECTION")
        print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
        print(f"CV Strategy: {self.cv_strategy} ({self.n_folds} folds)")
        print("=" * 70)

        for name, pipeline in self.candidates.items():
            start = time.time()
            scores = cross_val_score(pipeline, X, y, cv=cv,
                                     scoring=scoring, n_jobs=-1)
            elapsed = time.time() - start

            self.results_[name] = {
                'mean': scores.mean(),
                'std': scores.std(),
                'scores': scores,
                'time': elapsed
            }
            print(f"  {name:25s} | {scores.mean():.4f} ± {scores.std():.4f} "
                  f"| {elapsed:.2f}s")

        # Find best
        self.best_model_ = max(self.results_,
                               key=lambda k: self.results_[k]['mean'])
        print(f"\n★ Best Model: {self.best_model_} "
              f"({self.results_[self.best_model_]['mean']:.4f})")

        return self

    def compare_top2(self, alpha=0.05):
        """Statistical comparison of top 2 models."""
        sorted_models = sorted(self.results_,
                               key=lambda k: self.results_[k]['mean'],
                               reverse=True)
        m1, m2 = sorted_models[0], sorted_models[1]
        s1 = self.results_[m1]['scores']
        s2 = self.results_[m2]['scores']

        # Paired t-test
        t_stat, p_val = stats.ttest_rel(s1, s2)
        # Wilcoxon
        try:
            w_stat, p_wil = stats.wilcoxon(s1, s2)
        except:
            w_stat, p_wil = None, None

        print(f"\n{'='*70}")
        print(f"STATISTICAL COMPARISON: {m1} vs {m2}")
        print(f"{'='*70}")
        print(f"  {m1}: {s1.mean():.4f} ± {s1.std():.4f}")
        print(f"  {m2}: {s2.mean():.4f} ± {s2.std():.4f}")
        print(f"  Paired t-test: t={t_stat:.3f}, p={p_val:.4f}")
        if p_wil is not None:
            print(f"  Wilcoxon test: W={w_stat:.3f}, p={p_wil:.4f}")

        if p_val < alpha:
            print(f"  → Significant difference (p < {alpha})")
            print(f"  → Choose {m1}")
        else:
            print(f"  → No significant difference (p ≥ {alpha})")
            print(f"  → Choose simpler/faster model")

    def leaderboard(self):
        """Return results as a sorted DataFrame."""
        rows = []
        for name, res in self.results_.items():
            rows.append({
                'Model': name,
                'Mean Score': res['mean'],
                'Std': res['std'],
                'Time (s)': res['time'],
                'Score per Second': res['mean'] / max(res['time'], 0.01)
            })
        df = pd.DataFrame(rows).sort_values('Mean Score', ascending=False)
        return df.reset_index(drop=True)

# --- Run the framework ---
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

selector = AutoModelSelector(cv_strategy='stratified', n_folds=10)
selector.fit(X, y)
selector.compare_top2()
print("\n" + selector.leaderboard().to_string(index=False))

Mini Project 2: AutoML Pipeline with Optuna

Python

"""
Mini Project 2: End-to-End AutoML Pipeline
Combines model selection + hyperparameter tuning using Optuna.
"""

import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier,
    AdaBoostClassifier
)
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np
import warnings
warnings.filterwarnings('ignore')

X, y = load_breast_cancer(return_X_y=True)

def automl_objective(trial):
    """Unified objective: select model AND tune hyperparameters."""

    # Step 1: Choose model family
    model_type = trial.suggest_categorical(
        'model_type',
        ['lr', 'rf', 'gb', 'svm', 'adaboost']
    )

    # Step 2: Define model-specific hyperparameters
    if model_type == 'lr':
        C = trial.suggest_float('lr_C', 1e-3, 100, log=True)
        penalty = trial.suggest_categorical('lr_penalty', ['l1', 'l2'])
        clf = LogisticRegression(
            C=C, penalty=penalty, solver='saga',
            max_iter=5000, random_state=42
        )

    elif model_type == 'rf':
        clf = RandomForestClassifier(
            n_estimators=trial.suggest_int('rf_n', 50, 300),
            max_depth=trial.suggest_int('rf_depth', 3, 20),
            min_samples_split=trial.suggest_int('rf_split', 2, 15),
            max_features=trial.suggest_categorical(
                'rf_feat', ['sqrt', 'log2']
            ),
            random_state=42
        )

    elif model_type == 'gb':
        clf = GradientBoostingClassifier(
            n_estimators=trial.suggest_int('gb_n', 50, 300),
            max_depth=trial.suggest_int('gb_depth', 2, 10),
            learning_rate=trial.suggest_float('gb_lr', 0.01, 0.3, log=True),
            subsample=trial.suggest_float('gb_sub', 0.5, 1.0),
            random_state=42
        )

    elif model_type == 'svm':
        clf = SVC(
            C=trial.suggest_float('svm_C', 0.1, 100, log=True),
            kernel=trial.suggest_categorical(
                'svm_kernel', ['rbf', 'poly']
            ),
            gamma=trial.suggest_categorical(
                'svm_gamma', ['scale', 'auto']
            ),
            random_state=42
        )

    else:  # adaboost
        clf = AdaBoostClassifier(
            n_estimators=trial.suggest_int('ada_n', 25, 200),
            learning_rate=trial.suggest_float('ada_lr', 0.01, 2.0, log=True),
            random_state=42
        )

    # Step 3: Wrap in pipeline (leak-proof!)
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', clf)
    ])

    # Step 4: Cross-validate
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X, y, cv=cv,
                             scoring='accuracy', n_jobs=-1)
    return scores.mean()

# Run AutoML
optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42)
)
study.optimize(automl_objective, n_trials=100)

# Results
print("=" * 70)
print("AutoML RESULTS")
print("=" * 70)
print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best model type: {study.best_params['model_type']}")
print(f"Best params:")
for k, v in study.best_params.items():
    print(f"  {k}: {v}")

# Show model type distribution
model_counts = {}
for trial in study.trials:
    mt = trial.params.get('model_type', 'unknown')
    if mt not in model_counts:
        model_counts[mt] = {'count': 0, 'best_score': 0}
    model_counts[mt]['count'] += 1
    model_counts[mt]['best_score'] = max(
        model_counts[mt]['best_score'], trial.value or 0
    )

print(f"\nModel Type Distribution (out of {len(study.trials)} trials):")
for mt, info in sorted(model_counts.items(),
                       key=lambda x: x[1]['best_score'],
                       reverse=True):
    print(f"  {mt:12s}: {info['count']:3d} trials, "
          f"best = {info['best_score']:.4f}")

Mini Project 3: Hyperparameter Importance Analyzer

Python

"""
Mini Project 3: Hyperparameter Importance Analysis
Determine which hyperparameters matter most for performance.
"""

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from itertools import product

X, y = load_breast_cancer(return_X_y=True)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.85, 1.0],
}

# Run all combinations
results = []
keys = list(param_grid.keys())
values = list(param_grid.values())

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Running parameter sweep...")
for combo in product(*values):
    params = dict(zip(keys, combo))
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', GradientBoostingClassifier(**params, random_state=42))
    ])
    scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')
    results.append({**params, 'mean_score': scores.mean()})

# Analyze importance using variance-based method (fANOVA-like)
import pandas as pd
df = pd.DataFrame(results)

print(f"\n{'='*60}")
print("HYPERPARAMETER IMPORTANCE (Marginal Variance Method)")
print(f"{'='*60}")

overall_var = df['mean_score'].var()
importances = {}

for param in keys:
    # Group by this parameter, compute mean score per group
    group_means = df.groupby(param)['mean_score'].mean()
    # Importance = variance of group means / total variance
    param_var = group_means.var()
    importance = param_var / overall_var if overall_var > 0 else 0
    importances[param] = importance

# Sort and display
sorted_imp = sorted(importances.items(), key=lambda x: x[1], reverse=True)
total_imp = sum(v for _, v in sorted_imp)

print(f"\nOverall score range: {df['mean_score'].min():.4f} – "
      f"{df['mean_score'].max():.4f}")
print(f"\nRanked Importance:")
for param, imp in sorted_imp:
    bar = '█' * int(imp / max(importances.values()) * 30)
    normalized = imp / total_imp * 100 if total_imp > 0 else 0
    print(f"  {param:20s} {bar:30s} {normalized:5.1f}%")

# Show best value for each parameter
print(f"\nBest value for each hyperparameter:")
for param, _ in sorted_imp:
    group_means = df.groupby(param)['mean_score'].mean()
    best_val = group_means.idxmax()
    best_score = group_means.max()
    print(f"  {param:20s} = {str(best_val):10s} (avg score: {best_score:.4f})")

📝

End-of-Chapter Exercises

Exercise 1 (Conceptual)

Explain why using the test set for any form of model selection (choosing hyperparameters, comparing models) invalidates its use for estimating generalization error. What should be used instead?

Exercise 2 (Computation)

A dataset has 1,000 samples. For 10-fold CV, compute: (a) training set size per fold, (b) test set size per fold, (c) total number of model training runs, (d) fraction of data each sample appears in as test.

Exercise 3 (Coding)

Implement Leave-One-Out Cross-Validation from scratch (no sklearn). Apply it to a dataset with 50 samples using KNN (K=3). Report the accuracy.

Exercise 4 (Analysis)

Given these 5-fold CV scores: Model X = [0.91, 0.88, 0.93, 0.87, 0.90], Model Y = [0.89, 0.90, 0.88, 0.91, 0.89]. Perform a paired t-test by hand. Is there a significant difference at α=0.05?

Exercise 5 (Conceptual)

Why is random shuffle invalid for time-series cross-validation? Give a concrete example from stock price prediction where shuffling would give misleadingly high accuracy.

Exercise 6 (Coding)

Write a Python function that takes a parameter grid dictionary and returns the total number of combinations. Test it with: {'C': [0.1, 1, 10], 'kernel': ['rbf', 'poly'], 'gamma': ['scale', 'auto', 0.1]}.

Exercise 7 (Design)

Design a cross-validation strategy for a fraud detection system where fraud accounts for 0.1% of transactions. Justify your choice of K, stratification, and evaluation metric.

Exercise 8 (Computation)

A Grid Search with 4 parameters having [5, 3, 4, 2] values respectively uses 5-fold CV. Each model fit takes 10 seconds. How long will the full search take? What if Random Search with 50 iterations is used instead?

Exercise 9 (Coding)

Build a Pipeline that chains StandardScaler → PCA(n_components=5) → RandomForest, and run GridSearchCV to tune both PCA components and RF max_depth simultaneously.

Exercise 10 (Theory)

Prove that LOOCV gives an approximately unbiased estimate of the generalization error. Then explain why, despite being unbiased, it can be a poor estimator due to high variance.

Exercise 11 (Coding)

Use Optuna to tune an SVM on the Iris dataset. Search over kernel (rbf, poly), C (0.1-100 log), and gamma (1e-4 to 1 log). Report the best parameters and CV accuracy.

Exercise 12 (Conceptual)

Explain the "double descent" phenomenon and how it challenges the traditional bias-variance tradeoff picture. Under what conditions does test error decrease again after increasing?

Exercise 13 (Coding)

Create learning curves for a Decision Tree with max_depth=2 (underfitting), max_depth=20 (overfitting), and max_depth=5 (good fit). Plot all three on the same figure.

Exercise 14 (Analysis)

A team scales features before splitting data into CV folds (i.e., fit_transform on entire dataset, then cross-validate). Quantify the impact: run the same model (a) with this leakage and (b) using a proper Pipeline. Compare the CV scores.

Exercise 15 (Design)

Design a model selection pipeline for a hospital predicting patient readmission. Consider: HIPAA compliance, class imbalance (15% readmission), temporal ordering, multi-site data.

Exercise 16 (Computation)

For the corrected t-test (Nadeau & Bengio), compute σ²_corrected for a 5-fold CV where s²_d = 0.001. Compare with the uncorrected variance.

Exercise 17 (Coding)

Implement Repeated Stratified K-Fold (3 repeats × 5 folds = 15 evaluations) from scratch. Compare the standard error with single 5-fold CV.

Exercise 18 (Research)

Read Bergstra & Bengio (2012). Summarize their key finding about random vs grid search in 200 words. When would grid search still be preferable?

Exercise 19 (Coding)

Use sklearn's learning_curve function to generate learning curves for Logistic Regression, SVM, and Random Forest on the same dataset. Which model shows the most overfitting?

Exercise 20 (Application)

A Flipkart data scientist reports 95% accuracy using 5-fold CV for predicting whether a customer will make a purchase. List 5 things you would check before trusting this number.

Exercise 21 (Advanced)

Implement the LOOCV shortcut for Ridge Regression using the hat matrix formula. Verify it gives the same result as explicit LOOCV on a small dataset.

Exercise 22 (Coding)

Build a GroupKFold implementation from scratch where groups are user IDs. Test it on a dataset where each user has multiple transactions.

Exercise 23 (Integration)

Combine everything: Load a dataset → Exploratory Analysis → Feature Engineering Pipeline → Model Selection with AutoModelSelector → Hyperparameter Tuning with Optuna → Final Evaluation on held-out test set → Report with confidence intervals.

✅

Multiple Choice Questions

MCQ 1

In 10-fold cross-validation on a dataset of 500 samples, each model is trained on how many samples per fold?

(A) 50 (B) 450 (C) 500 (D) 400

► Click to reveal answer

MCQ 2

Which cross-validation method is most appropriate for stock price prediction?

(A) Stratified K-Fold (B) Random K-Fold (C) TimeSeriesSplit (D) Leave-One-Out

► Click to reveal answer

MCQ 3

Bergstra & Bengio (2012) showed that random search is more efficient than grid search primarily because:

(A) It uses fewer iterations (B) It explores more distinct values per hyperparameter (C) It is parallelizable (D) It uses gradient information

► Click to reveal answer

MCQ 4

Data leakage in cross-validation occurs when:

(A) The model is too complex (B) Preprocessing uses information from the test fold (C) K is too large (D) The dataset is too small

► Click to reveal answer

MCQ 5

In the bias-variance tradeoff, increasing model complexity generally:

(A) Increases both bias and variance (B) Decreases bias, increases variance (C) Decreases both (D) Increases bias, decreases variance

► Click to reveal answer

MCQ 6

In scikit-learn, what does C represent in LogisticRegression(C=0.01)?

(A) Strong regularization (B) Weak regularization (C) Number of clusters (D) Convergence threshold

► Click to reveal answer

MCQ 7

LOOCV has high variance because:

(A) Training sets are very small (B) Test sets have only 1 sample (C) Training sets overlap extensively, making fold errors highly correlated (D) It doesn't use stratification

► Click to reveal answer

MCQ 8

Which sklearn class prevents data leakage by ensuring preprocessing is fit only on training data?

(A) GridSearchCV (B) Pipeline (C) StandardScaler (D) cross_val_score

► Click to reveal answer

MCQ 9

Bayesian optimization uses which component to decide the next hyperparameter combination to try?

(A) Loss function (B) Gradient descent (C) Acquisition function (D) Random number generator

► Click to reveal answer

MCQ 10

If Model A beats Model B on 7 out of 10 CV folds, can we conclude A is significantly better?

(A) Yes, 7/10 is a clear majority (B) No, we need a statistical test (C) Yes, if the mean scores differ (D) No, we need more folds

► Click to reveal answer

MCQ 11 (Bonus)

In the Nadeau & Bengio corrected t-test, the correction factor accounts for:

(A) Class imbalance (B) Non-normality of scores (C) Overlapping training sets across folds (D) Computational time

► Click to reveal answer

💼

Interview Questions

Interview Q1 (Google/Microsoft)

Q: Explain data leakage with a concrete example. How would you detect it?

A: Data leakage occurs when information from outside the training set inappropriately influences model development. Example: fitting StandardScaler on the entire dataset before CV means the scaler's mean/std include test data statistics. Detection: compare CV scores with Pipeline (correct) vs without (leaked) — if leaked scores are significantly higher, you have leakage. Also check if production performance is much worse than CV estimates.

Interview Q2 (Amazon)

Q: You have a classification dataset with 98% negative, 2% positive. How would you set up cross-validation?

A: Use Stratified K-Fold (K=5 or 10) to ensure each fold maintains the 98/2 ratio. Use appropriate metrics (PR-AUC, F1, precision@K) instead of accuracy. If using oversampling (SMOTE), apply it inside each fold only on the training set. Consider GroupKFold if samples have group structure (e.g., multiple transactions per user).

Interview Q3 (Netflix)

Q: How would you evaluate a recommendation model offline before A/B testing?

A: Use time-based splits (train on historical interactions, evaluate on future ones). Metrics: NDCG@K, MAP@K, Hit Rate@K. Beyond accuracy, measure diversity, novelty, and coverage. Use replay evaluation: simulate what would have happened if the new model had been deployed, accounting for selection bias in logged data.

Interview Q4 (Flipkart)

Q: Grid search is taking too long. How would you speed it up without sacrificing quality?

A: (1) Switch to RandomizedSearchCV — 60 random trials often match exhaustive grid search. (2) Use Bayesian optimization (Optuna) for intelligent sampling. (3) Use successive halving: evaluate all candidates on small data, keep top 50%, evaluate on more data, repeat. (4) Reduce CV folds (K=3 for screening, K=10 for final). (5) Use n_jobs=-1 for parallelization. (6) Start with coarse grid, then refine around the best region.

Interview Q5 (Uber/Ola)

Q: Why can't you use standard K-fold CV for a surge pricing model?

A: Surge pricing depends on temporal patterns (time of day, day of week, events). Standard K-fold randomly mixes timestamps, allowing the model to "see" future demand patterns during training. This creates data leakage — the model appears to predict well but actually just memorized temporal correlations. Use TimeSeriesSplit with expanding or sliding windows instead.

Interview Q6 (Goldman Sachs)

Q: How do you determine if the difference between two models' CV scores is statistically significant?

A: Use the paired t-test or Wilcoxon signed-rank test on the K fold-wise score differences. For the t-test: compute d̄ and s_d from fold differences, test statistic t = d̄/(s_d/√K), compare with t-distribution(K-1 df). Use the Nadeau-Bengio corrected version to account for non-independence of folds. For non-normal distributions, prefer Wilcoxon. Also report confidence intervals, not just p-values.

Interview Q7 (Meta/Facebook)

Q: What's the relationship between regularization and model selection?

A: Regularization (L1/L2) controls model complexity, which is equivalent to selecting from a family of models indexed by the regularization parameter λ. Cross-validating over λ values is a form of model selection — you're choosing the "model" (complexity level) that generalizes best. L1 additionally performs feature selection, effectively choosing among models with different feature subsets. RidgeCV and LassoCV in sklearn combine this naturally.

Interview Q8 (Apple)

Q: Explain the bias-variance tradeoff in the context of K in K-fold CV.

A: Large K (approaching LOOCV): training sets are nearly full-size → low bias (estimate is close to true generalization error), but high variance because training sets overlap heavily, making fold errors correlated. Small K (K=2): training sets are only half the data → high bias (underfits), but lower variance because folds are more independent. K=5 or 10 provides a good balance. This is different from the model's own bias-variance tradeoff but follows the same principle.

Interview Q9 (Tesla/Waymo)

Q: How would you validate a self-driving perception model?

A: Use geographic/scenario-based splits (not random): train on some cities/routes, validate on unseen ones. Stratify by driving conditions (rain, night, highway, urban). Use temporal ordering — never validate on past data. Evaluate per-class (pedestrian detection vs. vehicle detection). Use domain-specific metrics (mAP@IoU). Critical: test on adversarial/edge cases separately. Cross-validate across different sensor conditions.

Interview Q10 (Startup CTO)

Q: You're a startup with limited compute. How do you approach model selection pragmatically?

A: (1) Start with baselines: logistic regression or simple random forest. (2) Use 5-fold CV (not 10) to save time. (3) Random search with 30-50 trials instead of grid search. (4) Focus on the 2-3 most important hyperparameters (learning rate, regularization strength, tree depth) — hyperparameter importance analysis shows most other parameters have minimal impact. (5) Use early stopping. (6) Only run full evaluation on the top 2-3 candidates from quick screening.

🔬

Research Problems

Research Problem 1: Adaptive K Selection

Open Question: Can we develop a method that automatically selects the optimal K for K-fold CV based on dataset properties (size, noise level, class distribution)? Current practice (K=5 or 10) is a heuristic. Design and experimentally validate an adaptive algorithm.

Starting Points: Kohavi (1995), Arlot & Celisse (2010), Rodriguez et al. (2010).

Deliverable: A function select_k(X, y) that returns the recommended K, with empirical validation on 20+ UCI datasets.

Research Problem 2: Fair Model Selection

Open Question: Standard model selection optimizes for accuracy or AUC. How should model selection be modified when fairness constraints (demographic parity, equalized odds) must be satisfied? Develop a multi-objective cross-validation framework that balances accuracy and fairness.

Context: Critical for Indian applications: Aadhaar verification, loan approvals, criminal justice risk assessment where bias against marginalized communities could be amplified by biased model selection.

Research Problem 3: Meta-Learning for Hyperparameter Initialization

Open Question: Can we learn a mapping from dataset meta-features (number of samples, features, class imbalance ratio, etc.) to good initial hyperparameters for Bayesian optimization warm-starting? Implement and evaluate a meta-learning system using a database of past optimization runs.

Benchmark: Compare against cold-start Optuna on 50 OpenML datasets. Target: achieve equivalent accuracy with 50% fewer trials.

Research Problem 4: Cross-Validation for Graph Data

Open Question: Standard CV methods assume i.i.d. data. How should CV be conducted for graph-structured data (social networks, molecular graphs) where nodes/edges have complex dependencies? Develop and validate a graph-aware CV method.

⭐

Key Takeaways

Cross-validation is essential, not optional. A single train/test split gives unreliable estimates. K-Fold CV (K=5 or 10) provides multiple estimates with a measure of variability, giving you a much more reliable picture of your model's true performance.

Choose the right CV strategy for your data. Stratified K-Fold for classification, TimeSeriesSplit for temporal data, GroupKFold when samples are grouped. Using the wrong strategy gives misleadingly optimistic results.

Data leakage is the #1 enemy. Always use Pipelines to ensure preprocessing (scaling, feature selection, imputation) is fit only on training folds. Leakage creates an illusion of excellent performance that vanishes in production.

Random Search > Grid Search in most cases. With the same computational budget, Random Search explores more of the hyperparameter space and often finds better configurations. Use Bayesian optimization (Optuna) for expensive evaluations.

Not all hyperparameters are equal. Hyperparameter importance analysis shows that 1-2 parameters (typically learning rate and regularization strength) account for most of the performance variation. Focus tuning effort on these first.

Use statistical tests for model comparison. Don't just compare mean CV scores. Use paired t-tests or Wilcoxon signed-rank tests to determine if differences are statistically significant. A 1% accuracy improvement might be noise.

Bias-variance tradeoff guides all decisions. Underfitting (high bias): increase model complexity or add features. Overfitting (high variance): add regularization, get more data, or simplify the model. Learning curves are your diagnostic tool.

Regularization IS model selection. Tuning λ in L1/L2 regularization selects from a family of models of varying complexity. Cross-validating λ is one of the most principled approaches to model selection.

The test set is sacred. Use it exactly once — for the final evaluation of your chosen model. Never use it for hyperparameter tuning, feature selection, or model comparison. That's what the validation set (or inner CV) is for.

📚

References & Further Reading

Foundational Papers

Stone, M. (1974). "Cross-Validatory Choice and Assessment of Statistical Predictions." JRSS Series B, 36(2), 111–147.
Geisser, S. (1975). "The Predictive Sample Reuse Method with Applications." JASA, 70(350), 320–328.
Kohavi, R. (1995). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection." IJCAI, 14(2), 1137–1143.
Dietterich, T.G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." Neural Computation, 10(7), 1895–1923.
Nadeau, C., & Bengio, Y. (2003). "Inference for the Generalization Error." Machine Learning, 52(3), 239–281.

Hyperparameter Optimization

Bergstra, J., & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." JMLR, 13, 281–305.
Snoek, J., Larochelle, H., & Adams, R.P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms." NeurIPS.
Golovin, D., et al. (2017). "Google Vizier: A Service for Black-Box Optimization." KDD, 1487–1495.
Akiba, T., et al. (2019). "Optuna: A Next-generation Hyperparameter Optimization Framework." KDD, 2623–2631.
Li, L., et al. (2017). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." JMLR, 18(185), 1–52.

AutoML & Model Selection

Feurer, M., et al. (2015). "Efficient and Robust Automated Machine Learning." NeurIPS, 28, 2962–2970.
Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.) (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer.
Arlot, S., & Celisse, A. (2010). "A Survey of Cross-Validation Procedures for Model Selection." Statistics Surveys, 4, 40–79.

Textbooks

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Chapters 7 (Model Assessment and Selection). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly.

Online Resources

Scikit-learn documentation: Cross-validation: evaluating estimator performance — scikit-learn.org/stable/modules/cross_validation.html
Optuna documentation — optuna.readthedocs.io
Google Vizier documentation — cloud.google.com/ai-platform/optimizer/docs
AutoML Benchmark — openml.github.io/automlbenchmark