Cross-Validation, Hyperparameter Tuning & Model Selection
Master the art and science of choosing the right model, tuning it to perfection, and validating that your results will generalize to unseen data.
Learning Objectives
After completing this chapter, you will be able to:
- Explain why train/test split alone is insufficient and articulate the need for cross-validation.
- Implement K-Fold, Stratified K-Fold, Leave-One-Out, and Time-Series Cross-Validation from scratch and via scikit-learn.
- Diagnose underfitting vs. overfitting using bias-variance decomposition and learning curves.
- Execute Grid Search, Random Search, and Bayesian Optimization (Optuna) for hyperparameter tuning.
- Avoid data leakage by building proper Pipeline objects with StandardScaler, feature selection, and model training.
- Compare models rigorously using paired t-tests, Wilcoxon signed-rank tests, and confidence intervals.
- Build end-to-end AutoML pipelines that automate model selection and hyperparameter optimization.
- Tune regularization parameters (L1/L2 ฮป) as a model selection mechanism.
- Analyze hyperparameter importance to prioritize which parameters to tune first.
- Apply these techniques to real-world scenarios across Indian and global industry contexts.
Model selection and cross-validation questions appear in nearly every ML interview and exam. Be prepared to explain why random shuffling breaks time-series validation and how data leakage invalidates your entire experiment.
Introduction
Building a machine learning model is only half the story. The other half โ arguably the harder half โ is figuring out which model to build, what settings to use, and how confident you can be that your model will work on data it has never seen before.
Consider a data scientist at Flipkart trying to predict customer churn. She has access to Random Forests, Gradient Boosting, SVMs, and Neural Networks. Each model has dozens of adjustable knobs (hyperparameters). A Random Forest alone has parameters like n_estimators, max_depth, min_samples_split, and max_features. With just 4 parameters and 5 choices each, that's 5โด = 625 possible combinations โ for one model type.
This chapter addresses three interlinked questions:
- Validation: How do we reliably estimate how well a model will perform on unseen data? (Cross-validation)
- Optimization: How do we efficiently search the vast space of hyperparameters? (Hyperparameter tuning)
- Selection: How do we compare multiple models and choose the best one with statistical confidence? (Model selection)
These are not academic exercises โ they are the difference between a model that works in a notebook and one that works in production. A model that achieves 95% accuracy in training but 70% in production is worse than useless; it's dangerous, because it creates false confidence.
I tell my students: "Training a model is like cooking in your kitchen. Cross-validation is like having strangers taste your food. Grid search is like adjusting your recipe systematically. Model selection is choosing between your best dishes for the restaurant menu." The goal isn't to impress yourself โ it's to satisfy the real-world diner.
This chapter forms the bridge between learning algorithms (Chapters 4โ16) and deploying them in production (Chapter 18+). Without these techniques, every model you've built so far is just an optimistic guess.
Historical Background
The quest for robust model evaluation has evolved over more than a century:
The Holdout Era (Early 1900sโ1960s)
The simplest validation idea โ splitting data into train and test sets โ dates back to early statistical practice. However, results were highly variable depending on which data ended up in which partition. A "lucky" split could make a bad model look great.
Cross-Validation Emerges (1960sโ1970s)
M. Stone (1974) formalized cross-validation as a model assessment technique, connecting it to Akaike's information criterion. Allen (1974) independently proposed the PRESS (Predicted Residual Sum of Squares) statistic, essentially leave-one-out CV. Geisser (1975) established the theoretical foundations of predictive inference using cross-validation.
K-Fold Becomes Standard (1980sโ1990s)
Researchers found that K=5 or K=10 offered good bias-variance tradeoffs. Kohavi (1995) published a landmark empirical study comparing bootstrap, holdout, and K-fold methods, establishing K=10 stratified cross-validation as the recommended default. Dietterich (1998) proposed the 5ร2 cross-validation paired t-test for model comparison.
Hyperparameter Optimization Revolution (2000sโ2010s)
Bergstra & Bengio (2012) demonstrated that random search is more efficient than grid search โ a result that surprised many practitioners. Snoek, Larochelle & Adams (2012) brought Bayesian optimization to ML with their NIPS paper on Gaussian-process-based hyperparameter tuning. Google Vizier (2017) operationalized these ideas as a cloud service.
AutoML & Modern Methods (2015โPresent)
Auto-sklearn (Feurer et al., 2015) combined Bayesian optimization with meta-learning. Optuna (Akiba et al., 2019) introduced an efficient define-by-run API for hyperparameter search. Today, frameworks like Google AutoML, H2O AutoML, and TCS's iON platform automate the entire pipeline from data to deployed model.
TCS Research developed the iON AutoML platform in the late 2010s, enabling automated model selection for clients across banking, insurance, and retail sectors. Their approach combines meta-learning (learning which models work well on which types of data) with efficient hyperparameter optimization.
Conceptual Explanation
4.1 Why Not Just Use Training Accuracy?
Imagine a student who memorizes every answer in the textbook but can't solve a new problem. Training accuracy measures memorization; we need a metric that measures generalization โ the ability to perform well on data the model has never seen.
4.2 The Train/Test Split
The simplest approach: split your dataset into two parts โ typically 70-80% for training and 20-30% for testing. The model learns from the training set and is evaluated on the test set.
The test set must never be used during any part of training โ not for feature selection, not for scaling, not for imputation. If test data influences any training decision, your evaluation is invalid. This is called data leakage, and it's the #1 mistake in applied ML.
Stratified Splitting
When classes are imbalanced (e.g., 95% non-fraud, 5% fraud), a random split might give you a test set with 0% fraud cases! Stratified splitting preserves the class proportions in both sets.
4.3 K-Fold Cross-Validation
A single train/test split gives us one estimate of performance, but that estimate depends heavily on which samples happened to land in each partition. K-Fold CV solves this by using every data point for both training and testing:
- Divide the dataset into K equal-sized folds.
- For each fold i (i = 1 to K): use fold i as the validation set; use all other folds for training.
- Record the score from each iteration.
- Report the mean and standard deviation of all K scores.
With K=5, each data point appears in exactly one validation set and four training sets. This gives us 5 estimates instead of 1, along with a measure of variability.
4.4 Stratified K-Fold
Same as K-Fold, but each fold preserves the original class distribution. This is critical for imbalanced datasets and is the default recommendation for classification tasks.
4.5 Leave-One-Out Cross-Validation (LOOCV)
The extreme case: K = N (one fold per data point). Each iteration trains on N-1 samples and tests on 1. Produces the least biased estimate but has extremely high variance and is computationally expensive (N full training runs).
4.6 Time-Series Split
For time-series data, standard K-Fold is invalid because it shuffles data points randomly, allowing the model to "peek into the future." Time-series CV uses an expanding window: train on periods 1โt, test on period t+1.
Think of time-series split as simulating what would actually happen in deployment: at each point in time, you only have access to past data. Any validation strategy that violates this temporal ordering is fundamentally flawed.
4.7 The Bias-Variance Tradeoff
The expected test error of any model can be decomposed into three components:
- High Bias (Underfitting): Model is too simple โ it can't capture the true pattern. Training error is high. Example: fitting a line to quadratic data.
- High Variance (Overfitting): Model is too complex โ it memorizes noise. Training error is low but test error is high. Example: a degree-20 polynomial on 10 data points.
4.8 Learning Curves
A learning curve plots model performance (y-axis) against training set size (x-axis) for both training and validation sets:
- Underfitting: Both curves plateau at a high error. More data won't help โ you need a more complex model.
- Overfitting: Large gap between training (low error) and validation (high error). More data may help close the gap.
- Good fit: Both curves converge to low error as training size increases.
4.9 Hyperparameter Tuning Strategies
Grid Search
Exhaustively tries every combination of a pre-defined set of hyperparameter values. Guarantees finding the best in the grid but is exponentially expensive: with p parameters and v values each, cost = vp.
Random Search
Randomly samples hyperparameter combinations from defined distributions. Bergstra & Bengio (2012) proved that for many problems, random search finds better hyperparameters with far fewer iterations because it explores more of each dimension.
Bayesian Optimization (Optuna)
Uses a probabilistic model (surrogate function) to predict which hyperparameter combinations are likely to be best, then intelligently chooses the next point to evaluate. Much more sample-efficient than random search for expensive evaluations.
4.10 Model Comparison with Statistical Tests
Saying "Model A scored 0.85 and Model B scored 0.83" is not enough. We need statistical tests to determine if the difference is significant or just due to random variation in the folds:
- Paired t-test: Assumes scores are normally distributed. Tests if the mean difference across folds is significantly different from zero.
- Wilcoxon signed-rank test: Non-parametric alternative โ doesn't assume normality. Better for small K or non-normal score distributions.
4.11 Pipeline: Preventing Data Leakage
A Pipeline chains preprocessing steps and the final model into a single object. When used inside cross_val_score, the scaler is fit only on the training folds โ never on the validation fold. Without a pipeline, you'd accidentally scale using statistics from the entire dataset, including the validation fold โ data leakage!
ML Engineer / MLOps Specialist โ Companies like Flipkart, PhonePe, and Razorpay actively hire for roles that require expertise in model selection pipelines, A/B testing, and production validation. Salary range in India: โน18โ45 LPA. In global roles (Google, Netflix): $120Kโ200K+.
Mathematical Foundation
5.1 Cross-Validation Error Estimator
Given dataset D with N samples, partition into K folds {Fโ, Fโ, ..., F_K}. The K-fold CV estimate of generalization error is:
Where fฬโแตข is the model trained on all data except fold i, and L is the loss function evaluated on fold i.
5.2 Bias-Variance Decomposition
For a regression model fฬ(x) estimating the true function f(x) with noise ฮต ~ N(0, ฯยฒ):
= Biasยฒ[fฬ(x)] + Var[fฬ(x)] + ฯยฒ
5.3 K-Fold CV: Bias and Variance Properties
Let n_train = N ร (K-1)/K be the training set size per fold:
- K = N (LOOCV): n_train = N-1 โ Low bias (almost entire dataset used for training), but high variance (training sets differ by only 1 point, so models are highly correlated).
- K = 2: n_train = N/2 โ High bias (only half the data used), but low variance.
- K = 5 or 10: Good compromise between bias and variance. Empirical studies (Kohavi, 1995) confirm this.
5.4 Variance of CV Estimate
= (ฯยฒ/K) ร [1 + (K-1) ร ฯ]
Where ฯ is the correlation between fold error estimates and ฯยฒ is the variance of individual fold errors. Note: as K โ N (LOOCV), ฯ โ 1 because training sets are nearly identical, so variance does not decrease with more folds.
5.5 Paired t-Test for Model Comparison
Given K fold differences dแตข = scoreแตข(A) โ scoreแตข(B):
where dฬ = (1/K) ฮฃ dแตข, s_d = โ[(1/(K-1)) ฮฃ (dแตข โ dฬ)ยฒ]
Under Hโ: E[dแตข] = 0, t follows t-distribution with K-1 degrees of freedom
5.6 Regularization Parameter Selection
For Ridge Regression (L2), the objective with regularization parameter ฮป:
The optimal ฮป is found by minimizing CV error across a grid of ฮป values (typically log-spaced from 10โปโด to 10โด).
5.7 Expected Improvement (Bayesian Optimization)
In Bayesian optimization, the acquisition function Expected Improvement is:
= (ฮผ(x) - f(xโบ)) ร ฮฆ(Z) + ฯ(x) ร ฯ(Z)
where Z = (ฮผ(x) - f(xโบ)) / ฯ(x)
Here, ฮผ(x) and ฯ(x) are the mean and std from the surrogate model (Gaussian Process), f(xโบ) is the best observed value, and ฮฆ/ฯ are the CDF/PDF of the standard normal.
The bias-variance decomposition formula appears in almost every ML exam. Remember: bias measures systematic error (the model is consistently wrong), while variance measures sensitivity to training data (the model changes drastically with different samples).
Formula Derivations
6.1 Deriving the Bias-Variance Decomposition
Start with the expected MSE for a prediction at point x:
Since y = f(x) + ฮต where E[ฮต] = 0, Var(ฮต) = ฯยฒ:
Step 1: Add and subtract E[fฬ(x)]:
Step 2: Let A = f(x) โ E[fฬ(x)] (a constant), B = E[fฬ(x)] โ fฬ(x) (random), C = ฮต (random, independent of B). Expand (A + B + C)ยฒ:
Step 3: Simplify using E[B] = 0, E[C] = 0, E[BC] = 0 (ฮต is independent of fฬ):
= Biasยฒ + Variance + Noise
6.2 Deriving the CV Variance Formula
Let eแตข be the error on fold i. The CV estimate is CV = (1/K) ฮฃ eแตข:
Assume all folds have equal variance ฯยฒ and pairwise correlation ฯ:
Var(CV) = (1/Kยฒ) ร [Kฯยฒ + K(Kโ1)ฯฯยฒ] = (ฯยฒ/K) ร [1 + (Kโ1)ฯ]
Key insight: If ฯ = 0 (independent folds), variance decreases as 1/K. But for LOOCV, ฯ โ 1, so Var(CV) โ ฯยฒ regardless of K โ variance doesn't shrink!
6.3 Deriving the Corrected t-Test (Nadeau & Bengio, 2003)
The standard paired t-test underestimates variance because CV folds share training data. The corrected variance:
Where n_test and n_train are the test and train sizes per fold. The corrected t-statistic:
6.4 Deriving Optimal Lambda for Ridge Regression via CV
The Ridge solution for a given ฮป:
For LOOCV, there's an efficient closed-form (bypassing N separate model fits):
Where H(ฮป) = X(XแตX + ฮปI)โปยนXแต is the hat matrix. This allows computing LOOCV for all ฮป values with a single matrix decomposition โ an O(Npยฒ) operation instead of O(Nยฒpยฒ).
The LOOCV shortcut for Ridge Regression is one of the most elegant results in statistical learning. It shows that sometimes, what seems computationally intractable has a beautiful closed-form solution hiding in the linear algebra.
Worked Numerical Examples
Example 1: 5-Fold Cross-Validation by Hand
Dataset: 10 samples with labels: [0, 1, 1, 0, 1, 0, 0, 1, 1, 0]
Task: Compute 5-fold CV accuracy for a hypothetical classifier.
Suppose the classifier produces these fold accuracies:
| Fold | Train Size | Test Size | Predictions | Accuracy |
|---|---|---|---|---|
| 1 | 8 | 2 | [0โ, 1โ] | 2/2 = 1.00 |
| 2 | 8 | 2 | [1โ, 1โ] | 1/2 = 0.50 |
| 3 | 8 | 2 | [1โ, 0โ] | 2/2 = 1.00 |
| 4 | 8 | 2 | [0โ, 0โ] | 1/2 = 0.50 |
| 5 | 8 | 2 | [1โ, 0โ] | 2/2 = 1.00 |
Std Dev = โ[((0.20ยฒ + 0.30ยฒ + 0.20ยฒ + 0.30ยฒ + 0.20ยฒ) / 4)] = โ[(0.04+0.09+0.04+0.09+0.04)/4] = โ[0.30/4] = โ0.075 โ 0.274
Report: Accuracy = 0.80 ยฑ 0.27
Example 2: Grid Search Parameter Space Calculation
Model: Random Forest with the following hyperparameter grid:
| Parameter | Values | Count |
|---|---|---|
| n_estimators | [50, 100, 200, 500] | 4 |
| max_depth | [5, 10, 20, None] | 4 |
| min_samples_split | [2, 5, 10] | 3 |
| max_features | ['sqrt', 'log2'] | 2 |
With 5-fold CV: Total model fits = 96 ร 5 = 480
If each fit takes 30 seconds: Total time = 480 ร 30 = 14,400 seconds = 4 hours
Example 3: Paired t-Test for Model Comparison
Given: 5-fold CV scores for Model A and Model B:
| Fold | Model A | Model B | Difference (dแตข) |
|---|---|---|---|
| 1 | 0.88 | 0.84 | +0.04 |
| 2 | 0.85 | 0.82 | +0.03 |
| 3 | 0.90 | 0.86 | +0.04 |
| 4 | 0.82 | 0.81 | +0.01 |
| 5 | 0.87 | 0.85 | +0.02 |
sยฒ_d = [(0.012ยฒ + 0.002ยฒ + 0.012ยฒ + 0.018ยฒ + 0.008ยฒ) / 4] = [(0.000144 + 0.000004 + 0.000144 + 0.000324 + 0.000064) / 4] = 0.00068 / 4 = 0.00017
s_d = โ0.00017 โ 0.01304
t = 0.028 / (0.01304 / โ5) = 0.028 / 0.00583 โ 4.80
t_critical (df=4, ฮฑ=0.05, two-tailed) โ 2.776
Since |t| = 4.80 > 2.776 โ Reject Hโ โ Model A is significantly better than Model B (p < 0.05)
Example 4: Random Search Efficiency
Scenario: 5 hyperparameters, 10 values each. Grid search: 10โต = 100,000 combinations. Random search with 60 iterations covers:
So with 60 random trials, there's a 95.4% probability of landing in the top 5% for each individual parameter โ while Grid Search with 100,000 tries guarantees hitting only 10 distinct values per parameter!
Example 5: Time-Series CV Split
Dataset: Monthly sales data for 12 months (JanโDec)
Implement the paired t-test from Example 3 in Python without using scipy.stats.ttest_rel. Compute the t-statistic and p-value from scratch using only numpy. Compare your result with scipy.
Visual Diagrams
8.1 K-Fold Cross-Validation (K=5)
8.2 Bias-Variance Tradeoff
8.3 Learning Curves
8.4 Grid Search vs Random Search
8.5 Stratified vs Non-Stratified Split
Flowcharts
9.1 Complete Model Selection Pipeline
9.2 Choosing a CV Strategy
9.3 Bayesian Optimization Loop
Python Implementation
10.1 K-Fold Cross-Validation from Scratch
import numpy as np
class KFoldCV:
"""K-Fold Cross-Validation implemented from scratch."""
def __init__(self, n_splits=5, shuffle=False, random_state=None):
self.n_splits = n_splits
self.shuffle = shuffle
self.random_state = random_state
def split(self, X):
"""Generate train/test indices for K folds."""
n_samples = len(X)
indices = np.arange(n_samples)
if self.shuffle:
rng = np.random.RandomState(self.random_state)
rng.shuffle(indices)
# Compute fold sizes (handle uneven splits)
fold_sizes = np.full(self.n_splits, n_samples // self.n_splits)
fold_sizes[:n_samples % self.n_splits] += 1
current = 0
for fold_size in fold_sizes:
test_idx = indices[current:current + fold_size]
train_idx = np.concatenate([
indices[:current],
indices[current + fold_size:]
])
yield train_idx, test_idx
current += fold_size
def cross_val_score(self, model_class, X, y, **model_params):
"""Compute cross-validated scores."""
scores = []
for fold, (train_idx, test_idx) in enumerate(self.split(X)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Create fresh model for each fold
model = model_class(**model_params)
model.fit(X_train, y_train)
score = np.mean(model.predict(X_test) == y_test)
scores.append(score)
print(f" Fold {fold+1}: Accuracy = {score:.4f}")
scores = np.array(scores)
print(f"\n Mean: {scores.mean():.4f} ยฑ {scores.std():.4f}")
return scores
# --- Demo with a simple model ---
class SimpleThresholdClassifier:
"""Classify based on feature mean threshold."""
def fit(self, X, y):
# Learn: for each class, compute mean feature value
self.threshold = X[y == 1].mean() - X[y == 0].mean()
self.global_mean = X.mean(axis=0)
return self
def predict(self, X):
distances = np.linalg.norm(X - self.global_mean, axis=1)
return (distances < np.median(distances)).astype(int)
# Generate sample data
np.random.seed(42)
X = np.random.randn(100, 3)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# Run 5-Fold CV
print("=" * 50)
print("5-Fold Cross-Validation from Scratch")
print("=" * 50)
kfold = KFoldCV(n_splits=5, shuffle=True, random_state=42)
scores = kfold.cross_val_score(SimpleThresholdClassifier, X, y)
10.2 Stratified K-Fold from Scratch
class StratifiedKFold:
"""Stratified K-Fold: preserves class distribution in each fold."""
def __init__(self, n_splits=5, shuffle=False, random_state=None):
self.n_splits = n_splits
self.shuffle = shuffle
self.random_state = random_state
def split(self, X, y):
n_samples = len(y)
classes = np.unique(y)
rng = np.random.RandomState(self.random_state)
# Group indices by class
class_indices = {}
for cls in classes:
idx = np.where(y == cls)[0]
if self.shuffle:
rng.shuffle(idx)
class_indices[cls] = idx
# Split each class into K folds, then combine
folds = [[] for _ in range(self.n_splits)]
for cls in classes:
idx = class_indices[cls]
fold_sizes = np.full(self.n_splits, len(idx) // self.n_splits)
fold_sizes[:len(idx) % self.n_splits] += 1
current = 0
for i, size in enumerate(fold_sizes):
folds[i].extend(idx[current:current + size])
current += size
# Generate train/test pairs
for i in range(self.n_splits):
test_idx = np.array(folds[i])
train_idx = np.concatenate([folds[j] for j in range(self.n_splits) if j != i])
yield train_idx.astype(int), test_idx.astype(int)
# Demo: Imbalanced dataset
np.random.seed(42)
X_imb = np.random.randn(200, 4)
y_imb = np.array([0]*180 + [1]*20) # 90% vs 10%
print("\n" + "=" * 50)
print("Stratified K-Fold (Imbalanced: 90/10 split)")
print("=" * 50)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, test_idx) in enumerate(skf.split(X_imb, y_imb)):
train_dist = np.bincount(y_imb[train_idx])
test_dist = np.bincount(y_imb[test_idx])
print(f"Fold {fold+1}: Train {train_dist} | Test {test_dist} | "
f"Test class-1 ratio: {test_dist[1]/test_dist.sum():.2%}")
10.3 Time-Series Cross-Validation
class TimeSeriesSplit:
"""Time-series CV with expanding training window."""
def __init__(self, n_splits=5, min_train_size=None):
self.n_splits = n_splits
self.min_train_size = min_train_size
def split(self, X):
n_samples = len(X)
test_size = n_samples // (self.n_splits + 1)
min_train = self.min_train_size or test_size
for i in range(self.n_splits):
train_end = min_train + i * test_size
test_start = train_end
test_end = test_start + test_size
if test_end > n_samples:
break
train_idx = np.arange(0, train_end)
test_idx = np.arange(test_start, test_end)
yield train_idx, test_idx
# Demo
print("\n" + "=" * 50)
print("Time-Series Split (60 months of data)")
print("=" * 50)
X_ts = np.arange(60) # 60 months
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
print(f"Fold {fold+1}: Train months [0..{train_idx[-1]}] "
f"({len(train_idx)} months) โ Test months "
f"[{test_idx[0]}..{test_idx[-1]}] ({len(test_idx)} months)")
10.4 Grid Search from Scratch
from itertools import product
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
def grid_search_cv(model_class, param_grid, X, y, cv=5):
"""Grid Search with K-Fold Cross-Validation from scratch."""
# Generate all combinations
keys = list(param_grid.keys())
values = list(param_grid.values())
combinations = list(product(*values))
print(f"Total combinations: {len(combinations)}")
print(f"Total model fits: {len(combinations) * cv}\n")
best_score = -np.inf
best_params = None
results = []
for combo in combinations:
params = dict(zip(keys, combo))
# K-Fold CV for this combination
fold_scores = []
kf = KFoldCV(n_splits=cv, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(X):
model = model_class(**params)
model.fit(X[train_idx], y[train_idx])
pred = model.predict(X[test_idx])
fold_scores.append(accuracy_score(y[test_idx], pred))
mean_score = np.mean(fold_scores)
std_score = np.std(fold_scores)
results.append((params, mean_score, std_score))
if mean_score > best_score:
best_score = mean_score
best_params = params
# Sort and display top 5
results.sort(key=lambda x: x[1], reverse=True)
print("Top 5 Configurations:")
print("-" * 60)
for i, (params, mean, std) in enumerate(results[:5]):
print(f" {i+1}. Score: {mean:.4f} ยฑ {std:.4f} | {params}")
print(f"\nโ
Best: {best_params} โ {best_score:.4f}")
return best_params, best_score
# Run Grid Search
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
param_grid = {
'max_depth': [2, 3, 5, 7, 10, None],
'min_samples_split': [2, 5, 10, 20],
'criterion': ['gini', 'entropy']
}
best_params, best_score = grid_search_cv(
DecisionTreeClassifier, param_grid, X_iris, y_iris, cv=5
)
10.5 Bayesian Optimization with Optuna
# pip install optuna
import optuna
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
def objective(trial):
"""Optuna objective: define search space and evaluate."""
# Choose model type
model_name = trial.suggest_categorical('model', ['rf', 'gb'])
if model_name == 'rf':
params = {
'n_estimators': trial.suggest_int('rf_n_estimators', 50, 500),
'max_depth': trial.suggest_int('rf_max_depth', 2, 20),
'min_samples_split': trial.suggest_int('rf_min_samples_split', 2, 20),
'max_features': trial.suggest_categorical('rf_max_features',
['sqrt', 'log2']),
}
model = RandomForestClassifier(**params, random_state=42)
else:
params = {
'n_estimators': trial.suggest_int('gb_n_estimators', 50, 500),
'max_depth': trial.suggest_int('gb_max_depth', 2, 10),
'learning_rate': trial.suggest_float('gb_lr', 0.01, 0.3, log=True),
'subsample': trial.suggest_float('gb_subsample', 0.5, 1.0),
}
model = GradientBoostingClassifier(**params, random_state=42)
# 5-Fold Stratified CV
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
return scores.mean()
# Run optimization
study = optuna.create_study(direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=50, show_progress_bar=True)
print(f"\nBest trial:")
print(f" Value (accuracy): {study.best_trial.value:.4f}")
print(f" Params: {study.best_trial.params}")
# Hyperparameter importance
importances = optuna.importance.get_param_importances(study)
print(f"\nHyperparameter Importance:")
for param, imp in importances.items():
print(f" {param}: {imp:.4f}")
10.6 Learning Curves from Scratch
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC
def plot_learning_curves(model, X, y, title="Learning Curve"):
"""Plot training and validation learning curves."""
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='accuracy',
n_jobs=-1, random_state=42
)
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
plt.figure(figsize=(10, 6))
plt.fill_between(train_sizes, train_mean - train_std,
train_mean + train_std, alpha=0.1, color='#059669')
plt.fill_between(train_sizes, val_mean - val_std,
val_mean + val_std, alpha=0.1, color='#0891b2')
plt.plot(train_sizes, train_mean, 'o-', color='#059669',
label='Training Score', linewidth=2)
plt.plot(train_sizes, val_mean, 'o-', color='#0891b2',
label='Validation Score', linewidth=2)
plt.xlabel('Training Set Size', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title(title, fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Diagnose
gap = train_mean[-1] - val_mean[-1]
if val_mean[-1] < 0.7:
print("โ UNDERFITTING: Both scores are low. Try a more complex model.")
elif gap > 0.1:
print(f"โ OVERFITTING: Training-validation gap = {gap:.3f}. "
f"Try regularization or more data.")
else:
print(f"โ GOOD FIT: Gap = {gap:.3f}. Model generalizes well.")
# Example usage
data = load_breast_cancer()
plot_learning_curves(SVC(kernel='rbf', C=1.0), data.data, data.target,
title="SVM (RBF) Learning Curve")
10.7 Paired t-Test for Model Comparison
from scipy import stats
def compare_models(model_a, model_b, X, y, cv=10, alpha=0.05):
"""Compare two models using paired t-test and Wilcoxon test."""
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
scores_a = []
scores_b = []
for train_idx, test_idx in skf.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Model A
ma = model_a.__class__(**model_a.get_params())
ma.fit(X_train, y_train)
scores_a.append(accuracy_score(y_test, ma.predict(X_test)))
# Model B
mb = model_b.__class__(**model_b.get_params())
mb.fit(X_train, y_train)
scores_b.append(accuracy_score(y_test, mb.predict(X_test)))
scores_a = np.array(scores_a)
scores_b = np.array(scores_b)
differences = scores_a - scores_b
print("=" * 60)
print("MODEL COMPARISON REPORT")
print("=" * 60)
print(f"Model A: {model_a.__class__.__name__}")
print(f" Mean accuracy: {scores_a.mean():.4f} ยฑ {scores_a.std():.4f}")
print(f"Model B: {model_b.__class__.__name__}")
print(f" Mean accuracy: {scores_b.mean():.4f} ยฑ {scores_b.std():.4f}")
print(f"\nMean difference: {differences.mean():.4f}")
# Paired t-test
t_stat, p_value_t = stats.ttest_rel(scores_a, scores_b)
print(f"\nPaired t-test: t={t_stat:.4f}, p={p_value_t:.4f}")
# Wilcoxon signed-rank test
try:
w_stat, p_value_w = stats.wilcoxon(scores_a, scores_b)
print(f"Wilcoxon test: W={w_stat:.4f}, p={p_value_w:.4f}")
except ValueError:
print("Wilcoxon test: Cannot compute (all differences may be zero)")
# Decision
if p_value_t < alpha:
winner = "A" if differences.mean() > 0 else "B"
print(f"\nโ SIGNIFICANT difference (p={p_value_t:.4f} < {alpha})")
print(f" โ Model {winner} is statistically better.")
else:
print(f"\nโ NO significant difference (p={p_value_t:.4f} โฅ {alpha})")
print(f" โ Choose the simpler or faster model.")
return scores_a, scores_b
# Usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
data = load_breast_cancer()
scores_a, scores_b = compare_models(
RandomForestClassifier(n_estimators=100, random_state=42),
LogisticRegression(max_iter=1000, random_state=42),
data.data, data.target, cv=10
)
TensorFlow Implementation
11.1 Neural Network Hyperparameter Tuning with Keras Tuner
# pip install keras-tuner
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
import numpy as np
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
# --- Manual K-Fold CV for Neural Networks ---
def nn_cross_validate(X, y, build_fn, epochs=50, n_splits=5):
"""Cross-validate a Keras model with proper data handling."""
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
fold_scores = []
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
print(f"\n--- Fold {fold+1}/{n_splits} ---")
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Scale INSIDE the fold (no data leakage!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # transform only
# Build fresh model for each fold
model = build_fn(input_dim=X_train.shape[1])
# Train with early stopping
early_stop = keras.callbacks.EarlyStopping(
monitor='val_loss', patience=10, restore_best_weights=True
)
model.fit(
X_train_scaled, y_train,
validation_split=0.15,
epochs=epochs, batch_size=32,
callbacks=[early_stop],
verbose=0
)
# Evaluate
_, accuracy = model.evaluate(X_test_scaled, y_test, verbose=0)
fold_scores.append(accuracy)
print(f" Fold {fold+1} accuracy: {accuracy:.4f}")
fold_scores = np.array(fold_scores)
print(f"\n{'='*50}")
print(f"CV Result: {fold_scores.mean():.4f} ยฑ {fold_scores.std():.4f}")
return fold_scores
# --- Model builder ---
def build_model(input_dim, units=64, dropout=0.3, lr=0.001):
model = keras.Sequential([
keras.layers.Dense(units, activation='relu',
input_shape=(input_dim,)),
keras.layers.Dropout(dropout),
keras.layers.Dense(units // 2, activation='relu'),
keras.layers.Dropout(dropout / 2),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=lr),
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# Run CV
scores = nn_cross_validate(X, y, build_model, epochs=100, n_splits=5)
11.2 Keras Tuner for Automated Hyperparameter Search
import keras_tuner as kt
def build_tunable_model(hp):
"""Define hyperparameter search space for Keras Tuner."""
model = keras.Sequential()
# Tune number of layers
for i in range(hp.Int('n_layers', 1, 4)):
model.add(keras.layers.Dense(
units=hp.Int(f'units_{i}', min_value=16, max_value=256, step=16),
activation=hp.Choice(f'activation_{i}', ['relu', 'tanh', 'selu'])
))
if hp.Boolean(f'dropout_{i}'):
model.add(keras.layers.Dropout(
rate=hp.Float(f'dropout_rate_{i}', 0.1, 0.5, step=0.1)
))
model.add(keras.layers.Dense(1, activation='sigmoid'))
# Tune learning rate
lr = hp.Float('learning_rate', 1e-4, 1e-2, sampling='log')
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=lr),
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# Bayesian optimization tuner
tuner = kt.BayesianOptimization(
build_tunable_model,
objective='val_accuracy',
max_trials=30,
directory='tuner_results',
project_name='breast_cancer_nn'
)
# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Search
tuner.search(
X_scaled, y,
validation_split=0.2,
epochs=50,
batch_size=32,
callbacks=[keras.callbacks.EarlyStopping(patience=5)]
)
# Results
print("\nTop 3 Models:")
tuner.results_summary(num_trials=3)
best_model = tuner.get_best_models(num_models=1)[0]
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print(f"\nBest hyperparameters: {best_hp.values}")
When using validation_split inside model.fit(), Keras takes the last X% of data as validation. If your data isn't shuffled, this can be biased. Always shuffle beforehand or use validation_data with explicit splits.
Scikit-Learn Implementation
12.1 cross_val_score โ The Swiss Army Knife
from sklearn.model_selection import (
cross_val_score, StratifiedKFold, LeaveOneOut,
TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np
# Load data
X, y = load_breast_cancer(return_X_y=True)
# --- 1. Basic cross_val_score ---
print("=" * 60)
print("1. Basic cross_val_score (5-Fold)")
print("=" * 60)
models = {
'Logistic Regression': LogisticRegression(max_iter=5000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM (RBF)': SVC(kernel='rbf', random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f" {name:25s}: {scores.mean():.4f} ยฑ {scores.std():.4f}")
# --- 2. Stratified K-Fold with custom CV object ---
print(f"\n{'=' * 60}")
print("2. Stratified K-Fold (10 folds)")
print("=" * 60)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(
RandomForestClassifier(n_estimators=100, random_state=42),
X, y, cv=skf, scoring='accuracy'
)
print(f" RF (10-Fold Stratified): {scores.mean():.4f} ยฑ {scores.std():.4f}")
print(f" Individual fold scores: {np.round(scores, 4)}")
# --- 3. Multiple metrics ---
from sklearn.model_selection import cross_validate
print(f"\n{'=' * 60}")
print("3. Multiple Metrics with cross_validate")
print("=" * 60)
results = cross_validate(
RandomForestClassifier(n_estimators=100, random_state=42),
X, y, cv=5,
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
return_train_score=True
)
for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
train = results[f'train_{metric}'].mean()
test = results[f'test_{metric}'].mean()
print(f" {metric:12s}: Train={train:.4f} | Test={test:.4f} | "
f"Gap={train-test:.4f}")
12.2 GridSearchCV โ Exhaustive Hyperparameter Search
from sklearn.model_selection import GridSearchCV
# Define pipeline (prevents data leakage!)
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(random_state=42))
])
# Define parameter grid (note: 'clf__' prefix for pipeline)
param_grid = {
'clf__n_estimators': [50, 100, 200, 300],
'clf__max_depth': [5, 10, 15, 20, None],
'clf__min_samples_split': [2, 5, 10],
'clf__max_features': ['sqrt', 'log2'],
}
# Run Grid Search
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='accuracy',
n_jobs=-1, # Use all CPU cores
verbose=1,
return_train_score=True,
refit=True # Refit best model on full training data
)
grid_search.fit(X, y)
# Results
print(f"\n{'='*60}")
print("GRID SEARCH RESULTS")
print(f"{'='*60}")
print(f"Best Score: {grid_search.best_score_:.4f}")
print(f"Best Params: {grid_search.best_params_}")
print(f"Total fits: {grid_search.n_splits_} ร "
f"{len(grid_search.cv_results_['mean_test_score'])} = "
f"{grid_search.n_splits_ * len(grid_search.cv_results_['mean_test_score'])}")
# Top 5 combinations
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
top5 = results_df.nsmallest(5, 'rank_test_score')[
['params', 'mean_test_score', 'std_test_score', 'mean_fit_time']
]
print("\nTop 5 Configurations:")
for _, row in top5.iterrows():
print(f" Score: {row['mean_test_score']:.4f} ยฑ {row['std_test_score']:.4f} "
f"| Time: {row['mean_fit_time']:.2f}s")
12.3 RandomizedSearchCV โ Efficient Exploration
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Distributions instead of fixed lists
param_distributions = {
'clf__n_estimators': randint(50, 500),
'clf__max_depth': randint(3, 30),
'clf__min_samples_split': randint(2, 20),
'clf__max_features': ['sqrt', 'log2'],
'clf__min_samples_leaf': randint(1, 10),
}
random_search = RandomizedSearchCV(
pipeline,
param_distributions,
n_iter=60, # Only 60 random combinations (vs 1000s for grid)
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1,
return_train_score=True
)
random_search.fit(X, y)
print(f"\n{'='*60}")
print("RANDOMIZED SEARCH RESULTS")
print(f"{'='*60}")
print(f"Best Score: {random_search.best_score_:.4f}")
print(f"Best Params: {random_search.best_params_}")
print(f"Evaluated {random_search.n_splits_ * 60} model fits "
f"(Grid would need {4*5*3*2*4 * 5} = {4*5*3*2*4*5})")
12.4 Complete Pipeline with Data Leakage Prevention
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
# The CORRECT way: everything inside the pipeline
proper_pipeline = Pipeline([
('scaler', StandardScaler()), # Step 1: Scale
('feature_select', SelectKBest(f_classif)), # Step 2: Feature Selection
('pca', PCA()), # Step 3: Dimensionality Reduction
('clf', LogisticRegression(max_iter=5000)) # Step 4: Classifier
])
# Grid search over PIPELINE parameters
pipeline_param_grid = {
'feature_select__k': [5, 10, 15, 20, 'all'],
'pca__n_components': [3, 5, 10, 15],
'clf__C': [0.01, 0.1, 1, 10, 100],
'clf__penalty': ['l1', 'l2'],
'clf__solver': ['saga'],
}
# This ensures NO data leakage:
# - StandardScaler is fit ONLY on training folds
# - SelectKBest chooses features ONLY from training folds
# - PCA components are learned ONLY from training folds
pipeline_search = GridSearchCV(
proper_pipeline,
pipeline_param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=0
)
pipeline_search.fit(X, y)
print(f"Pipeline Best Score: {pipeline_search.best_score_:.4f}")
print(f"Pipeline Best Params: {pipeline_search.best_params_}")
# โ ๏ธ THE WRONG WAY (DATA LEAKAGE):
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X) โ Sees ALL data including test!
# selector = SelectKBest(k=10).fit_transform(X_scaled, y) โ Leakage!
# scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)
# ^^ This score is OPTIMISTICALLY BIASED!
12.5 Regularization Parameter Tuning (L1/L2)
from sklearn.linear_model import LogisticRegressionCV, RidgeCV, LassoCV
import matplotlib.pyplot as plt
# --- Logistic Regression with built-in CV for C ---
lr_cv = LogisticRegressionCV(
Cs=np.logspace(-4, 4, 50), # 50 values from 0.0001 to 10000
cv=10,
penalty='l2',
scoring='accuracy',
max_iter=5000,
random_state=42
)
lr_cv.fit(X, y)
print(f"Best C (L2): {lr_cv.C_[0]:.4f}")
print(f"Best score: {lr_cv.scores_[1].mean(axis=0).max():.4f}")
# --- Visualize C vs CV Accuracy ---
mean_scores = lr_cv.scores_[1].mean(axis=0)
plt.figure(figsize=(10, 5))
plt.semilogx(np.logspace(-4, 4, 50), mean_scores, 'o-', color='#059669')
plt.axvline(x=lr_cv.C_[0], color='#f43f5e', linestyle='--',
label=f'Best C = {lr_cv.C_[0]:.4f}')
plt.xlabel('Regularization Parameter C (log scale)')
plt.ylabel('CV Accuracy')
plt.title('L2 Regularization: C vs Cross-Validated Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Ridge Regression with built-in CV ---
from sklearn.datasets import fetch_california_housing
X_reg, y_reg = fetch_california_housing(return_X_y=True)
ridge_cv = RidgeCV(
alphas=np.logspace(-4, 4, 100),
cv=10, scoring='neg_mean_squared_error'
)
ridge_cv.fit(X_reg, y_reg)
print(f"\nBest Ridge alpha: {ridge_cv.alpha_:.4f}")
# --- Lasso with CV (automatic feature selection) ---
lasso_cv = LassoCV(cv=10, random_state=42, max_iter=10000)
lasso_cv.fit(X_reg, y_reg)
print(f"Best Lasso alpha: {lasso_cv.alpha_:.6f}")
print(f"Features selected: {np.sum(lasso_cv.coef_ != 0)} / {X_reg.shape[1]}")
In sklearn, C in LogisticRegression is the inverse of regularization strength (C = 1/ฮป). Small C = strong regularization = simpler model. Large C = weak regularization = complex model. This is the opposite of the ฮป convention!
Indian Case Studies
Case Study 1: Flipkart's A/B Testing & Model Selection Framework
Challenge: Flipkart runs hundreds of ML models simultaneously โ for search ranking, product recommendations, delivery time estimation, and fraud detection. Selecting and tuning each model independently was slow and error-prone.
Solution: They built an internal Model Selection Platform (MSP) that:
- Automatically runs Stratified K-Fold CV on candidate models
- Uses Bayesian optimization (similar to Optuna) for hyperparameter tuning
- Performs statistical A/B tests before deploying model updates
- Monitors production performance and triggers retraining when CV scores drift
Result: 40% reduction in model deployment time, 15% improvement in recommendation click-through rates after systematic tuning.
Key Insight: Flipkart uses time-aware splits for their demand forecasting models. They discovered that random K-fold on transaction data was giving optimistic estimates because the model was "seeing the future."
Case Study 2: TCS AutoML (iON Platform)
Context: TCS developed the iON Cognitive Platform to democratize ML for enterprises that lack dedicated data science teams.
Approach:
- Meta-learning: The system learns from past experiments which model families work best for different data characteristics (number of features, class imbalance ratio, data types)
- Multi-fidelity optimization: Instead of full K-fold CV for every candidate, initial screening uses 2-fold CV with small data subsets. Only promising candidates get full 10-fold evaluation
- Ensemble construction: Top-performing models are automatically combined into weighted ensembles
Deployment: Used by TCS clients in BFSI (Banking, Financial Services, Insurance) sector. One banking client reported 25% reduction in loan default prediction errors after switching from manually-tuned models to AutoML-selected pipelines.
Case Study 3: Aadhaar Biometric Matching โ Cross-Validation at Scale
Challenge: UIDAI's biometric matching system must maintain extremely low false acceptance rates (FAR < 0.01%) while keeping false rejection rates manageable. Model selection is literally a matter of national identity.
CV Strategy:
- Group K-Fold: folds are grouped by geographic region to ensure geographic generalization
- Stratified by biometric quality: ensures each fold has representative samples across quality levels
- Repeated CV (3ร10 fold): 30 total evaluations per model to reduce variance
Statistical Testing: McNemar's test (not paired t-test) is used because individual predictions are binary. Significance level set at ฮฑ = 0.001 (extremely stringent for a national system).
Case Study 4: PhonePe Fraud Detection Pipeline
Challenge: With 4+ billion monthly UPI transactions, PhonePe needs models that detect fraud in <50ms while maintaining <0.1% false positive rate on legitimate transactions.
Pipeline: StandardScaler โ SMOTE (inside CV!) โ Feature Selection โ XGBoost โ Calibration
Key Design: SMOTE (oversampling) is applied inside each CV fold, not before splitting. Applying SMOTE before CV creates synthetic copies that can appear in both train and test โ severe data leakage!
Global Case Studies
Case Study 1: Google Vizier โ Hyperparameter Optimization as a Service
What: Google's internal black-box optimization service used across the company for hyperparameter tuning. Published at KDD 2017.
Scale: Handles thousands of concurrent optimization studies, from tuning neural machine translation models to optimizing ad auction parameters.
Key innovations:
- Transfer learning: Warm-starts optimization using results from similar past studies
- Early stopping: Terminates unpromising trials early using median stopping rules
- Multi-objective: Can optimize accuracy AND latency simultaneously
- Safe optimization: Handles constraints (e.g., "model must fit in 2GB memory")
Impact: Reduced average tuning time by 3ร while finding better hyperparameters than manual tuning.
Case Study 2: Netflix โ Personalization Model Selection
Challenge: Netflix runs 100s of A/B tests simultaneously. For each recommendation algorithm update, they need rigorous offline evaluation before deploying to users.
Approach:
- Time-based splits: Always train on past viewing history, evaluate on future behavior
- Replay evaluation: Simulates what would have happened if the new model had been deployed in the past
- Interleaving experiments: Before full A/B tests, they "interleave" recommendations from two models on the same page and measure user preference
Scale: Their model selection pipeline evaluates 1000+ model variants weekly, with automated statistical significance testing determining which proceed to A/B testing.
Case Study 3: Tesla โ AutoML for Self-Driving
Tesla uses Neural Architecture Search (NAS) โ an extreme form of model selection where the architecture itself is a hyperparameter. Their Dojo supercomputer runs thousands of architecture evaluations in parallel, using progressive CV-like validation on increasingly larger subsets of their driving dataset.
Case Study 4: OpenAI โ Scaling Law-Based Model Selection
OpenAI's scaling laws research (Kaplan et al., 2020) discovered that model performance can be predicted based on model size, dataset size, and compute budget. This enables "model selection" before expensive full training โ you can estimate which configuration will perform best using small-scale experiments and extrapolation.
Startup Applications
15.1 When Resources Are Limited
Startups typically can't afford Google-scale hyperparameter searches. Practical strategies:
| Stage | Strategy | CV Method | Tuning |
|---|---|---|---|
| MVP (0-1K samples) | Simple models first | LOOCV or 5-fold | Manual + intuition |
| Growth (1K-100K) | Compare 3-5 models | Stratified 5-fold | Random Search (30 trials) |
| Scale (100K+) | Full pipeline | Stratified 10-fold | Bayesian (Optuna) |
15.2 Indian Startup Examples
- Razorpay (Payments): Uses time-series CV for fraud model evaluation; updates models weekly with expanding-window retraining
- Nykaa (Beauty E-commerce): Random Search + 5-fold CV for product recommendation model selection; found that a well-tuned LightGBM outperformed a default neural network
- CRED (FinTech): Group K-Fold CV where groups are individual users โ ensures no user's data appears in both train and test
- Ola (Ride-hailing): Geographic K-Fold for demand prediction โ train on some cities, validate on held-out cities
Build a "Quick Model Selector" function that takes a dataset and automatically: (1) tries Logistic Regression, Random Forest, SVM, and XGBoost, (2) runs 5-fold Stratified CV for each, (3) returns a ranked leaderboard with mean scores, std, and training time. Complete it in under 30 lines.
Government Applications
16.1 ISRO โ Satellite Image Classification
ISRO's remote sensing division uses spatial cross-validation for land-cover classification. Standard K-fold would allow spatially adjacent pixels (which are highly correlated) to appear in both train and test, inflating accuracy. Their solution: block CV where folds correspond to geographic tiles, ensuring spatial independence.
16.2 Ministry of Health โ Epidemic Prediction
India's Integrated Disease Surveillance Programme (IDSP) uses time-series CV to validate epidemic forecasting models. They compare ARIMA, Prophet, and ensemble ML models using expanding-window CV on weekly disease incidence data. The paired t-test determines if newer models statistically improve on existing ones before deployment.
16.3 Income Tax Department โ Fraud Detection
The IT department uses stratified CV with extreme class imbalance (fraud rates < 0.5%). They found that SMOTE inside CV folds, combined with XGBoost tuned via Bayesian optimization, significantly outperformed manually-tuned rules-based systems.
16.4 Election Commission โ Voter Turnout Prediction
Uses geographic group CV โ trains on some constituencies, validates on others. This prevents the model from memorizing constituency-specific patterns and ensures generalization to new electoral scenarios.
Industry Applications
17.1 Manufacturing โ Predictive Maintenance
CV Strategy: Group K-Fold (grouped by machine ID). Never allow same machine's data in both train and test.
Tuning: Multi-objective optimization โ balance precision (don't trigger unnecessary maintenance) with recall (don't miss failures).
17.2 Pharma โ Drug Discovery
CV Strategy: Scaffold splitting โ molecules are grouped by chemical scaffold. Standard random CV overestimates drug activity prediction because structurally similar molecules end up in both sets.
Tuning: Bayesian optimization with Gaussian Processes, due to extremely expensive evaluations (each "trial" may take hours of molecular simulation).
17.3 Finance โ Credit Scoring
CV Strategy: Time-series split (train on past applications, test on future ones) + Stratified for default/non-default balance.
Regulatory requirement: Models must be explainable, so model selection favors logistic regression and decision trees over black-box models, even if black-box models perform slightly better in CV.
17.4 Retail โ Demand Forecasting
CV Strategy: Sliding window time-series CV with store-level grouping.
Tuning Priority: Hyperparameter importance analysis reveals that learning_rate and n_estimators matter most for GBM-based forecasters; max_depth has diminishing returns beyond 8.
In production ML, the choice of CV strategy often matters more than the choice of model. A well-validated simple model beats a poorly-validated complex one every time. Top companies spend 50-70% of model development time on validation infrastructure.
Mini Projects
Mini Project 1: Complete Model Selection Framework
"""
Mini Project 1: AutoModelSelector
A complete framework for automated model selection with:
- Multiple CV strategies
- Statistical model comparison
- Leakage-free pipelines
- Results visualization
"""
import numpy as np
import pandas as pd
from sklearn.model_selection import (
cross_val_score, StratifiedKFold, TimeSeriesSplit
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
RandomForestClassifier, GradientBoostingClassifier
)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from scipy import stats
import time
class AutoModelSelector:
"""Automated model selection with statistical comparison."""
def __init__(self, cv_strategy='stratified', n_folds=5,
random_state=42):
self.cv_strategy = cv_strategy
self.n_folds = n_folds
self.random_state = random_state
self.results_ = {}
self.best_model_ = None
# Define candidate models (all inside pipelines)
self.candidates = {
'LogisticRegression': Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=5000,
random_state=random_state))
]),
'RandomForest': Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(n_estimators=100,
random_state=random_state))
]),
'GradientBoosting': Pipeline([
('scaler', StandardScaler()),
('clf', GradientBoostingClassifier(
random_state=random_state))
]),
'SVM_RBF': Pipeline([
('scaler', StandardScaler()),
('clf', SVC(kernel='rbf', random_state=random_state))
]),
'KNN': Pipeline([
('scaler', StandardScaler()),
('clf', KNeighborsClassifier(n_neighbors=5))
]),
}
def _get_cv(self, y=None):
if self.cv_strategy == 'stratified':
return StratifiedKFold(n_splits=self.n_folds,
shuffle=True,
random_state=self.random_state)
elif self.cv_strategy == 'timeseries':
return TimeSeriesSplit(n_splits=self.n_folds)
else:
return self.n_folds
def fit(self, X, y, scoring='accuracy'):
"""Evaluate all candidate models."""
cv = self._get_cv(y)
print("=" * 70)
print("AUTO MODEL SELECTION")
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"CV Strategy: {self.cv_strategy} ({self.n_folds} folds)")
print("=" * 70)
for name, pipeline in self.candidates.items():
start = time.time()
scores = cross_val_score(pipeline, X, y, cv=cv,
scoring=scoring, n_jobs=-1)
elapsed = time.time() - start
self.results_[name] = {
'mean': scores.mean(),
'std': scores.std(),
'scores': scores,
'time': elapsed
}
print(f" {name:25s} | {scores.mean():.4f} ยฑ {scores.std():.4f} "
f"| {elapsed:.2f}s")
# Find best
self.best_model_ = max(self.results_,
key=lambda k: self.results_[k]['mean'])
print(f"\nโ
Best Model: {self.best_model_} "
f"({self.results_[self.best_model_]['mean']:.4f})")
return self
def compare_top2(self, alpha=0.05):
"""Statistical comparison of top 2 models."""
sorted_models = sorted(self.results_,
key=lambda k: self.results_[k]['mean'],
reverse=True)
m1, m2 = sorted_models[0], sorted_models[1]
s1 = self.results_[m1]['scores']
s2 = self.results_[m2]['scores']
# Paired t-test
t_stat, p_val = stats.ttest_rel(s1, s2)
# Wilcoxon
try:
w_stat, p_wil = stats.wilcoxon(s1, s2)
except:
w_stat, p_wil = None, None
print(f"\n{'='*70}")
print(f"STATISTICAL COMPARISON: {m1} vs {m2}")
print(f"{'='*70}")
print(f" {m1}: {s1.mean():.4f} ยฑ {s1.std():.4f}")
print(f" {m2}: {s2.mean():.4f} ยฑ {s2.std():.4f}")
print(f" Paired t-test: t={t_stat:.3f}, p={p_val:.4f}")
if p_wil is not None:
print(f" Wilcoxon test: W={w_stat:.3f}, p={p_wil:.4f}")
if p_val < alpha:
print(f" โ Significant difference (p < {alpha})")
print(f" โ Choose {m1}")
else:
print(f" โ No significant difference (p โฅ {alpha})")
print(f" โ Choose simpler/faster model")
def leaderboard(self):
"""Return results as a sorted DataFrame."""
rows = []
for name, res in self.results_.items():
rows.append({
'Model': name,
'Mean Score': res['mean'],
'Std': res['std'],
'Time (s)': res['time'],
'Score per Second': res['mean'] / max(res['time'], 0.01)
})
df = pd.DataFrame(rows).sort_values('Mean Score', ascending=False)
return df.reset_index(drop=True)
# --- Run the framework ---
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
selector = AutoModelSelector(cv_strategy='stratified', n_folds=10)
selector.fit(X, y)
selector.compare_top2()
print("\n" + selector.leaderboard().to_string(index=False))
Mini Project 2: AutoML Pipeline with Optuna
"""
Mini Project 2: End-to-End AutoML Pipeline
Combines model selection + hyperparameter tuning using Optuna.
"""
import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
RandomForestClassifier, GradientBoostingClassifier,
AdaBoostClassifier
)
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np
import warnings
warnings.filterwarnings('ignore')
X, y = load_breast_cancer(return_X_y=True)
def automl_objective(trial):
"""Unified objective: select model AND tune hyperparameters."""
# Step 1: Choose model family
model_type = trial.suggest_categorical(
'model_type',
['lr', 'rf', 'gb', 'svm', 'adaboost']
)
# Step 2: Define model-specific hyperparameters
if model_type == 'lr':
C = trial.suggest_float('lr_C', 1e-3, 100, log=True)
penalty = trial.suggest_categorical('lr_penalty', ['l1', 'l2'])
clf = LogisticRegression(
C=C, penalty=penalty, solver='saga',
max_iter=5000, random_state=42
)
elif model_type == 'rf':
clf = RandomForestClassifier(
n_estimators=trial.suggest_int('rf_n', 50, 300),
max_depth=trial.suggest_int('rf_depth', 3, 20),
min_samples_split=trial.suggest_int('rf_split', 2, 15),
max_features=trial.suggest_categorical(
'rf_feat', ['sqrt', 'log2']
),
random_state=42
)
elif model_type == 'gb':
clf = GradientBoostingClassifier(
n_estimators=trial.suggest_int('gb_n', 50, 300),
max_depth=trial.suggest_int('gb_depth', 2, 10),
learning_rate=trial.suggest_float('gb_lr', 0.01, 0.3, log=True),
subsample=trial.suggest_float('gb_sub', 0.5, 1.0),
random_state=42
)
elif model_type == 'svm':
clf = SVC(
C=trial.suggest_float('svm_C', 0.1, 100, log=True),
kernel=trial.suggest_categorical(
'svm_kernel', ['rbf', 'poly']
),
gamma=trial.suggest_categorical(
'svm_gamma', ['scale', 'auto']
),
random_state=42
)
else: # adaboost
clf = AdaBoostClassifier(
n_estimators=trial.suggest_int('ada_n', 25, 200),
learning_rate=trial.suggest_float('ada_lr', 0.01, 2.0, log=True),
random_state=42
)
# Step 3: Wrap in pipeline (leak-proof!)
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', clf)
])
# Step 4: Cross-validate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv,
scoring='accuracy', n_jobs=-1)
return scores.mean()
# Run AutoML
optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42)
)
study.optimize(automl_objective, n_trials=100)
# Results
print("=" * 70)
print("AutoML RESULTS")
print("=" * 70)
print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best model type: {study.best_params['model_type']}")
print(f"Best params:")
for k, v in study.best_params.items():
print(f" {k}: {v}")
# Show model type distribution
model_counts = {}
for trial in study.trials:
mt = trial.params.get('model_type', 'unknown')
if mt not in model_counts:
model_counts[mt] = {'count': 0, 'best_score': 0}
model_counts[mt]['count'] += 1
model_counts[mt]['best_score'] = max(
model_counts[mt]['best_score'], trial.value or 0
)
print(f"\nModel Type Distribution (out of {len(study.trials)} trials):")
for mt, info in sorted(model_counts.items(),
key=lambda x: x[1]['best_score'],
reverse=True):
print(f" {mt:12s}: {info['count']:3d} trials, "
f"best = {info['best_score']:.4f}")
Mini Project 3: Hyperparameter Importance Analyzer
"""
Mini Project 3: Hyperparameter Importance Analysis
Determine which hyperparameters matter most for performance.
"""
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from itertools import product
X, y = load_breast_cancer(return_X_y=True)
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'subsample': [0.7, 0.85, 1.0],
}
# Run all combinations
results = []
keys = list(param_grid.keys())
values = list(param_grid.values())
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("Running parameter sweep...")
for combo in product(*values):
params = dict(zip(keys, combo))
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', GradientBoostingClassifier(**params, random_state=42))
])
scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')
results.append({**params, 'mean_score': scores.mean()})
# Analyze importance using variance-based method (fANOVA-like)
import pandas as pd
df = pd.DataFrame(results)
print(f"\n{'='*60}")
print("HYPERPARAMETER IMPORTANCE (Marginal Variance Method)")
print(f"{'='*60}")
overall_var = df['mean_score'].var()
importances = {}
for param in keys:
# Group by this parameter, compute mean score per group
group_means = df.groupby(param)['mean_score'].mean()
# Importance = variance of group means / total variance
param_var = group_means.var()
importance = param_var / overall_var if overall_var > 0 else 0
importances[param] = importance
# Sort and display
sorted_imp = sorted(importances.items(), key=lambda x: x[1], reverse=True)
total_imp = sum(v for _, v in sorted_imp)
print(f"\nOverall score range: {df['mean_score'].min():.4f} โ "
f"{df['mean_score'].max():.4f}")
print(f"\nRanked Importance:")
for param, imp in sorted_imp:
bar = 'โ' * int(imp / max(importances.values()) * 30)
normalized = imp / total_imp * 100 if total_imp > 0 else 0
print(f" {param:20s} {bar:30s} {normalized:5.1f}%")
# Show best value for each parameter
print(f"\nBest value for each hyperparameter:")
for param, _ in sorted_imp:
group_means = df.groupby(param)['mean_score'].mean()
best_val = group_means.idxmax()
best_score = group_means.max()
print(f" {param:20s} = {str(best_val):10s} (avg score: {best_score:.4f})")
End-of-Chapter Exercises
learning_curve function to generate learning curves for Logistic Regression, SVM, and Random Forest on the same dataset. Which model shows the most overfitting?Multiple Choice Questions
In 10-fold cross-validation on a dataset of 500 samples, each model is trained on how many samples per fold?
Which cross-validation method is most appropriate for stock price prediction?
Bergstra & Bengio (2012) showed that random search is more efficient than grid search primarily because:
Data leakage in cross-validation occurs when:
In the bias-variance tradeoff, increasing model complexity generally:
In scikit-learn, what does C represent in LogisticRegression(C=0.01)?
LOOCV has high variance because:
Which sklearn class prevents data leakage by ensuring preprocessing is fit only on training data?
Bayesian optimization uses which component to decide the next hyperparameter combination to try?
If Model A beats Model B on 7 out of 10 CV folds, can we conclude A is significantly better?
In the Nadeau & Bengio corrected t-test, the correction factor accounts for:
Interview Questions
Q: Explain data leakage with a concrete example. How would you detect it?
A: Data leakage occurs when information from outside the training set inappropriately influences model development. Example: fitting StandardScaler on the entire dataset before CV means the scaler's mean/std include test data statistics. Detection: compare CV scores with Pipeline (correct) vs without (leaked) โ if leaked scores are significantly higher, you have leakage. Also check if production performance is much worse than CV estimates.
Q: You have a classification dataset with 98% negative, 2% positive. How would you set up cross-validation?
A: Use Stratified K-Fold (K=5 or 10) to ensure each fold maintains the 98/2 ratio. Use appropriate metrics (PR-AUC, F1, precision@K) instead of accuracy. If using oversampling (SMOTE), apply it inside each fold only on the training set. Consider GroupKFold if samples have group structure (e.g., multiple transactions per user).
Q: How would you evaluate a recommendation model offline before A/B testing?
A: Use time-based splits (train on historical interactions, evaluate on future ones). Metrics: NDCG@K, MAP@K, Hit Rate@K. Beyond accuracy, measure diversity, novelty, and coverage. Use replay evaluation: simulate what would have happened if the new model had been deployed, accounting for selection bias in logged data.
Q: Grid search is taking too long. How would you speed it up without sacrificing quality?
A: (1) Switch to RandomizedSearchCV โ 60 random trials often match exhaustive grid search. (2) Use Bayesian optimization (Optuna) for intelligent sampling. (3) Use successive halving: evaluate all candidates on small data, keep top 50%, evaluate on more data, repeat. (4) Reduce CV folds (K=3 for screening, K=10 for final). (5) Use n_jobs=-1 for parallelization. (6) Start with coarse grid, then refine around the best region.
Q: Why can't you use standard K-fold CV for a surge pricing model?
A: Surge pricing depends on temporal patterns (time of day, day of week, events). Standard K-fold randomly mixes timestamps, allowing the model to "see" future demand patterns during training. This creates data leakage โ the model appears to predict well but actually just memorized temporal correlations. Use TimeSeriesSplit with expanding or sliding windows instead.
Q: How do you determine if the difference between two models' CV scores is statistically significant?
A: Use the paired t-test or Wilcoxon signed-rank test on the K fold-wise score differences. For the t-test: compute dฬ and s_d from fold differences, test statistic t = dฬ/(s_d/โK), compare with t-distribution(K-1 df). Use the Nadeau-Bengio corrected version to account for non-independence of folds. For non-normal distributions, prefer Wilcoxon. Also report confidence intervals, not just p-values.
Q: What's the relationship between regularization and model selection?
A: Regularization (L1/L2) controls model complexity, which is equivalent to selecting from a family of models indexed by the regularization parameter ฮป. Cross-validating over ฮป values is a form of model selection โ you're choosing the "model" (complexity level) that generalizes best. L1 additionally performs feature selection, effectively choosing among models with different feature subsets. RidgeCV and LassoCV in sklearn combine this naturally.
Q: Explain the bias-variance tradeoff in the context of K in K-fold CV.
A: Large K (approaching LOOCV): training sets are nearly full-size โ low bias (estimate is close to true generalization error), but high variance because training sets overlap heavily, making fold errors correlated. Small K (K=2): training sets are only half the data โ high bias (underfits), but lower variance because folds are more independent. K=5 or 10 provides a good balance. This is different from the model's own bias-variance tradeoff but follows the same principle.
Q: How would you validate a self-driving perception model?
A: Use geographic/scenario-based splits (not random): train on some cities/routes, validate on unseen ones. Stratify by driving conditions (rain, night, highway, urban). Use temporal ordering โ never validate on past data. Evaluate per-class (pedestrian detection vs. vehicle detection). Use domain-specific metrics (mAP@IoU). Critical: test on adversarial/edge cases separately. Cross-validate across different sensor conditions.
Q: You're a startup with limited compute. How do you approach model selection pragmatically?
A: (1) Start with baselines: logistic regression or simple random forest. (2) Use 5-fold CV (not 10) to save time. (3) Random search with 30-50 trials instead of grid search. (4) Focus on the 2-3 most important hyperparameters (learning rate, regularization strength, tree depth) โ hyperparameter importance analysis shows most other parameters have minimal impact. (5) Use early stopping. (6) Only run full evaluation on the top 2-3 candidates from quick screening.
Research Problems
Open Question: Can we develop a method that automatically selects the optimal K for K-fold CV based on dataset properties (size, noise level, class distribution)? Current practice (K=5 or 10) is a heuristic. Design and experimentally validate an adaptive algorithm.
Starting Points: Kohavi (1995), Arlot & Celisse (2010), Rodriguez et al. (2010).
Deliverable: A function select_k(X, y) that returns the recommended K, with empirical validation on 20+ UCI datasets.
Open Question: Standard model selection optimizes for accuracy or AUC. How should model selection be modified when fairness constraints (demographic parity, equalized odds) must be satisfied? Develop a multi-objective cross-validation framework that balances accuracy and fairness.
Context: Critical for Indian applications: Aadhaar verification, loan approvals, criminal justice risk assessment where bias against marginalized communities could be amplified by biased model selection.
Open Question: Can we learn a mapping from dataset meta-features (number of samples, features, class imbalance ratio, etc.) to good initial hyperparameters for Bayesian optimization warm-starting? Implement and evaluate a meta-learning system using a database of past optimization runs.
Benchmark: Compare against cold-start Optuna on 50 OpenML datasets. Target: achieve equivalent accuracy with 50% fewer trials.
Open Question: Standard CV methods assume i.i.d. data. How should CV be conducted for graph-structured data (social networks, molecular graphs) where nodes/edges have complex dependencies? Develop and validate a graph-aware CV method.
Key Takeaways
References & Further Reading
Foundational Papers
- Stone, M. (1974). "Cross-Validatory Choice and Assessment of Statistical Predictions." JRSS Series B, 36(2), 111โ147.
- Geisser, S. (1975). "The Predictive Sample Reuse Method with Applications." JASA, 70(350), 320โ328.
- Kohavi, R. (1995). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection." IJCAI, 14(2), 1137โ1143.
- Dietterich, T.G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." Neural Computation, 10(7), 1895โ1923.
- Nadeau, C., & Bengio, Y. (2003). "Inference for the Generalization Error." Machine Learning, 52(3), 239โ281.
Hyperparameter Optimization
- Bergstra, J., & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." JMLR, 13, 281โ305.
- Snoek, J., Larochelle, H., & Adams, R.P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms." NeurIPS.
- Golovin, D., et al. (2017). "Google Vizier: A Service for Black-Box Optimization." KDD, 1487โ1495.
- Akiba, T., et al. (2019). "Optuna: A Next-generation Hyperparameter Optimization Framework." KDD, 2623โ2631.
- Li, L., et al. (2017). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." JMLR, 18(185), 1โ52.
AutoML & Model Selection
- Feurer, M., et al. (2015). "Efficient and Robust Automated Machine Learning." NeurIPS, 28, 2962โ2970.
- Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.) (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer.
- Arlot, S., & Celisse, A. (2010). "A Survey of Cross-Validation Procedures for Model Selection." Statistics Surveys, 4, 40โ79.
Textbooks
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Chapters 7 (Model Assessment and Selection). Springer.
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
- Gรฉron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly.
Online Resources
- Scikit-learn documentation: Cross-validation: evaluating estimator performance โ scikit-learn.org/stable/modules/cross_validation.html
- Optuna documentation โ optuna.readthedocs.io
- Google Vizier documentation โ cloud.google.com/ai-platform/optimizer/docs
- AutoML Benchmark โ openml.github.io/automlbenchmark