Neural Networks & Deep Learning
Chapter 11: Hyperparameter Tuning & ML Strategy
The Art and Science of Finding the Right Knobs to Turn
โฑ๏ธ Reading Time: ~2.5 hours | ๐ Part III: Training Deep Networks | ๐ง Strategy + Code Chapter
๐ Prerequisites: Chapters 6โ10 (Deep Networks, Optimization, Regularization, Batch Norm)
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the priority ranking of hyperparameters (Tier 1โ4) and default recommended values |
| ๐ต Understand | Explain why random search dominates grid search and why logarithmic scale is needed for learning rate |
| ๐ข Apply | Implement a hyperparameter search loop and LR finder from scratch in Python |
| ๐ก Analyze | Perform structured error analysis โ break down misclassifications into actionable categories |
| ๐ Evaluate | Diagnose train/dev mismatch, decide whether a model has avoidable bias or variance problems |
| ๐ด Create | Design an end-to-end ML strategy for a real-world project with mismatched data distributions |
Learning Objectives
By the end of this chapter, you will be able to:
- Rank hyperparameters by importance (Tier 1โ4) and justify why learning rate is the single most critical hyperparameter
- Compare grid search vs. random search using the probability argument, and explain when random search is strictly superior
- Apply the coarse-to-fine search strategy and use logarithmic sampling for learning rate and regularization strength
- Design train/dev/test splits appropriate for big-data scenarios (1M+ examples) vs. small-data scenarios
- Diagnose train/dev distribution mismatch and propose corrective strategies (data synthesis, re-weighting)
- Perform structured error analysis by manually inspecting misclassified examples and building error breakdown tables
- Apply orthogonalization: fix high bias first (bigger model, longer training), then fix high variance (more data, regularization)
- Use human-level performance as a proxy for Bayes optimal error to calculate avoidable bias
- Implement a hyperparameter search loop and a fastai-style LR finder from scratch in Python
- Evaluate when to use "Panda" vs. "Caviar" strategy for hyperparameter tuning based on compute budget
Opening Hook
๐๏ธ Too Many Knobs, Too Little Time
You've built a 12-layer neural network. You sit down to train it and realize there are at least 8 hyperparameters staring at you:
ฮฑ (learning rate) ยท hidden units ยท # layers ยท epochs ยท batch size ยท dropout rate ยท ฮป (L2 reg) ยท ฮฒ (momentum)
If you try just 5 values for each, that's 5โธ = 390,625 experiments. At 10 minutes per experiment on a single GPU, that's 7.4 years of compute. Even on a โน5 lakh cloud budget, you'd burn through it in days.
So how do teams at Flipkart, Ola, and Jio tune models that serve hundreds of millions of users โ and do it in weeks, not years?
The answer is not brute force. It's strategy.
FlipkartOlaJioPaytmCore Concepts
This chapter covers two tightly related topics: (A) Hyperparameter Tuning โ the mechanics of finding good values, and (B) ML Strategy โ the decision framework for what to work on next. Together, they form the practitioner's toolkit for going from a working prototype to a production-quality model.
The Hyperparameter Landscape
Not all hyperparameters are created equal. Andrew Ng's practical hierarchy, refined across years of production ML projects, ranks them into four tiers of importance:
Learning Rate (ฮฑ) โ The single most important hyperparameter. A 10ร change in ฮฑ can make or break your model. Always tune this first.
- Momentum term (ฮฒ) โ typically 0.9, but values like 0.99 or 0.95 can matter
- Number of hidden units โ directly controls model capacity
- Mini-batch size โ affects gradient noise and training speed
- Number of layers โ depth vs. width trade-off
- Learning rate decay โ schedule type and decay factor
- Adam parameters: ฮฒโ = 0.9, ฮฒโ = 0.999, ฮต = 10โปโธ โ almost never need tuning
Key Insight: Why Learning Rate is King
Learning rate controls how big each gradient descent step is. Too large โ the model overshoots and loss explodes. Too small โ the model crawls and never reaches a good minimum in reasonable time. Every other hyperparameter (hidden units, layers, regularization) only shapes the loss landscape โ but ฮฑ determines whether you can even navigate it.
AnalogyThink of tuning a TV. Learning rate is like the power switch and channel selector โ get it wrong and you see nothing. Hidden units are like brightness and contrast โ they refine the picture. Adam's ฮฒโ, ฮฒโ are like the backlight frequency โ you'd never touch them unless you're an engineer.
Grid Search vs. Random Search
Grid Search: The Naรฏve Approach
In grid search, you define a set of values for each hyperparameter and try every combination:
Python# Grid search: 5 values ร 5 values = 25 experiments
learning_rates = [0.001, 0.003, 0.01, 0.03, 0.1]
hidden_units = [64, 128, 256, 512, 1024]
for lr in learning_rates:
for hu in hidden_units:
train_and_evaluate(lr=lr, hidden=hu) # 25 runs
Problem: If learning rate matters much more than hidden units (which it does โ Tier 1 vs. Tier 2), then in a 5ร5 grid you only test 5 unique learning rates. The other 20 experiments are "wasted" exploring hidden unit values when ฮฑ is already suboptimal.
Random Search: The Better Alternative
In random search, you sample each hyperparameter independently from a range:
Pythonimport numpy as np
# Random search: 25 experiments, but 25 UNIQUE learning rates!
for trial in range(25):
lr = 10 ** np.random.uniform(-4, -1) # log-uniform: 0.0001 to 0.1
hu = np.random.choice([64, 128, 256, 512, 1024])
train_and_evaluate(lr=lr, hidden=hu)
Advantage: Now you test 25 unique learning rates instead of just 5. Since ฮฑ matters most, random search explores the most important dimension much more richly.
The Probability Argument for Random Search
Suppose the optimal learning rate lies in a narrow "sweet spot" that covers 10% of your search range. With grid search using 5 values, the probability of at least one value falling in this sweet spot is:
With random search using 25 independently sampled points projected onto the LR axis:
Same total budget (25 experiments), but random search gives you 93% probability of hitting the sweet spot for the most important hyperparameter, vs. only 41% for grid search. This gap widens as the number of hyperparameters increases.
Paper ReferenceThis was formally proven by Bergstra & Bengio (2012): "Random Search for Hyper-Parameter Optimization" โ one of the most cited hyperparameter tuning papers in ML history.
Reality: Grid search is exhaustive only across the full combination โ but for the individual axis that matters most, it's extremely wasteful. Random search with the same budget samples more unique values along every individual axis.
Coarse-to-Fine Strategy & Logarithmic Scale
The Coarse-to-Fine Workflow
In practice, you don't run one massive search. You iterate in rounds:
- Round 1 (Coarse): Sample broadly. LR from 10โปโด to 10โฐ, hidden units from 32 to 2048. Run ~25 experiments with fewer epochs (5โ10).
- Identify the promising region: e.g., best results cluster around LR โ [10โปยณ, 10โปยฒ] and hidden units โ [256, 512].
- Round 2 (Fine): Zoom into the promising region. LR from 10โปยณ to 10โปยฒ, hidden units from 200 to 600. Run ~25 more experiments with more epochs (20โ50).
- Round 3 (Final): Narrow further if needed. Train the top 3 candidates with full epochs and pick the best.
Why Logarithmic Scale for Learning Rate?
Learning rate values that "matter" are spread across orders of magnitude. The difference between 0.0001 and 0.001 is just as significant as the difference between 0.01 and 0.1 โ each is a 10ร change. If you sampled uniformly from [0.0001, 1], you'd spend 90% of your samples in [0.1, 1] and only 0.1% of samples in [0.0001, 0.001].
Example: r ~ Uniform(โ4, โ1) gives ฮฑ โ [10โปโด, 10โปยน] = [0.0001, 0.1]
Pythonimport numpy as np
# โ
CORRECT: Log-uniform sampling for learning rate
r = np.random.uniform(-4, -1) # exponent between -4 and -1
alpha = 10 ** r # ฮฑ โ [0.0001, 0.1]
# โ WRONG: Uniform sampling for learning rate
alpha = np.random.uniform(0.0001, 0.1) # heavily biased toward large values
# โ
Log-uniform for ฮฒ (momentum): sample (1 - ฮฒ) on log scale
# ฮฒ โ [0.9, 0.999] โ (1-ฮฒ) โ [0.001, 0.1] โ r โ [-3, -1]
r = np.random.uniform(-3, -1)
beta = 1 - 10 ** r # ฮฒ โ [0.9, 0.999]
(1 โ ฮฒ) on log scale instead.
Panda ๐ผ vs. Caviar ๐ Strategy
Two Approaches to Hyperparameter Tuning
When compute is limited (e.g., a single GPU at your university lab), you train one model at a time, carefully watching the loss curve and adjusting hyperparameters day by day. Like a panda caring for a single baby.
Best for: Students, startups with limited GPU budget (e.g., a single NVIDIA RTX 4090 costing โน1,60,000).
๐ Caviar Strategy (Spawn Many in Parallel)When compute is abundant (e.g., a cloud cluster), you launch dozens of experiments simultaneously with different hyperparameters and pick the winner. Like a fish spawning thousands of eggs.
Best for: Companies like TCS, Infosys, Jio with cloud budgets or on-premise GPU clusters.
Train/Dev/Test Split Strategy
The Classical Split
Traditionally (pre-deep learning era, small datasets of 100โ10,000 examples):
| Split | Classical Ratio | Purpose |
|---|---|---|
| Train | 60% | Fit model parameters |
| Dev (Validation) | 20% | Tune hyperparameters, model selection |
| Test | 20% | Final unbiased evaluation |
The Big Data Split
In the deep learning era with 1M+ examples, you don't need 200K examples just for dev set. Modern splits:
| Dataset Size | Train | Dev | Test |
|---|---|---|---|
| 10,000 | 60% | 20% | 20% |
| 100,000 | 90% | 5% | 5% |
| 1,000,000 | 98% | 1% | 1% |
| 10,000,000 | 99.5% | 0.25% | 0.25% |
Even 0.25% of 10M = 25,000 examples โ more than enough to get a statistically significant performance estimate.
Guidelines for Dev/Test Set Design
If your dev set is from distribution A and your test set is from distribution B, you're optimizing hyperparameters for the wrong target. It's like practicing archery aiming at one target, then being graded on a different one.
Rule 2: Dev Set Must Be Large Enough to Detect DifferencesIf algorithm A has 90.0% accuracy and B has 90.1%, you need enough dev examples to tell them apart reliably. Rule of thumb: at least 1,000โ10,000 examples in dev set.
Rule 3: Test Set is Optional (But Recommended)If you only need to pick the best model (no unbiased final estimate needed), you can skip the test set. But for papers, competitions, and production, always keep a held-out test set.
Why it's wrong: This makes your test set a second dev set. You lose the ability to get an unbiased estimate of real-world performance. Many Kaggle beginners make this mistake and are shocked when their leaderboard score drops on private test data.
Train/Dev Distribution Mismatch
The Real-World Problem
In many production ML systems, your training data comes from a different distribution than what you'll see at inference time:
Why Does Mismatch Happen?
- Data availability: You have millions of web-scraped images but only 10,000 images from your actual camera app
- Temporal shift: Training on historical data, deploying on current data
- Geographic shift: Training on US data, deploying for Indian users
- Platform shift: Training on desktop clicks, deploying on mobile
The Solution: Prioritize Target Distribution for Dev/Test
Always make your dev and test sets reflect the target distribution (what you'll see in production). Use the mismatched (but larger) data for training.
The Train-Dev Set: A Diagnostic Tool
If your training error is 1% and dev error is 10%, is the 9% gap due to variance (overfitting) or distribution mismatch? You can't tell with just train and dev errors.
Solution: Create a "Train-Dev" SetCarve out a small subset from training data (same distribution as train, but not used for training). Now you can decompose the error:
| Set | Distribution | Used For | Error |
|---|---|---|---|
| Training set | Source | Training | 1% |
| Train-Dev set | Source (held out) | Diagnosis | 9% |
| Dev set | Target | HP tuning | 10% |
Train โ Train-Dev gap (1% โ 9%): This is variance (the model overfits training data).
Train-Dev โ Dev gap (9% โ 10%): This is data mismatch (only 1% โ negligible).
Conclusion: The main problem here is variance, not data mismatch. Fix with more regularization or more data.
Human-level โ Bayes error
โ (Avoidable Bias)
Training error
โ (Variance)
Train-Dev error
โ (Data Mismatch)
Dev error
โ (Overfitting to Dev Set)
Test error
Error Analysis
The Power of Manual Inspection
Before spending weeks collecting more data or redesigning your architecture, spend 30โ60 minutes manually inspecting 100 misclassified dev-set examples. This simple practice is one of the highest-ROI activities in ML.
Structured Error Analysis Process
- Pull 100 misclassified examples from the dev set
- Create a spreadsheet with columns for each potential error category
- For each example, mark which categories apply
- Count percentages to identify the biggest error sources
- Prioritize the category with the highest ceiling for improvement
Example: Food Classification for Zomato
Imagine you're building a food image classifier for Zomato's photo-based search. You inspect 100 misclassified images:
| Error Category | Count (out of 100) | Ceiling for Improvement |
|---|---|---|
| Blurry/low-quality images | 38 | 38% |
| Multiple food items in one image | 25 | 25% |
| Unusual plating/presentation | 18 | 18% |
| Mislabeled training data | 12 | 12% |
| Rare regional dishes | 7 | 7% |
Conclusion: Working on handling blurry images (data augmentation, super-resolution preprocessing) could fix up to 38% of errors โ that's the highest-impact area. Don't waste time on rare regional dishes (only 7% ceiling).
Should You Fix Incorrect Labels?
Deep learning algorithms are robust to random label noise in the training set โ if errors are random (not systematic), a small percentage (1โ2%) of wrong labels usually doesn't hurt much. However:
- Dev/test set labels must be correct. If 6% of your dev set is mislabeled, you can't trust your model selection process.
- Systematic errors are dangerous. If all "dosa" images are labeled "uttapam", the model will learn this wrong mapping.
- If you fix dev labels, fix test labels too โ they must stay from the same distribution.
Orthogonalization
One Knob, One Function
In a well-designed system, each control affects exactly one thing. In an old TV: brightness knob changes brightness, volume knob changes volume. If one knob changed both, debugging would be impossible.
Apply the same principle to ML. The four sequential goals, each with its own "knob":
The Four Knobs of ML Orthogonalization
Tools: Bigger network, train longer, better optimizer (Adam), different architecture.
Goal: Training error โ Human-level performance
Knob 2: Fit Dev Set Well (Fix High Variance)Tools: Regularization (L2, dropout, data augmentation), more training data, early stopping (use cautiously โ it affects Knob 1 too).
Goal: Dev error โ Training error
Knob 3: Fit Test Set Well (Fix Dev Overfitting)Tools: Bigger dev set, don't over-tune on dev set.
Goal: Test error โ Dev error
Knob 4: Perform Well in Real WorldTools: Change dev/test set distribution to match real world, change cost function to better reflect reality.
Goal: Real-world performance โ Test error
Human-Level Performance & Bayes Optimal Error
Why Compare to Humans?
For tasks that humans are good at (vision, speech, NLP), human-level error is a useful proxy for the Bayes optimal error โ the theoretical best any function can achieve given the noise in the data.
Avoidable Bias = Training Error โ Human-Level Error
Variance = Dev Error โ Training Error
Which "Human Level" to Use?
Consider a medical imaging task:
| Human | Error Rate |
|---|---|
| Typical medical student | 5% |
| Experienced radiologist | 2% |
| Team of expert radiologists | 0.7% |
For the purpose of computing avoidable bias, use the best human performance (0.7%) as the proxy for Bayes error, because Bayes error โค 0.7%.
Worked Diagnostic Examples
Scenario A: High Avoidable Bias
| Metric | Error |
|---|---|
| Human-level (Bayes proxy) | 1% |
| Training error | 8% |
| Dev error | 10% |
Avoidable bias = 8% โ 1% = 7%
Variance = 10% โ 8% = 2%
Diagnosis: Focus on reducing bias โ bigger model, train longer.
Scenario B: High Variance
| Metric | Error |
|---|---|
| Human-level (Bayes proxy) | 1% |
| Training error | 2% |
| Dev error | 10% |
Avoidable bias = 2% โ 1% = 1%
Variance = 10% โ 2% = 8%
Diagnosis: Focus on reducing variance โ regularization, more data, dropout.
Surpassing Human-Level Performance
Once a model surpasses human-level error, progress typically slows down because:
- You can no longer use human-level as a reliable Bayes proxy โ the gap becomes unclear
- You can't do manual error analysis (if the model is better than you, how do you know which errors to fix?)
- You're approaching the theoretical ceiling (Bayes error), where further gains require exponentially more effort
From-Scratch Code
4.1 Hyperparameter Random Search Loop
A complete, reusable hyperparameter search engine from scratch:
Pythonimport numpy as np
import json
from datetime import datetime
class HyperparameterSearcher:
"""Random search over hyperparameters with logging."""
def __init__(self, search_space, n_trials=25, seed=42):
"""
search_space: dict mapping param_name -> dict with:
'type': 'log_uniform' | 'uniform' | 'choice' | 'int_uniform'
'low', 'high': for uniform/log_uniform/int_uniform
'options': list for choice
"""
self.search_space = search_space
self.n_trials = n_trials
self.rng = np.random.RandomState(seed)
self.results = []
def _sample_params(self):
"""Sample one set of hyperparameters."""
params = {}
for name, spec in self.search_space.items():
if spec['type'] == 'log_uniform':
# Sample on log scale: 10^Uniform(low, high)
r = self.rng.uniform(spec['low'], spec['high'])
params[name] = 10 ** r
elif spec['type'] == 'uniform':
params[name] = self.rng.uniform(spec['low'], spec['high'])
elif spec['type'] == 'choice':
params[name] = self.rng.choice(spec['options'])
elif spec['type'] == 'int_uniform':
params[name] = int(self.rng.uniform(spec['low'], spec['high']))
elif spec['type'] == 'log_complement':
# For ฮฒ: sample (1-ฮฒ) on log scale
r = self.rng.uniform(spec['low'], spec['high'])
params[name] = 1 - 10 ** r
return params
def search(self, train_fn):
"""
Run the search.
train_fn: callable(params_dict) -> dict with 'train_loss',
'dev_loss', 'dev_acc', etc.
"""
print(f"Starting random search: {self.n_trials} trials")
print("=" * 60)
for trial in range(self.n_trials):
params = self._sample_params()
print(f"\nTrial {trial+1}/{self.n_trials}")
print(f" Params: {params}")
# Train and evaluate
metrics = train_fn(params)
# Log results
result = {
'trial': trial + 1,
'params': params,
'metrics': metrics,
'timestamp': datetime.now().isoformat()
}
self.results.append(result)
print(f" Dev Acc: {metrics.get('dev_acc', 'N/A')}")
# Find best trial
best = max(self.results,
key=lambda x: x['metrics'].get('dev_acc', 0))
print(f"\n{'='*60}")
print(f"Best Trial: #{best['trial']}")
print(f"Best Params: {best['params']}")
print(f"Best Dev Acc: {best['metrics']['dev_acc']:.4f}")
return best
def top_k(self, k=5):
"""Return top-k trials by dev accuracy."""
sorted_results = sorted(
self.results,
key=lambda x: x['metrics'].get('dev_acc', 0),
reverse=True
)
return sorted_results[:k]
# โโโ Usage Example โโโ
search_space = {
'learning_rate': {'type': 'log_uniform', 'low': -4, 'high': -1},
'hidden_units': {'type': 'choice', 'options': [64,128,256,512]},
'dropout_rate': {'type': 'uniform', 'low': 0.1, 'high': 0.5},
'momentum': {'type': 'log_complement', 'low': -3, 'high': -1},
'batch_size': {'type': 'choice', 'options': [32,64,128,256]},
}
searcher = HyperparameterSearcher(search_space, n_trials=25)
# best = searcher.search(my_train_function)
4.2 Learning Rate Finder (fastai-style)
The LR Finder is one of the most practical tools in deep learning. It trains for one epoch while exponentially increasing the learning rate, and records the loss at each step. The optimal LR is where the loss decreases fastest (steepest slope).
Pythonimport numpy as np
import copy
class LRFinder:
"""
Learning Rate Finder (Smith 2017 / fastai-style).
Exponentially increases LR from lr_min to lr_max over
one pass through the data, recording loss at each step.
"""
def __init__(self, model, optimizer_fn, loss_fn):
"""
model: object with .forward(X) and .parameters()
optimizer_fn: callable(params, lr) -> optimizer
loss_fn: callable(y_pred, y_true) -> scalar loss
"""
self.model = model
self.optimizer_fn = optimizer_fn
self.loss_fn = loss_fn
self.lrs = []
self.losses = []
def find(self, X_train, y_train, lr_min=1e-7, lr_max=10,
num_steps=100, smooth_factor=0.05):
"""Run the LR range test."""
# Save initial model state
initial_state = copy.deepcopy(self.model)
# Compute multiplication factor per step
mult = (lr_max / lr_min) ** (1 / num_steps)
lr = lr_min
best_loss = float('inf')
avg_loss = 0
n = len(X_train)
batch_size = min(64, n)
for step in range(num_steps):
# Sample a mini-batch
idx = np.random.choice(n, batch_size, replace=False)
X_batch = X_train[idx]
y_batch = y_train[idx]
# Forward pass
y_pred = self.model.forward(X_batch)
loss = self.loss_fn(y_pred, y_batch)
# Smooth the loss (exponential moving average)
avg_loss = smooth_factor * loss + (1 - smooth_factor) * avg_loss
smoothed_loss = avg_loss / (1 - (1 - smooth_factor) ** (step + 1))
# Stop if loss explodes (> 4ร best)
if step > 10 and smoothed_loss > 4 * best_loss:
print(f"Stopping: loss exploded at lr={lr:.2e}")
break
if smoothed_loss < best_loss:
best_loss = smoothed_loss
# Record
self.lrs.append(lr)
self.losses.append(smoothed_loss)
# Backward pass & update (simplified)
optimizer = self.optimizer_fn(
self.model.parameters(), lr=lr
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Increase learning rate exponentially
lr *= mult
# Restore initial model state
self.model = initial_state
return self.lrs, self.losses
def suggest_lr(self):
"""Suggest the LR where loss decreases fastest."""
if len(self.losses) < 3:
return None
# Compute gradient of loss w.r.t. log(lr)
log_lrs = np.log10(self.lrs)
gradients = np.gradient(self.losses, log_lrs)
# Find the LR with steepest negative gradient
# (skip first 10% and last 10% for stability)
start = len(gradients) // 10
end = len(gradients) - len(gradients) // 10
min_idx = start + np.argmin(gradients[start:end])
suggested_lr = self.lrs[min_idx]
print(f"Suggested LR: {suggested_lr:.2e}")
print(f" (where loss decreased fastest)")
print(f" Rule of thumb: use ~{suggested_lr/10:.2e} to "
f"{suggested_lr:.2e}")
return suggested_lr
4.3 Error Analysis Helper
Pythonimport numpy as np
from collections import Counter
class ErrorAnalyzer:
"""Structured error analysis on misclassified examples."""
def __init__(self, X_dev, y_dev, y_pred, class_names=None):
self.X_dev = X_dev
self.y_dev = y_dev
self.y_pred = y_pred
self.class_names = class_names
# Find misclassified indices
self.misclassified = np.where(y_dev != y_pred)[0]
self.total_errors = len(self.misclassified)
self.dev_error = self.total_errors / len(y_dev)
print(f"Dev set size: {len(y_dev)}")
print(f"Misclassified: {self.total_errors}")
print(f"Dev error rate: {self.dev_error:.2%}")
def confusion_breakdown(self):
"""Show which classes are most confused."""
confusion_pairs = []
for idx in self.misclassified:
true_label = self.y_dev[idx]
pred_label = self.y_pred[idx]
if self.class_names:
pair = (self.class_names[true_label],
self.class_names[pred_label])
else:
pair = (true_label, pred_label)
confusion_pairs.append(pair)
counts = Counter(confusion_pairs)
print("\nTop Confusion Pairs (True โ Predicted):")
print("-" * 50)
for (true, pred), count in counts.most_common(10):
pct = count / self.total_errors * 100
print(f" {true:>15s} โ {pred:<15s} "
f"{count:4d} ({pct:.1f}%)")
return counts
def ceiling_analysis(self, categories):
"""
Given error categories with counts, compute ceilings.
categories: dict mapping category_name -> count_of_errors
"""
print(f"\nError Ceiling Analysis (Dev Error: {self.dev_error:.2%})")
print("-" * 55)
print(f"{'Category':<25s} {'Count':>6s} {'% Errors':>9s} {'Ceiling':>9s}")
print("-" * 55)
for cat, count in sorted(categories.items(),
key=lambda x: -x[1]):
pct = count / self.total_errors * 100
ceiling = self.dev_error * (1 - count / self.total_errors)
print(f" {cat:<23s} {count:6d} {pct:8.1f}% "
f"{ceiling:8.2%}")
print("-" * 55)
Industry Code โ PyTorch & Optuna
5.1 PyTorch LR Finder (torch-lr-finder)
Pythonimport torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch_lr_finder import LRFinder # pip install torch-lr-finder
# Define a simple model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.BatchNorm1d(256),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
optimizer = optim.Adam(model.parameters(), lr=1e-7, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
# Create data loader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
# Run LR Finder
lr_finder = LRFinder(model, optimizer, criterion, device="cuda")
lr_finder.range_test(train_loader, end_lr=10, num_iter=200)
lr_finder.plot() # Shows loss vs. LR curve
lr_finder.reset() # Restore model to initial state
# Read the plot: pick LR where loss is steepest
# Typically: suggested_lr โ 3e-3 for this architecture
5.2 Optuna โ Automated Hyperparameter Search
Pythonimport optuna
import torch
import torch.nn as nn
import torch.optim as optim
def objective(trial):
"""Optuna objective function for HP search."""
# โโ Sample hyperparameters โโ
lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
n_layers = trial.suggest_int("n_layers", 2, 5)
dropout = trial.suggest_float("dropout", 0.1, 0.5)
hidden_size = trial.suggest_categorical(
"hidden_size", [64, 128, 256, 512]
)
batch_size = trial.suggest_categorical(
"batch_size", [32, 64, 128]
)
# โโ Build model dynamically โโ
layers = []
in_dim = 784
for i in range(n_layers):
layers.append(nn.Linear(in_dim, hidden_size))
layers.append(nn.ReLU())
layers.append(nn.Dropout(dropout))
in_dim = hidden_size
layers.append(nn.Linear(in_dim, 10))
model = nn.Sequential(*layers)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
# โโ Training loop (simplified) โโ
for epoch in range(20):
model.train()
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
loss = criterion(model(X_batch), y_batch)
loss.backward()
optimizer.step()
# โโ Evaluate on dev set โโ
model.eval()
correct = 0
total = 0
with torch.no_grad():
for X_val, y_val in dev_loader:
preds = model(X_val).argmax(dim=1)
correct += (preds == y_val).sum().item()
total += len(y_val)
dev_acc = correct / total
# Pruning: stop bad trials early
trial.report(dev_acc, epoch)
if trial.should_prune():
raise optuna.exceptions.TrialPruned()
return dev_acc
# โโ Run the study โโ
study = optuna.create_study(
direction="maximize",
pruner=optuna.pruners.MedianPruner(n_warmup_steps=5)
)
study.optimize(objective, n_trials=50, timeout=3600) # 1-hour budget
# โโ Results โโ
print(f"Best trial: {study.best_trial.number}")
print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
# Visualize (optional)
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_param_importances(study)
๐ญ Industry Note โ Optuna at Scale
Optuna (developed by Preferred Networks, Japan) is the most popular HP tuning framework in production ML. It uses Tree-structured Parzen Estimators (TPE) โ a Bayesian approach that's smarter than pure random search. Key features: pruning (kills bad trials early), distributed search across GPUs, and dashboard visualization. Many Indian ML teams at Flipkart, Swiggy, and PhonePe use Optuna or similar frameworks (Ray Tune, Weights & Biases Sweeps).
Visual Diagrams
6.1 Grid Search vs. Random Search โ Visual
6.2 Error Decomposition Waterfall
6.3 The LR Finder Plot
6.4 Orthogonalization โ The Four Knobs
Worked Example โ Complete HP Tuning Workflow
Problem: MNIST Digit Classifier at โน500 Cloud Budget
You're a student at VIT Vellore. Your professor gives you a โน500 Google Cloud credit and asks you to build the best MNIST classifier you can. You have a single T4 GPU (training one model takes ~3 minutes). Budget allows ~100 experiments.
Step 1: Define the Search Space (Using Tier Priorities)
| Hyperparameter | Tier | Range | Scale |
|---|---|---|---|
| Learning Rate (ฮฑ) | 1 | [10โปโด, 10โปยน] | Log-uniform |
| Hidden Units | 2 | {64, 128, 256, 512} | Categorical |
| Batch Size | 2 | {32, 64, 128} | Categorical |
| Dropout | 2 | [0.1, 0.5] | Uniform |
| Number of Layers | 3 | {2, 3, 4} | Categorical |
Step 2: Round 1 โ Coarse Search (30 trials, 5 epochs each)
30 trials ร 5 epochs ร 3 min = 15 GPU-minutes. Cost: ~โน10.
Python# Round 1 results (top 5 of 30 trials)
# Trial | LR | Units | Batch | Drop | Layers | Dev Acc
# โโโโโโโผโโโโโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโผโโโโโโโโโผโโโโโโโโ
# 7 | 3.2e-3 | 256 | 64 | 0.30 | 3 | 97.8%
# 12 | 1.8e-3 | 512 | 64 | 0.25 | 3 | 97.6%
# 19 | 5.1e-3 | 256 | 128 | 0.35 | 2 | 97.5%
# 23 | 2.4e-3 | 128 | 64 | 0.20 | 3 | 97.3%
# 3 | 8.7e-3 | 256 | 32 | 0.40 | 2 | 97.1%
Observation: Best LRs cluster around [1e-3, 1e-2]. Hidden units 256โ512. 3 layers seems best. Batch size 64. Dropout 0.20โ0.35.
Step 3: Round 2 โ Fine Search (20 trials, 20 epochs each)
Narrow the ranges based on Round 1 findings:
Python# Narrowed search space
fine_search_space = {
'lr': {'type': 'log_uniform', 'low': np.log10(1e-3),
'high': np.log10(1e-2)}, # narrowed!
'units': {'type': 'choice', 'options': [192,256,320,384,512]},
'dropout': {'type': 'uniform', 'low': 0.2, 'high': 0.4},
'layers': {'type': 'choice', 'options': [3, 4]},
}
# Top result from Round 2:
# LR = 2.8e-3, units = 320, dropout = 0.28, layers = 3
# Dev Accuracy: 98.4%
Step 4: Final Training (Top 3 candidates, full 50 epochs)
Python# Final evaluation on held-out test set
# Model | Dev Acc | Test Acc
# โโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโ
# Candidate A | 98.4% | 98.3% โ Selected (dev โ test, no overfitting)
# Candidate B | 98.3% | 98.2%
# Candidate C | 98.1% | 98.0%
Step 5: Error Analysis on the Best Model
Inspect 100 misclassified test examples:
| Error Category | Count | Ceiling |
|---|---|---|
| Ambiguous digits (e.g., 4 vs 9) | 42 | 42% |
| Rotated/skewed digits | 28 | 28% |
| Very thin strokes | 18 | 18% |
| Noisy backgrounds | 12 | 12% |
Conclusion: To go beyond 98.3%, focus on data augmentation for rotation/skew (28% ceiling) and potentially ensemble models for ambiguous digits. Total cloud spend: ~โน80 out of โน500 budget. ๐ฏ
Case Study โ Ola Surge Pricing
๐ Ola's Surge Pricing Model: When Historical โ Real-Time
The Problem
Ola's surge pricing model predicts demand-supply imbalance in each zone to set dynamic pricing. The challenge: training data is historical (past ride requests, driver locations, weather, events), but inference happens in real-time (current conditions that may be very different).
Distribution Mismatch Sources
| Factor | Training Data (Historical) | Inference (Real-Time) |
|---|---|---|
| Events | Diwali 2024 patterns | IPL 2025 final (new stadium, new traffic patterns) |
| Geography | Data from 5 tier-1 cities | Expanding to 50 tier-2/3 cities |
| Weather | Historical averages | Unexpected cyclone Biparjoy |
| Competition | Pre-Rapido bike-taxi era | Post-Rapido with 2-wheeler competition |
| Regulation | Pre-fare-cap rules | New state-level fare caps in Karnataka |
Ola's ML Strategy (Reconstructed)
Step 1: Dev/Test Set Design
- Training set: 500M historical ride records (2020โ2024) from all cities. Source distribution.
- Dev set: 100K most recent records (last 30 days) from target cities. Target distribution.
- Test set: 50K records from the last 7 days (same target distribution, held out).
- Train-Dev set: 50K records randomly sampled from training data (source distribution, but held out).
Step 2: Error Decomposition
| Metric | MAE (โน) |
|---|---|
| Human expert (Bayes proxy) | โน8 |
| Training error | โน12 |
| Train-Dev error | โน15 |
| Dev error | โน28 |
| Test error | โน30 |
Decomposition:
- Avoidable Bias: โน12 โ โน8 = โน4
- Variance: โน15 โ โน12 = โน3
- Data Mismatch: โน28 โ โน15 = โน13 โ Biggest problem!
- Dev Overfitting: โน30 โ โน28 = โน2
Step 3: Fixing Data Mismatch (โน13 gap)
- Real-time feature injection: Add current weather, event calendars, live traffic as features at inference time โ not just historical averages.
- Synthetic data: Simulate "what if IPL final + heavy rain" scenarios by combining historical ride patterns with synthetic weather overlays.
- Online learning: Fine-tune the model daily on the most recent 24 hours of data to reduce temporal mismatch.
- Domain adaptation: Use a small amount of target city data to adapt the model trained on tier-1 cities.
Step 4: Hyperparameter Tuning
After fixing data mismatch, re-run HP search using the Caviar strategy (Ola has a GPU cluster):
- Tier 1: Learning rate โ Log-uniform [10โปโต, 10โปยฒ]
- Tier 2: Hidden units (128โ1024), batch size (256โ2048 for large data)
- 50 parallel Optuna trials with pruning
- Result: Dev MAE dropped from โน28 to โน16
Business Impact
Reducing surge pricing prediction error from โน28 MAE to โน16 MAE meant:
- Fewer overcharged rides โ 12% reduction in ride cancellations
- Better driver allocation โ 8% improvement in driver utilization
- Estimated revenue impact: โน45 crore annually across 250+ cities
Common Mistakes & Misconceptions
Don't spend time searching over Adam's ฮต (Tier 4) when you haven't even found a good learning rate (Tier 1). Follow the tier ordering โ it saves 80%+ of your compute budget.
Grid search wastes experiments on redundant combinations. For the same budget, random search covers the important dimensions much more thoroughly. Bergstra & Bengio (2012) proved this mathematically.
This concentrates 90% of samples above 0.1, where most reasonable LRs don't live. Always use log-uniform sampling:
ฮฑ = 10 ** uniform(-4, -1).
The moment you use test set performance to make model decisions, it becomes a second dev set and your final evaluation is biased. Keep the test set locked away until the very end.
If dev error is high, many practitioners assume "overfitting" and add regularization. But the real problem might be distribution mismatch. Use a train-dev set to distinguish variance from data mismatch.
Spending 30 minutes inspecting 100 misclassified examples often reveals insights that save weeks of engineering effort. Don't jump straight to "collect more data" without understanding why the model fails.
Early stopping simultaneously affects bias (underfits training set) and variance (prevents overfitting dev set). This violates orthogonalization. Prefer L2 regularization + dropout, which affect only variance without compromising training fit.
Comparison Tables
10.1 Search Strategies Compared
| Aspect | Grid Search | Random Search | Bayesian (Optuna) |
|---|---|---|---|
| Unique values per axis (N trials) | โN (cube root) | N | N (informed) |
| Handles HP importance | โ Equal spacing | โ Auto-concentrates | โ Learns importance |
| Parallelizable | โ Fully | โ Fully | โ ๏ธ Partially |
| Compute efficiency | Low | Medium | High |
| Implementation effort | Easy | Easy | Medium (library) |
| Best for | โค2 HPs | 3โ7 HPs | 3โ20 HPs |
| Uses previous results | โ No | โ No | โ Yes (surrogate model) |
10.2 Bias vs. Variance vs. Mismatch Diagnostics
| Problem | Symptom | Solution |
|---|---|---|
| High Avoidable Bias | Train error >> Human error | Bigger model, train longer, better architecture |
| High Variance | Train-Dev error >> Train error | Regularization, more data, dropout |
| Data Mismatch | Dev error >> Train-Dev error | More target data, data synthesis, domain adaptation |
| Dev Overfitting | Test error >> Dev error | Bigger dev set, less HP tuning on dev |
10.3 Panda vs. Caviar Strategy
| Aspect | ๐ผ Panda (Babysitting) | ๐ Caviar (Parallel) |
|---|---|---|
| Compute resources | 1 GPU | 10โ100+ GPUs |
| Experiments at a time | 1 | Dozens |
| Human attention | High (daily monitoring) | Low (set and forget) |
| Time to result | Daysโweeks | Hoursโdays |
| Cost per experiment | Low | High (but faster) |
| Typical user | Student, startup | Large company, research lab |
| Indian example | IIT M.Tech thesis on โน1L budget | Jio AI team with private GPU cluster |
Exercises
Multiple Choice Questions (10)
According to Andrew Ng's hyperparameter priority ranking, which hyperparameter belongs to Tier 1 (highest priority)?
- Number of hidden layers
- Learning rate ฮฑ
- Adam's ฮฒโ parameter
- Mini-batch size
In a 5ร5 grid search over learning rate and hidden units (25 total experiments), how many unique learning rate values are tested?
- 25
- 10
- 5
- 1
You want to search over learning rates in [10โปโด, 10โปยน]. What is the correct log-uniform sampling in Python?
np.random.uniform(0.0001, 0.1)10 ** np.random.uniform(-4, -1)np.random.choice([0.0001, 0.001, 0.01, 0.1])np.exp(np.random.uniform(-4, -1))
10 ** np.random.uniform(-4, -1) โ This samples the exponent uniformly between โ4 and โ1, then raises 10 to that power, giving equal probability to each order of magnitude. Option A is linear uniform (biased toward large values). Option D uses base e, not base 10.You have 10 million training examples. What is the recommended train/dev/test split?
- 60% / 20% / 20%
- 80% / 10% / 10%
- 98% / 1% / 1%
- 99.5% / 0.25% / 0.25%
Your model has: Human error = 1%, Training error = 2%, Dev error = 10%. What is the dominant problem?
- High avoidable bias
- High variance
- Data mismatch
- Bayes error is too high
What is the purpose of a "train-dev" set?
- To augment the training data with more examples
- To distinguish between variance and data mismatch as causes of dev error
- To replace the test set for final evaluation
- To validate that the train set has no label errors
Why does early stopping violate the principle of orthogonalization?
- It requires manual monitoring of training
- It simultaneously affects training set fit (bias) and dev set fit (variance)
- It only works with SGD, not Adam
- It prevents the model from reaching zero training loss
In a food classification task, error analysis reveals that 38% of misclassified images are blurry. If the dev error is 10%, what is the "ceiling" โ the best possible dev error if you perfectly solve the blurry image problem?
- 3.8%
- 6.2%
- 10%
- 0%
A medical imaging model achieves 0.5% error, while the best team of radiologists achieves 0.7% error. Which statement is TRUE?
- The model has zero avoidable bias
- Human-level error can no longer serve as a useful proxy for Bayes error
- The model must have overfitted since it beat humans
- You should use the average doctor's error rate as the Bayes proxy
When should you use the "Panda" (babysitting) strategy over the "Caviar" (parallel) strategy?
- When you have a large cloud budget and many GPUs
- When you have limited compute (e.g., a single GPU)
- When the dataset is very large (10M+ examples)
- When you're tuning only Tier 4 hyperparameters
Short Answer Questions (5)
B1. Beginner
List the four tiers of hyperparameter importance according to Andrew Ng. Give one example hyperparameter for each tier.
B2. Intermediate
Explain with a numerical example why sampling the momentum parameter ฮฒ uniformly from [0.9, 0.999] is a bad idea. What should you do instead?
B3. Intermediate
You're building a crop disease classifier for an agri-tech startup in Pune. You have 2 million images from web-scraped agricultural databases but only 8,000 images from Indian farmers' phones (the target use case). How would you design the train/dev/test split?
B4. Intermediate
A model has: Human error = 3%, Train error = 4%, Train-Dev error = 5%, Dev error = 12%, Test error = 13%. Compute all four error gaps (avoidable bias, variance, data mismatch, dev overfitting) and identify the top priority to fix.
B5. Advanced
Explain why ML progress typically slows down after surpassing human-level performance. Give two concrete reasons related to the ML workflow.
Long Answer Questions (3)
C1. Intermediate โ The Complete Tuning Workflow (10 marks)
You are building a text classification model to categorize customer complaints for a telecom company (Jio). You have 5 million labeled complaints from email (historical) and need the model to work on WhatsApp messages (target). Describe a complete ML strategy covering: (a) data split design, (b) hyperparameter search approach, (c) error analysis methodology, (d) how to handle distribution mismatch.
C2. Advanced โ Grid vs. Random: Mathematical Proof (8 marks)
Prove mathematically that for a fixed budget of N experiments, random search tests more unique values along the most important hyperparameter axis than grid search when there are k hyperparameters (k โฅ 2). State your assumptions clearly. Compute the exact number of unique values per axis for both methods when N = 64 and k = 3.
C3. Advanced โ Orthogonalization in Practice (8 marks)
A junior ML engineer at Infosys reports the following results for a sentiment analysis model: Human error = 2%, Training error = 15%, Dev error = 16%. They propose: "Let's add dropout and L2 regularization since the dev error is high." Critique this proposal using orthogonalization principles. What would you recommend instead? Explain your reasoning step by step.
Programming Exercises (2)
D1. Intermediate โ Build a Complete HP Search Pipeline
Using the HyperparameterSearcher class from Section 4, build a complete pipeline that:
- Defines a search space for a 3-layer neural network (LR, hidden units, dropout, batch size)
- Implements a dummy
train_fnthat simulates training (you can use synthetic data or sklearn's make_classification) - Runs a 2-round coarse-to-fine search (Round 1: 20 trials, broad range; Round 2: 10 trials, narrowed range)
- Prints the top-5 results from each round
- Reports the final best hyperparameters
Deliverable: A Python script that runs end-to-end and prints results to console.
D2. Advanced โ LR Finder with PyTorch
Implement a LR Finder for a PyTorch model on the Fashion-MNIST dataset:
- Load Fashion-MNIST using
torchvision.datasets - Define a 3-layer network with BatchNorm and Dropout
- Implement the LR range test: exponentially increase LR from 10โปโท to 10 over one epoch
- Plot the loss vs. log(LR) curve using matplotlib
- Automatically find and print the suggested LR (steepest descent point)
- Train the model using the suggested LR and report test accuracy
Deliverable: A Python script with a matplotlib plot and final test accuracy โฅ 88%.
Mini-Project
๐๏ธ End-to-End ML Strategy for Indian Language Sentiment Analysis
Scenario: You're building a sentiment classifier for Hinglish (Hindi-English code-mixed) product reviews for a Flipkart internship project. You have:
- 500K English Amazon reviews (labeled positive/negative) โ source distribution
- 5K Hinglish Flipkart reviews (labeled) โ target distribution
- 50K unlabeled Hinglish reviews
Tasks:
- Data Split Design: Design your train/dev/test/train-dev splits. Justify your choices with exact numbers.
- Baseline Model: Train a simple logistic regression or 2-layer NN on English data. Report train, train-dev, dev, and test errors.
- Error Decomposition: Compute avoidable bias, variance, and data mismatch. Identify the biggest problem.
- Hyperparameter Search: Run a 2-round coarse-to-fine random search using Optuna. Document all trials.
- Error Analysis: Manually inspect 50 misclassified dev examples. Create an error category spreadsheet and perform ceiling analysis.
- Strategy Proposal: Based on your error analysis, propose 3 concrete next steps with expected impact. Prioritize using ceilings.
Deliverable: A Jupyter notebook with all code, analysis tables, and a 500-word strategy writeup. Grading emphasizes strategic reasoning over raw accuracy.
Chapter Summary
Key Takeaways from Chapter 11
- Hyperparameter Priority: Learning rate (Tier 1) โ momentum, hidden units, batch size (Tier 2) โ layers, LR decay (Tier 3) โ Adam params (Tier 4). Spend 80% of your budget on Tiers 1โ2.
- Random Search > Grid Search: For N experiments, random search tests N unique values per axis vs. N^(1/k) for grid search with k hyperparameters. The probability of finding the optimal region is dramatically higher with random search.
- Coarse-to-Fine: Don't run one massive search. Start broad with few epochs, zoom into the promising region, then run longer training on the finalists.
- Logarithmic Scale: Always sample learning rate and regularization strength on log scale. For momentum ฮฒ, sample (1โฮฒ) on log scale.
- Big Data Splits: With 1M+ examples, use 98/1/1 or 99.5/0.25/0.25 splits. Dev and test sets must come from the same (target) distribution.
- Train-Dev Set: A diagnostic tool to distinguish variance from data mismatch. Same distribution as training, but held out from training.
- Error Analysis: Manually inspect 100 misclassified examples, categorize errors, compute ceilings, and prioritize the highest-ceiling category.
- Orthogonalization: Fix problems in order: bias first (bigger model) โ variance (regularization) โ data mismatch (more target data) โ real-world performance (change metric/distribution).
- Human-Level Performance: Use the best human performance as a proxy for Bayes error. Avoidable bias = train error โ human error. Variance = dev error โ train error.
- Panda vs. Caviar: Babysit one model (limited compute) vs. spawn many in parallel (abundant compute). Choose based on your resource constraints.
High avoidable bias? โ Bigger model, train longer
High variance? โ Regularize, more data
Data mismatch? โ More target data, domain adaptation
Dev overfitting? โ Bigger dev set
All good? โ Ship it! ๐
References & Further Reading
Foundational Papers
- Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281โ305. โ The seminal paper proving random search superiority over grid search.
- Smith, L.N. (2017). Cyclical Learning Rates for Training Neural Networks. IEEE WACV. โ Introduced the LR range test (LR Finder) and cyclical learning rates.
- Snoek, J., Larochelle, H., & Adams, R.P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NeurIPS. โ Foundations of Bayesian hyperparameter optimization.
- Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD. โ The Optuna paper.
Textbooks & Courses
- Ng, A. (2017). Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization. Coursera Deep Learning Specialization, Course 2. โ Primary source for the tier ranking, orthogonalization, and error analysis frameworks.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press, Chapter 11: Practical Methodology. โ Covers hyperparameter selection strategy.
- Howard, J., & Gugger, S. (2020). Deep Learning for Coders with fastai and PyTorch. O'Reilly. โ Practical LR Finder and 1cycle policy implementation.
Indian Industry Applications
- Ola Engineering Blog (2023). Dynamic Pricing at Scale: ML Architecture Behind Surge Pricing. engineering.olacabs.com
- Flipkart Tech Blog (2023). Scaling Recommendations for 400M Users. tech.flipkart.com
- Niramai Health Analytix. AI for Breast Cancer Screening. niramai.com โ Affordable AI-powered diagnostics using thermal imaging.
Tools & Libraries
- Optuna Documentation:
optuna.readthedocs.io - torch-lr-finder:
github.com/davidtvs/pytorch-lr-finder - Weights & Biases Sweeps:
wandb.ai/site/sweeps - Ray Tune:
docs.ray.io/en/latest/tune/