Chapter 6: Feature Engineering, Scaling & Regularization

0

Learning Objectives

After completing this chapter, you will be able to:

1

Create, transform, and select features from raw datasets using systematic engineering techniques

2

Handle missing data using mean/median/mode imputation, KNN imputation, and MICE (Multiple Imputation by Chained Equations)

3

Encode categorical variables using Label, One-Hot, Target, and Ordinal encoding — and know which to use when

4

Apply StandardScaler, MinMaxScaler, RobustScaler, and MaxAbsScaler — understanding the mathematical difference between normalization and standardization

5

Apply Log, Box-Cox, and Yeo-Johnson transforms to handle skewed distributions

6

Derive L1 (Lasso) regularization from first principles and prove geometrically why it creates sparsity

7

Derive L2 (Ridge) regularization and its closed-form solution: (XᵀX + λI)⁻¹XᵀY

8

Understand ElasticNet as a combination of L1 and L2, and derive its objective function

9

Interpret regularization path plots and choose optimal λ using cross-validation

10

Measure feature importance using permutation importance, coefficient magnitudes, and mutual information

11

Implement filter, wrapper, and embedded feature selection methods

12

Build end-to-end feature pipelines for Indian Census data and credit risk applications

1

Introduction

In the world of machine learning, there's a saying that has stood the test of time: "Garbage in, garbage out." No matter how powerful your model architecture is — whether it's a simple linear regression or a billion-parameter deep neural network — the quality of predictions fundamentally depends on the quality of features fed into it.

Feature engineering is the process of using domain knowledge to transform raw data into informative features that improve model performance. Feature scaling ensures that features are on comparable scales so that algorithms converge properly. Regularization adds constraints to the learning process to prevent overfitting, particularly when dealing with high-dimensional data where features may outnumber observations.

These three topics — engineering, scaling, and regularization — form a tightly coupled triad. A feature engineer who doesn't understand regularization may create too many features without knowing how to control model complexity. A data scientist who doesn't scale features may watch their gradient descent diverge or their regularization behave inconsistently.

This chapter is structured in three major arcs:

Feature Engineering (§6.3-6.5): Creating features, handling missing data, encoding categoricals, and selecting the most informative features
Feature Scaling & Transformation (§6.4-6.6): Standardization, normalization, and power transforms with mathematical foundations
Regularization (§6.5-6.7): L1, L2, and ElasticNet — derived from first principles with geometric intuition

Throughout, we ground every concept in Indian case studies — from handling 100+ features in Census of India microdata to regularizing credit risk models at HDFC Bank. We also study global patterns from Kaggle competitions and the Netflix Prize.

2

Historical Background

The story of feature engineering is as old as statistics itself, but its formalization as a discipline is surprisingly recent.

The Pre-Computing Era (1800s–1950s)

Carl Friedrich Gauss (1809) introduced the method of least squares for fitting astronomical observations. Even then, the choice of which variables to include (which features to use) was crucial. Gauss manually selected celestial coordinates, velocities, and time as his "features."

Francis Galton (1886) discovered the concept of "regression toward the mean" and, in doing so, also implicitly performed feature engineering — he transformed raw height measurements into standardized deviations from the mean.

The Regularization Revolution (1943–1970)

Andrey Tikhonov (1943) introduced what we now call Tikhonov regularization (equivalent to Ridge/L2 regularization) in the context of solving ill-posed integral equations. His contribution was recognizing that adding a penalty term to the objective function could stabilize otherwise unstable solutions.

Arthur Hoerl & Robert Kennard (1970) independently developed Ridge Regression for statistics, showing that biased estimators with smaller variance could outperform OLS estimates (the bias-variance trade-off). Their seminal paper "Ridge Regression: Biased Estimation for Nonorthogonal Problems" changed how statisticians approached multicollinearity.

The Sparsity Era (1996–2005)

Robert Tibshirani (1996) published his foundational paper on the LASSO (Least Absolute Shrinkage and Selection Operator), introducing L1 regularization to statistics. The key insight was that L1 penalty produces exactly zero coefficients — performing automatic variable selection. This paper has been cited over 45,000 times.

Hui Zou & Trevor Hastie (2005) introduced the Elastic Net, combining L1 and L2 penalties to get the best of both worlds: sparsity from L1 and stability from L2.

The Kaggle Era & AutoML (2010–Present)

Kaggle competitions (starting ~2010) democratized feature engineering knowledge. Winners like Xavier Conort (Liberty Mutual), Lucas & team (Allstate), and many others showed that creative feature engineering — polynomial features, target encoding, frequency encoding, interaction features — often mattered more than model architecture.

Today, AutoML tools (Google AutoML, H2O, Auto-sklearn, TPOT) attempt to automate feature engineering, but domain expertise remains irreplaceable for most real-world problems.

3

Conceptual Explanation

3.1 Feature Engineering: The Art and Science

A feature (also called a variable, attribute, or predictor) is a measurable property of the phenomenon being observed. Feature engineering involves three major activities:

Feature Creation

Deriving new features from existing ones using domain knowledge:

Interaction features: Product or ratio of two features (e.g., BMI = weight/height²)
Polynomial features: Squares, cubes of existing features (e.g., x², x₁·x₂)
Date/Time decomposition: Extracting year, month, day, weekday, hour from timestamps
Text features: TF-IDF, word count, sentiment scores from text data
Aggregation features: Mean, sum, count over grouped entities (e.g., avg transaction per customer)
Domain-specific features: Recency-Frequency-Monetary (RFM) in marketing, technical indicators in finance

Feature Transformation

Modifying existing features to make them more suitable for ML models:

Scaling: StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
Power transforms: Log, Box-Cox, Yeo-Johnson for skewed data
Binning/Discretization: Converting continuous to categorical (age → age group)
Encoding: Converting categorical to numerical representations

Feature Selection

Choosing the most informative subset of features:

Filter methods: Statistical tests independent of the model (correlation, χ², mutual information)
Wrapper methods: Use model performance to evaluate subsets (forward/backward selection, RFE)
Embedded methods: Feature selection built into model training (Lasso, decision tree importance)

3.2 Handling Missing Data

Real-world datasets are rarely complete. Missing data arises from sensor failures, user non-response, data entry errors, or intentional omission. Understanding the mechanism of missingness is crucial:

Mechanism	Definition	Example	Strategy
MCAR (Missing Completely At Random)	Missingness unrelated to any variable	Sensor randomly fails	Any imputation works; listwise deletion OK
MAR (Missing At Random)	Missingness depends on observed variables	Income missing more for younger respondents	Multiple imputation, model-based methods
MNAR (Missing Not At Random)	Missingness depends on the missing value itself	High-income people don't report income	Specialized models, sensitivity analysis

Imputation Methods

Simple Imputation:

Mean imputation: Replace with column mean. Fast but reduces variance and distorts correlations.
Median imputation: Robust to outliers. Better for skewed distributions.
Mode imputation: For categorical features. Replace with most frequent category.
Constant imputation: Replace with a domain-specific value (e.g., 0, "Unknown").

Advanced Imputation:

KNN Imputation: Find K nearest neighbors based on non-missing features, impute with their weighted average. Captures local structure but is computationally expensive (O(n²)).
MICE (Multiple Imputation by Chained Equations): Iteratively imputes each feature using a regression model conditioned on all other features. Produces multiple plausible imputations, allowing uncertainty quantification.

3.3 Encoding Categorical Variables

Method	How it Works	Pros	Cons	Best For
Label Encoding	Map each category to an integer (0, 1, 2, …)	Simple, memory-efficient	Imposes ordinal relationship	Ordinal data, tree-based models
One-Hot Encoding	Create binary column for each category	No ordinal assumption	High-cardinality → many columns	Linear models, low-cardinality
Ordinal Encoding	Map to integers preserving order	Preserves ordering	Assumes equal spacing	Truly ordinal features
Target Encoding	Replace category with mean of target	Handles high-cardinality	Target leakage risk	High-cardinality + careful CV

3.4 Feature Scaling: Why and When

Many ML algorithms — gradient descent-based methods, distance-based methods (KNN, SVM), and PCA — are sensitive to the scale of features. If one feature ranges from 0 to 1 and another from 0 to 1,000,000, the larger feature will dominate.

Scaler	Formula	Range	Handles Outliers?	Best For
StandardScaler	z = (x - μ) / σ	~[-3, 3]	No	Normally distributed data, SVM, logistic regression
MinMaxScaler	x' = (x - min) / (max - min)	[0, 1]	No	Neural networks, image pixel values
RobustScaler	x' = (x - Q₂) / (Q₃ - Q₁)	Variable	Yes	Data with outliers
MaxAbsScaler	x' = x / \|max(x)\|	[-1, 1]	No	Sparse data, already centered at 0

3.5 Normalization vs. Standardization

These terms are often confused. Here's the precise distinction:

Standardization (Z-score normalization)

Transforms data to have zero mean and unit variance. The transformed data follows a standard normal distribution if the original data was normal. Formula: z = (x - μ) / σ

Use when: Data is approximately Gaussian, algorithm assumes zero-centered data (PCA, SVM, Logistic Regression).

Normalization (Min-Max scaling)

Rescales data to a fixed range [0, 1] (or [-1, 1]). Preserves the shape of the distribution. Formula: x' = (x - x_min) / (x_max - x_min)

Use when: You need bounded values (neural network inputs), data is NOT Gaussian, you want to preserve zero entries in sparse data.

3.6 Power Transforms for Skewed Data

Many ML models assume (or perform better with) normally distributed features. Real-world data is often right-skewed (income, house prices, transaction amounts). Power transforms help:

Log Transform: x' = log(x + 1). Simple, works for right-skewed positive data. Adding 1 handles zeros.
Box-Cox Transform: Parametric family for positive data. Finds optimal λ automatically. When λ=0, it's log; when λ=1, it's linear.
Yeo-Johnson Transform: Extension of Box-Cox that handles negative values and zeros. More general-purpose.

3.7 Regularization: Controlling Model Complexity

When models have too many features relative to observations, they tend to overfit — memorizing noise rather than learning patterns. Regularization adds a penalty term to the loss function that discourages overly complex models:

The Regularization Framework

General form: Loss_total = Loss_data + λ × Penalty(w)

L1 Penalty (Lasso): Penalty = Σ|wᵢ| → Encourages sparsity (zeros out coefficients)
L2 Penalty (Ridge): Penalty = Σwᵢ² → Shrinks coefficients toward zero (but not exactly zero)
Elastic Net: Penalty = α·Σ|wᵢ| + (1-α)·Σwᵢ² → Best of both worlds

The hyperparameter λ (regularization strength) controls the trade-off: λ=0 is no regularization (OLS), λ→∞ shrinks all coefficients to zero.

4

Mathematical Foundation

4.1 Feature Scaling Mathematics

StandardScaler (Z-Score Normalization) Given a feature vector x = [x₁, x₂, ..., xₙ]: Mean: μ = (1/n) Σᵢ xᵢ Std Dev: σ = \sqrt[(1/n) Σᵢ (xᵢ - μ)²] Z-score: zᵢ = (xᵢ - μ) / σ Properties after transformation: • E[z] = 0 (zero mean) • Var(z) = 1 (unit variance) • z is unitless (dimensionless)

MinMaxScaler x'ᵢ = (xᵢ - x_min) / (x_max - x_min) For scaling to [a, b] instead of [0, 1]: x'ᵢ = a + (xᵢ - x_min)(b - a) / (x_max - x_min) Properties: • x'_min = 0, x'_max = 1 • Preserves zero entries only if x_min = 0 • Sensitive to outliers (single extreme value stretches the scale)

RobustScaler x'ᵢ = (xᵢ - Q₂) / IQR where: Q₂ = median(x) (50th percentile) IQR = Q₃ - Q₁ (interquartile range) Q₁ = 25th percentile Q₃ = 75th percentile Advantage: IQR is robust to outliers — extreme values don't affect Q₁, Q₂, Q₃ as much as they affect μ and σ.

4.2 Box-Cox and Yeo-Johnson Transforms

Box-Cox Transform (x > 0) ┌ (x^λ - 1) / λ if λ \neq 0 y(x, λ) = │ └ ln(x) if λ = 0 Special cases: λ = 1 \to y = x - 1 (linear, shifted) λ = 0.5 \to y = 2(\sqrtx - 1) (square root transform) λ = 0 \to y = ln(x) (log transform) λ = -1 \to y = 1 - 1/x (reciprocal transform) The optimal λ is found by maximum likelihood estimation.

Yeo-Johnson Transform (x \in ℝ) ┌ [(x + 1)^λ - 1] / λ if λ \neq 0, x \geq 0 │ ln(x + 1) if λ = 0, x \geq 0 y(x,λ) =│ │ -[(-x + 1)^(2-λ) - 1] / (2-λ) if λ \neq 2, x < 0 └ -ln(-x + 1) if λ = 2, x < 0 Advantage: Works for zero and negative values (unlike Box-Cox).

4.3 Regularization Mathematics

Ordinary Least Squares (OLS) — Baseline Model: ŷ = Xw + b (or ŷ = Xw with bias absorbed into X) Loss: J_OLS(w) = (1/2n) ||y - Xw||² = (1/2n) Σᵢ (yᵢ - wᵀxᵢ)² Solution: \nablaJ = 0 \to w_OLS = (XᵀX)⁻¹Xᵀy Problem: If XᵀX is singular (multicollinear features) or ill-conditioned (nearly singular), w_OLS is unstable with huge variance.

L2 Regularization (Ridge Regression) J_ridge(w) = (1/2n) ||y - Xw||² + λ ||w||₂² = (1/2n) Σᵢ (yᵢ - wᵀxᵢ)² + λ Σⱼ wⱼ² where λ \geq 0 is the regularization strength. Closed-form solution: w_ridge = (XᵀX + 2nλI)⁻¹ Xᵀy (Often written as w = (XᵀX + λI)⁻¹Xᵀy by absorbing 2n into λ) Key insight: Adding λI to XᵀX makes it invertible even when XᵀX is singular. The eigenvalues of (XᵀX + λI) are (σᵢ² + λ), all positive when λ > 0.

L1 Regularization (Lasso) J_lasso(w) = (1/2n) ||y - Xw||² + λ ||w||₁ = (1/2n) Σᵢ (yᵢ - wᵀxᵢ)² + λ Σⱼ |wⱼ| No closed-form solution because |wⱼ| is not differentiable at wⱼ = 0. Solved using coordinate descent or subgradient methods. Subgradient of |w|: ┌ 1 if w > 0 \partial|w| = │ [-1,1] if w = 0 (set-valued) └ -1 if w < 0

Elastic Net J_elastic(w) = (1/2n) ||y - Xw||² + λ₁ ||w||₁ + λ₂ ||w||₂² Often parameterized as: J(w) = (1/2n) ||y - Xw||² + λ [α ||w||₁ + (1-α) ||w||₂²] where: α \in [0, 1] controls L1 vs L2 mix α = 1 \to pure Lasso α = 0 \to pure Ridge 0 < α < 1 \to Elastic Net

4.4 Mutual Information for Feature Selection

Mutual Information I(X; Y) I(X; Y) = Σ_x Σ_y p(x, y) log [p(x, y) / (p(x) \cdot p(y))] Properties: • I(X; Y) \geq 0 (non-negative) • I(X; Y) = 0 iff X and Y are independent • I(X; Y) = H(X) - H(X|Y) where H is entropy • Captures non-linear dependencies (unlike correlation)

5

Formula Derivations

5.1 Deriving Ridge Regression Closed-Form

📐 Derivation: Ridge Regression w = (XᵀX + λI)⁻¹Xᵀy

Step 1: Write the Ridge objective

J(w) = (1/2) (y - Xw)ᵀ(y - Xw) + (λ/2) wᵀw

Step 2: Expand the quadratic

J(w) = (1/2)[yᵀy - 2yᵀXw + wᵀXᵀXw] + (λ/2) wᵀw

Step 3: Take the gradient with respect to w

∇ᵤJ = -XᵀY + XᵀXw + λw = -Xᵀy + (XᵀX + λI)w

Step 4: Set gradient to zero

(XᵀX + λI)w = Xᵀy

Step 5: Solve for w

w_ridge = (XᵀX + λI)⁻¹ Xᵀy ∎ Note: (XᵀX + λI) is always invertible for λ > 0 because its eigenvalues are (σᵢ² + λ) > 0, where σᵢ are singular values of X.

5.2 Deriving Lasso Gradient with L1 Penalty

📐 Derivation: Lasso Coordinate Descent Update

Step 1: Write the Lasso objective for a single coordinate wⱼ

Fixing all other weights, isolate terms involving wⱼ: J(wⱼ) = (1/2) Σᵢ(yᵢ - Σₖ≠ⱼ wₖxᵢₖ - wⱼxᵢⱼ)² + λ|wⱼ| + const

Step 2: Define the partial residual

Let rᵢⱼ = yᵢ - Σₖ≠ⱼ wₖxᵢₖ (residual without feature j) Then: J(wⱼ) = (1/2) Σᵢ(rᵢⱼ - wⱼxᵢⱼ)² + λ|wⱼ|

Step 3: Compute the derivative (where it exists)

∂J/∂wⱼ = -Σᵢ xᵢⱼ(rᵢⱼ - wⱼxᵢⱼ) + λ · sign(wⱼ) Let ρⱼ = Σᵢ xᵢⱼrᵢⱼ and zⱼ = Σᵢ xᵢⱼ²

Step 4: Apply the soft-thresholding operator

Setting subdifferential to contain 0: ┌ (ρⱼ - λ) / zⱼ if ρⱼ > λ wⱼ* = │ 0 if |ρⱼ| ≤ λ └ (ρⱼ + λ) / zⱼ if ρⱼ < -λ This is the soft-thresholding operator: wⱼ* = S(ρⱼ, λ) / zⱼ where S(ρ, λ) = sign(ρ) · max(|ρ| - λ, 0)

Step 5: Key insight — sparsity!

When |ρⱼ| ≤ λ, the optimal wⱼ = 0 EXACTLY. This means the L1 penalty forces coefficients to be exactly zero when their correlation with the residual (ρⱼ) is not strong enough to overcome the penalty threshold λ. ∎

5.3 Geometric Proof: Why L1 Creates Sparsity

📐 Geometric Intuition: L1 vs L2 Constraint Regions

The Constrained Optimization View

Ridge: minimize ||y - Xw||² subject to Σwⱼ² ≤ t Lasso: minimize ||y - Xw||² subject to Σ|wⱼ| ≤ t These are equivalent to the penalized forms via Lagrange multipliers.

L2 constraint region (Ridge)

Σwⱼ² ≤ t defines a CIRCLE (sphere in higher d) The circle has no corners — the contour of the loss function typically touches the circle at a point where BOTH w₁ ≠ 0 AND w₂ ≠ 0.

L1 constraint region (Lasso)

Σ|wⱼ| ≤ t defines a DIAMOND (cross-polytope in higher d) The diamond has sharp CORNERS on the axes. Because the loss contours are ellipses, they are much more likely to touch a CORNER of the diamond, where one coordinate = 0.

In d dimensions

The L1 ball in d dimensions has 2d vertices, all on axes. The fraction of the surface area at corners INCREASES with d. In high dimensions, the probability of the optimal point being at a corner (sparse solution) approaches 1. This is why Lasso becomes MORE effective for feature selection as the number of features increases. ∎

5.4 ElasticNet Derivation

📐 Derivation: Elastic Net Objective

Step 1: Combine L1 and L2 penalties

J(w) = (1/2n)||y - Xw||² + λ[α||w||₁ + (1-α)/2 · ||w||₂²] where α ∈ [0,1] is the L1 ratio.

Step 2: Coordinate descent update for wⱼ

Partial residual: ρⱼ = Σᵢ xᵢⱼ(yᵢ - ŷᵢ⁽⁻ʲ⁾) Denominator: zⱼ = Σᵢ xᵢⱼ² + nλ(1-α) wⱼ* = S(ρⱼ, nλα) / zⱼ where S is the soft-thresholding operator.

Step 3: Why Elastic Net?

• When features are correlated, Lasso arbitrarily selects one and ignores others. Elastic Net's L2 term groups correlated features together. • Elastic Net can select more features than n (Lasso is limited to selecting at most n features when p > n). ∎

6

Worked Numerical Examples

6.1 Feature Scaling — Step by Step

Example: Scaling House Prices Data

Consider a small dataset with two features:

House	Area (sq ft)	Bedrooms	Price (₹ lakhs)
1	1200	2	45
2	1800	3	65
3	2400	4	80
4	900	1	35
5	3000	5	110

Step 1: StandardScaler for Area

μ_area = (1200+1800+2400+900+3000)/5 = 1860

σ_area = √[((-660)²+(-60)²+540²+(-960)²+1140²)/5] = √[(435600+3600+291600+921600+1299600)/5] = √590400 ≈ 768.6

z₁ = (1200-1860)/768.6 = -0.858

z₂ = (1800-1860)/768.6 = -0.078

z₃ = (2400-1860)/768.6 = +0.702

z₄ = (900-1860)/768.6 = -1.249

z₅ = (3000-1860)/768.6 = +1.483

Step 2: MinMaxScaler for Area

x_min = 900, x_max = 3000, range = 2100

x'₁ = (1200-900)/2100 = 0.143

x'₂ = (1800-900)/2100 = 0.429

x'₃ = (2400-900)/2100 = 0.714

x'₄ = (900-900)/2100 = 0.000

x'₅ = (3000-900)/2100 = 1.000

Step 3: RobustScaler for Area

Sorted: [900, 1200, 1800, 2400, 3000]

Q₁ = 1200, Q₂ (median) = 1800, Q₃ = 2400, IQR = 1200

x'₁ = (1200-1800)/1200 = -0.500

x'₂ = (1800-1800)/1200 = 0.000

x'₃ = (2400-1800)/1200 = 0.500

x'₄ = (900-1800)/1200 = -0.750

x'₅ = (3000-1800)/1200 = 1.000

6.2 Ridge vs Lasso — Same Dataset

Example: Ridge vs Lasso on a 3-Feature Problem

Dataset: 5 observations, 3 features (x₁, x₂, x₃), target y. Features are standardized.

x₁	x₂	x₃	y
-1.2	0.5	-0.3	2.1
0.8	-1.1	0.7	4.5
0.3	0.9	-1.2	1.8
1.5	-0.4	1.1	6.2
-0.6	0.2	0.4	3.0

OLS Solution (no regularization)

w_OLS = (XᵀX)⁻¹Xᵀy

After computation: w₁ = 1.82, w₂ = -0.43, w₃ = 1.05

All three features have non-zero coefficients. Note x₂ has a small coefficient.

Ridge (λ = 1.0)

w_ridge = (XᵀX + I)⁻¹Xᵀy

Result: w₁ = 1.41, w₂ = -0.28, w₃ = 0.83

All coefficients are shrunk toward zero, but none is exactly zero. The ratio between coefficients is approximately preserved.

Lasso (λ = 0.5)

Using coordinate descent:

Result: w₁ = 1.35, w₂ = 0.00, w₃ = 0.71

x₂'s coefficient is set to exactly zero! Lasso has performed automatic feature selection, determining that x₂ is not informative enough to justify its inclusion.

Comparison

Method	w₁	w₂	w₃	# Non-zero
OLS	1.82	-0.43	1.05	3
Ridge (λ=1)	1.41	-0.28	0.83	3
Lasso (λ=0.5)	1.35	0.00	0.71	2

6.3 Target Encoding — Worked Example

Example: Target Encoding for City Feature

City	Default (y)
Mumbai	0
Delhi	1
Mumbai	0
Bangalore	0
Delhi	1
Mumbai	1
Bangalore	0
Delhi	0

Step 1: Compute mean target per category

Mumbai: (0+0+1)/3 = 0.333

Delhi: (1+1+0)/3 = 0.667

Bangalore: (0+0)/2 = 0.000

Step 2: Apply smoothing (to prevent overfitting)

Smoothed encoding = (nᵢ × meanᵢ + m × global_mean) / (nᵢ + m)

Global mean = 3/8 = 0.375, smoothing factor m = 2

Mumbai: (3×0.333 + 2×0.375)/(3+2) = 1.749/5 = 0.350

Delhi: (3×0.667 + 2×0.375)/(3+2) = 2.751/5 = 0.550

Bangalore: (2×0.000 + 2×0.375)/(2+2) = 0.750/4 = 0.188

7

Visual Diagrams

Feature Engineering Pipeline Overview

┌─────────────────────────────────────────────────────────────────────┐ │ RAW DATA │ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ ┌─────────────┐ │ │ │Numerical│ │Categorical│ │ Text │ │DateTime│ │ Missing │ │ │ │ Values │ │ Values │ │ Data │ │ Stamps │ │ Values │ │ │ └────┬────┘ └─────┬─────┘ └────┬─────┘ └───┬────┘ └──────┬──────┘ │ └───────┼────────────┼────────────┼────────────┼─────────────┼────────┘ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌─────────┐ ┌───────────┐ │ Scaling │ │ Encoding │ │ TF-IDF │ │ Year, │ │ Imputation│ │ Log/ │ │ OneHot/ │ │ Word2Vec│ │ Month, │ │ Mean/KNN/ │ │ Box-Cox │ │ Target/ │ │ BERT │ │ DayOf │ │ MICE │ │ Z-Score │ │ Ordinal │ │ │ │ Week │ │ │ └────┬─────┘ └────┬─────┘ └────┬────┘ └────┬────┘ └─────┬─────┘ │ │ │ │ │ └──────────┬─┴────────────┴────────────┴─────────────┘ │ ▼ ┌──────────────────────┐ │ FEATURE MATRIX │ │ (n × p matrix) │ │ All numerical, │ │ scaled, complete │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ FEATURE SELECTION │ │ ┌────────────────┐ │ │ │ Filter: MI, χ² │ │ │ │ Wrapper: RFE │ │ │ │ Embedded: Lasso│ │ │ └────────────────┘ │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ SELECTED FEATURES │ │ → Model Training │ └──────────────────────┘

L1 vs L2 Constraint Regions (Why Lasso Creates Sparsity)

L2 (Ridge): Circle L1 (Lasso): Diamond w₂ w₂ │ ╱╲ │ /\ │ ╱ ╲ Loss │ / \ Loss │ ╱ ● ╲ contours │ / ● \ contours │ ╱ ╱╲ ╲ │ / /\ \ │╱ ╱ ╲ ╲ │/ / \ \ ─────┼──╱────╲──╲───── w₁ ─────┼──/────\──\───── w₁ │ ╱ ╱╲ ╱ ╱ │ \ /\ / / │╱ ╱ ╲╱ ╱ │ \/ \/ / │ ╱ ★ ╱ │ / ★ / │ ╱ ╱ │ / ▲ / │╱ ╱ │/ │ / │ ★ touches circle at a ★ touches diamond at a SMOOTH point: both CORNER on w₁ axis: w₁ ≠ 0, w₂ ≠ 0 w₂ = 0! (SPARSE!) Circle: w₁² + w₂² ≤ t Diamond: |w₁| + |w₂| ≤ t No corners → no sparsity Corners on axes → sparsity!

Regularization Path: Coefficients vs λ

Coefficient Value │ 3 │ ╲ ← w₁ (important feature) │ ╲ 2 │ ╲ │ ╲ ╲ ← w₃ (moderately important) 1 │ ╲ ╲ │ ╲ ╲ 0 │───────╲──╲────────────────────────── (Lasso: exact zeros) │ ╲ ╲╲ -1 │ ╲ ╲ ← w₂ (least important, zeroed first) │ ╲ -2 │ ╲ │ └──────────────────────────────────── λ 0 0.01 0.1 0.5 1 5 10 Ridge: coefficients shrink smoothly but never reach zero Lasso: coefficients hit zero at different λ values Feature w₂ → 0 first (least important) Feature w₁ → 0 last (most important)

Effect of Scaling on Gradient Descent

WITHOUT Scaling: WITH Scaling: (features on different scales) (features on same scale) w₂ w₂ │ ╱─────╲ │ ╱──╲ │ ╱ ╲ │ ╱ ● ╲ │╱ ● ╲ │ ╱ ↓ ╲ │ ↙ ╲ │╱ ↓ ╲ │ ↙ ╲ ─────┼────★─────╲──── w₁ │ ↙↗ ╲ │╲ ╱ │ ↙↗↙ ╲ │ ╲ ★ ╱ │↙↗↙↗ ╲ │ ╲ ╱ ─────┼↗↙↗★────────────── w₁ │ ╲─╱ │ │ Elongated contours → Circular contours → Zig-zag path (slow!) Direct path (fast!) Many iterations needed Few iterations needed

Missing Data Imputation Strategies

Original Data: ┌──────┬──────┬──────┬──────┐ │ Age │Income│ City │Rating│ ├──────┼──────┼──────┼──────┤ │ 25 │ 40K │ DEL │ 4.2 │ │ 32 │ ?? │ MUM │ 3.8 │ │ ?? │ 55K │ BLR │ ?? │ │ 28 │ 35K │ ?? │ 4.5 │ │ 45 │ 80K │ DEL │ 3.9 │ └──────┴──────┴──────┴──────┘ Strategy 1: Mean/Median Strategy 2: KNN (K=2) ┌──────┬──────┐ ┌──────┬──────┐ │ 25 │ 40K │ │ 25 │ 40K │ │ 32 │ 52.5K│←mean │ 32 │ 37.5K│←avg of neighbors │ 32.5 │ 55K │←median │ 28 │ 55K │←nearest match │ 28 │ 35K │ │ 28 │ 35K │ │ 45 │ 80K │ │ 45 │ 80K │ └──────┴──────┘ └──────┴──────┘ Strategy 3: MICE (Iterative) ┌────────────────────────────────────┐ │ Iteration 1: Impute Age from │ │ Income, City → predict Age=30 │ │ Iteration 2: Impute Income from │ │ Age, City → predict Income=42K │ │ Iteration 3: Re-impute Age... │ │ ... repeat until convergence │ └────────────────────────────────────┘

8

Flowcharts

Flowchart: Choosing the Right Scaler

┌───────────────────┐ │ Is your data │ │ approximately │ │ Gaussian? │ └─────────┬─────────┘ │ ┌──────┴──────┐ │ │ Yes No │ │ ▼ ▼ ┌────────────┐ ┌───────────────┐ │ Are there │ │ Does data have│ │ significant │ │ many outliers?│ │ outliers? │ └──────┬────────┘ └──────┬─────┘ │ │ │ │ Yes No Yes No │ │ │ │ ▼ ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Robust │ │ MinMax │ │ Robust │ │Standard │ │ Scaler │ │ Scaler │ │ Scaler │ │ Scaler │ └────────┘ │ [0,1] │ └──────────┘ │ (Z-score)│ └──────────┘ └──────────┘ ┌───────────────┐ │ Is data sparse │ │ (many zeros)? │ └──────┬────────┘ │ │ Yes No │ │ ▼ ▼ ┌──────────┐ (Use above │ MaxAbs │ decision) │ Scaler │ └──────────┘

Flowchart: Choosing Regularization Method

┌──────────────────────────────┐ │ Do you need automatic │ │ feature selection? │ └──────────────┬───────────────┘ │ │ Yes No │ │ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │ Are features │ │ Use Ridge (L2) │ │ highly correlated│ │ Shrinks all coefs │ │ (multicollinear)?│ │ No feature select │ └────────┬─────────┘ └──────────────────┘ │ │ Yes No │ │ ▼ ▼ ┌────────────────┐ ┌────────────────┐ │ Use ElasticNet │ │ Use Lasso (L1) │ │ Groups correl. │ │ Sets unimport. │ │ features │ │ coefs to zero │ │ α ∈ (0,1) │ └────────────────┘ └────────────────┘ │ ▼ ┌────────────────────────┐ │ Tune λ and α using │ │ cross-validation │ │ (LassoCV, RidgeCV, │ │ ElasticNetCV) │ └────────────────────────┘

Flowchart: Encoding Categorical Variables

┌─────────────────────────┐ │ Is the categorical │ │ variable ORDINAL? │ │ (has natural order) │ └──────────┬──────────────┘ │ │ Yes No │ │ ▼ ▼ ┌──────────────┐ ┌───────────────────┐ │ Ordinal │ │ How many unique │ │ Encoding │ │ categories (k)? │ │ (preserve │ └─────────┬─────────┘ │ order) │ │ │ └──────────────┘ k < 10 k ≥ 10 │ │ ▼ ▼ ┌────────────┐ ┌────────────────┐ │ One-Hot │ │ Target Encoding │ │ Encoding │ │ (with smooth.) │ │ (dummy │ │ or Frequency │ │ variables) │ │ Encoding │ └────────────┘ └────────────────┘ Tree models? → Label Encoding OK (trees handle it) Linear models? → Must use One-Hot or Target Encoding

9

Python Implementation (From Scratch)

9.1 Feature Scaling from Scratch

feature_scaling.py — Scalers from scratch Python

import numpy as np

class StandardScalerFromScratch:
    """Z-score standardization: z = (x - mean) / std"""
    def fit(self, X):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        # Avoid division by zero for constant features
        self.std_[self.std_ == 0] = 1.0
        return self

    def transform(self, X):
        return (X - self.mean_) / self.std_

    def inverse_transform(self, Z):
        return Z * self.std_ + self.mean_

    def fit_transform(self, X):
        return self.fit(X).transform(X)


class MinMaxScalerFromScratch:
    """Min-Max scaling to [0, 1]: x' = (x - min) / (max - min)"""
    def __init__(self, feature_range=(0, 1)):
        self.min_val, self.max_val = feature_range

    def fit(self, X):
        self.data_min_ = np.min(X, axis=0)
        self.data_max_ = np.max(X, axis=0)
        self.data_range_ = self.data_max_ - self.data_min_
        self.data_range_[self.data_range_ == 0] = 1.0
        return self

    def transform(self, X):
        X_std = (X - self.data_min_) / self.data_range_
        return X_std * (self.max_val - self.min_val) + self.min_val

    def fit_transform(self, X):
        return self.fit(X).transform(X)


class RobustScalerFromScratch:
    """Robust scaling using median and IQR"""
    def fit(self, X):
        self.median_ = np.median(X, axis=0)
        Q1 = np.percentile(X, 25, axis=0)
        Q3 = np.percentile(X, 75, axis=0)
        self.iqr_ = Q3 - Q1
        self.iqr_[self.iqr_ == 0] = 1.0
        return self

    def transform(self, X):
        return (X - self.median_) / self.iqr_

    def fit_transform(self, X):
        return self.fit(X).transform(X)


# ---- Demo ----
np.random.seed(42)
X = np.array([[1200, 2], [1800, 3], [2400, 4],
              [900,  1], [3000, 5]])

print("Original:\n", X)
print("StandardScaler:\n", StandardScalerFromScratch().fit_transform(X))
print("MinMaxScaler:\n", MinMaxScalerFromScratch().fit_transform(X))
print("RobustScaler:\n", RobustScalerFromScratch().fit_transform(X))

9.2 Ridge Regression from Scratch

ridge_from_scratch.py — Ridge with closed-form solution Python

import numpy as np

class RidgeRegressionFromScratch:
    """
    Ridge Regression: w = (X^T X + λI)^{-1} X^T y
    Implements closed-form solution derived in Section 5.1.
    """
    def __init__(self, alpha=1.0, fit_intercept=True):
        self.alpha = alpha
        self.fit_intercept = fit_intercept

    def fit(self, X, y):
        X = np.array(X, dtype=np.float64)
        y = np.array(y, dtype=np.float64)

        if self.fit_intercept:
            # Center X and y to handle intercept
            self.X_mean_ = np.mean(X, axis=0)
            self.y_mean_ = np.mean(y)
            X_c = X - self.X_mean_
            y_c = y - self.y_mean_
        else:
            X_c, y_c = X, y

        n, p = X_c.shape
        # Closed-form: w = (X^T X + alpha * I)^(-1) X^T y
        I = np.eye(p)
        XtX = X_c.T @ X_c
        Xty = X_c.T @ y_c
        self.coef_ = np.linalg.solve(XtX + self.alpha * I, Xty)

        if self.fit_intercept:
            self.intercept_ = self.y_mean_ - self.X_mean_ @ self.coef_
        else:
            self.intercept_ = 0.0

        return self

    def predict(self, X):
        X = np.array(X, dtype=np.float64)
        return X @ self.coef_ + self.intercept_

    def score(self, X, y):
        """R² score"""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - ss_res / ss_tot


# ---- Demo ----
np.random.seed(42)
X = np.random.randn(100, 5)
true_w = np.array([3, 0, -2, 0, 1.5])  # Only 3 features matter
y = X @ true_w + np.random.randn(100) * 0.5

for alpha in [0.01, 0.1, 1.0, 10.0]:
    model = RidgeRegressionFromScratch(alpha=alpha)
    model.fit(X, y)
    print(f"λ={alpha:5.2f}  coefs={np.round(model.coef_, 2)}  R²={model.score(X, y):.4f}")

9.3 Lasso Regression from Scratch (Coordinate Descent)

lasso_from_scratch.py — Coordinate descent with soft-thresholding Python

import numpy as np

def soft_threshold(rho, lam):
    """Soft-thresholding operator S(ρ, λ) = sign(ρ) · max(|ρ| - λ, 0)"""
    if rho > lam:
        return rho - lam
    elif rho < -lam:
        return rho + lam
    else:
        return 0.0


class LassoFromScratch:
    """
    Lasso Regression using coordinate descent.
    Implements the soft-thresholding update derived in Section 5.2.
    """
    def __init__(self, alpha=1.0, max_iter=1000, tol=1e-6):
        self.alpha = alpha
        self.max_iter = max_iter
        self.tol = tol

    def fit(self, X, y):
        X = np.array(X, dtype=np.float64)
        y = np.array(y, dtype=np.float64)
        n, p = X.shape

        # Center data
        self.X_mean_ = np.mean(X, axis=0)
        self.y_mean_ = np.mean(y)
        X_c = X - self.X_mean_
        y_c = y - self.y_mean_

        # Initialize coefficients
        w = np.zeros(p)
        # Precompute z_j = sum(x_ij^2) for each feature
        z = np.sum(X_c ** 2, axis=0)

        for iteration in range(self.max_iter):
            w_old = w.copy()
            for j in range(p):
                # Compute partial residual (excluding feature j)
                r_j = y_c - X_c @ w + X_c[:, j] * w[j]
                # ρ_j = correlation of feature j with residual
                rho_j = X_c[:, j] @ r_j
                # Apply soft-thresholding
                w[j] = soft_threshold(rho_j, n * self.alpha) / z[j]

            # Check convergence
            if np.max(np.abs(w - w_old)) < self.tol:
                break

        self.coef_ = w
        self.intercept_ = self.y_mean_ - self.X_mean_ @ self.coef_
        self.n_iter_ = iteration + 1
        return self

    def predict(self, X):
        return np.array(X) @ self.coef_ + self.intercept_


# ---- Demo: Lasso automatic feature selection ----
np.random.seed(42)
X = np.random.randn(100, 10)
true_w = np.array([3, 0, 0, -2, 0, 1.5, 0, 0, 0, 0.8])
y = X @ true_w + np.random.randn(100) * 0.3

model = LassoFromScratch(alpha=0.05)
model.fit(X, y)
print("True weights: ", true_w)
print("Lasso weights:", np.round(model.coef_, 3))
print("Non-zero features:", np.sum(model.coef_ != 0), "/ 10")
# Expected: Lasso correctly identifies the 4 non-zero features!

9.4 Missing Data Imputation

imputation.py — KNN Imputation from scratch Python

import numpy as np

class KNNImputerFromScratch:
    """
    K-Nearest Neighbors Imputation.
    For each missing value, find K nearest neighbors based on
    non-missing features, then impute with their weighted average.
    """
    def __init__(self, n_neighbors=5):
        self.n_neighbors = n_neighbors

    def fit_transform(self, X):
        X = np.array(X, dtype=np.float64)
        X_imputed = X.copy()
        n, p = X.shape

        for i in range(n):
            for j in range(p):
                if np.isnan(X[i, j]):
                    # Find features that are NOT missing for row i
                    valid_features = ~np.isnan(X[i, :])
                    # Find rows that have value for feature j
                    candidate_rows = ~np.isnan(X[:, j])
                    candidate_rows[i] = False  # exclude self

                    if np.sum(candidate_rows) == 0:
                        # Fallback: use column mean
                        col_values = X[~np.isnan(X[:, j]), j]
                        X_imputed[i, j] = np.mean(col_values)
                        continue

                    # Compute distances using valid features
                    distances = []
                    for k in range(n):
                        if not candidate_rows[k]:
                            continue
                        # Only use features present in both rows
                        shared = valid_features & ~np.isnan(X[k, :])
                        if np.sum(shared) == 0:
                            continue
                        dist = np.sqrt(np.sum((X[i, shared] - X[k, shared])**2))
                        distances.append((dist, k))

                    # Sort by distance, take K nearest
                    distances.sort()
                    neighbors = distances[:self.n_neighbors]

                    if len(neighbors) == 0:
                        col_values = X[~np.isnan(X[:, j]), j]
                        X_imputed[i, j] = np.mean(col_values)
                    else:
                        # Weighted average (inverse distance weighting)
                        weights = [1/(d + 1e-8) for d, _ in neighbors]
                        values = [X[idx, j] for _, idx in neighbors]
                        X_imputed[i, j] = np.average(values, weights=weights)

        return X_imputed


# ---- Demo ----
X = np.array([
    [25, 40000, 4.2],
    [32, np.nan, 3.8],
    [np.nan, 55000, np.nan],
    [28, 35000, 4.5],
    [45, 80000, 3.9]
])
imputer = KNNImputerFromScratch(n_neighbors=2)
X_filled = imputer.fit_transform(X)
print("Imputed data:\n", np.round(X_filled, 1))

9.5 Feature Importance: Permutation Importance

permutation_importance.py — From scratch Python

import numpy as np

def permutation_importance(model, X, y, n_repeats=10,
                            scoring=None, random_state=42):
    """
    Compute permutation importance for each feature.
    
    Algorithm:
    1. Compute baseline score with original data.
    2. For each feature j:
       a. Shuffle feature j's values randomly (n_repeats times).
       b. Compute new score with shuffled feature.
       c. Importance = baseline_score - shuffled_score.
    """
    rng = np.random.RandomState(random_state)

    if scoring is None:
        # Default: R² for regression
        def scoring(model, X, y):
            y_pred = model.predict(X)
            ss_res = np.sum((y - y_pred) ** 2)
            ss_tot = np.sum((y - np.mean(y)) ** 2)
            return 1 - ss_res / ss_tot

    baseline_score = scoring(model, X, y)
    n, p = X.shape
    importances = np.zeros((n_repeats, p))

    for r in range(n_repeats):
        for j in range(p):
            X_perm = X.copy()
            # Randomly shuffle feature j
            X_perm[:, j] = rng.permutation(X_perm[:, j])
            perm_score = scoring(model, X_perm, y)
            importances[r, j] = baseline_score - perm_score

    return {
        'importances_mean': np.mean(importances, axis=0),
        'importances_std': np.std(importances, axis=0),
        'importances': importances
    }


# ---- Demo ----
from sklearn.linear_model import LinearRegression

np.random.seed(42)
X = np.random.randn(200, 5)
y = 3*X[:,0] - 2*X[:,2] + 0.5*X[:,4] + np.random.randn(200)*0.3

model = LinearRegression().fit(X, y)
result = permutation_importance(model, X, y)
for j in range(5):
    print(f"Feature {j}: importance = {result['importances_mean'][j]:.4f} ± {result['importances_std'][j]:.4f}")

10

TensorFlow Implementation

regularization_tf.py — L1, L2, ElasticNet in TensorFlow TensorFlow

import tensorflow as tf
import numpy as np

# ---- Generate synthetic data ----
np.random.seed(42)
n_samples, n_features = 500, 20
X_train = np.random.randn(n_samples, n_features).astype(np.float32)
true_weights = np.zeros(n_features)
true_weights[[0, 3, 7, 15]] = [3.0, -2.0, 1.5, -1.0]  # Only 4 active features
y_train = (X_train @ true_weights + np.random.randn(n_samples) * 0.5).astype(np.float32)


class RegularizedLinearModel(tf.keras.Model):
    """Linear model with configurable regularization in TF."""
    def __init__(self, n_features, l1_lambda=0.0, l2_lambda=0.0):
        super().__init__()
        self.w = tf.Variable(tf.zeros([n_features, 1]), name='weights')
        self.b = tf.Variable(tf.zeros([1]), name='bias')
        self.l1_lambda = l1_lambda
        self.l2_lambda = l2_lambda

    def call(self, X):
        return tf.matmul(X, self.w) + self.b

    def regularization_loss(self):
        l1_loss = self.l1_lambda * tf.reduce_sum(tf.abs(self.w))
        l2_loss = self.l2_lambda * tf.reduce_sum(tf.square(self.w))
        return l1_loss + l2_loss


def train_model(model, X, y, epochs=500, lr=0.01):
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    X_tensor = tf.constant(X)
    y_tensor = tf.constant(y.reshape(-1, 1))

    for epoch in range(epochs):
        with tf.GradientTape() as tape:
            predictions = model(X_tensor)
            mse_loss = tf.reduce_mean(tf.square(y_tensor - predictions))
            total_loss = mse_loss + model.regularization_loss()

        gradients = tape.gradient(total_loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        if (epoch + 1) % 100 == 0:
            print(f"Epoch {epoch+1}: Loss={total_loss.numpy():.4f}")

    return model


# Train with different regularizations
print("="*50, "RIDGE (L2)")
ridge_model = RegularizedLinearModel(n_features, l2_lambda=0.1)
train_model(ridge_model, X_train, y_train)
ridge_weights = ridge_model.w.numpy().flatten()
print(f"Non-zero weights (|w| > 0.1): {np.sum(np.abs(ridge_weights) > 0.1)}")

print("="*50, "LASSO (L1)")
lasso_model = RegularizedLinearModel(n_features, l1_lambda=0.1)
train_model(lasso_model, X_train, y_train)
lasso_weights = lasso_model.w.numpy().flatten()
print(f"Non-zero weights (|w| > 0.1): {np.sum(np.abs(lasso_weights) > 0.1)}")

print("="*50, "ELASTIC NET")
enet_model = RegularizedLinearModel(n_features, l1_lambda=0.05, l2_lambda=0.05)
train_model(enet_model, X_train, y_train)
enet_weights = enet_model.w.numpy().flatten()
print(f"Non-zero weights (|w| > 0.1): {np.sum(np.abs(enet_weights) > 0.1)}")


# ---- Using Keras built-in regularizers ----
model_keras = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu',
                          kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.01, l2=0.01),
                          input_shape=(n_features,)),
    tf.keras.layers.Dense(32, activation='relu',
                          kernel_regularizer=tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(1)
])
model_keras.compile(optimizer='adam', loss='mse')
model_keras.fit(X_train, y_train, epochs=100, verbose=0)
print("Keras model score:", model_keras.evaluate(X_train, y_train, verbose=0))

feature_preprocessing_tf.py — TF Data Pipeline with Scaling TensorFlow

import tensorflow as tf

# Feature columns with built-in preprocessing
class TFFeaturePipeline:
    """Feature engineering pipeline using TensorFlow layers."""

    def build_preprocessing_model(self):
        # Normalization layer (StandardScaler equivalent)
        normalizer = tf.keras.layers.Normalization(axis=-1)

        # StringLookup + OneHot (One-Hot encoding)
        category_encoder = tf.keras.layers.StringLookup(
            vocabulary=['Mumbai', 'Delhi', 'Bangalore', 'Chennai'],
            output_mode='one_hot'
        )

        # Discretization (binning)
        age_binner = tf.keras.layers.Discretization(
            bin_boundaries=[18, 25, 35, 45, 55, 65]
        )

        return normalizer, category_encoder, age_binner

    def demo(self):
        # Example: Normalize numerical features
        data = tf.constant([[1200.0, 2.0], [1800.0, 3.0],
                           [2400.0, 4.0], [900.0, 1.0]])
        normalizer = tf.keras.layers.Normalization(axis=-1)
        normalizer.adapt(data)
        print("Normalized:\n", normalizer(data).numpy())

pipeline = TFFeaturePipeline()
pipeline.demo()

11

Scikit-Learn Implementation

complete_pipeline.py — Full Feature Engineering + Regularization Pipeline Scikit-Learn

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler,
    OneHotEncoder, OrdinalEncoder, LabelEncoder,
    PowerTransformer, PolynomialFeatures
)
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer  # MICE
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LassoCV, RidgeCV
from sklearn.feature_selection import (
    SelectKBest, mutual_info_regression, RFE,
    SelectFromModel
)
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_regression

# ============================================================
# 1. DATA PREPARATION
# ============================================================

# Create a realistic mixed-type dataset
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10, 1, n),  # Skewed!
    'city': np.random.choice(['Mumbai', 'Delhi', 'Bangalore',
                             'Chennai', 'Hyderabad'], n),
    'education': np.random.choice(['High School', 'Bachelor',
                                   'Master', 'PhD'], n),
    'experience': np.random.randint(0, 40, n),
    'credit_score': np.random.randint(300, 850, n),
})

# Inject missing values (10%)
for col in ['income', 'credit_score', 'age']:
    mask = np.random.random(n) < 0.1
    df.loc[mask, col] = np.nan

# Create target
y = (0.5 * df['age'].fillna(35) +
     0.001 * df['income'].fillna(50000) +
     0.8 * df['experience'] +
     np.random.randn(n) * 5)

# ============================================================
# 2. COLUMN TRANSFORMER: Different processing for different types
# ============================================================

numeric_features = ['age', 'income', 'experience', 'credit_score']
categorical_features = ['city']
ordinal_features = ['education']

numeric_pipeline = Pipeline([
    ('imputer', KNNImputer(n_neighbors=5)),  # KNN imputation
    ('scaler', StandardScaler()),           # Z-score scaling
])

skewed_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('power', PowerTransformer(method='yeo-johnson')),  # Handle skew
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

ordinal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(
        categories=[['High School', 'Bachelor', 'Master', 'PhD']])),
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, ['age', 'experience', 'credit_score']),
    ('skewed', skewed_pipeline, ['income']),
    ('cat', categorical_pipeline, categorical_features),
    ('ord', ordinal_pipeline, ordinal_features),
])

# ============================================================
# 3. FULL PIPELINE WITH REGULARIZED MODEL
# ============================================================

# Ridge Pipeline
ridge_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', Ridge(alpha=1.0))
])

# Lasso Pipeline
lasso_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', Lasso(alpha=0.1))
])

# ElasticNet Pipeline
enet_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', ElasticNet(alpha=0.1, l1_ratio=0.5))
])

# Cross-validation comparison
for name, pipe in [('Ridge', ridge_pipeline),
                   ('Lasso', lasso_pipeline),
                   ('ElasticNet', enet_pipeline)]:
    scores = cross_val_score(pipe, df, y, cv=5, scoring='r2')
    print(f"{name:12s}: R² = {scores.mean():.4f} ± {scores.std():.4f}")

# ============================================================
# 4. REGULARIZATION PATH: Coefficients vs Lambda
# ============================================================

from sklearn.linear_model import lasso_path

# Prepare data
X_processed = preprocessor.fit_transform(df)
alphas, coefs, _ = lasso_path(X_processed, y, alphas=np.logspace(-3, 1, 50))

print("\nLasso Path Summary:")
print(f"  Alphas tested: {len(alphas)}")
print(f"  Features: {coefs.shape[0]}")
print(f"  At α=0.001: {np.sum(np.abs(coefs[:, 0]) > 1e-6)} non-zero features")
print(f"  At α=10.0:  {np.sum(np.abs(coefs[:, -1]) > 1e-6)} non-zero features")

# ============================================================
# 5. FEATURE SELECTION METHODS
# ============================================================

# Filter method: Mutual Information
selector_mi = SelectKBest(mutual_info_regression, k=5)
X_selected = selector_mi.fit_transform(X_processed, y)
print(f"\nMutual Info - selected {X_selected.shape[1]} features")
print(f"  MI scores: {np.round(selector_mi.scores_, 3)}")

# Wrapper method: Recursive Feature Elimination
rfe = RFE(Ridge(alpha=1.0), n_features_to_select=5)
rfe.fit(X_processed, y)
print(f"\nRFE - selected features: {rfe.support_}")
print(f"  Feature ranking: {rfe.ranking_}")

# Embedded method: Lasso-based selection
selector_lasso = SelectFromModel(Lasso(alpha=0.1))
selector_lasso.fit(X_processed, y)
print(f"\nLasso selection - selected {np.sum(selector_lasso.get_support())} features")

# ============================================================
# 6. MICE IMPUTATION
# ============================================================

mice_imputer = IterativeImputer(
    max_iter=10,
    random_state=42,
    sample_posterior=True  # For multiple imputation
)
X_numeric = df[numeric_features].values
X_mice = mice_imputer.fit_transform(X_numeric)
print(f"\nMICE imputation: {np.sum(np.isnan(X_numeric))} NaN → {np.sum(np.isnan(X_mice))} NaN")

scaling_comparison.py — Visual comparison of scalers Scikit-Learn + Matplotlib

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler,
    PowerTransformer
)

# Generate skewed data
np.random.seed(42)
data = np.random.lognormal(0, 1, 1000).reshape(-1, 1)

# Add outliers
data = np.vstack([data, [[50], [100], [200]]])

scalers = {
    'Original': None,
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler(),
    'MaxAbsScaler': MaxAbsScaler(),
    'Yeo-Johnson': PowerTransformer(method='yeo-johnson'),
}

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for ax, (name, scaler) in zip(axes.flat, scalers.items()):
    if scaler is None:
        transformed = data
    else:
        transformed = scaler.fit_transform(data)

    ax.hist(transformed, bins=50, color='#059669', alpha=0.7, edgecolor='white')
    ax.set_title(name, fontsize=14, fontweight='bold')
    ax.set_xlabel(f"Range: [{transformed.min():.1f}, {transformed.max():.1f}]")

plt.tight_layout()
plt.savefig('scaling_comparison.png', dpi=150)
plt.show()
print("Plot saved as scaling_comparison.png")

12

Indian Case Studies

🇮🇳 Case Study 1: Census of India — Handling 100+ Mixed-Type Features

The Census of India microdata (2011) contains records for 600+ million individuals across 100+ variables: demographic (age, gender, religion, caste), geographic (state, district, urban/rural), economic (occupation, industry, income proxy), educational, and housing characteristics.

Feature Engineering Challenges:

Mixed types: 40+ categorical variables (religion with 6+ categories, caste with 4 groups, occupation with 100+ codes), 30+ numerical, 20+ ordinal (education levels), 10+ binary
High cardinality: District (700+), occupation (National Classification of Occupations has 500+ codes)
Missing data: 15-30% missing for income-related variables (MNAR — higher income individuals more likely to skip)
Hierarchical geography: State → District → Tehsil → Village — requires hierarchical encoding

Solution Pipeline:

Imputation: MICE for numerical variables (captures cross-variable dependencies in census data). Mode imputation for categorical.
Encoding: Target encoding for district (700+ categories), ordinal encoding for education, one-hot for religion/caste
Feature creation: Dependency ratio (dependents/working-age), literacy index (weighted education score), urbanization index
Scaling: RobustScaler for income-related features (heavily right-skewed with outliers)
Selection: Lasso for dimensionality reduction from 100+ to ~20 key features for poverty prediction

Results: After feature engineering, a simple Lasso model achieved 82% accuracy for poverty classification — comparable to random forests on raw data, but with 5x faster inference and interpretable coefficients.

🇮🇳 Case Study 2: Flipkart Product Pricing — E-Commerce Feature Engineering

Flipkart, India's largest e-commerce platform (owned by Walmart), uses ML to predict optimal product pricing, demand forecasting, and seller recommendations. The product catalog has millions of SKUs across electronics, fashion, groceries, and more.

Feature Engineering for Price Prediction:

Text features: Product title → TF-IDF (top 500 terms), brand extraction, category keywords
Image features: Product image → CNN embeddings (128-dim vector from ResNet)
Seller features: Seller rating, # reviews, days on platform, return rate
Temporal features: Day of week, month, festival season (Diwali/Dussehra effect), sale event (Big Billion Days)
Interaction features: Brand × Category, Seller Rating × Price Range, Discount % × Season
Aggregate features: Avg price in category, price rank within brand, price vs. MRP ratio

Scaling challenge: Price ranges from ₹50 (phone case) to ₹2,00,000+ (laptops). Log transform + RobustScaler is essential for price features. MinMaxScaler for ratings (1-5 bounded).

Impact: Feature engineering improvements led to 12% improvement in price prediction MAE and a reported ₹200 crore annual revenue increase from better pricing strategies.

🇮🇳 Case Study 3: HDFC Bank Credit Risk — Regularization for High-Dimensional Financial Data

HDFC Bank, India's largest private bank, uses ML for credit risk assessment. With 60+ million customers and 500+ potential features per customer (bureau data, transaction history, demographics, account behavior), regularization is not just helpful — it's essential.

The Challenge:

Curse of dimensionality: 500+ features but only ~5% default rate (imbalanced + high-dimensional)
Multicollinearity: Many bureau variables are correlated (e.g., total debt, EMI, # active loans)
Regulatory requirements: RBI mandates interpretable models for credit decisions (can't use black-box models)

Regularization Strategy:

ElasticNet (α=0.7): Primary model. L1 component selects ~50 key features from 500+. L2 component handles correlated bureau variables.
Regularization path analysis: λ is chosen via 5-fold CV on historical data. The model identified 12 "core" features that remain non-zero across all λ values: DPD (days past due), utilization ratio, age of oldest account, number of inquiries.
Stability selection: Run Lasso 100 times with subsampled data. Features selected in >70% of runs are considered "stable" — these are the ones reported to regulators.

Impact: Regularized model reduced false positive rate by 18% (fewer good customers rejected) while maintaining the same default detection rate, saving an estimated ₹500 crore annually in reduced losses.

🇮🇳 India Spotlight

India's fintech revolution (UPI, PhonePe, Paytm, Razorpay) generates billions of daily transactions. Feature engineering on transaction data — temporal patterns (when do people spend?), merchant categories, amount distributions, peer-to-peer vs merchant payments — is a massive opportunity for credit scoring the "new-to-credit" population. Regularized models are preferred because they're interpretable and comply with RBI's AI governance framework (2023).

13

Global Case Studies

🌍 Case Study 1: Kaggle Competition Feature Engineering Patterns

Analysis of top Kaggle competition solutions reveals consistent feature engineering patterns used by winners:

Competition	Key Feature Engineering	Impact
Ames Housing (Regression)	Log-transform price, polynomial interactions (area × quality), target encoding for neighborhood	Reduced RMSE by 35%
Titanic (Classification)	Title extraction from name, family size, deck from cabin, fare per person	Accuracy 78% → 84%
Home Credit Default Risk	500+ features from 7 tables: aggregations (mean, max, count), time-since features, bureau score interactions	Top 10 solutions all had 1000+ engineered features
Allstate Claims Severity	Stacking + target encoding + frequency encoding of 100+ categorical columns	Winners used 2000+ features with ElasticNet

Common Patterns:

Always try log-transform on right-skewed targets (price, claims, revenue)
Target encoding > One-hot for high-cardinality categoricals with tree models
Feature interactions between top correlated features always help
Null indicator features ("is_missing" binary column) often carry signal
Aggregation features from relational tables (mean, count, std per group)

🌍 Case Study 2: Netflix Prize — Feature Engineering at Scale

The Netflix Prize (2006-2009, $1M prize) was one of the most famous ML competitions. The winning team (BellKor's Pragmatic Chaos) demonstrated the power of feature engineering for recommendation systems.

Key Features Engineered:

Temporal effects: User rating behavior changes over time. A user who rated 4 stars in 2005 might rate 3.5 stars in 2009 (rating inflation/deflation). Time-binned user biases were created.
User/Movie biases: bᵤ = average rating of user u - global mean. bᵢ = average rating of movie i - global mean.
Implicit features: Which movies a user chose to rate (not just the rating value) carries signal. Binary "did user rate this movie" features.
Neighborhood features: Similarity to K nearest users, weighted average of similar users' ratings.

Regularization insight: The winning SVD++ model used L2 regularization on all latent factors. Without regularization, the model overfitted dramatically on the sparse rating matrix (99% of entries missing). λ = 0.02 was optimal, found via cross-validation.

Result: The combination of feature engineering + regularized matrix factorization achieved a 10.06% improvement over Netflix's own algorithm (Cinematch), barely crossing the $1M prize threshold.

14

Startup Applications

🚀 HealthTech: Practo — Patient Risk Scoring

Practo (India's largest healthtech platform) uses feature engineering on patient data: symptom frequency, appointment history, lab results, age-comorbidity interactions. RobustScaler handles extreme lab values. L1 regularization selects the top 20 risk factors from 200+ potential features.

🚀 AgriTech: CropIn — Crop Yield Prediction

CropIn engineers features from satellite imagery (NDVI, soil moisture), weather data (cumulative rainfall, GDD), and historical yields. Box-Cox transforms for skewed yield distributions. ElasticNet handles correlated weather variables (temperature and humidity are negatively correlated).

🚀 FinTech: Razorpay — Fraud Detection

Transaction velocity (# transactions in last hour/day), amount deviation from user mean, merchant category risk scores, device fingerprint features. All features are scaled with StandardScaler for the SVM-based fraud detector. L1 regularization identifies the top 30 fraud indicators from 500+ candidate features.

15

Government Applications

🏛️ Aadhaar — Biometric Feature Quality Assessment

UIDAI processes 1.3 billion biometric records. Feature engineering on fingerprint quality scores (NFIQ), iris clarity metrics, and face image quality. Missing biometric modalities are handled with cascaded imputation. Regularized classifiers determine if a biometric sample is "good enough" for authentication.

🏛️ NITI Aayog — District-Level Development Indices

NITI Aayog's Aspirational Districts Programme creates composite development indices from 49 indicators across health, education, agriculture, financial inclusion, and infrastructure. Feature scaling (MinMax to [0,100]) ensures indicators on different scales are comparable. PCA with standardized features identifies the key development dimensions.

🏛️ ISRO — Satellite Image Feature Extraction

ISRO's remote sensing satellites (Cartosat, Resourcesat) generate multi-spectral imagery. Feature engineering involves computing vegetation indices (NDVI, EVI), texture features (GLCM), and spectral ratios. StandardScaler is applied across spectral bands before classification. Ridge regularization handles correlated spectral features.

16

Industry Applications

🏭 Manufacturing: Tata Steel — Quality Prediction

Tata Steel's ML system predicts steel quality from 200+ process parameters: temperature profiles, chemical composition, rolling speeds. Feature engineering includes time-series aggregations (moving averages, peak detection), interaction features (carbon × temperature), and polynomial features for non-linear effects. ElasticNet with α=0.3 balances interpretability (few key parameters) with handling correlated temperature sensors.

🏭 Telecom: Jio — Customer Churn Prediction

Jio's churn model uses 400+ features: usage patterns (data consumption trends, call frequency, recharge behavior), network quality metrics (drop rate, speed tests), and customer service interactions. Lasso reduces features to ~40 key predictors. Feature scaling is critical because usage (in GB) and call duration (in minutes) are on very different scales.

🏭 Pharma: Dr. Reddy's — Drug Response Prediction

Genomic data has 20,000+ gene expression features for a few hundred patients (p >> n, classic high-dimensional problem). ElasticNet is the standard approach: L1 selects the most relevant genes, L2 handles the extreme multicollinearity among co-expressed genes. Feature scaling (StandardScaler) is essential because gene expression levels vary by orders of magnitude.

17

Mini Projects

🔨 Mini Project 1: Indian Census Feature Engineering Pipeline

⏱️ Duration: 3-4 hours 📊 Difficulty: Intermediate 🔧 Tools: pandas, sklearn, matplotlib

Objective: Build an end-to-end feature engineering pipeline for a subset of Indian Census data to predict household poverty status.

Steps:

Data Loading: Use the Census 2011 sample dataset (or simulate with similar structure): age, gender, religion (6 categories), caste (4 categories), education (7 levels), occupation (10 categories), state (28), district (100+), rural/urban, house condition (3 levels), water source, toilet type, income proxy
Missing Data Analysis: Visualize missingness patterns using a heatmap. Test MCAR vs MAR using Little's test. Apply MICE imputation for numerical features and mode imputation for categorical.
Feature Engineering:
- Create dependency ratio, literacy score, housing quality index
- Target encode district (100+ categories)
- Ordinal encode education level
- One-hot encode religion and caste
Scaling: RobustScaler for income (skewed), StandardScaler for others
Feature Selection: Compare filter (mutual information), wrapper (RFE), and embedded (Lasso) methods. Create a Venn diagram showing overlap between selected feature sets.
Modeling: Compare Ridge, Lasso, ElasticNet with 5-fold CV. Plot regularization paths.

Deliverables: Jupyter notebook with pipeline code, visualizations, comparison table, and 1-page report.

🔨 Mini Project 2: Credit Risk Regularization Study

⏱️ Duration: 4-5 hours 📊 Difficulty: Advanced 🔧 Tools: sklearn, pandas, matplotlib, seaborn

Objective: Investigate how L1, L2, and ElasticNet regularization affect credit risk model performance and interpretability across different λ values.

Steps:

Data: Use Kaggle's "Give Me Some Credit" dataset or simulate credit data with: monthly income, age, # dependents, # times 30-59 DPD, # times 60-89 DPD, # times 90+ DPD, revolving utilization, # open credit lines, # real estate loans, debt ratio + 10 noise features
Feature Engineering: Create total DPD score, utilization buckets, income-to-debt ratio, age × experience interaction, polynomial features for top 5 predictors
Regularization Path Analysis:
- Plot Ridge coefficients vs λ (all coefficients shrink smoothly)
- Plot Lasso coefficients vs λ (observe features dropping to zero at different λ)
- Plot ElasticNet with α = 0.5
- Mark the optimal λ from cross-validation on each plot
Stability Selection: Run Lasso 100 times with 80% subsampling. Plot feature selection frequency. Identify "stable" features (selected >70% of the time).
Performance Comparison: AUC-ROC curves for Ridge, Lasso, ElasticNet at optimal λ. Compare with unregularized logistic regression.

Deliverables: Regularization path plots, stability selection plot, AUC comparison table, feature importance ranking.

🔨 Mini Project 3: Scaling Effect Visualization Lab

⏱️ Duration: 2 hours 📊 Difficulty: Beginner-Intermediate 🔧 Tools: sklearn, matplotlib, seaborn

Objective: Visually demonstrate how different scaling methods affect model performance and gradient descent convergence.

Steps:

Create a dataset where features have very different scales (area: 500-5000, rooms: 1-10, price: 10⁶-10⁸)
Train linear regression with gradient descent (from scratch) on: (a) unscaled, (b) StandardScaled, (c) MinMaxScaled, (d) RobustScaled data
Plot the loss curve (loss vs iteration) for each — show how scaling dramatically reduces iterations to convergence
Plot 2D contours of the loss surface with and without scaling to show the "elongated ellipse vs circle" effect
Compare KNN accuracy with and without scaling on the same dataset

18

End-of-Chapter Exercises

Exercise 1Easy

Given the data [10, 20, 30, 40, 50], compute the StandardScaler transformation by hand. Verify that the mean of the transformed data is 0 and the standard deviation is 1.

Exercise 2Easy

Apply MinMaxScaler to [100, 200, 300, 400, 500] to scale to the range [0, 1]. What happens if a new data point of 600 appears in test data?

Exercise 3Easy

Explain why tree-based models (Random Forest, XGBoost) don't need feature scaling, but SVM and KNN do. Give a concrete numerical example.

Exercise 4Medium

Given the data [1, 2, 3, 100, 5], compute both StandardScaler and RobustScaler transformations. Which one is less affected by the outlier (100)? Explain mathematically.

Exercise 5Medium

For the categorical variable "Blood Type" with values {A, B, AB, O}, apply (a) Label Encoding, (b) One-Hot Encoding, (c) explain why Label Encoding is problematic for linear models.

Exercise 6Medium

Prove that for StandardScaler, the transformed data always has mean = 0 and variance = 1, regardless of the original distribution.

Exercise 7Medium

Derive the Ridge regression closed-form solution starting from the Lagrangian formulation: minimize ||y - Xw||² subject to ||w||² ≤ t. Show that the Lagrangian multiplier corresponds to the regularization parameter λ.

Exercise 8Medium

Given X = [[1, 2], [3, 4], [5, 6]], y = [3, 7, 11], compute the Ridge regression solution for λ = 0.5 by hand. Compare with OLS.

Exercise 9Medium

Explain the difference between MCAR, MAR, and MNAR with an example from an e-commerce context (e.g., missing product reviews on Flipkart).

Exercise 10Medium

Implement target encoding for a categorical variable with 5 categories. Include smoothing with m=10. Show that without smoothing, rare categories overfit.

Exercise 11Medium

Apply Box-Cox transform to [1, 4, 9, 16, 25] for λ = 0.5 and λ = 0. Compare the skewness before and after transformation.

Exercise 12Hard

Prove the soft-thresholding result for Lasso: show that the optimal w_j is S(ρ, λ)/z where S is the soft-thresholding operator. Start from the subgradient optimality condition.

Exercise 13Hard

Implement MICE (Multiple Imputation by Chained Equations) from scratch in Python. Test on a dataset with 20% MCAR missing values and compare with mean imputation. Measure RMSE of imputed values against true values.

Exercise 14Hard

Show that Ridge regression coefficients are a scaled version of OLS coefficients when X has orthonormal columns (XᵀX = I). Specifically, show w_ridge = w_OLS / (1 + λ).

Exercise 15Hard

Plot the regularization path for Lasso on a synthetic dataset with 20 features, where only 5 are truly predictive. Identify the λ value where the "correct" set of 5 features is selected. How does noise level affect this λ?

Exercise 16Medium

Compare permutation importance, coefficient magnitude, and mutual information for feature importance on the same dataset. Do they agree? When might they disagree?

Exercise 17Medium

Create a sklearn Pipeline that: (a) imputes missing values, (b) scales features, (c) applies Lasso feature selection, (d) fits a Ridge model. Use cross_val_score with 5-fold CV.

Exercise 18Hard

Prove that the Elastic Net can select more than n features when p > n, while Lasso is limited to at most n. (Hint: consider the rank of the Lasso subproblem.)

Exercise 19Medium

Explain why you should NEVER fit a scaler on test data. Construct a concrete example where fitting on test data leads to data leakage and inflated test accuracy.

Exercise 20Hard

Implement coordinate descent for Elastic Net from scratch. Test on a dataset with groups of correlated features. Show that Elastic Net selects all features in a correlated group, while Lasso selects only one.

Exercise 21Easy

What is the difference between LabelEncoder and OrdinalEncoder in scikit-learn? When should you use each?

Exercise 22Medium

Design a feature engineering pipeline for an Indian ride-hailing app (Ola/Uber). List at least 10 engineered features for predicting ride demand, specifying the encoding/scaling method for each.

19

Multiple Choice Questions

1. Which scaler is most robust to outliers?

A) RobustScaler
B) StandardScaler
C) MinMaxScaler
D) MaxAbsScaler

✅ A) RobustScaler — uses median and IQR, which are not influenced by extreme values.

2. What property does Lasso (L1) regularization have that Ridge (L2) does NOT?

A) Shrinks coefficients toward zero
B) Sets some coefficients to exactly zero
C) Has a closed-form solution
D) Reduces overfitting

✅ B) Lasso produces sparse solutions (some coefficients = exactly 0), performing automatic feature selection. Ridge shrinks but never zeros out.

3. The Ridge regression closed-form solution is:

A) w = (XᵀX)⁻¹Xᵀy
B) w = (XᵀX + λI)⁻¹Xᵀy
C) w = Xᵀ(XXᵀ + λI)⁻¹y
D) w = (XᵀX)⁻¹(Xᵀy + λI)

✅ B) Adding λI to XᵀX ensures invertibility and shrinks coefficients.

4. When should you use Yeo-Johnson transform instead of Box-Cox?

A) When data is normally distributed
B) When data has no outliers
C) When data contains zero or negative values
D) When data is already scaled

✅ C) Box-Cox requires strictly positive data. Yeo-Johnson extends it to handle zeros and negatives.

5. In Elastic Net, setting α = 1 gives:

A) Pure Ridge
B) Pure Lasso
C) OLS (no regularization)
D) Equal mix of L1 and L2

✅ B) α (l1_ratio) = 1 means 100% L1 penalty = Lasso. α = 0 means pure Ridge.

6. Which feature encoding method risks target leakage if not used carefully?

A) One-Hot Encoding
B) Label Encoding
C) Target Encoding
D) Ordinal Encoding

✅ C) Target Encoding uses the target variable to create features. Without proper CV or smoothing, it leaks target information into features.

7. What does MICE stand for?

A) Maximum Iterative Conditional Estimation
B) Multiple Imputation by Chained Equations
C) Multivariate Imputation by Conditional Expectation
D) Mean Imputation with Cross Evaluation

✅ B) MICE iteratively imputes each feature using a model conditioned on all other features.

8. Which algorithm does NOT require feature scaling?

A) KNN
B) SVM
C) Random Forest
D) Logistic Regression with gradient descent

✅ C) Tree-based models split on thresholds and are invariant to monotonic feature transformations.

9. Mutual information for feature selection is preferred over Pearson correlation when:

A) Features are normally distributed
B) The relationship is strictly linear
C) The relationship may be non-linear
D) There are only two features

✅ C) Mutual information captures any statistical dependency (linear or non-linear), while Pearson correlation only measures linear relationships.

10. The geometric reason Lasso produces sparse solutions is:

A) The L1 ball is smooth
B) The L1 ball is differentiable everywhere
C) The L1 ball (diamond) has sharp corners on the axes
D) The L1 ball is smaller than the L2 ball

✅ C) The diamond-shaped L1 constraint region has corners on coordinate axes. Loss contour ellipses are likely to touch these corners, giving solutions where some coordinates = 0.

11. In sklearn, which is the correct way to preprocess data for cross-validation?

A) Scale all data first, then split into folds
B) Use Pipeline with cross_val_score
C) Scale training data, then fit scaler on test data
D) Use the same scaler fit on test data for training

✅ B) Pipeline ensures scaling is fit only on training folds and applied to validation folds, preventing data leakage.

12. For a dataset with 500 features and 100 samples (p >> n), the best regularization is:

A) No regularization (OLS)
B) Ridge only
C) Lasso only
D) Elastic Net

✅ D) Elastic Net. With p >> n, Lasso can select at most n features. Elastic Net combines L1 sparsity with L2 grouping, handling both the need for selection and correlated features.

20

Interview Questions

Interview Q1 — Conceptual (Google, Amazon)

Explain the difference between L1 and L2 regularization. When would you choose one over the other?

L1 (Lasso): Adds sum of absolute values of weights as penalty. Produces sparse models (some weights exactly zero) → automatic feature selection. Use when you believe only a few features are truly important.

L2 (Ridge): Adds sum of squared weights as penalty. Shrinks all weights toward zero but never exactly zero. Use when many features contribute small amounts to the prediction, or when features are correlated.

Geometric intuition: L1 constraint is a diamond (corners on axes → sparse). L2 constraint is a circle (smooth → no exact zeros).

When to choose: L1 for feature selection + interpretability. L2 for multicollinearity. ElasticNet when you want both.

Interview Q2 — Practical (Flipkart, Paytm)

How would you handle a categorical feature with 10,000 unique values in a linear model?

One-hot encoding would create 10,000 binary columns — too many. Options:

Target encoding (with smoothing and cross-validation to prevent leakage): Replace each category with the mean target value. Use sklearn's TargetEncoder with CV.
Frequency encoding: Replace with count/frequency of each category.
Hash encoding: Hash categories into a fixed number of buckets (e.g., 256). Some collisions, but manageable.
Embedding (for deep learning): Learn a dense vector representation for each category.
Group rare categories: Combine categories with <N occurrences into an "Other" bucket, then one-hot encode.

Interview Q3 — Mathematical (DeepMind, MSRI)

Derive the closed-form solution for Ridge regression. Why does adding λI make (XᵀX + λI) always invertible?

Derivation: J(w) = ||y-Xw||² + λ||w||². Take gradient: ∇J = -2Xᵀ(y-Xw) + 2λw = 0. Rearranging: (XᵀX + λI)w = Xᵀy → w = (XᵀX + λI)⁻¹Xᵀy.

Invertibility: XᵀX is positive semi-definite with eigenvalues σᵢ² ≥ 0. Adding λI shifts all eigenvalues to σᵢ² + λ > 0 (when λ > 0). A matrix with all positive eigenvalues is positive definite, hence invertible.

Interview Q4 — System Design (Netflix, Spotify)

Design a feature store for a recommendation system that serves 100M users. How do you handle feature freshness, versioning, and serving latency?

Architecture: Two-layer feature store:

Offline store (batch): Precomputed features updated daily (user lifetime metrics, item popularity scores). Stored in Hive/S3.
Online store (real-time): Low-latency features updated in real-time (last 10 items viewed, session duration). Stored in Redis/DynamoDB.
Feature versioning: Each feature definition is versioned (v1.0, v1.1). Models reference specific versions to ensure reproducibility.
Serving: Feature vectors precomputed and cached. Serving latency < 10ms via key-value lookup.
Freshness: Critical features (last click) → real-time stream processing. Less critical (monthly spend) → batch update.

Interview Q5 — Debugging (Microsoft, Meta)

Your model's training accuracy is 99% but test accuracy is 60%. How do you diagnose and fix this using regularization and feature engineering?

Diagnosis: Classic overfitting. The model memorizes training data.

Fixes:

Regularization: Add L1/L2 penalty. Start with Ridge (λ=1) and increase until gap closes.
Feature reduction: Too many features? Use Lasso to eliminate irrelevant ones. Check if removing features improves test accuracy.
Check for leakage: Is a feature that perfectly correlates with the target leaking information? (e.g., using future data to predict past events)
Cross-validation: Use k-fold CV to get a realistic estimate instead of a single train/test split.
More data: If possible, collect more training samples to reduce variance.

Interview Q6 — Applied (HDFC, ICICI, Bajaj)

How would you build a credit scoring model for "new-to-credit" customers who have no credit bureau history?

Alternative data features:

UPI transaction patterns (frequency, amount distribution, merchant types)
Mobile phone metadata (recharge frequency, data usage, handset value)
Social signals (LinkedIn profile completeness, education background)
Rent payment history (via landlord verification)
E-commerce purchase patterns (from Flipkart/Amazon history)

Feature engineering: RFM (Recency-Frequency-Monetary) from UPI, spending volatility, income stability score.

Regularization: ElasticNet critical here — many features are noisy/correlated, and we need interpretability for RBI compliance.

Interview Q7 — Conceptual (Any ML role)

What is the difference between normalization and standardization? Give an example where each is preferred.

Standardization: z = (x - μ)/σ. Centers at 0, unit variance. Preferred when: data is Gaussian, using algorithms that assume zero-centered data (SVM, PCA, logistic regression).

Normalization (Min-Max): x' = (x - min)/(max - min). Scales to [0,1]. Preferred when: data is NOT Gaussian, need bounded outputs (neural network inputs, image pixels), want to preserve zero entries in sparse data.

Interview Q8 — Coding (Amazon, Google)

Write a function that implements the soft-thresholding operator used in Lasso coordinate descent.

def soft_threshold(rho, lam):
    """S(ρ, λ) = sign(ρ) · max(|ρ| - λ, 0)"""
    if rho > lam:
        return rho - lam
    elif rho < -lam:
        return rho + lam
    else:
        return 0.0

Interview Q9 — Practical (Data Science roles)

You have a dataset with 30% missing values. Walk me through your approach to handle this.

Step-by-step approach:

Understand the mechanism: Is it MCAR, MAR, or MNAR? This determines the strategy.
Visualize patterns: Use missingno library to see correlations between missing values.
Drop if possible: If a feature has >70% missing, consider dropping it. If a row has >50% missing, consider dropping it.
Simple imputation first: Median for skewed numerical, mean for symmetric, mode for categorical.
Advanced imputation: KNN or MICE for MAR data where inter-feature relationships matter.
Create missing indicator: Add binary "is_missing" columns — sometimes missingness itself is a signal.
Evaluate: Compare model performance with different imputation strategies using cross-validation.

Interview Q10 — Advanced (Research / Senior ML)

Prove that Lasso can select at most n features when p > n. Why does Elastic Net overcome this limitation?

Lasso limitation: The Lasso objective is a convex program. The solution lies on the boundary of the L1 ball. For a linear system with n equations and p unknowns (p > n), the solution space is (p-n)-dimensional. The L1 optimization selects a vertex of the feasible polytope, which can have at most n non-zero coordinates (by the theory of linear programming, a basic feasible solution has at most n non-zero variables).

Elastic Net overcomes this: The L2 penalty makes the objective strictly convex, so the solution is unique and doesn't need to be a vertex of a polytope. The L2 term "stabilizes" the selection, allowing groups of correlated features to all receive non-zero weights. Mathematically, the Elastic Net reformulation adds n additional "pseudo-observations" which effectively makes p ≤ 2n, allowing more features to be selected.

21

Research Problems

Research 1Research

Adaptive Regularization for Non-Stationary Data: Traditional regularization assumes a fixed λ throughout training. In streaming/online learning settings (e.g., real-time UPI fraud detection), the optimal λ changes as the data distribution shifts. Design an adaptive regularization scheme that adjusts λ based on detected distribution drift. How would you formalize "drift" and "adapt" mathematically? Implement and evaluate on a synthetic non-stationary dataset.

Research 2Research

Fairness-Aware Feature Selection: Standard Lasso selects features purely based on predictive power. In credit scoring (HDFC, SBI), selected features may serve as proxies for protected attributes (caste, religion, gender). Design a regularization framework that simultaneously optimizes predictive accuracy AND fairness (equalized odds). Formulate this as a constrained optimization problem and derive the Lagrangian. How does the Pareto frontier between accuracy and fairness change as you increase regularization?

Research 3Research

Neural Feature Engineering: Can we learn optimal feature transformations (scaling, encoding, interaction selection) end-to-end using a neural network? Design a "feature engineering layer" that sits before the main model and learns: (a) optimal power transform parameters (generalizing Box-Cox), (b) learned embeddings for categoricals, (c) automatic interaction detection. Compare with manual feature engineering on 5 Kaggle datasets. Under what conditions does learned feature engineering outperform manual engineering?

Research 4Research

Group Lasso for Hierarchical Indian Census Data: Indian census data has natural groupings: geographic features (state, district, tehsil), demographic features (age, gender, caste), economic features (occupation, industry). Standard Lasso selects individual features. Group Lasso selects/removes entire groups. Design a two-level regularization (group-level L1 + within-group L2) for census poverty prediction. Does the hierarchical structure improve both prediction and interpretability compared to standard Elastic Net?

22

Key Takeaways

🔑 Feature engineering is the single most impactful step in ML pipeline development. Spend 70-80% of your time understanding, cleaning, and transforming features — the model choice often matters less.

🔑 Scaling is essential for gradient-based (linear regression, SVM, neural nets) and distance-based (KNN, K-Means) algorithms, but not needed for tree-based models (Random Forest, XGBoost).

🔑 L1 (Lasso) creates sparsity due to the diamond-shaped constraint region with corners on axes. Use it for automatic feature selection when you believe the true signal is sparse.

🔑 L2 (Ridge) shrinks but doesn't zero out. Its closed-form solution w = (XᵀX + λI)⁻¹Xᵀy stabilizes OLS when features are correlated (multicollinear) or when p > n.

🔑 Elastic Net combines L1 + L2 and is the default choice for high-dimensional data with correlated features. It can select more than n features (Lasso cannot when p > n).

🔑 Never fit a scaler on test data. Always use fit_transform() on training data and transform() on test data. sklearn Pipelines handle this automatically.

🔑 For missing data: understand the mechanism first. MCAR → any imputation works. MAR → MICE or KNN. MNAR → requires domain-specific modeling. Always create a "missing" indicator column.

🔑 For categorical encoding: Low cardinality → One-Hot. High cardinality → Target encoding (with smoothing). Ordinal data → Ordinal encoding. Tree models → Label encoding is fine.

🔑 Regularization path plots are your best tool for understanding feature importance across λ values. Features that survive high λ are the most robust predictors.

23

References & Further Reading

[1] Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, 58(1), 267-288. — The foundational Lasso paper.
[2] Hoerl, A. E., & Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, 12(1), 55-67. — Ridge regression origin.
[3] Zou, H., & Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." JRSS-B, 67(2), 301-320. — ElasticNet paper.
[4] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 3, 5, 7. — The ML theory bible.
[5] Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapters 4, 6. — Practical implementation reference.
[6] van Buuren, S. (2018). Flexible Imputation of Missing Data, 2nd ed. CRC Press. — Comprehensive missing data treatment.
[7] Kuhn, M., & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press. — Modern feature engineering practices.
[8] Box, G. E. P., & Cox, D. R. (1964). "An Analysis of Transformations." JRSS-B, 26(2), 211-252. — Box-Cox transform origin.
[9] Yeo, I.-K., & Johnson, R. A. (2000). "A New Family of Power Transformations to Improve Normality or Symmetry." Biometrika, 87(4), 954-959.
[10] Census of India 2011 Microdata Documentation. Ministry of Home Affairs, Government of India. — Indian census data reference.
[11] Reserve Bank of India (2023). "Master Direction on IT Governance, Risk, Data Management and Business Continuity Planning." — RBI AI governance framework.
[12] Koren, Y. (2009). "The BellKor Solution to the Netflix Grand Prize." Netflix Prize documentation. — Netflix Prize feature engineering.
[13] Scikit-learn documentation: sklearn.preprocessing, sklearn.impute, sklearn.linear_model. https://scikit-learn.org/
[14] Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." JMLR, 12, 2825-2830.