Feature Engineering, Scaling & Regularization
Master the art and science of preparing features ā from raw data wrangling and imputation to sophisticated regularization techniques that prevent overfitting and reveal the most important signals in your data.
Learning Objectives
After completing this chapter, you will be able to:
Introduction
In the world of machine learning, there's a saying that has stood the test of time: "Garbage in, garbage out." No matter how powerful your model architecture is ā whether it's a simple linear regression or a billion-parameter deep neural network ā the quality of predictions fundamentally depends on the quality of features fed into it.
Feature engineering is the process of using domain knowledge to transform raw data into informative features that improve model performance. Feature scaling ensures that features are on comparable scales so that algorithms converge properly. Regularization adds constraints to the learning process to prevent overfitting, particularly when dealing with high-dimensional data where features may outnumber observations.
These three topics ā engineering, scaling, and regularization ā form a tightly coupled triad. A feature engineer who doesn't understand regularization may create too many features without knowing how to control model complexity. A data scientist who doesn't scale features may watch their gradient descent diverge or their regularization behave inconsistently.
This chapter is structured in three major arcs:
- Feature Engineering (§6.3-6.5): Creating features, handling missing data, encoding categoricals, and selecting the most informative features
- Feature Scaling & Transformation (§6.4-6.6): Standardization, normalization, and power transforms with mathematical foundations
- Regularization (§6.5-6.7): L1, L2, and ElasticNet ā derived from first principles with geometric intuition
Throughout, we ground every concept in Indian case studies ā from handling 100+ features in Census of India microdata to regularizing credit risk models at HDFC Bank. We also study global patterns from Kaggle competitions and the Netflix Prize.
Historical Background
The story of feature engineering is as old as statistics itself, but its formalization as a discipline is surprisingly recent.
The Pre-Computing Era (1800sā1950s)
Carl Friedrich Gauss (1809) introduced the method of least squares for fitting astronomical observations. Even then, the choice of which variables to include (which features to use) was crucial. Gauss manually selected celestial coordinates, velocities, and time as his "features."
Francis Galton (1886) discovered the concept of "regression toward the mean" and, in doing so, also implicitly performed feature engineering ā he transformed raw height measurements into standardized deviations from the mean.
The Regularization Revolution (1943ā1970)
Andrey Tikhonov (1943) introduced what we now call Tikhonov regularization (equivalent to Ridge/L2 regularization) in the context of solving ill-posed integral equations. His contribution was recognizing that adding a penalty term to the objective function could stabilize otherwise unstable solutions.
Arthur Hoerl & Robert Kennard (1970) independently developed Ridge Regression for statistics, showing that biased estimators with smaller variance could outperform OLS estimates (the bias-variance trade-off). Their seminal paper "Ridge Regression: Biased Estimation for Nonorthogonal Problems" changed how statisticians approached multicollinearity.
The Sparsity Era (1996ā2005)
Robert Tibshirani (1996) published his foundational paper on the LASSO (Least Absolute Shrinkage and Selection Operator), introducing L1 regularization to statistics. The key insight was that L1 penalty produces exactly zero coefficients ā performing automatic variable selection. This paper has been cited over 45,000 times.
Hui Zou & Trevor Hastie (2005) introduced the Elastic Net, combining L1 and L2 penalties to get the best of both worlds: sparsity from L1 and stability from L2.
The Kaggle Era & AutoML (2010āPresent)
Kaggle competitions (starting ~2010) democratized feature engineering knowledge. Winners like Xavier Conort (Liberty Mutual), Lucas & team (Allstate), and many others showed that creative feature engineering ā polynomial features, target encoding, frequency encoding, interaction features ā often mattered more than model architecture.
Today, AutoML tools (Google AutoML, H2O, Auto-sklearn, TPOT) attempt to automate feature engineering, but domain expertise remains irreplaceable for most real-world problems.
Conceptual Explanation
3.1 Feature Engineering: The Art and Science
A feature (also called a variable, attribute, or predictor) is a measurable property of the phenomenon being observed. Feature engineering involves three major activities:
Feature Creation
Deriving new features from existing ones using domain knowledge:
- Interaction features: Product or ratio of two features (e.g., BMI = weight/height²)
- Polynomial features: Squares, cubes of existing features (e.g., x², xāĀ·xā)
- Date/Time decomposition: Extracting year, month, day, weekday, hour from timestamps
- Text features: TF-IDF, word count, sentiment scores from text data
- Aggregation features: Mean, sum, count over grouped entities (e.g., avg transaction per customer)
- Domain-specific features: Recency-Frequency-Monetary (RFM) in marketing, technical indicators in finance
Feature Transformation
Modifying existing features to make them more suitable for ML models:
- Scaling: StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
- Power transforms: Log, Box-Cox, Yeo-Johnson for skewed data
- Binning/Discretization: Converting continuous to categorical (age ā age group)
- Encoding: Converting categorical to numerical representations
Feature Selection
Choosing the most informative subset of features:
- Filter methods: Statistical tests independent of the model (correlation, ϲ, mutual information)
- Wrapper methods: Use model performance to evaluate subsets (forward/backward selection, RFE)
- Embedded methods: Feature selection built into model training (Lasso, decision tree importance)
3.2 Handling Missing Data
Real-world datasets are rarely complete. Missing data arises from sensor failures, user non-response, data entry errors, or intentional omission. Understanding the mechanism of missingness is crucial:
| Mechanism | Definition | Example | Strategy |
|---|---|---|---|
| MCAR (Missing Completely At Random) |
Missingness unrelated to any variable | Sensor randomly fails | Any imputation works; listwise deletion OK |
| MAR (Missing At Random) |
Missingness depends on observed variables | Income missing more for younger respondents | Multiple imputation, model-based methods |
| MNAR (Missing Not At Random) |
Missingness depends on the missing value itself | High-income people don't report income | Specialized models, sensitivity analysis |
Imputation Methods
Simple Imputation:
- Mean imputation: Replace with column mean. Fast but reduces variance and distorts correlations.
- Median imputation: Robust to outliers. Better for skewed distributions.
- Mode imputation: For categorical features. Replace with most frequent category.
- Constant imputation: Replace with a domain-specific value (e.g., 0, "Unknown").
Advanced Imputation:
- KNN Imputation: Find K nearest neighbors based on non-missing features, impute with their weighted average. Captures local structure but is computationally expensive (O(n²)).
- MICE (Multiple Imputation by Chained Equations): Iteratively imputes each feature using a regression model conditioned on all other features. Produces multiple plausible imputations, allowing uncertainty quantification.
3.3 Encoding Categorical Variables
| Method | How it Works | Pros | Cons | Best For |
|---|---|---|---|---|
| Label Encoding | Map each category to an integer (0, 1, 2, ā¦) | Simple, memory-efficient | Imposes ordinal relationship | Ordinal data, tree-based models |
| One-Hot Encoding | Create binary column for each category | No ordinal assumption | High-cardinality ā many columns | Linear models, low-cardinality |
| Ordinal Encoding | Map to integers preserving order | Preserves ordering | Assumes equal spacing | Truly ordinal features |
| Target Encoding | Replace category with mean of target | Handles high-cardinality | Target leakage risk | High-cardinality + careful CV |
3.4 Feature Scaling: Why and When
Many ML algorithms ā gradient descent-based methods, distance-based methods (KNN, SVM), and PCA ā are sensitive to the scale of features. If one feature ranges from 0 to 1 and another from 0 to 1,000,000, the larger feature will dominate.
| Scaler | Formula | Range | Handles Outliers? | Best For |
|---|---|---|---|---|
| StandardScaler | z = (x - μ) / Ļ | ~[-3, 3] | No | Normally distributed data, SVM, logistic regression |
| MinMaxScaler | x' = (x - min) / (max - min) | [0, 1] | No | Neural networks, image pixel values |
| RobustScaler | x' = (x - Qā) / (Qā - Qā) | Variable | Yes | Data with outliers |
| MaxAbsScaler | x' = x / |max(x)| | [-1, 1] | No | Sparse data, already centered at 0 |
3.5 Normalization vs. Standardization
These terms are often confused. Here's the precise distinction:
Standardization (Z-score normalization)
Transforms data to have zero mean and unit variance. The transformed data follows a standard normal distribution
if the original data was normal. Formula: z = (x - μ) / Ļ
Use when: Data is approximately Gaussian, algorithm assumes zero-centered data (PCA, SVM, Logistic Regression).
Normalization (Min-Max scaling)
Rescales data to a fixed range [0, 1] (or [-1, 1]). Preserves the shape of the distribution.
Formula: x' = (x - x_min) / (x_max - x_min)
Use when: You need bounded values (neural network inputs), data is NOT Gaussian, you want to preserve zero entries in sparse data.
3.6 Power Transforms for Skewed Data
Many ML models assume (or perform better with) normally distributed features. Real-world data is often right-skewed (income, house prices, transaction amounts). Power transforms help:
- Log Transform: x' = log(x + 1). Simple, works for right-skewed positive data. Adding 1 handles zeros.
- Box-Cox Transform: Parametric family for positive data. Finds optimal Ī» automatically. When Ī»=0, it's log; when Ī»=1, it's linear.
- Yeo-Johnson Transform: Extension of Box-Cox that handles negative values and zeros. More general-purpose.
3.7 Regularization: Controlling Model Complexity
When models have too many features relative to observations, they tend to overfit ā memorizing noise rather than learning patterns. Regularization adds a penalty term to the loss function that discourages overly complex models:
The Regularization Framework
General form: Loss_total = Loss_data + Ī» Ć Penalty(w)
- L1 Penalty (Lasso): Penalty = Ī£|wįµ¢| ā Encourages sparsity (zeros out coefficients)
- L2 Penalty (Ridge): Penalty = Ī£wᵢ² ā Shrinks coefficients toward zero (but not exactly zero)
- Elastic Net: Penalty = α·Σ|wįµ¢| + (1-α)Ā·Ī£wᵢ² ā Best of both worlds
The hyperparameter Ī» (regularization strength) controls the trade-off: Ī»=0 is no regularization (OLS), Ī»āā shrinks all coefficients to zero.
Mathematical Foundation
4.1 Feature Scaling Mathematics
4.2 Box-Cox and Yeo-Johnson Transforms
4.3 Regularization Mathematics
4.4 Mutual Information for Feature Selection
Formula Derivations
5.1 Deriving Ridge Regression Closed-Form
š Derivation: Ridge Regression w = (XįµX + Ī»I)ā»Ā¹Xįµy
5.2 Deriving Lasso Gradient with L1 Penalty
š Derivation: Lasso Coordinate Descent Update
5.3 Geometric Proof: Why L1 Creates Sparsity
š Geometric Intuition: L1 vs L2 Constraint Regions
5.4 ElasticNet Derivation
š Derivation: Elastic Net Objective
Worked Numerical Examples
6.1 Feature Scaling ā Step by Step
Example: Scaling House Prices Data
Consider a small dataset with two features:
| House | Area (sq ft) | Bedrooms | Price (ā¹ lakhs) |
|---|---|---|---|
| 1 | 1200 | 2 | 45 |
| 2 | 1800 | 3 | 65 |
| 3 | 2400 | 4 | 80 |
| 4 | 900 | 1 | 35 |
| 5 | 3000 | 5 | 110 |
μ_area = (1200+1800+2400+900+3000)/5 = 1860
Ļ_area = ā[((-660)²+(-60)²+540²+(-960)²+1140²)/5] = ā[(435600+3600+291600+921600+1299600)/5] = ā590400 ā 768.6
zā = (1200-1860)/768.6 = -0.858
zā = (1800-1860)/768.6 = -0.078
zā = (2400-1860)/768.6 = +0.702
zā = (900-1860)/768.6 = -1.249
zā = (3000-1860)/768.6 = +1.483
x_min = 900, x_max = 3000, range = 2100
x'ā = (1200-900)/2100 = 0.143
x'ā = (1800-900)/2100 = 0.429
x'ā = (2400-900)/2100 = 0.714
x'ā = (900-900)/2100 = 0.000
x'ā = (3000-900)/2100 = 1.000
Sorted: [900, 1200, 1800, 2400, 3000]
Qā = 1200, Qā (median) = 1800, Qā = 2400, IQR = 1200
x'ā = (1200-1800)/1200 = -0.500
x'ā = (1800-1800)/1200 = 0.000
x'ā = (2400-1800)/1200 = 0.500
x'ā = (900-1800)/1200 = -0.750
x'ā = (3000-1800)/1200 = 1.000
6.2 Ridge vs Lasso ā Same Dataset
Example: Ridge vs Lasso on a 3-Feature Problem
Dataset: 5 observations, 3 features (xā, xā, xā), target y. Features are standardized.
| xā | xā | xā | y |
|---|---|---|---|
| -1.2 | 0.5 | -0.3 | 2.1 |
| 0.8 | -1.1 | 0.7 | 4.5 |
| 0.3 | 0.9 | -1.2 | 1.8 |
| 1.5 | -0.4 | 1.1 | 6.2 |
| -0.6 | 0.2 | 0.4 | 3.0 |
w_OLS = (XįµX)ā»Ā¹Xįµy
After computation: wā = 1.82, wā = -0.43, wā = 1.05
All three features have non-zero coefficients. Note xā has a small coefficient.
w_ridge = (XįµX + I)ā»Ā¹Xįµy
Result: wā = 1.41, wā = -0.28, wā = 0.83
All coefficients are shrunk toward zero, but none is exactly zero. The ratio between coefficients is approximately preserved.
Using coordinate descent:
Result: wā = 1.35, wā = 0.00, wā = 0.71
xā's coefficient is set to exactly zero! Lasso has performed automatic feature selection, determining that xā is not informative enough to justify its inclusion.
| Method | wā | wā | wā | # Non-zero |
|---|---|---|---|---|
| OLS | 1.82 | -0.43 | 1.05 | 3 |
| Ridge (Ī»=1) | 1.41 | -0.28 | 0.83 | 3 |
| Lasso (Ī»=0.5) | 1.35 | 0.00 | 0.71 | 2 |
6.3 Target Encoding ā Worked Example
Example: Target Encoding for City Feature
| City | Default (y) |
|---|---|
| Mumbai | 0 |
| Delhi | 1 |
| Mumbai | 0 |
| Bangalore | 0 |
| Delhi | 1 |
| Mumbai | 1 |
| Bangalore | 0 |
| Delhi | 0 |
Mumbai: (0+0+1)/3 = 0.333
Delhi: (1+1+0)/3 = 0.667
Bangalore: (0+0)/2 = 0.000
Smoothed encoding = (nįµ¢ Ć meanįµ¢ + m Ć global_mean) / (nįµ¢ + m)
Global mean = 3/8 = 0.375, smoothing factor m = 2
Mumbai: (3Ć0.333 + 2Ć0.375)/(3+2) = 1.749/5 = 0.350
Delhi: (3Ć0.667 + 2Ć0.375)/(3+2) = 2.751/5 = 0.550
Bangalore: (2Ć0.000 + 2Ć0.375)/(2+2) = 0.750/4 = 0.188
Visual Diagrams
Flowcharts
Python Implementation (From Scratch)
9.1 Feature Scaling from Scratch
import numpy as np class StandardScalerFromScratch: """Z-score standardization: z = (x - mean) / std""" def fit(self, X): self.mean_ = np.mean(X, axis=0) self.std_ = np.std(X, axis=0) # Avoid division by zero for constant features self.std_[self.std_ == 0] = 1.0 return self def transform(self, X): return (X - self.mean_) / self.std_ def inverse_transform(self, Z): return Z * self.std_ + self.mean_ def fit_transform(self, X): return self.fit(X).transform(X) class MinMaxScalerFromScratch: """Min-Max scaling to [0, 1]: x' = (x - min) / (max - min)""" def __init__(self, feature_range=(0, 1)): self.min_val, self.max_val = feature_range def fit(self, X): self.data_min_ = np.min(X, axis=0) self.data_max_ = np.max(X, axis=0) self.data_range_ = self.data_max_ - self.data_min_ self.data_range_[self.data_range_ == 0] = 1.0 return self def transform(self, X): X_std = (X - self.data_min_) / self.data_range_ return X_std * (self.max_val - self.min_val) + self.min_val def fit_transform(self, X): return self.fit(X).transform(X) class RobustScalerFromScratch: """Robust scaling using median and IQR""" def fit(self, X): self.median_ = np.median(X, axis=0) Q1 = np.percentile(X, 25, axis=0) Q3 = np.percentile(X, 75, axis=0) self.iqr_ = Q3 - Q1 self.iqr_[self.iqr_ == 0] = 1.0 return self def transform(self, X): return (X - self.median_) / self.iqr_ def fit_transform(self, X): return self.fit(X).transform(X) # ---- Demo ---- np.random.seed(42) X = np.array([[1200, 2], [1800, 3], [2400, 4], [900, 1], [3000, 5]]) print("Original:\n", X) print("StandardScaler:\n", StandardScalerFromScratch().fit_transform(X)) print("MinMaxScaler:\n", MinMaxScalerFromScratch().fit_transform(X)) print("RobustScaler:\n", RobustScalerFromScratch().fit_transform(X))
9.2 Ridge Regression from Scratch
import numpy as np class RidgeRegressionFromScratch: """ Ridge Regression: w = (X^T X + λI)^{-1} X^T y Implements closed-form solution derived in Section 5.1. """ def __init__(self, alpha=1.0, fit_intercept=True): self.alpha = alpha self.fit_intercept = fit_intercept def fit(self, X, y): X = np.array(X, dtype=np.float64) y = np.array(y, dtype=np.float64) if self.fit_intercept: # Center X and y to handle intercept self.X_mean_ = np.mean(X, axis=0) self.y_mean_ = np.mean(y) X_c = X - self.X_mean_ y_c = y - self.y_mean_ else: X_c, y_c = X, y n, p = X_c.shape # Closed-form: w = (X^T X + alpha * I)^(-1) X^T y I = np.eye(p) XtX = X_c.T @ X_c Xty = X_c.T @ y_c self.coef_ = np.linalg.solve(XtX + self.alpha * I, Xty) if self.fit_intercept: self.intercept_ = self.y_mean_ - self.X_mean_ @ self.coef_ else: self.intercept_ = 0.0 return self def predict(self, X): X = np.array(X, dtype=np.float64) return X @ self.coef_ + self.intercept_ def score(self, X, y): """R² score""" y_pred = self.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return 1 - ss_res / ss_tot # ---- Demo ---- np.random.seed(42) X = np.random.randn(100, 5) true_w = np.array([3, 0, -2, 0, 1.5]) # Only 3 features matter y = X @ true_w + np.random.randn(100) * 0.5 for alpha in [0.01, 0.1, 1.0, 10.0]: model = RidgeRegressionFromScratch(alpha=alpha) model.fit(X, y) print(f"λ={alpha:5.2f} coefs={np.round(model.coef_, 2)} R²={model.score(X, y):.4f}")
9.3 Lasso Regression from Scratch (Coordinate Descent)
import numpy as np def soft_threshold(rho, lam): """Soft-thresholding operator S(Ļ, Ī») = sign(Ļ) Ā· max(|Ļ| - Ī», 0)""" if rho > lam: return rho - lam elif rho < -lam: return rho + lam else: return 0.0 class LassoFromScratch: """ Lasso Regression using coordinate descent. Implements the soft-thresholding update derived in Section 5.2. """ def __init__(self, alpha=1.0, max_iter=1000, tol=1e-6): self.alpha = alpha self.max_iter = max_iter self.tol = tol def fit(self, X, y): X = np.array(X, dtype=np.float64) y = np.array(y, dtype=np.float64) n, p = X.shape # Center data self.X_mean_ = np.mean(X, axis=0) self.y_mean_ = np.mean(y) X_c = X - self.X_mean_ y_c = y - self.y_mean_ # Initialize coefficients w = np.zeros(p) # Precompute z_j = sum(x_ij^2) for each feature z = np.sum(X_c ** 2, axis=0) for iteration in range(self.max_iter): w_old = w.copy() for j in range(p): # Compute partial residual (excluding feature j) r_j = y_c - X_c @ w + X_c[:, j] * w[j] # Ļ_j = correlation of feature j with residual rho_j = X_c[:, j] @ r_j # Apply soft-thresholding w[j] = soft_threshold(rho_j, n * self.alpha) / z[j] # Check convergence if np.max(np.abs(w - w_old)) < self.tol: break self.coef_ = w self.intercept_ = self.y_mean_ - self.X_mean_ @ self.coef_ self.n_iter_ = iteration + 1 return self def predict(self, X): return np.array(X) @ self.coef_ + self.intercept_ # ---- Demo: Lasso automatic feature selection ---- np.random.seed(42) X = np.random.randn(100, 10) true_w = np.array([3, 0, 0, -2, 0, 1.5, 0, 0, 0, 0.8]) y = X @ true_w + np.random.randn(100) * 0.3 model = LassoFromScratch(alpha=0.05) model.fit(X, y) print("True weights: ", true_w) print("Lasso weights:", np.round(model.coef_, 3)) print("Non-zero features:", np.sum(model.coef_ != 0), "/ 10") # Expected: Lasso correctly identifies the 4 non-zero features!
9.4 Missing Data Imputation
import numpy as np class KNNImputerFromScratch: """ K-Nearest Neighbors Imputation. For each missing value, find K nearest neighbors based on non-missing features, then impute with their weighted average. """ def __init__(self, n_neighbors=5): self.n_neighbors = n_neighbors def fit_transform(self, X): X = np.array(X, dtype=np.float64) X_imputed = X.copy() n, p = X.shape for i in range(n): for j in range(p): if np.isnan(X[i, j]): # Find features that are NOT missing for row i valid_features = ~np.isnan(X[i, :]) # Find rows that have value for feature j candidate_rows = ~np.isnan(X[:, j]) candidate_rows[i] = False # exclude self if np.sum(candidate_rows) == 0: # Fallback: use column mean col_values = X[~np.isnan(X[:, j]), j] X_imputed[i, j] = np.mean(col_values) continue # Compute distances using valid features distances = [] for k in range(n): if not candidate_rows[k]: continue # Only use features present in both rows shared = valid_features & ~np.isnan(X[k, :]) if np.sum(shared) == 0: continue dist = np.sqrt(np.sum((X[i, shared] - X[k, shared])**2)) distances.append((dist, k)) # Sort by distance, take K nearest distances.sort() neighbors = distances[:self.n_neighbors] if len(neighbors) == 0: col_values = X[~np.isnan(X[:, j]), j] X_imputed[i, j] = np.mean(col_values) else: # Weighted average (inverse distance weighting) weights = [1/(d + 1e-8) for d, _ in neighbors] values = [X[idx, j] for _, idx in neighbors] X_imputed[i, j] = np.average(values, weights=weights) return X_imputed # ---- Demo ---- X = np.array([ [25, 40000, 4.2], [32, np.nan, 3.8], [np.nan, 55000, np.nan], [28, 35000, 4.5], [45, 80000, 3.9] ]) imputer = KNNImputerFromScratch(n_neighbors=2) X_filled = imputer.fit_transform(X) print("Imputed data:\n", np.round(X_filled, 1))
9.5 Feature Importance: Permutation Importance
import numpy as np def permutation_importance(model, X, y, n_repeats=10, scoring=None, random_state=42): """ Compute permutation importance for each feature. Algorithm: 1. Compute baseline score with original data. 2. For each feature j: a. Shuffle feature j's values randomly (n_repeats times). b. Compute new score with shuffled feature. c. Importance = baseline_score - shuffled_score. """ rng = np.random.RandomState(random_state) if scoring is None: # Default: R² for regression def scoring(model, X, y): y_pred = model.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return 1 - ss_res / ss_tot baseline_score = scoring(model, X, y) n, p = X.shape importances = np.zeros((n_repeats, p)) for r in range(n_repeats): for j in range(p): X_perm = X.copy() # Randomly shuffle feature j X_perm[:, j] = rng.permutation(X_perm[:, j]) perm_score = scoring(model, X_perm, y) importances[r, j] = baseline_score - perm_score return { 'importances_mean': np.mean(importances, axis=0), 'importances_std': np.std(importances, axis=0), 'importances': importances } # ---- Demo ---- from sklearn.linear_model import LinearRegression np.random.seed(42) X = np.random.randn(200, 5) y = 3*X[:,0] - 2*X[:,2] + 0.5*X[:,4] + np.random.randn(200)*0.3 model = LinearRegression().fit(X, y) result = permutation_importance(model, X, y) for j in range(5): print(f"Feature {j}: importance = {result['importances_mean'][j]:.4f} ± {result['importances_std'][j]:.4f}")
TensorFlow Implementation
import tensorflow as tf import numpy as np # ---- Generate synthetic data ---- np.random.seed(42) n_samples, n_features = 500, 20 X_train = np.random.randn(n_samples, n_features).astype(np.float32) true_weights = np.zeros(n_features) true_weights[[0, 3, 7, 15]] = [3.0, -2.0, 1.5, -1.0] # Only 4 active features y_train = (X_train @ true_weights + np.random.randn(n_samples) * 0.5).astype(np.float32) class RegularizedLinearModel(tf.keras.Model): """Linear model with configurable regularization in TF.""" def __init__(self, n_features, l1_lambda=0.0, l2_lambda=0.0): super().__init__() self.w = tf.Variable(tf.zeros([n_features, 1]), name='weights') self.b = tf.Variable(tf.zeros([1]), name='bias') self.l1_lambda = l1_lambda self.l2_lambda = l2_lambda def call(self, X): return tf.matmul(X, self.w) + self.b def regularization_loss(self): l1_loss = self.l1_lambda * tf.reduce_sum(tf.abs(self.w)) l2_loss = self.l2_lambda * tf.reduce_sum(tf.square(self.w)) return l1_loss + l2_loss def train_model(model, X, y, epochs=500, lr=0.01): optimizer = tf.keras.optimizers.Adam(learning_rate=lr) X_tensor = tf.constant(X) y_tensor = tf.constant(y.reshape(-1, 1)) for epoch in range(epochs): with tf.GradientTape() as tape: predictions = model(X_tensor) mse_loss = tf.reduce_mean(tf.square(y_tensor - predictions)) total_loss = mse_loss + model.regularization_loss() gradients = tape.gradient(total_loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) if (epoch + 1) % 100 == 0: print(f"Epoch {epoch+1}: Loss={total_loss.numpy():.4f}") return model # Train with different regularizations print("="*50, "RIDGE (L2)") ridge_model = RegularizedLinearModel(n_features, l2_lambda=0.1) train_model(ridge_model, X_train, y_train) ridge_weights = ridge_model.w.numpy().flatten() print(f"Non-zero weights (|w| > 0.1): {np.sum(np.abs(ridge_weights) > 0.1)}") print("="*50, "LASSO (L1)") lasso_model = RegularizedLinearModel(n_features, l1_lambda=0.1) train_model(lasso_model, X_train, y_train) lasso_weights = lasso_model.w.numpy().flatten() print(f"Non-zero weights (|w| > 0.1): {np.sum(np.abs(lasso_weights) > 0.1)}") print("="*50, "ELASTIC NET") enet_model = RegularizedLinearModel(n_features, l1_lambda=0.05, l2_lambda=0.05) train_model(enet_model, X_train, y_train) enet_weights = enet_model.w.numpy().flatten() print(f"Non-zero weights (|w| > 0.1): {np.sum(np.abs(enet_weights) > 0.1)}") # ---- Using Keras built-in regularizers ---- model_keras = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.01, l2=0.01), input_shape=(n_features,)), tf.keras.layers.Dense(32, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01)), tf.keras.layers.Dense(1) ]) model_keras.compile(optimizer='adam', loss='mse') model_keras.fit(X_train, y_train, epochs=100, verbose=0) print("Keras model score:", model_keras.evaluate(X_train, y_train, verbose=0))
import tensorflow as tf # Feature columns with built-in preprocessing class TFFeaturePipeline: """Feature engineering pipeline using TensorFlow layers.""" def build_preprocessing_model(self): # Normalization layer (StandardScaler equivalent) normalizer = tf.keras.layers.Normalization(axis=-1) # StringLookup + OneHot (One-Hot encoding) category_encoder = tf.keras.layers.StringLookup( vocabulary=['Mumbai', 'Delhi', 'Bangalore', 'Chennai'], output_mode='one_hot' ) # Discretization (binning) age_binner = tf.keras.layers.Discretization( bin_boundaries=[18, 25, 35, 45, 55, 65] ) return normalizer, category_encoder, age_binner def demo(self): # Example: Normalize numerical features data = tf.constant([[1200.0, 2.0], [1800.0, 3.0], [2400.0, 4.0], [900.0, 1.0]]) normalizer = tf.keras.layers.Normalization(axis=-1) normalizer.adapt(data) print("Normalized:\n", normalizer(data).numpy()) pipeline = TFFeaturePipeline() pipeline.demo()
Scikit-Learn Implementation
import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import ( StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder, PowerTransformer, PolynomialFeatures ) from sklearn.impute import SimpleImputer, KNNImputer from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # MICE from sklearn.linear_model import Ridge, Lasso, ElasticNet, LassoCV, RidgeCV from sklearn.feature_selection import ( SelectKBest, mutual_info_regression, RFE, SelectFromModel ) from sklearn.model_selection import cross_val_score from sklearn.datasets import make_regression # ============================================================ # 1. DATA PREPARATION # ============================================================ # Create a realistic mixed-type dataset np.random.seed(42) n = 500 df = pd.DataFrame({ 'age': np.random.randint(18, 70, n), 'income': np.random.lognormal(10, 1, n), # Skewed! 'city': np.random.choice(['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad'], n), 'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n), 'experience': np.random.randint(0, 40, n), 'credit_score': np.random.randint(300, 850, n), }) # Inject missing values (10%) for col in ['income', 'credit_score', 'age']: mask = np.random.random(n) < 0.1 df.loc[mask, col] = np.nan # Create target y = (0.5 * df['age'].fillna(35) + 0.001 * df['income'].fillna(50000) + 0.8 * df['experience'] + np.random.randn(n) * 5) # ============================================================ # 2. COLUMN TRANSFORMER: Different processing for different types # ============================================================ numeric_features = ['age', 'income', 'experience', 'credit_score'] categorical_features = ['city'] ordinal_features = ['education'] numeric_pipeline = Pipeline([ ('imputer', KNNImputer(n_neighbors=5)), # KNN imputation ('scaler', StandardScaler()), # Z-score scaling ]) skewed_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('power', PowerTransformer(method='yeo-johnson')), # Handle skew ]) categorical_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)), ]) ordinal_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OrdinalEncoder( categories=[['High School', 'Bachelor', 'Master', 'PhD']])), ]) preprocessor = ColumnTransformer([ ('num', numeric_pipeline, ['age', 'experience', 'credit_score']), ('skewed', skewed_pipeline, ['income']), ('cat', categorical_pipeline, categorical_features), ('ord', ordinal_pipeline, ordinal_features), ]) # ============================================================ # 3. FULL PIPELINE WITH REGULARIZED MODEL # ============================================================ # Ridge Pipeline ridge_pipeline = Pipeline([ ('preprocessor', preprocessor), ('model', Ridge(alpha=1.0)) ]) # Lasso Pipeline lasso_pipeline = Pipeline([ ('preprocessor', preprocessor), ('model', Lasso(alpha=0.1)) ]) # ElasticNet Pipeline enet_pipeline = Pipeline([ ('preprocessor', preprocessor), ('model', ElasticNet(alpha=0.1, l1_ratio=0.5)) ]) # Cross-validation comparison for name, pipe in [('Ridge', ridge_pipeline), ('Lasso', lasso_pipeline), ('ElasticNet', enet_pipeline)]: scores = cross_val_score(pipe, df, y, cv=5, scoring='r2') print(f"{name:12s}: R² = {scores.mean():.4f} ± {scores.std():.4f}") # ============================================================ # 4. REGULARIZATION PATH: Coefficients vs Lambda # ============================================================ from sklearn.linear_model import lasso_path # Prepare data X_processed = preprocessor.fit_transform(df) alphas, coefs, _ = lasso_path(X_processed, y, alphas=np.logspace(-3, 1, 50)) print("\nLasso Path Summary:") print(f" Alphas tested: {len(alphas)}") print(f" Features: {coefs.shape[0]}") print(f" At α=0.001: {np.sum(np.abs(coefs[:, 0]) > 1e-6)} non-zero features") print(f" At α=10.0: {np.sum(np.abs(coefs[:, -1]) > 1e-6)} non-zero features") # ============================================================ # 5. FEATURE SELECTION METHODS # ============================================================ # Filter method: Mutual Information selector_mi = SelectKBest(mutual_info_regression, k=5) X_selected = selector_mi.fit_transform(X_processed, y) print(f"\nMutual Info - selected {X_selected.shape[1]} features") print(f" MI scores: {np.round(selector_mi.scores_, 3)}") # Wrapper method: Recursive Feature Elimination rfe = RFE(Ridge(alpha=1.0), n_features_to_select=5) rfe.fit(X_processed, y) print(f"\nRFE - selected features: {rfe.support_}") print(f" Feature ranking: {rfe.ranking_}") # Embedded method: Lasso-based selection selector_lasso = SelectFromModel(Lasso(alpha=0.1)) selector_lasso.fit(X_processed, y) print(f"\nLasso selection - selected {np.sum(selector_lasso.get_support())} features") # ============================================================ # 6. MICE IMPUTATION # ============================================================ mice_imputer = IterativeImputer( max_iter=10, random_state=42, sample_posterior=True # For multiple imputation ) X_numeric = df[numeric_features].values X_mice = mice_imputer.fit_transform(X_numeric) print(f"\nMICE imputation: {np.sum(np.isnan(X_numeric))} NaN ā {np.sum(np.isnan(X_mice))} NaN")
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import ( StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler, PowerTransformer ) # Generate skewed data np.random.seed(42) data = np.random.lognormal(0, 1, 1000).reshape(-1, 1) # Add outliers data = np.vstack([data, [[50], [100], [200]]]) scalers = { 'Original': None, 'StandardScaler': StandardScaler(), 'MinMaxScaler': MinMaxScaler(), 'RobustScaler': RobustScaler(), 'MaxAbsScaler': MaxAbsScaler(), 'Yeo-Johnson': PowerTransformer(method='yeo-johnson'), } fig, axes = plt.subplots(2, 3, figsize=(15, 10)) for ax, (name, scaler) in zip(axes.flat, scalers.items()): if scaler is None: transformed = data else: transformed = scaler.fit_transform(data) ax.hist(transformed, bins=50, color='#059669', alpha=0.7, edgecolor='white') ax.set_title(name, fontsize=14, fontweight='bold') ax.set_xlabel(f"Range: [{transformed.min():.1f}, {transformed.max():.1f}]") plt.tight_layout() plt.savefig('scaling_comparison.png', dpi=150) plt.show() print("Plot saved as scaling_comparison.png")
Indian Case Studies
š®š³ Case Study 1: Census of India ā Handling 100+ Mixed-Type Features
The Census of India microdata (2011) contains records for 600+ million individuals across 100+ variables: demographic (age, gender, religion, caste), geographic (state, district, urban/rural), economic (occupation, industry, income proxy), educational, and housing characteristics.
Feature Engineering Challenges:
- Mixed types: 40+ categorical variables (religion with 6+ categories, caste with 4 groups, occupation with 100+ codes), 30+ numerical, 20+ ordinal (education levels), 10+ binary
- High cardinality: District (700+), occupation (National Classification of Occupations has 500+ codes)
- Missing data: 15-30% missing for income-related variables (MNAR ā higher income individuals more likely to skip)
- Hierarchical geography: State ā District ā Tehsil ā Village ā requires hierarchical encoding
Solution Pipeline:
- Imputation: MICE for numerical variables (captures cross-variable dependencies in census data). Mode imputation for categorical.
- Encoding: Target encoding for district (700+ categories), ordinal encoding for education, one-hot for religion/caste
- Feature creation: Dependency ratio (dependents/working-age), literacy index (weighted education score), urbanization index
- Scaling: RobustScaler for income-related features (heavily right-skewed with outliers)
- Selection: Lasso for dimensionality reduction from 100+ to ~20 key features for poverty prediction
Results: After feature engineering, a simple Lasso model achieved 82% accuracy for poverty classification ā comparable to random forests on raw data, but with 5x faster inference and interpretable coefficients.
š®š³ Case Study 2: Flipkart Product Pricing ā E-Commerce Feature Engineering
Flipkart, India's largest e-commerce platform (owned by Walmart), uses ML to predict optimal product pricing, demand forecasting, and seller recommendations. The product catalog has millions of SKUs across electronics, fashion, groceries, and more.
Feature Engineering for Price Prediction:
- Text features: Product title ā TF-IDF (top 500 terms), brand extraction, category keywords
- Image features: Product image ā CNN embeddings (128-dim vector from ResNet)
- Seller features: Seller rating, # reviews, days on platform, return rate
- Temporal features: Day of week, month, festival season (Diwali/Dussehra effect), sale event (Big Billion Days)
- Interaction features: Brand Ć Category, Seller Rating Ć Price Range, Discount % Ć Season
- Aggregate features: Avg price in category, price rank within brand, price vs. MRP ratio
Scaling challenge: Price ranges from ā¹50 (phone case) to ā¹2,00,000+ (laptops). Log transform + RobustScaler is essential for price features. MinMaxScaler for ratings (1-5 bounded).
Impact: Feature engineering improvements led to 12% improvement in price prediction MAE and a reported ā¹200 crore annual revenue increase from better pricing strategies.
š®š³ Case Study 3: HDFC Bank Credit Risk ā Regularization for High-Dimensional Financial Data
HDFC Bank, India's largest private bank, uses ML for credit risk assessment. With 60+ million customers and 500+ potential features per customer (bureau data, transaction history, demographics, account behavior), regularization is not just helpful ā it's essential.
The Challenge:
- Curse of dimensionality: 500+ features but only ~5% default rate (imbalanced + high-dimensional)
- Multicollinearity: Many bureau variables are correlated (e.g., total debt, EMI, # active loans)
- Regulatory requirements: RBI mandates interpretable models for credit decisions (can't use black-box models)
Regularization Strategy:
- ElasticNet (α=0.7): Primary model. L1 component selects ~50 key features from 500+. L2 component handles correlated bureau variables.
- Regularization path analysis: Ī» is chosen via 5-fold CV on historical data. The model identified 12 "core" features that remain non-zero across all Ī» values: DPD (days past due), utilization ratio, age of oldest account, number of inquiries.
- Stability selection: Run Lasso 100 times with subsampled data. Features selected in >70% of runs are considered "stable" ā these are the ones reported to regulators.
Impact: Regularized model reduced false positive rate by 18% (fewer good customers rejected) while maintaining the same default detection rate, saving an estimated ā¹500 crore annually in reduced losses.
Global Case Studies
š Case Study 1: Kaggle Competition Feature Engineering Patterns
Analysis of top Kaggle competition solutions reveals consistent feature engineering patterns used by winners:
| Competition | Key Feature Engineering | Impact |
|---|---|---|
| Ames Housing (Regression) | Log-transform price, polynomial interactions (area Ć quality), target encoding for neighborhood | Reduced RMSE by 35% |
| Titanic (Classification) | Title extraction from name, family size, deck from cabin, fare per person | Accuracy 78% ā 84% |
| Home Credit Default Risk | 500+ features from 7 tables: aggregations (mean, max, count), time-since features, bureau score interactions | Top 10 solutions all had 1000+ engineered features |
| Allstate Claims Severity | Stacking + target encoding + frequency encoding of 100+ categorical columns | Winners used 2000+ features with ElasticNet |
Common Patterns:
- Always try log-transform on right-skewed targets (price, claims, revenue)
- Target encoding > One-hot for high-cardinality categoricals with tree models
- Feature interactions between top correlated features always help
- Null indicator features ("is_missing" binary column) often carry signal
- Aggregation features from relational tables (mean, count, std per group)
š Case Study 2: Netflix Prize ā Feature Engineering at Scale
The Netflix Prize (2006-2009, $1M prize) was one of the most famous ML competitions. The winning team (BellKor's Pragmatic Chaos) demonstrated the power of feature engineering for recommendation systems.
Key Features Engineered:
- Temporal effects: User rating behavior changes over time. A user who rated 4 stars in 2005 might rate 3.5 stars in 2009 (rating inflation/deflation). Time-binned user biases were created.
- User/Movie biases: bᵤ = average rating of user u - global mean. bᵢ = average rating of movie i - global mean.
- Implicit features: Which movies a user chose to rate (not just the rating value) carries signal. Binary "did user rate this movie" features.
- Neighborhood features: Similarity to K nearest users, weighted average of similar users' ratings.
Regularization insight: The winning SVD++ model used L2 regularization on all latent factors. Without regularization, the model overfitted dramatically on the sparse rating matrix (99% of entries missing). Ī» = 0.02 was optimal, found via cross-validation.
Result: The combination of feature engineering + regularized matrix factorization achieved a 10.06% improvement over Netflix's own algorithm (Cinematch), barely crossing the $1M prize threshold.
Startup Applications
š HealthTech: Practo ā Patient Risk Scoring
Practo (India's largest healthtech platform) uses feature engineering on patient data: symptom frequency, appointment history, lab results, age-comorbidity interactions. RobustScaler handles extreme lab values. L1 regularization selects the top 20 risk factors from 200+ potential features.
š AgriTech: CropIn ā Crop Yield Prediction
CropIn engineers features from satellite imagery (NDVI, soil moisture), weather data (cumulative rainfall, GDD), and historical yields. Box-Cox transforms for skewed yield distributions. ElasticNet handles correlated weather variables (temperature and humidity are negatively correlated).
š FinTech: Razorpay ā Fraud Detection
Transaction velocity (# transactions in last hour/day), amount deviation from user mean, merchant category risk scores, device fingerprint features. All features are scaled with StandardScaler for the SVM-based fraud detector. L1 regularization identifies the top 30 fraud indicators from 500+ candidate features.
Government Applications
šļø Aadhaar ā Biometric Feature Quality Assessment
UIDAI processes 1.3 billion biometric records. Feature engineering on fingerprint quality scores (NFIQ), iris clarity metrics, and face image quality. Missing biometric modalities are handled with cascaded imputation. Regularized classifiers determine if a biometric sample is "good enough" for authentication.
šļø NITI Aayog ā District-Level Development Indices
NITI Aayog's Aspirational Districts Programme creates composite development indices from 49 indicators across health, education, agriculture, financial inclusion, and infrastructure. Feature scaling (MinMax to [0,100]) ensures indicators on different scales are comparable. PCA with standardized features identifies the key development dimensions.
šļø ISRO ā Satellite Image Feature Extraction
ISRO's remote sensing satellites (Cartosat, Resourcesat) generate multi-spectral imagery. Feature engineering involves computing vegetation indices (NDVI, EVI), texture features (GLCM), and spectral ratios. StandardScaler is applied across spectral bands before classification. Ridge regularization handles correlated spectral features.
Industry Applications
š Manufacturing: Tata Steel ā Quality Prediction
Tata Steel's ML system predicts steel quality from 200+ process parameters: temperature profiles, chemical composition, rolling speeds. Feature engineering includes time-series aggregations (moving averages, peak detection), interaction features (carbon à temperature), and polynomial features for non-linear effects. ElasticNet with α=0.3 balances interpretability (few key parameters) with handling correlated temperature sensors.
š Telecom: Jio ā Customer Churn Prediction
Jio's churn model uses 400+ features: usage patterns (data consumption trends, call frequency, recharge behavior), network quality metrics (drop rate, speed tests), and customer service interactions. Lasso reduces features to ~40 key predictors. Feature scaling is critical because usage (in GB) and call duration (in minutes) are on very different scales.
š Pharma: Dr. Reddy's ā Drug Response Prediction
Genomic data has 20,000+ gene expression features for a few hundred patients (p >> n, classic high-dimensional problem). ElasticNet is the standard approach: L1 selects the most relevant genes, L2 handles the extreme multicollinearity among co-expressed genes. Feature scaling (StandardScaler) is essential because gene expression levels vary by orders of magnitude.
Mini Projects
šØ Mini Project 1: Indian Census Feature Engineering Pipeline
Objective: Build an end-to-end feature engineering pipeline for a subset of Indian Census data to predict household poverty status.
Steps:
- Data Loading: Use the Census 2011 sample dataset (or simulate with similar structure): age, gender, religion (6 categories), caste (4 categories), education (7 levels), occupation (10 categories), state (28), district (100+), rural/urban, house condition (3 levels), water source, toilet type, income proxy
- Missing Data Analysis: Visualize missingness patterns using a heatmap. Test MCAR vs MAR using Little's test. Apply MICE imputation for numerical features and mode imputation for categorical.
- Feature Engineering:
- Create dependency ratio, literacy score, housing quality index
- Target encode district (100+ categories)
- Ordinal encode education level
- One-hot encode religion and caste
- Scaling: RobustScaler for income (skewed), StandardScaler for others
- Feature Selection: Compare filter (mutual information), wrapper (RFE), and embedded (Lasso) methods. Create a Venn diagram showing overlap between selected feature sets.
- Modeling: Compare Ridge, Lasso, ElasticNet with 5-fold CV. Plot regularization paths.
Deliverables: Jupyter notebook with pipeline code, visualizations, comparison table, and 1-page report.
šØ Mini Project 2: Credit Risk Regularization Study
Objective: Investigate how L1, L2, and ElasticNet regularization affect credit risk model performance and interpretability across different Ī» values.
Steps:
- Data: Use Kaggle's "Give Me Some Credit" dataset or simulate credit data with: monthly income, age, # dependents, # times 30-59 DPD, # times 60-89 DPD, # times 90+ DPD, revolving utilization, # open credit lines, # real estate loans, debt ratio + 10 noise features
- Feature Engineering: Create total DPD score, utilization buckets, income-to-debt ratio, age Ć experience interaction, polynomial features for top 5 predictors
- Regularization Path Analysis:
- Plot Ridge coefficients vs Ī» (all coefficients shrink smoothly)
- Plot Lasso coefficients vs Ī» (observe features dropping to zero at different Ī»)
- Plot ElasticNet with α = 0.5
- Mark the optimal Ī» from cross-validation on each plot
- Stability Selection: Run Lasso 100 times with 80% subsampling. Plot feature selection frequency. Identify "stable" features (selected >70% of the time).
- Performance Comparison: AUC-ROC curves for Ridge, Lasso, ElasticNet at optimal Ī». Compare with unregularized logistic regression.
Deliverables: Regularization path plots, stability selection plot, AUC comparison table, feature importance ranking.
šØ Mini Project 3: Scaling Effect Visualization Lab
Objective: Visually demonstrate how different scaling methods affect model performance and gradient descent convergence.
Steps:
- Create a dataset where features have very different scales (area: 500-5000, rooms: 1-10, price: 10ā¶-10āø)
- Train linear regression with gradient descent (from scratch) on: (a) unscaled, (b) StandardScaled, (c) MinMaxScaled, (d) RobustScaled data
- Plot the loss curve (loss vs iteration) for each ā show how scaling dramatically reduces iterations to convergence
- Plot 2D contours of the loss surface with and without scaling to show the "elongated ellipse vs circle" effect
- Compare KNN accuracy with and without scaling on the same dataset
End-of-Chapter Exercises
Given the data [10, 20, 30, 40, 50], compute the StandardScaler transformation by hand. Verify that the mean of the transformed data is 0 and the standard deviation is 1.
Apply MinMaxScaler to [100, 200, 300, 400, 500] to scale to the range [0, 1]. What happens if a new data point of 600 appears in test data?
Explain why tree-based models (Random Forest, XGBoost) don't need feature scaling, but SVM and KNN do. Give a concrete numerical example.
Given the data [1, 2, 3, 100, 5], compute both StandardScaler and RobustScaler transformations. Which one is less affected by the outlier (100)? Explain mathematically.
For the categorical variable "Blood Type" with values {A, B, AB, O}, apply (a) Label Encoding, (b) One-Hot Encoding, (c) explain why Label Encoding is problematic for linear models.
Prove that for StandardScaler, the transformed data always has mean = 0 and variance = 1, regardless of the original distribution.
Derive the Ridge regression closed-form solution starting from the Lagrangian formulation: minimize ||y - Xw||² subject to ||w||² ⤠t. Show that the Lagrangian multiplier corresponds to the regularization parameter λ.
Given X = [[1, 2], [3, 4], [5, 6]], y = [3, 7, 11], compute the Ridge regression solution for Ī» = 0.5 by hand. Compare with OLS.
Explain the difference between MCAR, MAR, and MNAR with an example from an e-commerce context (e.g., missing product reviews on Flipkart).
Implement target encoding for a categorical variable with 5 categories. Include smoothing with m=10. Show that without smoothing, rare categories overfit.
Apply Box-Cox transform to [1, 4, 9, 16, 25] for Ī» = 0.5 and Ī» = 0. Compare the skewness before and after transformation.
Prove the soft-thresholding result for Lasso: show that the optimal w_j is S(Ļ, Ī»)/z where S is the soft-thresholding operator. Start from the subgradient optimality condition.
Implement MICE (Multiple Imputation by Chained Equations) from scratch in Python. Test on a dataset with 20% MCAR missing values and compare with mean imputation. Measure RMSE of imputed values against true values.
Show that Ridge regression coefficients are a scaled version of OLS coefficients when X has orthonormal columns (XįµX = I). Specifically, show w_ridge = w_OLS / (1 + Ī»).
Plot the regularization path for Lasso on a synthetic dataset with 20 features, where only 5 are truly predictive. Identify the Ī» value where the "correct" set of 5 features is selected. How does noise level affect this Ī»?
Compare permutation importance, coefficient magnitude, and mutual information for feature importance on the same dataset. Do they agree? When might they disagree?
Create a sklearn Pipeline that: (a) imputes missing values, (b) scales features, (c) applies Lasso feature selection, (d) fits a Ridge model. Use cross_val_score with 5-fold CV.
Prove that the Elastic Net can select more than n features when p > n, while Lasso is limited to at most n. (Hint: consider the rank of the Lasso subproblem.)
Explain why you should NEVER fit a scaler on test data. Construct a concrete example where fitting on test data leads to data leakage and inflated test accuracy.
Implement coordinate descent for Elastic Net from scratch. Test on a dataset with groups of correlated features. Show that Elastic Net selects all features in a correlated group, while Lasso selects only one.
What is the difference between LabelEncoder and OrdinalEncoder in scikit-learn? When should you use each?
Design a feature engineering pipeline for an Indian ride-hailing app (Ola/Uber). List at least 10 engineered features for predicting ride demand, specifying the encoding/scaling method for each.
Multiple Choice Questions
Interview Questions
L1 (Lasso): Adds sum of absolute values of weights as penalty. Produces sparse models (some weights exactly zero) ā automatic feature selection. Use when you believe only a few features are truly important.
L2 (Ridge): Adds sum of squared weights as penalty. Shrinks all weights toward zero but never exactly zero. Use when many features contribute small amounts to the prediction, or when features are correlated.
Geometric intuition: L1 constraint is a diamond (corners on axes ā sparse). L2 constraint is a circle (smooth ā no exact zeros).
When to choose: L1 for feature selection + interpretability. L2 for multicollinearity. ElasticNet when you want both.
One-hot encoding would create 10,000 binary columns ā too many. Options:
- Target encoding (with smoothing and cross-validation to prevent leakage): Replace each category with the mean target value. Use sklearn's
TargetEncoderwith CV. - Frequency encoding: Replace with count/frequency of each category.
- Hash encoding: Hash categories into a fixed number of buckets (e.g., 256). Some collisions, but manageable.
- Embedding (for deep learning): Learn a dense vector representation for each category.
- Group rare categories: Combine categories with <N occurrences into an "Other" bucket, then one-hot encode.
Derivation: J(w) = ||y-Xw||² + Ī»||w||². Take gradient: āJ = -2Xįµ(y-Xw) + 2Ī»w = 0. Rearranging: (XįµX + Ī»I)w = Xįµy ā w = (XįµX + Ī»I)ā»Ā¹Xįµy.
Invertibility: XįµX is positive semi-definite with eigenvalues Ļᵢ² ā„ 0. Adding Ī»I shifts all eigenvalues to Ļᵢ² + Ī» > 0 (when Ī» > 0). A matrix with all positive eigenvalues is positive definite, hence invertible.
Architecture: Two-layer feature store:
- Offline store (batch): Precomputed features updated daily (user lifetime metrics, item popularity scores). Stored in Hive/S3.
- Online store (real-time): Low-latency features updated in real-time (last 10 items viewed, session duration). Stored in Redis/DynamoDB.
- Feature versioning: Each feature definition is versioned (v1.0, v1.1). Models reference specific versions to ensure reproducibility.
- Serving: Feature vectors precomputed and cached. Serving latency < 10ms via key-value lookup.
- Freshness: Critical features (last click) ā real-time stream processing. Less critical (monthly spend) ā batch update.
Diagnosis: Classic overfitting. The model memorizes training data.
Fixes:
- Regularization: Add L1/L2 penalty. Start with Ridge (Ī»=1) and increase until gap closes.
- Feature reduction: Too many features? Use Lasso to eliminate irrelevant ones. Check if removing features improves test accuracy.
- Check for leakage: Is a feature that perfectly correlates with the target leaking information? (e.g., using future data to predict past events)
- Cross-validation: Use k-fold CV to get a realistic estimate instead of a single train/test split.
- More data: If possible, collect more training samples to reduce variance.
Alternative data features:
- UPI transaction patterns (frequency, amount distribution, merchant types)
- Mobile phone metadata (recharge frequency, data usage, handset value)
- Social signals (LinkedIn profile completeness, education background)
- Rent payment history (via landlord verification)
- E-commerce purchase patterns (from Flipkart/Amazon history)
Feature engineering: RFM (Recency-Frequency-Monetary) from UPI, spending volatility, income stability score.
Regularization: ElasticNet critical here ā many features are noisy/correlated, and we need interpretability for RBI compliance.
Standardization: z = (x - μ)/Ļ. Centers at 0, unit variance. Preferred when: data is Gaussian, using algorithms that assume zero-centered data (SVM, PCA, logistic regression).
Normalization (Min-Max): x' = (x - min)/(max - min). Scales to [0,1]. Preferred when: data is NOT Gaussian, need bounded outputs (neural network inputs, image pixels), want to preserve zero entries in sparse data.
def soft_threshold(rho, lam):
"""S(Ļ, Ī») = sign(Ļ) Ā· max(|Ļ| - Ī», 0)"""
if rho > lam:
return rho - lam
elif rho < -lam:
return rho + lam
else:
return 0.0
Step-by-step approach:
- Understand the mechanism: Is it MCAR, MAR, or MNAR? This determines the strategy.
- Visualize patterns: Use missingno library to see correlations between missing values.
- Drop if possible: If a feature has >70% missing, consider dropping it. If a row has >50% missing, consider dropping it.
- Simple imputation first: Median for skewed numerical, mean for symmetric, mode for categorical.
- Advanced imputation: KNN or MICE for MAR data where inter-feature relationships matter.
- Create missing indicator: Add binary "is_missing" columns ā sometimes missingness itself is a signal.
- Evaluate: Compare model performance with different imputation strategies using cross-validation.
Lasso limitation: The Lasso objective is a convex program. The solution lies on the boundary of the L1 ball. For a linear system with n equations and p unknowns (p > n), the solution space is (p-n)-dimensional. The L1 optimization selects a vertex of the feasible polytope, which can have at most n non-zero coordinates (by the theory of linear programming, a basic feasible solution has at most n non-zero variables).
Elastic Net overcomes this: The L2 penalty makes the objective strictly convex, so the solution is unique and doesn't need to be a vertex of a polytope. The L2 term "stabilizes" the selection, allowing groups of correlated features to all receive non-zero weights. Mathematically, the Elastic Net reformulation adds n additional "pseudo-observations" which effectively makes p ⤠2n, allowing more features to be selected.
Research Problems
Adaptive Regularization for Non-Stationary Data: Traditional regularization assumes a fixed Ī» throughout training. In streaming/online learning settings (e.g., real-time UPI fraud detection), the optimal Ī» changes as the data distribution shifts. Design an adaptive regularization scheme that adjusts Ī» based on detected distribution drift. How would you formalize "drift" and "adapt" mathematically? Implement and evaluate on a synthetic non-stationary dataset.
Fairness-Aware Feature Selection: Standard Lasso selects features purely based on predictive power. In credit scoring (HDFC, SBI), selected features may serve as proxies for protected attributes (caste, religion, gender). Design a regularization framework that simultaneously optimizes predictive accuracy AND fairness (equalized odds). Formulate this as a constrained optimization problem and derive the Lagrangian. How does the Pareto frontier between accuracy and fairness change as you increase regularization?
Neural Feature Engineering: Can we learn optimal feature transformations (scaling, encoding, interaction selection) end-to-end using a neural network? Design a "feature engineering layer" that sits before the main model and learns: (a) optimal power transform parameters (generalizing Box-Cox), (b) learned embeddings for categoricals, (c) automatic interaction detection. Compare with manual feature engineering on 5 Kaggle datasets. Under what conditions does learned feature engineering outperform manual engineering?
Group Lasso for Hierarchical Indian Census Data: Indian census data has natural groupings: geographic features (state, district, tehsil), demographic features (age, gender, caste), economic features (occupation, industry). Standard Lasso selects individual features. Group Lasso selects/removes entire groups. Design a two-level regularization (group-level L1 + within-group L2) for census poverty prediction. Does the hierarchical structure improve both prediction and interpretability compared to standard Elastic Net?
Key Takeaways
fit_transform() on training data and transform() on test data. sklearn Pipelines handle this automatically.
References & Further Reading
- [1] Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, 58(1), 267-288. ā The foundational Lasso paper.
- [2] Hoerl, A. E., & Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, 12(1), 55-67. ā Ridge regression origin.
- [3] Zou, H., & Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." JRSS-B, 67(2), 301-320. ā ElasticNet paper.
- [4] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 3, 5, 7. ā The ML theory bible.
- [5] GĆ©ron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapters 4, 6. ā Practical implementation reference.
- [6] van Buuren, S. (2018). Flexible Imputation of Missing Data, 2nd ed. CRC Press. ā Comprehensive missing data treatment.
- [7] Kuhn, M., & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press. ā Modern feature engineering practices.
- [8] Box, G. E. P., & Cox, D. R. (1964). "An Analysis of Transformations." JRSS-B, 26(2), 211-252. ā Box-Cox transform origin.
- [9] Yeo, I.-K., & Johnson, R. A. (2000). "A New Family of Power Transformations to Improve Normality or Symmetry." Biometrika, 87(4), 954-959.
- [10] Census of India 2011 Microdata Documentation. Ministry of Home Affairs, Government of India. ā Indian census data reference.
- [11] Reserve Bank of India (2023). "Master Direction on IT Governance, Risk, Data Management and Business Continuity Planning." ā RBI AI governance framework.
- [12] Koren, Y. (2009). "The BellKor Solution to the Netflix Grand Prize." Netflix Prize documentation. ā Netflix Prize feature engineering.
- [13] Scikit-learn documentation: sklearn.preprocessing, sklearn.impute, sklearn.linear_model. https://scikit-learn.org/
- [14] Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." JMLR, 12, 2825-2830.