Learning Objectives
By the end of this chapter, you will be able to:
Entropy & Information Gain
Derive entropy from first principles and compute information gain for any dataset split.
Splitting Criteria Mastery
Compare and apply Information Gain, Gain Ratio, Gini Impurity, and Variance Reduction.
ID3, C4.5 & CART Algorithms
Implement and differentiate between the three fundamental tree-building algorithms.
Tree Construction & Pruning
Build trees recursively, apply pre-pruning constraints, and perform cost-complexity post-pruning.
Regression Trees
Extend classification trees to predict continuous values using variance reduction.
Random Forests & Bagging
Understand bootstrap aggregation, feature randomness, and why ensembles reduce variance.
OOB Error & Feature Importance
Estimate out-of-bag error and compute Gini importance and permutation importance.
Full Implementation
Code Decision Trees and Random Forests from scratch in Python and with scikit-learn/TensorFlow.
Decision Trees appear in virtually every ML interview and exam. You must be able to build a tree by hand, compute entropy and Gini impurity, and explain the bias-variance tradeoff of trees vs. forests. These are non-negotiable skills.
Introduction
Imagine you are a loan officer at a bank. A customer walks in and asks for a โน5,00,000 personal loan. How do you decide? You might ask: "What is their annual income?" If it's above โน8,00,000, you feel more confident. Then: "Do they have an existing loan?" If yes, you're cautious. "What's their credit score?" Above 750? Approve. Below 600? Reject.
Congratulations โ you just built a decision tree in your head. Decision trees are one of the most natural, interpretable, and powerful machine learning algorithms. They mirror how humans actually make decisions: through a series of yes/no questions that partition the problem space.
๐ณ The Big Idea
A Decision Tree recursively partitions the feature space using axis-aligned splits, creating a tree structure where each internal node represents a test on a feature, each branch represents an outcome, and each leaf node represents a prediction (class label or continuous value). A Random Forest builds many trees on random subsets of data and features, then aggregates their predictions โ turning a high-variance model into a low-variance powerhouse.
Why Study Decision Trees in 2025?
- Interpretability: Trees are "white-box" models โ you can trace every decision. Essential in finance, healthcare, and legal domains.
- No Preprocessing: Unlike SVMs and neural networks, trees don't need feature scaling, normalization, or one-hot encoding of ordinal features.
- Non-linearity: Trees capture complex, non-linear decision boundaries effortlessly.
- Foundation for Ensembles: Random Forests, Gradient Boosting (XGBoost, LightGBM, CatBoost) are all built on decision trees. These dominate Kaggle competitions and industry applications.
- Real-time Deployment: A trained tree is just a series of if-else statements โ lightning fast at inference.
Decision trees bridge the gap between simple linear models (Chapters 4-7) and complex neural networks (Chapters 10-12). They show that you don't always need gradient descent โ sometimes the best approach is to greedily partition the data. Understanding when trees work well (tabular data, mixed features, need for interpretability) vs. when neural networks shine (images, text, large-scale unstructured data) is a hallmark of ML maturity.
Chapter Roadmap
We begin by deriving entropy and information gain from scratch, then build the ID3, C4.5, and CART algorithms step by step. We'll work through complete numerical examples by hand, learn pruning strategies, extend to regression trees, and then level up to Random Forests. We'll implement everything from scratch in Python, then use scikit-learn and TensorFlow for production pipelines. Along the way, we'll explore real-world case studies from SBI's credit scoring to Microsoft's Kinect body tracking.
Historical Background
The Origins of Information Theory
The mathematical foundation of decision trees begins with Claude Shannon's groundbreaking 1948 paper "A Mathematical Theory of Communication." Shannon defined entropy as a measure of uncertainty or information content in a random variable. This concept, borrowed from thermodynamics, would become the cornerstone of the ID3 algorithm decades later.
Timeline of Key Developments
| Year | Milestone | Contributor | Significance |
|---|---|---|---|
| 1948 | Information Theory | Claude Shannon | Defined entropy, foundation for Information Gain |
| 1963 | AID (Automatic Interaction Detection) | Morgan & Sonquist | First systematic use of trees in statistics |
| 1979 | ID3 Algorithm | J. Ross Quinlan | First practical decision tree using Information Gain |
| 1984 | CART (Classification and Regression Trees) | Breiman, Friedman, Olshen, Stone | Introduced Gini impurity, binary trees, cost-complexity pruning |
| 1993 | C4.5 Algorithm | J. Ross Quinlan | Improved ID3 with Gain Ratio, handles continuous features, missing values |
| 1996 | Bagging | Leo Breiman | Bootstrap Aggregating โ reduce variance via ensemble |
| 2001 | Random Forests | Leo Breiman | Added feature randomness to bagging โ a landmark paper |
| 2014 | XGBoost | Tianqi Chen | Gradient boosted trees with regularization โ Kaggle dominance begins |
| 2017 | LightGBM & CatBoost | Microsoft / Yandex | Faster, more efficient gradient boosting implementations |
Leo Breiman: The Father of Random Forests
Leo Breiman (1928โ2005) was a statistician at UC Berkeley whose work transformed machine learning. His 2001 paper "Random Forests" combined two of his earlier ideas โ bagging (1996) and random subspace method โ into a single, elegant framework. Breiman was also a vocal critic of traditional statistics, arguing in his famous 2001 paper "Statistical Modeling: The Two Cultures" that predictive modeling (algorithmic approach) was often more useful than explanatory modeling (data modeling approach).
India's banking sector was among the earliest adopters of decision tree models. In the early 2000s, ICICI Bank and HDFC Bank began using CART-based models for credit scoring. Today, the Reserve Bank of India (RBI) mandates that banks must be able to explain why a loan application was rejected โ making interpretable models like decision trees essential for regulatory compliance under the Fair Practices Code.
J. Ross Quinlan's Legacy
Australian computer scientist J. Ross Quinlan developed both ID3 (1979) and its successor C4.5 (1993). C4.5 was ranked #1 in the "Top 10 Algorithms in Data Mining" by the IEEE International Conference on Data Mining in 2006. Quinlan later developed C5.0, a commercial version with boosting support and faster execution.
Conceptual Explanation
What is a Decision Tree?
A decision tree is a flowchart-like structure that maps observations (features) to conclusions (predictions). Think of it as a game of "20 Questions" where each question is carefully chosen to maximally reduce uncertainty about the answer.
๐ฟ Tree Anatomy
- Root Node: The topmost node โ the first question asked (most informative feature)
- Internal Nodes: Decision points that test a feature condition
- Branches: Outcomes of a test (e.g., "Age โค 30" โ left, "Age > 30" โ right)
- Leaf Nodes: Terminal nodes that give the final prediction
- Depth: Length of the longest path from root to leaf
๐ฏ The Core Question
At each node, the algorithm asks: "Which feature, and which threshold, gives me the best split?"
A "good split" is one that makes the resulting child nodes as pure as possible โ ideally, each child should contain samples of only one class.
The key difference between algorithms (ID3, C4.5, CART) lies in how they define "best split."
The Three Major Algorithms
๐ต ID3 (Iterative Dichotomiser 3)
Creator: J. Ross Quinlan (1979) | Split Criterion: Information Gain | Tree Type: Multi-way
- Works only with categorical features
- Creates one branch per unique value of the selected feature
- Greedy algorithm โ selects the feature with highest Information Gain at each node
- Limitation: Biased toward features with many values (e.g., "Student ID" would always win)
- No built-in pruning โ prone to overfitting
๐ข C4.5 (Successor to ID3)
Creator: J. Ross Quinlan (1993) | Split Criterion: Gain Ratio | Tree Type: Multi-way
- Handles both categorical and continuous features
- Uses Gain Ratio to correct ID3's bias toward high-cardinality features
- Handles missing values by distributing them proportionally
- Includes post-pruning (error-based pruning)
- Ranked #1 in Top 10 Data Mining Algorithms (IEEE ICDM, 2006)
๐ด CART (Classification and Regression Trees)
Creator: Breiman et al. (1984) | Split Criterion: Gini Impurity / Variance Reduction | Tree Type: Binary
- Always produces binary trees (yes/no splits)
- Uses Gini impurity for classification, variance reduction for regression
- Handles both categorical and continuous features naturally
- Features cost-complexity pruning (the gold standard for post-pruning)
- scikit-learn's default โ this is the algorithm you use when you call
DecisionTreeClassifier()
In practice, the difference between Information Gain and Gini Impurity rarely matters โ they produce similar trees 95%+ of the time. The real advantage of CART is its binary-only structure, which makes it computationally simpler and pairs beautifully with ensemble methods like Random Forests and Gradient Boosting. When you see "decision tree" in industry, it almost always means CART.
How Does a Decision Tree Learn?
The learning process follows a greedy, top-down, recursive partitioning strategy called TDIDT (Top-Down Induction of Decision Trees):
- Start with all training data at the root node
- Select the best feature (and threshold for continuous features) to split on
- Create child nodes based on the split
- Recurse on each child node with its subset of data
- Stop when a stopping criterion is met (pure node, max depth, min samples, etc.)
- Assign prediction to leaf nodes (majority class for classification, mean value for regression)
Regression Trees
While classification trees predict discrete labels, regression trees predict continuous values. The key differences:
- Splitting criterion: Instead of Gini or entropy, use variance reduction (or MSE reduction)
- Leaf prediction: Instead of majority vote, predict the mean value of training samples in that leaf
- Everything else stays the same: recursive splitting, pruning, the tree structure itself
Random Forests: The Ensemble Approach
A single decision tree is powerful but unstable โ small changes in the data can produce completely different trees (high variance). Random Forests solve this by building many trees and averaging their predictions:
๐ฒ๐ฒ๐ฒ Random Forest = Bagging + Feature Randomness
Step 1 โ Bootstrap: Create B random subsets of the training data by sampling with replacement (each subset is ~63.2% unique samples).
Step 2 โ Feature Randomness: At each split in each tree, consider only a random subset of m features (typically โp for classification, p/3 for regression, where p = total features).
Step 3 โ Grow Trees: Build a full decision tree on each bootstrap sample, with no pruning.
Step 4 โ Aggregate: For classification, take the majority vote. For regression, take the mean.
Why Does Random Forest Work?
The magic comes from decorrelation. If all trees are built on the same features, they'll be similar (correlated), and averaging them won't help much. By randomizing features, each tree captures different patterns. When you average many uncorrelated, individually-imperfect estimators, the errors cancel out.
where ฯ = average pairwise correlation between trees,
ฯยฒ = variance of each tree, B = number of trees
As B โ โ: Var โ ฯฯยฒ (irreducible correlation floor)
Feature randomness reduces ฯ โ lower variance!
In industry, Random Forests are often the first model you try on tabular data. They work well out of the box with minimal tuning, handle missing values and mixed feature types, provide feature importance for free, and serve as a strong baseline before trying more complex models like XGBoost or neural networks.
Mathematical Foundation
Entropy: Measuring Uncertainty
Entropy quantifies the "impurity" or "disorder" in a dataset. For a random variable with K possible outcomes:
where pi = proportion of class i in set S
Convention: 0 ยท logโ(0) = 0
Intuition: Entropy is highest (= logโK) when all classes are equally likely (maximum uncertainty), and lowest (= 0) when all samples belong to one class (complete certainty).
Binary Entropy Function
For a two-class problem (positive proportion p, negative proportion 1-p):
H(0) = 0, H(0.5) = 1 (max), H(1) = 0
Information Gain
Information Gain measures how much entropy is reduced after splitting on a feature A:
= H(parent) - weighted average H(children)
Gain Ratio
C4.5 normalizes Information Gain by the feature's intrinsic value to avoid bias toward high-cardinality features:
IV(A) = -โv โ Values(A) (|Sv| / |S|) ยท logโ(|Sv| / |S|)
Gini Impurity
CART uses Gini impurity โ a faster-to-compute alternative to entropy:
= โiโ j pi ยท pj
For binary: Gini = 2p(1-p)
Gini โ [0, 0.5] for binary; [0, 1-1/K] for K classes
Interpretation: Gini impurity equals the probability that two randomly chosen samples from the node would have different class labels. Perfect purity โ Gini = 0.
Variance Reduction (Regression Trees)
Var(S) = (1/|S|) โi (yi - ศณ)ยฒ
Gini vs. Entropy: Mathematical Comparison
| Property | Entropy | Gini Impurity |
|---|---|---|
| Formula | -โ pแตข logโ(pแตข) | 1 - โ pแตขยฒ |
| Range (binary) | [0, 1] | [0, 0.5] |
| Maximum at | p = 0.5 โ H = 1 | p = 0.5 โ G = 0.5 |
| Computation | Requires logarithm (slower) | Polynomial (faster) |
| Sensitivity | More sensitive to class probability changes | Prefers larger, purer partitions |
| Used by | ID3, C4.5 | CART (scikit-learn default) |
Out-of-Bag (OOB) Error Estimation
In bagging, each bootstrap sample contains ~63.2% of the data. The remaining ~36.8% are out-of-bag samples. For each sample xแตข, we can predict using only the trees where xแตข was OOB, giving a "free" validation estimate:
OOB Error = (1/n) โi=1n ๐[ลทแตขOOB โ yแตข]
where ลทแตขOOB = majority vote of trees where xแตข was out-of-bag
Feature Importance Measures
๐ต Gini Importance (MDI)
Mean Decrease in Impurity: Sum the total impurity reduction provided by each feature across all trees:
FI(j) = โtrees โnodes splitting on j ฮGininode
Pro: Fast, computed during training. Con: Biased toward high-cardinality features.
๐ข Permutation Importance (MDA)
Mean Decrease in Accuracy: Randomly permute feature j's values and measure how much accuracy drops:
FI(j) = baseline_score - score_after_permuting_j
Pro: Model-agnostic, unbiased. Con: Slower (requires re-evaluation per feature).
In exams, you'll often be asked: "Why is Gini importance biased?" The answer: continuous or high-cardinality features have more possible split points, giving them more chances to reduce impurity. A random ID number would rank as highly important by Gini importance โ which is clearly wrong. Always prefer permutation importance for reliable feature ranking.
Formula Derivations
Deriving Entropy from First Principles
We want a function H(pโ, pโ, ..., pโ) that measures "how surprised we are" by a random event. Shannon showed that the unique function satisfying three natural axioms is entropy.
๐ Derivation: Shannon Entropy from Axioms
Axiom 1 (Continuity): H should be continuous in the probabilities pแตข. Small changes in probabilities should cause small changes in uncertainty.
Axiom 2 (Maximum): For a uniform distribution (all pแตข = 1/K), H should be maximized, and H(1/K, ..., 1/K) should increase with K. More equally likely outcomes = more uncertainty.
Axiom 3 (Additivity / Chain Rule): If a choice is broken into sub-choices, the total entropy should decompose as:
The joint entropy of choosing A then B equals the entropy of A plus the expected entropy of B given A.
Start with uniform distribution: Define f(n) = H(1/n, ..., 1/n). From Axiom 2, f is increasing. From Axiom 3, consider nแต equally likely outcomes grouped in stages:
This is the functional equation for logarithms! So f(n) = c ยท log(n) for some constant c > 0.
Extend to non-uniform distributions: Consider a distribution (pโ, ..., pโ) where each pแตข = nแตข/N is rational. Using the grouping axiom:
Choose c = 1 and base 2: With c = 1 and logโ, we get:
The unit is bits (binary digits) because we use base 2. One bit = the information from a fair coin flip.
Deriving Gini Impurity
๐ Derivation: Gini Impurity from Misclassification Probability
Setup: Suppose we randomly pick a sample from node S, and then randomly assign it a label according to the class distribution in S. What's the probability of misclassification?
Compute: Probability that we pick a sample of class i AND assign label j โ i:
Gini impurity is the expected error rate if we randomly classify samples according to the class distribution. A pure node (all same class) has Gini = 0.
Deriving the Relationship Between Entropy and Gini
For binary classification with positive proportion p, using the Taylor expansion of logโ(p) around p = 0.5:
H(p) โ 2ยทGini(p) for p near 0.5
This is why entropy and Gini produce similar trees!
Entropy is a slightly more "curved" version of Gini.
Cost-Complexity Pruning Derivation
๐ Cost-Complexity Pruning (CART)
Define the cost-complexity measure: For a tree T with terminal nodes (leaves) |Tฬ|:
where R(T) = total misclassification rate (resubstitution error), ฮฑ โฅ 0 is the complexity parameter.
Prune when the subtree isn't worth the complexity: For an internal node t with subtree Tโ, prune Tโ (replace with a leaf) when:
This ratio is called the effective alpha of node t. We prune from the weakest link up.
Use cross-validation: Generate a sequence of pruned trees Tโ โ Tโ โ ... for increasing ฮฑ. Select the ฮฑ (and corresponding tree) that minimizes cross-validated error. scikit-learn's ccp_alpha parameter controls this.
Worked Numerical Examples
Example 1: Building a Decision Tree by Hand
Dataset: Should a bank approve a loan?
| # | Income | Credit Score | Employed | Approve? |
|---|---|---|---|---|
| 1 | High | Good | Yes | โ Yes |
| 2 | High | Good | No | โ Yes |
| 3 | Low | Good | Yes | โ Yes |
| 4 | Low | Bad | Yes | โ No |
| 5 | Low | Bad | No | โ No |
Step 1: Compute Parent Entropy
S has 3 Yes, 2 No โ p(Yes) = 3/5, p(No) = 2/5
H(S) = -(3/5)ยทlogโ(3/5) - (2/5)ยทlogโ(2/5)
= -(0.6)ยท(-0.737) - (0.4)ยท(-1.322)
= 0.442 + 0.529
= 0.971 bits
Step 2: Information Gain for "Income"
Income = High: {1โ , 2โ } โ 2 Yes, 0 No โ H = 0
Income = Low: {3โ , 4โ, 5โ} โ 1 Yes, 2 No
H(Low) = -(1/3)ยทlogโ(1/3) - (2/3)ยทlogโ(2/3)
= -(0.333)ยท(-1.585) - (0.667)ยท(-0.585)
= 0.528 + 0.390
= 0.918 bits
IG(S, Income) = 0.971 - [(2/5)ยท0 + (3/5)ยท0.918]
= 0.971 - 0.551
= 0.420 bits
Step 3: Information Gain for "Credit Score"
Credit = Good: {1โ , 2โ , 3โ } โ 3 Yes, 0 No โ H = 0
Credit = Bad: {4โ, 5โ} โ 0 Yes, 2 No โ H = 0
IG(S, Credit) = 0.971 - [(3/5)ยท0 + (2/5)ยท0]
= 0.971 - 0
= 0.971 bits โ PERFECT SPLIT!
Step 4: Information Gain for "Employed"
Employed = Yes: {1โ , 3โ , 4โ} โ 2 Yes, 1 No
Employed = No: {2โ , 5โ} โ 1 Yes, 1 No
H(Yes) = -(2/3)ยทlogโ(2/3) - (1/3)ยทlogโ(1/3) = 0.918
H(No) = -(1/2)ยทlogโ(1/2) - (1/2)ยทlogโ(1/2) = 1.000
IG(S, Employed) = 0.971 - [(3/5)ยท0.918 + (2/5)ยท1.000]
= 0.971 - [0.551 + 0.400]
= 0.971 - 0.951
= 0.020 bits
Step 5: Select Best Feature & Build Tree
Information Gain Summary:
Credit Score: 0.971 โ WINNER (highest)
Income: 0.420
Employed: 0.020
Result: Split on Credit Score at the root!
- Credit = Good โ Pure leaf: Approve = Yes
- Credit = Bad โ Pure leaf: Approve = No
Final Tree:
[Credit Score?]
/ \
Good Bad
| |
โ
Yes โ No
The tree perfectly classifies all 5 samples in just 1 split! This is because Credit Score has maximum Information Gain (0.971 = parent entropy), meaning it alone captures all the decision-making information.
Example 2: Gini Impurity Computation
# Parent Gini
Gini(S) = 1 - (3/5)ยฒ - (2/5)ยฒ
= 1 - 0.36 - 0.16
= 0.48
# Split on Credit Score
Gini(Good) = 1 - (3/3)ยฒ = 1 - 1 = 0
Gini(Bad) = 1 - (2/2)ยฒ = 1 - 1 = 0
Gini_split = (3/5)ยท0 + (2/5)ยท0 = 0
ฮGini = 0.48 - 0 = 0.48
# Split on Income
Gini(High) = 1 - (2/2)ยฒ = 0
Gini(Low) = 1 - (1/3)ยฒ - (2/3)ยฒ = 1 - 0.111 - 0.444 = 0.444
Gini_split = (2/5)ยท0 + (3/5)ยท0.444 = 0.267
ฮGini = 0.48 - 0.267 = 0.213
# Split on Employed
Gini(Yes) = 1 - (2/3)ยฒ - (1/3)ยฒ = 0.444
Gini(No) = 1 - (1/2)ยฒ - (1/2)ยฒ = 0.500
Gini_split = (3/5)ยท0.444 + (2/5)ยท0.500 = 0.467
ฮGini = 0.48 - 0.467 = 0.013
# Same winner: Credit Score (ฮGini = 0.48)
Example 3: Random Forest Feature Importance
Consider a Random Forest with 3 trees, each splitting on different features:
# Tree 1: Splits on Feature A (ฮGini=0.15), then Feature B (ฮGini=0.10)
# Tree 2: Splits on Feature B (ฮGini=0.20), then Feature C (ฮGini=0.08)
# Tree 3: Splits on Feature A (ฮGini=0.18), then Feature A (ฮGini=0.05)
Total ฮGini per feature:
Feature A: 0.15 + 0.18 + 0.05 = 0.38
Feature B: 0.10 + 0.20 = 0.30
Feature C: 0.08 = 0.08
Total: 0.76
Normalized Feature Importance:
Feature A: 0.38 / 0.76 = 0.500 (50%)
Feature B: 0.30 / 0.76 = 0.395 (39.5%)
Feature C: 0.08 / 0.76 = 0.105 (10.5%)
Feature A is the most important โ it contributes 50% of the total impurity reduction across all trees.
Notice how both entropy and Gini chose the same best split (Credit Score) in our example. This is typical โ they agree on the best split about 95% of the time. The difference matters more at deeper levels of the tree. In practice, the choice between entropy and Gini is far less important than proper pruning and hyperparameter tuning.
Visual Diagrams (ASCII)
Flowcharts (ASCII)
Python Implementation (From Scratch)
Decision Tree Classifier โ Complete Implementation
import numpy as np
from collections import Counter
class Node:
"""Represents a node in the decision tree."""
def __init__(self, feature=None, threshold=None,
left=None, right=None, *, value=None):
self.feature = feature # Index of feature to split on
self.threshold = threshold # Threshold for the split
self.left = left # Left child (feature <= threshold)
self.right = right # Right child (feature > threshold)
self.value = value # Prediction (for leaf nodes)
def is_leaf(self):
return self.value is not None
class DecisionTreeClassifier:
"""Decision Tree Classifier built from scratch.
Implements the CART algorithm with Gini impurity or Entropy.
Supports both classification and can be extended for regression.
Parameters:
max_depth (int): Maximum depth of the tree
min_samples_split (int): Minimum samples needed to split
criterion (str): 'gini' or 'entropy'
max_features (int): Max features to consider at each split
"""
def __init__(self, max_depth=100, min_samples_split=2,
criterion='gini', max_features=None):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.criterion = criterion
self.max_features = max_features
self.root = None
self.n_classes_ = None
self.feature_importances_ = None
def fit(self, X, y):
"""Build the decision tree from training data."""
self.n_classes_ = len(set(y))
self.n_features_ = X.shape[1]
self.feature_importances_ = np.zeros(self.n_features_)
# Determine max_features to consider at each split
if self.max_features is None:
self.max_features_ = self.n_features_
else:
self.max_features_ = min(self.max_features, self.n_features_)
self.root = self._build_tree(X, y, depth=0)
# Normalize feature importances
total = self.feature_importances_.sum()
if total > 0:
self.feature_importances_ /= total
return self
def _build_tree(self, X, y, depth):
"""Recursively build the decision tree."""
n_samples, n_features = X.shape
n_classes = len(set(y))
# Stopping conditions
if (depth >= self.max_depth
or n_classes == 1
or n_samples < self.min_samples_split):
leaf_value = self._most_common_label(y)
return Node(value=leaf_value)
# Select random subset of features (for Random Forest support)
feature_indices = np.random.choice(
n_features, self.max_features_, replace=False
)
# Find the best split
best_feature, best_threshold = self._best_split(
X, y, feature_indices
)
if best_feature is None:
leaf_value = self._most_common_label(y)
return Node(value=leaf_value)
# Split the data
left_mask = X[:, best_feature] <= best_threshold
right_mask = ~left_mask
# Track feature importance (Gini importance)
n_total = len(y)
n_left = left_mask.sum()
n_right = right_mask.sum()
importance = (
self._impurity(y)
- (n_left / n_total) * self._impurity(y[left_mask])
- (n_right / n_total) * self._impurity(y[right_mask])
)
self.feature_importances_[best_feature] += importance * n_total
# Recursively build children
left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
return Node(best_feature, best_threshold, left_child, right_child)
def _best_split(self, X, y, feature_indices):
"""Find the best feature and threshold to split on."""
best_gain = -1
best_feature, best_threshold = None, None
parent_impurity = self._impurity(y)
n_total = len(y)
for feature_idx in feature_indices:
# Get sorted unique thresholds
thresholds = np.unique(X[:, feature_idx])
for threshold in thresholds:
left_mask = X[:, feature_idx] <= threshold
right_mask = ~left_mask
if left_mask.sum() == 0 or right_mask.sum() == 0:
continue
# Weighted impurity of children
n_left = left_mask.sum()
n_right = right_mask.sum()
child_impurity = (
(n_left / n_total) * self._impurity(y[left_mask])
+ (n_right / n_total) * self._impurity(y[right_mask])
)
gain = parent_impurity - child_impurity
if gain > best_gain:
best_gain = gain
best_feature = feature_idx
best_threshold = threshold
return best_feature, best_threshold
def _impurity(self, y):
"""Compute impurity (Gini or Entropy)."""
counts = np.bincount(y.astype(int))
probabilities = counts / len(y)
if self.criterion == 'gini':
return 1.0 - np.sum(probabilities ** 2)
elif self.criterion == 'entropy':
# Filter out zero probabilities to avoid log(0)
probabilities = probabilities[probabilities > 0]
return -np.sum(probabilities * np.log2(probabilities))
def _most_common_label(self, y):
"""Return the most common class label."""
counter = Counter(y)
return counter.most_common(1)[0][0]
def predict(self, X):
"""Predict class labels for samples in X."""
return np.array([self._traverse_tree(x, self.root) for x in X])
def _traverse_tree(self, x, node):
"""Traverse the tree to make a prediction for a single sample."""
if node.is_leaf():
return node.value
if x[node.feature] <= node.threshold:
return self._traverse_tree(x, node.left)
else:
return self._traverse_tree(x, node.right)
def score(self, X, y):
"""Return accuracy on the given data."""
predictions = self.predict(X)
return np.mean(predictions == y)
def print_tree(self, node=None, indent=""):
"""Print a text representation of the tree."""
if node is None:
node = self.root
if node.is_leaf():
print(f"{indent}Predict: {node.value}")
return
print(f"{indent}Feature[{node.feature}] <= {node.threshold:.4f}?")
print(f"{indent}โโโ Yes:")
self.print_tree(node.left, indent + "โ ")
print(f"{indent}โโโ No:")
self.print_tree(node.right, indent + " ")
# ========== TESTING THE DECISION TREE ==========
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate synthetic data
X, y = make_classification(
n_samples=200, n_features=4, n_informative=3,
n_redundant=0, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train our Decision Tree
dt = DecisionTreeClassifier(max_depth=5, criterion='gini')
dt.fit(X_train, y_train)
print(f"Training Accuracy: {dt.score(X_train, y_train):.4f}")
print(f"Testing Accuracy: {dt.score(X_test, y_test):.4f}")
print(f"\nFeature Importances: {dt.feature_importances_}")
print(f"\nTree Structure:")
dt.print_tree()
Random Forest Classifier โ Complete Implementation
class RandomForestClassifier:
"""Random Forest Classifier built from scratch.
Uses bootstrap aggregating (bagging) with feature randomness.
Each tree is a DecisionTreeClassifier (from above).
Parameters:
n_estimators (int): Number of trees in the forest
max_depth (int): Maximum depth of each tree
min_samples_split (int): Minimum samples to split a node
max_features (str or int): 'sqrt', 'log2', or integer
criterion (str): 'gini' or 'entropy'
oob_score (bool): Whether to compute OOB score
"""
def __init__(self, n_estimators=100, max_depth=None,
min_samples_split=2, max_features='sqrt',
criterion='gini', oob_score=False,
random_state=None):
self.n_estimators = n_estimators
self.max_depth = max_depth if max_depth else 100
self.min_samples_split = min_samples_split
self.max_features = max_features
self.criterion = criterion
self.oob_score = oob_score
self.random_state = random_state
self.trees = []
self.oob_score_ = None
self.feature_importances_ = None
def _get_max_features(self, n_features):
"""Determine the number of features to consider."""
if self.max_features == 'sqrt':
return int(np.sqrt(n_features))
elif self.max_features == 'log2':
return int(np.log2(n_features))
elif isinstance(self.max_features, int):
return min(self.max_features, n_features)
else:
return n_features
def fit(self, X, y):
"""Build the random forest from training data."""
if self.random_state:
np.random.seed(self.random_state)
n_samples, n_features = X.shape
max_feat = self._get_max_features(n_features)
self.trees = []
oob_predictions = {} # {sample_idx: [predictions]}
for i in range(self.n_estimators):
# Bootstrap sample
bootstrap_indices = np.random.choice(
n_samples, size=n_samples, replace=True
)
oob_indices = list(
set(range(n_samples)) - set(bootstrap_indices)
)
X_bootstrap = X[bootstrap_indices]
y_bootstrap = y[bootstrap_indices]
# Build tree with feature randomness
tree = DecisionTreeClassifier(
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
criterion=self.criterion,
max_features=max_feat
)
tree.fit(X_bootstrap, y_bootstrap)
self.trees.append(tree)
# Collect OOB predictions
if self.oob_score and len(oob_indices) > 0:
oob_preds = tree.predict(X[oob_indices])
for idx, pred in zip(oob_indices, oob_preds):
if idx not in oob_predictions:
oob_predictions[idx] = []
oob_predictions[idx].append(pred)
# Compute OOB score
if self.oob_score:
correct = 0
total = 0
for idx, preds in oob_predictions.items():
majority = Counter(preds).most_common(1)[0][0]
if majority == y[idx]:
correct += 1
total += 1
self.oob_score_ = correct / total if total > 0 else 0
# Aggregate feature importances
self.feature_importances_ = np.mean(
[tree.feature_importances_ for tree in self.trees],
axis=0
)
return self
def predict(self, X):
"""Predict class labels using majority vote."""
# Get predictions from all trees
all_predictions = np.array([
tree.predict(X) for tree in self.trees
]) # Shape: (n_estimators, n_samples)
# Majority vote for each sample
majority_votes = []
for i in range(X.shape[0]):
votes = all_predictions[:, i]
majority = Counter(votes).most_common(1)[0][0]
majority_votes.append(majority)
return np.array(majority_votes)
def predict_proba(self, X):
"""Predict class probabilities (vote proportions)."""
all_predictions = np.array([
tree.predict(X) for tree in self.trees
])
probas = []
for i in range(X.shape[0]):
votes = all_predictions[:, i]
counts = Counter(votes)
prob = [counts.get(c, 0) / self.n_estimators
for c in range(self.n_classes_)]
probas.append(prob)
return np.array(probas)
def score(self, X, y):
"""Return accuracy on the given data."""
return np.mean(self.predict(X) == y)
# ========== TEST RANDOM FOREST ==========
rf = RandomForestClassifier(
n_estimators=50, max_depth=10,
max_features='sqrt', oob_score=True,
random_state=42
)
rf.fit(X_train, y_train)
print(f"Random Forest Results:")
print(f"Training Accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Testing Accuracy: {rf.score(X_test, y_test):.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"Feature Importances: {rf.feature_importances_}")
Regression Tree โ From Scratch
class DecisionTreeRegressor:
"""Decision Tree Regressor using variance reduction."""
def __init__(self, max_depth=10, min_samples_split=2):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.root = None
def fit(self, X, y):
self.root = self._build_tree(X, y, depth=0)
return self
def _build_tree(self, X, y, depth):
n_samples = X.shape[0]
# Stopping conditions
if depth >= self.max_depth or n_samples < self.min_samples_split:
return Node(value=np.mean(y))
# If variance is very small, stop
if np.var(y) < 1e-7:
return Node(value=np.mean(y))
best_feature, best_threshold = self._best_split(X, y)
if best_feature is None:
return Node(value=np.mean(y))
left_mask = X[:, best_feature] <= best_threshold
right_mask = ~left_mask
left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
return Node(best_feature, best_threshold, left_child, right_child)
def _best_split(self, X, y):
"""Find best split by maximizing variance reduction."""
best_var_reduction = -1
best_feature, best_threshold = None, None
parent_var = np.var(y)
n = len(y)
for feature_idx in range(X.shape[1]):
thresholds = np.unique(X[:, feature_idx])
for t in thresholds:
left = y[X[:, feature_idx] <= t]
right = y[X[:, feature_idx] > t]
if len(left) == 0 or len(right) == 0:
continue
# Weighted variance of children
weighted_var = (
(len(left) / n) * np.var(left)
+ (len(right) / n) * np.var(right)
)
var_reduction = parent_var - weighted_var
if var_reduction > best_var_reduction:
best_var_reduction = var_reduction
best_feature = feature_idx
best_threshold = t
return best_feature, best_threshold
def predict(self, X):
return np.array([self._traverse(x, self.root) for x in X])
def _traverse(self, x, node):
if node.is_leaf():
return node.value
if x[node.feature] <= node.threshold:
return self._traverse(x, node.left)
return self._traverse(x, node.right)
# Test Regression Tree
from sklearn.datasets import make_regression
X_reg, y_reg = make_regression(
n_samples=200, n_features=3, noise=10, random_state=42
)
X_tr, X_te, y_tr, y_te = train_test_split(X_reg, y_reg, test_size=0.2)
dtr = DecisionTreeRegressor(max_depth=6)
dtr.fit(X_tr, y_tr)
preds = dtr.predict(X_te)
mse = np.mean((preds - y_te) ** 2)
print(f"Regression Tree MSE: {mse:.2f}")
Extend the DecisionTreeClassifier to support ID3-style multi-way splits on categorical features. Hint: instead of a single threshold, store the set of values for each branch, and create a list of children (not just left/right).
TensorFlow Implementation
TensorFlow Decision Forests (TF-DF) is Google's library for training decision tree models within the TensorFlow ecosystem. It supports Random Forests, Gradient Boosted Trees, and CART.
# ========== TensorFlow Decision Forests ==========
# Install: pip install tensorflow_decision_forests
import tensorflow_decision_forests as tfdf
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# ---- Step 1: Load and Prepare Data ----
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
# Split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# Convert to TensorFlow Datasets
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
train_df, label="species", task=tfdf.keras.Task.CLASSIFICATION
)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
test_df, label="species", task=tfdf.keras.Task.CLASSIFICATION
)
# ---- Step 2: Train a Random Forest ----
rf_model = tfdf.keras.RandomForestModel(
num_trees=300,
max_depth=16,
min_examples=5,
task=tfdf.keras.Task.CLASSIFICATION,
verbose=0
)
rf_model.fit(train_ds)
# ---- Step 3: Evaluate ----
rf_model.compile(metrics=["accuracy"])
evaluation = rf_model.evaluate(test_ds, return_dict=True)
print(f"TF-DF Random Forest Accuracy: {evaluation['accuracy']:.4f}")
# ---- Step 4: Feature Importance ----
inspector = rf_model.make_inspector()
print("\nFeature Importances (NUM_AS_ROOT):")
for fi in inspector.variable_importances()["NUM_AS_ROOT"]:
print(f" {fi[0].name}: {fi[1]:.4f}")
# ---- Step 5: Train a CART Decision Tree ----
cart_model = tfdf.keras.CartModel(
max_depth=6,
task=tfdf.keras.Task.CLASSIFICATION,
verbose=0
)
cart_model.fit(train_ds)
cart_model.compile(metrics=["accuracy"])
cart_eval = cart_model.evaluate(test_ds, return_dict=True)
print(f"\nTF-DF CART Accuracy: {cart_eval['accuracy']:.4f}")
# ---- Step 6: Gradient Boosted Trees ----
gbt_model = tfdf.keras.GradientBoostedTreesModel(
num_trees=300,
max_depth=6,
learning_rate=0.1,
task=tfdf.keras.Task.CLASSIFICATION,
verbose=0
)
gbt_model.fit(train_ds)
gbt_model.compile(metrics=["accuracy"])
gbt_eval = gbt_model.evaluate(test_ds, return_dict=True)
print(f"TF-DF GBT Accuracy: {gbt_eval['accuracy']:.4f}")
# ---- Step 7: Save and Load Model ----
rf_model.save("saved_rf_model")
loaded_model = tf.keras.models.load_model("saved_rf_model")
print("\nModel saved and loaded successfully!")
TensorFlow Decision Forests is unique because it doesn't use gradient descent โ trees are trained using the classical algorithms (ID3/CART). TF-DF wraps them in the Keras API for convenience. It's especially useful when you want to combine tree models with neural networks in a single pipeline, or deploy to TensorFlow Serving for production.
Scikit-Learn Implementation
# ========== Complete Scikit-Learn Pipeline ==========
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import (
DecisionTreeClassifier, DecisionTreeRegressor,
plot_tree, export_text
)
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import (
train_test_split, cross_val_score, GridSearchCV
)
from sklearn.metrics import (
accuracy_score, classification_report,
confusion_matrix, roc_auc_score
)
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.inspection import permutation_importance
# ---- 1. Decision Tree on Iris Dataset ----
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Train with Gini (default)
dt_gini = DecisionTreeClassifier(
criterion='gini', max_depth=4,
min_samples_split=5, random_state=42
)
dt_gini.fit(X_train, y_train)
# Train with Entropy
dt_entropy = DecisionTreeClassifier(
criterion='entropy', max_depth=4,
min_samples_split=5, random_state=42
)
dt_entropy.fit(X_train, y_train)
print("=== Decision Tree Results ===")
print(f"Gini Accuracy: {dt_gini.score(X_test, y_test):.4f}")
print(f"Entropy Accuracy: {dt_entropy.score(X_test, y_test):.4f}")
# Print tree rules as text
print("\nTree Rules (Gini):")
print(export_text(dt_gini, feature_names=iris.feature_names))
# Feature importances
print("Feature Importances (Gini-based):")
for name, importance in zip(iris.feature_names, dt_gini.feature_importances_):
print(f" {name}: {importance:.4f}")
# ---- 2. Visualize the Tree ----
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(
dt_gini, feature_names=iris.feature_names,
class_names=iris.target_names, filled=True,
rounded=True, fontsize=10, ax=ax
)
plt.title("Decision Tree - Iris Dataset")
plt.tight_layout()
plt.savefig("decision_tree_iris.png", dpi=150)
plt.show()
# ---- 3. Cost-Complexity Pruning ----
# Get the pruning path
path = dt_gini.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
impurities = path.impurities
# Train trees for each alpha
trees = []
for alpha in ccp_alphas:
tree = DecisionTreeClassifier(
ccp_alpha=alpha, random_state=42
)
tree.fit(X_train, y_train)
trees.append(tree)
# Plot accuracy vs alpha
train_scores = [t.score(X_train, y_train) for t in trees]
test_scores = [t.score(X_test, y_test) for t in trees]
fig, ax = plt.subplots()
ax.plot(ccp_alphas, train_scores, 'o-', label="Train")
ax.plot(ccp_alphas, test_scores, 'o-', label="Test")
ax.set_xlabel("Cost-Complexity Alpha")
ax.set_ylabel("Accuracy")
ax.set_title("Cost-Complexity Pruning")
ax.legend()
plt.show()
# ---- 4. Random Forest Pipeline ----
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)
rf = RandomForestClassifier(
n_estimators=200,
max_depth=10,
min_samples_split=5,
max_features='sqrt',
oob_score=True,
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf.fit(X_train, y_train)
print("\n=== Random Forest Results ===")
print(f"Training Accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Testing Accuracy: {rf.score(X_test, y_test):.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"\n{classification_report(y_test, rf.predict(X_test))}")
# ---- 5. Permutation Importance ----
perm_importance = permutation_importance(
rf, X_test, y_test, n_repeats=30, random_state=42
)
# Sort by importance
sorted_idx = perm_importance.importances_mean.argsort()[::-1]
print("Top 10 Features (Permutation Importance):")
for idx in sorted_idx[:10]:
print(f" {cancer.feature_names[idx]}: "
f"{perm_importance.importances_mean[idx]:.4f} "
f"ยฑ {perm_importance.importances_std[idx]:.4f}")
# ---- 6. Hyperparameter Tuning with GridSearchCV ----
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'max_features': ['sqrt', 'log2']
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42, oob_score=True),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")
Data Scientists at companies like Flipkart, Swiggy, and Razorpay use scikit-learn's Random Forest as a baseline for every tabular ML problem. Knowing how to tune hyperparameters, interpret feature importances, and compare with gradient boosting (XGBoost) is essential for cracking data science interviews at Indian tech companies.
Indian Case Studies
๐ฆ Case Study 1: SBI Credit Scoring with Decision Trees
Organization: State Bank of India (SBI) โ India's largest bank with 450M+ customers
Problem: SBI needed to automate credit decisions for personal loans (โน50,000 to โน20,00,000). The RBI's Fair Practices Code requires banks to provide reasons for loan rejections โ making black-box models like neural networks inappropriate.
Solution: Interpretable Decision Tree + Random Forest Ensemble
- Feature Engineering: 45+ features including income, employment tenure, CIBIL score, existing EMIs, age, city tier, account history, and digital footprint.
- Model 1 (Explainability): CART Decision Tree (max_depth=6) for generating human-readable rejection reasons. "Your loan was rejected because: CIBIL score < 650 AND monthly income < โน25,000 AND existing EMI ratio > 50%."
- Model 2 (Accuracy): Random Forest (500 trees) for the actual decision, achieving 89% accuracy and 0.94 AUC-ROC.
- Rule Extraction: The tree's paths were converted into business rules for the core banking system.
Implementation Details
# Simplified SBI credit scoring pipeline
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_text
# Sample features (anonymized)
features = [
'cibil_score', 'monthly_income', 'employment_years',
'existing_emi_ratio', 'age', 'city_tier',
'savings_balance', 'loan_amount_requested',
'num_credit_cards', 'previous_defaults'
]
# Random Forest for prediction
rf_credit = RandomForestClassifier(
n_estimators=500, max_depth=12,
min_samples_leaf=50, # Prevent overfitting on rare cases
class_weight='balanced', # Handle class imbalance
oob_score=True, random_state=42
)
# Decision Tree for explainability
dt_explain = DecisionTreeClassifier(
max_depth=6, # Shallow for interpretability
min_samples_leaf=100,
random_state=42
)
# After training, extract rejection rules:
# rules = export_text(dt_explain, feature_names=features)
print("Credit Scoring Model Ready for SBI deployment")
Impact: Processing time reduced from 7 days (manual) to 30 seconds (automated). Default rate decreased by 23%. RBI compliance maintained with interpretable rejection reasons.
๐ Case Study 2: IRCTC Booking Demand Prediction with Random Forest
Organization: Indian Railway Catering and Tourism Corporation (IRCTC)
Problem: IRCTC handles 12 lakh+ ticket bookings daily. During festivals (Diwali, Holi, Chhath Puja), demand spikes 3-5x. They needed to predict demand 30-90 days ahead for dynamic pricing, extra coach allocation, and server capacity planning.
Solution: Random Forest Demand Forecaster
- Features: Route, day of week, month, festival proximity (days to Diwali/Holi/Eid), historical bookings (lag features), weather, class (1AC/2AC/3AC/SL), tatkal ratio, cancellation rate.
- Target: Number of bookings in the next 7 days per route-class combination.
- Model: Random Forest Regressor with 300 trees, max_depth=15.
- Results: MAPE of 12.3% (vs. 28% for simple historical average), enabling accurate extra coach allocation.
# IRCTC Demand Prediction Pipeline (Simplified)
from sklearn.ensemble import RandomForestRegressor
# Feature engineering for railway demand
def create_irctc_features(df):
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
# Festival proximity features
festivals = {'diwali': '2025-10-20', 'holi': '2025-03-14'}
for name, date in festivals.items():
df[f'days_to_{name}'] = (
pd.to_datetime(date) - df['date']
).dt.days.clip(lower=0)
# Lag features
df['bookings_7d_ago'] = df['bookings'].shift(7)
df['bookings_30d_avg'] = df['bookings'].rolling(30).mean()
return df
rf_irctc = RandomForestRegressor(
n_estimators=300, max_depth=15,
min_samples_leaf=10, n_jobs=-1
)
print("IRCTC Demand Model: Ready for training")
Impact: โน200 crore annual savings from optimized coach allocation. 15% improvement in dynamic pricing revenue. Reduced passenger complaints about unavailable seats by 30%.
India's financial sector is a massive adopter of decision tree models. CIBIL (TransUnion), Paytm's credit scoring, and HDFC's home loan approval systems all use ensemble tree models. The RBI's emphasis on explainable AI in the 2023 AI governance framework makes interpretable models like decision trees crucial for regulatory compliance in Indian banking.
Global Case Studies
๐ฎ Case Study 3: Microsoft Kinect โ Random Forests for Body Pose Estimation
Organization: Microsoft Research, Cambridge (2011)
Problem: The Xbox Kinect needed to estimate 3D body joint positions (20 joints: head, shoulders, elbows, wrists, hips, knees, ankles) from a single depth image in real-time (30 fps). No markers, no special clothing โ just a depth camera.
Solution: Per-Pixel Classification with Randomized Decision Forests
- Approach: Classify each pixel in the depth image into one of 31 body parts using an extremely efficient random forest.
- Training Data: Generated 1 million synthetic depth images using motion-captured poses rendered on diverse body shapes โ a massive synthetic dataset.
- Features: Simple depth comparison features: f(x) = d(x + ฮดโ/d(x)) - d(x + ฮดโ/d(x)), comparing depth values at offset positions. These are extremely fast to compute.
- Model: 3 random forests, each with 20 trees, max depth = 20. Trained on 1M images, 2000 pixels per image.
- Speed: Classification of all pixels in a 640ร480 depth image in < 5ms on Xbox 360 hardware.
Impact: Sold 35 million Kinect units. Enabled natural user interfaces without controllers. The paper "Real-Time Human Pose Recognition in Parts from Single Depth Images" (Shotton et al., 2011) is one of the most cited papers in computer vision. It demonstrated that random forests could compete with deep learning for real-time applications.
๐ Case Study 4: Kaggle Competitions โ The Dominance of Tree-Based Models
Context: Kaggle is the world's largest data science competition platform.
Finding: Between 2015 and 2023, tree-based ensemble methods (Random Forest, XGBoost, LightGBM, CatBoost) won approximately 70%+ of tabular data competitions on Kaggle.
Notable Wins
- Higgs Boson Challenge (2014): XGBoost's breakthrough โ first major competition won by gradient boosted trees.
- Porto Seguro's Safe Driver Prediction (2017): Top solutions used LightGBM ensembles.
- Home Credit Default Risk (2018): LightGBM-based solutions dominated, achieving AUC > 0.80.
- Google Analytics Customer Revenue (2019): Random Forest + feature engineering outperformed deep learning approaches.
Key Insight: For structured/tabular data, tree-based models consistently outperform neural networks. Deep learning wins on images, text, and audio, but trees remain king for tabular datasets. This was confirmed by the benchmark paper "Why do tree-based models still outperform deep learning on tabular data?" (Grinsztajn et al., 2022).
Netflix uses gradient boosted trees (a cousin of random forests) for content recommendation ranking. Amazon uses random forests for fraud detection. Google uses decision trees in TF-Ranking for search result ordering. These are not toy applications โ tree-based models power billion-dollar decisions daily.
Startup Applications
๐ฅ HealthTech: Disease Risk Scoring
Indian startup Niramai uses tree-based models for breast cancer risk assessment. Decision trees analyze thermal imaging features, blood markers, and patient history. The interpretable output helps doctors explain risk factors to patients.
๐พ AgriTech: Crop Yield Prediction
CropIn (Bangalore) uses Random Forests to predict crop yields from satellite imagery, weather data, soil type, and irrigation patterns. Features like NDVI, rainfall, and temperature are naturally suited for tree-based models with mixed feature types.
๐ณ FinTech: Fraud Detection
Razorpay uses real-time Random Forest models to flag suspicious transactions. Features include transaction amount, merchant category, time of day, device fingerprint, and historical patterns. Random Forests handle the class imbalance (99.8% legitimate vs 0.2% fraud) well with class weights.
๐ InsurTech: Dynamic Pricing
Acko (Mumbai) uses gradient boosted trees for car insurance pricing. Features include vehicle age, driver age, city, claim history, driving score (from telematics), and vehicle model. Tree-based models naturally capture non-linear interactions between features.
Government Applications
๐๏ธ Tax Fraud Detection (Income Tax Department)
India's Central Board of Direct Taxes (CBDT) uses decision tree models under Project Insight to flag suspicious tax returns. Features include income-expense ratios, investment patterns, property transactions, and cross-referencing with PAN-Aadhaar linked data. The interpretability requirement ensures tax officers can explain why a return was flagged for scrutiny.
๐ Environmental Monitoring (ISRO)
ISRO uses Random Forest classifiers for land use / land cover (LULC) classification from satellite imagery. Spectral bands from IRS satellites serve as features. Random Forests handle multi-class classification (forest, water, urban, agriculture, barren) with 92%+ accuracy.
๐ฅ Public Health (ICMR)
During COVID-19, the Indian Council of Medical Research used decision tree models to triage patients: based on age, oxygen saturation, comorbidities, and symptoms, patients were classified into mild/moderate/severe categories for hospital bed allocation.
๐ Crime Pattern Analysis (State Police)
Several Indian state police departments use Random Forests to predict crime hotspots based on historical crime data, time of day, population density, socioeconomic indicators, and seasonal patterns. This supports preventive patrolling strategies.
Industry Applications
| Industry | Application | Model | Key Features |
|---|---|---|---|
| Banking | Credit scoring, loan approval | CART + RF | CIBIL score, income, employment, DTI ratio |
| Healthcare | Disease diagnosis, drug response | Random Forest | Lab values, symptoms, genetics, age |
| E-commerce | Customer churn, product recommendation | XGBoost/RF | Purchase history, RFM, browsing patterns |
| Manufacturing | Quality control, predictive maintenance | Random Forest | Sensor data, temperature, vibration, pressure |
| Telecom | Network failure prediction | Gradient Boosted Trees | Signal strength, usage, complaints, hardware age |
| Insurance | Claim fraud detection | Random Forest | Claim amount, history, timing, documentation |
| Retail | Inventory demand forecasting | RF Regressor | Sales history, promotions, season, price |
| Energy | Solar/wind power output prediction | RF Regressor | Weather, cloud cover, time, historical output |
The dominance of tree-based models in industry comes down to three factors: (1) minimal preprocessing โ trees handle missing values, mixed types, and outliers naturally; (2) interpretability โ regulators and business stakeholders demand explanations; (3) robust performance โ Random Forests work well with default hyperparameters and rarely catastrophically fail. The old saying holds: "Start with trees, beat them with boosting, explain them with SHAP."
Mini Projects
๐ฆ Mini Project 1: Loan Default Predictor
Objective: Build a complete loan default prediction system using Decision Trees and Random Forests.
Dataset
Use the German Credit Dataset (1000 samples, 20 features) from UCI ML Repository, or generate synthetic data:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, roc_curve
)
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt
# ---- Generate Synthetic Loan Data ----
np.random.seed(42)
n = 2000
data = pd.DataFrame({
'age': np.random.randint(21, 65, n),
'income': np.random.lognormal(12.5, 0.6, n).astype(int),
'credit_score': np.random.randint(300, 850, n),
'employment_years': np.random.randint(0, 30, n),
'loan_amount': np.random.randint(50000, 2000000, n),
'num_existing_loans': np.random.poisson(1.5, n),
'monthly_emi': np.random.randint(0, 50000, n),
'savings': np.random.lognormal(11, 1, n).astype(int),
'has_property': np.random.binomial(1, 0.4, n),
'previous_defaults': np.random.binomial(1, 0.1, n),
})
# Create target (default) based on realistic rules
default_prob = (
0.3 * (data['credit_score'] < 550).astype(float)
+ 0.2 * (data['income'] < 300000).astype(float)
+ 0.25 * data['previous_defaults']
+ 0.15 * (data['num_existing_loans'] > 3).astype(float)
+ 0.1 * (data['monthly_emi'] / data['income'] > 0.5).astype(float)
+ np.random.normal(0, 0.1, n)
)
data['default'] = (default_prob > 0.4).astype(int)
print(f"Default rate: {data['default'].mean():.2%}")
# ---- Split Data ----
features = data.columns.drop('default')
X = data[features].values
y = data['default'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# ---- Model 1: Decision Tree ----
dt = DecisionTreeClassifier(
max_depth=5, min_samples_leaf=20,
criterion='gini', random_state=42
)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
print("\n=== Decision Tree ===")
print(classification_report(y_test, dt_pred,
target_names=['No Default', 'Default']))
print(f"AUC-ROC: {roc_auc_score(y_test, dt.predict_proba(X_test)[:,1]):.4f}")
# Print rules
print("\nDecision Rules:")
print(export_text(dt, feature_names=list(features), max_depth=3))
# ---- Model 2: Random Forest ----
rf = RandomForestClassifier(
n_estimators=200, max_depth=10,
min_samples_leaf=10, max_features='sqrt',
oob_score=True, class_weight='balanced',
random_state=42, n_jobs=-1
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
print("\n=== Random Forest ===")
print(classification_report(y_test, rf_pred,
target_names=['No Default', 'Default']))
print(f"AUC-ROC: {roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]):.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")
# ---- Feature Importance Analysis ----
print("\nTop Features (Gini Importance):")
importances = pd.Series(rf.feature_importances_, index=features)
for feat, imp in importances.sort_values(ascending=False).items():
print(f" {feat}: {imp:.4f}")
# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='roc_auc')
print(f"\n5-Fold CV AUC: {cv_scores.mean():.4f} ยฑ {cv_scores.std():.4f}")
Deliverables
- Comparison table: Decision Tree vs Random Forest (accuracy, AUC, F1)
- Visualization of the decision tree (max 5 levels)
- Feature importance bar chart
- Business rules extracted from the decision tree
- Report on which customers are most likely to default
๐ Mini Project 2: Customer Churn Analyzer
Objective: Build a churn prediction system for a telecom company and identify key retention drivers.
# ========== Customer Churn Analysis ==========
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
train_test_split, cross_val_score, StratifiedKFold
)
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.inspection import permutation_importance
# Generate synthetic telecom churn data
np.random.seed(42)
n = 5000
churn_data = pd.DataFrame({
'tenure_months': np.random.randint(1, 72, n),
'monthly_charges': np.random.uniform(199, 2999, n),
'total_charges': np.random.uniform(500, 150000, n),
'num_complaints': np.random.poisson(1.2, n),
'data_usage_gb': np.random.lognormal(2, 0.8, n),
'call_minutes': np.random.lognormal(5, 0.5, n),
'contract_type': np.random.choice([0, 1, 2], n, p=[0.5, 0.3, 0.2]),
'customer_service_calls': np.random.poisson(1.5, n),
'has_international_plan': np.random.binomial(1, 0.1, n),
'network_quality_score': np.random.uniform(1, 5, n),
})
# Realistic churn logic
churn_score = (
0.3 * (churn_data['num_complaints'] > 2).astype(float)
+ 0.2 * (churn_data['tenure_months'] < 12).astype(float)
+ 0.15 * (churn_data['contract_type'] == 0).astype(float)
+ 0.2 * (churn_data['network_quality_score'] < 2.5).astype(float)
+ 0.15 * (churn_data['customer_service_calls'] > 3).astype(float)
+ np.random.normal(0, 0.15, n)
)
churn_data['churned'] = (churn_score > 0.45).astype(int)
print(f"Churn rate: {churn_data['churned'].mean():.2%}")
# Build and evaluate model
features = churn_data.columns.drop('churned')
X = churn_data[features].values
y = churn_data['churned'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Random Forest with class balancing
rf_churn = RandomForestClassifier(
n_estimators=300, max_depth=12,
class_weight='balanced_subsample',
oob_score=True, random_state=42, n_jobs=-1
)
rf_churn.fit(X_train, y_train)
print("\n=== Churn Prediction Results ===")
print(f"Test Accuracy: {rf_churn.score(X_test, y_test):.4f}")
print(f"OOB Score: {rf_churn.oob_score_:.4f}")
print(f"AUC-ROC: {roc_auc_score(y_test, rf_churn.predict_proba(X_test)[:,1]):.4f}")
# Actionable insights: Which features drive churn?
perm_imp = permutation_importance(rf_churn, X_test, y_test, n_repeats=20)
print("\nChurn Drivers (Permutation Importance):")
for idx in perm_imp.importances_mean.argsort()[::-1]:
print(f" {features[idx]}: {perm_imp.importances_mean[idx]:.4f}")
Deliverables
- Churn prediction model with 85%+ AUC-ROC
- Top 5 churn drivers with business interpretation
- Customer segmentation: High/Medium/Low churn risk
- Retention strategies based on decision tree rules
- Cost-benefit analysis: cost of retention offers vs. revenue from prevented churn
End-of-Chapter Exercises
compute_entropy(labels) that takes a list of class labels and returns the entropy in bits.compute_gini(labels) that takes a list of class labels and returns the Gini impurity.| Outlook | Temp | Humidity | Play? |
|---|---|---|---|
| Sunny | Hot | High | No |
| Sunny | Mild | Normal | Yes |
| Overcast | Hot | High | Yes |
| Rain | Mild | High | Yes |
| Rain | Cool | Normal | No |
max_depth, min_samples_split, min_samples_leaf, max_features.n_estimators in a Random Forest never causes overfitting (in theory). Does this hold in practice?Multiple Choice Questions
Interview Questions
def information_gain(parent, left, right):
def entropy(labels):
from collections import Counter
n = len(labels)
counts = Counter(labels)
return -sum((c/n) * np.log2(c/n)
for c in counts.values() if c > 0)
n = len(parent)
H_parent = entropy(parent)
H_children = (len(left)/n * entropy(left)
+ len(right)/n * entropy(right))
return H_parent - H_children
Research Problems
๐ฌ Research Problem 1: Fairness-Aware Decision Trees
Background: Decision trees can learn biased splits that discriminate based on protected attributes (gender, caste, religion). Standard pruning doesn't account for fairness.
Research Question: Can we modify the CART splitting criterion to include a fairness constraint (e.g., demographic parity or equalized odds) without significantly sacrificing accuracy?
Approach: Define a modified splitting criterion: ฮImpurity_fair = ฮGini - ฮป ยท ฮFairness, where ฮFairness measures the increase in demographic disparity after the split. Investigate the accuracy-fairness Pareto frontier for different ฮป values on credit scoring datasets.
References: Kamiran et al. (2010) "Discrimination aware decision tree learning"; Aghaei et al. (2019) "Learning optimal fair decision trees".
๐ฌ Research Problem 2: Optimal Random Forest Calibration
Background: Random Forests are known to produce poorly calibrated probability estimates โ they tend to push probabilities toward 0.5 (underconfident). This is problematic in applications requiring accurate probability estimates (medical diagnosis, insurance pricing).
Research Question: What is the theoretical reason for RF miscalibration, and can we develop a calibration method specifically designed for ensemble tree models that outperforms Platt scaling and isotonic regression?
References: Niculescu-Mizil & Caruana (2005) "Predicting good probabilities with supervised learning"; Bostrรถm (2008) "Calibrating Random Forests".
๐ฌ Research Problem 3: Streaming Random Forests
Background: Standard Random Forests require all data in memory. In streaming applications (real-time sensor data, social media feeds), data arrives continuously and distributions change over time (concept drift).
Research Question: Design an incremental Random Forest algorithm that (a) updates trees without full retraining, (b) detects and adapts to concept drift, (c) maintains bounded memory, and (d) has provable convergence guarantees.
References: Abdulsalam et al. (2007) "Streaming Random Forests"; Gomes et al. (2017) "Adaptive Random Forests for data streams".
๐ฌ Research Problem 4: Neural-Tree Hybrid Architectures
Background: Recent work (TabNet, NODE, DNDT) attempts to combine the representational power of neural networks with the interpretability and inductive bias of decision trees.
Research Question: Can differentiable decision trees (soft routing) achieve the generalization properties of hard decision trees while being trainable end-to-end with gradient descent? Investigate the conditions under which soft trees converge to hard trees.
References: Kontschieder et al. (2015) "Deep Neural Decision Forests"; Hazimeh et al. (2020) "The Tree Ensemble Layer".
Key Takeaways
References & Further Reading
Foundational Papers
- Shannon, C.E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27, 379-423.
- Quinlan, J.R. (1986). "Induction of Decision Trees." Machine Learning, 1(1), 81-106. [ID3 paper]
- Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Wadsworth. [CART book]
- Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. [C4.5 book]
- Breiman, L. (1996). "Bagging Predictors." Machine Learning, 24(2), 123-140.
- Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32.
Advanced References
- Shotton, J., et al. (2011). "Real-Time Human Pose Recognition in Parts from Single Depth Images." CVPR. [Kinect paper]
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 9-10.
- Grinsztajn, L., et al. (2022). "Why do tree-based models still outperform deep learning on tabular data?" NeurIPS.
- Louppe, G. (2014). "Understanding Random Forests: From Theory to Practice." PhD Thesis, University of Liรจge.
- Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." KDD.
- Breiman, L. (2001). "Statistical Modeling: The Two Cultures." Statistical Science, 16(3), 199-231.
Indian References
- RBI (2023). "Framework for Responsible and Ethical Use of AI/ML in Financial Services." Reserve Bank of India.
- Sharma, V. & Kumar, R. (2020). "Credit Scoring in Indian Banking: A Comparative Study." Indian Journal of Finance.
- ISRO (2021). "Land Use Land Cover Classification using Machine Learning." NRSC Technical Report.
Online Resources
- scikit-learn Documentation: Decision Trees
- scikit-learn Documentation: Random Forests
- TensorFlow Decision Forests: tf-df
- Visual Introduction to Machine Learning: R2D3
- Distill.pub: "Feature Visualization" โ related to understanding tree-based feature importance
Chapter 13 covered the foundation: individual trees and random forests. In Chapter 14: Gradient Boosting & XGBoost, we'll see how building trees sequentially (each correcting the previous one's errors) leads to even more powerful models. Gradient boosting is to random forests what focused practice is to random exploration โ both are valuable, but boosting optimizes more deliberately. Master both, and you'll have the tools to win any tabular data competition.