Chapter 13: Decision Trees & Random Forests

Learning Objectives

By the end of this chapter, you will be able to:

Entropy & Information Gain

Derive entropy from first principles and compute information gain for any dataset split.

Splitting Criteria Mastery

Compare and apply Information Gain, Gain Ratio, Gini Impurity, and Variance Reduction.

ID3, C4.5 & CART Algorithms

Implement and differentiate between the three fundamental tree-building algorithms.

Tree Construction & Pruning

Build trees recursively, apply pre-pruning constraints, and perform cost-complexity post-pruning.

Regression Trees

Extend classification trees to predict continuous values using variance reduction.

Random Forests & Bagging

Understand bootstrap aggregation, feature randomness, and why ensembles reduce variance.

OOB Error & Feature Importance

Estimate out-of-bag error and compute Gini importance and permutation importance.

Full Implementation

Code Decision Trees and Random Forests from scratch in Python and with scikit-learn/TensorFlow.

🎓 Exam Tip

Decision Trees appear in virtually every ML interview and exam. You must be able to build a tree by hand, compute entropy and Gini impurity, and explain the bias-variance tradeoff of trees vs. forests. These are non-negotiable skills.

Introduction

Imagine you are a loan officer at a bank. A customer walks in and asks for a ₹5,00,000 personal loan. How do you decide? You might ask: "What is their annual income?" If it's above ₹8,00,000, you feel more confident. Then: "Do they have an existing loan?" If yes, you're cautious. "What's their credit score?" Above 750? Approve. Below 600? Reject.

Congratulations — you just built a decision tree in your head. Decision trees are one of the most natural, interpretable, and powerful machine learning algorithms. They mirror how humans actually make decisions: through a series of yes/no questions that partition the problem space.

🌳 The Big Idea

A Decision Tree recursively partitions the feature space using axis-aligned splits, creating a tree structure where each internal node represents a test on a feature, each branch represents an outcome, and each leaf node represents a prediction (class label or continuous value). A Random Forest builds many trees on random subsets of data and features, then aggregates their predictions — turning a high-variance model into a low-variance powerhouse.

Why Study Decision Trees in 2025?

Interpretability: Trees are "white-box" models — you can trace every decision. Essential in finance, healthcare, and legal domains.
No Preprocessing: Unlike SVMs and neural networks, trees don't need feature scaling, normalization, or one-hot encoding of ordinal features.
Non-linearity: Trees capture complex, non-linear decision boundaries effortlessly.
Foundation for Ensembles: Random Forests, Gradient Boosting (XGBoost, LightGBM, CatBoost) are all built on decision trees. These dominate Kaggle competitions and industry applications.
Real-time Deployment: A trained tree is just a series of if-else statements — lightning fast at inference.

👨‍🏫 Professor's Insight

Decision trees bridge the gap between simple linear models (Chapters 4-7) and complex neural networks (Chapters 10-12). They show that you don't always need gradient descent — sometimes the best approach is to greedily partition the data. Understanding when trees work well (tabular data, mixed features, need for interpretability) vs. when neural networks shine (images, text, large-scale unstructured data) is a hallmark of ML maturity.

Chapter Roadmap

We begin by deriving entropy and information gain from scratch, then build the ID3, C4.5, and CART algorithms step by step. We'll work through complete numerical examples by hand, learn pruning strategies, extend to regression trees, and then level up to Random Forests. We'll implement everything from scratch in Python, then use scikit-learn and TensorFlow for production pipelines. Along the way, we'll explore real-world case studies from SBI's credit scoring to Microsoft's Kinect body tracking.

Historical Background

The Origins of Information Theory

The mathematical foundation of decision trees begins with Claude Shannon's groundbreaking 1948 paper "A Mathematical Theory of Communication." Shannon defined entropy as a measure of uncertainty or information content in a random variable. This concept, borrowed from thermodynamics, would become the cornerstone of the ID3 algorithm decades later.

Timeline of Key Developments

Year	Milestone	Contributor	Significance
1948	Information Theory	Claude Shannon	Defined entropy, foundation for Information Gain
1963	AID (Automatic Interaction Detection)	Morgan & Sonquist	First systematic use of trees in statistics
1979	ID3 Algorithm	J. Ross Quinlan	First practical decision tree using Information Gain
1984	CART (Classification and Regression Trees)	Breiman, Friedman, Olshen, Stone	Introduced Gini impurity, binary trees, cost-complexity pruning
1993	C4.5 Algorithm	J. Ross Quinlan	Improved ID3 with Gain Ratio, handles continuous features, missing values
1996	Bagging	Leo Breiman	Bootstrap Aggregating — reduce variance via ensemble
2001	Random Forests	Leo Breiman	Added feature randomness to bagging — a landmark paper
2014	XGBoost	Tianqi Chen	Gradient boosted trees with regularization — Kaggle dominance begins
2017	LightGBM & CatBoost	Microsoft / Yandex	Faster, more efficient gradient boosting implementations

Leo Breiman: The Father of Random Forests

Leo Breiman (1928–2005) was a statistician at UC Berkeley whose work transformed machine learning. His 2001 paper "Random Forests" combined two of his earlier ideas — bagging (1996) and random subspace method — into a single, elegant framework. Breiman was also a vocal critic of traditional statistics, arguing in his famous 2001 paper "Statistical Modeling: The Two Cultures" that predictive modeling (algorithmic approach) was often more useful than explanatory modeling (data modeling approach).

🇮🇳 India Spotlight

India's banking sector was among the earliest adopters of decision tree models. In the early 2000s, ICICI Bank and HDFC Bank began using CART-based models for credit scoring. Today, the Reserve Bank of India (RBI) mandates that banks must be able to explain why a loan application was rejected — making interpretable models like decision trees essential for regulatory compliance under the Fair Practices Code.

J. Ross Quinlan's Legacy

Australian computer scientist J. Ross Quinlan developed both ID3 (1979) and its successor C4.5 (1993). C4.5 was ranked #1 in the "Top 10 Algorithms in Data Mining" by the IEEE International Conference on Data Mining in 2006. Quinlan later developed C5.0, a commercial version with boosting support and faster execution.

Conceptual Explanation

What is a Decision Tree?

A decision tree is a flowchart-like structure that maps observations (features) to conclusions (predictions). Think of it as a game of "20 Questions" where each question is carefully chosen to maximally reduce uncertainty about the answer.

🌿 Tree Anatomy

Root Node: The topmost node — the first question asked (most informative feature)
Internal Nodes: Decision points that test a feature condition
Branches: Outcomes of a test (e.g., "Age ≤ 30" → left, "Age > 30" → right)
Leaf Nodes: Terminal nodes that give the final prediction
Depth: Length of the longest path from root to leaf

🎯 The Core Question

At each node, the algorithm asks: "Which feature, and which threshold, gives me the best split?"

A "good split" is one that makes the resulting child nodes as pure as possible — ideally, each child should contain samples of only one class.

The key difference between algorithms (ID3, C4.5, CART) lies in how they define "best split."

The Three Major Algorithms

🔵 ID3 (Iterative Dichotomiser 3)

Creator: J. Ross Quinlan (1979) | Split Criterion: Information Gain | Tree Type: Multi-way

Works only with categorical features
Creates one branch per unique value of the selected feature
Greedy algorithm — selects the feature with highest Information Gain at each node
Limitation: Biased toward features with many values (e.g., "Student ID" would always win)
No built-in pruning — prone to overfitting

🟢 C4.5 (Successor to ID3)

Creator: J. Ross Quinlan (1993) | Split Criterion: Gain Ratio | Tree Type: Multi-way

Handles both categorical and continuous features
Uses Gain Ratio to correct ID3's bias toward high-cardinality features
Handles missing values by distributing them proportionally
Includes post-pruning (error-based pruning)
Ranked #1 in Top 10 Data Mining Algorithms (IEEE ICDM, 2006)

🔴 CART (Classification and Regression Trees)

Creator: Breiman et al. (1984) | Split Criterion: Gini Impurity / Variance Reduction | Tree Type: Binary

Always produces binary trees (yes/no splits)
Uses Gini impurity for classification, variance reduction for regression
Handles both categorical and continuous features naturally
Features cost-complexity pruning (the gold standard for post-pruning)
scikit-learn's default — this is the algorithm you use when you call DecisionTreeClassifier()

👨‍🏫 Professor's Insight

In practice, the difference between Information Gain and Gini Impurity rarely matters — they produce similar trees 95%+ of the time. The real advantage of CART is its binary-only structure, which makes it computationally simpler and pairs beautifully with ensemble methods like Random Forests and Gradient Boosting. When you see "decision tree" in industry, it almost always means CART.

How Does a Decision Tree Learn?

The learning process follows a greedy, top-down, recursive partitioning strategy called TDIDT (Top-Down Induction of Decision Trees):

Start with all training data at the root node
Select the best feature (and threshold for continuous features) to split on
Create child nodes based on the split
Recurse on each child node with its subset of data
Stop when a stopping criterion is met (pure node, max depth, min samples, etc.)
Assign prediction to leaf nodes (majority class for classification, mean value for regression)

Regression Trees

While classification trees predict discrete labels, regression trees predict continuous values. The key differences:

Splitting criterion: Instead of Gini or entropy, use variance reduction (or MSE reduction)
Leaf prediction: Instead of majority vote, predict the mean value of training samples in that leaf
Everything else stays the same: recursive splitting, pruning, the tree structure itself

Random Forests: The Ensemble Approach

A single decision tree is powerful but unstable — small changes in the data can produce completely different trees (high variance). Random Forests solve this by building many trees and averaging their predictions:

🌲🌲🌲 Random Forest = Bagging + Feature Randomness

Step 1 — Bootstrap: Create B random subsets of the training data by sampling with replacement (each subset is ~63.2% unique samples).

Step 2 — Feature Randomness: At each split in each tree, consider only a random subset of m features (typically √p for classification, p/3 for regression, where p = total features).

Step 3 — Grow Trees: Build a full decision tree on each bootstrap sample, with no pruning.

Step 4 — Aggregate: For classification, take the majority vote. For regression, take the mean.

Why Does Random Forest Work?

The magic comes from decorrelation. If all trees are built on the same features, they'll be similar (correlated), and averaging them won't help much. By randomizing features, each tree captures different patterns. When you average many uncorrelated, individually-imperfect estimators, the errors cancel out.

Variance Reduction Through Averaging Var(X̄) = (ρσ² + (1-ρ)σ²/B) where ρ = average pairwise correlation between trees, σ² = variance of each tree, B = number of trees As B \to \infty: Var \to ρσ² (irreducible correlation floor) Feature randomness reduces ρ \to lower variance!

🏭 Industry Alert

In industry, Random Forests are often the first model you try on tabular data. They work well out of the box with minimal tuning, handle missing values and mixed feature types, provide feature importance for free, and serve as a strong baseline before trying more complex models like XGBoost or neural networks.

Mathematical Foundation

Entropy: Measuring Uncertainty

Entropy quantifies the "impurity" or "disorder" in a dataset. For a random variable with K possible outcomes:

Shannon Entropy H(S) = -\sum i=1 K p i log₂(p i) where p i = proportion of class i in set S Convention: 0 \cdot log₂(0) = 0

Intuition: Entropy is highest (= log₂K) when all classes are equally likely (maximum uncertainty), and lowest (= 0) when all samples belong to one class (complete certainty).

Binary Entropy Function

For a two-class problem (positive proportion p, negative proportion 1-p):

Binary Entropy H(S) = -p \cdot log₂(p) - (1-p) \cdot log₂(1-p) H(0) = 0, H(0.5) = 1 (max), H(1) = 0

Information Gain

Information Gain measures how much entropy is reduced after splitting on a feature A:

Information Gain (ID3 Criterion) IG(S, A) = H(S) - \sum v \in Values(A) (|S v | / |S|) \cdot H(S v) = H(parent) - weighted average H(children)

Gain Ratio

C4.5 normalizes Information Gain by the feature's intrinsic value to avoid bias toward high-cardinality features:

Gain Ratio (C4.5 Criterion) GainRatio(S, A) = IG(S, A) / IV(A) IV(A) = -\sum v \in Values(A) (|S v | / |S|) \cdot log₂(|S v | / |S|)

Gini Impurity

CART uses Gini impurity — a faster-to-compute alternative to entropy:

Gini Impurity (CART Criterion) Gini(S) = 1 - \sum i=1 K p i ² = \sum i\neqj p i \cdot p j For binary: Gini = 2p(1-p) Gini \in [0, 0.5] for binary; [0, 1-1/K] for K classes

Interpretation: Gini impurity equals the probability that two randomly chosen samples from the node would have different class labels. Perfect purity → Gini = 0.

Variance Reduction (Regression Trees)

Variance Reduction for Regression VR(S, A, t) = Var(S) - [|S L |/|S| \cdot Var(S L) + |S R |/|S| \cdot Var(S R)] Var(S) = (1/|S|) \sum i (y i - ȳ)²

Gini vs. Entropy: Mathematical Comparison

Property	Entropy	Gini Impurity
Formula	-∑ pᵢ log₂(pᵢ)	1 - ∑ pᵢ²
Range (binary)	[0, 1]	[0, 0.5]
Maximum at	p = 0.5 → H = 1	p = 0.5 → G = 0.5
Computation	Requires logarithm (slower)	Polynomial (faster)
Sensitivity	More sensitive to class probability changes	Prefers larger, purer partitions
Used by	ID3, C4.5	CART (scikit-learn default)

Out-of-Bag (OOB) Error Estimation

In bagging, each bootstrap sample contains ~63.2% of the data. The remaining ~36.8% are out-of-bag samples. For each sample xᵢ, we can predict using only the trees where xᵢ was OOB, giving a "free" validation estimate:

OOB Error P(sample is OOB) = (1 - 1/n)ⁿ \to 1/e \approx 0.368 as n \to \infty OOB Error = (1/n) \sum i=1 n 𝟙[ŷᵢ OOB \neq yᵢ] where ŷᵢ OOB = majority vote of trees where xᵢ was out-of-bag

Feature Importance Measures

🔵 Gini Importance (MDI)

Mean Decrease in Impurity: Sum the total impurity reduction provided by each feature across all trees:

FI(j) = ∑_trees ∑_{nodes splitting on j} ΔGini_node

Pro: Fast, computed during training. Con: Biased toward high-cardinality features.

🟢 Permutation Importance (MDA)

Mean Decrease in Accuracy: Randomly permute feature j's values and measure how much accuracy drops:

FI(j) = baseline_score - score_after_permuting_j

Pro: Model-agnostic, unbiased. Con: Slower (requires re-evaluation per feature).

🎓 Exam Tip

In exams, you'll often be asked: "Why is Gini importance biased?" The answer: continuous or high-cardinality features have more possible split points, giving them more chances to reduce impurity. A random ID number would rank as highly important by Gini importance — which is clearly wrong. Always prefer permutation importance for reliable feature ranking.

Formula Derivations

Deriving Entropy from First Principles

We want a function H(p₁, p₂, ..., pₖ) that measures "how surprised we are" by a random event. Shannon showed that the unique function satisfying three natural axioms is entropy.

📐 Derivation: Shannon Entropy from Axioms

Axiom 1 (Continuity): H should be continuous in the probabilities pᵢ. Small changes in probabilities should cause small changes in uncertainty.

Axiom 2 (Maximum): For a uniform distribution (all pᵢ = 1/K), H should be maximized, and H(1/K, ..., 1/K) should increase with K. More equally likely outcomes = more uncertainty.

Axiom 3 (Additivity / Chain Rule): If a choice is broken into sub-choices, the total entropy should decompose as:

H(AB) = H(A) + H(B|A)

The joint entropy of choosing A then B equals the entropy of A plus the expected entropy of B given A.

Start with uniform distribution: Define f(n) = H(1/n, ..., 1/n). From Axiom 2, f is increasing. From Axiom 3, consider nᵏ equally likely outcomes grouped in stages:

f(nᵏ) = k · f(n)

This is the functional equation for logarithms! So f(n) = c · log(n) for some constant c > 0.

Extend to non-uniform distributions: Consider a distribution (p₁, ..., pₖ) where each pᵢ = nᵢ/N is rational. Using the grouping axiom:

f(N) = H(p₁, ..., pₖ) + ∑ pᵢ · f(nᵢ)

c·log(N) = H(p₁,...,pₖ) + ∑ pᵢ · c·log(nᵢ)

H(p₁,...,pₖ) = c·log(N) - c·∑ pᵢ·log(nᵢ)

= c·∑ pᵢ·log(N) - c·∑ pᵢ·log(nᵢ) [since ∑pᵢ = 1]

= c·∑ pᵢ·log(N/nᵢ)

= -c·∑ pᵢ·log(pᵢ)

Choose c = 1 and base 2: With c = 1 and log₂, we get:

H(S) = -∑ pᵢ log₂(pᵢ) (bits)

The unit is bits (binary digits) because we use base 2. One bit = the information from a fair coin flip.

Deriving Gini Impurity

📐 Derivation: Gini Impurity from Misclassification Probability

Setup: Suppose we randomly pick a sample from node S, and then randomly assign it a label according to the class distribution in S. What's the probability of misclassification?

Compute: Probability that we pick a sample of class i AND assign label j ≠ i:

P(misclassify) = ∑ᵢ P(pick class i) · P(assign NOT class i)

= ∑ᵢ pᵢ · (1 - pᵢ)

= ∑ᵢ pᵢ - ∑ᵢ pᵢ²

= 1 - ∑ᵢ pᵢ²

Gini(S) = 1 - ∑ᵢ pᵢ²

Gini impurity is the expected error rate if we randomly classify samples according to the class distribution. A pure node (all same class) has Gini = 0.

Deriving the Relationship Between Entropy and Gini

For binary classification with positive proportion p, using the Taylor expansion of log₂(p) around p = 0.5:

Entropy \approx Scaled Gini (First-Order Approximation) Using ln(x) \approx 2(x-1)/(x+1) and converting: H(p) \approx 2\cdotGini(p) for p near 0.5 This is why entropy and Gini produce similar trees! Entropy is a slightly more "curved" version of Gini.

Cost-Complexity Pruning Derivation

📐 Cost-Complexity Pruning (CART)

Define the cost-complexity measure: For a tree T with terminal nodes (leaves) |T̃|:

Rα(T) = R(T) + α · |T̃|

where R(T) = total misclassification rate (resubstitution error), α ≥ 0 is the complexity parameter.

Prune when the subtree isn't worth the complexity: For an internal node t with subtree Tₜ, prune Tₜ (replace with a leaf) when:

R(t) + α ≤ R(Tₜ) + α · |T̃ₜ|

α ≥ [R(t) - R(Tₜ)] / [|T̃ₜ| - 1]

This ratio is called the effective alpha of node t. We prune from the weakest link up.

Use cross-validation: Generate a sequence of pruned trees T₁ ⊃ T₂ ⊃ ... for increasing α. Select the α (and corresponding tree) that minimizes cross-validated error. scikit-learn's ccp_alpha parameter controls this.

Worked Numerical Examples

Example 1: Building a Decision Tree by Hand

🎯 Loan Approval Decision Tree — 5 Data Points

Dataset: Should a bank approve a loan?

#	Income	Credit Score	Employed	Approve?
1	High	Good	Yes	✅ Yes
2	High	Good	No	✅ Yes
3	Low	Good	Yes	✅ Yes
4	Low	Bad	Yes	❌ No
5	Low	Bad	No	❌ No

Step 1: Compute Parent Entropy

S has 3 Yes, 2 No → p(Yes) = 3/5, p(No) = 2/5

H(S) = -(3/5)·log₂(3/5) - (2/5)·log₂(2/5)
     = -(0.6)·(-0.737) - (0.4)·(-1.322)
     = 0.442 + 0.529
     = 0.971 bits

Step 2: Information Gain for "Income"

Income = High: {1✅, 2✅} → 2 Yes, 0 No → H = 0

Income = Low: {3✅, 4❌, 5❌} → 1 Yes, 2 No

H(Low) = -(1/3)·log₂(1/3) - (2/3)·log₂(2/3)
       = -(0.333)·(-1.585) - (0.667)·(-0.585)
       = 0.528 + 0.390
       = 0.918 bits

IG(S, Income) = 0.971 - [(2/5)·0 + (3/5)·0.918]
              = 0.971 - 0.551
              = 0.420 bits

Step 3: Information Gain for "Credit Score"

Credit = Good: {1✅, 2✅, 3✅} → 3 Yes, 0 No → H = 0

Credit = Bad: {4❌, 5❌} → 0 Yes, 2 No → H = 0

IG(S, Credit) = 0.971 - [(3/5)·0 + (2/5)·0]
              = 0.971 - 0
              = 0.971 bits ← PERFECT SPLIT!

Step 4: Information Gain for "Employed"

Employed = Yes: {1✅, 3✅, 4❌} → 2 Yes, 1 No

Employed = No: {2✅, 5❌} → 1 Yes, 1 No

H(Yes) = -(2/3)·log₂(2/3) - (1/3)·log₂(1/3) = 0.918
H(No)  = -(1/2)·log₂(1/2) - (1/2)·log₂(1/2) = 1.000

IG(S, Employed) = 0.971 - [(3/5)·0.918 + (2/5)·1.000]
                = 0.971 - [0.551 + 0.400]
                = 0.971 - 0.951
                = 0.020 bits

Step 5: Select Best Feature & Build Tree

Information Gain Summary:
  Credit Score: 0.971 ← WINNER (highest)
  Income:       0.420
  Employed:     0.020

Result: Split on Credit Score at the root!
  - Credit = Good → Pure leaf: Approve = Yes
  - Credit = Bad  → Pure leaf: Approve = No

Final Tree:
       [Credit Score?]
       /             \
    Good              Bad
     |                 |
  ✅ Yes            ❌ No

The tree perfectly classifies all 5 samples in just 1 split! This is because Credit Score has maximum Information Gain (0.971 = parent entropy), meaning it alone captures all the decision-making information.

Example 2: Gini Impurity Computation

🎯 Gini Impurity for the Same Dataset

# Parent Gini
Gini(S) = 1 - (3/5)² - (2/5)²
        = 1 - 0.36 - 0.16
        = 0.48

# Split on Credit Score
Gini(Good) = 1 - (3/3)² = 1 - 1 = 0
Gini(Bad)  = 1 - (2/2)² = 1 - 1 = 0

Gini_split = (3/5)·0 + (2/5)·0 = 0
ΔGini = 0.48 - 0 = 0.48

# Split on Income
Gini(High) = 1 - (2/2)² = 0
Gini(Low)  = 1 - (1/3)² - (2/3)² = 1 - 0.111 - 0.444 = 0.444

Gini_split = (2/5)·0 + (3/5)·0.444 = 0.267
ΔGini = 0.48 - 0.267 = 0.213

# Split on Employed
Gini(Yes) = 1 - (2/3)² - (1/3)² = 0.444
Gini(No)  = 1 - (1/2)² - (1/2)² = 0.500

Gini_split = (3/5)·0.444 + (2/5)·0.500 = 0.467
ΔGini = 0.48 - 0.467 = 0.013

# Same winner: Credit Score (ΔGini = 0.48)

Example 3: Random Forest Feature Importance

🎯 Computing Gini Feature Importance from 3 Trees

Consider a Random Forest with 3 trees, each splitting on different features:

# Tree 1: Splits on Feature A (ΔGini=0.15), then Feature B (ΔGini=0.10)
# Tree 2: Splits on Feature B (ΔGini=0.20), then Feature C (ΔGini=0.08)
# Tree 3: Splits on Feature A (ΔGini=0.18), then Feature A (ΔGini=0.05)

Total ΔGini per feature:
  Feature A: 0.15 + 0.18 + 0.05 = 0.38
  Feature B: 0.10 + 0.20       = 0.30
  Feature C: 0.08              = 0.08
  Total:                         0.76

Normalized Feature Importance:
  Feature A: 0.38 / 0.76 = 0.500 (50%)
  Feature B: 0.30 / 0.76 = 0.395 (39.5%)
  Feature C: 0.08 / 0.76 = 0.105 (10.5%)

Feature A is the most important — it contributes 50% of the total impurity reduction across all trees.

👨‍🏫 Professor's Insight

Notice how both entropy and Gini chose the same best split (Credit Score) in our example. This is typical — they agree on the best split about 95% of the time. The difference matters more at deeper levels of the tree. In practice, the choice between entropy and Gini is far less important than proper pruning and hyperparameter tuning.

Visual Diagrams (ASCII)

═══ DECISION TREE STRUCTURE ═══

┌─────────────────┐ │ Credit Score? │ ← Root Node │ (IG = 0.971) │ └────────┬────────┘ ┌──────┴──────┐ Good Bad │ │ ┌──────┴──────┐ ┌───┴───┐ │ Income? │ │ NO │ ← Leaf │ (IG=0.42) │ │ (2/2) │ └──────┬──────┘ └───────┘ ┌────┴────┐ High Low │ │ ┌────┴────┐ ┌───┴───┐ │ YES │ │ YES │ │ (2/2) │ │ (1/1) │ └─────────┘ └───────┘ Internal Node: Tests a feature Leaf Node: Makes a prediction Branch: Feature value / threshold

═══ ENTROPY vs GINI IMPURITY (Binary Classification) ═══

Impurity 1.0 ┤ . . . . . Entropy . . . . . │ ╭───────────────╮ 0.9 ┤ ╱ ╲ │ ╱ ·· Gini ·· ╲ 0.8 ┤ ╱ · · ╲ │ ╱ · · ╲ 0.7 ┤ ╱ ╭─────────────────╮ · ╲ │╱ ╱ ╲ · ╲ 0.5 ┤ ╱ GINI: 2p(1-p) ╲ ╲ │ ╱ ╲ ╲ 0.4 ┤╱ ╲ ╲ │ ╲ ╲ 0.2 ┤ ╲ ╲ │ ╲ ╲ 0.0 ┼──────────────────────────────┼──────╲── 0 0.2 0.4 0.5 0.6 0.8 1.0 p (proportion of class 1) Key: Entropy peaks at 1.0 when p=0.5 Gini peaks at 0.5 when p=0.5 Both are 0 when p=0 or p=1 (pure)

═══ RANDOM FOREST ARCHITECTURE ═══

Training Data D ┌─────────────────────────────────────────┐ │ x₁ x₂ x₃ x₄ ... xₙ (n samples) │ └──────────────┬──────────────────────────┘ │ Bootstrap Sampling (with replacement) ┌────────────┼────────────┐ │ │ │ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │ D₁ │ │ D₂ │ │ D₃ │ ... B samples │~63% │ │~63% │ │~63% │ └──┬──┘ └──┬──┘ └──┬──┘ │ │ │ Random Random Random features features features (√p each) (√p each) (√p each) │ │ │ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │Tree₁│ │Tree₂│ │Tree₃│ ... B trees │(deep│ │(deep│ │(deep│ │ no │ │ no │ │ no │ │prune│ │prune│ │prune│ └──┬──┘ └──┬──┘ └──┬──┘ │ │ │ pred₁ pred₂ pred₃ │ │ │ └────────────┼────────────┘ │ ┌───────┴───────┐ │ AGGREGATION │ │ Classification│ │ → Majority │ │ Vote │ │ Regression │ │ → Mean │ └───────────────┘

═══ DECISION BOUNDARY: TREE vs LINEAR vs FOREST ═══

Linear Model Decision Tree Random Forest ┌──────────┐ ┌──────────┐ ┌──────────┐ │ + + +╱ │ │ + + │ │ │ + +· │ │ + + ╱ - │ │ + + │ - │ │ + · · - │ │ + ╱ - - │ │─────┤ - │ │ ·· ── ·- │ │ ╱ - - - │ │ - - │ - │ │ · ·── ·- │ │╱ - - - - │ │ - - │────│ │ ─·── · - │ └──────────┘ └──────────┘ └──────────┘ Diagonal line Axis-aligned Smooth, (one boundary) rectangles staircase-like (blocky) (many rectangles)

═══ OVERFITTING IN DECISION TREES ═══

Error │ │ ╲ Training Error │ ╲ │ ╲ ╱ Test Error │ ╲ ╱ │ ╲ ╱ │ ╲ ╱ │ ╲ ╱ │ ╲ ╱ │ ╲ ╱ ← Sweet Spot (optimal depth) │ ╳ │ ╱ ╲ │ ╱ ╲───────── (memorizes noise) │ ╱ │ ╱ │──╱────────────────── Training → 0 └────────────────────────────────── Tree Depth → Shallow tree: HIGH BIAS, LOW VARIANCE (underfitting) Deep tree: LOW BIAS, HIGH VARIANCE (overfitting) Pruning/Forests find the balance!

Flowcharts (ASCII)

═══ DECISION TREE BUILDING ALGORITHM (CART) ═══

START: BuildTree(data, depth) │ ▼ ┌──────────────────────┐ │ Stopping Condition? │─── YES ──→ Return LEAF │ • All same class? │ (majority class │ • depth ≥ max_depth? │ or mean value) │ • samples < min? │ │ • impurity = 0? │ └──────────┬───────────┘ │ NO ▼ ┌──────────────────────┐ │ For each feature f: │ │ For each threshold │ │ t in f's values: │ │ Split data into │ │ left (f≤t) and │ │ right (f>t) │ │ Compute ΔGini │ │ or ΔEntropy │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Select best (f*, t*) │ │ with highest ΔGini │ │ or Information Gain │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Create internal node │ │ Split: f* ≤ t*? │ └─────┬──────────┬─────┘ │ │ YES NO │ │ ▼ ▼ BuildTree( BuildTree( left_data, right_data, depth+1) depth+1)

═══ RANDOM FOREST TRAINING PIPELINE ═══

Training Data (n samples, p features) │ ▼ ┌─────────────────────┐ │ For b = 1 to B: │ (B = n_estimators, e.g., 100) │ │ │ 1. Bootstrap sample │──→ Draw n samples WITH replacement │ D_b ⊂ D │ (~63.2% unique samples) │ │ │ 2. Build tree T_b: │ │ At each node: │──→ Select random m features │ m = √p (classify)│ (√p for classification, │ m = p/3 (regress)│ p/3 for regression) │ │ │ 3. Find best split │──→ Only among the m features │ among m features │ │ │ │ 4. Grow full tree │──→ No pruning (max depth=∞) │ (no pruning) │ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ Ensemble of B trees │ │ {T₁, T₂, ..., T_B} │ └─────────┬───────────┘ │ ▼ (Prediction for new x) ┌─────────────────────┐ │ Classification: │ │ ŷ = mode(T₁(x), ...,│ │ T_B(x)) │ │ │ │ Regression: │ │ ŷ = mean(T₁(x), ...,│ │ T_B(x)) │ └─────────────────────┘

═══ COST-COMPLEXITY PRUNING FLOWCHART ═══

START: Full tree T_max (overfit) │ ▼ ┌──────────────────────────┐ │ Compute effective alpha │ │ for each internal node t:│ │ │ │ α_eff(t) = │ │ [R(t) - R(T_t)] │ │ ───────────────── │ │ [|leaves(T_t)| - 1] │ └────────────┬─────────────┘ │ ▼ ┌──────────────────────────┐ │ Find weakest link: │ │ t* = argmin α_eff(t) │ │ │ │ Prune subtree at t* │ │ Replace with leaf │ └────────────┬─────────────┘ │ ▼ ┌──────────────────────────┐ │ Repeat until only root │ │ remains. This generates │ │ sequence: │ │ T₁ ⊃ T₂ ⊃ ... ⊃ {root} │ └────────────┬─────────────┘ │ ▼ ┌──────────────────────────┐ │ Cross-validate: │ │ For each T_k, compute │ │ CV error. Select tree │ │ with lowest CV error │ │ (or 1-SE rule) │ └──────────────────────────┘

Python Implementation (From Scratch)

Decision Tree Classifier — Complete Implementation

import numpy as np
from collections import Counter

class Node:
    """Represents a node in the decision tree."""
    def __init__(self, feature=None, threshold=None,
                 left=None, right=None, *, value=None):
        self.feature = feature      # Index of feature to split on
        self.threshold = threshold  # Threshold for the split
        self.left = left            # Left child (feature <= threshold)
        self.right = right          # Right child (feature > threshold)
        self.value = value          # Prediction (for leaf nodes)

    def is_leaf(self):
        return self.value is not None


class DecisionTreeClassifier:
    """Decision Tree Classifier built from scratch.
    
    Implements the CART algorithm with Gini impurity or Entropy.
    Supports both classification and can be extended for regression.
    
    Parameters:
        max_depth (int): Maximum depth of the tree
        min_samples_split (int): Minimum samples needed to split
        criterion (str): 'gini' or 'entropy'
        max_features (int): Max features to consider at each split
    """
    
    def __init__(self, max_depth=100, min_samples_split=2,
                 criterion='gini', max_features=None):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.criterion = criterion
        self.max_features = max_features
        self.root = None
        self.n_classes_ = None
        self.feature_importances_ = None
    
    def fit(self, X, y):
        """Build the decision tree from training data."""
        self.n_classes_ = len(set(y))
        self.n_features_ = X.shape[1]
        self.feature_importances_ = np.zeros(self.n_features_)
        
        # Determine max_features to consider at each split
        if self.max_features is None:
            self.max_features_ = self.n_features_
        else:
            self.max_features_ = min(self.max_features, self.n_features_)
        
        self.root = self._build_tree(X, y, depth=0)
        
        # Normalize feature importances
        total = self.feature_importances_.sum()
        if total > 0:
            self.feature_importances_ /= total
        
        return self
    
    def _build_tree(self, X, y, depth):
        """Recursively build the decision tree."""
        n_samples, n_features = X.shape
        n_classes = len(set(y))
        
        # Stopping conditions
        if (depth >= self.max_depth
            or n_classes == 1
            or n_samples < self.min_samples_split):
            leaf_value = self._most_common_label(y)
            return Node(value=leaf_value)
        
        # Select random subset of features (for Random Forest support)
        feature_indices = np.random.choice(
            n_features, self.max_features_, replace=False
        )
        
        # Find the best split
        best_feature, best_threshold = self._best_split(
            X, y, feature_indices
        )
        
        if best_feature is None:
            leaf_value = self._most_common_label(y)
            return Node(value=leaf_value)
        
        # Split the data
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        # Track feature importance (Gini importance)
        n_total = len(y)
        n_left = left_mask.sum()
        n_right = right_mask.sum()
        importance = (
            self._impurity(y)
            - (n_left / n_total) * self._impurity(y[left_mask])
            - (n_right / n_total) * self._impurity(y[right_mask])
        )
        self.feature_importances_[best_feature] += importance * n_total
        
        # Recursively build children
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return Node(best_feature, best_threshold, left_child, right_child)
    
    def _best_split(self, X, y, feature_indices):
        """Find the best feature and threshold to split on."""
        best_gain = -1
        best_feature, best_threshold = None, None
        
        parent_impurity = self._impurity(y)
        n_total = len(y)
        
        for feature_idx in feature_indices:
            # Get sorted unique thresholds
            thresholds = np.unique(X[:, feature_idx])
            
            for threshold in thresholds:
                left_mask = X[:, feature_idx] <= threshold
                right_mask = ~left_mask
                
                if left_mask.sum() == 0 or right_mask.sum() == 0:
                    continue
                
                # Weighted impurity of children
                n_left = left_mask.sum()
                n_right = right_mask.sum()
                child_impurity = (
                    (n_left / n_total) * self._impurity(y[left_mask])
                    + (n_right / n_total) * self._impurity(y[right_mask])
                )
                
                gain = parent_impurity - child_impurity
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold
        
        return best_feature, best_threshold
    
    def _impurity(self, y):
        """Compute impurity (Gini or Entropy)."""
        counts = np.bincount(y.astype(int))
        probabilities = counts / len(y)
        
        if self.criterion == 'gini':
            return 1.0 - np.sum(probabilities ** 2)
        elif self.criterion == 'entropy':
            # Filter out zero probabilities to avoid log(0)
            probabilities = probabilities[probabilities > 0]
            return -np.sum(probabilities * np.log2(probabilities))
    
    def _most_common_label(self, y):
        """Return the most common class label."""
        counter = Counter(y)
        return counter.most_common(1)[0][0]
    
    def predict(self, X):
        """Predict class labels for samples in X."""
        return np.array([self._traverse_tree(x, self.root) for x in X])
    
    def _traverse_tree(self, x, node):
        """Traverse the tree to make a prediction for a single sample."""
        if node.is_leaf():
            return node.value
        
        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        else:
            return self._traverse_tree(x, node.right)
    
    def score(self, X, y):
        """Return accuracy on the given data."""
        predictions = self.predict(X)
        return np.mean(predictions == y)
    
    def print_tree(self, node=None, indent=""):
        """Print a text representation of the tree."""
        if node is None:
            node = self.root
        
        if node.is_leaf():
            print(f"{indent}Predict: {node.value}")
            return
        
        print(f"{indent}Feature[{node.feature}] <= {node.threshold:.4f}?")
        print(f"{indent}├── Yes:")
        self.print_tree(node.left, indent + "│   ")
        print(f"{indent}└── No:")
        self.print_tree(node.right, indent + "    ")


# ========== TESTING THE DECISION TREE ==========
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic data
X, y = make_classification(
    n_samples=200, n_features=4, n_informative=3,
    n_redundant=0, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train our Decision Tree
dt = DecisionTreeClassifier(max_depth=5, criterion='gini')
dt.fit(X_train, y_train)

print(f"Training Accuracy: {dt.score(X_train, y_train):.4f}")
print(f"Testing Accuracy:  {dt.score(X_test, y_test):.4f}")
print(f"\nFeature Importances: {dt.feature_importances_}")
print(f"\nTree Structure:")
dt.print_tree()

Random Forest Classifier — Complete Implementation

class RandomForestClassifier:
    """Random Forest Classifier built from scratch.
    
    Uses bootstrap aggregating (bagging) with feature randomness.
    Each tree is a DecisionTreeClassifier (from above).
    
    Parameters:
        n_estimators (int): Number of trees in the forest
        max_depth (int): Maximum depth of each tree
        min_samples_split (int): Minimum samples to split a node
        max_features (str or int): 'sqrt', 'log2', or integer
        criterion (str): 'gini' or 'entropy'
        oob_score (bool): Whether to compute OOB score
    """
    
    def __init__(self, n_estimators=100, max_depth=None,
                 min_samples_split=2, max_features='sqrt',
                 criterion='gini', oob_score=False,
                 random_state=None):
        self.n_estimators = n_estimators
        self.max_depth = max_depth if max_depth else 100
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.criterion = criterion
        self.oob_score = oob_score
        self.random_state = random_state
        self.trees = []
        self.oob_score_ = None
        self.feature_importances_ = None
    
    def _get_max_features(self, n_features):
        """Determine the number of features to consider."""
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        else:
            return n_features
    
    def fit(self, X, y):
        """Build the random forest from training data."""
        if self.random_state:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        max_feat = self._get_max_features(n_features)
        
        self.trees = []
        oob_predictions = {}  # {sample_idx: [predictions]}
        
        for i in range(self.n_estimators):
            # Bootstrap sample
            bootstrap_indices = np.random.choice(
                n_samples, size=n_samples, replace=True
            )
            oob_indices = list(
                set(range(n_samples)) - set(bootstrap_indices)
            )
            
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]
            
            # Build tree with feature randomness
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                criterion=self.criterion,
                max_features=max_feat
            )
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
            
            # Collect OOB predictions
            if self.oob_score and len(oob_indices) > 0:
                oob_preds = tree.predict(X[oob_indices])
                for idx, pred in zip(oob_indices, oob_preds):
                    if idx not in oob_predictions:
                        oob_predictions[idx] = []
                    oob_predictions[idx].append(pred)
        
        # Compute OOB score
        if self.oob_score:
            correct = 0
            total = 0
            for idx, preds in oob_predictions.items():
                majority = Counter(preds).most_common(1)[0][0]
                if majority == y[idx]:
                    correct += 1
                total += 1
            self.oob_score_ = correct / total if total > 0 else 0
        
        # Aggregate feature importances
        self.feature_importances_ = np.mean(
            [tree.feature_importances_ for tree in self.trees],
            axis=0
        )
        
        return self
    
    def predict(self, X):
        """Predict class labels using majority vote."""
        # Get predictions from all trees
        all_predictions = np.array([
            tree.predict(X) for tree in self.trees
        ])  # Shape: (n_estimators, n_samples)
        
        # Majority vote for each sample
        majority_votes = []
        for i in range(X.shape[0]):
            votes = all_predictions[:, i]
            majority = Counter(votes).most_common(1)[0][0]
            majority_votes.append(majority)
        
        return np.array(majority_votes)
    
    def predict_proba(self, X):
        """Predict class probabilities (vote proportions)."""
        all_predictions = np.array([
            tree.predict(X) for tree in self.trees
        ])
        
        probas = []
        for i in range(X.shape[0]):
            votes = all_predictions[:, i]
            counts = Counter(votes)
            prob = [counts.get(c, 0) / self.n_estimators
                    for c in range(self.n_classes_)]
            probas.append(prob)
        
        return np.array(probas)
    
    def score(self, X, y):
        """Return accuracy on the given data."""
        return np.mean(self.predict(X) == y)


# ========== TEST RANDOM FOREST ==========
rf = RandomForestClassifier(
    n_estimators=50, max_depth=10,
    max_features='sqrt', oob_score=True,
    random_state=42
)
rf.fit(X_train, y_train)

print(f"Random Forest Results:")
print(f"Training Accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Testing Accuracy:  {rf.score(X_test, y_test):.4f}")
print(f"OOB Score:         {rf.oob_score_:.4f}")
print(f"Feature Importances: {rf.feature_importances_}")

Regression Tree — From Scratch

class DecisionTreeRegressor:
    """Decision Tree Regressor using variance reduction."""
    
    def __init__(self, max_depth=10, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None
    
    def fit(self, X, y):
        self.root = self._build_tree(X, y, depth=0)
        return self
    
    def _build_tree(self, X, y, depth):
        n_samples = X.shape[0]
        
        # Stopping conditions
        if depth >= self.max_depth or n_samples < self.min_samples_split:
            return Node(value=np.mean(y))
        
        # If variance is very small, stop
        if np.var(y) < 1e-7:
            return Node(value=np.mean(y))
        
        best_feature, best_threshold = self._best_split(X, y)
        
        if best_feature is None:
            return Node(value=np.mean(y))
        
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return Node(best_feature, best_threshold, left_child, right_child)
    
    def _best_split(self, X, y):
        """Find best split by maximizing variance reduction."""
        best_var_reduction = -1
        best_feature, best_threshold = None, None
        parent_var = np.var(y)
        n = len(y)
        
        for feature_idx in range(X.shape[1]):
            thresholds = np.unique(X[:, feature_idx])
            for t in thresholds:
                left = y[X[:, feature_idx] <= t]
                right = y[X[:, feature_idx] > t]
                
                if len(left) == 0 or len(right) == 0:
                    continue
                
                # Weighted variance of children
                weighted_var = (
                    (len(left) / n) * np.var(left)
                    + (len(right) / n) * np.var(right)
                )
                var_reduction = parent_var - weighted_var
                
                if var_reduction > best_var_reduction:
                    best_var_reduction = var_reduction
                    best_feature = feature_idx
                    best_threshold = t
        
        return best_feature, best_threshold
    
    def predict(self, X):
        return np.array([self._traverse(x, self.root) for x in X])
    
    def _traverse(self, x, node):
        if node.is_leaf():
            return node.value
        if x[node.feature] <= node.threshold:
            return self._traverse(x, node.left)
        return self._traverse(x, node.right)


# Test Regression Tree
from sklearn.datasets import make_regression

X_reg, y_reg = make_regression(
    n_samples=200, n_features=3, noise=10, random_state=42
)
X_tr, X_te, y_tr, y_te = train_test_split(X_reg, y_reg, test_size=0.2)

dtr = DecisionTreeRegressor(max_depth=6)
dtr.fit(X_tr, y_tr)

preds = dtr.predict(X_te)
mse = np.mean((preds - y_te) ** 2)
print(f"Regression Tree MSE: {mse:.2f}")

💻 Code Challenge

Extend the DecisionTreeClassifier to support ID3-style multi-way splits on categorical features. Hint: instead of a single threshold, store the set of values for each branch, and create a list of children (not just left/right).

TensorFlow Implementation

TensorFlow Decision Forests (TF-DF) is Google's library for training decision tree models within the TensorFlow ecosystem. It supports Random Forests, Gradient Boosted Trees, and CART.

# ========== TensorFlow Decision Forests ==========
# Install: pip install tensorflow_decision_forests

import tensorflow_decision_forests as tfdf
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# ---- Step 1: Load and Prepare Data ----
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Convert to TensorFlow Datasets
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    train_df, label="species", task=tfdf.keras.Task.CLASSIFICATION
)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    test_df, label="species", task=tfdf.keras.Task.CLASSIFICATION
)

# ---- Step 2: Train a Random Forest ----
rf_model = tfdf.keras.RandomForestModel(
    num_trees=300,
    max_depth=16,
    min_examples=5,
    task=tfdf.keras.Task.CLASSIFICATION,
    verbose=0
)
rf_model.fit(train_ds)

# ---- Step 3: Evaluate ----
rf_model.compile(metrics=["accuracy"])
evaluation = rf_model.evaluate(test_ds, return_dict=True)
print(f"TF-DF Random Forest Accuracy: {evaluation['accuracy']:.4f}")

# ---- Step 4: Feature Importance ----
inspector = rf_model.make_inspector()
print("\nFeature Importances (NUM_AS_ROOT):")
for fi in inspector.variable_importances()["NUM_AS_ROOT"]:
    print(f"  {fi[0].name}: {fi[1]:.4f}")

# ---- Step 5: Train a CART Decision Tree ----
cart_model = tfdf.keras.CartModel(
    max_depth=6,
    task=tfdf.keras.Task.CLASSIFICATION,
    verbose=0
)
cart_model.fit(train_ds)
cart_model.compile(metrics=["accuracy"])
cart_eval = cart_model.evaluate(test_ds, return_dict=True)
print(f"\nTF-DF CART Accuracy: {cart_eval['accuracy']:.4f}")

# ---- Step 6: Gradient Boosted Trees ----
gbt_model = tfdf.keras.GradientBoostedTreesModel(
    num_trees=300,
    max_depth=6,
    learning_rate=0.1,
    task=tfdf.keras.Task.CLASSIFICATION,
    verbose=0
)
gbt_model.fit(train_ds)
gbt_model.compile(metrics=["accuracy"])
gbt_eval = gbt_model.evaluate(test_ds, return_dict=True)
print(f"TF-DF GBT Accuracy: {gbt_eval['accuracy']:.4f}")

# ---- Step 7: Save and Load Model ----
rf_model.save("saved_rf_model")
loaded_model = tf.keras.models.load_model("saved_rf_model")
print("\nModel saved and loaded successfully!")

👨‍🏫 Professor's Insight

TensorFlow Decision Forests is unique because it doesn't use gradient descent — trees are trained using the classical algorithms (ID3/CART). TF-DF wraps them in the Keras API for convenience. It's especially useful when you want to combine tree models with neural networks in a single pipeline, or deploy to TensorFlow Serving for production.

Scikit-Learn Implementation

# ========== Complete Scikit-Learn Pipeline ==========
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import (
    DecisionTreeClassifier, DecisionTreeRegressor,
    plot_tree, export_text
)
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV
)
from sklearn.metrics import (
    accuracy_score, classification_report,
    confusion_matrix, roc_auc_score
)
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.inspection import permutation_importance

# ---- 1. Decision Tree on Iris Dataset ----
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train with Gini (default)
dt_gini = DecisionTreeClassifier(
    criterion='gini', max_depth=4,
    min_samples_split=5, random_state=42
)
dt_gini.fit(X_train, y_train)

# Train with Entropy
dt_entropy = DecisionTreeClassifier(
    criterion='entropy', max_depth=4,
    min_samples_split=5, random_state=42
)
dt_entropy.fit(X_train, y_train)

print("=== Decision Tree Results ===")
print(f"Gini Accuracy:    {dt_gini.score(X_test, y_test):.4f}")
print(f"Entropy Accuracy: {dt_entropy.score(X_test, y_test):.4f}")

# Print tree rules as text
print("\nTree Rules (Gini):")
print(export_text(dt_gini, feature_names=iris.feature_names))

# Feature importances
print("Feature Importances (Gini-based):")
for name, importance in zip(iris.feature_names, dt_gini.feature_importances_):
    print(f"  {name}: {importance:.4f}")

# ---- 2. Visualize the Tree ----
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(
    dt_gini, feature_names=iris.feature_names,
    class_names=iris.target_names, filled=True,
    rounded=True, fontsize=10, ax=ax
)
plt.title("Decision Tree - Iris Dataset")
plt.tight_layout()
plt.savefig("decision_tree_iris.png", dpi=150)
plt.show()

# ---- 3. Cost-Complexity Pruning ----
# Get the pruning path
path = dt_gini.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
impurities = path.impurities

# Train trees for each alpha
trees = []
for alpha in ccp_alphas:
    tree = DecisionTreeClassifier(
        ccp_alpha=alpha, random_state=42
    )
    tree.fit(X_train, y_train)
    trees.append(tree)

# Plot accuracy vs alpha
train_scores = [t.score(X_train, y_train) for t in trees]
test_scores = [t.score(X_test, y_test) for t in trees]

fig, ax = plt.subplots()
ax.plot(ccp_alphas, train_scores, 'o-', label="Train")
ax.plot(ccp_alphas, test_scores, 'o-', label="Test")
ax.set_xlabel("Cost-Complexity Alpha")
ax.set_ylabel("Accuracy")
ax.set_title("Cost-Complexity Pruning")
ax.legend()
plt.show()

# ---- 4. Random Forest Pipeline ----
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42
)

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=5,
    max_features='sqrt',
    oob_score=True,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)
rf.fit(X_train, y_train)

print("\n=== Random Forest Results ===")
print(f"Training Accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Testing Accuracy:  {rf.score(X_test, y_test):.4f}")
print(f"OOB Score:         {rf.oob_score_:.4f}")
print(f"\n{classification_report(y_test, rf.predict(X_test))}")

# ---- 5. Permutation Importance ----
perm_importance = permutation_importance(
    rf, X_test, y_test, n_repeats=30, random_state=42
)

# Sort by importance
sorted_idx = perm_importance.importances_mean.argsort()[::-1]
print("Top 10 Features (Permutation Importance):")
for idx in sorted_idx[:10]:
    print(f"  {cancer.feature_names[idx]}: "
          f"{perm_importance.importances_mean[idx]:.4f} "
          f"± {perm_importance.importances_std[idx]:.4f}")

# ---- 6. Hyperparameter Tuning with GridSearchCV ----
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, oob_score=True),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Score:   {grid_search.best_score_:.4f}")
print(f"Test Score:      {grid_search.score(X_test, y_test):.4f}")

🚀 Career Path

Data Scientists at companies like Flipkart, Swiggy, and Razorpay use scikit-learn's Random Forest as a baseline for every tabular ML problem. Knowing how to tune hyperparameters, interpret feature importances, and compare with gradient boosting (XGBoost) is essential for cracking data science interviews at Indian tech companies.

Indian Case Studies

🏦 Case Study 1: SBI Credit Scoring with Decision Trees

Organization: State Bank of India (SBI) — India's largest bank with 450M+ customers

Problem: SBI needed to automate credit decisions for personal loans (₹50,000 to ₹20,00,000). The RBI's Fair Practices Code requires banks to provide reasons for loan rejections — making black-box models like neural networks inappropriate.

Solution: Interpretable Decision Tree + Random Forest Ensemble

Feature Engineering: 45+ features including income, employment tenure, CIBIL score, existing EMIs, age, city tier, account history, and digital footprint.
Model 1 (Explainability): CART Decision Tree (max_depth=6) for generating human-readable rejection reasons. "Your loan was rejected because: CIBIL score < 650 AND monthly income < ₹25,000 AND existing EMI ratio > 50%."
Model 2 (Accuracy): Random Forest (500 trees) for the actual decision, achieving 89% accuracy and 0.94 AUC-ROC.
Rule Extraction: The tree's paths were converted into business rules for the core banking system.

Implementation Details

# Simplified SBI credit scoring pipeline
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_text

# Sample features (anonymized)
features = [
    'cibil_score', 'monthly_income', 'employment_years',
    'existing_emi_ratio', 'age', 'city_tier',
    'savings_balance', 'loan_amount_requested',
    'num_credit_cards', 'previous_defaults'
]

# Random Forest for prediction
rf_credit = RandomForestClassifier(
    n_estimators=500, max_depth=12,
    min_samples_leaf=50,  # Prevent overfitting on rare cases
    class_weight='balanced',  # Handle class imbalance
    oob_score=True, random_state=42
)

# Decision Tree for explainability
dt_explain = DecisionTreeClassifier(
    max_depth=6,  # Shallow for interpretability
    min_samples_leaf=100,
    random_state=42
)

# After training, extract rejection rules:
# rules = export_text(dt_explain, feature_names=features)
print("Credit Scoring Model Ready for SBI deployment")

Impact: Processing time reduced from 7 days (manual) to 30 seconds (automated). Default rate decreased by 23%. RBI compliance maintained with interpretable rejection reasons.

🚂 Case Study 2: IRCTC Booking Demand Prediction with Random Forest

Organization: Indian Railway Catering and Tourism Corporation (IRCTC)

Problem: IRCTC handles 12 lakh+ ticket bookings daily. During festivals (Diwali, Holi, Chhath Puja), demand spikes 3-5x. They needed to predict demand 30-90 days ahead for dynamic pricing, extra coach allocation, and server capacity planning.

Solution: Random Forest Demand Forecaster

Features: Route, day of week, month, festival proximity (days to Diwali/Holi/Eid), historical bookings (lag features), weather, class (1AC/2AC/3AC/SL), tatkal ratio, cancellation rate.
Target: Number of bookings in the next 7 days per route-class combination.
Model: Random Forest Regressor with 300 trees, max_depth=15.
Results: MAPE of 12.3% (vs. 28% for simple historical average), enabling accurate extra coach allocation.

# IRCTC Demand Prediction Pipeline (Simplified)
from sklearn.ensemble import RandomForestRegressor

# Feature engineering for railway demand
def create_irctc_features(df):
    df['day_of_week'] = df['date'].dt.dayofweek
    df['month'] = df['date'].dt.month
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    
    # Festival proximity features
    festivals = {'diwali': '2025-10-20', 'holi': '2025-03-14'}
    for name, date in festivals.items():
        df[f'days_to_{name}'] = (
            pd.to_datetime(date) - df['date']
        ).dt.days.clip(lower=0)
    
    # Lag features
    df['bookings_7d_ago'] = df['bookings'].shift(7)
    df['bookings_30d_avg'] = df['bookings'].rolling(30).mean()
    
    return df

rf_irctc = RandomForestRegressor(
    n_estimators=300, max_depth=15,
    min_samples_leaf=10, n_jobs=-1
)
print("IRCTC Demand Model: Ready for training")

Impact: ₹200 crore annual savings from optimized coach allocation. 15% improvement in dynamic pricing revenue. Reduced passenger complaints about unavailable seats by 30%.

🇮🇳 India Spotlight

India's financial sector is a massive adopter of decision tree models. CIBIL (TransUnion), Paytm's credit scoring, and HDFC's home loan approval systems all use ensemble tree models. The RBI's emphasis on explainable AI in the 2023 AI governance framework makes interpretable models like decision trees crucial for regulatory compliance in Indian banking.

Global Case Studies

🎮 Case Study 3: Microsoft Kinect — Random Forests for Body Pose Estimation

Organization: Microsoft Research, Cambridge (2011)

Problem: The Xbox Kinect needed to estimate 3D body joint positions (20 joints: head, shoulders, elbows, wrists, hips, knees, ankles) from a single depth image in real-time (30 fps). No markers, no special clothing — just a depth camera.

Solution: Per-Pixel Classification with Randomized Decision Forests

Approach: Classify each pixel in the depth image into one of 31 body parts using an extremely efficient random forest.
Training Data: Generated 1 million synthetic depth images using motion-captured poses rendered on diverse body shapes — a massive synthetic dataset.
Features: Simple depth comparison features: f(x) = d(x + δ₁/d(x)) - d(x + δ₂/d(x)), comparing depth values at offset positions. These are extremely fast to compute.
Model: 3 random forests, each with 20 trees, max depth = 20. Trained on 1M images, 2000 pixels per image.
Speed: Classification of all pixels in a 640×480 depth image in < 5ms on Xbox 360 hardware.

Impact: Sold 35 million Kinect units. Enabled natural user interfaces without controllers. The paper "Real-Time Human Pose Recognition in Parts from Single Depth Images" (Shotton et al., 2011) is one of the most cited papers in computer vision. It demonstrated that random forests could compete with deep learning for real-time applications.

🏆 Case Study 4: Kaggle Competitions — The Dominance of Tree-Based Models

Context: Kaggle is the world's largest data science competition platform.

Finding: Between 2015 and 2023, tree-based ensemble methods (Random Forest, XGBoost, LightGBM, CatBoost) won approximately 70%+ of tabular data competitions on Kaggle.

Notable Wins

Higgs Boson Challenge (2014): XGBoost's breakthrough — first major competition won by gradient boosted trees.
Porto Seguro's Safe Driver Prediction (2017): Top solutions used LightGBM ensembles.
Home Credit Default Risk (2018): LightGBM-based solutions dominated, achieving AUC > 0.80.
Google Analytics Customer Revenue (2019): Random Forest + feature engineering outperformed deep learning approaches.

Key Insight: For structured/tabular data, tree-based models consistently outperform neural networks. Deep learning wins on images, text, and audio, but trees remain king for tabular datasets. This was confirmed by the benchmark paper "Why do tree-based models still outperform deep learning on tabular data?" (Grinsztajn et al., 2022).

🏭 Industry Alert

Netflix uses gradient boosted trees (a cousin of random forests) for content recommendation ranking. Amazon uses random forests for fraud detection. Google uses decision trees in TF-Ranking for search result ordering. These are not toy applications — tree-based models power billion-dollar decisions daily.

Startup Applications

🏥 HealthTech: Disease Risk Scoring

Indian startup Niramai uses tree-based models for breast cancer risk assessment. Decision trees analyze thermal imaging features, blood markers, and patient history. The interpretable output helps doctors explain risk factors to patients.

🌾 AgriTech: Crop Yield Prediction

CropIn (Bangalore) uses Random Forests to predict crop yields from satellite imagery, weather data, soil type, and irrigation patterns. Features like NDVI, rainfall, and temperature are naturally suited for tree-based models with mixed feature types.

💳 FinTech: Fraud Detection

Razorpay uses real-time Random Forest models to flag suspicious transactions. Features include transaction amount, merchant category, time of day, device fingerprint, and historical patterns. Random Forests handle the class imbalance (99.8% legitimate vs 0.2% fraud) well with class weights.

🚗 InsurTech: Dynamic Pricing

Acko (Mumbai) uses gradient boosted trees for car insurance pricing. Features include vehicle age, driver age, city, claim history, driving score (from telematics), and vehicle model. Tree-based models naturally capture non-linear interactions between features.

Government Applications

🏛️ Tax Fraud Detection (Income Tax Department)

India's Central Board of Direct Taxes (CBDT) uses decision tree models under Project Insight to flag suspicious tax returns. Features include income-expense ratios, investment patterns, property transactions, and cross-referencing with PAN-Aadhaar linked data. The interpretability requirement ensures tax officers can explain why a return was flagged for scrutiny.

🌍 Environmental Monitoring (ISRO)

ISRO uses Random Forest classifiers for land use / land cover (LULC) classification from satellite imagery. Spectral bands from IRS satellites serve as features. Random Forests handle multi-class classification (forest, water, urban, agriculture, barren) with 92%+ accuracy.

🏥 Public Health (ICMR)

During COVID-19, the Indian Council of Medical Research used decision tree models to triage patients: based on age, oxygen saturation, comorbidities, and symptoms, patients were classified into mild/moderate/severe categories for hospital bed allocation.

🚔 Crime Pattern Analysis (State Police)

Several Indian state police departments use Random Forests to predict crime hotspots based on historical crime data, time of day, population density, socioeconomic indicators, and seasonal patterns. This supports preventive patrolling strategies.

Industry Applications

Industry	Application	Model	Key Features
Banking	Credit scoring, loan approval	CART + RF	CIBIL score, income, employment, DTI ratio
Healthcare	Disease diagnosis, drug response	Random Forest	Lab values, symptoms, genetics, age
E-commerce	Customer churn, product recommendation	XGBoost/RF	Purchase history, RFM, browsing patterns
Manufacturing	Quality control, predictive maintenance	Random Forest	Sensor data, temperature, vibration, pressure
Telecom	Network failure prediction	Gradient Boosted Trees	Signal strength, usage, complaints, hardware age
Insurance	Claim fraud detection	Random Forest	Claim amount, history, timing, documentation
Retail	Inventory demand forecasting	RF Regressor	Sales history, promotions, season, price
Energy	Solar/wind power output prediction	RF Regressor	Weather, cloud cover, time, historical output

👨‍🏫 Professor's Insight

The dominance of tree-based models in industry comes down to three factors: (1) minimal preprocessing — trees handle missing values, mixed types, and outliers naturally; (2) interpretability — regulators and business stakeholders demand explanations; (3) robust performance — Random Forests work well with default hyperparameters and rarely catastrophically fail. The old saying holds: "Start with trees, beat them with boosting, explain them with SHAP."

Mini Projects

⚙️ Intermediate

🏦 Mini Project 1: Loan Default Predictor

Objective: Build a complete loan default prediction system using Decision Trees and Random Forests.

Dataset

Use the German Credit Dataset (1000 samples, 20 features) from UCI ML Repository, or generate synthetic data:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, roc_curve
)
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

# ---- Generate Synthetic Loan Data ----
np.random.seed(42)
n = 2000

data = pd.DataFrame({
    'age': np.random.randint(21, 65, n),
    'income': np.random.lognormal(12.5, 0.6, n).astype(int),
    'credit_score': np.random.randint(300, 850, n),
    'employment_years': np.random.randint(0, 30, n),
    'loan_amount': np.random.randint(50000, 2000000, n),
    'num_existing_loans': np.random.poisson(1.5, n),
    'monthly_emi': np.random.randint(0, 50000, n),
    'savings': np.random.lognormal(11, 1, n).astype(int),
    'has_property': np.random.binomial(1, 0.4, n),
    'previous_defaults': np.random.binomial(1, 0.1, n),
})

# Create target (default) based on realistic rules
default_prob = (
    0.3 * (data['credit_score'] < 550).astype(float)
    + 0.2 * (data['income'] < 300000).astype(float)
    + 0.25 * data['previous_defaults']
    + 0.15 * (data['num_existing_loans'] > 3).astype(float)
    + 0.1 * (data['monthly_emi'] / data['income'] > 0.5).astype(float)
    + np.random.normal(0, 0.1, n)
)
data['default'] = (default_prob > 0.4).astype(int)

print(f"Default rate: {data['default'].mean():.2%}")

# ---- Split Data ----
features = data.columns.drop('default')
X = data[features].values
y = data['default'].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ---- Model 1: Decision Tree ----
dt = DecisionTreeClassifier(
    max_depth=5, min_samples_leaf=20,
    criterion='gini', random_state=42
)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)

print("\n=== Decision Tree ===")
print(classification_report(y_test, dt_pred,
    target_names=['No Default', 'Default']))
print(f"AUC-ROC: {roc_auc_score(y_test, dt.predict_proba(X_test)[:,1]):.4f}")

# Print rules
print("\nDecision Rules:")
print(export_text(dt, feature_names=list(features), max_depth=3))

# ---- Model 2: Random Forest ----
rf = RandomForestClassifier(
    n_estimators=200, max_depth=10,
    min_samples_leaf=10, max_features='sqrt',
    oob_score=True, class_weight='balanced',
    random_state=42, n_jobs=-1
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

print("\n=== Random Forest ===")
print(classification_report(y_test, rf_pred,
    target_names=['No Default', 'Default']))
print(f"AUC-ROC: {roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]):.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")

# ---- Feature Importance Analysis ----
print("\nTop Features (Gini Importance):")
importances = pd.Series(rf.feature_importances_, index=features)
for feat, imp in importances.sort_values(ascending=False).items():
    print(f"  {feat}: {imp:.4f}")

# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='roc_auc')
print(f"\n5-Fold CV AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Deliverables

Comparison table: Decision Tree vs Random Forest (accuracy, AUC, F1)
Visualization of the decision tree (max 5 levels)
Feature importance bar chart
Business rules extracted from the decision tree
Report on which customers are most likely to default

🔥 Advanced

📉 Mini Project 2: Customer Churn Analyzer

Objective: Build a churn prediction system for a telecom company and identify key retention drivers.

# ========== Customer Churn Analysis ==========
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    train_test_split, cross_val_score, StratifiedKFold
)
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.inspection import permutation_importance

# Generate synthetic telecom churn data
np.random.seed(42)
n = 5000

churn_data = pd.DataFrame({
    'tenure_months': np.random.randint(1, 72, n),
    'monthly_charges': np.random.uniform(199, 2999, n),
    'total_charges': np.random.uniform(500, 150000, n),
    'num_complaints': np.random.poisson(1.2, n),
    'data_usage_gb': np.random.lognormal(2, 0.8, n),
    'call_minutes': np.random.lognormal(5, 0.5, n),
    'contract_type': np.random.choice([0, 1, 2], n, p=[0.5, 0.3, 0.2]),
    'customer_service_calls': np.random.poisson(1.5, n),
    'has_international_plan': np.random.binomial(1, 0.1, n),
    'network_quality_score': np.random.uniform(1, 5, n),
})

# Realistic churn logic
churn_score = (
    0.3 * (churn_data['num_complaints'] > 2).astype(float)
    + 0.2 * (churn_data['tenure_months'] < 12).astype(float)
    + 0.15 * (churn_data['contract_type'] == 0).astype(float)
    + 0.2 * (churn_data['network_quality_score'] < 2.5).astype(float)
    + 0.15 * (churn_data['customer_service_calls'] > 3).astype(float)
    + np.random.normal(0, 0.15, n)
)
churn_data['churned'] = (churn_score > 0.45).astype(int)
print(f"Churn rate: {churn_data['churned'].mean():.2%}")

# Build and evaluate model
features = churn_data.columns.drop('churned')
X = churn_data[features].values
y = churn_data['churned'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Random Forest with class balancing
rf_churn = RandomForestClassifier(
    n_estimators=300, max_depth=12,
    class_weight='balanced_subsample',
    oob_score=True, random_state=42, n_jobs=-1
)
rf_churn.fit(X_train, y_train)

print("\n=== Churn Prediction Results ===")
print(f"Test Accuracy: {rf_churn.score(X_test, y_test):.4f}")
print(f"OOB Score:     {rf_churn.oob_score_:.4f}")
print(f"AUC-ROC:       {roc_auc_score(y_test, rf_churn.predict_proba(X_test)[:,1]):.4f}")

# Actionable insights: Which features drive churn?
perm_imp = permutation_importance(rf_churn, X_test, y_test, n_repeats=20)
print("\nChurn Drivers (Permutation Importance):")
for idx in perm_imp.importances_mean.argsort()[::-1]:
    print(f"  {features[idx]}: {perm_imp.importances_mean[idx]:.4f}")

Deliverables

Churn prediction model with 85%+ AUC-ROC
Top 5 churn drivers with business interpretation
Customer segmentation: High/Medium/Low churn risk
Retention strategies based on decision tree rules
Cost-benefit analysis: cost of retention offers vs. revenue from prevented churn

End-of-Chapter Exercises

Compute the entropy of a dataset with 30 positive and 70 negative samples. Show all steps.

For a dataset S with class distribution {A: 40%, B: 35%, C: 25%}, compute both the Gini impurity and the entropy.

Given a feature "Weather" with values {Sunny: [5+, 2-], Rainy: [3+, 4-], Overcast: [4+, 0-]}, compute the Information Gain for splitting on Weather. Parent has 12+ and 6-.

Explain why ID3's Information Gain is biased toward features with many unique values. Give a concrete example with "Student ID" as a feature.

Derive the Gain Ratio for the Weather feature from Q3 and the feature "Temperature" with values {Hot: [4+, 2-], Mild: [6+, 2-], Cool: [2+, 2-]}.

A regression tree node contains values [2.1, 3.5, 4.2, 3.8, 2.9]. Compute the variance. If split into [2.1, 2.9] and [3.5, 4.2, 3.8], compute the variance reduction.

Explain the difference between pre-pruning and post-pruning. List three pre-pruning hyperparameters and one post-pruning technique.

Why does a Random Forest use bootstrap sampling (with replacement)? What would happen if we sampled without replacement?

Prove that the probability of a sample being out-of-bag approaches 1/e ≈ 0.368 as n → ∞.

In a Random Forest with p = 100 features, how many features are considered at each split for (a) classification (b) regression? Why is the number different?

Implement a function compute_entropy(labels) that takes a list of class labels and returns the entropy in bits.

Implement a function compute_gini(labels) that takes a list of class labels and returns the Gini impurity.

Write Python code to perform a single best split on a dataset using Information Gain. The function should return the best feature index and threshold.

Explain the bias-variance tradeoff in the context of (a) a single deep decision tree (b) a shallow decision tree (c) a Random Forest.

Build a decision tree by hand for the following dataset and compute Information Gain for each feature:

Outlook	Temp	Humidity	Play?
Sunny	Hot	High	No
Sunny	Mild	Normal	Yes
Overcast	Hot	High	Yes
Rain	Mild	High	Yes
Rain	Cool	Normal	No

What is the time complexity of training a CART decision tree? Express in terms of n (samples), p (features), and d (depth).

Compare the following hyperparameters and their effect on overfitting: max_depth, min_samples_split, min_samples_leaf, max_features.

Explain why increasing n_estimators in a Random Forest never causes overfitting (in theory). Does this hold in practice?

Design a decision tree-based system for a hospital emergency room triage: classify patients into Green (non-urgent), Yellow (urgent), Red (critical) based on vital signs.

Compare Gini importance vs. Permutation importance on a dataset with (a) correlated features (b) a random noise feature. Which method gives misleading results in each case?

Implement OOB error estimation from scratch in Python for a Random Forest with 10 trees on the Iris dataset.

Explain how CART handles continuous features. How does it decide the optimal threshold for splitting?

A company has 10,000 customers with a 5% churn rate. Explain why accuracy is a bad metric here and what you should use instead. How would you handle class imbalance with Random Forests?

Multiple Choice Questions

1. What is the entropy of a dataset with 50% positive and 50% negative samples?

(A) 0
(B) 0.5
(C) 1.0
(D) 2.0

✅ (C) 1.0 — H = -0.5·log₂(0.5) - 0.5·log₂(0.5) = 0.5 + 0.5 = 1.0 bits. Maximum entropy for binary classification.

2. Which splitting criterion does scikit-learn's DecisionTreeClassifier use by default?

(A) Information Gain
(B) Gain Ratio
(C) Gini Impurity
(D) Variance Reduction

✅ (C) Gini Impurity — scikit-learn implements the CART algorithm, which uses Gini by default (criterion='gini').

3. In a Random Forest with 100 features, how many features are typically considered at each split for classification?

(A) 100
(B) 50
(C) 10 (√100)
(D) 33 (100/3)

✅ (C) 10 (√100) — For classification, the default is √p features. For regression, it's p/3.

4. What percentage of training samples are typically out-of-bag (OOB) in a bootstrap sample?

(A) ~25%
(B) ~36.8%
(C) ~50%
(D) ~63.2%

✅ (B) ~36.8% — P(OOB) = (1 - 1/n)ⁿ → 1/e ≈ 0.368 as n → ∞.

5. Which algorithm produces only binary trees?

(A) ID3
(B) C4.5
(C) CART
(D) Both A and B

✅ (C) CART — CART always produces binary splits (left/right). ID3 and C4.5 can create multi-way splits.

6. What happens to a Random Forest's performance as n_estimators → ∞?

(A) It overfits severely
(B) It converges to a generalization limit (no overfitting)
(C) Accuracy decreases
(D) Training time decreases

✅ (B) — Breiman proved that Random Forests converge to a limit as B → ∞. Adding more trees never hurts generalization (but costs training time).

7. A decision tree with no depth limit on a training set will have training error of:

(A) Close to 50%
(B) Around 10-20%
(C) Approximately 0% (if no duplicate samples with different labels)
(D) Depends on the criterion used

✅ (C) — An unrestricted tree can create one leaf per training sample, achieving 0% training error (perfect memorization). This is why pruning is essential.

8. The Gini impurity for a node with class distribution [70%, 30%] is:

(A) 0.21
(B) 0.42
(C) 0.49
(D) 0.70

✅ (B) 0.42 — Gini = 1 - (0.7² + 0.3²) = 1 - (0.49 + 0.09) = 0.42.

9. Which advantage does a Decision Tree have over Logistic Regression?

(A) Always higher accuracy
(B) Can capture non-linear decision boundaries without feature engineering
(C) Lower variance
(D) Better calibrated probabilities

✅ (B) — Trees can model non-linear relationships and interactions automatically. Logistic regression requires manual feature engineering (polynomial features, interactions) for non-linear boundaries.

10. Cost-complexity pruning in CART uses which parameter to control tree complexity?

(A) max_depth
(B) min_samples_split
(C) ccp_alpha (α)
(D) n_estimators

✅ (C) ccp_alpha (α) — The complexity parameter α penalizes tree size. Higher α = more aggressive pruning. Optimal α is selected via cross-validation.

11. Which feature importance method is model-agnostic and unbiased?

(A) Gini importance (MDI)
(B) Permutation importance (MDA)
(C) Both are equally unbiased
(D) Neither — all importance methods are biased

✅ (B) Permutation importance — It's model-agnostic and not biased toward high-cardinality or continuous features. Gini importance has known bias toward features with more unique values.

12. In the Random Forest algorithm, what is the purpose of feature randomness (selecting √p features at each split)?

(A) Speed up training
(B) Reduce correlation between trees
(C) Prevent overfitting of individual trees
(D) Handle missing values

✅ (B) Reduce correlation between trees — If a strong feature exists, all bagged trees would split on it first, making them correlated. Feature randomness decorrelates the trees, which is key to reducing ensemble variance.

Interview Questions

Interview Q1 — Conceptual

Explain the difference between bagging and boosting. Why does Random Forest use bagging?

Bagging (Bootstrap AGGregating) builds trees independently on bootstrap samples and aggregates them — it reduces variance. Boosting builds trees sequentially, each correcting the previous one's errors — it reduces bias. Random Forests use bagging because individual deep trees have low bias but high variance. Bagging + feature randomness decorrelates the trees, dramatically reducing variance while maintaining low bias. Boosting would be redundant since trees are already low-bias.

Interview Q2 — Technical

How would you handle a highly imbalanced dataset (99% negative, 1% positive) with Random Forests?

Multiple strategies: (1) class_weight='balanced' — weights inversely proportional to class frequency; (2) class_weight='balanced_subsample' — recomputes weights per bootstrap sample; (3) SMOTE or other oversampling before training; (4) Threshold tuning — adjust the probability threshold from 0.5 to optimize F1 or recall; (5) Balanced Random Forest from imbalanced-learn — undersamples the majority class in each bootstrap. Use precision-recall curves and F1 score (not accuracy) for evaluation.

Interview Q3 — Design

Design a real-time fraud detection system using Random Forests. What are the challenges?

Architecture: Train RF offline on historical transactions → serialize model → deploy as a microservice → for each transaction, extract features (amount, location, time, merchant, device) and predict in real-time. Challenges: (1) Latency — RF prediction is O(B·d) where B=trees, d=depth. Use shallow trees or limit B for <10ms latency; (2) Concept drift — fraud patterns change. Retrain weekly with recent data; (3) Class imbalance — 99.8% legitimate. Use AUPRC, not accuracy; (4) Feature engineering — real-time features (velocity, frequency) need streaming infrastructure; (5) False positives — each false positive blocks a legitimate customer. Optimize for high precision at reasonable recall.

Interview Q4 — Mathematical

Prove that entropy is always greater than or equal to Gini impurity for binary classification.

For binary classification with positive proportion p: Entropy H(p) = -p·log₂(p) - (1-p)·log₂(1-p), Gini G(p) = 2p(1-p). Both are 0 at p=0 and p=1. At p=0.5: H=1.0, G=0.5. Using the concavity of log and the inequality -log₂(x) ≥ 2(1-x)/ln(2) for x ∈ (0,1), we can show H(p) ≥ G(p) for all p ∈ [0,1]. More precisely, H(p) ≥ G(p) because the Taylor expansion of entropy around p=0.5 has all positive higher-order terms compared to Gini.

Interview Q5 — Practical

You trained a Random Forest with 500 trees and it's too slow for production. How do you speed it up?

Options: (1) Reduce n_estimators — often 100-200 trees suffice (diminishing returns after ~100); (2) Limit max_depth — each tree's prediction is O(depth); (3) Use max_features — fewer features per split = faster training; (4) Feature selection — remove unimportant features; (5) Distill to a single tree — train a smaller model on RF's predictions; (6) Use model compression — tools like ONNX Runtime optimize inference; (7) Switch to LightGBM — typically faster with similar accuracy; (8) Parallelize inference — each tree's prediction is independent, use n_jobs=-1.

Interview Q6 — Debugging

Your decision tree gets 99% training accuracy but only 60% test accuracy. Diagnose and fix the issue.

This is classic overfitting. The tree has memorized the training data (including noise). Fixes: (1) Limit max_depth (e.g., 5-10 instead of unlimited); (2) Increase min_samples_split and min_samples_leaf; (3) Apply cost-complexity pruning (ccp_alpha); (4) Use a Random Forest instead — ensemble of trees reduces variance; (5) Check for data leakage — is a feature that shouldn't exist in production leaking target information? (6) Cross-validate to get a realistic estimate of generalization performance.

Interview Q7 — Comparison

When would you choose a Decision Tree over a Neural Network, and vice versa?

Decision Tree/RF wins: Tabular data, small-medium datasets, need for interpretability (healthcare, finance, legal), mixed feature types, fast inference needed, limited compute resources. Neural Network wins: Image/video data (CNNs), text/NLP (Transformers), audio/speech, very large datasets (millions+), complex spatial/temporal patterns, when interpretability isn't critical. Rule of thumb: For tabular data with <100K rows, start with tree-based models. For unstructured data, go neural.

Interview Q8 — System Design (FAANG Level)

How would you build a credit scoring system for a bank that serves 50 million customers? Consider model selection, deployment, monitoring, and compliance.

Model: Random Forest or XGBoost for scoring + shallow Decision Tree for generating rejection reasons. Training: Monthly retraining on 6-month rolling window. Deployment: Model served via REST API with <100ms latency. A/B test new models vs. current champion. Monitoring: Track PSI (Population Stability Index) for feature drift, model accuracy on holdout, false positive/negative rates by demographic. Compliance: Explainability via SHAP values, decision tree rule extraction for regulatory reporting, adverse action reasons (Fair Lending laws), model documentation (SR 11-7 for US banks, RBI guidelines for Indian banks). Fairness: Test for disparate impact across protected groups (gender, caste, religion). Use fairness-aware constraints or post-processing calibration.

Interview Q9 — Theoretical

Why do decision trees produce axis-aligned decision boundaries? What are the implications?

Trees split on one feature at a time (e.g., x₁ ≤ 3.5), creating boundaries parallel to feature axes. This means: (1) They can't efficiently capture diagonal boundaries — need many splits to approximate a diagonal line (staircase pattern); (2) They're rotation-sensitive — rotating the data 45° can completely change the tree; (3) They need more depth for linear boundaries at angles. Mitigations: Random Forests reduce the staircase effect through averaging; oblique decision trees (splitting on linear combinations of features) exist but are rarely used; rotation-sensitive features should be engineered carefully (e.g., PCA rotation before tree training, though this hurts interpretability).

Interview Q10 — Coding

Write a Python function that computes the Information Gain for a binary split given parent labels and left/right child labels.

def information_gain(parent, left, right):
    def entropy(labels):
        from collections import Counter
        n = len(labels)
        counts = Counter(labels)
        return -sum((c/n) * np.log2(c/n)
                     for c in counts.values() if c > 0)

    n = len(parent)
    H_parent = entropy(parent)
    H_children = (len(left)/n * entropy(left)
                  + len(right)/n * entropy(right))
    return H_parent - H_children

Research Problems

🔬 Research Problem 1: Fairness-Aware Decision Trees

Background: Decision trees can learn biased splits that discriminate based on protected attributes (gender, caste, religion). Standard pruning doesn't account for fairness.

Research Question: Can we modify the CART splitting criterion to include a fairness constraint (e.g., demographic parity or equalized odds) without significantly sacrificing accuracy?

Approach: Define a modified splitting criterion: ΔImpurity_fair = ΔGini - λ · ΔFairness, where ΔFairness measures the increase in demographic disparity after the split. Investigate the accuracy-fairness Pareto frontier for different λ values on credit scoring datasets.

References: Kamiran et al. (2010) "Discrimination aware decision tree learning"; Aghaei et al. (2019) "Learning optimal fair decision trees".

🔬 Research Problem 2: Optimal Random Forest Calibration

Background: Random Forests are known to produce poorly calibrated probability estimates — they tend to push probabilities toward 0.5 (underconfident). This is problematic in applications requiring accurate probability estimates (medical diagnosis, insurance pricing).

Research Question: What is the theoretical reason for RF miscalibration, and can we develop a calibration method specifically designed for ensemble tree models that outperforms Platt scaling and isotonic regression?

References: Niculescu-Mizil & Caruana (2005) "Predicting good probabilities with supervised learning"; Boström (2008) "Calibrating Random Forests".

🔬 Research Problem 3: Streaming Random Forests

Background: Standard Random Forests require all data in memory. In streaming applications (real-time sensor data, social media feeds), data arrives continuously and distributions change over time (concept drift).

Research Question: Design an incremental Random Forest algorithm that (a) updates trees without full retraining, (b) detects and adapts to concept drift, (c) maintains bounded memory, and (d) has provable convergence guarantees.

References: Abdulsalam et al. (2007) "Streaming Random Forests"; Gomes et al. (2017) "Adaptive Random Forests for data streams".

🔬 Research Problem 4: Neural-Tree Hybrid Architectures

Background: Recent work (TabNet, NODE, DNDT) attempts to combine the representational power of neural networks with the interpretability and inductive bias of decision trees.

Research Question: Can differentiable decision trees (soft routing) achieve the generalization properties of hard decision trees while being trainable end-to-end with gradient descent? Investigate the conditions under which soft trees converge to hard trees.

References: Kontschieder et al. (2015) "Deep Neural Decision Forests"; Hazimeh et al. (2020) "The Tree Ensemble Layer".

Key Takeaways

Entropy measures uncertainty: H(S) = -∑ pᵢ log₂(pᵢ). Maximum when all classes are equally likely, zero when data is pure. Information Gain = parent entropy minus weighted children entropy.

Three algorithms, one idea: ID3 (Information Gain, categorical only), C4.5 (Gain Ratio, handles continuous + missing values), CART (Gini impurity, binary trees, scikit-learn default). In practice, CART dominates.

Trees overfit without constraints: A deep tree memorizes training data. Use max_depth, min_samples_split, min_samples_leaf (pre-pruning) or ccp_alpha (post-pruning) to control complexity.

Random Forests = Bagging + Feature Randomness: Build B trees on bootstrap samples, considering √p random features at each split. Majority vote for classification, mean for regression. This dramatically reduces variance.

OOB error is free validation: ~36.8% of samples are out-of-bag for each tree. Predicting with only OOB trees gives an unbiased estimate of test error — no need for a separate validation set.

Feature importance has two flavors: Gini importance (MDI) is fast but biased toward high-cardinality features. Permutation importance (MDA) is unbiased but slower. Always prefer permutation importance for reliable feature ranking.

Trees shine on tabular data: For structured/tabular datasets, tree-based models (RF, XGBoost, LightGBM) consistently outperform deep learning. They handle mixed feature types, require minimal preprocessing, and provide interpretability. Neural networks win on images, text, and audio.

Interpretability matters in regulated industries: Banking (RBI Fair Practices Code), healthcare (FDA), and insurance require models that can explain their decisions. Decision trees provide this naturally through their rule-based structure.

The industry workflow: Start with a Decision Tree for baseline interpretability → upgrade to Random Forest for accuracy → try XGBoost/LightGBM for marginal gains → use SHAP for model-agnostic explanations. This pipeline solves 80% of real-world tabular ML problems.

References & Further Reading

Foundational Papers

Shannon, C.E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27, 379-423.
Quinlan, J.R. (1986). "Induction of Decision Trees." Machine Learning, 1(1), 81-106. [ID3 paper]
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Wadsworth. [CART book]
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. [C4.5 book]
Breiman, L. (1996). "Bagging Predictors." Machine Learning, 24(2), 123-140.
Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32.

Advanced References

Shotton, J., et al. (2011). "Real-Time Human Pose Recognition in Parts from Single Depth Images." CVPR. [Kinect paper]
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 9-10.
Grinsztajn, L., et al. (2022). "Why do tree-based models still outperform deep learning on tabular data?" NeurIPS.
Louppe, G. (2014). "Understanding Random Forests: From Theory to Practice." PhD Thesis, University of Liège.
Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." KDD.
Breiman, L. (2001). "Statistical Modeling: The Two Cultures." Statistical Science, 16(3), 199-231.

Indian References

RBI (2023). "Framework for Responsible and Ethical Use of AI/ML in Financial Services." Reserve Bank of India.
Sharma, V. & Kumar, R. (2020). "Credit Scoring in Indian Banking: A Comparative Study." Indian Journal of Finance.
ISRO (2021). "Land Use Land Cover Classification using Machine Learning." NRSC Technical Report.

Online Resources

scikit-learn Documentation: Decision Trees
scikit-learn Documentation: Random Forests
TensorFlow Decision Forests: tf-df
Visual Introduction to Machine Learning: R2D3
Distill.pub: "Feature Visualization" — related to understanding tree-based feature importance

👨‍🏫 Professor's Insight — What's Next?

Chapter 13 covered the foundation: individual trees and random forests. In Chapter 14: Gradient Boosting & XGBoost, we'll see how building trees sequentially (each correcting the previous one's errors) leads to even more powerful models. Gradient boosting is to random forests what focused practice is to random exploration — both are valuable, but boosting optimizes more deliberately. Master both, and you'll have the tools to win any tabular data competition.