Machine Learning: From Scratch to Advanced

Chapter 1

Introduction to Machine Learning

Learning Objectives

Understand what Machine Learning is and why it matters
Distinguish between Supervised, Unsupervised, and Reinforcement Learning
Identify real-world ML applications across industries
Understand the end-to-end ML workflow

What is Machine Learning?

Machine Learning (ML) is a branch of Artificial Intelligence that enables computers to learn patterns from data and make decisions without being explicitly programmed for every scenario. Instead of writing rules by hand, you feed the algorithm data and let it discover the rules itself.

Arthur Samuel (1959) defined ML as the "field of study that gives computers the ability to learn without being explicitly programmed." Tom Mitchell provided a more formal definition:

A computer program is said to learn from experience E with respect to task T and performance measure P, if its performance at T, as measured by P, improves with experience E.

Types of Machine Learning

1. Supervised Learning

The algorithm learns from labeled data — each training example comes with the correct answer (label). The model learns a mapping from inputs to outputs.

Classification: Predicting a category — spam vs. not spam, cat vs. dog
Regression: Predicting a continuous value — house prices, temperature

2. Unsupervised Learning

The algorithm works with unlabeled data and tries to find hidden patterns or groupings without knowing the correct answers.

Clustering: Grouping similar customers together
Dimensionality Reduction: Compressing data while keeping important features (PCA)
Association: Finding items that frequently co-occur (market basket analysis)

3. Reinforcement Learning

An agent learns by interacting with an environment, receiving rewards for good actions and penalties for bad ones. Used in game AI, robotics, and self-driving cars.

Type	Data	Goal	Example
Supervised	Labeled	Predict output	Email spam detection
Unsupervised	Unlabeled	Find structure	Customer segmentation
Reinforcement	Rewards/Penalties	Maximize reward	AlphaGo, robotics

Real-World Applications

Machine Learning powers many products and services you use daily:

Healthcare: Disease diagnosis from X-rays, drug discovery, patient risk prediction
Finance: Fraud detection, algorithmic trading, credit scoring
E-commerce: Product recommendations (Amazon, Netflix), dynamic pricing
Transportation: Self-driving cars (Tesla), route optimization (Google Maps)
Language: Machine translation (Google Translate), voice assistants (Siri, Alexa)
Social Media: Content recommendation, face recognition, sentiment analysis

The ML Workflow

Every ML project follows a standard pipeline:

Define the Problem: What question are you trying to answer?
Collect Data: Gather relevant, high-quality data
Explore & Preprocess: Clean data, handle missing values, visualize patterns
Feature Engineering: Select and transform the most informative features
Train the Model: Choose an algorithm and fit it to the training data
Evaluate: Test the model on unseen data using appropriate metrics
Tune & Optimize: Adjust hyperparameters for better performance
Deploy: Put the model into production and monitor its performance

Setting Up Your Environment

bash
# Install essential ML libraries
pip install numpy pandas matplotlib scikit-learn
pip install tensorflow keras
pip install seaborn jupyter

# Verify installation
python -c "import sklearn; print(sklearn.__version__)"

Exercises

Exercise 1.1: Classify each problem as Supervised, Unsupervised, or Reinforcement Learning

a) Predicting house prices from square footage → Supervised (Regression)

b) Grouping news articles by topic without labels → Unsupervised (Clustering)

c) Teaching a robot to walk → Reinforcement Learning

d) Detecting fraudulent credit card transactions → Supervised (Classification)

e) Reducing image features from 1000 to 50 → Unsupervised (Dimensionality Reduction)

Exercise 1.2: List 3 ML applications in your daily life and identify the type

Example answers:

YouTube recommendations → Supervised Learning (predicting what you'll click)
Google Photos grouping faces → Unsupervised Learning (clustering)
Siri learning your preferences → Reinforcement Learning

Exercise 1.3: Describe the 8 steps of the ML workflow for a movie recommendation system

1. Define: Recommend movies users will enjoy.

2. Collect: User ratings, watch history, movie metadata.

3. Explore: Analyze rating distributions, popular genres, viewing patterns.

4. Feature Engineering: User preferences, genre encoding, watch time features.

5. Train: Collaborative filtering or content-based model.

6. Evaluate: Measure with RMSE, precision@k on held-out data.

7. Tune: Optimize number of factors, learning rate.

8. Deploy: Serve recommendations in real-time via API.

Chapter Summary

ML enables computers to learn from data rather than explicit programming
Three main types: Supervised, Unsupervised, and Reinforcement Learning
ML is used across healthcare, finance, e-commerce, transportation, and more
Every ML project follows a systematic workflow from problem definition to deployment

Chapter 2

Python Essentials for Machine Learning

Learning Objectives

Master NumPy arrays and operations for numerical computing
Use Pandas DataFrames for data manipulation and analysis
Create informative visualizations with Matplotlib
Understand vectorized operations for efficient computation

NumPy: Numerical Computing Foundation

NumPy is the backbone of scientific computing in Python. It provides high-performance multidimensional arrays and tools for working with them.

Python
import numpy as np

# Creating arrays
a = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Useful array generators
zeros = np.zeros((3, 4))         # 3x4 matrix of zeros
ones = np.ones((2, 3))           # 2x3 matrix of ones
rng = np.arange(0, 10, 2)       # [0, 2, 4, 6, 8]
lin = np.linspace(0, 1, 5)      # [0, 0.25, 0.5, 0.75, 1.0]
rand = np.random.randn(3, 3)    # 3x3 random normal values

# Array properties
print(matrix.shape)    # (3, 3)
print(matrix.dtype)    # int64
print(matrix.ndim)     # 2

Vectorized Operations

Python
# Element-wise operations (much faster than loops)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(a + b)      # [5, 7, 9]
print(a * b)      # [4, 10, 18]
print(a ** 2)     # [1, 4, 9]
print(np.dot(a, b))  # 32  (dot product)

# Statistical operations
data = np.array([14, 23, 18, 29, 35, 22])
print(np.mean(data))    # 23.5
print(np.std(data))     # 6.99
print(np.median(data))  # 22.5

# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(A @ B)               # Matrix multiplication
print(np.linalg.inv(A))    # Matrix inverse
print(np.linalg.det(A))    # Determinant: -2.0

Pandas: Data Analysis Powerhouse

Python
import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 75000, 55000],
    'Department': ['ML', 'Web', 'ML', 'Data']
})

# Exploring data
print(df.head())          # First 5 rows
print(df.info())          # Column types, non-null counts
print(df.describe())      # Statistical summary

# Selecting & filtering
ml_team = df[df['Department'] == 'ML']
high_salary = df[df['Salary'] > 55000]

# Groupby operations
avg_salary = df.groupby('Department')['Salary'].mean()

# Handling missing data
df.dropna()               # Remove rows with NaN
df.fillna(0)              # Replace NaN with 0
df.fillna(df.mean())      # Replace NaN with column mean

# Reading from CSV
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)

Matplotlib: Data Visualization

Python
import matplotlib.pyplot as plt
import numpy as np

# Line plot
x = np.linspace(0, 10, 100)
plt.figure(figsize=(10, 6))
plt.plot(x, np.sin(x), label='sin(x)', linewidth=2)
plt.plot(x, np.cos(x), label='cos(x)', linewidth=2)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Trigonometric Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Scatter plot
np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
plt.scatter(x, y, alpha=0.7, c='#4f46e5')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Linear Relationship with Noise')
plt.show()

# Histogram
data = np.random.normal(100, 15, 1000)
plt.hist(data, bins=30, edgecolor='white', color='#7c3aed')
plt.title('Normal Distribution')
plt.show()

Exercises

Exercise 2.1: Create a NumPy array of 20 random integers between 1-100. Find the mean, max, min, and standard deviation.

import numpy as np
arr = np.random.randint(1, 101, size=20)
print(f"Array: {arr}")
print(f"Mean: {np.mean(arr):.2f}")
print(f"Max: {np.max(arr)}")
print(f"Min: {np.min(arr)}")
print(f"Std: {np.std(arr):.2f}")

Exercise 2.2: Create a Pandas DataFrame of 5 students with Name, Maths, Science, English scores. Add a Total and Percentage column.

import pandas as pd
df = pd.DataFrame({
    'Name': ['Amit', 'Priya', 'Ravi', 'Sneha', 'Karan'],
    'Maths': [85, 92, 78, 95, 88],
    'Science': [90, 88, 82, 91, 76],
    'English': [78, 95, 85, 89, 92]
})
df['Total'] = df[['Maths','Science','English']].sum(axis=1)
df['Percentage'] = (df['Total'] / 300 * 100).round(2)
print(df)

Exercise 2.3: Plot a bar chart showing the average score per subject from Exercise 2.2

subjects = ['Maths', 'Science', 'English']
averages = [df[s].mean() for s in subjects]
plt.bar(subjects, averages, color=['#4f46e5','#10b981','#f59e0b'])
plt.ylabel('Average Score')
plt.title('Average Score by Subject')
plt.ylim(70, 100)
plt.show()

Chapter Summary

NumPy provides fast, vectorized array operations essential for ML math
Pandas simplifies data loading, cleaning, filtering, and groupby analysis
Matplotlib enables line, scatter, bar, and histogram visualizations
Vectorized operations are 10-100x faster than Python loops

Chapter 3

Data Preprocessing & Feature Engineering

Learning Objectives

Handle missing data with various imputation strategies
Encode categorical variables using Label and One-Hot encoding
Scale numerical features with StandardScaler and MinMaxScaler
Perform feature selection and train-test splitting

Why Preprocessing Matters

Raw data is messy — it contains missing values, inconsistent formats, outliers, and mixed data types. Garbage in = garbage out. Preprocessing transforms raw data into a clean, ML-ready format. Studies show that data scientists spend 60-80% of their time on data preprocessing.

Handling Missing Data

Python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Age': [25, np.nan, 35, 28, np.nan],
    'Salary': [50000, 60000, np.nan, 55000, 70000],
    'City': ['Delhi', 'Mumbai', 'Delhi', None, 'Bangalore']
})

# Check missing values
print(df.isnull().sum())

# Strategy 1: Drop rows with any NaN
df_dropped = df.dropna()

# Strategy 2: Fill with mean/median (numerical)
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)

# Strategy 3: Fill with mode (categorical)
df['City'].fillna(df['City'].mode()[0], inplace=True)

# Strategy 4: Scikit-learn SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['Age','Salary']] = imputer.fit_transform(df[['Age','Salary']])

Encoding Categorical Variables

Python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label Encoding (for ordinal data: low < medium < high)
le = LabelEncoder()
df['City_encoded'] = le.fit_transform(df['City'])
# Delhi=1, Mumbai=2, Bangalore=0

# One-Hot Encoding (for nominal data — no order)
df_encoded = pd.get_dummies(df, columns=['City'], drop_first=True)
# Creates: City_Delhi, City_Mumbai (Bangalore is baseline)

Feature Scaling

Many algorithms (KNN, SVM, Neural Networks, Gradient Descent) are sensitive to the scale of features. A salary feature (50000-100000) would dominate an age feature (20-60) without scaling.

Python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler: mean=0, std=1 (z-score normalization)
scaler = StandardScaler()
df[['Age_scaled','Salary_scaled']] = scaler.fit_transform(df[['Age','Salary']])

# MinMaxScaler: scales to [0, 1]
mm_scaler = MinMaxScaler()
df[['Age_mm','Salary_mm']] = mm_scaler.fit_transform(df[['Age','Salary']])

StandardScaler: z = (x - μ) / σ MinMaxScaler: x' = (x - x_min) / (x_max - x_min)

Train-Test Split

Python
from sklearn.model_selection import train_test_split

X = df[['Age', 'Salary']]  # Features
y = df['Target']            # Label

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

When to use which scaler?

StandardScaler — when data follows a normal distribution. Best for SVM, Logistic Regression, PCA.

MinMaxScaler — when you need values in a fixed range [0,1]. Best for Neural Networks, KNN.

Mini-Project: Titanic Data Cleaning Pipeline

Python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Drop irrelevant columns
df = df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)

# Fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Encode categorical variables
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

# Split features and target
X = df.drop('Survived', axis=1)
y = df['Survived']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print("Data is clean and ready for modeling!")

Exercises

Exercise 3.1: Given a dataset with 15% missing values in 'Income', when would you drop vs. impute?

Drop when: the dataset is very large AND the missing rows are random (MCAR). Losing 15% of a 1M row dataset still leaves 850k rows.

Impute when: the dataset is small, or the missing data has a pattern (MAR/MNAR). Use median for skewed data, mean for normal distributions.

Exercise 3.2: Encode ['Red', 'Green', 'Blue', 'Red', 'Blue'] using both Label and One-Hot encoding

from sklearn.preprocessing import LabelEncoder
colors = ['Red', 'Green', 'Blue', 'Red', 'Blue']
le = LabelEncoder()
label_encoded = le.fit_transform(colors)
print("Label:", label_encoded)  # [2, 1, 0, 2, 0]

df = pd.DataFrame({'Color': colors})
one_hot = pd.get_dummies(df, columns=['Color'])
print("One-Hot:\n", one_hot)

Exercise 3.3: Why should you fit the scaler ONLY on training data?

If you fit the scaler on the entire dataset (including test data), you introduce data leakage. The scaler would learn the mean/std of the test set, which shouldn't be available during training. This leads to overly optimistic performance estimates. Always: scaler.fit(X_train), then scaler.transform(X_test).

Chapter Summary

Handle missing data using dropping, mean/median/mode imputation, or SimpleImputer
Use Label Encoding for ordinal categories, One-Hot Encoding for nominal categories
Scale features with StandardScaler or MinMaxScaler to normalize ranges
Always split data before scaling to prevent data leakage

Chapter 4

Linear Regression

Learning Objectives

Understand the math behind linear regression — cost function and gradient descent
Implement linear regression from scratch in Python
Use scikit-learn for efficient linear regression
Build a house price prediction project

The Idea

Linear regression fits a straight line (or hyperplane) through data to predict a continuous target variable. It finds the best relationship between input features (X) and output (y).

ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b (or simply: ŷ = wX + b)

Where w (weights) and b (bias) are learned from data. The goal is to find values of w and b that minimize the error between predicted ŷ and actual y.

Cost Function: Mean Squared Error

MSE = (1/n) × Σ(yᵢ - ŷᵢ)² — we want to minimize this

Gradient Descent

Gradient descent is an optimization algorithm that iteratively adjusts w and b to minimize the cost function by moving in the direction of the steepest descent.

w = w - α × ∂MSE/∂w b = b - α × ∂MSE/∂b (α = learning rate)

Implementation from Scratch

Python
import numpy as np

class LinearRegressionScratch:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iter):
            y_pred = np.dot(X, self.weights) + self.bias

            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

            # Track cost
            cost = np.mean((y_pred - y) ** 2)
            self.cost_history.append(cost)

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

# Usage
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X.flatten() + np.random.randn(100)

model = LinearRegressionScratch(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)
print(f"Weight: {model.weights[0]:.3f}, Bias: {model.bias:.3f}")
# Should be close to Weight: 3.0, Bias: 4.0

Using Scikit-Learn

Python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.3f}")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")

Project: House Price Prediction

Python
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_s, y_train)

# Evaluate
y_pred = model.predict(X_test_s)
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)) * 100000:.0f}")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")

# Feature importance
for name, coef in zip(housing.feature_names, model.coef_):
    print(f"  {name}: {coef:.4f}")

Exercises

Exercise 4.1: What happens if the learning rate is too high or too low?

Too high: Gradient descent overshoots the minimum — the cost oscillates or diverges to infinity. The model never converges.

Too low: Convergence is extremely slow — training takes thousands of extra iterations. You might run out of patience and stop before reaching the optimal solution.

Best practice: Start with 0.01, try 0.001, 0.1, and use learning rate schedules that decrease over time.

Exercise 4.2: Implement linear regression from scratch for y = 5x + 3 with noise

np.random.seed(0)
X = np.random.rand(200, 1) * 10
y = 5 * X.flatten() + 3 + np.random.randn(200) * 2

model = LinearRegressionScratch(lr=0.01, n_iterations=2000)
model.fit(X, y)
print(f"Learned: y = {model.weights[0]:.2f}x + {model.bias:.2f}")
# Expected: y ≈ 5.00x + 3.00

Exercise 4.3: What is R² score? What does R²=0.85 mean?

R² (Coefficient of Determination) measures how well the model explains the variance in the target variable.

R² = 1 - (SS_res / SS_tot) where SS_res is the sum of squared residuals and SS_tot is the total sum of squares.

R² = 0.85 means the model explains 85% of the variance in the data. The remaining 15% is unexplained noise.

R² = 1.0 → perfect prediction, R² = 0 → model predicts just the mean, R² < 0 → worse than predicting mean.

Chapter Summary

Linear regression minimizes MSE to find the best-fit line: ŷ = wX + b
Gradient descent iteratively updates weights to minimize cost
R² score measures how well the model explains variance (higher is better)
Feature scaling is crucial for gradient descent convergence

Chapter 5

Classification: Logistic Regression

Learning Objectives

Understand logistic regression and the sigmoid function
Learn binary cross-entropy loss and decision boundaries
Implement logistic regression with scikit-learn
Build an email spam classifier

From Regression to Classification

Logistic regression predicts probabilities for classification. It wraps linear regression output through the sigmoid function to produce values between 0 and 1.

σ(z) = 1 / (1 + e⁻ᶻ) where z = wX + b → output ∈ (0, 1)

If σ(z) ≥ 0.5, predict class 1. Otherwise, predict class 0. The 0.5 threshold defines the decision boundary.

Binary Cross-Entropy Loss

Loss = -(1/n) × Σ [yᵢ log(ŷᵢ) + (1 - yᵢ) log(1 - ŷᵢ)]

Python
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Sigmoid outputs for different inputs
z_values = np.array([-5, -2, 0, 2, 5])
print(sigmoid(z_values))
# [0.007, 0.119, 0.500, 0.881, 0.993]

Scikit-Learn Implementation

Python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Project: Email Spam Classifier

Python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Sample email data
emails = [
    "Win a free iPhone now! Click here!",
    "Meeting scheduled for tomorrow at 3pm",
    "Congratulations! You won $1000000",
    "Please review the attached report",
    "FREE OFFER! Buy one get one free!!!",
    "Can we reschedule the project discussion?",
    "URGENT: Your account will be suspended",
    "Hey, are you coming to lunch today?",
    "Claim your prize before it expires!!!",
    "The quarterly report is ready for review",
]
labels = [1,0,1,0,1,0,1,0,1,0]  # 1=spam, 0=ham

# Convert text to numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)
y = np.array(labels)

# Train classifier
model = LogisticRegression()
model.fit(X, y)

# Test on new emails
new_emails = [
    "You are selected for a cash prize!",
    "Let's finalize the budget this week"
]
X_new = vectorizer.transform(new_emails)
predictions = model.predict(X_new)

for email, pred in zip(new_emails, predictions):
    label = "SPAM" if pred == 1 else "HAM"
    print(f"[{label}] {email}")

Exercises

Exercise 5.1: What is the sigmoid output for z = 0? z = 10? z = -10?

σ(0) = 0.5 (exactly at the decision boundary)

σ(10) ≈ 0.9999 (very confident class 1)

σ(-10) ≈ 0.0000 (very confident class 0)

Exercise 5.2: Explain the difference between accuracy and precision

Accuracy = (TP + TN) / Total — overall correctness. Can be misleading with imbalanced classes (99% accuracy if 99% are negative).

Precision = TP / (TP + FP) — of all positive predictions, how many were correct? Important when false positives are costly (e.g., spam filter marking important email as spam).

Exercise 5.3: When would you change the decision threshold from 0.5?

Lower threshold (e.g., 0.3): When missing a positive is very costly — medical diagnosis (prefer false alarms over missed diseases).

Higher threshold (e.g., 0.8): When false positives are costly — fraud detection (don't block legitimate transactions unnecessarily).

Chapter Summary

Logistic regression uses the sigmoid function to produce probabilities for binary classification
Binary cross-entropy is the loss function optimized during training
Confusion matrix, precision, recall, and F1-score provide comprehensive evaluation
TF-IDF converts text to numerical features for NLP classification tasks

Chapter 6

Decision Trees & Ensemble Methods

Learning Objectives

Understand decision tree construction using Gini impurity and information gain
Learn ensemble methods: Bagging, Random Forest, and Gradient Boosting
Implement tree-based models with scikit-learn
Build a customer churn prediction project

Decision Trees

A decision tree splits data recursively based on feature thresholds, creating a tree structure of if-else decisions. It chooses splits that maximize information gain (or minimize Gini impurity).

Gini Impurity = 1 - Σ(pᵢ²) where pᵢ = probability of class i

Entropy = -Σ pᵢ log₂(pᵢ) Information Gain = Entropy(parent) - Weighted Avg Entropy(children)

Python
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42)

tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

print(f"Accuracy: {tree.score(X_test, y_test):.3f}")
print(export_text(tree, feature_names=iris.feature_names))

Random Forest (Bagging)

Random Forest trains multiple decision trees on random subsets of data and features, then combines their predictions (majority voting for classification, averaging for regression). This reduces overfitting.

Python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
print(f"Random Forest Accuracy: {rf.score(X_test, y_test):.3f}")

# Feature importance
for name, imp in zip(iris.feature_names, rf.feature_importances_):
    print(f"  {name}: {imp:.4f}")

Gradient Boosting & XGBoost

Gradient Boosting builds trees sequentially — each tree corrects errors of the previous one. XGBoost is an optimized, high-performance implementation.

Python
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb.fit(X_train, y_train)
print(f"Gradient Boosting Accuracy: {gb.score(X_test, y_test):.3f}")

Method	How it works	Strength
Decision Tree	Single tree, recursive splits	Interpretable, fast
Random Forest	Many trees in parallel (bagging)	Reduces overfitting
Gradient Boosting	Trees built sequentially (boosting)	Highest accuracy

Project: Customer Churn Prediction

Python
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Simulated telecom churn dataset
np.random.seed(42)
n = 1000
data = pd.DataFrame({
    'tenure': np.random.randint(1, 72, n),
    'monthly_charge': np.random.uniform(20, 120, n).round(2),
    'total_charges': np.random.uniform(100, 8000, n).round(2),
    'contract_type': np.random.choice([0,1,2], n),  # month/year/2year
    'support_tickets': np.random.randint(0, 10, n),
})
# Churn likelihood: short tenure + high charges + many tickets
churn_prob = (1 / (1 + np.exp(-(
    -0.05 * data['tenure'] +
    0.02 * data['monthly_charge'] +
    0.3 * data['support_tickets'] +
    -1.0 * data['contract_type'] - 1
))))
data['churned'] = (np.random.random(n) < churn_prob).astype(int)

X = data.drop('churned', axis=1)
y = data['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
rf.fit(X_train, y_train)

print(f"Accuracy: {rf.score(X_test, y_test):.3f}")
print(classification_report(y_test, rf.predict(X_test)))
print("Top churn predictors:")
for name, imp in sorted(zip(X.columns, rf.feature_importances_), key=lambda x: -x[1]):
    print(f"  {name}: {imp:.4f}")

Exercises

Exercise 6.1: Why do decision trees tend to overfit? How does Random Forest solve this?

Decision trees grow deep to fit every training sample perfectly, memorizing noise. They have high variance.

Random Forest reduces variance by: (1) training each tree on a random bootstrap sample, (2) considering only a random subset of features at each split, (3) averaging predictions across many trees. This ensemble approach smooths out individual tree's noise.

Exercise 6.2: What is the difference between Bagging and Boosting?

Bagging (Bootstrap Aggregating): Train trees independently in parallel on random subsets, then average/vote. Reduces variance. Example: Random Forest.

Boosting: Train trees sequentially — each tree focuses on correcting errors of the previous one. Reduces bias. Example: Gradient Boosting, XGBoost, AdaBoost.

Exercise 6.3: Explain feature importance in Random Forest

Feature importance measures how much each feature contributes to reducing impurity across all trees. Calculated as the average decrease in Gini impurity when a feature is used for splitting, weighted by the number of samples it affects. Higher importance = feature is more predictive.

Chapter Summary

Decision trees split data using Gini impurity or information gain
Random Forest (bagging) reduces overfitting by averaging many trees
Gradient Boosting builds trees sequentially for higher accuracy
Feature importance reveals which variables drive predictions

Chapter 7

Support Vector Machines

Learning Objectives

Understand the concept of maximum margin hyperplanes
Learn the kernel trick for non-linear classification
Implement SVM with different kernels
Apply SVM to image classification

The Intuition

SVM finds the optimal hyperplane that separates classes with the maximum margin — the largest gap between the closest data points (support vectors) of each class. Wider margin = better generalization.

Kernel Trick

When data isn't linearly separable, the kernel trick maps data to a higher dimension where it becomes separable, without actually computing the transformation.

Kernel	Use Case	Formula
Linear	Linearly separable data	K(x,y) = x · y
RBF (Gaussian)	Complex, non-linear boundaries	K(x,y) = exp(-γ\|\|x-y\|\|²)
Polynomial	Polynomial decision boundaries	K(x,y) = (x · y + r)^d

Python
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Non-linear data
X, y = make_moons(n_samples=500, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Compare kernels
for kernel in ['linear', 'rbf', 'poly']:
    svm = SVC(kernel=kernel, C=1.0, gamma='scale')
    svm.fit(X_train_s, y_train)
    acc = svm.score(X_test_s, y_test)
    print(f"{kernel:>8} kernel → Accuracy: {acc:.3f}")

Key Hyperparameters

C (Regularization): High C = strict margin (may overfit), Low C = wide margin (may underfit).

gamma: High gamma = complex boundary (close influence), Low gamma = smooth boundary (wide influence).

Project: Handwritten Digit Classification

Python
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

svm = SVC(kernel='rbf', C=10, gamma=0.001)
svm.fit(X_train_s, y_train)

y_pred = svm.predict(X_test_s)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred))

Exercises

Exercise 7.1: Why is feature scaling critical for SVMs?

SVM calculates distances between data points to find the margin. If one feature has range [0, 100000] and another [0, 1], the large-scale feature dominates the distance calculation. Scaling ensures all features contribute equally to the decision boundary.

Exercise 7.2: When would you choose RBF kernel over Linear?

Use Linear when: data is linearly separable, you have many features relative to samples (text classification), or you need interpretability.

Use RBF when: the decision boundary is non-linear (moon/circle patterns), you have a moderate number of features, and accuracy matters more than interpretability.

Exercise 7.3: What are support vectors and why are they important?

Support vectors are the data points closest to the decision boundary (hyperplane). They are the most "difficult" examples to classify. The SVM model is completely defined by these support vectors — removing any non-support vector from the training set does not change the model. This makes SVM memory-efficient and robust.

Chapter Summary

SVM finds the maximum margin hyperplane for classification
Kernel trick enables non-linear classification without explicit transformation
RBF kernel works well for most non-linear problems
Feature scaling is mandatory for SVM performance

Chapter 8

Unsupervised Learning

Learning Objectives

Understand K-Means and hierarchical clustering algorithms
Use the elbow method and silhouette score to choose k
Apply PCA for dimensionality reduction
Build a customer segmentation project

K-Means Clustering

K-Means groups data into k clusters by iteratively: (1) assigning each point to the nearest centroid, (2) recalculating centroids as the mean of assigned points.

Python
from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Inertia: {kmeans.inertia_:.2f}")

Choosing k: Elbow Method

Python
inertias = []
K_range = range(1, 11)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

# Plot the elbow curve
import matplotlib.pyplot as plt
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Look for the "elbow" where the curve bends

PCA: Dimensionality Reduction

Principal Component Analysis projects high-dimensional data onto fewer dimensions while preserving maximum variance.

Python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data  # 4 features

# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

print(f"Original shape: {X.shape}")      # (150, 4)
print(f"Reduced shape: {X_2d.shape}")     # (150, 2)
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")

Project: Customer Segmentation

Python
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Simulated customer data
np.random.seed(42)
customers = pd.DataFrame({
    'annual_income': np.concatenate([
        np.random.normal(30000, 5000, 100),
        np.random.normal(70000, 10000, 100),
        np.random.normal(120000, 15000, 100)
    ]),
    'spending_score': np.concatenate([
        np.random.normal(20, 10, 100),
        np.random.normal(60, 15, 100),
        np.random.normal(80, 8, 100)
    ])
})

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)

# Find optimal k and cluster
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
customers['segment'] = kmeans.fit_predict(X_scaled)

# Analyze segments
print(customers.groupby('segment').mean().round(0))
# Segment 0: Budget-conscious (low income, low spending)
# Segment 1: Middle market (medium income, medium spending)
# Segment 2: Premium (high income, high spending)

Exercises

Exercise 8.1: K-Means found 3 clusters but you expected 5. What could be wrong?

Possible reasons: (1) Two clusters may overlap and K-Means merges them. (2) The elbow method might suggest 3 is optimal. (3) Features may need better scaling. (4) The data genuinely has 3 natural groupings. Try: silhouette analysis, different initializations, or DBSCAN which doesn't need k.

Exercise 8.2: How much variance should PCA retain?

Common practice: retain 95% of variance. Use PCA(n_components=0.95) to automatically select the number of components. For visualization, 2-3 components are used even if they retain less variance.

Exercise 8.3: What are limitations of K-Means?

(1) Must specify k in advance. (2) Assumes spherical clusters of equal size. (3) Sensitive to initialization — use k-means++. (4) Sensitive to outliers. (5) Doesn't work well with non-convex shapes. Alternatives: DBSCAN (no k needed, finds arbitrary shapes), Hierarchical clustering.

Chapter Summary

K-Means clusters data by iteratively updating centroids
Elbow method and silhouette score help choose the number of clusters
PCA reduces dimensions while preserving maximum variance
Customer segmentation is a classic unsupervised learning application

Chapter 9

Neural Networks Fundamentals

Learning Objectives

Understand the perceptron and multi-layer neural networks
Learn activation functions: ReLU, Sigmoid, Softmax
Understand forward propagation and backpropagation
Build a neural network from scratch

The Perceptron

A perceptron is the simplest neural network — a single neuron that takes inputs, applies weights and bias, then passes through an activation function.

output = activation(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)

Activation Functions

Function	Formula	Range	Use Case
Sigmoid	1/(1+e⁻ˣ)	(0, 1)	Binary output, output layer
ReLU	max(0, x)	[0, ∞)	Hidden layers (most common)
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	(-1, 1)	Hidden layers, zero-centered
Softmax	eˣⁱ/Σeˣʲ	(0, 1), sum=1	Multi-class output

Python
import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

Building a Neural Network from Scratch

Python
import numpy as np

class NeuralNetwork:
    def __init__(self, layer_sizes):
        """layer_sizes: e.g., [2, 4, 1] = 2 inputs, 4 hidden, 1 output"""
        self.weights = []
        self.biases = []
        for i in range(len(layer_sizes) - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * 0.5
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X):
        self.activations = [X]
        self.z_values = []
        current = X
        for i in range(len(self.weights)):
            z = current @ self.weights[i] + self.biases[i]
            self.z_values.append(z)
            if i == len(self.weights) - 1:
                current = sigmoid(z)  # Output layer
            else:
                current = relu(z)     # Hidden layers
            self.activations.append(current)
        return current

    def backward(self, X, y, lr=0.01):
        m = X.shape[0]
        output = self.activations[-1]
        delta = output - y.reshape(-1, 1)

        for i in range(len(self.weights) - 1, -1, -1):
            dw = self.activations[i].T @ delta / m
            db = np.sum(delta, axis=0, keepdims=True) / m
            if i > 0:
                delta = (delta @ self.weights[i].T) * relu_derivative(self.z_values[i-1])
            self.weights[i] -= lr * dw
            self.biases[i] -= lr * db

    def train(self, X, y, epochs=1000, lr=0.1):
        for epoch in range(epochs):
            output = self.forward(X)
            self.backward(X, y, lr)
            if epoch % 200 == 0:
                loss = np.mean((output - y.reshape(-1,1))**2)
                print(f"Epoch {epoch}, Loss: {loss:.4f}")

# XOR Problem — not linearly separable!
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0, 1, 1, 0])

nn = NeuralNetwork([2, 8, 1])
nn.train(X, y, epochs=2000, lr=0.5)

predictions = nn.forward(X)
print("\nPredictions:")
for inp, pred in zip(X, predictions):
    print(f"  {inp} → {pred[0]:.4f} (expected {int(inp[0]) ^ int(inp[1])})")

Exercises

Exercise 9.1: Why can't a single perceptron solve XOR?

XOR is not linearly separable — no single straight line can separate the classes (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. A perceptron can only create linear decision boundaries. You need at least one hidden layer (2+ neurons) to create the non-linear boundary needed for XOR.

Exercise 9.2: Why is ReLU preferred over Sigmoid in hidden layers?

(1) No vanishing gradient: Sigmoid squashes gradients to near-zero for large/small inputs, making deep networks hard to train. ReLU gradient is 1 for positive inputs.

(2) Computational speed: ReLU is just max(0,x) — much faster than computing exponentials.

(3) Sparsity: ReLU outputs zero for negative inputs, creating sparse representations that can improve efficiency.

Exercise 9.3: Modify the neural network to have 2 hidden layers: [2, 8, 4, 1]

nn = NeuralNetwork([2, 8, 4, 1])
nn.train(X, y, epochs=3000, lr=0.5)
predictions = nn.forward(X)
for inp, pred in zip(X, predictions):
    print(f"  {inp} → {pred[0]:.4f}")

Adding a second hidden layer gives the network more capacity to learn complex patterns.

Chapter Summary

Neural networks combine layers of neurons with non-linear activation functions
ReLU is the default hidden layer activation; Sigmoid/Softmax for output
Backpropagation computes gradients to update weights layer by layer
Even a simple network with one hidden layer can solve non-linear problems like XOR

Chapter 10

Deep Learning with TensorFlow & Keras

Learning Objectives

Build deep learning models with the Keras API
Understand CNNs for image recognition
Learn RNNs and LSTMs for sequential data
Build a digit recognition project with MNIST

Keras: High-Level Deep Learning API

Python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Simple sequential model
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(784,)),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

Convolutional Neural Networks (CNNs)

CNNs use convolutional filters to automatically detect visual features — edges, textures, shapes — making them ideal for image recognition.

Python
cnn_model = keras.Sequential([
    # Convolutional layers — feature extraction
    layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (3,3), activation='relu'),

    # Dense layers — classification
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

Recurrent Neural Networks (RNNs & LSTMs)

RNNs process sequential data (text, time series) by maintaining a hidden state. LSTMs solve the vanishing gradient problem with gates that control information flow.

Python
# LSTM for sequence classification
lstm_model = keras.Sequential([
    layers.Embedding(input_dim=10000, output_dim=128),
    layers.LSTM(64, return_sequences=True),
    layers.LSTM(32),
    layers.Dense(1, activation='sigmoid')
])
lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Project: MNIST Digit Recognition with CNN

Python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load MNIST
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()

# Preprocess
X_train = X_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
X_test = X_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Build CNN
model = keras.Sequential([
    layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D((2,2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train
history = model.fit(X_train, y_train, epochs=5,
                    batch_size=64,
                    validation_split=0.1)

# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTest Accuracy: {test_acc:.4f}")
# Expected: ~99% accuracy

Exercises

Exercise 10.1: What does Dropout do and why is it important?

Dropout randomly sets a fraction of neurons to zero during training. This prevents co-adaptation of neurons and acts as regularization. At test time, all neurons are active but outputs are scaled. Dropout(0.5) means 50% of neurons are dropped each forward pass. It significantly reduces overfitting.

Exercise 10.2: Why use Conv2D + MaxPooling instead of just Dense layers for images?

(1) Parameter efficiency: A 28×28 image flattened = 784 inputs. Dense layer with 128 neurons = 100k parameters. Conv2D with 32 3×3 filters = only 320 parameters. (2) Spatial features: Convolution preserves spatial relationships — edges, corners, textures. Dense layers lose this. (3) Translation invariance: A cat detected anywhere in the image triggers the same filter.

Exercise 10.3: When would you use LSTM instead of a simple Dense network?

Use LSTM when data has temporal/sequential dependencies: (1) Time series forecasting (stock prices, weather). (2) Natural language processing (word order matters). (3) Speech recognition. (4) Video analysis (frame sequences). Dense networks treat all inputs independently and ignore order.

Chapter Summary

Keras provides a high-level API for building deep learning models quickly
CNNs use convolution and pooling layers for image feature extraction
LSTMs handle sequential data with memory gates
Dropout is essential for preventing overfitting in deep networks

Chapter 11

Natural Language Processing

Learning Objectives

Understand text preprocessing: tokenization, stopwords, stemming
Learn TF-IDF and Bag of Words representations
Understand word embeddings (Word2Vec, GloVe)
Build a sentiment analysis project

Text Preprocessing Pipeline

Python
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # 3. Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Example
raw = "This movie was AMAZING!!! I loved it 100%. Best film of 2024."
clean = preprocess_text(raw)
print(clean)  # "this movie was amazing i loved it best film of"

Bag of Words & TF-IDF

Python
corpus = [
    "machine learning is amazing",
    "deep learning is a subset of machine learning",
    "natural language processing uses machine learning"
]

# Bag of Words — simple word counts
bow = CountVectorizer()
X_bow = bow.fit_transform(corpus)
print("Vocabulary:", bow.get_feature_names_out())
print("BoW Matrix:\n", X_bow.toarray())

# TF-IDF — weights by importance
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("\nTF-IDF Matrix:\n", X_tfidf.toarray().round(3))

TF-IDF Intuition

TF (Term Frequency): How often a word appears in a document. IDF (Inverse Document Frequency): How rare a word is across all documents. Common words like "is", "the" get low IDF. Rare, distinctive words get high IDF. TF-IDF = TF × IDF.

Word Embeddings

Word embeddings represent words as dense vectors where similar words have similar vectors. King - Man + Woman ≈ Queen.

Python
# Using pre-trained embeddings with Keras
from tensorflow.keras.layers import Embedding

# Embedding layer: vocab_size → embedding_dim
# Learns 128-dim vector for each of 10000 words
embedding = Embedding(input_dim=10000, output_dim=128, input_length=200)

Project: Movie Review Sentiment Analysis

Python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Sample movie reviews
reviews = [
    "This movie was fantastic! Great acting and storyline.",
    "Terrible film. Waste of time and money.",
    "Loved every moment. A masterpiece of cinema.",
    "Boring and predictable. Would not recommend.",
    "An incredible journey with amazing performances.",
    "The worst movie I have ever seen. Awful.",
    "Brilliant storytelling and breathtaking visuals.",
    "Dull plot with terrible dialogue throughout.",
    "A beautiful and moving cinematic experience.",
    "Completely disappointing. Save your money.",
    "Outstanding performances by the entire cast.",
    "Unwatchable garbage from start to finish.",
    "One of the best films of the decade.",
    "Painfully slow and utterly forgettable.",
    "A thrilling and emotionally powerful movie.",
    "So bad I walked out of the theater.",
]
# 1=positive, 0=negative
sentiments = [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]

# TF-IDF features
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf.fit_transform(reviews)
y = np.array(sentiments)

# Train model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print(f"Accuracy: {model.score(X_test, y_test):.2f}")

# Test on new reviews
new_reviews = [
    "This is an absolutely wonderful and amazing movie!",
    "I hated this film, it was terrible and boring.",
    "A decent movie with some good moments."
]
X_new = tfidf.transform(new_reviews)
predictions = model.predict(X_new)

for review, pred in zip(new_reviews, predictions):
    sentiment = "POSITIVE ✓" if pred == 1 else "NEGATIVE ✗"
    print(f"[{sentiment}] {review}")

Exercises

Exercise 11.1: What's the difference between Bag of Words and TF-IDF?

Bag of Words: Simply counts word occurrences. "the" appears 5 times → score 5. Problem: common words dominate.

TF-IDF: Weights word counts by how rare they are across documents. "the" appears everywhere → low IDF → low score. "amazing" appears rarely → high IDF → high score. TF-IDF better captures word importance.

Exercise 11.2: Why are word embeddings better than one-hot encoding?

(1) Dense vs. sparse: One-hot vectors are huge sparse vectors (10000-dim for 10000 words). Embeddings are dense 128-300 dim. (2) Semantic meaning: Embeddings capture meaning — similar words have similar vectors. One-hot treats every word as equally different. (3) Generalization: "excellent" and "wonderful" have similar embeddings, so the model can generalize from one to the other.

Exercise 11.3: What preprocessing steps would you add for tweets?

(1) Remove @mentions and #hashtags (or extract them as features). (2) Remove URLs. (3) Handle emojis (convert to text or use as features). (4) Expand contractions (can't → cannot). (5) Handle slang/abbreviations (lol, brb). (6) Remove or handle repeated characters ("sooooo" → "so").

Chapter Summary

Text preprocessing (lowering, cleaning, stopwords) is essential before modeling
TF-IDF weights word importance better than raw counts
Word embeddings capture semantic meaning in dense vectors
Sentiment analysis combines text features with classification models

Chapter 12

Model Evaluation, Tuning & Deployment

Learning Objectives

Master cross-validation for robust model evaluation
Tune hyperparameters with Grid Search and Random Search
Understand the full confusion matrix and classification metrics
Deploy a model as a REST API with Flask

Cross-Validation

Instead of a single train-test split, k-fold cross-validation splits data into k parts, trains on k-1, tests on 1, and rotates. This gives a more reliable performance estimate.

Python
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold cross-validation
scores = cross_val_score(model, iris.data, iris.target, cv=5, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")

Hyperparameter Tuning

Python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Grid Search — exhaustive search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(iris.data, iris.target)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
best_model = grid_search.best_estimator_

Comprehensive Evaluation Metrics

Python
from sklearn.metrics import (confusion_matrix, classification_report,
                             accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score)

y_pred = best_model.predict(X_test)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
#              Predicted
#            Neg    Pos
# Actual Neg [TN]   [FP]
# Actual Pos [FN]   [TP]

print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred, average='weighted'):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred, average='weighted'):.3f}")

Metric Cheat Sheet

Accuracy: (TP+TN)/Total — overall correctness. Misleading with imbalanced data.

Precision: TP/(TP+FP) — "of predicted positives, how many are correct?" Use when FP is costly.

Recall: TP/(TP+FN) — "of actual positives, how many did we catch?" Use when FN is costly (disease detection).

F1 Score: 2×(P×R)/(P+R) — harmonic mean of precision and recall. Best single metric for imbalanced datasets.

Overfitting vs. Underfitting

Problem	Symptom	Solution
Overfitting	High train accuracy, low test accuracy	More data, regularization, dropout, simpler model, cross-validation
Underfitting	Low train and test accuracy	More features, complex model, less regularization, more training

Project: Deploy ML Model with Flask

Python
# save_model.py — Train and save model
import pickle
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(iris.data, iris.target)

# Save model to file
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

print("Model saved!")

Python
# app.py — Flask API
from flask import Flask, request, jsonify
import pickle
import numpy as np

app = Flask(__name__)

# Load trained model
with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

species = ['setosa', 'versicolor', 'virginica']

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = np.array(data['features']).reshape(1, -1)
    prediction = model.predict(features)[0]
    probability = model.predict_proba(features)[0].tolist()

    return jsonify({
        'prediction': species[prediction],
        'confidence': max(probability),
        'probabilities': dict(zip(species, probability))
    })

@app.route('/health')
def health():
    return jsonify({'status': 'healthy'})

if __name__ == '__main__':
    app.run(debug=True, port=5000)

bash
# Test the API
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

# Response:
# {"prediction": "setosa", "confidence": 1.0, "probabilities": {...}}

Exercises

Exercise 12.1: You have 95% accuracy on a fraud detection dataset where 98% are non-fraud. Is this good?

No! A naive model predicting "non-fraud" always achieves 98% accuracy. Your 95% is actually worse than the naive baseline. For imbalanced datasets, use Precision, Recall, F1-Score, and AUC-ROC instead. Focus on recall (catching actual fraud) and precision (not flagging legitimate transactions).

Exercise 12.2: Grid Search is very slow. What alternatives exist?

RandomizedSearchCV: Samples random combinations instead of trying all. Often finds good results in a fraction of the time.

Bayesian Optimization (Optuna, Hyperopt): Uses past results to intelligently pick the next parameters to try. Much more efficient.

Halving Grid Search: Starts with all candidates on a small subset, eliminates poor performers, and allocates more resources to promising ones.

Exercise 12.3: What should you consider before deploying a model to production?

(1) Model versioning: Track which model version is live. (2) Monitoring: Track prediction accuracy, latency, data drift over time. (3) A/B testing: Compare new model vs. existing before full rollout. (4) Input validation: Handle invalid or adversarial inputs. (5) Scalability: Can it handle peak traffic? (6) Fallback: What happens if the model fails? (7) Fairness: Check for bias across demographic groups.

Chapter Summary

Cross-validation gives reliable performance estimates across multiple data splits
Grid Search and Random Search find optimal hyperparameters
Precision, Recall, and F1 are more informative than accuracy for imbalanced datasets
Flask makes it easy to deploy ML models as REST APIs
Production deployment requires monitoring, versioning, and input validation

🎓 Congratulations!

You've completed the Machine Learning book. You now have the knowledge to build, evaluate, and deploy ML models. Keep practicing!