Introduction to Machine Learning
Learning Objectives
- Understand what Machine Learning is and why it matters
- Distinguish between Supervised, Unsupervised, and Reinforcement Learning
- Identify real-world ML applications across industries
- Understand the end-to-end ML workflow
What is Machine Learning?
Machine Learning (ML) is a branch of Artificial Intelligence that enables computers to learn patterns from data and make decisions without being explicitly programmed for every scenario. Instead of writing rules by hand, you feed the algorithm data and let it discover the rules itself.
Arthur Samuel (1959) defined ML as the "field of study that gives computers the ability to learn without being explicitly programmed." Tom Mitchell provided a more formal definition:
Types of Machine Learning
1. Supervised Learning
The algorithm learns from labeled data — each training example comes with the correct answer (label). The model learns a mapping from inputs to outputs.
- Classification: Predicting a category — spam vs. not spam, cat vs. dog
- Regression: Predicting a continuous value — house prices, temperature
2. Unsupervised Learning
The algorithm works with unlabeled data and tries to find hidden patterns or groupings without knowing the correct answers.
- Clustering: Grouping similar customers together
- Dimensionality Reduction: Compressing data while keeping important features (PCA)
- Association: Finding items that frequently co-occur (market basket analysis)
3. Reinforcement Learning
An agent learns by interacting with an environment, receiving rewards for good actions and penalties for bad ones. Used in game AI, robotics, and self-driving cars.
| Type | Data | Goal | Example |
|---|---|---|---|
| Supervised | Labeled | Predict output | Email spam detection |
| Unsupervised | Unlabeled | Find structure | Customer segmentation |
| Reinforcement | Rewards/Penalties | Maximize reward | AlphaGo, robotics |
Real-World Applications
Machine Learning powers many products and services you use daily:
- Healthcare: Disease diagnosis from X-rays, drug discovery, patient risk prediction
- Finance: Fraud detection, algorithmic trading, credit scoring
- E-commerce: Product recommendations (Amazon, Netflix), dynamic pricing
- Transportation: Self-driving cars (Tesla), route optimization (Google Maps)
- Language: Machine translation (Google Translate), voice assistants (Siri, Alexa)
- Social Media: Content recommendation, face recognition, sentiment analysis
The ML Workflow
Every ML project follows a standard pipeline:
- Define the Problem: What question are you trying to answer?
- Collect Data: Gather relevant, high-quality data
- Explore & Preprocess: Clean data, handle missing values, visualize patterns
- Feature Engineering: Select and transform the most informative features
- Train the Model: Choose an algorithm and fit it to the training data
- Evaluate: Test the model on unseen data using appropriate metrics
- Tune & Optimize: Adjust hyperparameters for better performance
- Deploy: Put the model into production and monitor its performance
Setting Up Your Environment
bash
# Install essential ML libraries
pip install numpy pandas matplotlib scikit-learn
pip install tensorflow keras
pip install seaborn jupyter
# Verify installation
python -c "import sklearn; print(sklearn.__version__)"
Exercises
Exercise 1.1: Classify each problem as Supervised, Unsupervised, or Reinforcement Learning
a) Predicting house prices from square footage → Supervised (Regression)
b) Grouping news articles by topic without labels → Unsupervised (Clustering)
c) Teaching a robot to walk → Reinforcement Learning
d) Detecting fraudulent credit card transactions → Supervised (Classification)
e) Reducing image features from 1000 to 50 → Unsupervised (Dimensionality Reduction)
Exercise 1.2: List 3 ML applications in your daily life and identify the type
Example answers:
- YouTube recommendations → Supervised Learning (predicting what you'll click)
- Google Photos grouping faces → Unsupervised Learning (clustering)
- Siri learning your preferences → Reinforcement Learning
Exercise 1.3: Describe the 8 steps of the ML workflow for a movie recommendation system
1. Define: Recommend movies users will enjoy.
2. Collect: User ratings, watch history, movie metadata.
3. Explore: Analyze rating distributions, popular genres, viewing patterns.
4. Feature Engineering: User preferences, genre encoding, watch time features.
5. Train: Collaborative filtering or content-based model.
6. Evaluate: Measure with RMSE, precision@k on held-out data.
7. Tune: Optimize number of factors, learning rate.
8. Deploy: Serve recommendations in real-time via API.
Chapter Summary
- ML enables computers to learn from data rather than explicit programming
- Three main types: Supervised, Unsupervised, and Reinforcement Learning
- ML is used across healthcare, finance, e-commerce, transportation, and more
- Every ML project follows a systematic workflow from problem definition to deployment
Python Essentials for Machine Learning
Learning Objectives
- Master NumPy arrays and operations for numerical computing
- Use Pandas DataFrames for data manipulation and analysis
- Create informative visualizations with Matplotlib
- Understand vectorized operations for efficient computation
NumPy: Numerical Computing Foundation
NumPy is the backbone of scientific computing in Python. It provides high-performance multidimensional arrays and tools for working with them.
Python
import numpy as np
# Creating arrays
a = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Useful array generators
zeros = np.zeros((3, 4)) # 3x4 matrix of zeros
ones = np.ones((2, 3)) # 2x3 matrix of ones
rng = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
lin = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1.0]
rand = np.random.randn(3, 3) # 3x3 random normal values
# Array properties
print(matrix.shape) # (3, 3)
print(matrix.dtype) # int64
print(matrix.ndim) # 2
Vectorized Operations
Python
# Element-wise operations (much faster than loops)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # [5, 7, 9]
print(a * b) # [4, 10, 18]
print(a ** 2) # [1, 4, 9]
print(np.dot(a, b)) # 32 (dot product)
# Statistical operations
data = np.array([14, 23, 18, 29, 35, 22])
print(np.mean(data)) # 23.5
print(np.std(data)) # 6.99
print(np.median(data)) # 22.5
# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(A @ B) # Matrix multiplication
print(np.linalg.inv(A)) # Matrix inverse
print(np.linalg.det(A)) # Determinant: -2.0
Pandas: Data Analysis Powerhouse
Python
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'Salary': [50000, 60000, 75000, 55000],
'Department': ['ML', 'Web', 'ML', 'Data']
})
# Exploring data
print(df.head()) # First 5 rows
print(df.info()) # Column types, non-null counts
print(df.describe()) # Statistical summary
# Selecting & filtering
ml_team = df[df['Department'] == 'ML']
high_salary = df[df['Salary'] > 55000]
# Groupby operations
avg_salary = df.groupby('Department')['Salary'].mean()
# Handling missing data
df.dropna() # Remove rows with NaN
df.fillna(0) # Replace NaN with 0
df.fillna(df.mean()) # Replace NaN with column mean
# Reading from CSV
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)
Matplotlib: Data Visualization
Python
import matplotlib.pyplot as plt
import numpy as np
# Line plot
x = np.linspace(0, 10, 100)
plt.figure(figsize=(10, 6))
plt.plot(x, np.sin(x), label='sin(x)', linewidth=2)
plt.plot(x, np.cos(x), label='cos(x)', linewidth=2)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Trigonometric Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Scatter plot
np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
plt.scatter(x, y, alpha=0.7, c='#4f46e5')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Linear Relationship with Noise')
plt.show()
# Histogram
data = np.random.normal(100, 15, 1000)
plt.hist(data, bins=30, edgecolor='white', color='#7c3aed')
plt.title('Normal Distribution')
plt.show()
Exercises
Exercise 2.1: Create a NumPy array of 20 random integers between 1-100. Find the mean, max, min, and standard deviation.
import numpy as np
arr = np.random.randint(1, 101, size=20)
print(f"Array: {arr}")
print(f"Mean: {np.mean(arr):.2f}")
print(f"Max: {np.max(arr)}")
print(f"Min: {np.min(arr)}")
print(f"Std: {np.std(arr):.2f}")
Exercise 2.2: Create a Pandas DataFrame of 5 students with Name, Maths, Science, English scores. Add a Total and Percentage column.
import pandas as pd
df = pd.DataFrame({
'Name': ['Amit', 'Priya', 'Ravi', 'Sneha', 'Karan'],
'Maths': [85, 92, 78, 95, 88],
'Science': [90, 88, 82, 91, 76],
'English': [78, 95, 85, 89, 92]
})
df['Total'] = df[['Maths','Science','English']].sum(axis=1)
df['Percentage'] = (df['Total'] / 300 * 100).round(2)
print(df)
Exercise 2.3: Plot a bar chart showing the average score per subject from Exercise 2.2
subjects = ['Maths', 'Science', 'English']
averages = [df[s].mean() for s in subjects]
plt.bar(subjects, averages, color=['#4f46e5','#10b981','#f59e0b'])
plt.ylabel('Average Score')
plt.title('Average Score by Subject')
plt.ylim(70, 100)
plt.show()
Chapter Summary
- NumPy provides fast, vectorized array operations essential for ML math
- Pandas simplifies data loading, cleaning, filtering, and groupby analysis
- Matplotlib enables line, scatter, bar, and histogram visualizations
- Vectorized operations are 10-100x faster than Python loops
Data Preprocessing & Feature Engineering
Learning Objectives
- Handle missing data with various imputation strategies
- Encode categorical variables using Label and One-Hot encoding
- Scale numerical features with StandardScaler and MinMaxScaler
- Perform feature selection and train-test splitting
Why Preprocessing Matters
Raw data is messy — it contains missing values, inconsistent formats, outliers, and mixed data types. Garbage in = garbage out. Preprocessing transforms raw data into a clean, ML-ready format. Studies show that data scientists spend 60-80% of their time on data preprocessing.
Handling Missing Data
Python
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Age': [25, np.nan, 35, 28, np.nan],
'Salary': [50000, 60000, np.nan, 55000, 70000],
'City': ['Delhi', 'Mumbai', 'Delhi', None, 'Bangalore']
})
# Check missing values
print(df.isnull().sum())
# Strategy 1: Drop rows with any NaN
df_dropped = df.dropna()
# Strategy 2: Fill with mean/median (numerical)
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# Strategy 3: Fill with mode (categorical)
df['City'].fillna(df['City'].mode()[0], inplace=True)
# Strategy 4: Scikit-learn SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['Age','Salary']] = imputer.fit_transform(df[['Age','Salary']])
Encoding Categorical Variables
Python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label Encoding (for ordinal data: low < medium < high)
le = LabelEncoder()
df['City_encoded'] = le.fit_transform(df['City'])
# Delhi=1, Mumbai=2, Bangalore=0
# One-Hot Encoding (for nominal data — no order)
df_encoded = pd.get_dummies(df, columns=['City'], drop_first=True)
# Creates: City_Delhi, City_Mumbai (Bangalore is baseline)
Feature Scaling
Many algorithms (KNN, SVM, Neural Networks, Gradient Descent) are sensitive to the scale of features. A salary feature (50000-100000) would dominate an age feature (20-60) without scaling.
Python
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# StandardScaler: mean=0, std=1 (z-score normalization)
scaler = StandardScaler()
df[['Age_scaled','Salary_scaled']] = scaler.fit_transform(df[['Age','Salary']])
# MinMaxScaler: scales to [0, 1]
mm_scaler = MinMaxScaler()
df[['Age_mm','Salary_mm']] = mm_scaler.fit_transform(df[['Age','Salary']])
Train-Test Split
Python
from sklearn.model_selection import train_test_split
X = df[['Age', 'Salary']] # Features
y = df['Target'] # Label
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
When to use which scaler?
StandardScaler — when data follows a normal distribution. Best for SVM, Logistic Regression, PCA.
MinMaxScaler — when you need values in a fixed range [0,1]. Best for Neural Networks, KNN.
Mini-Project: Titanic Data Cleaning Pipeline
Python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
# Drop irrelevant columns
df = df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)
# Fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Encode categorical variables
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
# Split features and target
X = df.drop('Survived', axis=1)
y = df['Survived']
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print("Data is clean and ready for modeling!")
Exercises
Exercise 3.1: Given a dataset with 15% missing values in 'Income', when would you drop vs. impute?
Drop when: the dataset is very large AND the missing rows are random (MCAR). Losing 15% of a 1M row dataset still leaves 850k rows.
Impute when: the dataset is small, or the missing data has a pattern (MAR/MNAR). Use median for skewed data, mean for normal distributions.
Exercise 3.2: Encode ['Red', 'Green', 'Blue', 'Red', 'Blue'] using both Label and One-Hot encoding
from sklearn.preprocessing import LabelEncoder
colors = ['Red', 'Green', 'Blue', 'Red', 'Blue']
le = LabelEncoder()
label_encoded = le.fit_transform(colors)
print("Label:", label_encoded) # [2, 1, 0, 2, 0]
df = pd.DataFrame({'Color': colors})
one_hot = pd.get_dummies(df, columns=['Color'])
print("One-Hot:\n", one_hot)
Exercise 3.3: Why should you fit the scaler ONLY on training data?
If you fit the scaler on the entire dataset (including test data), you introduce data leakage. The scaler would learn the mean/std of the test set, which shouldn't be available during training. This leads to overly optimistic performance estimates. Always: scaler.fit(X_train), then scaler.transform(X_test).
Chapter Summary
- Handle missing data using dropping, mean/median/mode imputation, or SimpleImputer
- Use Label Encoding for ordinal categories, One-Hot Encoding for nominal categories
- Scale features with StandardScaler or MinMaxScaler to normalize ranges
- Always split data before scaling to prevent data leakage
Linear Regression
Learning Objectives
- Understand the math behind linear regression — cost function and gradient descent
- Implement linear regression from scratch in Python
- Use scikit-learn for efficient linear regression
- Build a house price prediction project
The Idea
Linear regression fits a straight line (or hyperplane) through data to predict a continuous target variable. It finds the best relationship between input features (X) and output (y).
Where w (weights) and b (bias) are learned from data. The goal is to find values of w and b that minimize the error between predicted ŷ and actual y.
Cost Function: Mean Squared Error
Gradient Descent
Gradient descent is an optimization algorithm that iteratively adjusts w and b to minimize the cost function by moving in the direction of the steepest descent.
Implementation from Scratch
Python
import numpy as np
class LinearRegressionScratch:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.weights = None
self.bias = None
self.cost_history = []
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for _ in range(self.n_iter):
y_pred = np.dot(X, self.weights) + self.bias
# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
db = (1 / n_samples) * np.sum(y_pred - y)
# Update parameters
self.weights -= self.lr * dw
self.bias -= self.lr * db
# Track cost
cost = np.mean((y_pred - y) ** 2)
self.cost_history.append(cost)
def predict(self, X):
return np.dot(X, self.weights) + self.bias
# Usage
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X.flatten() + np.random.randn(100)
model = LinearRegressionScratch(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)
print(f"Weight: {model.weights[0]:.3f}, Bias: {model.bias:.3f}")
# Should be close to Weight: 3.0, Bias: 4.0
Using Scikit-Learn
Python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.3f}")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
Project: House Price Prediction
Python
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Train model
model = LinearRegression()
model.fit(X_train_s, y_train)
# Evaluate
y_pred = model.predict(X_test_s)
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)) * 100000:.0f}")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
# Feature importance
for name, coef in zip(housing.feature_names, model.coef_):
print(f" {name}: {coef:.4f}")
Exercises
Exercise 4.1: What happens if the learning rate is too high or too low?
Too high: Gradient descent overshoots the minimum — the cost oscillates or diverges to infinity. The model never converges.
Too low: Convergence is extremely slow — training takes thousands of extra iterations. You might run out of patience and stop before reaching the optimal solution.
Best practice: Start with 0.01, try 0.001, 0.1, and use learning rate schedules that decrease over time.
Exercise 4.2: Implement linear regression from scratch for y = 5x + 3 with noise
np.random.seed(0)
X = np.random.rand(200, 1) * 10
y = 5 * X.flatten() + 3 + np.random.randn(200) * 2
model = LinearRegressionScratch(lr=0.01, n_iterations=2000)
model.fit(X, y)
print(f"Learned: y = {model.weights[0]:.2f}x + {model.bias:.2f}")
# Expected: y ≈ 5.00x + 3.00
Exercise 4.3: What is R² score? What does R²=0.85 mean?
R² (Coefficient of Determination) measures how well the model explains the variance in the target variable.
R² = 1 - (SS_res / SS_tot) where SS_res is the sum of squared residuals and SS_tot is the total sum of squares.
R² = 0.85 means the model explains 85% of the variance in the data. The remaining 15% is unexplained noise.
R² = 1.0 → perfect prediction, R² = 0 → model predicts just the mean, R² < 0 → worse than predicting mean.
Chapter Summary
- Linear regression minimizes MSE to find the best-fit line: ŷ = wX + b
- Gradient descent iteratively updates weights to minimize cost
- R² score measures how well the model explains variance (higher is better)
- Feature scaling is crucial for gradient descent convergence
Classification: Logistic Regression
Learning Objectives
- Understand logistic regression and the sigmoid function
- Learn binary cross-entropy loss and decision boundaries
- Implement logistic regression with scikit-learn
- Build an email spam classifier
From Regression to Classification
Logistic regression predicts probabilities for classification. It wraps linear regression output through the sigmoid function to produce values between 0 and 1.
If σ(z) ≥ 0.5, predict class 1. Otherwise, predict class 0. The 0.5 threshold defines the decision boundary.
Binary Cross-Entropy Loss
Python
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Sigmoid outputs for different inputs
z_values = np.array([-5, -2, 0, 2, 5])
print(sigmoid(z_values))
# [0.007, 0.119, 0.500, 0.881, 0.993]
Scikit-Learn Implementation
Python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1] # Probability of class 1
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Project: Email Spam Classifier
Python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Sample email data
emails = [
"Win a free iPhone now! Click here!",
"Meeting scheduled for tomorrow at 3pm",
"Congratulations! You won $1000000",
"Please review the attached report",
"FREE OFFER! Buy one get one free!!!",
"Can we reschedule the project discussion?",
"URGENT: Your account will be suspended",
"Hey, are you coming to lunch today?",
"Claim your prize before it expires!!!",
"The quarterly report is ready for review",
]
labels = [1,0,1,0,1,0,1,0,1,0] # 1=spam, 0=ham
# Convert text to numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)
y = np.array(labels)
# Train classifier
model = LogisticRegression()
model.fit(X, y)
# Test on new emails
new_emails = [
"You are selected for a cash prize!",
"Let's finalize the budget this week"
]
X_new = vectorizer.transform(new_emails)
predictions = model.predict(X_new)
for email, pred in zip(new_emails, predictions):
label = "SPAM" if pred == 1 else "HAM"
print(f"[{label}] {email}")
Exercises
Exercise 5.1: What is the sigmoid output for z = 0? z = 10? z = -10?
σ(0) = 0.5 (exactly at the decision boundary)
σ(10) ≈ 0.9999 (very confident class 1)
σ(-10) ≈ 0.0000 (very confident class 0)
Exercise 5.2: Explain the difference between accuracy and precision
Accuracy = (TP + TN) / Total — overall correctness. Can be misleading with imbalanced classes (99% accuracy if 99% are negative).
Precision = TP / (TP + FP) — of all positive predictions, how many were correct? Important when false positives are costly (e.g., spam filter marking important email as spam).
Exercise 5.3: When would you change the decision threshold from 0.5?
Lower threshold (e.g., 0.3): When missing a positive is very costly — medical diagnosis (prefer false alarms over missed diseases).
Higher threshold (e.g., 0.8): When false positives are costly — fraud detection (don't block legitimate transactions unnecessarily).
Chapter Summary
- Logistic regression uses the sigmoid function to produce probabilities for binary classification
- Binary cross-entropy is the loss function optimized during training
- Confusion matrix, precision, recall, and F1-score provide comprehensive evaluation
- TF-IDF converts text to numerical features for NLP classification tasks
Decision Trees & Ensemble Methods
Learning Objectives
- Understand decision tree construction using Gini impurity and information gain
- Learn ensemble methods: Bagging, Random Forest, and Gradient Boosting
- Implement tree-based models with scikit-learn
- Build a customer churn prediction project
Decision Trees
A decision tree splits data recursively based on feature thresholds, creating a tree structure of if-else decisions. It chooses splits that maximize information gain (or minimize Gini impurity).
Python
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
print(f"Accuracy: {tree.score(X_test, y_test):.3f}")
print(export_text(tree, feature_names=iris.feature_names))
Random Forest (Bagging)
Random Forest trains multiple decision trees on random subsets of data and features, then combines their predictions (majority voting for classification, averaging for regression). This reduces overfitting.
Python
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
print(f"Random Forest Accuracy: {rf.score(X_test, y_test):.3f}")
# Feature importance
for name, imp in zip(iris.feature_names, rf.feature_importances_):
print(f" {name}: {imp:.4f}")
Gradient Boosting & XGBoost
Gradient Boosting builds trees sequentially — each tree corrects errors of the previous one. XGBoost is an optimized, high-performance implementation.
Python
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb.fit(X_train, y_train)
print(f"Gradient Boosting Accuracy: {gb.score(X_test, y_test):.3f}")
| Method | How it works | Strength |
|---|---|---|
| Decision Tree | Single tree, recursive splits | Interpretable, fast |
| Random Forest | Many trees in parallel (bagging) | Reduces overfitting |
| Gradient Boosting | Trees built sequentially (boosting) | Highest accuracy |
Project: Customer Churn Prediction
Python
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Simulated telecom churn dataset
np.random.seed(42)
n = 1000
data = pd.DataFrame({
'tenure': np.random.randint(1, 72, n),
'monthly_charge': np.random.uniform(20, 120, n).round(2),
'total_charges': np.random.uniform(100, 8000, n).round(2),
'contract_type': np.random.choice([0,1,2], n), # month/year/2year
'support_tickets': np.random.randint(0, 10, n),
})
# Churn likelihood: short tenure + high charges + many tickets
churn_prob = (1 / (1 + np.exp(-(
-0.05 * data['tenure'] +
0.02 * data['monthly_charge'] +
0.3 * data['support_tickets'] +
-1.0 * data['contract_type'] - 1
))))
data['churned'] = (np.random.random(n) < churn_prob).astype(int)
X = data.drop('churned', axis=1)
y = data['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
rf.fit(X_train, y_train)
print(f"Accuracy: {rf.score(X_test, y_test):.3f}")
print(classification_report(y_test, rf.predict(X_test)))
print("Top churn predictors:")
for name, imp in sorted(zip(X.columns, rf.feature_importances_), key=lambda x: -x[1]):
print(f" {name}: {imp:.4f}")
Exercises
Exercise 6.1: Why do decision trees tend to overfit? How does Random Forest solve this?
Decision trees grow deep to fit every training sample perfectly, memorizing noise. They have high variance.
Random Forest reduces variance by: (1) training each tree on a random bootstrap sample, (2) considering only a random subset of features at each split, (3) averaging predictions across many trees. This ensemble approach smooths out individual tree's noise.
Exercise 6.2: What is the difference between Bagging and Boosting?
Bagging (Bootstrap Aggregating): Train trees independently in parallel on random subsets, then average/vote. Reduces variance. Example: Random Forest.
Boosting: Train trees sequentially — each tree focuses on correcting errors of the previous one. Reduces bias. Example: Gradient Boosting, XGBoost, AdaBoost.
Exercise 6.3: Explain feature importance in Random Forest
Feature importance measures how much each feature contributes to reducing impurity across all trees. Calculated as the average decrease in Gini impurity when a feature is used for splitting, weighted by the number of samples it affects. Higher importance = feature is more predictive.
Chapter Summary
- Decision trees split data using Gini impurity or information gain
- Random Forest (bagging) reduces overfitting by averaging many trees
- Gradient Boosting builds trees sequentially for higher accuracy
- Feature importance reveals which variables drive predictions
Support Vector Machines
Learning Objectives
- Understand the concept of maximum margin hyperplanes
- Learn the kernel trick for non-linear classification
- Implement SVM with different kernels
- Apply SVM to image classification
The Intuition
SVM finds the optimal hyperplane that separates classes with the maximum margin — the largest gap between the closest data points (support vectors) of each class. Wider margin = better generalization.
Kernel Trick
When data isn't linearly separable, the kernel trick maps data to a higher dimension where it becomes separable, without actually computing the transformation.
| Kernel | Use Case | Formula |
|---|---|---|
| Linear | Linearly separable data | K(x,y) = x · y |
| RBF (Gaussian) | Complex, non-linear boundaries | K(x,y) = exp(-γ||x-y||²) |
| Polynomial | Polynomial decision boundaries | K(x,y) = (x · y + r)^d |
Python
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Non-linear data
X, y = make_moons(n_samples=500, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Compare kernels
for kernel in ['linear', 'rbf', 'poly']:
svm = SVC(kernel=kernel, C=1.0, gamma='scale')
svm.fit(X_train_s, y_train)
acc = svm.score(X_test_s, y_test)
print(f"{kernel:>8} kernel → Accuracy: {acc:.3f}")
Key Hyperparameters
C (Regularization): High C = strict margin (may overfit), Low C = wide margin (may underfit).
gamma: High gamma = complex boundary (close influence), Low gamma = smooth boundary (wide influence).
Project: Handwritten Digit Classification
Python
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
svm = SVC(kernel='rbf', C=10, gamma=0.001)
svm.fit(X_train_s, y_train)
y_pred = svm.predict(X_test_s)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred))
Exercises
Exercise 7.1: Why is feature scaling critical for SVMs?
SVM calculates distances between data points to find the margin. If one feature has range [0, 100000] and another [0, 1], the large-scale feature dominates the distance calculation. Scaling ensures all features contribute equally to the decision boundary.
Exercise 7.2: When would you choose RBF kernel over Linear?
Use Linear when: data is linearly separable, you have many features relative to samples (text classification), or you need interpretability.
Use RBF when: the decision boundary is non-linear (moon/circle patterns), you have a moderate number of features, and accuracy matters more than interpretability.
Exercise 7.3: What are support vectors and why are they important?
Support vectors are the data points closest to the decision boundary (hyperplane). They are the most "difficult" examples to classify. The SVM model is completely defined by these support vectors — removing any non-support vector from the training set does not change the model. This makes SVM memory-efficient and robust.
Chapter Summary
- SVM finds the maximum margin hyperplane for classification
- Kernel trick enables non-linear classification without explicit transformation
- RBF kernel works well for most non-linear problems
- Feature scaling is mandatory for SVM performance
Unsupervised Learning
Learning Objectives
- Understand K-Means and hierarchical clustering algorithms
- Use the elbow method and silhouette score to choose k
- Apply PCA for dimensionality reduction
- Build a customer segmentation project
K-Means Clustering
K-Means groups data into k clusters by iteratively: (1) assigning each point to the nearest centroid, (2) recalculating centroids as the mean of assigned points.
Python
from sklearn.cluster import KMeans
import numpy as np
# Generate sample data
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Inertia: {kmeans.inertia_:.2f}")
Choosing k: Elbow Method
Python
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X)
inertias.append(km.inertia_)
# Plot the elbow curve
import matplotlib.pyplot as plt
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Look for the "elbow" where the curve bends
PCA: Dimensionality Reduction
Principal Component Analysis projects high-dimensional data onto fewer dimensions while preserving maximum variance.
Python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data # 4 features
# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
print(f"Original shape: {X.shape}") # (150, 4)
print(f"Reduced shape: {X_2d.shape}") # (150, 2)
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")
Project: Customer Segmentation
Python
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Simulated customer data
np.random.seed(42)
customers = pd.DataFrame({
'annual_income': np.concatenate([
np.random.normal(30000, 5000, 100),
np.random.normal(70000, 10000, 100),
np.random.normal(120000, 15000, 100)
]),
'spending_score': np.concatenate([
np.random.normal(20, 10, 100),
np.random.normal(60, 15, 100),
np.random.normal(80, 8, 100)
])
})
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)
# Find optimal k and cluster
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
customers['segment'] = kmeans.fit_predict(X_scaled)
# Analyze segments
print(customers.groupby('segment').mean().round(0))
# Segment 0: Budget-conscious (low income, low spending)
# Segment 1: Middle market (medium income, medium spending)
# Segment 2: Premium (high income, high spending)
Exercises
Exercise 8.1: K-Means found 3 clusters but you expected 5. What could be wrong?
Possible reasons: (1) Two clusters may overlap and K-Means merges them. (2) The elbow method might suggest 3 is optimal. (3) Features may need better scaling. (4) The data genuinely has 3 natural groupings. Try: silhouette analysis, different initializations, or DBSCAN which doesn't need k.
Exercise 8.2: How much variance should PCA retain?
Common practice: retain 95% of variance. Use PCA(n_components=0.95) to automatically select the number of components. For visualization, 2-3 components are used even if they retain less variance.
Exercise 8.3: What are limitations of K-Means?
(1) Must specify k in advance. (2) Assumes spherical clusters of equal size. (3) Sensitive to initialization — use k-means++. (4) Sensitive to outliers. (5) Doesn't work well with non-convex shapes. Alternatives: DBSCAN (no k needed, finds arbitrary shapes), Hierarchical clustering.
Chapter Summary
- K-Means clusters data by iteratively updating centroids
- Elbow method and silhouette score help choose the number of clusters
- PCA reduces dimensions while preserving maximum variance
- Customer segmentation is a classic unsupervised learning application
Neural Networks Fundamentals
Learning Objectives
- Understand the perceptron and multi-layer neural networks
- Learn activation functions: ReLU, Sigmoid, Softmax
- Understand forward propagation and backpropagation
- Build a neural network from scratch
The Perceptron
A perceptron is the simplest neural network — a single neuron that takes inputs, applies weights and bias, then passes through an activation function.
Activation Functions
| Function | Formula | Range | Use Case |
|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | (0, 1) | Binary output, output layer |
| ReLU | max(0, x) | [0, ∞) | Hidden layers (most common) |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | (-1, 1) | Hidden layers, zero-centered |
| Softmax | eˣⁱ/Σeˣʲ | (0, 1), sum=1 | Multi-class output |
Python
import numpy as np
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
Building a Neural Network from Scratch
Python
import numpy as np
class NeuralNetwork:
def __init__(self, layer_sizes):
"""layer_sizes: e.g., [2, 4, 1] = 2 inputs, 4 hidden, 1 output"""
self.weights = []
self.biases = []
for i in range(len(layer_sizes) - 1):
w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * 0.5
b = np.zeros((1, layer_sizes[i+1]))
self.weights.append(w)
self.biases.append(b)
def forward(self, X):
self.activations = [X]
self.z_values = []
current = X
for i in range(len(self.weights)):
z = current @ self.weights[i] + self.biases[i]
self.z_values.append(z)
if i == len(self.weights) - 1:
current = sigmoid(z) # Output layer
else:
current = relu(z) # Hidden layers
self.activations.append(current)
return current
def backward(self, X, y, lr=0.01):
m = X.shape[0]
output = self.activations[-1]
delta = output - y.reshape(-1, 1)
for i in range(len(self.weights) - 1, -1, -1):
dw = self.activations[i].T @ delta / m
db = np.sum(delta, axis=0, keepdims=True) / m
if i > 0:
delta = (delta @ self.weights[i].T) * relu_derivative(self.z_values[i-1])
self.weights[i] -= lr * dw
self.biases[i] -= lr * db
def train(self, X, y, epochs=1000, lr=0.1):
for epoch in range(epochs):
output = self.forward(X)
self.backward(X, y, lr)
if epoch % 200 == 0:
loss = np.mean((output - y.reshape(-1,1))**2)
print(f"Epoch {epoch}, Loss: {loss:.4f}")
# XOR Problem — not linearly separable!
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0, 1, 1, 0])
nn = NeuralNetwork([2, 8, 1])
nn.train(X, y, epochs=2000, lr=0.5)
predictions = nn.forward(X)
print("\nPredictions:")
for inp, pred in zip(X, predictions):
print(f" {inp} → {pred[0]:.4f} (expected {int(inp[0]) ^ int(inp[1])})")
Exercises
Exercise 9.1: Why can't a single perceptron solve XOR?
XOR is not linearly separable — no single straight line can separate the classes (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. A perceptron can only create linear decision boundaries. You need at least one hidden layer (2+ neurons) to create the non-linear boundary needed for XOR.
Exercise 9.2: Why is ReLU preferred over Sigmoid in hidden layers?
(1) No vanishing gradient: Sigmoid squashes gradients to near-zero for large/small inputs, making deep networks hard to train. ReLU gradient is 1 for positive inputs.
(2) Computational speed: ReLU is just max(0,x) — much faster than computing exponentials.
(3) Sparsity: ReLU outputs zero for negative inputs, creating sparse representations that can improve efficiency.
Exercise 9.3: Modify the neural network to have 2 hidden layers: [2, 8, 4, 1]
nn = NeuralNetwork([2, 8, 4, 1])
nn.train(X, y, epochs=3000, lr=0.5)
predictions = nn.forward(X)
for inp, pred in zip(X, predictions):
print(f" {inp} → {pred[0]:.4f}")
Adding a second hidden layer gives the network more capacity to learn complex patterns.
Chapter Summary
- Neural networks combine layers of neurons with non-linear activation functions
- ReLU is the default hidden layer activation; Sigmoid/Softmax for output
- Backpropagation computes gradients to update weights layer by layer
- Even a simple network with one hidden layer can solve non-linear problems like XOR
Deep Learning with TensorFlow & Keras
Learning Objectives
- Build deep learning models with the Keras API
- Understand CNNs for image recognition
- Learn RNNs and LSTMs for sequential data
- Build a digit recognition project with MNIST
Keras: High-Level Deep Learning API
Python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Simple sequential model
model = keras.Sequential([
layers.Dense(128, activation='relu', input_shape=(784,)),
layers.Dropout(0.3),
layers.Dense(64, activation='relu'),
layers.Dropout(0.2),
layers.Dense(10, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
Convolutional Neural Networks (CNNs)
CNNs use convolutional filters to automatically detect visual features — edges, textures, shapes — making them ideal for image recognition.
Python
cnn_model = keras.Sequential([
# Convolutional layers — feature extraction
layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
layers.MaxPooling2D((2,2)),
layers.Conv2D(64, (3,3), activation='relu'),
layers.MaxPooling2D((2,2)),
layers.Conv2D(64, (3,3), activation='relu'),
# Dense layers — classification
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
Recurrent Neural Networks (RNNs & LSTMs)
RNNs process sequential data (text, time series) by maintaining a hidden state. LSTMs solve the vanishing gradient problem with gates that control information flow.
Python
# LSTM for sequence classification
lstm_model = keras.Sequential([
layers.Embedding(input_dim=10000, output_dim=128),
layers.LSTM(64, return_sequences=True),
layers.LSTM(32),
layers.Dense(1, activation='sigmoid')
])
lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Project: MNIST Digit Recognition with CNN
Python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Load MNIST
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
# Preprocess
X_train = X_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
X_test = X_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0
# Build CNN
model = keras.Sequential([
layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
layers.MaxPooling2D((2,2)),
layers.Conv2D(64, (3,3), activation='relu'),
layers.MaxPooling2D((2,2)),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Train
history = model.fit(X_train, y_train, epochs=5,
batch_size=64,
validation_split=0.1)
# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTest Accuracy: {test_acc:.4f}")
# Expected: ~99% accuracy
Exercises
Exercise 10.1: What does Dropout do and why is it important?
Dropout randomly sets a fraction of neurons to zero during training. This prevents co-adaptation of neurons and acts as regularization. At test time, all neurons are active but outputs are scaled. Dropout(0.5) means 50% of neurons are dropped each forward pass. It significantly reduces overfitting.
Exercise 10.2: Why use Conv2D + MaxPooling instead of just Dense layers for images?
(1) Parameter efficiency: A 28×28 image flattened = 784 inputs. Dense layer with 128 neurons = 100k parameters. Conv2D with 32 3×3 filters = only 320 parameters. (2) Spatial features: Convolution preserves spatial relationships — edges, corners, textures. Dense layers lose this. (3) Translation invariance: A cat detected anywhere in the image triggers the same filter.
Exercise 10.3: When would you use LSTM instead of a simple Dense network?
Use LSTM when data has temporal/sequential dependencies: (1) Time series forecasting (stock prices, weather). (2) Natural language processing (word order matters). (3) Speech recognition. (4) Video analysis (frame sequences). Dense networks treat all inputs independently and ignore order.
Chapter Summary
- Keras provides a high-level API for building deep learning models quickly
- CNNs use convolution and pooling layers for image feature extraction
- LSTMs handle sequential data with memory gates
- Dropout is essential for preventing overfitting in deep networks
Natural Language Processing
Learning Objectives
- Understand text preprocessing: tokenization, stopwords, stemming
- Learn TF-IDF and Bag of Words representations
- Understand word embeddings (Word2Vec, GloVe)
- Build a sentiment analysis project
Text Preprocessing Pipeline
Python
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def preprocess_text(text):
# 1. Lowercase
text = text.lower()
# 2. Remove special characters and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)
# 3. Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
# Example
raw = "This movie was AMAZING!!! I loved it 100%. Best film of 2024."
clean = preprocess_text(raw)
print(clean) # "this movie was amazing i loved it best film of"
Bag of Words & TF-IDF
Python
corpus = [
"machine learning is amazing",
"deep learning is a subset of machine learning",
"natural language processing uses machine learning"
]
# Bag of Words — simple word counts
bow = CountVectorizer()
X_bow = bow.fit_transform(corpus)
print("Vocabulary:", bow.get_feature_names_out())
print("BoW Matrix:\n", X_bow.toarray())
# TF-IDF — weights by importance
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("\nTF-IDF Matrix:\n", X_tfidf.toarray().round(3))
TF-IDF Intuition
TF (Term Frequency): How often a word appears in a document. IDF (Inverse Document Frequency): How rare a word is across all documents. Common words like "is", "the" get low IDF. Rare, distinctive words get high IDF. TF-IDF = TF × IDF.
Word Embeddings
Word embeddings represent words as dense vectors where similar words have similar vectors. King - Man + Woman ≈ Queen.
Python
# Using pre-trained embeddings with Keras
from tensorflow.keras.layers import Embedding
# Embedding layer: vocab_size → embedding_dim
# Learns 128-dim vector for each of 10000 words
embedding = Embedding(input_dim=10000, output_dim=128, input_length=200)
Project: Movie Review Sentiment Analysis
Python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Sample movie reviews
reviews = [
"This movie was fantastic! Great acting and storyline.",
"Terrible film. Waste of time and money.",
"Loved every moment. A masterpiece of cinema.",
"Boring and predictable. Would not recommend.",
"An incredible journey with amazing performances.",
"The worst movie I have ever seen. Awful.",
"Brilliant storytelling and breathtaking visuals.",
"Dull plot with terrible dialogue throughout.",
"A beautiful and moving cinematic experience.",
"Completely disappointing. Save your money.",
"Outstanding performances by the entire cast.",
"Unwatchable garbage from start to finish.",
"One of the best films of the decade.",
"Painfully slow and utterly forgettable.",
"A thrilling and emotionally powerful movie.",
"So bad I walked out of the theater.",
]
# 1=positive, 0=negative
sentiments = [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]
# TF-IDF features
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf.fit_transform(reviews)
y = np.array(sentiments)
# Train model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2f}")
# Test on new reviews
new_reviews = [
"This is an absolutely wonderful and amazing movie!",
"I hated this film, it was terrible and boring.",
"A decent movie with some good moments."
]
X_new = tfidf.transform(new_reviews)
predictions = model.predict(X_new)
for review, pred in zip(new_reviews, predictions):
sentiment = "POSITIVE ✓" if pred == 1 else "NEGATIVE ✗"
print(f"[{sentiment}] {review}")
Exercises
Exercise 11.1: What's the difference between Bag of Words and TF-IDF?
Bag of Words: Simply counts word occurrences. "the" appears 5 times → score 5. Problem: common words dominate.
TF-IDF: Weights word counts by how rare they are across documents. "the" appears everywhere → low IDF → low score. "amazing" appears rarely → high IDF → high score. TF-IDF better captures word importance.
Exercise 11.2: Why are word embeddings better than one-hot encoding?
(1) Dense vs. sparse: One-hot vectors are huge sparse vectors (10000-dim for 10000 words). Embeddings are dense 128-300 dim. (2) Semantic meaning: Embeddings capture meaning — similar words have similar vectors. One-hot treats every word as equally different. (3) Generalization: "excellent" and "wonderful" have similar embeddings, so the model can generalize from one to the other.
Exercise 11.3: What preprocessing steps would you add for tweets?
(1) Remove @mentions and #hashtags (or extract them as features). (2) Remove URLs. (3) Handle emojis (convert to text or use as features). (4) Expand contractions (can't → cannot). (5) Handle slang/abbreviations (lol, brb). (6) Remove or handle repeated characters ("sooooo" → "so").
Chapter Summary
- Text preprocessing (lowering, cleaning, stopwords) is essential before modeling
- TF-IDF weights word importance better than raw counts
- Word embeddings capture semantic meaning in dense vectors
- Sentiment analysis combines text features with classification models
Model Evaluation, Tuning & Deployment
Learning Objectives
- Master cross-validation for robust model evaluation
- Tune hyperparameters with Grid Search and Random Search
- Understand the full confusion matrix and classification metrics
- Deploy a model as a REST API with Flask
Cross-Validation
Instead of a single train-test split, k-fold cross-validation splits data into k parts, trains on k-1, tests on 1, and rotates. This gives a more reliable performance estimate.
Python
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
iris = load_iris()
model = RandomForestClassifier(n_estimators=100, random_state=42)
# 5-fold cross-validation
scores = cross_val_score(model, iris.data, iris.target, cv=5, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")
Hyperparameter Tuning
Python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Grid Search — exhaustive search
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(iris.data, iris.target)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
best_model = grid_search.best_estimator_
Comprehensive Evaluation Metrics
Python
from sklearn.metrics import (confusion_matrix, classification_report,
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score)
y_pred = best_model.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Predicted
# Neg Pos
# Actual Neg [TN] [FP]
# Actual Pos [FN] [TP]
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.3f}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted'):.3f}")
Metric Cheat Sheet
Accuracy: (TP+TN)/Total — overall correctness. Misleading with imbalanced data.
Precision: TP/(TP+FP) — "of predicted positives, how many are correct?" Use when FP is costly.
Recall: TP/(TP+FN) — "of actual positives, how many did we catch?" Use when FN is costly (disease detection).
F1 Score: 2×(P×R)/(P+R) — harmonic mean of precision and recall. Best single metric for imbalanced datasets.
Overfitting vs. Underfitting
| Problem | Symptom | Solution |
|---|---|---|
| Overfitting | High train accuracy, low test accuracy | More data, regularization, dropout, simpler model, cross-validation |
| Underfitting | Low train and test accuracy | More features, complex model, less regularization, more training |
Project: Deploy ML Model with Flask
Python
# save_model.py — Train and save model
import pickle
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(iris.data, iris.target)
# Save model to file
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
print("Model saved!")
Python
# app.py — Flask API
from flask import Flask, request, jsonify
import pickle
import numpy as np
app = Flask(__name__)
# Load trained model
with open('model.pkl', 'rb') as f:
model = pickle.load(f)
species = ['setosa', 'versicolor', 'virginica']
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
features = np.array(data['features']).reshape(1, -1)
prediction = model.predict(features)[0]
probability = model.predict_proba(features)[0].tolist()
return jsonify({
'prediction': species[prediction],
'confidence': max(probability),
'probabilities': dict(zip(species, probability))
})
@app.route('/health')
def health():
return jsonify({'status': 'healthy'})
if __name__ == '__main__':
app.run(debug=True, port=5000)
bash
# Test the API
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'
# Response:
# {"prediction": "setosa", "confidence": 1.0, "probabilities": {...}}
Exercises
Exercise 12.1: You have 95% accuracy on a fraud detection dataset where 98% are non-fraud. Is this good?
No! A naive model predicting "non-fraud" always achieves 98% accuracy. Your 95% is actually worse than the naive baseline. For imbalanced datasets, use Precision, Recall, F1-Score, and AUC-ROC instead. Focus on recall (catching actual fraud) and precision (not flagging legitimate transactions).
Exercise 12.2: Grid Search is very slow. What alternatives exist?
RandomizedSearchCV: Samples random combinations instead of trying all. Often finds good results in a fraction of the time.
Bayesian Optimization (Optuna, Hyperopt): Uses past results to intelligently pick the next parameters to try. Much more efficient.
Halving Grid Search: Starts with all candidates on a small subset, eliminates poor performers, and allocates more resources to promising ones.
Exercise 12.3: What should you consider before deploying a model to production?
(1) Model versioning: Track which model version is live. (2) Monitoring: Track prediction accuracy, latency, data drift over time. (3) A/B testing: Compare new model vs. existing before full rollout. (4) Input validation: Handle invalid or adversarial inputs. (5) Scalability: Can it handle peak traffic? (6) Fallback: What happens if the model fails? (7) Fairness: Check for bias across demographic groups.
Chapter Summary
- Cross-validation gives reliable performance estimates across multiple data splits
- Grid Search and Random Search find optimal hyperparameters
- Precision, Recall, and F1 are more informative than accuracy for imbalanced datasets
- Flask makes it easy to deploy ML models as REST APIs
- Production deployment requires monitoring, versioning, and input validation
🎓 Congratulations!
You've completed the Machine Learning book. You now have the knowledge to build, evaluate, and deploy ML models. Keep practicing!
© 2025 EduArtha — Machine Learning Complete Guide