Neural Networks & Deep Learning
Chapter 12: Convolutional Neural Networks
Seeing the World
โฑ๏ธ Reading Time: ~5 hours | ๐ Part IV: Architectures | ๐ง Theory + Code Chapter
๐ Prerequisites: Chapters 6โ8 (Deep Networks, Backpropagation, Optimization), Basic Linear Algebra
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the convolution formula, output-size equations, and architectural details of LeNet-5, AlexNet, VGG-16, and ResNet |
| ๐ต Understand | Explain why convolution preserves spatial structure, how pooling achieves translation invariance, and why CNNs need far fewer parameters than fully connected networks |
| ๐ข Apply | Implement 2D convolution from scratch in NumPy and build a full CNN in TensorFlow/Keras for CIFAR-10 classification |
| ๐ก Analyze | Compare feature maps across layers, diagnose overfitting vs. underfitting in CNN training curves, and analyze the effect of filter size and stride |
| ๐ Evaluate | Choose between training from scratch vs. transfer learning; select an appropriate pretrained backbone (VGG, ResNet, MobileNet) for a given deployment constraint |
| ๐ด Create | Design an end-to-end CNN pipeline for Indian traffic sign recognition using transfer learning with MobileNetV2 |
Learning Objectives
By the end of this chapter, you will be able to:
- Explain why flattening an image into a 1-D vector destroys spatial information and causes a parameter explosion (e.g., 224ร224ร3 = 150,528 inputs per neuron)
- Define the discrete 2D convolution operation, draw a kernel sliding over an input, and compute the output feature map dimensions:
(n โ f + 2p) / s + 1 - Distinguish between "valid" padding (no padding) and "same" padding (
p = (fโ1)/2) and explain when each is appropriate - Describe how multiple convolutional filters produce multiple feature maps, each detecting different features (edges, textures, patterns)
- Compare Max Pooling and Average Pooling โ their formulas, behaviour, and the fact that pooling layers have zero learnable parameters
- Sketch the canonical CNN pipeline: [CONV โ ReLU โ POOL] ร N โ Flatten โ FC โ Softmax and trace tensor shapes through each layer
- Summarise the evolution of classic architectures: LeNet-5 (1998), AlexNet (2012), VGG-16 (2014), ResNet (2015) and explain residual connections
- Implement 2D convolution from scratch in NumPy, verifying output against SciPy's
correlate2d - Build a complete CNN in TensorFlow/Keras for CIFAR-10 achieving โฅ 75% test accuracy
- Apply transfer learning using pretrained VGG-16 / ResNet-50 / MobileNetV2 to a custom Indian dataset with fine-tuning
Opening Hook โ The Eyes of E-Commerce
๐๏ธ 50,000 Product Images Every Day โ Classified in Milliseconds
Meesho, India's social commerce platform with 150 million+ monthly active users, onboards over 50,000 new seller product images every single day. Each image must be checked for quality โ is it blurry? Does the product fill at least 80% of the frame? Is the background clean? Are there watermarks or offensive content?
Before CNNs, a team of 200+ moderators manually reviewed each image at a cost of โน3โ5 per image. Today, a Convolutional Neural Network classifies each image in under 12 milliseconds on a single GPU โ achieving 96.3% accuracy at a cost of โน0.002 per image. That's a 2,000ร cost reduction.
The secret? CNNs don't just see pixels โ they see edges, textures, shapes, and objects, just like human visual cortex. This chapter teaches you exactly how.
Meesho Flipkart Myntra JioCore Concepts
12.1 โ Why Not Flatten? The Parameter Explosion Problem
In Chapters 6โ7 we built fully connected (dense) networks where every input neuron connects to every hidden neuron. This works beautifully for tabular data โ but what happens when the input is an image?
The Numbers That Break Dense Networks
Consider a modest 224 ร 224 colour image (the standard ImageNet input size):
- Total pixel values: 224 ร 224 ร 3 (RGB channels) = 150,528
- If the first hidden layer has 1,000 neurons: 150,528 ร 1,000 = 150.5 million weights โ just for the first layer!
- A 5-layer dense network on this input could easily exceed 500 million parameters
For 224ร224ร3 with 1000 neurons: 150,528 ร 1,000 = 150,528,000 parameters
Three Fatal Problems with Flattening
| Problem | What Goes Wrong | Consequence |
|---|---|---|
| 1. Spatial destruction | Flattening converts a 2D grid into a 1D vector โ pixel (0,0) is now equally "far" from pixel (0,1) and pixel (223,223) | The network cannot learn that nearby pixels are related |
| 2. Parameter explosion | 150K+ inputs per neuron means billions of parameters for even moderate networks | Massive overfitting, huge memory requirements, slow training |
| 3. No translation invariance | A cat in the top-left corner and the same cat in the bottom-right are completely different inputs to a dense network | Network must see the same object at every possible position during training |
Technically yes, but it's extraordinarily wasteful. A ResNet-50 achieves 76% ImageNet accuracy with 25.6M parameters. An equivalent fully connected network would need billions of parameters and still perform worse because it cannot exploit spatial locality.
The solution? Convolutional Neural Networks (CNNs) โ networks that exploit three key ideas:
- Local connectivity โ each neuron connects only to a small local region (not the entire input)
- Parameter sharing โ the same set of weights (filter/kernel) is applied across the entire image
- Translation equivariance โ if the input shifts, the output shifts by the same amount
12.2 โ The Convolution Operation
Intuition: A Sliding Magnifying Glass
Imagine placing a small 3ร3 magnifying glass (called a kernel or filter) on the top-left corner of an image. You multiply each pixel under the glass by the corresponding weight in the kernel, sum the results, and write the answer in a new grid called the feature map. Then you slide the glass one pixel to the right and repeat. When you reach the right edge, you move down one row and start from the left again.
Formal Definition: 2D Discrete Convolution (Cross-Correlation)
For an input matrix X of size nรn and a kernel K of size fรf, the output feature map Y is:
Output dimension (no padding, stride=1): nout = n โ f + 1
For a 6ร6 input with a 3ร3 kernel: 6 โ 3 + 1 = 4ร4 output
Deep Learning ConventionIn deep learning, what we call "convolution" is technically cross-correlation (no kernel flipping). This distinction doesn't matter in practice because the network learns the optimal kernel weights regardless of flipping.
Step-by-Step Example: Edge Detection
Consider this 5ร5 grayscale image and a vertical-edge-detection kernel:
The kernel produces high values (20) exactly where the vertical edge occurs โ where pixels transition from bright (10) to dark (0). This is the magic of convolution: simple element-wise multiplication and summation can detect complex visual patterns.
12.3 โ Padding and Stride
The Shrinking Problem
With the basic convolution formula (n โ f + 1), each layer shrinks the spatial dimensions. A 32ร32 input through 10 successive 3ร3 convolutions becomes: 30 โ 28 โ 26 โ ... โ 12ร12. We run out of spatial information fast! Also, corner pixels contribute to only 1 output position, while center pixels contribute to fยฒ positions โ edges are severely under-represented.
Padding: Preserving Spatial Dimensions
Valid vs. Same Padding
No padding applied. Output shrinks: nout = n โ f + 1. Use when you deliberately want dimensionality reduction.
Same Padding (p = (f โ 1) / 2)Pad the input with zeros so output size equals input size. For a 3ร3 kernel: p = (3โ1)/2 = 1 (add 1 ring of zeros). For a 5ร5 kernel: p = (5โ1)/2 = 2.
Stride: Controlling the Step Size
Instead of sliding the kernel 1 pixel at a time, we can slide it by s pixels. Stride > 1 acts as a built-in downsampling mechanism.
nout = โ(n + 2p โ f) / sโ + 1
where: n = input size, f = filter size, p = padding, s = stride
Quick Examples
| Input (n) | Filter (f) | Padding (p) | Stride (s) | Output |
|---|---|---|---|---|
| 32 | 3 | 0 | 1 | โ(32+0โ3)/1โ+1 = 30 |
| 32 | 3 | 1 | 1 | โ(32+2โ3)/1โ+1 = 32 (same) |
| 32 | 5 | 2 | 1 | โ(32+4โ5)/1โ+1 = 32 (same) |
| 32 | 3 | 1 | 2 | โ(32+2โ3)/2โ+1 = 16 (halved) |
| 224 | 7 | 3 | 2 | โ(224+6โ7)/2โ+1 = 112 |
12.4 โ Multiple Filters โ Feature Maps
A single filter detects one type of feature (e.g., vertical edges). But images contain horizontal edges, curves, textures, colours, and complex patterns. The solution: use multiple filters.
Filters, Channels, and Feature Maps
An RGB image has 3 channels. A single filter must also have 3 channels: K is f ร f ร 3. The filter performs element-wise multiplication across all channels simultaneously and sums everything into a single output value.
nf Filters โ nf Feature MapsIf we use nf = 32 filters on an input of size H ร W ร C, the output is Hout ร Wout ร 32. Each filter produces one feature map (one "channel" of the output).
Parameter CountEach filter: f ร f ร Cin weights + 1 bias = fยฒCin + 1
Total for nf filters: nf ร (fยฒ ร Cin + 1)
Parameter Count Example
| Layer | Input Channels | Filters | Kernel Size | Parameters |
|---|---|---|---|---|
| Conv1 | 3 (RGB) | 32 | 3ร3 | 32 ร (3ร3ร3 + 1) = 896 |
| Conv2 | 32 | 64 | 3ร3 | 64 ร (3ร3ร32 + 1) = 18,496 |
| Conv3 | 64 | 128 | 3ร3 | 128 ร (3ร3ร64 + 1) = 73,856 |
Total: 93,248 parameters โ compare that to the 150 million for a single fully connected layer! This is the power of parameter sharing.
12.5 โ Pooling Layers: Downsample Without Learning
Pooling layers reduce spatial dimensions while retaining the most important information. They have zero learnable parameters โ making them computationally cheap and impossible to overfit.
Max Pooling vs. Average Pooling
Takes the maximum value from each pooling window. Captures the most prominent feature in each region. Acts as a "was this feature present anywhere in this region?" detector.
Average PoolingTakes the mean value from each pooling window. Smooths features. Often used as the final layer (Global Average Pooling) before the classifier to replace fully connected layers.
HyperparametersTypical: f = 2, s = 2 โ halves spatial dimensions. No padding typically used.
Learnable parameters: 0 (just a fixed operation)
Why Pooling Helps
- Reduces computation: Halving spatial dims reduces FLOPs by 4ร in the next layer
- Translation invariance: A small shift in input doesn't change max-pool output much
- Controls overfitting: Reduces the number of values the network must process
- Increases receptive field: After pooling, each neuron "sees" a larger region of the original input
12.6 โ The Full CNN Architecture
A complete CNN follows a canonical pipeline that transforms raw pixels into class probabilities:
Shape Trace Through a Simple CNN
| Layer | Operation | Output Shape | Parameters |
|---|---|---|---|
| Input | โ | 32 ร 32 ร 3 | 0 |
| Conv1 | 32 filters, 3ร3, same, s=1 | 32 ร 32 ร 32 | 896 |
| ReLU1 | max(0, x) | 32 ร 32 ร 32 | 0 |
| Pool1 | MaxPool 2ร2, s=2 | 16 ร 16 ร 32 | 0 |
| Conv2 | 64 filters, 3ร3, same, s=1 | 16 ร 16 ร 64 | 18,496 |
| ReLU2 | max(0, x) | 16 ร 16 ร 64 | 0 |
| Pool2 | MaxPool 2ร2, s=2 | 8 ร 8 ร 64 | 0 |
| Conv3 | 128 filters, 3ร3, same, s=1 | 8 ร 8 ร 128 | 73,856 |
| ReLU3 | max(0, x) | 8 ร 8 ร 128 | 0 |
| Pool3 | MaxPool 2ร2, s=2 | 4 ร 4 ร 128 | 0 |
| Flatten | Reshape | 2,048 | 0 |
| FC1 | Dense(128) | 128 | 262,272 |
| Output | Dense(10), softmax | 10 | 1,290 |
| Total | 356,810 | ||
Compare: A fully connected network on 32ร32ร3 input with similar hidden sizes would need millions of parameters. The CNN achieves comparable performance with ~357K parameters โ a 10-30ร reduction.
12.7 โ Classic Architectures: From LeNet to ResNet
The Evolution Timeline
Residual Connections: The Key Innovation
The central problem with very deep networks: gradients vanish as they propagate through many layers. ResNet's solution is elegantly simple:
Output = F(x) + x
Instead of learning H(x) directly, learn the residual F(x) = H(x) โ x
If the optimal transformation is near-identity, F(x) โ 0 is easy to learn!
| Architecture | Year | Layers | Parameters | Top-5 Error | Key Innovation |
|---|---|---|---|---|---|
| LeNet-5 | 1998 | 7 | 60K | โ (MNIST) | First practical CNN |
| AlexNet | 2012 | 8 | 60M | 16.4% | ReLU, Dropout, GPU |
| VGG-16 | 2014 | 16 | 138M | 7.3% | Small 3ร3 filters throughout |
| GoogLeNet | 2014 | 22 | 5M | 6.7% | Inception modules, 1ร1 conv |
| ResNet-50 | 2015 | 50 | 25.6M | 3.57% | Residual connections |
| MobileNetV2 | 2018 | 53 | 3.4M | 8.9% | Depthwise separable convs |
12.8 โ Transfer Learning: Standing on the Shoulders of Giants
Training a CNN from scratch on ImageNet takes 2โ4 weeks on 8 GPUs and costs approximately โน5โ15 lakh in cloud compute. Transfer learning lets you leverage this expensive pre-training for free.
Transfer Learning: The Three Strategies
Freeze all convolutional layers of a pretrained model. Remove the final classification head. Add your own FC layers for your task. Only train the new FC layers.
Strategy 2: Fine-Tuning (Medium dataset, 5Kโ100K images)Start with a pretrained model. Freeze early layers (generic features like edges). Unfreeze later layers (task-specific features). Train with a very small learning rate (1/10th to 1/100th of original).
Strategy 3: Full Retraining (Large dataset, > 100K images)Use pretrained weights as initialisation (instead of random init). Train all layers with a moderate learning rate. Converges faster than random init but adapts fully to your data.
Why Transfer Learning Works
CNN layers learn a hierarchy of features:
Early layers learn universal features that are useful for any vision task. Only the later layers need to be adapted for your specific problem.
From-Scratch Code โ NumPy 2D Convolution
Let's implement the core 2D convolution operation from scratch, then verify it against SciPy.
4.1 โ Single-Channel 2D Convolution
Python
import numpy as np
def conv2d(image, kernel, padding=0, stride=1):
"""
Perform 2D convolution (cross-correlation) from scratch.
Parameters:
image : np.array of shape (H, W)
kernel : np.array of shape (f, f)
padding: int, number of zero-padding rings
stride : int, step size
Returns:
output : np.array of shape (H_out, W_out)
"""
# Step 1: Pad the input image
if padding > 0:
image = np.pad(image, pad_width=padding, mode='constant', constant_values=0)
H, W = image.shape
f = kernel.shape[0]
# Step 2: Calculate output dimensions
H_out = (H - f) // stride + 1
W_out = (W - f) // stride + 1
# Step 3: Initialize output
output = np.zeros((H_out, W_out))
# Step 4: Slide kernel and compute element-wise multiply + sum
for i in range(H_out):
for j in range(W_out):
h_start = i * stride
w_start = j * stride
receptive_field = image[h_start:h_start+f, w_start:w_start+f]
output[i, j] = np.sum(receptive_field * kernel)
return output
# โโ Demo: Vertical edge detection โโ
image = np.array([
[10, 10, 10, 0, 0],
[10, 10, 10, 0, 0],
[10, 10, 10, 0, 0],
[10, 10, 10, 0, 0],
[10, 10, 10, 0, 0]
], dtype=np.float64)
vertical_edge_kernel = np.array([
[ 1, 0, -1],
[ 1, 0, -1],
[ 1, 0, -1]
], dtype=np.float64)
result = conv2d(image, vertical_edge_kernel, padding=0, stride=1)
print("Output shape:", result.shape)
print("Feature map:\n", result)
4.2 โ Multi-Channel Convolution (RGB Support)
Python
def conv2d_multichannel(image, kernels, bias, padding=0, stride=1):
"""
Multi-channel, multi-filter 2D convolution.
Parameters:
image : np.array of shape (H, W, C_in)
kernels : np.array of shape (n_filters, f, f, C_in)
bias : np.array of shape (n_filters,)
padding : int
stride : int
Returns:
output : np.array of shape (H_out, W_out, n_filters)
"""
n_filters = kernels.shape[0]
f = kernels.shape[1]
# Pad spatial dimensions only (not channels)
if padding > 0:
image = np.pad(image,
[(padding, padding), (padding, padding), (0, 0)],
mode='constant')
H, W, C = image.shape
H_out = (H - f) // stride + 1
W_out = (W - f) // stride + 1
output = np.zeros((H_out, W_out, n_filters))
for k in range(n_filters):
for i in range(H_out):
for j in range(W_out):
h_s, w_s = i * stride, j * stride
patch = image[h_s:h_s+f, w_s:w_s+f, :] # (f, f, C_in)
output[i, j, k] = np.sum(patch * kernels[k]) + bias[k]
return output
# โโ Demo: 2 filters on a 5ร5 RGB image โโ
np.random.seed(42)
rgb_image = np.random.randint(0, 255, (5, 5, 3)).astype(np.float64)
kernels = np.random.randn(2, 3, 3, 3) # 2 filters, each 3ร3ร3
biases = np.zeros(2)
output = conv2d_multichannel(rgb_image, kernels, biases, padding=1, stride=1)
print("Input shape :", rgb_image.shape) # (5, 5, 3)
print("Output shape:", output.shape) # (5, 5, 2) โ same padding, 2 filters
4.3 โ Max Pooling From Scratch
Python
def max_pool2d(feature_map, pool_size=2, stride=2):
"""
Max pooling on a multi-channel feature map.
Parameters:
feature_map : np.array of shape (H, W, C)
pool_size : int, size of pooling window
stride : int, step size
Returns:
output : np.array of shape (H_out, W_out, C)
"""
H, W, C = feature_map.shape
H_out = (H - pool_size) // stride + 1
W_out = (W - pool_size) // stride + 1
output = np.zeros((H_out, W_out, C))
for i in range(H_out):
for j in range(W_out):
h_s, w_s = i * stride, j * stride
window = feature_map[h_s:h_s+pool_size, w_s:w_s+pool_size, :]
output[i, j, :] = np.max(window, axis=(0, 1))
return output
# โโ Demo โโ
fmap = np.random.randn(4, 4, 2)
pooled = max_pool2d(fmap, pool_size=2, stride=2)
print("Before pooling:", fmap.shape) # (4, 4, 2)
print("After pooling :", pooled.shape) # (2, 2, 2)
4.4 โ Verification Against SciPy
Python
from scipy.signal import correlate2d
# Our implementation
our_result = conv2d(image, vertical_edge_kernel, padding=0, stride=1)
# SciPy's implementation (cross-correlation, valid mode)
scipy_result = correlate2d(image, vertical_edge_kernel, mode='valid')
print("Match:", np.allclose(our_result, scipy_result))
print("Max difference:", np.max(np.abs(our_result - scipy_result)))
for loops and runs in O(H ร W ร fยฒ) โ far too slow for real images. Production frameworks (TensorFlow, PyTorch) use highly optimised C++/CUDA implementations with im2col or Winograd transforms that run 100-1000ร faster. The from-scratch code is for understanding the algorithm, not for production use.
Industry Code โ Full CNN with TensorFlow/Keras (CIFAR-10)
Now let's build a production-quality CNN using TensorFlow/Keras on the CIFAR-10 dataset (60,000 32ร32 colour images across 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).
5.1 โ Data Loading and Preprocessing
Python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
import numpy as np
# โโ Load CIFAR-10 โโ
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
# โโ Normalize pixels to [0, 1] โโ
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0
# โโ Class names โโ
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
print(f"Training set : {X_train.shape}") # (50000, 32, 32, 3)
print(f"Test set : {X_test.shape}") # (10000, 32, 32, 3)
print(f"Pixel range : [{X_train.min()}, {X_train.max()}]")
5.2 โ Data Augmentation
Python
# โโ Data augmentation to reduce overfitting โโ
data_augmentation = keras.Sequential([
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1), # ยฑ10% rotation
layers.RandomTranslation(0.1, 0.1), # ยฑ10% shift
layers.RandomZoom(0.1), # ยฑ10% zoom
], name="data_augmentation")
5.3 โ CNN Model Architecture
Python
def build_cnn(input_shape=(32, 32, 3), num_classes=10):
"""Build a CNN following the [CONVโBNโReLUโPOOL] pattern."""
inputs = layers.Input(shape=input_shape)
x = data_augmentation(inputs) # Augmentation (only active during training)
# โโ Block 1: 32 filters โโ
x = layers.Conv2D(32, (3,3), padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.Conv2D(32, (3,3), padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.MaxPooling2D((2,2))(x) # 32ร32 โ 16ร16
x = layers.Dropout(0.25)(x)
# โโ Block 2: 64 filters โโ
x = layers.Conv2D(64, (3,3), padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.Conv2D(64, (3,3), padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.MaxPooling2D((2,2))(x) # 16ร16 โ 8ร8
x = layers.Dropout(0.25)(x)
# โโ Block 3: 128 filters โโ
x = layers.Conv2D(128, (3,3), padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.Conv2D(128, (3,3), padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.MaxPooling2D((2,2))(x) # 8ร8 โ 4ร4
x = layers.Dropout(0.25)(x)
# โโ Classification Head โโ
x = layers.GlobalAveragePooling2D()(x) # 4ร4ร128 โ 128
x = layers.Dense(256)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(num_classes, activation='softmax')(x)
return models.Model(inputs, outputs, name="CIFAR10_CNN")
model = build_cnn()
model.summary()
5.4 โ Compile and Train
Python
# โโ Compile with Adam optimizer and learning rate schedule โโ
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# โโ Callbacks โโ
callbacks = [
keras.callbacks.EarlyStopping(
monitor='val_accuracy', patience=10, restore_best_weights=True
),
keras.callbacks.ReduceLROnPlateau(
monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6
)
]
# โโ Train โโ
history = model.fit(
X_train, y_train,
epochs=50,
batch_size=64,
validation_split=0.1,
callbacks=callbacks,
verbose=1
)
5.5 โ Evaluate and Visualise
Python
import matplotlib.pyplot as plt
# โโ Test set evaluation โโ
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test Loss : {test_loss:.4f}")
# โโ Plot training curves โโ
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.plot(history.history['accuracy'], label='Train')
ax1.plot(history.history['val_accuracy'], label='Validation')
ax1.set_title('Accuracy')
ax1.set_xlabel('Epoch')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax2.plot(history.history['loss'], label='Train')
ax2.plot(history.history['val_loss'], label='Validation')
ax2.set_title('Loss')
ax2.set_xlabel('Epoch')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('cifar10_training_curves.png', dpi=150)
plt.show()
5.6 โ Transfer Learning with MobileNetV2
Python
# โโ Transfer Learning: MobileNetV2 pretrained on ImageNet โโ
base_model = keras.applications.MobileNetV2(
input_shape=(32, 32, 3),
include_top=False, # Remove ImageNet classification head
weights='imagenet'
)
base_model.trainable = False # Freeze all pretrained layers
# โโ New classification head for CIFAR-10 โโ
inputs = layers.Input(shape=(32, 32, 3))
x = data_augmentation(inputs)
x = keras.applications.mobilenet_v2.preprocess_input(x)
x = base_model(x, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(10, activation='softmax')(x)
transfer_model = models.Model(inputs, outputs)
print(f"Total params : {transfer_model.count_params():,}")
print(f"Trainable params : {sum(p.numpy().size for p in transfer_model.trainable_weights):,}")
# โโ Train (feature extraction phase) โโ
transfer_model.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
transfer_model.fit(X_train, y_train, epochs=10, batch_size=64,
validation_split=0.1)
# โโ Fine-tuning phase: unfreeze last 30 layers โโ
base_model.trainable = True
for layer in base_model.layers[:-30]:
layer.trainable = False
transfer_model.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-5), # Very small LR!
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
transfer_model.fit(X_train, y_train, epochs=10, batch_size=64,
validation_split=0.1)
Visual Diagrams
6.1 โ CNN Feature Hierarchy
6.2 โ Convolution Operation Step-by-Step
6.3 โ VGG-16 Architecture
6.4 โ ResNet Skip Connection
Worked Example โ Shape Tracing Through a CNN
Problem: Trace the tensor shapes and parameter counts through a CNN designed for Myntra's product category classification (224ร224 RGB images, 50 categories).
Architecture Specification
Input: 224 ร 224 ร 3 RGB image
Block 1: Conv(64, 7ร7, stride=2, same padding) โ BN โ ReLU โ MaxPool(3ร3, stride=2)
Block 2: Conv(128, 3ร3, stride=1, same) โ BN โ ReLU โ MaxPool(2ร2, stride=2)
Block 3: Conv(256, 3ร3, stride=1, same) โ BN โ ReLU โ Conv(256, 3ร3, stride=1, same) โ BN โ ReLU โ MaxPool(2ร2, stride=2)
Head: GlobalAveragePooling โ Dense(512) โ ReLU โ Dropout(0.5) โ Dense(50, softmax)
Step-by-Step Shape Trace
Block 1: Initial Feature Extraction
Conv(64, 7ร7, stride=2, same):
Output: 112 ร 112 ร 64
Parameters: 64 ร (7ร7ร3 + 1) = 64 ร 148 = 9,472
BN + ReLU: Shape unchanged โ 112 ร 112 ร 64. BN params: 64ร4 = 256
MaxPool(3ร3, stride=2, same):
Output: 56 ร 56 ร 64
Parameters: 0 (pooling has no learnable params)
Block 2: Deepening Features
Conv(128, 3ร3, stride=1, same):
Output: 56 ร 56 ร 128
Parameters: 128 ร (3ร3ร64 + 1) = 128 ร 577 = 73,856
MaxPool(2ร2, stride=2): โ 28 ร 28 ร 128, Params: 0
Block 3: High-Level Features
Conv(256, 3ร3, same) ร 2:
Second Conv: 256 ร (3ร3ร256 + 1) = 590,080, Output: 28 ร 28 ร 256
MaxPool(2ร2, stride=2): โ 14 ร 14 ร 256
Classification Head
Dense(512): 256 ร 512 + 512 = 131,584
Dense(50): 512 ร 50 + 50 = 25,650
Complete Summary
| Layer | Output Shape | Parameters |
|---|---|---|
| Input | 224 ร 224 ร 3 | 0 |
| Conv1 (64, 7ร7, s=2) | 112 ร 112 ร 64 | 9,472 |
| BN + ReLU | 112 ร 112 ร 64 | 256 |
| MaxPool (3ร3, s=2) | 56 ร 56 ร 64 | 0 |
| Conv2 (128, 3ร3, s=1) | 56 ร 56 ร 128 | 73,856 |
| BN + ReLU | 56 ร 56 ร 128 | 512 |
| MaxPool (2ร2, s=2) | 28 ร 28 ร 128 | 0 |
| Conv3a (256, 3ร3) | 28 ร 28 ร 256 | 295,168 |
| BN + ReLU | 28 ร 28 ร 256 | 1,024 |
| Conv3b (256, 3ร3) | 28 ร 28 ร 256 | 590,080 |
| BN + ReLU | 28 ร 28 ร 256 | 1,024 |
| MaxPool (2ร2, s=2) | 14 ร 14 ร 256 | 0 |
| GlobalAvgPool | 256 | 0 |
| Dense(512) + ReLU | 512 | 131,584 |
| Dense(50) + Softmax | 50 | 25,650 |
| Total | 1,128,626 (~1.1M) | |
Compare this to a fully connected network: first layer alone would need 224ร224ร3 ร 512 = 76.8 million parameters. Our CNN achieves similar expressive power with just 1.1M parameters โ a 68ร reduction.
Case Study โ DigiYatra: CNN-Powered Face Recognition at Indian Airports
๐ซ DigiYatra โ Seamless, Paperless Air Travel Across India
The Challenge
Indian airports handled 376 million passengers in FY 2023-24. At peak hours, passengers spend 45โ60 minutes in check-in and security queues. The Ministry of Civil Aviation launched DigiYatra โ an AI-powered face recognition system that allows paperless, contactless boarding using CNN-based facial verification.
The Technical Pipeline
| Stage | CNN Component | Details |
|---|---|---|
| 1. Enrollment | Face Detection (MTCNN) | Multi-task CNN detects face bounding boxes, facial landmarks (eyes, nose, mouth). Runs at 30 FPS on airport cameras. |
| 2. Feature Extraction | FaceNet / ArcFace CNN | ResNet-based backbone extracts a 512-dimensional embedding vector from each face. This embedding captures identity-specific features. |
| 3. Verification | Cosine Similarity | At each checkpoint, the live face embedding is compared against the enrolled embedding. Threshold: cosine similarity > 0.85 โ match. |
| 4. Anti-spoofing | Liveness Detection CNN | A separate MobileNet classifies real faces vs. printed photos / screen displays. Prevents fraudulent boarding. |
Architecture Details
- Backbone: Modified ResNet-50 with ArcFace loss (additive angular margin) for discriminative face embeddings
- Input: 112 ร 112 aligned and cropped face images
- Output: 512-D normalised embedding vector (unit sphere)
- Inference time: ~15ms per face on NVIDIA T4 GPU (deployed at airports)
- Accuracy: 99.2% True Accept Rate at 0.1% False Accept Rate
Scale and Impact
- Deployed at 24 airports including Delhi T3, Bangalore, Hyderabad, Varanasi, Pune
- 6.5 million+ passengers processed through DigiYatra (as of March 2024)
- Average time savings: 40 seconds per checkpoint ร 3 checkpoints = 2 minutes per passenger
- Projected savings: โน1,200 crore annually in operational efficiency across Indian airports
Privacy Considerations โ DPDPA 2023 / PDPB Compliance
Face recognition raises critical privacy concerns. India's Digital Personal Data Protection Act (DPDPA) 2023 mandates:
- Consent: DigiYatra is opt-in only โ passengers must voluntarily enroll via the DigiYatra app
- Data minimisation: Face embeddings (not raw images) are stored; embeddings are deleted within 24 hours of the flight
- Purpose limitation: Biometric data used exclusively for airport identity verification, not shared with third parties
- Right to erasure: Passengers can delete their DigiYatra profile and all associated biometric data at any time
- Data localisation: All processing happens on servers physically located in India (no cross-border transfer)
Technical Challenges for Indian Deployment
- Diversity: Indian faces span enormous diversity in skin tone, facial structure, and accessories (turbans, bindis, veils). The training dataset must be representative.
- Lighting: Airport lighting varies drastically โ from bright terminal lobbies to dim corridors. The CNN must be robust to illumination changes.
- Scale: Delhi T3 handles 70M+ passengers/year. The system must process ~200 faces/second at peak hours.
- Cost: Edge deployment uses NVIDIA Jetson modules (โน35,000 each) to keep per-checkpoint costs under โน50,000/month.
Common Mistakes & Misconceptions
Adding too many filters (e.g., 512 filters in the first layer of a small CNN) causes massive overfitting on small datasets. Start with 32 or 64 filters and double at each pooling stage. Monitor the validation gap.
Raw pixel values range [0, 255]. Without normalising to [0, 1] or standardising to zero-mean/unit-variance, the loss landscape becomes extremely steep, gradients explode, and training fails. Always normalise. For transfer learning, use the model's specific preprocessing function (e.g.,
preprocess_input()).
VGG proved that two stacked 3ร3 convolutions have the same receptive field as one 5ร5, but with fewer parameters and more non-linearity. Modern CNNs use 3ร3 almost exclusively (except sometimes 7ร7 in the very first layer for large images).
BatchNormalization after convolution layers dramatically stabilises training. Without it, you need careful weight initialisation and lower learning rates. With BN, you can use learning rates 10ร higher and converge faster.
If you have fewer than 10,000 images, always try transfer learning first. A pretrained ResNet-50 fine-tuned on 1,000 images often outperforms a CNN trained from scratch on 10,000 images. Only train from scratch when your domain is very different from ImageNet (e.g., medical X-rays, satellite imagery).
Pretrained weights are already near a good minimum. A high learning rate (e.g., 0.01) will destroy these weights in the first few epochs. Use 1/10th to 1/100th of the original learning rate (e.g., 1e-4 or 1e-5) when fine-tuning.
Mathematically, convolution flips the kernel; cross-correlation does not. Deep learning frameworks implement cross-correlation but call it "convolution." Since kernels are learned, the distinction is irrelevant in practice โ but know the difference for exams!
Comparison Tables
10.1 โ CNN vs. Fully Connected Network
| Aspect | Fully Connected (Dense) | Convolutional Neural Network |
|---|---|---|
| Input | Flattened 1D vector | Preserves 2D/3D spatial structure |
| Connectivity | Every input โ every neuron | Local receptive field (e.g., 3ร3 patch) |
| Parameter sharing | None โ unique weights per connection | Same kernel shared across entire spatial extent |
| Parameters (224ร224ร3) | ~150M (first layer alone) | ~9K (first conv layer with 64 3ร3 filters) |
| Translation invariance | None โ cat at (0,0) โ cat at (100,100) | Built-in through weight sharing + pooling |
| Best for | Tabular data, final classification layers | Images, video, spatial/temporal data |
10.2 โ Padding Comparison
| Type | Padding Value | Output Size | When to Use |
|---|---|---|---|
| Valid | p = 0 | (n โ f)/s + 1 | When shrinking is acceptable; last conv before pooling |
| Same | p = (f โ 1)/2 | n / s | Most layers โ preserves spatial info for stacking many layers |
| Full | p = f โ 1 | n + f โ 1 | Rare; used in signal processing, not common in deep learning |
10.3 โ Pooling Comparison
| Aspect | Max Pooling | Average Pooling | Global Average Pooling |
|---|---|---|---|
| Operation | max(window) | mean(window) | mean(entire feature map) |
| Output | Strongest activation in region | Average activation in region | Single value per channel |
| Translation invariance | Strong | Moderate | Complete (position ignored) |
| Parameters | 0 | 0 | 0 |
| Common use | Between conv blocks | Less common in classification CNNs | Final layer before classifier (replaces FC) |
10.4 โ When to Use Which Architecture
| Scenario | Recommended Architecture | Reasoning |
|---|---|---|
| Small dataset (< 5K images) | MobileNetV2 + Transfer Learning | Too few images to train from scratch; lightweight backbone |
| Medium dataset (5Kโ50K), server GPU | ResNet-50 + Fine-tuning | Best accuracy; sufficient data for fine-tuning later layers |
| Large dataset (> 100K), full control | Custom CNN or EfficientNet | Enough data to train from scratch; can optimise architecture |
| Mobile/Edge deployment | MobileNetV2 or MobileNetV3 | 3.4M params, fast inference on smartphones |
| Medical imaging | DenseNet-121 or ResNet-50 | Dense connections help with limited data; well-studied in medical AI |
Exercises
Section A โ Multiple Choice Questions (10)
A 64ร64ร3 input image is convolved with 32 filters of size 5ร5, stride=1, no padding. What is the output shape?
- 60 ร 60 ร 3
- 60 ร 60 ร 32
- 64 ร 64 ร 32
- 32 ร 32 ร 32
How many learnable parameters does a Conv2D layer with 64 filters of size 3ร3, applied to an input with 32 channels, have?
- 576
- 18,432
- 18,496
- 36,928
What padding value is needed to make a 7ร7 convolution a "same" convolution (output size equals input size, stride=1)?
- 1
- 2
- 3
- 7
A MaxPool layer with pool size 2ร2 and stride 2 is applied to a 28ร28ร64 feature map. How many learnable parameters does this layer have?
- 0
- 4
- 256
- 512
Two stacked 3ร3 convolutions have the same effective receptive field as a single:
- 3ร3 convolution
- 5ร5 convolution
- 7ร7 convolution
- 9ร9 convolution
What is the key innovation of ResNet that enabled training networks with 100+ layers?
- Using 1ร1 convolutions for dimensionality reduction
- Inception modules with parallel convolutions
- Skip (residual) connections that add the input to the block's output
- Depthwise separable convolutions
In transfer learning, when fine-tuning a pretrained model on a small dataset, you should:
- Use a very high learning rate to quickly adapt the weights
- Freeze all layers and only train a new classification head
- Unfreeze later layers and use a very small learning rate
- Re-initialise all weights randomly and train from scratch
An input of size 112ร112ร3 is passed through Conv(64, 3ร3, padding='same', stride=2). What is the output shape?
- 112 ร 112 ร 64
- 56 ร 56 ร 64
- 110 ร 110 ร 64
- 55 ร 55 ร 64
Which architecture introduced the concept of bottleneck layers using 1ร1 convolutions to reduce computational cost?
- AlexNet
- VGG-16
- GoogLeNet / Inception
- LeNet-5
Global Average Pooling applied to a feature map of shape 7ร7ร512 produces an output of shape:
- 1 ร 1 ร 512
- 7 ร 7 ร 1
- 512
- 3584
Section B โ Short Answer Questions (5)
Explain the three properties of CNNs (local connectivity, parameter sharing, translation equivariance) using a real-world analogy of a security guard scanning a crowd with binoculars.
Calculate the total number of learnable parameters in a convolutional layer that takes a 28ร28ร1 input (grayscale) and applies 16 filters of size 5ร5 with stride 1 and valid padding. Show all work.
Explain why VGG uses two stacked 3ร3 convolutions instead of one 5ร5 convolution. Calculate the parameter savings for an input with 256 channels.
A CNN has the following architecture: Input(32ร32ร3) โ Conv(32, 3ร3, same) โ MaxPool(2ร2) โ Conv(64, 3ร3, same) โ MaxPool(2ร2) โ Flatten โ Dense(128) โ Dense(10). Trace the tensor shape after each layer and compute the total flatten size.
Explain the residual learning formulation F(x) = H(x) โ x and why learning the residual is easier than learning H(x) directly. How does this solve the degradation problem in deep networks?
Section C โ Long Answer Questions (3)
[15 marks] Compare and contrast the architectures of LeNet-5, AlexNet, VGG-16, and ResNet-50. For each, describe: (a) the key architectural innovation, (b) the approximate number of parameters, (c) one advantage and one limitation. Explain how each architecture addressed a specific limitation of its predecessor. Use a table followed by a detailed discussion.
[15 marks] Discuss the concept of transfer learning in the context of CNNs. (a) Explain why features learned on ImageNet transfer to other vision tasks. (b) Describe the three strategies (feature extraction, fine-tuning, full retraining) with code pseudocode for each. (c) Using the example of Niramai's breast cancer detection from thermal images, explain how transfer learning from ImageNet helps despite the domain difference. (d) Discuss potential failure modes of transfer learning.
[15 marks] Analyse the DigiYatra face recognition pipeline from a systems perspective. (a) Draw the complete pipeline from camera capture to gate opening. (b) Explain the role of CNNs at each stage (detection, alignment, embedding, liveness). (c) Discuss the privacy implications under DPDPA 2023 โ what safeguards should be mandatory? (d) Propose a privacy-preserving alternative using federated learning or on-device processing. (e) Should face recognition in public spaces be banned in India? Argue both sides.
Section D โ Programming Assignments (3)
Intermediate Build a CNN from scratch in Keras to classify Indian traffic signs. Use the Indian Traffic Sign Dataset (or GTSRB as a proxy with 43 classes). Your solution should include:
- Data loading, exploration (class distribution), and preprocessing (resize to 32ร32, normalize)
- Data augmentation (rotation, shift, zoom โ but no horizontal flip since signs are directional!)
- A CNN with at least 3 conv blocks: [ConvโBNโReLUโConvโBNโReLUโMaxPoolโDropout]ร3
- Training with Adam, early stopping, and learning rate reduction
- Confusion matrix and per-class accuracy analysis
- Identify which sign classes are most confused and hypothesise why
Target: โฅ 95% test accuracy. Report training time on your hardware.
Intermediate Use transfer learning with MobileNetV2 to classify 5 types of Indian cuisine from images: dosa, biryani, chole bhature, pav bhaji, and pani puri. Steps:
- Collect 200+ images per class from the web (use
bing-image-downloaderor manual collection) - Split into 70/15/15 train/val/test
- Phase 1: Feature extraction โ freeze MobileNetV2, train only the classification head (10 epochs)
- Phase 2: Fine-tuning โ unfreeze last 30 layers, train with lr=1e-5 (10 more epochs)
- Compare Phase 1 vs Phase 2 accuracy
- Visualise what the CNN "sees" using Grad-CAM on 5 correctly classified and 5 misclassified images
Beginner Take a pretrained VGG-16 model and visualise the feature maps (activations) at layers conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1 for a single input image. Steps:
- Load VGG-16 with ImageNet weights
- Create a feature extraction model that outputs activations at the specified layers
- Pass a sample image (e.g., a photo of the Taj Mahal) through the model
- Plot the first 16 feature maps at each layer as a 4ร4 grid
- Write 1โ2 paragraphs describing what you observe: how features progress from edges โ textures โ object parts โ semantic features
Section E โ Mini-Project
๐ Project: Product Quality Classifier for Meesho Sellers
Scenario: You are an ML engineer at a social commerce company (like Meesho). Sellers upload product images, and you need to automatically classify image quality as: Good (clear, well-lit, product-focused), Acceptable (minor issues), or Reject (blurry, cluttered, watermarked).
Deliverables:
- Dataset Creation (20%): Collect/annotate 500+ product images across 3 quality levels. Document your annotation guidelines.
- Baseline Model (20%): Train a simple 3-block CNN from scratch. Report accuracy, precision, recall per class.
- Transfer Learning Model (20%): Fine-tune MobileNetV2 on the same data. Compare with baseline.
- Error Analysis (20%): Analyse misclassifications with Grad-CAM. What visual patterns confuse the model? How would you improve the dataset?
- Deployment Plan (20%): Write a 1-page plan for deploying this model as an API endpoint. Include: input preprocessing, model serving (TF Serving / FastAPI), latency requirements (< 50ms), cost estimate for 50K images/day on AWS/GCP, and monitoring strategy.
Constraints:
- Final model must be < 20MB (suitable for edge deployment)
- Inference time < 50ms on a CPU
- Must handle images from โน8,000 smartphones (low resolution, poor lighting)
- False rejection rate (good images marked as reject) must be < 2%
Duration: 2 weeks | Team size: 2โ3 students | Submission: GitHub repo + 5-minute demo video
Chapter Summary
Key Takeaways โ CNNs: Seeing the World
- The problem with dense layers for images: Flattening a 224ร224ร3 image creates 150,528 inputs per neuron โ destroying spatial structure and causing parameter explosion. CNNs solve this through local connectivity and parameter sharing.
- Convolution operation: A kernel (filter) slides over the input, computing element-wise multiply-and-sum at each position. Output size:
nout = โ(n + 2p โ f) / sโ + 1. - Padding preserves dimensions: "Same" padding (
p = (fโ1)/2) keeps output size equal to input. "Valid" padding (p=0) shrinks the output byfโ1pixels. - Stride controls downsampling: Stride=2 halves spatial dimensions, acting as a learnable alternative to pooling.
- Multiple filters โ multiple feature maps: Each filter detects a different feature. Parameter count:
nf ร (fยฒ ร Cin + 1)โ dramatically fewer than dense layers. - Pooling reduces dimensions without learning: MaxPool(2ร2, stride=2) halves spatial dimensions with zero learnable parameters. Global Average Pooling replaces Flatten+FC entirely.
- Canonical CNN pipeline: [CONV โ BN โ ReLU โ POOL] ร N โ GlobalAvgPool โ FC โ Softmax. Spatial dims decrease while channel depth increases.
- Classic architectures evolved systematically: LeNet (1998, first CNN) โ AlexNet (2012, ReLU+GPU) โ VGG (2014, small filters) โ GoogLeNet (2014, inception modules) โ ResNet (2015, skip connections for 150+ layers).
- Residual connections: y = F(x) + x allows gradients to flow through identity shortcuts, enabling training of very deep networks. Learning the residual F(x) is easier than learning H(x) directly.
- Transfer learning is almost always the right choice: Pretrained models (ImageNet) provide excellent feature extractors. Three strategies: feature extraction (freeze all), fine-tuning (unfreeze later layers), or full retraining (use as initialisation).
- Indian AI ecosystem: From DigiYatra's face recognition at airports to Flipkart's visual search, Niramai's cancer detection, and TCS's crop disease classification โ CNNs are powering India's AI infrastructure across healthcare, commerce, agriculture, and security.
Output size: nout = โ(n + 2p โ f) / sโ + 1
Same padding: p = (f โ 1) / 2
Conv params: nf ร (fยฒ ร Cin + 1)
Residual block: y = F(x) + x
Receptive field of k stacked 3ร3 convs = (2k + 1) ร (2k + 1)
References & Further Reading
Foundational Papers
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition." Proceedings of the IEEE, 86(11), 2278โ2324. [The LeNet paper โ foundation of all modern CNNs]
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS 2012. [AlexNet โ started the deep learning revolution]
- Simonyan, K. & Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR 2015. [VGG โ proved that depth with small filters wins]
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016. [ResNet โ skip connections enable 150+ layer networks]
- Sandler, M., Howard, A., et al. (2018). "MobileNetV2: Inverted Residuals and Linear Bottlenecks." CVPR 2018. [MobileNetV2 โ efficient CNN for mobile devices]
Textbooks
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 9: Convolutional Networks. MIT Press. [Comprehensive theoretical treatment]
- Chollet, F. (2021). Deep Learning with Python, 2nd Ed., Chapter 8: Computer Vision. Manning. [Practical Keras implementation guide]
- Gรฉron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Ed., Chapter 14. O'Reilly. [Excellent practical coverage with code]
Indian Context
- DigiYatra Foundation. (2024). "DigiYatra Technical Specifications." Ministry of Civil Aviation, Government of India. [Official documentation on India's facial recognition boarding system]
- Digital Personal Data Protection Act (DPDPA), 2023. Government of India Gazette. [India's primary data protection legislation]
- Sharma, A., et al. (2022). "Crop Disease Detection using Transfer Learning on Indian Agricultural Images." TCS Research. [CNN applications in Indian agriculture]
- Niramai Health Analytix. (2023). "Thermalytix: AI-Powered Breast Cancer Screening." Technical Whitepaper. [CNN for medical imaging in rural India]
Online Resources
- CS231n: Convolutional Neural Networks for Visual Recognition โ Stanford University (2023). [The gold-standard course on CNNs]
- TensorFlow CNN Tutorial:
https://www.tensorflow.org/tutorials/images/cnn - Keras Applications API:
https://keras.io/api/applications/[Pretrained models documentation]
Datasets
- CIFAR-10 & CIFAR-100 โ Alex Krizhevsky (2009). [60K 32ร32 images, 10/100 classes]
- ImageNet (ILSVRC) โ Deng et al. (2009). [14M images, 1000 classes โ the benchmark that drove CNN evolution]
- German Traffic Sign Recognition Benchmark (GTSRB). [~50K images, 43 classes โ proxy for Indian traffic signs]