Neural Networks & Deep Learning
Chapter 20: Applied Deep Learning
Computer Vision Projects โ From Farm to Hospital to Highway
โฑ๏ธ Reading Time: ~4 hours | ๐ Unit VII: Applications & Industry | ๐จ Project-Driven Chapter
๐ Prerequisites: Chapter 13 (CNN Architectures & Transfer Learning), Chapter 17 (Object Detection & Segmentation)
Bloom's Taxonomy Progression
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the standard CV project pipeline: problem framing โ dataset engineering โ model selection โ training โ evaluation โ deployment |
| ๐ต Understand | Explain why ResNet50 transfer learning works for crop disease detection, why Grad-CAM is critical for medical AI, and how YOLOv8 achieves real-time detection |
| ๐ข Apply | Build 5 complete CV projects: crop disease detection, currency authentication, traffic sign recognition, chest X-ray diagnosis, and real-time object detection |
| ๐ก Analyze | Diagnose model failures through confusion matrices, precision-recall trade-offs, Grad-CAM heatmaps, and per-class error analysis |
| ๐ Evaluate | Choose optimal architectures and deployment strategies for real-world constraints (mobile phone, hospital PACS, edge GPU) |
| ๐ด Create | Design end-to-end deployable CV systems with data pipelines, model optimization (ONNX/TorchScript), and production monitoring |
Learning Objectives
By the end of this chapter, you will be able to:
- Build a crop disease detection system using ResNet50 transfer learning on the PlantVillage dataset (38 classes) with Indian-crop-specific data augmentation, achieving >95% accuracy
- Develop an Indian currency note authentication CNN that distinguishes genuine โน500/โน2000 notes from counterfeits using texture and watermark features
- Train a traffic sign recognition model adapted for Indian road signs โ multilingual text, non-standard shapes, and conditions distinct from German GTSRB
- Implement a chest X-ray pneumonia detection classifier with Grad-CAM explainability and understand the medical ethics of deploying AI in healthcare
- Deploy YOLOv8 for real-time object detection on Indian traffic scenarios โ auto-rickshaws, cows on roads, pedestrians, and two-wheelers
- Evaluate every model using precision, recall, F1-score, confusion matrices, ROC-AUC, and domain-appropriate metrics (sensitivity/specificity for medical, mAP for detection)
- Visualize model decisions using Grad-CAM heatmaps to build trust, debug failures, and meet regulatory requirements
- Compare Indian deployment constraints with US/global equivalents and adapt solutions accordingly
Opening Hook โ Theory Without Practice Is Empty
๐พ Five Problems. Five Models. One Chapter.
Theory without practice is empty. Practice without theory is blind. For 19 chapters, you've built up a formidable arsenal โ perceptrons, backpropagation, CNNs, transfer learning, object detection, attention mechanisms. Now it's time to deploy that arsenal on real problems that matter.
In a village near Nagpur, a cotton farmer loses โน3 lakh to bollworm-related leaf disease because he misidentified the symptoms. At an RBI currency chest in Lucknow, a clerk handles 10,000 notes daily โ how many counterfeits slip through? On NH-48 near Gurugram, a self-driving car prototype encounters a cow sitting on the highway median โ a scenario that never appears in Stanford's datasets. At AIIMS Delhi, a radiologist reads 200 chest X-rays daily and misses a subtle pneumonia case at 4 PM because of fatigue.
Each of these problems has a deep learning solution that you will build in this chapter. Not toy examples. Not MNIST. Full production-grade projects with real datasets, proper evaluation, Grad-CAM explainability, and deployment code. These are projects you can show in interviews, deploy on your phone, and even monetize.
CropIn RBI NHAI AIIMS Ola/Uber Google Health WaymoThe Intuition First โ Why Projects, Not Just Theory?
The "Cooking Class" Analogy
Imagine you've spent a semester learning about heat transfer, Maillard reactions, emulsification, and flavor compounds. You know the science of cooking. But can you actually cook a biryani? Making biryani requires you to orchestrate all that knowledge simultaneously โ choosing the right rice, managing the dum, timing the layers. That's what this chapter is.
Each project is a "dish" that forces you to combine multiple skills:
The "Aha" Question
Here's something that might surprise you: the model architecture is usually the least important decision in an applied CV project. The same ResNet50 can get you 60% or 98% accuracy on the same dataset. The difference? Data quality, augmentation strategy, learning rate schedule, and evaluation methodology. This chapter teaches you the 80% of effort that determines success โ the "engineering" around the model.
Mathematical Foundation โ Metrics That Matter
Before diving into projects, you need to master the evaluation metrics that determine whether your model is production-ready. Accuracy alone is dangerously misleading.
Deriving Precision, Recall, and F1 from First Principles
Consider a binary classifier (e.g., "pneumonia" vs. "normal"). Every prediction falls into one of four categories:
Now we derive the key metrics:
Predicted + Predicted โ
Actual + TP FN
Actual โ FP TN
Multi-class: Macro-F1 = (1/C) ฮฃ F1c | Weighted-F1 = ฮฃ (nc/N) ร F1c
Grad-CAM: Making CNNs Explain Themselves
Gradient-weighted Class Activation Mapping (Grad-CAM) produces a heatmap highlighting which regions of the input image the model "looked at" to make its prediction. Let's derive it from scratch:
Grad-CAM Derivation
Let Ak be the k-th feature map of the last convolutional layer (shape: HรW), and yc be the score for class c (before softmax).
โyc / โAk โ this tells us how much each spatial location in feature map k influences class c.ฮฑkc = (1/Z) ฮฃi ฮฃj (โyc / โAijk)where Z = H ร W. This single number tells us how important feature map k is for class c.
LGrad-CAMc = ReLU(ฮฃk ฮฑkc ยท Ak)ReLU because we only care about features that have a positive influence on class c.
mAP for Object Detection
For Project 5 (YOLOv8), we need mean Average Precision (mAP):
APc = โซ01 p(r) dr (area under precision-recall curve for class c)
mAP@0.5 = (1/C) ฮฃc APc at IoU threshold = 0.5
Q: In a medical screening test, which metric should you optimize โ precision or recall?
A: Recall (sensitivity). Missing a disease case (FN) is far more dangerous than a false alarm (FP). A false alarm leads to more tests; a missed case can lead to death. That's why medical AI systems target recall โฅ 0.95 even if precision drops.
Indian Crop Disease Detection
Problem Statement
Indian agriculture loses 15-25% of crop yield annually to plant diseases. An average farmer in Maharashtra or Punjab cannot afford an agronomist visit (โน2,000-5,000). You will build a model that photographs a leaf and identifies the disease within 3 seconds โ running entirely on a โน10,000 smartphone with no internet required.
Dataset: PlantVillage
| Property | Value |
|---|---|
| Total Images | 54,305 |
| Classes | 38 (14 crop species ร diseases + healthy) |
| Indian Crops Included | Tomato, Potato, Corn, Pepper (+ augment for Rice, Wheat, Cotton) |
| Image Size | 256ร256 RGB |
| Class Imbalance | Moderate (healthy classes overrepresented) |
Architecture: ResNet50 + Custom Head
Full PyTorch Implementation
Python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# โโ Device Setup โโ
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# โโ Data Augmentation (Indian crop-aware) โโ
train_transforms = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.RandomRotation(30),
transforms.ColorJitter(
brightness=0.3, contrast=0.3,
saturation=0.3, hue=0.1 # Simulate Indian soil/lighting
),
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
transforms.GaussianBlur(kernel_size=3), # Phone camera blur
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
val_transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
# โโ Dataset Loading โโ
train_dataset = datasets.ImageFolder("plantvillage/train", train_transforms)
val_dataset = datasets.ImageFolder("plantvillage/val", val_transforms)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True,
num_workers=4, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False,
num_workers=4, pin_memory=True)
# โโ Model: ResNet50 with Custom Head โโ
class CropDiseaseNet(nn.Module):
def __init__(self, num_classes=38, pretrained=True):
super().__init__()
self.backbone = models.resnet50(
weights=models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
)
# Freeze backbone initially
for param in self.backbone.parameters():
param.requires_grad = False
# Replace classifier head
in_features = self.backbone.fc.in_features # 2048
self.backbone.fc = nn.Sequential(
nn.Dropout(0.4),
nn.Linear(in_features, 512),
nn.ReLU(),
nn.BatchNorm1d(512),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
def unfreeze_backbone(self, layers="layer4"):
"""Gradually unfreeze backbone layers for fine-tuning."""
for name, param in self.backbone.named_parameters():
if layers in name:
param.requires_grad = True
def forward(self, x):
return self.backbone(x)
model = CropDiseaseNet(num_classes=38).to(device)
# โโ Training Loop with 2-Phase Strategy โโ
def train_one_epoch(model, loader, criterion, optimizer, device):
model.train()
running_loss, correct, total = 0.0, 0, 0
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item() * images.size(0)
_, preds = outputs.max(1)
correct += preds.eq(labels).sum().item()
total += labels.size(0)
return running_loss / total, correct / total
def evaluate(model, loader, criterion, device):
model.eval()
running_loss, correct, total = 0.0, 0, 0
all_preds, all_labels = [], []
with torch.no_grad():
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item() * images.size(0)
_, preds = outputs.max(1)
correct += preds.eq(labels).sum().item()
total += labels.size(0)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
return running_loss / total, correct / total, all_preds, all_labels
# โโ Phase 1: Train head only (5 epochs) โโ
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.backbone.fc.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.5)
for epoch in range(5):
train_loss, train_acc = train_one_epoch(model, train_loader,
criterion, optimizer, device)
val_loss, val_acc, _, _ = evaluate(model, val_loader, criterion, device)
scheduler.step()
print(f"Phase1 Epoch {epoch+1}: Train Acc={train_acc:.4f}, Val Acc={val_acc:.4f}")
# โโ Phase 2: Unfreeze layer4 + fine-tune (15 epochs) โโ
model.unfreeze_backbone("layer4")
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()),
lr=1e-5) # Much lower LR!
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=15)
best_val_acc = 0
for epoch in range(15):
train_loss, train_acc = train_one_epoch(model, train_loader,
criterion, optimizer, device)
val_loss, val_acc, preds, labels = evaluate(model, val_loader,
criterion, device)
scheduler.step()
print(f"Phase2 Epoch {epoch+1}: Train={train_acc:.4f}, Val={val_acc:.4f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), "crop_disease_best.pth")
# โโ Evaluation โโ
print(classification_report(labels, preds, target_names=train_dataset.classes))
Expected Results
Grad-CAM Visualization
Python
import torch.nn.functional as F
import matplotlib.pyplot as plt
def grad_cam(model, image_tensor, target_class, target_layer):
"""Generate Grad-CAM heatmap for a given image and class."""
model.eval()
activations, gradients = {}, {}
# Register hooks on target layer
def forward_hook(module, input, output):
activations['value'] = output.detach()
def backward_hook(module, grad_in, grad_out):
gradients['value'] = grad_out[0].detach()
handle_f = target_layer.register_forward_hook(forward_hook)
handle_b = target_layer.register_full_backward_hook(backward_hook)
# Forward pass
output = model(image_tensor.unsqueeze(0).to(device))
# Backward pass for target class
model.zero_grad()
output[0, target_class].backward()
# Compute weights (global average pooling of gradients)
weights = gradients['value'].mean(dim=[2, 3], keepdim=True) # ฮฑ_k^c
# Weighted combination + ReLU
cam = F.relu((weights * activations['value']).sum(dim=1, keepdim=True))
# Upsample to input size
cam = F.interpolate(cam, size=(224, 224), mode='bilinear', align_corners=False)
cam = cam.squeeze().cpu().numpy()
cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)
handle_f.remove()
handle_b.remove()
return cam
# Usage
target_layer = model.backbone.layer4[-1] # Last bottleneck in layer4
heatmap = grad_cam(model, sample_image, predicted_class, target_layer)
plt.imshow(original_image)
plt.imshow(heatmap, alpha=0.5, cmap='jet')
plt.title(f"Grad-CAM: {class_names[predicted_class]}")
plt.show()
- 38+ Indian crop diseases
- Offline-first (no 4G in fields)
- โน8,000 phone target hardware
- Hindi/Marathi/Telugu voice output
- ICAR partnership for ground truth
- Revenue: โน500/farmer/season
- Satellite + drone imagery (not phone)
- Cloud-based processing (5G available)
- $100K+ precision ag platforms
- English-only interface
- USDA partnership for datasets
- Revenue: $15/acre/season
Deployment: ONNX Export
Python
# Export to ONNX for mobile deployment
dummy_input = torch.randn(1, 3, 224, 224).to(device)
torch.onnx.export(model, dummy_input, "crop_disease.onnx",
input_names=["image"], output_names=["prediction"],
dynamic_axes={"image": {0: "batch"}})
print("โ
Exported! ONNX model size:",
os.path.getsize("crop_disease.onnx") / 1e6, "MB")
Indian Currency Note Authentication
Problem Statement
Post-demonetization (Nov 2016), India introduced new โน500 and โน2000 notes. The RBI seized โน8.26 crore in counterfeit currency in FY2023. You will build a CNN that analyzes texture patterns, watermark regions, and security thread features to classify notes as genuine or counterfeit โ a binary classification problem with critical precision requirements.
Dataset Engineering
No public dataset exists for Indian counterfeit notes (for obvious security reasons). You'll create a synthetic pipeline:
| Source | Genuine Notes | Counterfeit Simulation |
|---|---|---|
| โน500 notes | 2,000 images (varied lighting, angles) | 2,000 (printscanned, washed, photocopy artifacts) |
| โน2000 notes | 2,000 images | 2,000 (degraded security features) |
| Augmented | ร5 (=10,000 per class) | ร5 (noise injection, color shift) |
Architecture: Multi-Region CNN
Python
class CurrencyAuthNet(nn.Module):
"""Multi-region CNN for Indian currency authentication.
Analyzes watermark, security thread, latent image, and texture
regions separately, then fuses features for final prediction."""
def __init__(self):
super().__init__()
# Shared feature extractor for each region
def make_branch():
return nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
nn.AdaptiveAvgPool2d((4, 4)),
nn.Flatten(),
nn.Linear(128 * 4 * 4, 256), nn.ReLU(), nn.Dropout(0.3)
)
self.watermark_branch = make_branch() # Region 1: Watermark area
self.thread_branch = make_branch() # Region 2: Security thread
self.latent_branch = make_branch() # Region 3: Latent image
self.texture_branch = make_branch() # Region 4: Overall texture
# Fusion classifier
self.classifier = nn.Sequential(
nn.Linear(256 * 4, 512), nn.ReLU(), nn.BatchNorm1d(512),
nn.Dropout(0.4),
nn.Linear(512, 128), nn.ReLU(),
nn.Linear(128, 2) # genuine vs counterfeit
)
def forward(self, watermark, thread, latent, texture):
f1 = self.watermark_branch(watermark)
f2 = self.thread_branch(thread)
f3 = self.latent_branch(latent)
f4 = self.texture_branch(texture)
fused = torch.cat([f1, f2, f3, f4], dim=1)
return self.classifier(fused)
# โโ Region Extraction Utility โโ
def extract_regions(note_image):
"""Extract 4 security-feature regions from a currency note image.
Coordinates calibrated for โน500/โน2000 note dimensions."""
h, w = note_image.shape[1:]
watermark = note_image[:, :h//2, :w//3] # Top-left quadrant
thread = note_image[:, :, w//3:w//3+w//10] # Vertical strip
latent = note_image[:, h//2:, :w//3] # Bottom-left
texture = note_image # Full note for texture
# Resize all to 64ร64 for uniform processing
resize = transforms.Resize((64, 64))
return resize(watermark), resize(thread), resize(latent), resize(texture)
# โโ Training โโ
model = CurrencyAuthNet().to(device)
criterion = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0]).to(device))
# Weight=2.0 for counterfeit class โ missing a fake note is worse!
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
for epoch in range(30):
model.train()
for batch in train_loader:
wm, th, lt, tx, labels = [b.to(device) for b in batch]
optimizer.zero_grad()
outputs = model(wm, th, lt, tx)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
โ TRUTH: Public images of notes are low-resolution scans. Real authentication requires high-DPI captures (600+ DPI) of security features. You need controlled capture conditions.
๐ WHY IT MATTERS: A model trained on web-scraped images will learn color/shape patterns, not the micro-texture and UV-response features that distinguish genuine from counterfeit notes.
- โน500, โน2000 notes with Mahatma Gandhi Series features
- Demonetization created surge in counterfeiting
- โน8.26 crore seized in FY2023
- Bank-level deployment needed
- UV + tactile features unique to Indian notes
- $100 "supernotes" (North Korean counterfeits)
- $20 is most counterfeited denomination
- $70M+ seized annually
- FedEye automated detection systems
- Color-shifting ink + 3D security ribbon
Traffic Sign Recognition for Indian Roads
Problem Statement
India has 4.6 lakh road accidents annually โ the highest in the world. Indian traffic signs are fundamentally different from the German Traffic Sign Recognition Benchmark (GTSRB) used in most research: they're multilingual (Hindi + English + regional), have different color conventions, and are often occluded by trees, ads, or dust. You will build a real-time classifier for Indian road signs.
Indian vs. German Signs: Key Differences
| Feature | German (GTSRB) | Indian |
|---|---|---|
| Language | German only | Hindi + English + Regional |
| Shape Standards | Strict EU compliance | IRC standards (often non-compliant) |
| Conditions | Clean, well-maintained | Dusty, faded, partially occluded |
| Categories | 43 classes | ~50+ classes (including toll, speed breaker) |
| Number Plates | Standard EU format | White/yellow with varying fonts |
Architecture: MobileNetV2 for Edge Speed
Python
import torch
import torch.nn as nn
from torchvision import models, transforms
class IndianTrafficSignNet(nn.Module):
"""MobileNetV2-based Indian traffic sign classifier.
Optimized for real-time inference on edge devices (Jetson Nano, phones).
Handles 50 Indian sign categories including multilingual signs."""
def __init__(self, num_classes=50):
super().__init__()
self.backbone = models.mobilenet_v2(
weights=models.MobileNet_V2_Weights.IMAGENET1K_V2
)
# Freeze first 14 layers (of 19 inverted residual blocks)
for i, (name, param) in enumerate(self.backbone.features.named_parameters()):
if i < 100: # Approx first 14 blocks
param.requires_grad = False
# Replace classifier
self.backbone.classifier = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(1280, 256),
nn.ReLU(),
nn.BatchNorm1d(256),
nn.Dropout(0.2),
nn.Linear(256, num_classes)
)
def forward(self, x):
return self.backbone(x)
# โโ Indian-specific augmentation โโ
indian_sign_transforms = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.6, 1.0)), # Partial occlusion
transforms.RandomRotation(15), # Tilted signs
transforms.ColorJitter(
brightness=0.4, contrast=0.4,
saturation=0.2, hue=0.05
), # Dust/sun fading
transforms.RandomPerspective(
distortion_scale=0.3, p=0.5
), # Viewing angle variation
transforms.GaussianBlur(5, sigma=(0.1, 2.0)), # Rain/fog
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
transforms.RandomErasing(p=0.3, scale=(0.02, 0.15)) # Sticker occlusion
])
model = IndianTrafficSignNet(num_classes=50).to(device)
optimizer = optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()),
lr=3e-4, weight_decay=0.01)
scheduler = optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=3e-3, epochs=25,
steps_per_epoch=len(train_loader)
)
# โโ Training with OneCycleLR โโ
for epoch in range(25):
model.train()
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
loss = nn.CrossEntropyLoss()(model(images), labels)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
"Deep Learning for Indian Traffic Sign Detection and Recognition" (ICCV Workshop 2023): Researchers from IIT Bombay created the ITSR-50 dataset with 15,000 images of Indian traffic signs. Their EfficientNet-B3 model achieved 96.8% accuracy, but dropped to 82.4% on rain/fog conditions โ highlighting the domain gap challenge for Indian road scenarios. Their work also showed that bilingual signs are 12% harder to classify than English-only signs.
Chest X-Ray Pneumonia Detection
Problem Statement
India has only 1 radiologist per 100,000 people (vs. 1 per 10,000 in the US). A single radiologist at a district hospital in Jharkhand reads 200+ chest X-rays daily. Fatigue-related misdiagnosis is a real risk. You will build a pneumonia detection system that serves as a "second opinion" โ not a replacement โ for radiologists.
Dataset: NIH Chest X-Ray / Kermany
| Property | Value |
|---|---|
| Source | Kermany et al. (Mendeley Data) |
| Total Images | 5,856 chest X-rays |
| Classes | 2 (Normal: 1,583, Pneumonia: 4,273) |
| Image Size | Variable (resize to 224ร224) |
| Class Imbalance | 2.7:1 ratio (pneumonia-heavy) |
- Never deploy as sole diagnostic tool. This is a screening aid, not a replacement for a radiologist's expertise.
- Sensitivity over specificity. Missing pneumonia (FN) can be fatal. A false alarm (FP) only means one more test.
- Grad-CAM is mandatory. Clinicians must be able to see why the model made its prediction. Black-box medical AI is unethical.
- Regulatory compliance: In India, medical AI requires CDSCO approval. In the US, FDA 510(k) clearance.
- Dataset bias: The Kermany dataset is predominantly from pediatric patients in Guangzhou, China. Deploying on Indian adult patients without domain adaptation is dangerous.
Architecture: DenseNet121 with Grad-CAM
Python
import torch
import torch.nn as nn
from torchvision import models, transforms
from sklearn.metrics import roc_auc_score, roc_curve
class PneumoniaNet(nn.Module):
"""DenseNet121-based chest X-ray pneumonia detector.
DenseNet chosen because:
1. Feature reuse via dense connections โ better with limited data
2. Smaller model than ResNet50 (8M vs 25M params)
3. CheXNet (Rajpurkar et al., 2017) validated on 14 pathologies"""
def __init__(self):
super().__init__()
self.densenet = models.densenet121(
weights=models.DenseNet121_Weights.IMAGENET1K_V1
)
# DenseNet121 final features: 1024 channels
in_features = self.densenet.classifier.in_features
self.densenet.classifier = nn.Sequential(
nn.Linear(in_features, 256), nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 1) # Binary: sigmoid output
)
def forward(self, x):
return self.densenet(x)
model = PneumoniaNet().to(device)
# โโ Weighted BCE for class imbalance โโ
# Pneumonia:Normal = 4273:1583 โ weight Normal higher
pos_weight = torch.tensor([1583/4273]).to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
# โโ Medical-appropriate augmentation (conservative!) โโ
medical_transforms = transforms.Compose([
transforms.Resize((256, 256)),
transforms.RandomCrop(224),
transforms.RandomHorizontalFlip(), # X-rays can be flipped
transforms.RandomRotation(10), # Slight rotation only!
transforms.RandomAffine(
degrees=0, translate=(0.05, 0.05)
), # Small translation
# NO color jitter โ X-rays are grayscale!
# NO aggressive crops โ might remove pathology!
transforms.ToTensor(),
transforms.Normalize([0.485], [0.229]) # Single channel norms
])
# โโ Training with sensitivity-focused early stopping โโ
optimizer = optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='max', factor=0.5, patience=3,
verbose=True
) # Monitor recall, not loss!
best_recall = 0
for epoch in range(20):
model.train()
for images, labels in train_loader:
images = images.to(device)
labels = labels.float().unsqueeze(1).to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# Evaluate with medical metrics
model.eval()
all_probs, all_labels = [], []
with torch.no_grad():
for images, labels in val_loader:
probs = torch.sigmoid(model(images.to(device)))
all_probs.extend(probs.cpu().numpy().flatten())
all_labels.extend(labels.numpy())
# Find threshold that gives recall โฅ 0.95
fpr, tpr, thresholds = roc_curve(all_labels, all_probs)
auc = roc_auc_score(all_labels, all_probs)
# Choose threshold where TPR (recall) โฅ 0.95
idx = np.argmin(np.abs(tpr - 0.95))
optimal_threshold = thresholds[idx]
preds = (np.array(all_probs) >= optimal_threshold).astype(int)
recall = np.sum((preds == 1) & (np.array(all_labels) == 1)) / \
np.sum(np.array(all_labels) == 1)
print(f"Epoch {epoch+1}: AUC={auc:.4f}, Recall={recall:.4f}, "
f"Threshold={optimal_threshold:.3f}")
scheduler.step(recall)
if recall > best_recall:
best_recall = recall
torch.save(model.state_dict(), "pneumonia_best.pth")
Grad-CAM for Medical Explainability
Python
def medical_grad_cam(model, image, target_layer):
"""Generate Grad-CAM for chest X-ray interpretation.
The heatmap must highlight lung regions where pathology is detected.
If it highlights bones, borders, or text โ the model is wrong!"""
model.eval()
activations, gradients = {}, {}
def fwd_hook(m, i, o): activations['val'] = o.detach()
def bwd_hook(m, gi, go): gradients['val'] = go[0].detach()
h1 = target_layer.register_forward_hook(fwd_hook)
h2 = target_layer.register_full_backward_hook(bwd_hook)
output = model(image.unsqueeze(0).to(device))
model.zero_grad()
output.backward() # Binary โ no class selection needed
weights = gradients['val'].mean(dim=[2, 3], keepdim=True)
cam = torch.relu((weights * activations['val']).sum(dim=1))
cam = nn.functional.interpolate(
cam.unsqueeze(0), size=(224, 224),
mode='bilinear', align_corners=False
).squeeze().cpu().numpy()
cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)
h1.remove(); h2.remove()
return cam
# Validate: Check if Grad-CAM focuses on lung regions
# If attention is on diaphragm/text/borders โ model learned shortcuts!
target_layer = model.densenet.features.denseblock4
cam = medical_grad_cam(model, test_image, target_layer)
โ TRUTH: Accuracy is meaningless for medical AI. You need: (1) AUC-ROC โฅ 0.95, (2) Sensitivity โฅ 0.95 at a clinically relevant specificity, (3) Grad-CAM showing attention on pathology (not artifacts), (4) External validation on a different hospital's dataset, (5) Regulatory approval (CDSCO/FDA).
๐ WHY IT MATTERS: CheXNet (2017) claimed "radiologist-level performance" but later studies showed it failed on datasets from different hospitals. Distribution shift kills medical AI.
- 1 radiologist per 100,000 people
- qXR by Qure.ai: TB + pneumonia screening
- Deployed in 90+ countries from Mumbai
- CDSCO Class B medical device approval
- โน10-50 per scan pricing model
- Works on low-quality portable X-rays
- 1 radiologist per 10,000 people
- FDA 510(k) cleared AI products
- $100-500 per scan pricing
- Integrated with PACS systems
- Focus on efficiency, not access
- High-quality DICOM inputs expected
Roles using this skill:
- Medical AI Engineer at Qure.ai (Mumbai), SigTuple (Bengaluru) โ โน18-35 LPA
- Clinical ML Scientist at Google Health, Aidoc โ $150-250K USD
- Regulatory AI Specialist โ bridging model development and CDSCO/FDA approval
- Research Scientist at AIIMS/IIT medical AI labs โ academic + consulting income
Real-time Object Detection โ Indian Traffic
Problem Statement
Self-driving car companies training on US/European data fail spectacularly on Indian roads. Why? Because their models have never seen an auto-rickshaw, a cow sitting on the highway median, or 4 people riding a single two-wheeler. You will train YOLOv8 to detect India-specific objects in real-time traffic video.
Indian Traffic Object Classes
| Class | India-Specific? | Challenge |
|---|---|---|
| ๐บ Auto-rickshaw | โ Yes | Highly variable shapes (Pune vs Chennai vs Delhi) |
| ๐ Cow / Buffalo | โ Yes | Stationary obstacle, rare in COCO dataset |
| ๐ถ Pedestrian | Partial | Jaywalking, sari/dhoti clothing occlusion |
| ๐ต Two-wheeler | Partial | 1-4 riders, no helmet detection needed too |
| ๐ Truck | Partial | Heavily decorated "horn OK please" trucks |
| ๐ Bus | Partial | State transport with varying paint schemes |
| ๐ Car | No | Standard COCO class, good baseline |
| ๐ Street Dog | โ Yes | Small, fast-moving, frequently on roads |
| ๐ Cart / Thela | โ Yes | Hand-drawn carts, not in any standard dataset |
| ๐ง Road Barrier | Partial | Non-standard barriers, construction debris |
YOLOv8: Architecture Overview
Full Implementation with Ultralytics
Python
# โโ Install: pip install ultralytics โโ
from ultralytics import YOLO
import cv2
import yaml
# โโ Step 1: Prepare dataset config (YOLO format) โโ
dataset_config = {
'path': 'indian_traffic_dataset',
'train': 'images/train',
'val': 'images/val',
'test': 'images/test',
'nc': 10, # Number of classes
'names': [
'auto_rickshaw', 'cow', 'pedestrian',
'two_wheeler', 'truck', 'bus', 'car',
'street_dog', 'cart', 'road_barrier'
]
}
with open('indian_traffic.yaml', 'w') as f:
yaml.dump(dataset_config, f)
# โโ Step 2: Load pretrained YOLOv8 and fine-tune โโ
model = YOLO('yolov8m.pt') # Medium model โ good speed/accuracy balance
# โโ Step 3: Train on Indian traffic data โโ
results = model.train(
data='indian_traffic.yaml',
epochs=100,
imgsz=640,
batch=16,
lr0=0.01,
lrf=0.01, # Final LR = lr0 ร lrf
momentum=0.937,
weight_decay=0.0005,
warmup_epochs=3,
warmup_momentum=0.8,
augment=True, # Mosaic + MixUp + HSV jitter
mosaic=1.0, # Mosaic probability
mixup=0.1,
close_mosaic=10, # Disable mosaic last 10 epochs
device='0',
project='indian_traffic',
name='yolov8m_exp1'
)
# โโ Step 4: Evaluate โโ
metrics = model.val()
print(f"mAP@0.5: {metrics.box.map50:.4f}")
print(f"mAP@0.5:0.95: {metrics.box.map:.4f}")
# Per-class AP
for i, name in enumerate(dataset_config['names']):
print(f" {name:20s}: AP50={metrics.box.ap50[i]:.3f}")
# โโ Step 5: Real-time inference on Indian dashcam video โโ
model = YOLO('indian_traffic/yolov8m_exp1/weights/best.pt')
cap = cv2.VideoCapture('indian_highway_dashcam.mp4')
while cap.isOpened():
ret, frame = cap.read()
if not ret: break
results = model(frame, conf=0.5, iou=0.45)
annotated = results[0].plot() # Draw boxes + labels
cv2.imshow('Indian Traffic Detection', annotated)
if cv2.waitKey(1) & 0xFF == ord('q'): break
cap.release()
cv2.destroyAllWindows()
# โโ Step 6: Export for edge deployment โโ
model.export(format='onnx', simplify=True, dynamic=True)
model.export(format='engine', half=True, device='0') # TensorRT FP16
Per-Class Detection Performance
| Class | AP@0.5 | Notes |
|---|---|---|
| ๐ Car | 0.942 | Best: abundant in COCO pretraining |
| ๐ Bus | 0.918 | Large objects, easy to detect |
| ๐บ Auto-rickshaw | 0.876 | Good โ unique shape signature |
| ๐ Truck | 0.891 | Decorated trucks need more data |
| ๐ต Two-wheeler | 0.834 | Multiple riders cause confusion |
| ๐ถ Pedestrian | 0.812 | Sari/kurta clothing challenges |
| ๐ Cow | 0.788 | Stationary + background blend |
| ๐ Street Dog | 0.741 | Small, fast โ hardest class |
| ๐ Cart | 0.763 | Limited training data |
| ๐ง Barrier | 0.802 | Non-standard shapes |
- Chaotic, rule-defying traffic
- Auto-rickshaws, cows, handcarts
- No lane discipline, mixed traffic
- Ather Energy scooters + ADAS
- IIT Hyderabad iHub for AV research
- NHAI exploring AI-based toll plazas
- Structured lanes, clear markings
- COCO/nuScenes standard datasets
- LiDAR + Camera fusion
- SAE Level 4 robotaxis in San Francisco
- NHTSA regulation framework
- $1B+ investment per company
A student's YOLOv8 training gets stuck at mAP = 0.45 after 50 epochs. Find the 3 bugs in their config:
results = model.train(
data='indian_traffic.yaml',
epochs=50,
imgsz=320, # Bug 1: ???
batch=2, # Bug 2: ???
lr0=0.1, # Bug 3: ???
augment=False,
mosaic=0.0,
)
Hints: (1) YOLOv8 needs at least 640px for small objects. (2) batch=2 means extremely noisy gradients. (3) lr0=0.1 is 10ร too high for fine-tuning. Also, augmentation is disabled! Fix: imgsz=640, batch=16, lr0=0.01, augment=True
Visual Aid โ The 5-Project Architecture Comparison
Transfer Learning Decision Flowchart
Common Misconceptions
โ TRUTH: Cleaner data leads to better accuracy. 5,000 well-labeled, diverse images often outperform 50,000 noisy images. Label quality is the #1 bottleneck in applied CV. Garbage in, garbage out โ even with ResNet.
๐ WHY IT MATTERS: Many Indian startups scrape large datasets from the web but don't invest in annotation quality. A mislabeled "healthy" leaf that actually has early-stage disease will teach your model to ignore disease symptoms.
โ TRUTH: Test set performance is necessary but not sufficient. Production readiness requires: (1) Performance on out-of-distribution data, (2) Inference latency within budget, (3) Grad-CAM showing reasonable attention, (4) Graceful failure on invalid inputs, (5) Monitoring for data drift.
๐ WHY IT MATTERS: Your crop disease model trained on lab-photographed leaves will likely fail on field photos with soil, hands, and shadows in the frame.
โ TRUTH: Effective transfer learning is a 2-phase process: (1) Train only the new head with high LR for 5-10 epochs, (2) Gradually unfreeze backbone layers and fine-tune with 10-100ร lower LR. Unfreezing too early or with too high a LR will destroy the pretrained features.
๐ WHY IT MATTERS: The difference between a good and bad fine-tuning strategy is often 5-15% accuracy โ more than most architectural changes.
โ TRUTH: YOLOv8 is anchor-free and slightly more accurate, but YOLOv5 has a more mature ecosystem, better documentation, and wider deployment support. For production systems, ecosystem maturity often matters more than marginal accuracy gains.
๐ WHY IT MATTERS: Choose tools based on your deployment constraints, not benchmarks. A YOLOv5 model deployed and running is infinitely more useful than a YOLOv8 model stuck in development.
GATE / Exam Corner
Transfer Learning Formula Sheet
GATE PYQ-Style Questions
In a binary classification for medical diagnosis with 100 positive and 900 negative samples, a model predicts all samples as negative. What is the accuracy and recall?
- Accuracy = 90%, Recall = 0%
- Accuracy = 10%, Recall = 100%
- Accuracy = 90%, Recall = 90%
- Accuracy = 0%, Recall = 0%
In transfer learning, freezing all backbone layers and training only the classification head is equivalent to using the pretrained CNN as a:
- Generative model
- Fixed feature extractor
- Autoencoder
- Data augmentation tool
Grad-CAM computes importance weights by performing global average pooling on:
- The input image gradients
- The gradients of the output w.r.t. the last convolutional layer's feature maps
- The activations of the first convolutional layer
- The loss function gradients w.r.t. the weights
YOLOv8 differs from YOLOv3/v5 primarily because it is:
- A two-stage detector
- Anchor-free with decoupled detection head
- Based on Vision Transformers
- Uses only a single-scale feature map
Prediction Table: Likely GATE 2026-27 Topics
| Topic | Probability | Focus Area |
|---|---|---|
| Precision/Recall/F1 | ๐ด Very High | Numerical computation from confusion matrix |
| Transfer Learning concept | ๐ High | When to freeze vs. fine-tune |
| IoU computation | ๐ High | Numerical + definition |
| Data Augmentation effects | ๐ก Medium | Which augmentations preserve labels |
| Grad-CAM/Explainability | ๐ก Medium | Conceptual understanding |
Interview Prep
Conceptual Questions
๐ฏ Q1: "Walk me through how you'd build a crop disease detection app for Indian farmers."
"I'd start with problem framing: the app must work offline on โน8,000 phones. This rules out large models and cloud inference. For the model, I'd use MobileNetV2 pretrained on ImageNet, fine-tuned on PlantVillage (38 classes), augmented with Indian crop images from ICAR. Two-phase training: frozen backbone for 5 epochs, then unfreeze last 3 blocks with 100ร lower LR. For evaluation, I'd prioritize per-class recall โ missing a disease (FN) costs the farmer their crop. Deployment via ONNX Runtime on Android, with TFLite quantization (INT8) to get model under 10MB. Add voice output in Hindi/Marathi using Android TTS for low-literacy users."
๐ฏ Q2: "Your medical AI model has 99% accuracy but the hospital rejects it. Why?"
"99% accuracy on an imbalanced dataset (95% normal, 5% disease) could mean the model just predicts 'normal' for everything. I'd ask: (1) What's the sensitivity? If it's below 90%, the model is missing diseases. (2) Does Grad-CAM show attention on lung pathology or on patient ID text in the X-ray corner? (3) Was it validated on data from a different hospital? Distribution shift kills medical AI. (4) Does it have regulatory approval? In India, CDSCO; in the US, FDA 510(k). (5) Does the UI show confidence scores and Grad-CAM to the radiologist? Hospitals won't trust a black box."
๐ฏ Q3: "How would you handle classes like 'cow on road' that don't exist in COCO?"
"This is a domain adaptation problem. COCO has 80 classes, none of which include auto-rickshaws or cows in traffic context. My approach: (1) Collect 2,000-5,000 images per new class from Indian dashcam footage (available from Ola, BDD100K-India). (2) Annotate using CVAT or LabelImg with bounding boxes. (3) Start from COCO-pretrained YOLOv8 โ the backbone features (edges, textures, shapes) transfer well even to new classes. (4) Train with mosaic augmentation to compose new training scenes. (5) Monitor per-class AP โ expect novel classes like 'cow' to take longer to converge than familiar ones like 'car'. (6) Consider few-shot detection techniques if annotation budget is tight."
Coding Challenge
๐ป Live Coding: "Implement Grad-CAM from scratch in PyTorch in 15 minutes"
What they're testing: PyTorch hooks, backward pass understanding, tensor manipulation. The implementation is in Section 4 (Project 1). Key points to cover: (1) register_forward_hook / register_full_backward_hook, (2) global average pooling of gradients, (3) weighted sum + ReLU, (4) upsampling to input size. Common mistakes: forgetting .detach() in hooks, wrong dimension for mean operation.
Companies hiring for these skills (2024-2026):
- India: CropIn (Bengaluru), Qure.ai (Mumbai), Stellantis India, Ola Krutrim, SigTuple, Wadhwani AI โ โน15-45 LPA
- USA: Google Health, Waymo, Tesla Autopilot, Aidoc, PathAI โ $130-250K USD
- Remote: Roboflow, Ultralytics, Hugging Face โ competitive global salaries
Hands-On Lab โ End-to-End Crop Disease Detector
๐ฌ Lab: Build, Train, Evaluate, and Deploy a Plant Disease Classifier
Part A: Data Preparation (30 min)
- Download PlantVillage dataset from Kaggle
- Split into train/val/test (70/15/15) with stratification
- Implement the Indian-crop augmentation pipeline from Project 1
- Visualize 5 augmented samples per class to verify augmentations are reasonable
Part B: Model Training (60 min)
- Build
CropDiseaseNetwith ResNet50 backbone - Phase 1: Train head only for 5 epochs (expect ~85% val accuracy)
- Phase 2: Unfreeze layer4, fine-tune for 15 epochs with cosine LR (expect ~96%)
- Plot training/validation loss and accuracy curves
Part C: Evaluation (45 min)
- Generate full classification report (precision/recall/F1 per class)
- Plot 38ร38 confusion matrix โ identify the most confused class pairs
- Generate Grad-CAM heatmaps for 10 correct and 5 incorrect predictions
- Write a 200-word "failure analysis" โ why does the model confuse certain diseases?
Part D: Deployment (45 min)
- Export model to ONNX format
- Measure inference time on CPU (should be <100ms for 224ร224)
- Build a simple Gradio web interface for uploading leaf photos
- Test with 5 real leaf photos from your garden/campus
| Component | Points | Criteria |
|---|---|---|
| Data Pipeline | 20 | Correct split, augmentation, dataloaders |
| Model Training | 25 | 2-phase strategy, val accuracy โฅ 93% |
| Evaluation | 25 | Full metrics report, confusion matrix, Grad-CAM |
| Deployment | 20 | ONNX export, Gradio demo working |
| Analysis Write-up | 10 | Thoughtful failure analysis |
Exercises (22 Problems)
Section A โ Conceptual Questions (5)
Why is transfer learning from ImageNet effective for crop disease detection, even though ImageNet doesn't contain any leaf disease images?
Explain why aggressive data augmentation (random erasing, cutout) is appropriate for traffic sign recognition but dangerous for chest X-ray diagnosis.
What is the "accuracy paradox"? Give an example from the pneumonia detection project.
Why does the multi-region architecture (Project 2) outperform a single-image CNN for currency authentication?
Why does YOLOv8 use an anchor-free design? What problem did anchor-based detection have?
Section B โ Mathematical Problems (8)
A crop disease model produces the following confusion matrix for 3 classes (Healthy=H, Blight=B, Rust=R). Compute macro and weighted F1.
Pred-H Pred-B Pred-R Actual-H 85 10 5 (100 total) Actual-B 5 70 25 (100 total) Actual-R 2 8 90 (100 total)
B: P=70/88=0.795, R=70/100=0.70, F1=2(0.795ร0.70)/(0.795+0.70)=0.745
R: P=90/120=0.75, R=90/100=0.90, F1=2(0.75ร0.90)/(0.75+0.90)=0.818
Macro F1 = (0.886+0.745+0.818)/3 = 0.816
Weighted F1 = same as Macro F1 here since all classes have equal support (100 each) = 0.816
Compute IoU for two bounding boxes: Box A = (x1=10, y1=10, x2=50, y2=50) and Box B = (x1=30, y1=30, x2=70, y2=70). Is this a valid detection at IoU threshold 0.5?
Intersection area = (50-30)ร(50-30) = 20ร20 = 400
Area A = (50-10)ร(50-10) = 40ร40 = 1600
Area B = (70-30)ร(70-30) = 40ร40 = 1600
Union = 1600 + 1600 - 400 = 2800
IoU = 400/2800 = 0.143
At IoU threshold 0.5, this is NOT a valid detection (0.143 < 0.5).
A pneumonia detector has these results: TP=190, FP=30, TN=170, FN=10. Compute sensitivity, specificity, PPV, NPV, and F1-score.
Specificity = 170/(170+30) = 170/200 = 0.85
PPV (Precision) = 190/(190+30) = 190/220 = 0.864
NPV = 170/(170+10) = 170/180 = 0.944
F1 = 2ร(0.864ร0.95)/(0.864+0.95) = 0.905
A ResNet50 backbone has 25.6M parameters. If we freeze all backbone layers and only train a head with layers [Linear(2048,512), Linear(512,38)], how many trainable parameters does the model have? (Ignore biases for simplicity)
Linear(512, 38): 512 ร 38 = 19,456 params
Total trainable = 1,048,576 + 19,456 = 1,068,032 โ 1.07M
That's only 4.2% of the total model โ this is why frozen-backbone training is so fast!
In Grad-CAM, the importance weight ฮฑkc is the global average pool of gradients โyc/โAk. If the feature map Ak has spatial dimensions 7ร7 and 512 channels, what is the shape of the final Grad-CAM heatmap (before upsampling)?
Each Ak has shape (7, 7).
LGrad-CAM = ReLU(ฮฃk=1..512 ฮฑk ยท Ak) โ a weighted sum over 512 channels.
Final shape: (7, 7) โ a single 7ร7 heatmap that gets upsampled to input size (e.g., 224ร224).
A YOLOv8 model outputs detections at 3 scales: 80ร80, 40ร40, and 20ร20. How many candidate detections are generated per image?
Scale 2: 40 ร 40 = 1,600 candidates
Scale 3: 20 ร 20 = 400 candidates
Total = 6,400 + 1,600 + 400 = 8,400 candidates
NMS (Non-Maximum Suppression) then reduces these to typically 10-50 final detections.
Show that the harmonic mean (F1-score) is always โค the arithmetic mean of precision and recall. When are they equal?
(a+b)/2 โฅ 2ab/(a+b)
โน (a+b)ยฒ โฅ 4ab
โน aยฒ + 2ab + bยฒ โฅ 4ab
โน aยฒ - 2ab + bยฒ โฅ 0
โน (a-b)ยฒ โฅ 0 โ (always true)
Equality holds when a = b, i.e., F1 = AM only when Precision = Recall. This shows F1 penalizes imbalance between P and R more harshly than arithmetic mean does.
In the currency authentication model, the loss uses class weight 2.0 for counterfeits. If the base cross-entropy loss for a counterfeit sample is -log(0.9) = 0.105, what is the weighted loss? How does this affect gradient magnitude?
The gradient is also scaled by 2ร: โ(weighted_loss)/โฮธ = 2.0 ร โ(base_loss)/โฮธ.
This means the model updates its weights 2ร more aggressively when it misclassifies a counterfeit note, effectively telling the optimizer "missing a fake note is twice as bad as a false alarm."
Section C โ Coding Problems (4)
Write a PyTorch function compute_metrics(y_true, y_pred, num_classes) that computes per-class precision, recall, F1, and macro-averaged F1 from scratch (no sklearn).
def compute_metrics(y_true, y_pred, num_classes):
metrics = {}
f1_scores = []
for c in range(num_classes):
tp = ((y_pred == c) & (y_true == c)).sum().item()
fp = ((y_pred == c) & (y_true != c)).sum().item()
fn = ((y_pred != c) & (y_true == c)).sum().item()
p = tp / (tp + fp + 1e-8)
r = tp / (tp + fn + 1e-8)
f1 = 2 * p * r / (p + r + 1e-8)
metrics[c] = {'precision': p, 'recall': r, 'f1': f1}
f1_scores.append(f1)
metrics['macro_f1'] = sum(f1_scores) / num_classes
return metrics
Write a function compute_iou(box_a, box_b) that computes IoU between two bounding boxes in [x1, y1, x2, y2] format. Handle the no-overlap case.
def compute_iou(box_a, box_b):
x1 = max(box_a[0], box_b[0])
y1 = max(box_a[1], box_b[1])
x2 = min(box_a[2], box_b[2])
y2 = min(box_a[3], box_b[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
area_a = (box_a[2]-box_a[0]) * (box_a[3]-box_a[1])
area_b = (box_b[2]-box_b[0]) * (box_b[3]-box_b[1])
union = area_a + area_b - inter
return inter / (union + 1e-8)
Implement a complete 2-phase transfer learning pipeline: Phase 1 trains only the head, Phase 2 unfreezes the last N layers. Include learning rate adjustment.
param.requires_grad = False for freezing, (2) separate optimizers for each phase, (3) LR for Phase 2 should be 10-100ร lower than Phase 1, (4) Cosine annealing or ReduceLROnPlateau scheduler, (5) Save best model based on validation metric.Write a custom PyTorch Dataset class for the multi-region currency authentication model that loads a note image and returns 4 cropped regions + label.
class CurrencyDataset(torch.utils.data.Dataset):
def __init__(self, image_paths, labels, transform=None):
self.paths = image_paths
self.labels = labels
self.transform = transform
self.resize = transforms.Resize((64, 64))
def __len__(self): return len(self.paths)
def __getitem__(self, idx):
img = Image.open(self.paths[idx]).convert('RGB')
w, h = img.size
watermark = img.crop((0, 0, w//3, h//2))
thread = img.crop((w//3, 0, w//3+w//10, h))
latent = img.crop((0, h//2, w//3, h))
texture = img
if self.transform:
watermark = self.transform(self.resize(watermark))
thread = self.transform(self.resize(thread))
latent = self.transform(self.resize(latent))
texture = self.transform(self.resize(texture))
return watermark, thread, latent, texture, self.labels[idx]
Section D โ Critical Thinking (3)
Your chest X-ray model shows excellent performance on the Kermany dataset but fails when deployed at AIIMS Delhi. What are 3 likely reasons and how would you fix each?
A startup claims their Indian traffic sign recognition system achieves 99% accuracy. You're an investor evaluating this claim. What 5 questions would you ask?
Discuss the ethical implications of deploying a cow-detection model for autonomous vehicles in India. Consider: religious sentiments, animal welfare, liability, and regional variation.
โ Starred Research Problems (2)
Read the CheXNet paper (Rajpurkar et al., 2017). They claim "radiologist-level performance" on 14 pathologies using DenseNet121. Critically analyze: (1) How did they compare against radiologists? (2) What criticisms has the paper received? (3) How would you design a more rigorous evaluation? Write a 500-word analysis.
Design a "few-shot" crop disease detection system that can learn to identify a new disease from only 5 example images. Propose an architecture (hint: metric learning or prototypical networks) and describe how you'd evaluate it. Include a comparison with standard fine-tuning.
Connections
๐ How Chapter 20 Connects to the Rest of the Book
- Chapter 13 (CNN Architectures): ResNet50, MobileNetV2, DenseNet121 โ all architectures used in this chapter
- Chapter 17 (Transfer Learning): The 2-phase fine-tuning strategy comes directly from transfer learning theory
- Chapter 9 (Regularization): Dropout, data augmentation, weight decay โ all used extensively in every project
- Chapter 4 (Loss Functions): Cross-entropy, BCE with logits, weighted loss for class imbalance
- Chapter 21 (MLOps): Deploying these models in production requires CI/CD, monitoring, model versioning
- Chapter 22 (Ethics & Future): The medical AI ethics discussion in Project 4 is expanded in the ethics chapter
- Foundation Models for CV: DINOv2, SAM (Segment Anything Model) โ can these replace task-specific fine-tuning?
- Vision-Language Models: GPT-4V, Gemini โ can you describe a disease in text and have the model classify it?
- Federated Learning for Medical AI: Training on hospital data without moving it โ privacy-preserving medical AI
- CropIn (Bengaluru): Serves 7M+ farmers with AI-powered crop advisory
- Qure.ai (Mumbai): Deployed in 90+ countries for chest X-ray screening
- Ultralytics: YOLOv8 used in 100K+ projects worldwide
Chapter Summary
๐ฏ 7 Key Takeaways
- The model is the least important decision. Data quality, augmentation strategy, evaluation methodology, and deployment constraints matter more than whether you use ResNet50 or EfficientNet-B3.
- Transfer learning is the default. Always start with a pretrained backbone. Use 2-phase training: head-only with high LR, then gradual unfreezing with 10-100ร lower LR. This gives you 95%+ of the benefit with 10% of the compute.
- Metrics must match the domain. Accuracy for traffic signs. Recall (sensitivity) for medical screening. mAP@0.5 for detection. F1 when precision and recall both matter. Never use accuracy alone on imbalanced datasets.
- Grad-CAM is not optional. For medical AI, it's ethically required. For all projects, it's a debugging tool โ if your crop disease model is looking at the background instead of the leaf spots, your model learned a shortcut.
- Indian CV projects require domain adaptation. PlantVillage needs Indian crop augmentation. COCO needs Indian traffic classes. Chest X-ray models need Indian hospital validation. Off-the-shelf models from US/European research fail on Indian data.
- Deployment is half the battle. A 98% accurate model on a GPU server is useless to a farmer in Madhya Pradesh without internet. ONNX export, INT8 quantization, and mobile runtime optimization are not afterthoughts โ they're design requirements.
- Ethics is engineering. In medical AI, a false negative can kill. In autonomous driving, a missed cow can cause an accident. Build safety margins, regulatory awareness, and human-in-the-loop design into every project from day one.
๐ The Key Equations
Grad-CAM: LGrad-CAMc = ReLU(ฮฃk ฮฑkc ยท Ak), where ฮฑkc = GAP(โyc/โAk)
IoU: IoU(A,B) = |A โฉ B| / |A โช B|
๐ก The Key Intuition
Applied deep learning is not about knowing the fanciest architecture โ it's about engineering discipline. The same ResNet50 can give you 60% or 98% on the same dataset. The difference is in data curation, augmentation design, learning rate scheduling, evaluation rigor, and deployment optimization. Mastering these "boring" engineering skills is what separates a student who knows deep learning theory from an engineer who can deploy it to save crops, detect diseases, and prevent accidents.
Further Reading
๐ฎ๐ณ Indian Resources
- NPTEL Course: "Deep Learning for Computer Vision" by Prof. Vineeth N Balasubramanian (IIT Hyderabad) โ covers CNN architectures, transfer learning, and object detection with Indian examples
- NPTEL Course: "Computer Vision" by Prof. Jayanthi Sivaswamy (IIIT Hyderabad) โ classical + deep learning approaches
- CropIn Technical Blog: How they build AI models for Indian agriculture with limited labeled data
- Qure.ai Research Papers: Multiple publications on chest X-ray AI deployment in low-resource settings
- GATE Preparation: "Deep Learning" by Ian Goodfellow โ Chapters 9 (CNN) and 12 (Applications)
- IIT Bombay ITSR Dataset: Indian Traffic Sign Recognition benchmark for academic research
๐ Global Resources
- ๐ Paper: He et al., "Deep Residual Learning for Image Recognition," CVPR 2016 โ the ResNet paper used in Project 1
- ๐ Paper: Selvaraju et al., "Grad-CAM: Visual Explanations from Deep Networks," ICCV 2017 โ the Grad-CAM paper used across all projects
- ๐ Paper: Rajpurkar et al., "CheXNet: Radiologist-Level Pneumonia Detection," 2017 โ inspiration for Project 4
- ๐ Paper: Jocher et al., "Ultralytics YOLOv8," 2023 โ the YOLOv8 architecture documentation
- ๐ Paper: Hughes & Salathรฉ, "An open access repository of images on plant health," 2015 โ PlantVillage dataset paper
- ๐ฅ Video: 3Blue1Brown โ "But what is a Neural Network?" โ foundational intuition
- ๐ Interactive: Distill.pub: Building Blocks of Interpretability โ excellent visualization of neural network features
- ๐ Docs: Ultralytics YOLOv8 Documentation โ complete training, evaluation, and deployment guide
- ๐ Platform: Roboflow โ dataset annotation, augmentation, and model training platform
- ๐ Book: Franรงois Chollet, "Deep Learning with Python" (2nd edition) โ practical Keras/TF approach to CV projects