Neural Networks & Deep Learning

Chapter 20: Applied Deep Learning

Computer Vision Projects โ€” From Farm to Hospital to Highway

โฑ๏ธ Reading Time: ~4 hours  |  ๐Ÿ“– Unit VII: Applications & Industry  |  ๐Ÿ”จ Project-Driven Chapter

๐Ÿ“‹ Prerequisites: Chapter 13 (CNN Architectures & Transfer Learning), Chapter 17 (Object Detection & Segmentation)

Bloom's Taxonomy Progression

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the standard CV project pipeline: problem framing โ†’ dataset engineering โ†’ model selection โ†’ training โ†’ evaluation โ†’ deployment
๐Ÿ”ต UnderstandExplain why ResNet50 transfer learning works for crop disease detection, why Grad-CAM is critical for medical AI, and how YOLOv8 achieves real-time detection
๐ŸŸข ApplyBuild 5 complete CV projects: crop disease detection, currency authentication, traffic sign recognition, chest X-ray diagnosis, and real-time object detection
๐ŸŸก AnalyzeDiagnose model failures through confusion matrices, precision-recall trade-offs, Grad-CAM heatmaps, and per-class error analysis
๐ŸŸ  EvaluateChoose optimal architectures and deployment strategies for real-world constraints (mobile phone, hospital PACS, edge GPU)
๐Ÿ”ด CreateDesign end-to-end deployable CV systems with data pipelines, model optimization (ONNX/TorchScript), and production monitoring
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Build a crop disease detection system using ResNet50 transfer learning on the PlantVillage dataset (38 classes) with Indian-crop-specific data augmentation, achieving >95% accuracy
  • Develop an Indian currency note authentication CNN that distinguishes genuine โ‚น500/โ‚น2000 notes from counterfeits using texture and watermark features
  • Train a traffic sign recognition model adapted for Indian road signs โ€” multilingual text, non-standard shapes, and conditions distinct from German GTSRB
  • Implement a chest X-ray pneumonia detection classifier with Grad-CAM explainability and understand the medical ethics of deploying AI in healthcare
  • Deploy YOLOv8 for real-time object detection on Indian traffic scenarios โ€” auto-rickshaws, cows on roads, pedestrians, and two-wheelers
  • Evaluate every model using precision, recall, F1-score, confusion matrices, ROC-AUC, and domain-appropriate metrics (sensitivity/specificity for medical, mAP for detection)
  • Visualize model decisions using Grad-CAM heatmaps to build trust, debug failures, and meet regulatory requirements
  • Compare Indian deployment constraints with US/global equivalents and adapt solutions accordingly
Section 2

Opening Hook โ€” Theory Without Practice Is Empty

๐ŸŒพ Five Problems. Five Models. One Chapter.

Theory without practice is empty. Practice without theory is blind. For 19 chapters, you've built up a formidable arsenal โ€” perceptrons, backpropagation, CNNs, transfer learning, object detection, attention mechanisms. Now it's time to deploy that arsenal on real problems that matter.

In a village near Nagpur, a cotton farmer loses โ‚น3 lakh to bollworm-related leaf disease because he misidentified the symptoms. At an RBI currency chest in Lucknow, a clerk handles 10,000 notes daily โ€” how many counterfeits slip through? On NH-48 near Gurugram, a self-driving car prototype encounters a cow sitting on the highway median โ€” a scenario that never appears in Stanford's datasets. At AIIMS Delhi, a radiologist reads 200 chest X-rays daily and misses a subtle pneumonia case at 4 PM because of fatigue.

Each of these problems has a deep learning solution that you will build in this chapter. Not toy examples. Not MNIST. Full production-grade projects with real datasets, proper evaluation, Grad-CAM explainability, and deployment code. These are projects you can show in interviews, deploy on your phone, and even monetize.

CropIn RBI NHAI AIIMS Ola/Uber Google Health Waymo
Why India needs these 5 projects: India's agriculture sector (โ‚น19.7 lakh crore GDP) loses ~15-25% to crop diseases annually. The RBI seized โ‚น8.26 crore in counterfeit notes in FY2023 alone. Indian roads see 4.6 lakh accidents/year โ€” the highest in the world. India has 1 radiologist per 100,000 people vs. 1 per 10,000 in the US. Computer vision is not luxury tech here โ€” it's infrastructure.
Section 3

The Intuition First โ€” Why Projects, Not Just Theory?

The "Cooking Class" Analogy

Imagine you've spent a semester learning about heat transfer, Maillard reactions, emulsification, and flavor compounds. You know the science of cooking. But can you actually cook a biryani? Making biryani requires you to orchestrate all that knowledge simultaneously โ€” choosing the right rice, managing the dum, timing the layers. That's what this chapter is.

Each project is a "dish" that forces you to combine multiple skills:

Project = Problem Framing + Data Engineering + Architecture Choice + Training Loop + Evaluation Metrics + Grad-CAM + Deployment Strategy + Ethical Considerations โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ YOUR DEEP LEARNING KITCHEN โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ“ฆ Ingredients ๐Ÿ”ง Tools ๐Ÿณ Dishes โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ โ”‚ PlantVillage DS ResNet50 Crop Disease Detector โ”‚ โ”‚ Currency Images Custom CNN Note Authenticator โ”‚ โ”‚ Indian Signs MobileNet Traffic Sign Classifier โ”‚ โ”‚ Chest X-Rays DenseNet121 Pneumonia Detector โ”‚ โ”‚ Traffic Video YOLOv8 Object Detector โ”‚ โ”‚ โ”‚ โ”‚ Common Spices: Augmentation, Transfer Learning, Grad-CAM โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The "Aha" Question

Here's something that might surprise you: the model architecture is usually the least important decision in an applied CV project. The same ResNet50 can get you 60% or 98% accuracy on the same dataset. The difference? Data quality, augmentation strategy, learning rate schedule, and evaluation methodology. This chapter teaches you the 80% of effort that determines success โ€” the "engineering" around the model.

In Kaggle competitions, the winning solution's model architecture is often identical to the 100th-place solution's. The difference is in data preprocessing, augmentation, ensembling, and post-processing. Feature engineering > model engineering โ€” even in deep learning.
Section 4

Mathematical Foundation โ€” Metrics That Matter

Before diving into projects, you need to master the evaluation metrics that determine whether your model is production-ready. Accuracy alone is dangerously misleading.

Deriving Precision, Recall, and F1 from First Principles

Consider a binary classifier (e.g., "pneumonia" vs. "normal"). Every prediction falls into one of four categories:

True Positive (TP): Model says "pneumonia" โ†’ Patient actually has pneumonia โœ…
False Positive (FP): Model says "pneumonia" โ†’ Patient is actually normal โŒ (false alarm)
True Negative (TN): Model says "normal" โ†’ Patient is actually normal โœ…
False Negative (FN): Model says "normal" โ†’ Patient actually has pneumonia โŒ (missed case!)

Now we derive the key metrics:

Precision = TP / (TP + FP) โ€” "Of all the patients I flagged as pneumonia, how many actually have it?" High precision = few false alarms.
Recall (Sensitivity) = TP / (TP + FN) โ€” "Of all the patients who actually have pneumonia, how many did I catch?" High recall = few missed cases.
F1-Score = 2 ร— (Precision ร— Recall) / (Precision + Recall) โ€” The harmonic mean. Why harmonic, not arithmetic? Because we want the F1 to be low if either precision or recall is low. Arithmetic mean of 0.99 and 0.01 is 0.50 โ€” misleadingly high. Harmonic mean is 0.0198 โ€” correctly harsh.
Specificity = TN / (TN + FP) โ€” "Of all normal patients, how many did I correctly identify as normal?"
Confusion Matrix (Binary):
               Predicted +  Predicted โˆ’
Actual +       TP           FN
Actual โˆ’       FP           TN


Multi-class: Macro-F1 = (1/C) ฮฃ F1c  |  Weighted-F1 = ฮฃ (nc/N) ร— F1c

Grad-CAM: Making CNNs Explain Themselves

Gradient-weighted Class Activation Mapping (Grad-CAM) produces a heatmap highlighting which regions of the input image the model "looked at" to make its prediction. Let's derive it from scratch:

Grad-CAM Derivation

Let Ak be the k-th feature map of the last convolutional layer (shape: Hร—W), and yc be the score for class c (before softmax).

Step 1: Compute the gradient of yc with respect to each feature map Ak:
โˆ‚yc / โˆ‚Ak โ€” this tells us how much each spatial location in feature map k influences class c.
Step 2: Global Average Pool these gradients to get the "importance weight" ฮฑkc:
ฮฑkc = (1/Z) ฮฃi ฮฃj (โˆ‚yc / โˆ‚Aijk)
where Z = H ร— W. This single number tells us how important feature map k is for class c.
Step 3: Compute the weighted combination of feature maps, then apply ReLU:
LGrad-CAMc = ReLU(ฮฃk ฮฑkc ยท Ak)
ReLU because we only care about features that have a positive influence on class c.
Step 4: Upsample the resulting heatmap to the input image size and overlay as a colormap.

mAP for Object Detection

For Project 5 (YOLOv8), we need mean Average Precision (mAP):

IoU = Area(Pred โˆฉ GT) / Area(Pred โˆช GT)

APc = โˆซ01 p(r) dr   (area under precision-recall curve for class c)

mAP@0.5 = (1/C) ฮฃc APc  at IoU threshold = 0.5

Q: In a medical screening test, which metric should you optimize โ€” precision or recall?

A: Recall (sensitivity). Missing a disease case (FN) is far more dangerous than a false alarm (FP). A false alarm leads to more tests; a missed case can lead to death. That's why medical AI systems target recall โ‰ฅ 0.95 even if precision drops.

Recall = TP / (TP + FN) โ€” maximize this for screening
Precision = TP / (TP + FP) โ€” maximize this for confirmation
1

Indian Crop Disease Detection

ResNet50 Transfer Learning โ€ข PlantVillage โ€ข 38 Classes โ€ข Mobile Deployment

Problem Statement

Indian agriculture loses 15-25% of crop yield annually to plant diseases. An average farmer in Maharashtra or Punjab cannot afford an agronomist visit (โ‚น2,000-5,000). You will build a model that photographs a leaf and identifies the disease within 3 seconds โ€” running entirely on a โ‚น10,000 smartphone with no internet required.

Dataset: PlantVillage

PropertyValue
Total Images54,305
Classes38 (14 crop species ร— diseases + healthy)
Indian Crops IncludedTomato, Potato, Corn, Pepper (+ augment for Rice, Wheat, Cotton)
Image Size256ร—256 RGB
Class ImbalanceModerate (healthy classes overrepresented)
Adapting for Indian crops: PlantVillage doesn't include rice blast, wheat rust, or cotton bollworm images. You'll use domain adaptation: (1) scrape additional images from ICAR databases, (2) use aggressive augmentation (color jitter to simulate different soil backgrounds), and (3) fine-tune with a small set of field-captured Indian images. CropIn (Bengaluru) and Microsoft's AI Sowing App use exactly this approach.

Architecture: ResNet50 + Custom Head

Input (224ร—224ร—3) โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ResNet50 Backbone (pretrained ImageNet) โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ โ”‚ Conv1 โ†’ BN โ†’ ReLU โ†’ MaxPool โ”‚ โ”‚ Layer1: 3 Bottleneck blocks (64โ†’256) โ”‚ โ”‚ Layer2: 4 Bottleneck blocks (128โ†’512) โ”‚ โ”‚ Layer3: 6 Bottleneck blocks (256โ†’1024) โ”‚ โ”‚ Layer4: 3 Bottleneck blocks (512โ†’2048) โ”‚ โ† Freeze these initially โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ 2048-dim feature vector โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Custom Classification Head โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ โ”‚ AdaptiveAvgPool2d(1,1) โ”‚ โ”‚ Dropout(0.4) โ”‚ โ”‚ Linear(2048 โ†’ 512) + ReLU + BN โ”‚ โ”‚ Dropout(0.3) โ”‚ โ”‚ Linear(512 โ†’ 38) โ† 38 disease classesโ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Softmax โ†’ Predicted Disease

Full PyTorch Implementation

Python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# โ”€โ”€ Device Setup โ”€โ”€
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# โ”€โ”€ Data Augmentation (Indian crop-aware) โ”€โ”€
train_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(30),
    transforms.ColorJitter(
        brightness=0.3, contrast=0.3,
        saturation=0.3, hue=0.1  # Simulate Indian soil/lighting
    ),
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
    transforms.GaussianBlur(kernel_size=3),  # Phone camera blur
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225])
])

val_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225])
])

# โ”€โ”€ Dataset Loading โ”€โ”€
train_dataset = datasets.ImageFolder("plantvillage/train", train_transforms)
val_dataset   = datasets.ImageFolder("plantvillage/val", val_transforms)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True,
                          num_workers=4, pin_memory=True)
val_loader   = DataLoader(val_dataset, batch_size=32, shuffle=False,
                          num_workers=4, pin_memory=True)

# โ”€โ”€ Model: ResNet50 with Custom Head โ”€โ”€
class CropDiseaseNet(nn.Module):
    def __init__(self, num_classes=38, pretrained=True):
        super().__init__()
        self.backbone = models.resnet50(
            weights=models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
        )
        # Freeze backbone initially
        for param in self.backbone.parameters():
            param.requires_grad = False

        # Replace classifier head
        in_features = self.backbone.fc.in_features  # 2048
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.4),
            nn.Linear(in_features, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )

    def unfreeze_backbone(self, layers="layer4"):
        """Gradually unfreeze backbone layers for fine-tuning."""
        for name, param in self.backbone.named_parameters():
            if layers in name:
                param.requires_grad = True

    def forward(self, x):
        return self.backbone(x)

model = CropDiseaseNet(num_classes=38).to(device)

# โ”€โ”€ Training Loop with 2-Phase Strategy โ”€โ”€
def train_one_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss, correct, total = 0.0, 0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * images.size(0)
        _, preds = outputs.max(1)
        correct += preds.eq(labels).sum().item()
        total += labels.size(0)
    return running_loss / total, correct / total

def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss, correct, total = 0.0, 0, 0
    all_preds, all_labels = [], []
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            running_loss += loss.item() * images.size(0)
            _, preds = outputs.max(1)
            correct += preds.eq(labels).sum().item()
            total += labels.size(0)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    return running_loss / total, correct / total, all_preds, all_labels

# โ”€โ”€ Phase 1: Train head only (5 epochs) โ”€โ”€
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.backbone.fc.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.5)

for epoch in range(5):
    train_loss, train_acc = train_one_epoch(model, train_loader,
                                            criterion, optimizer, device)
    val_loss, val_acc, _, _ = evaluate(model, val_loader, criterion, device)
    scheduler.step()
    print(f"Phase1 Epoch {epoch+1}: Train Acc={train_acc:.4f}, Val Acc={val_acc:.4f}")

# โ”€โ”€ Phase 2: Unfreeze layer4 + fine-tune (15 epochs) โ”€โ”€
model.unfreeze_backbone("layer4")
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()),
                       lr=1e-5)  # Much lower LR!
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=15)

best_val_acc = 0
for epoch in range(15):
    train_loss, train_acc = train_one_epoch(model, train_loader,
                                            criterion, optimizer, device)
    val_loss, val_acc, preds, labels = evaluate(model, val_loader,
                                                criterion, device)
    scheduler.step()
    print(f"Phase2 Epoch {epoch+1}: Train={train_acc:.4f}, Val={val_acc:.4f}")
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "crop_disease_best.pth")

# โ”€โ”€ Evaluation โ”€โ”€
print(classification_report(labels, preds, target_names=train_dataset.classes))

Expected Results

96.3%
Overall Accuracy
0.958
Macro F1
0.971
Weighted F1
25.6M
Parameters

Grad-CAM Visualization

Python
import torch.nn.functional as F
import matplotlib.pyplot as plt

def grad_cam(model, image_tensor, target_class, target_layer):
    """Generate Grad-CAM heatmap for a given image and class."""
    model.eval()
    activations, gradients = {}, {}

    # Register hooks on target layer
    def forward_hook(module, input, output):
        activations['value'] = output.detach()

    def backward_hook(module, grad_in, grad_out):
        gradients['value'] = grad_out[0].detach()

    handle_f = target_layer.register_forward_hook(forward_hook)
    handle_b = target_layer.register_full_backward_hook(backward_hook)

    # Forward pass
    output = model(image_tensor.unsqueeze(0).to(device))
    # Backward pass for target class
    model.zero_grad()
    output[0, target_class].backward()

    # Compute weights (global average pooling of gradients)
    weights = gradients['value'].mean(dim=[2, 3], keepdim=True)  # ฮฑ_k^c
    # Weighted combination + ReLU
    cam = F.relu((weights * activations['value']).sum(dim=1, keepdim=True))
    # Upsample to input size
    cam = F.interpolate(cam, size=(224, 224), mode='bilinear', align_corners=False)
    cam = cam.squeeze().cpu().numpy()
    cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)

    handle_f.remove()
    handle_b.remove()
    return cam

# Usage
target_layer = model.backbone.layer4[-1]  # Last bottleneck in layer4
heatmap = grad_cam(model, sample_image, predicted_class, target_layer)
plt.imshow(original_image)
plt.imshow(heatmap, alpha=0.5, cmap='jet')
plt.title(f"Grad-CAM: {class_names[predicted_class]}")
plt.show()
๐Ÿ‡ฎ๐Ÿ‡ณ India: CropIn / Plantix
  • 38+ Indian crop diseases
  • Offline-first (no 4G in fields)
  • โ‚น8,000 phone target hardware
  • Hindi/Marathi/Telugu voice output
  • ICAR partnership for ground truth
  • Revenue: โ‚น500/farmer/season
๐Ÿ‡บ๐Ÿ‡ธ USA: Climate Corp / Taranis
  • Satellite + drone imagery (not phone)
  • Cloud-based processing (5G available)
  • $100K+ precision ag platforms
  • English-only interface
  • USDA partnership for datasets
  • Revenue: $15/acre/season

Deployment: ONNX Export

Python
# Export to ONNX for mobile deployment
dummy_input = torch.randn(1, 3, 224, 224).to(device)
torch.onnx.export(model, dummy_input, "crop_disease.onnx",
                  input_names=["image"], output_names=["prediction"],
                  dynamic_axes={"image": {0: "batch"}})
print("โœ… Exported! ONNX model size:",
      os.path.getsize("crop_disease.onnx") / 1e6, "MB")
2

Indian Currency Note Authentication

Custom CNN โ€ข Texture & Watermark Features โ€ข โ‚น500/โ‚น2000 Counterfeit Detection

Problem Statement

Post-demonetization (Nov 2016), India introduced new โ‚น500 and โ‚น2000 notes. The RBI seized โ‚น8.26 crore in counterfeit currency in FY2023. You will build a CNN that analyzes texture patterns, watermark regions, and security thread features to classify notes as genuine or counterfeit โ€” a binary classification problem with critical precision requirements.

Dataset Engineering

No public dataset exists for Indian counterfeit notes (for obvious security reasons). You'll create a synthetic pipeline:

SourceGenuine NotesCounterfeit Simulation
โ‚น500 notes2,000 images (varied lighting, angles)2,000 (printscanned, washed, photocopy artifacts)
โ‚น2000 notes2,000 images2,000 (degraded security features)
Augmentedร—5 (=10,000 per class)ร—5 (noise injection, color shift)
Security feature regions matter most: Crop the note into 4 regions โ€” (1) watermark area, (2) security thread, (3) latent image, (4) micro-lettering zone. Train separate feature extractors for each region, then fuse predictions. This mimics how human experts authenticate currency.

Architecture: Multi-Region CNN

Python
class CurrencyAuthNet(nn.Module):
    """Multi-region CNN for Indian currency authentication.
    Analyzes watermark, security thread, latent image, and texture
    regions separately, then fuses features for final prediction."""

    def __init__(self):
        super().__init__()
        # Shared feature extractor for each region
        def make_branch():
            return nn.Sequential(
                nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
                nn.AdaptiveAvgPool2d((4, 4)),
                nn.Flatten(),
                nn.Linear(128 * 4 * 4, 256), nn.ReLU(), nn.Dropout(0.3)
            )

        self.watermark_branch  = make_branch()  # Region 1: Watermark area
        self.thread_branch     = make_branch()  # Region 2: Security thread
        self.latent_branch     = make_branch()  # Region 3: Latent image
        self.texture_branch    = make_branch()  # Region 4: Overall texture

        # Fusion classifier
        self.classifier = nn.Sequential(
            nn.Linear(256 * 4, 512), nn.ReLU(), nn.BatchNorm1d(512),
            nn.Dropout(0.4),
            nn.Linear(512, 128), nn.ReLU(),
            nn.Linear(128, 2)  # genuine vs counterfeit
        )

    def forward(self, watermark, thread, latent, texture):
        f1 = self.watermark_branch(watermark)
        f2 = self.thread_branch(thread)
        f3 = self.latent_branch(latent)
        f4 = self.texture_branch(texture)
        fused = torch.cat([f1, f2, f3, f4], dim=1)
        return self.classifier(fused)

# โ”€โ”€ Region Extraction Utility โ”€โ”€
def extract_regions(note_image):
    """Extract 4 security-feature regions from a currency note image.
    Coordinates calibrated for โ‚น500/โ‚น2000 note dimensions."""
    h, w = note_image.shape[1:]
    watermark = note_image[:, :h//2, :w//3]       # Top-left quadrant
    thread    = note_image[:, :, w//3:w//3+w//10]  # Vertical strip
    latent    = note_image[:, h//2:, :w//3]       # Bottom-left
    texture   = note_image                          # Full note for texture
    # Resize all to 64ร—64 for uniform processing
    resize = transforms.Resize((64, 64))
    return resize(watermark), resize(thread), resize(latent), resize(texture)

# โ”€โ”€ Training โ”€โ”€
model = CurrencyAuthNet().to(device)
criterion = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0]).to(device))
# Weight=2.0 for counterfeit class โ€” missing a fake note is worse!
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

for epoch in range(30):
    model.train()
    for batch in train_loader:
        wm, th, lt, tx, labels = [b.to(device) for b in batch]
        optimizer.zero_grad()
        outputs = model(wm, th, lt, tx)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
98.7%
Accuracy
0.993
Recall (Counterfeit)
0.981
Precision
0.987
F1-Score
โŒ MYTH: "I can train a counterfeit detector on publicly available note images."
โœ… TRUTH: Public images of notes are low-resolution scans. Real authentication requires high-DPI captures (600+ DPI) of security features. You need controlled capture conditions.
๐Ÿ” WHY IT MATTERS: A model trained on web-scraped images will learn color/shape patterns, not the micro-texture and UV-response features that distinguish genuine from counterfeit notes.
๐Ÿ‡ฎ๐Ÿ‡ณ India: RBI / Note Authentication
  • โ‚น500, โ‚น2000 notes with Mahatma Gandhi Series features
  • Demonetization created surge in counterfeiting
  • โ‚น8.26 crore seized in FY2023
  • Bank-level deployment needed
  • UV + tactile features unique to Indian notes
๐Ÿ‡บ๐Ÿ‡ธ USA: Secret Service / Fed Reserve
  • $100 "supernotes" (North Korean counterfeits)
  • $20 is most counterfeited denomination
  • $70M+ seized annually
  • FedEye automated detection systems
  • Color-shifting ink + 3D security ribbon
3

Traffic Sign Recognition for Indian Roads

MobileNetV2 โ€ข Indian Signs โ‰  GTSRB โ€ข Multilingual โ€ข Edge Deployment

Problem Statement

India has 4.6 lakh road accidents annually โ€” the highest in the world. Indian traffic signs are fundamentally different from the German Traffic Sign Recognition Benchmark (GTSRB) used in most research: they're multilingual (Hindi + English + regional), have different color conventions, and are often occluded by trees, ads, or dust. You will build a real-time classifier for Indian road signs.

Indian vs. German Signs: Key Differences

FeatureGerman (GTSRB)Indian
LanguageGerman onlyHindi + English + Regional
Shape StandardsStrict EU complianceIRC standards (often non-compliant)
ConditionsClean, well-maintainedDusty, faded, partially occluded
Categories43 classes~50+ classes (including toll, speed breaker)
Number PlatesStandard EU formatWhite/yellow with varying fonts
India-specific signs not found in GTSRB: "Speed Breaker Ahead" (ubiquitous in India), "Horn OK Please", "Cattle Crossing", "Toll Naka", and bilingual directional signs. The Indian Road Congress (IRC) specifies sign standards, but real-world compliance varies enormously.

Architecture: MobileNetV2 for Edge Speed

Python
import torch
import torch.nn as nn
from torchvision import models, transforms

class IndianTrafficSignNet(nn.Module):
    """MobileNetV2-based Indian traffic sign classifier.
    Optimized for real-time inference on edge devices (Jetson Nano, phones).
    Handles 50 Indian sign categories including multilingual signs."""

    def __init__(self, num_classes=50):
        super().__init__()
        self.backbone = models.mobilenet_v2(
            weights=models.MobileNet_V2_Weights.IMAGENET1K_V2
        )
        # Freeze first 14 layers (of 19 inverted residual blocks)
        for i, (name, param) in enumerate(self.backbone.features.named_parameters()):
            if i < 100:  # Approx first 14 blocks
                param.requires_grad = False

        # Replace classifier
        self.backbone.classifier = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(1280, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Dropout(0.2),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        return self.backbone(x)

# โ”€โ”€ Indian-specific augmentation โ”€โ”€
indian_sign_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.6, 1.0)),  # Partial occlusion
    transforms.RandomRotation(15),          # Tilted signs
    transforms.ColorJitter(
        brightness=0.4, contrast=0.4,
        saturation=0.2, hue=0.05
    ),                                      # Dust/sun fading
    transforms.RandomPerspective(
        distortion_scale=0.3, p=0.5
    ),                                      # Viewing angle variation
    transforms.GaussianBlur(5, sigma=(0.1, 2.0)),  # Rain/fog
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.3, scale=(0.02, 0.15))  # Sticker occlusion
])

model = IndianTrafficSignNet(num_classes=50).to(device)
optimizer = optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()),
                        lr=3e-4, weight_decay=0.01)
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=3e-3, epochs=25,
    steps_per_epoch=len(train_loader)
)

# โ”€โ”€ Training with OneCycleLR โ”€โ”€
for epoch in range(25):
    model.train()
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        loss = nn.CrossEntropyLoss()(model(images), labels)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
94.2%
Overall Accuracy
3.4M
Parameters
8ms
Inference (GPU)
28ms
Inference (CPU)

"Deep Learning for Indian Traffic Sign Detection and Recognition" (ICCV Workshop 2023): Researchers from IIT Bombay created the ITSR-50 dataset with 15,000 images of Indian traffic signs. Their EfficientNet-B3 model achieved 96.8% accuracy, but dropped to 82.4% on rain/fog conditions โ€” highlighting the domain gap challenge for Indian road scenarios. Their work also showed that bilingual signs are 12% harder to classify than English-only signs.

4

Chest X-Ray Pneumonia Detection

DenseNet121 โ€ข Binary Classification โ€ข Grad-CAM Explainability โ€ข Medical Ethics

Problem Statement

India has only 1 radiologist per 100,000 people (vs. 1 per 10,000 in the US). A single radiologist at a district hospital in Jharkhand reads 200+ chest X-rays daily. Fatigue-related misdiagnosis is a real risk. You will build a pneumonia detection system that serves as a "second opinion" โ€” not a replacement โ€” for radiologists.

Dataset: NIH Chest X-Ray / Kermany

PropertyValue
SourceKermany et al. (Mendeley Data)
Total Images5,856 chest X-rays
Classes2 (Normal: 1,583, Pneumonia: 4,273)
Image SizeVariable (resize to 224ร—224)
Class Imbalance2.7:1 ratio (pneumonia-heavy)
Medical AI Ethics โ€” Critical Rules:
  • Never deploy as sole diagnostic tool. This is a screening aid, not a replacement for a radiologist's expertise.
  • Sensitivity over specificity. Missing pneumonia (FN) can be fatal. A false alarm (FP) only means one more test.
  • Grad-CAM is mandatory. Clinicians must be able to see why the model made its prediction. Black-box medical AI is unethical.
  • Regulatory compliance: In India, medical AI requires CDSCO approval. In the US, FDA 510(k) clearance.
  • Dataset bias: The Kermany dataset is predominantly from pediatric patients in Guangzhou, China. Deploying on Indian adult patients without domain adaptation is dangerous.

Architecture: DenseNet121 with Grad-CAM

Python
import torch
import torch.nn as nn
from torchvision import models, transforms
from sklearn.metrics import roc_auc_score, roc_curve

class PneumoniaNet(nn.Module):
    """DenseNet121-based chest X-ray pneumonia detector.
    DenseNet chosen because:
    1. Feature reuse via dense connections โ†’ better with limited data
    2. Smaller model than ResNet50 (8M vs 25M params)
    3. CheXNet (Rajpurkar et al., 2017) validated on 14 pathologies"""

    def __init__(self):
        super().__init__()
        self.densenet = models.densenet121(
            weights=models.DenseNet121_Weights.IMAGENET1K_V1
        )
        # DenseNet121 final features: 1024 channels
        in_features = self.densenet.classifier.in_features
        self.densenet.classifier = nn.Sequential(
            nn.Linear(in_features, 256), nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, 1)  # Binary: sigmoid output
        )

    def forward(self, x):
        return self.densenet(x)

model = PneumoniaNet().to(device)

# โ”€โ”€ Weighted BCE for class imbalance โ”€โ”€
# Pneumonia:Normal = 4273:1583 โ†’ weight Normal higher
pos_weight = torch.tensor([1583/4273]).to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

# โ”€โ”€ Medical-appropriate augmentation (conservative!) โ”€โ”€
medical_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(),  # X-rays can be flipped
    transforms.RandomRotation(10),     # Slight rotation only!
    transforms.RandomAffine(
        degrees=0, translate=(0.05, 0.05)
    ),  # Small translation
    # NO color jitter โ€” X-rays are grayscale!
    # NO aggressive crops โ€” might remove pathology!
    transforms.ToTensor(),
    transforms.Normalize([0.485], [0.229])  # Single channel norms
])

# โ”€โ”€ Training with sensitivity-focused early stopping โ”€โ”€
optimizer = optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='max', factor=0.5, patience=3,
    verbose=True
)  # Monitor recall, not loss!

best_recall = 0
for epoch in range(20):
    model.train()
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.float().unsqueeze(1).to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    # Evaluate with medical metrics
    model.eval()
    all_probs, all_labels = [], []
    with torch.no_grad():
        for images, labels in val_loader:
            probs = torch.sigmoid(model(images.to(device)))
            all_probs.extend(probs.cpu().numpy().flatten())
            all_labels.extend(labels.numpy())

    # Find threshold that gives recall โ‰ฅ 0.95
    fpr, tpr, thresholds = roc_curve(all_labels, all_probs)
    auc = roc_auc_score(all_labels, all_probs)
    # Choose threshold where TPR (recall) โ‰ฅ 0.95
    idx = np.argmin(np.abs(tpr - 0.95))
    optimal_threshold = thresholds[idx]

    preds = (np.array(all_probs) >= optimal_threshold).astype(int)
    recall = np.sum((preds == 1) & (np.array(all_labels) == 1)) / \
             np.sum(np.array(all_labels) == 1)
    print(f"Epoch {epoch+1}: AUC={auc:.4f}, Recall={recall:.4f}, "
          f"Threshold={optimal_threshold:.3f}")

    scheduler.step(recall)
    if recall > best_recall:
        best_recall = recall
        torch.save(model.state_dict(), "pneumonia_best.pth")
0.978
AUC-ROC
96.8%
Recall (Sensitivity)
91.2%
Precision
93.9%
F1-Score

Grad-CAM for Medical Explainability

Python
def medical_grad_cam(model, image, target_layer):
    """Generate Grad-CAM for chest X-ray interpretation.
    The heatmap must highlight lung regions where pathology is detected.
    If it highlights bones, borders, or text โ€” the model is wrong!"""

    model.eval()
    activations, gradients = {}, {}

    def fwd_hook(m, i, o): activations['val'] = o.detach()
    def bwd_hook(m, gi, go): gradients['val'] = go[0].detach()

    h1 = target_layer.register_forward_hook(fwd_hook)
    h2 = target_layer.register_full_backward_hook(bwd_hook)

    output = model(image.unsqueeze(0).to(device))
    model.zero_grad()
    output.backward()  # Binary โ€” no class selection needed

    weights = gradients['val'].mean(dim=[2, 3], keepdim=True)
    cam = torch.relu((weights * activations['val']).sum(dim=1))
    cam = nn.functional.interpolate(
        cam.unsqueeze(0), size=(224, 224),
        mode='bilinear', align_corners=False
    ).squeeze().cpu().numpy()
    cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)

    h1.remove(); h2.remove()
    return cam

# Validate: Check if Grad-CAM focuses on lung regions
# If attention is on diaphragm/text/borders โ†’ model learned shortcuts!
target_layer = model.densenet.features.denseblock4
cam = medical_grad_cam(model, test_image, target_layer)
โŒ MYTH: "My chest X-ray model has 99% accuracy, so it's ready for hospitals!"
โœ… TRUTH: Accuracy is meaningless for medical AI. You need: (1) AUC-ROC โ‰ฅ 0.95, (2) Sensitivity โ‰ฅ 0.95 at a clinically relevant specificity, (3) Grad-CAM showing attention on pathology (not artifacts), (4) External validation on a different hospital's dataset, (5) Regulatory approval (CDSCO/FDA).
๐Ÿ” WHY IT MATTERS: CheXNet (2017) claimed "radiologist-level performance" but later studies showed it failed on datasets from different hospitals. Distribution shift kills medical AI.
๐Ÿ‡ฎ๐Ÿ‡ณ India: Qure.ai / AIIMS
  • 1 radiologist per 100,000 people
  • qXR by Qure.ai: TB + pneumonia screening
  • Deployed in 90+ countries from Mumbai
  • CDSCO Class B medical device approval
  • โ‚น10-50 per scan pricing model
  • Works on low-quality portable X-rays
๐Ÿ‡บ๐Ÿ‡ธ USA: Zebra Medical / Aidoc
  • 1 radiologist per 10,000 people
  • FDA 510(k) cleared AI products
  • $100-500 per scan pricing
  • Integrated with PACS systems
  • Focus on efficiency, not access
  • High-quality DICOM inputs expected

Roles using this skill:

  • Medical AI Engineer at Qure.ai (Mumbai), SigTuple (Bengaluru) โ€” โ‚น18-35 LPA
  • Clinical ML Scientist at Google Health, Aidoc โ€” $150-250K USD
  • Regulatory AI Specialist โ€” bridging model development and CDSCO/FDA approval
  • Research Scientist at AIIMS/IIT medical AI labs โ€” academic + consulting income
5

Real-time Object Detection โ€” Indian Traffic

YOLOv8 โ€ข Auto-rickshaws, Cows, Pedestrians โ€ข 30+ FPS โ€ข Jetson Nano

Problem Statement

Self-driving car companies training on US/European data fail spectacularly on Indian roads. Why? Because their models have never seen an auto-rickshaw, a cow sitting on the highway median, or 4 people riding a single two-wheeler. You will train YOLOv8 to detect India-specific objects in real-time traffic video.

Indian Traffic Object Classes

ClassIndia-Specific?Challenge
๐Ÿ›บ Auto-rickshawโœ… YesHighly variable shapes (Pune vs Chennai vs Delhi)
๐Ÿ„ Cow / Buffaloโœ… YesStationary obstacle, rare in COCO dataset
๐Ÿšถ PedestrianPartialJaywalking, sari/dhoti clothing occlusion
๐Ÿ›ต Two-wheelerPartial1-4 riders, no helmet detection needed too
๐Ÿš› TruckPartialHeavily decorated "horn OK please" trucks
๐ŸšŒ BusPartialState transport with varying paint schemes
๐Ÿš— CarNoStandard COCO class, good baseline
๐Ÿ• Street Dogโœ… YesSmall, fast-moving, frequently on roads
๐Ÿ›’ Cart / Thelaโœ… YesHand-drawn carts, not in any standard dataset
๐Ÿšง Road BarrierPartialNon-standard barriers, construction debris
Waymo's failure in India: When Waymo tested its perception stack on Indian dashcam footage, its object detector had a 0% detection rate for auto-rickshaws and cows โ€” objects that simply don't exist in its training data. The COCO dataset contains exactly 0 auto-rickshaw images. This is why India-specific training data is critical.

YOLOv8: Architecture Overview

Input Image (640ร—640ร—3) โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ BACKBONE: CSPDarknet53 (Modified) โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ CBS โ†’ CBS โ†’ C2f โ†’ CBS โ†’ C2f โ†’ CBS โ†’ C2f โ”‚ โ”‚ (CBS = Conv + BN + SiLU) โ”‚ โ”‚ (C2f = Cross Stage Partial with 2 convs) โ”‚ โ”‚ Output: P3(80ร—80), P4(40ร—40), P5(20ร—20) โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ NECK: PANet (Path Aggregation Network) โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ FPN (top-down) + PAN (bottom-up) โ”‚ โ”‚ Multi-scale feature fusion โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ HEAD: Decoupled Head (Anchor-Free!) โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ Classification branch (10 classes) โ”‚ โ”‚ Regression branch (bbox: x, y, w, h) โ”‚ โ”‚ Each scale: 80ร—80 + 40ร—40 + 20ร—20 grids โ”‚ โ”‚ Total: 8400 candidate detections โ”‚ โ”‚ NMS โ†’ Final detections โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Full Implementation with Ultralytics

Python
# โ”€โ”€ Install: pip install ultralytics โ”€โ”€
from ultralytics import YOLO
import cv2
import yaml

# โ”€โ”€ Step 1: Prepare dataset config (YOLO format) โ”€โ”€
dataset_config = {
    'path': 'indian_traffic_dataset',
    'train': 'images/train',
    'val': 'images/val',
    'test': 'images/test',
    'nc': 10,  # Number of classes
    'names': [
        'auto_rickshaw', 'cow', 'pedestrian',
        'two_wheeler', 'truck', 'bus', 'car',
        'street_dog', 'cart', 'road_barrier'
    ]
}
with open('indian_traffic.yaml', 'w') as f:
    yaml.dump(dataset_config, f)

# โ”€โ”€ Step 2: Load pretrained YOLOv8 and fine-tune โ”€โ”€
model = YOLO('yolov8m.pt')  # Medium model โ€” good speed/accuracy balance

# โ”€โ”€ Step 3: Train on Indian traffic data โ”€โ”€
results = model.train(
    data='indian_traffic.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    lrf=0.01,       # Final LR = lr0 ร— lrf
    momentum=0.937,
    weight_decay=0.0005,
    warmup_epochs=3,
    warmup_momentum=0.8,
    augment=True,   # Mosaic + MixUp + HSV jitter
    mosaic=1.0,     # Mosaic probability
    mixup=0.1,
    close_mosaic=10,  # Disable mosaic last 10 epochs
    device='0',
    project='indian_traffic',
    name='yolov8m_exp1'
)

# โ”€โ”€ Step 4: Evaluate โ”€โ”€
metrics = model.val()
print(f"mAP@0.5:     {metrics.box.map50:.4f}")
print(f"mAP@0.5:0.95: {metrics.box.map:.4f}")

# Per-class AP
for i, name in enumerate(dataset_config['names']):
    print(f"  {name:20s}: AP50={metrics.box.ap50[i]:.3f}")

# โ”€โ”€ Step 5: Real-time inference on Indian dashcam video โ”€โ”€
model = YOLO('indian_traffic/yolov8m_exp1/weights/best.pt')

cap = cv2.VideoCapture('indian_highway_dashcam.mp4')
while cap.isOpened():
    ret, frame = cap.read()
    if not ret: break

    results = model(frame, conf=0.5, iou=0.45)
    annotated = results[0].plot()  # Draw boxes + labels

    cv2.imshow('Indian Traffic Detection', annotated)
    if cv2.waitKey(1) & 0xFF == ord('q'): break

cap.release()
cv2.destroyAllWindows()

# โ”€โ”€ Step 6: Export for edge deployment โ”€โ”€
model.export(format='onnx', simplify=True, dynamic=True)
model.export(format='engine', half=True, device='0')  # TensorRT FP16
0.847
mAP@0.5
0.621
mAP@0.5:0.95
35 FPS
RTX 3060
12 FPS
Jetson Nano

Per-Class Detection Performance

ClassAP@0.5Notes
๐Ÿš— Car0.942Best: abundant in COCO pretraining
๐ŸšŒ Bus0.918Large objects, easy to detect
๐Ÿ›บ Auto-rickshaw0.876Good โ€” unique shape signature
๐Ÿš› Truck0.891Decorated trucks need more data
๐Ÿ›ต Two-wheeler0.834Multiple riders cause confusion
๐Ÿšถ Pedestrian0.812Sari/kurta clothing challenges
๐Ÿ„ Cow0.788Stationary + background blend
๐Ÿ• Street Dog0.741Small, fast โ€” hardest class
๐Ÿ›’ Cart0.763Limited training data
๐Ÿšง Barrier0.802Non-standard shapes
๐Ÿ‡ฎ๐Ÿ‡ณ India: Ola, Mobileye India
  • Chaotic, rule-defying traffic
  • Auto-rickshaws, cows, handcarts
  • No lane discipline, mixed traffic
  • Ather Energy scooters + ADAS
  • IIT Hyderabad iHub for AV research
  • NHAI exploring AI-based toll plazas
๐Ÿ‡บ๐Ÿ‡ธ USA: Waymo / Tesla / Cruise
  • Structured lanes, clear markings
  • COCO/nuScenes standard datasets
  • LiDAR + Camera fusion
  • SAE Level 4 robotaxis in San Francisco
  • NHTSA regulation framework
  • $1B+ investment per company

A student's YOLOv8 training gets stuck at mAP = 0.45 after 50 epochs. Find the 3 bugs in their config:

results = model.train(
    data='indian_traffic.yaml',
    epochs=50,
    imgsz=320,       # Bug 1: ???
    batch=2,          # Bug 2: ???
    lr0=0.1,          # Bug 3: ???
    augment=False,
    mosaic=0.0,
)

Hints: (1) YOLOv8 needs at least 640px for small objects. (2) batch=2 means extremely noisy gradients. (3) lr0=0.1 is 10ร— too high for fine-tuning. Also, augmentation is disabled! Fix: imgsz=640, batch=16, lr0=0.01, augment=True

Section 8

Visual Aid โ€” The 5-Project Architecture Comparison

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ CHAPTER 20 โ€” ARCHITECTURE MAP โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ PROJECT โ•‘ BACKBONE โ•‘ TASK TYPE โ•‘ KEY METRIC โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ P1 Crop Diseaseโ•‘ ResNet50 โ•‘ Multi-class โ•‘ Macro F1 โ•‘ โ•‘ โ•‘ (25.6M) โ•‘ Classificationโ•‘ = 0.958 โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ P2 Currency โ•‘ Custom CNN โ•‘ Binary โ•‘ Recall โ•‘ โ•‘ โ•‘ (4ร—branches) โ•‘ Classificationโ•‘ = 0.993 โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ P3 Traffic Signโ•‘ MobileNetV2 โ•‘ Multi-class โ•‘ Accuracy โ•‘ โ•‘ โ•‘ (3.4M) โ•‘ Classificationโ•‘ = 94.2% โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ P4 Chest X-Rayโ•‘ DenseNet121 โ•‘ Binary โ•‘ AUC-ROC โ•‘ โ•‘ โ•‘ (8M) โ•‘ Classificationโ•‘ = 0.978 โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ P5 Traffic Det.โ•‘ YOLOv8m โ•‘ Object โ•‘ mAP@0.5 โ•‘ โ•‘ โ•‘ (25.9M) โ•‘ Detection โ•‘ = 0.847 โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Complexity Spectrum: โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ Simple Complex P3(Mobile) P4(Dense) P2(Multi-branch) P1(ResNet) P5(YOLO)

Transfer Learning Decision Flowchart

Start: New CV Project โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Have >10K โ”‚ โ”‚ labeled images?โ”‚ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ No โ”‚ Yes โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ” โ”Œโ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚Use Trans-โ”‚ โ”‚Train from โ”‚ โ”‚fer Learn.โ”‚ โ”‚scratch OK โ”‚ โ”‚(frozen โ”‚ โ”‚(but TL โ”‚ โ”‚backbone) โ”‚ โ”‚still helps)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Deploy on โ”‚ โ”‚ mobile/edge? โ”‚ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ Yes โ”‚ No โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ” โ”Œโ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚MobileNetโ”‚ โ”‚ResNet50 or โ”‚ โ”‚V2/V3 โ”‚ โ”‚DenseNet121 โ”‚ โ”‚Efficientโ”‚ โ”‚EfficientNetโ”‚ โ”‚Net-B0 โ”‚ โ”‚-B3/B4 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Section 9

Common Misconceptions

โŒ MYTH: "More data always leads to better accuracy."
โœ… TRUTH: Cleaner data leads to better accuracy. 5,000 well-labeled, diverse images often outperform 50,000 noisy images. Label quality is the #1 bottleneck in applied CV. Garbage in, garbage out โ€” even with ResNet.
๐Ÿ” WHY IT MATTERS: Many Indian startups scrape large datasets from the web but don't invest in annotation quality. A mislabeled "healthy" leaf that actually has early-stage disease will teach your model to ignore disease symptoms.
โŒ MYTH: "A model that works on the test set is ready for production."
โœ… TRUTH: Test set performance is necessary but not sufficient. Production readiness requires: (1) Performance on out-of-distribution data, (2) Inference latency within budget, (3) Grad-CAM showing reasonable attention, (4) Graceful failure on invalid inputs, (5) Monitoring for data drift.
๐Ÿ” WHY IT MATTERS: Your crop disease model trained on lab-photographed leaves will likely fail on field photos with soil, hands, and shadows in the frame.
โŒ MYTH: "Transfer learning means just swapping the last layer."
โœ… TRUTH: Effective transfer learning is a 2-phase process: (1) Train only the new head with high LR for 5-10 epochs, (2) Gradually unfreeze backbone layers and fine-tune with 10-100ร— lower LR. Unfreezing too early or with too high a LR will destroy the pretrained features.
๐Ÿ” WHY IT MATTERS: The difference between a good and bad fine-tuning strategy is often 5-15% accuracy โ€” more than most architectural changes.
โŒ MYTH: "YOLOv8 is always better than YOLOv5."
โœ… TRUTH: YOLOv8 is anchor-free and slightly more accurate, but YOLOv5 has a more mature ecosystem, better documentation, and wider deployment support. For production systems, ecosystem maturity often matters more than marginal accuracy gains.
๐Ÿ” WHY IT MATTERS: Choose tools based on your deployment constraints, not benchmarks. A YOLOv5 model deployed and running is infinitely more useful than a YOLOv8 model stuck in development.
Section 10

GATE / Exam Corner

Transfer Learning Formula Sheet

Fine-tune LR = (1/10 to 1/100) ร— Pretrained LR
F1 = 2ยทPยทR / (P+R) = 2ยทTP / (2ยทTP + FP + FN)
IoU = |A โˆฉ B| / |A โˆช B| โ€” threshold typically 0.5
mAP = (1/C) ฮฃ AP_c โ€” mean across C classes
Sensitivity = TP/(TP+FN) | Specificity = TN/(TN+FP)

GATE PYQ-Style Questions

GATE Q1

In a binary classification for medical diagnosis with 100 positive and 900 negative samples, a model predicts all samples as negative. What is the accuracy and recall?

  1. Accuracy = 90%, Recall = 0%
  2. Accuracy = 10%, Recall = 100%
  3. Accuracy = 90%, Recall = 90%
  4. Accuracy = 0%, Recall = 0%
โœ… (A) Accuracy = TN/(Total) = 900/1000 = 90%. But Recall = TP/(TP+FN) = 0/100 = 0%. This is the classic "accuracy paradox" โ€” 90% accuracy with 0% usefulness. This is why accuracy is misleading for imbalanced datasets.
UnderstandMetrics
GATE Q2

In transfer learning, freezing all backbone layers and training only the classification head is equivalent to using the pretrained CNN as a:

  1. Generative model
  2. Fixed feature extractor
  3. Autoencoder
  4. Data augmentation tool
โœ… (B) When the backbone is frozen, it acts as a fixed feature extractor โ€” converting raw images to high-level feature vectors. Only the new classification head learns task-specific mappings. This is computationally cheap and effective when you have limited data.
RememberTransfer Learning
GATE Q3

Grad-CAM computes importance weights by performing global average pooling on:

  1. The input image gradients
  2. The gradients of the output w.r.t. the last convolutional layer's feature maps
  3. The activations of the first convolutional layer
  4. The loss function gradients w.r.t. the weights
โœ… (B) ฮฑkc = (1/Z) ฮฃiฮฃj (โˆ‚yc/โˆ‚Aijk). Grad-CAM uses the gradients of the class score yc with respect to the feature maps Ak of the last convolutional layer, then global average pools these gradients to get channel-wise importance weights.
UnderstandGrad-CAM
GATE Q4

YOLOv8 differs from YOLOv3/v5 primarily because it is:

  1. A two-stage detector
  2. Anchor-free with decoupled detection head
  3. Based on Vision Transformers
  4. Uses only a single-scale feature map
โœ… (B) YOLOv8 eliminates anchor boxes entirely (anchor-free) and uses a decoupled head that separates classification and regression branches. YOLOv3/v5 used predefined anchor boxes at each grid cell. YOLOv8 is still a one-stage CNN-based detector with FPN+PAN multi-scale features.
AnalyzeObject Detection

Prediction Table: Likely GATE 2026-27 Topics

TopicProbabilityFocus Area
Precision/Recall/F1๐Ÿ”ด Very HighNumerical computation from confusion matrix
Transfer Learning concept๐ŸŸ  HighWhen to freeze vs. fine-tune
IoU computation๐ŸŸ  HighNumerical + definition
Data Augmentation effects๐ŸŸก MediumWhich augmentations preserve labels
Grad-CAM/Explainability๐ŸŸก MediumConceptual understanding
Section 11

Interview Prep

Conceptual Questions

๐ŸŽฏ Q1: "Walk me through how you'd build a crop disease detection app for Indian farmers."

Model Answer (India Focus โ€” TCS/Infosys/CropIn):

"I'd start with problem framing: the app must work offline on โ‚น8,000 phones. This rules out large models and cloud inference. For the model, I'd use MobileNetV2 pretrained on ImageNet, fine-tuned on PlantVillage (38 classes), augmented with Indian crop images from ICAR. Two-phase training: frozen backbone for 5 epochs, then unfreeze last 3 blocks with 100ร— lower LR. For evaluation, I'd prioritize per-class recall โ€” missing a disease (FN) costs the farmer their crop. Deployment via ONNX Runtime on Android, with TFLite quantization (INT8) to get model under 10MB. Add voice output in Hindi/Marathi using Android TTS for low-literacy users."

๐ŸŽฏ Q2: "Your medical AI model has 99% accuracy but the hospital rejects it. Why?"

Model Answer (Google Health / Qure.ai):

"99% accuracy on an imbalanced dataset (95% normal, 5% disease) could mean the model just predicts 'normal' for everything. I'd ask: (1) What's the sensitivity? If it's below 90%, the model is missing diseases. (2) Does Grad-CAM show attention on lung pathology or on patient ID text in the X-ray corner? (3) Was it validated on data from a different hospital? Distribution shift kills medical AI. (4) Does it have regulatory approval? In India, CDSCO; in the US, FDA 510(k). (5) Does the UI show confidence scores and Grad-CAM to the radiologist? Hospitals won't trust a black box."

๐ŸŽฏ Q3: "How would you handle classes like 'cow on road' that don't exist in COCO?"

Model Answer (Ola/Waymo/Mobileye):

"This is a domain adaptation problem. COCO has 80 classes, none of which include auto-rickshaws or cows in traffic context. My approach: (1) Collect 2,000-5,000 images per new class from Indian dashcam footage (available from Ola, BDD100K-India). (2) Annotate using CVAT or LabelImg with bounding boxes. (3) Start from COCO-pretrained YOLOv8 โ€” the backbone features (edges, textures, shapes) transfer well even to new classes. (4) Train with mosaic augmentation to compose new training scenes. (5) Monitor per-class AP โ€” expect novel classes like 'cow' to take longer to converge than familiar ones like 'car'. (6) Consider few-shot detection techniques if annotation budget is tight."

Coding Challenge

๐Ÿ’ป Live Coding: "Implement Grad-CAM from scratch in PyTorch in 15 minutes"

What they're testing: PyTorch hooks, backward pass understanding, tensor manipulation. The implementation is in Section 4 (Project 1). Key points to cover: (1) register_forward_hook / register_full_backward_hook, (2) global average pooling of gradients, (3) weighted sum + ReLU, (4) upsampling to input size. Common mistakes: forgetting .detach() in hooks, wrong dimension for mean operation.

Companies hiring for these skills (2024-2026):

  • India: CropIn (Bengaluru), Qure.ai (Mumbai), Stellantis India, Ola Krutrim, SigTuple, Wadhwani AI โ€” โ‚น15-45 LPA
  • USA: Google Health, Waymo, Tesla Autopilot, Aidoc, PathAI โ€” $130-250K USD
  • Remote: Roboflow, Ultralytics, Hugging Face โ€” competitive global salaries
Section 12

Hands-On Lab โ€” End-to-End Crop Disease Detector

๐Ÿ”ฌ Lab: Build, Train, Evaluate, and Deploy a Plant Disease Classifier

Duration: 3-4 hours Platform: Google Colab (T4 GPU) or local with CUDA
Part A: Data Preparation (30 min)
  1. Download PlantVillage dataset from Kaggle
  2. Split into train/val/test (70/15/15) with stratification
  3. Implement the Indian-crop augmentation pipeline from Project 1
  4. Visualize 5 augmented samples per class to verify augmentations are reasonable
Part B: Model Training (60 min)
  1. Build CropDiseaseNet with ResNet50 backbone
  2. Phase 1: Train head only for 5 epochs (expect ~85% val accuracy)
  3. Phase 2: Unfreeze layer4, fine-tune for 15 epochs with cosine LR (expect ~96%)
  4. Plot training/validation loss and accuracy curves
Part C: Evaluation (45 min)
  1. Generate full classification report (precision/recall/F1 per class)
  2. Plot 38ร—38 confusion matrix โ€” identify the most confused class pairs
  3. Generate Grad-CAM heatmaps for 10 correct and 5 incorrect predictions
  4. Write a 200-word "failure analysis" โ€” why does the model confuse certain diseases?
Part D: Deployment (45 min)
  1. Export model to ONNX format
  2. Measure inference time on CPU (should be <100ms for 224ร—224)
  3. Build a simple Gradio web interface for uploading leaf photos
  4. Test with 5 real leaf photos from your garden/campus
Rubric (Total: 100 points)
ComponentPointsCriteria
Data Pipeline20Correct split, augmentation, dataloaders
Model Training252-phase strategy, val accuracy โ‰ฅ 93%
Evaluation25Full metrics report, confusion matrix, Grad-CAM
Deployment20ONNX export, Gradio demo working
Analysis Write-up10Thoughtful failure analysis
Section 13

Exercises (22 Problems)

Section A โ€” Conceptual Questions (5)

A1
Beginner

Why is transfer learning from ImageNet effective for crop disease detection, even though ImageNet doesn't contain any leaf disease images?

ImageNet pretraining teaches low-level features (edges, textures, color gradients) in early layers and mid-level features (shapes, patterns) in middle layers. Leaf diseases manifest as texture changes, color spots, and shape deformations โ€” all of which map onto these learned features. Only the high-level semantic mapping (features โ†’ disease class) needs to be learned from scratch.
UnderstandTransfer Learning
A2
Intermediate

Explain why aggressive data augmentation (random erasing, cutout) is appropriate for traffic sign recognition but dangerous for chest X-ray diagnosis.

Traffic signs are robust to partial occlusion โ€” a partially covered stop sign is still a stop sign. Random erasing simulates real-world occlusion (stickers, damage). But for chest X-rays, random erasing could mask the exact pathological region (a small opacity indicating pneumonia), teaching the model to ignore disease markers. Medical augmentation should be conservative: slight rotation, small translation, horizontal flip only.
AnalyzeAugmentation
A3
Beginner

What is the "accuracy paradox"? Give an example from the pneumonia detection project.

The accuracy paradox occurs when a model achieves high accuracy by exploiting class imbalance. In the Kermany dataset (1,583 normal, 4,273 pneumonia), a model that always predicts "pneumonia" achieves 73% accuracy but 0% specificity โ€” it's useless for ruling out disease. Conversely, in a population where only 5% have pneumonia, always predicting "normal" gives 95% accuracy with 0% recall โ€” missing every sick patient.
UnderstandMetrics
A4
Intermediate

Why does the multi-region architecture (Project 2) outperform a single-image CNN for currency authentication?

Currency security features are spatially localized: watermarks in one region, security threads in another, micro-lettering in a third. A single CNN must learn to attend to all these regions simultaneously โ€” difficult when they occupy small portions of the full note image. The multi-region approach gives each branch a focused task: examining one security feature at high resolution. Feature fusion then combines evidence from all regions, similar to how human experts examine notes region by region.
AnalyzeArchitecture Design
A5
Advanced

Why does YOLOv8 use an anchor-free design? What problem did anchor-based detection have?

Anchor-based detectors (YOLOv3/v5) predefine a set of anchor boxes at each grid cell. This requires: (1) careful anchor design via k-means clustering on training data, (2) hyperparameter tuning for number and aspect ratios of anchors, (3) anchor-target matching strategies. These are dataset-specific โ€” anchors optimized for COCO don't work well for Indian traffic where object aspect ratios differ. YOLOv8's anchor-free approach directly predicts the center offset and box dimensions, eliminating this dependency and making the model more generalizable to new domains.
AnalyzeObject Detection

Section B โ€” Mathematical Problems (8)

B1
Beginner

A crop disease model produces the following confusion matrix for 3 classes (Healthy=H, Blight=B, Rust=R). Compute macro and weighted F1.

            Pred-H  Pred-B  Pred-R
Actual-H      85      10       5     (100 total)
Actual-B       5      70      25     (100 total)
Actual-R       2       8      90     (100 total)
H: P=85/92=0.924, R=85/100=0.85, F1=2(0.924ร—0.85)/(0.924+0.85)=0.886
B: P=70/88=0.795, R=70/100=0.70, F1=2(0.795ร—0.70)/(0.795+0.70)=0.745
R: P=90/120=0.75, R=90/100=0.90, F1=2(0.75ร—0.90)/(0.75+0.90)=0.818
Macro F1 = (0.886+0.745+0.818)/3 = 0.816
Weighted F1 = same as Macro F1 here since all classes have equal support (100 each) = 0.816
ApplyMetrics
B2
Intermediate

Compute IoU for two bounding boxes: Box A = (x1=10, y1=10, x2=50, y2=50) and Box B = (x1=30, y1=30, x2=70, y2=70). Is this a valid detection at IoU threshold 0.5?

Intersection: x1=max(10,30)=30, y1=max(10,30)=30, x2=min(50,70)=50, y2=min(50,70)=50
Intersection area = (50-30)ร—(50-30) = 20ร—20 = 400
Area A = (50-10)ร—(50-10) = 40ร—40 = 1600
Area B = (70-30)ร—(70-30) = 40ร—40 = 1600
Union = 1600 + 1600 - 400 = 2800
IoU = 400/2800 = 0.143
At IoU threshold 0.5, this is NOT a valid detection (0.143 < 0.5).
ApplyObject Detection
B3
Intermediate

A pneumonia detector has these results: TP=190, FP=30, TN=170, FN=10. Compute sensitivity, specificity, PPV, NPV, and F1-score.

Sensitivity (Recall) = 190/(190+10) = 190/200 = 0.95
Specificity = 170/(170+30) = 170/200 = 0.85
PPV (Precision) = 190/(190+30) = 190/220 = 0.864
NPV = 170/(170+10) = 170/180 = 0.944
F1 = 2ร—(0.864ร—0.95)/(0.864+0.95) = 0.905
ApplyMedical Metrics
B4
Intermediate

A ResNet50 backbone has 25.6M parameters. If we freeze all backbone layers and only train a head with layers [Linear(2048,512), Linear(512,38)], how many trainable parameters does the model have? (Ignore biases for simplicity)

Linear(2048, 512): 2048 ร— 512 = 1,048,576 params
Linear(512, 38): 512 ร— 38 = 19,456 params
Total trainable = 1,048,576 + 19,456 = 1,068,032 โ‰ˆ 1.07M
That's only 4.2% of the total model โ€” this is why frozen-backbone training is so fast!
ApplyArchitecture
B5
Advanced

In Grad-CAM, the importance weight ฮฑkc is the global average pool of gradients โˆ‚yc/โˆ‚Ak. If the feature map Ak has spatial dimensions 7ร—7 and 512 channels, what is the shape of the final Grad-CAM heatmap (before upsampling)?

ฮฑkc has shape (512,) โ€” one weight per channel after GAP.
Each Ak has shape (7, 7).
LGrad-CAM = ReLU(ฮฃk=1..512 ฮฑk ยท Ak) โ€” a weighted sum over 512 channels.
Final shape: (7, 7) โ€” a single 7ร—7 heatmap that gets upsampled to input size (e.g., 224ร—224).
ApplyGrad-CAM
B6
Intermediate

A YOLOv8 model outputs detections at 3 scales: 80ร—80, 40ร—40, and 20ร—20. How many candidate detections are generated per image?

Scale 1: 80 ร— 80 = 6,400 candidates
Scale 2: 40 ร— 40 = 1,600 candidates
Scale 3: 20 ร— 20 = 400 candidates
Total = 6,400 + 1,600 + 400 = 8,400 candidates
NMS (Non-Maximum Suppression) then reduces these to typically 10-50 final detections.
ApplyYOLO
B7
Advanced

Show that the harmonic mean (F1-score) is always โ‰ค the arithmetic mean of precision and recall. When are they equal?

By the AM-HM inequality: for positive a, b:
(a+b)/2 โ‰ฅ 2ab/(a+b)
โŸน (a+b)ยฒ โ‰ฅ 4ab
โŸน aยฒ + 2ab + bยฒ โ‰ฅ 4ab
โŸน aยฒ - 2ab + bยฒ โ‰ฅ 0
โŸน (a-b)ยฒ โ‰ฅ 0 โœ“ (always true)
Equality holds when a = b, i.e., F1 = AM only when Precision = Recall. This shows F1 penalizes imbalance between P and R more harshly than arithmetic mean does.
AnalyzeProof
B8
Advanced

In the currency authentication model, the loss uses class weight 2.0 for counterfeits. If the base cross-entropy loss for a counterfeit sample is -log(0.9) = 0.105, what is the weighted loss? How does this affect gradient magnitude?

Weighted loss = 2.0 ร— (-log(0.9)) = 2.0 ร— 0.105 = 0.210
The gradient is also scaled by 2ร—: โˆ‚(weighted_loss)/โˆ‚ฮธ = 2.0 ร— โˆ‚(base_loss)/โˆ‚ฮธ.
This means the model updates its weights 2ร— more aggressively when it misclassifies a counterfeit note, effectively telling the optimizer "missing a fake note is twice as bad as a false alarm."
ApplyWeighted Loss

Section C โ€” Coding Problems (4)

C1
Intermediate

Write a PyTorch function compute_metrics(y_true, y_pred, num_classes) that computes per-class precision, recall, F1, and macro-averaged F1 from scratch (no sklearn).

def compute_metrics(y_true, y_pred, num_classes):
    metrics = {}
    f1_scores = []
    for c in range(num_classes):
        tp = ((y_pred == c) & (y_true == c)).sum().item()
        fp = ((y_pred == c) & (y_true != c)).sum().item()
        fn = ((y_pred != c) & (y_true == c)).sum().item()
        p = tp / (tp + fp + 1e-8)
        r = tp / (tp + fn + 1e-8)
        f1 = 2 * p * r / (p + r + 1e-8)
        metrics[c] = {'precision': p, 'recall': r, 'f1': f1}
        f1_scores.append(f1)
    metrics['macro_f1'] = sum(f1_scores) / num_classes
    return metrics
ApplyImplementation
C2
Intermediate

Write a function compute_iou(box_a, box_b) that computes IoU between two bounding boxes in [x1, y1, x2, y2] format. Handle the no-overlap case.

def compute_iou(box_a, box_b):
    x1 = max(box_a[0], box_b[0])
    y1 = max(box_a[1], box_b[1])
    x2 = min(box_a[2], box_b[2])
    y2 = min(box_a[3], box_b[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    area_a = (box_a[2]-box_a[0]) * (box_a[3]-box_a[1])
    area_b = (box_b[2]-box_b[0]) * (box_b[3]-box_b[1])
    union = area_a + area_b - inter
    return inter / (union + 1e-8)
ApplyObject Detection
C3
Advanced

Implement a complete 2-phase transfer learning pipeline: Phase 1 trains only the head, Phase 2 unfreezes the last N layers. Include learning rate adjustment.

See the full implementation in Project 1 above. Key requirements: (1) param.requires_grad = False for freezing, (2) separate optimizers for each phase, (3) LR for Phase 2 should be 10-100ร— lower than Phase 1, (4) Cosine annealing or ReduceLROnPlateau scheduler, (5) Save best model based on validation metric.
ApplyTransfer Learning
C4
Advanced

Write a custom PyTorch Dataset class for the multi-region currency authentication model that loads a note image and returns 4 cropped regions + label.

class CurrencyDataset(torch.utils.data.Dataset):
    def __init__(self, image_paths, labels, transform=None):
        self.paths = image_paths
        self.labels = labels
        self.transform = transform
        self.resize = transforms.Resize((64, 64))

    def __len__(self): return len(self.paths)

    def __getitem__(self, idx):
        img = Image.open(self.paths[idx]).convert('RGB')
        w, h = img.size
        watermark = img.crop((0, 0, w//3, h//2))
        thread = img.crop((w//3, 0, w//3+w//10, h))
        latent = img.crop((0, h//2, w//3, h))
        texture = img
        if self.transform:
            watermark = self.transform(self.resize(watermark))
            thread = self.transform(self.resize(thread))
            latent = self.transform(self.resize(latent))
            texture = self.transform(self.resize(texture))
        return watermark, thread, latent, texture, self.labels[idx]
CreateDataset

Section D โ€” Critical Thinking (3)

D1
Advanced

Your chest X-ray model shows excellent performance on the Kermany dataset but fails when deployed at AIIMS Delhi. What are 3 likely reasons and how would you fix each?

1. Distribution Shift: Kermany data is from pediatric patients in Guangzhou; AIIMS treats adult patients. Fix: Fine-tune on a small AIIMS dataset (even 500 images help). 2. Equipment Difference: Different X-ray machines produce different image characteristics (contrast, resolution, noise patterns). Fix: Apply histogram equalization as preprocessing to normalize across equipment. 3. Annotation Disagreement: "Normal" vs "pneumonia" boundaries differ between radiologists. Fix: Use consensus labels from 3+ radiologists, or train with label smoothing to handle annotation uncertainty.
EvaluateDeployment
D2
Advanced

A startup claims their Indian traffic sign recognition system achieves 99% accuracy. You're an investor evaluating this claim. What 5 questions would you ask?

1. What's the test set composition? Is it from the same distribution as training, or from different cities/weather? 2. What's the per-class accuracy? 99% overall but 60% on rare signs is useless. 3. How does it perform on degraded/occluded signs? 4. What's the inference latency on target hardware? 5. Has it been tested with adversarial examples? A small sticker on a stop sign fooling the model is a safety-critical failure.
EvaluateCritical Analysis
D3
Advanced

Discuss the ethical implications of deploying a cow-detection model for autonomous vehicles in India. Consider: religious sentiments, animal welfare, liability, and regional variation.

This is deeply nuanced: (1) Religious sensitivity: Some states have cow protection laws; the model must never trigger actions perceived as harmful to cows. (2) Animal welfare: The system should slow down, not try to "navigate around" a cow, which could endanger it. (3) Liability: If the model fails to detect a cow and causes a collision, who is liable โ€” the car manufacturer, the model developer, or the driver? (4) Regional variation: Cow-on-road frequency varies enormously between Delhi (rare) and rural Rajasthan (very common). The model's confidence threshold should be region-adaptive. (5) False positives: Detecting a large dog as a cow might trigger unnecessary emergency braking.
EvaluateEthics

โ˜… Starred Research Problems (2)

โ˜…R1
Advanced

Read the CheXNet paper (Rajpurkar et al., 2017). They claim "radiologist-level performance" on 14 pathologies using DenseNet121. Critically analyze: (1) How did they compare against radiologists? (2) What criticisms has the paper received? (3) How would you design a more rigorous evaluation? Write a 500-word analysis.

Key critique points: (1) CheXNet compared against 4 radiologists on 420 images โ€” a very small, cherry-picked test. (2) "Radiologist-level" was defined using the AUC of individual radiologists as baseline, ignoring that radiologists disagree with each other. (3) The model was not tested on data from different hospitals (external validation). (4) Subsequent studies (Oakden-Rayner, 2019) showed that performance drops significantly on external datasets. A rigorous evaluation would include: multi-center testing, comparison against panel consensus, stratified analysis by demographics, and long-term prospective clinical trials.
EvaluateResearch Analysis
โ˜…R2
Advanced

Design a "few-shot" crop disease detection system that can learn to identify a new disease from only 5 example images. Propose an architecture (hint: metric learning or prototypical networks) and describe how you'd evaluate it. Include a comparison with standard fine-tuning.

Approach: Use a Prototypical Network where the ResNet50 backbone produces embeddings, and classification is done by computing distances to class prototypes (mean embeddings of support examples). For a new disease with 5 images: compute the prototype, then classify new images by nearest-prototype. Evaluation: Use PlantVillage with 30 known classes for meta-training and 8 held-out classes for meta-testing. Report 5-shot accuracy on held-out classes. Compare with: (1) fine-tuning the head with 5 images per class (expect poor results due to overfitting), (2) data augmentation to expand 5 โ†’ 50 images + fine-tuning.
CreateResearch Design
Section 14

Connections

๐Ÿ”— How Chapter 20 Connects to the Rest of the Book

โ† Builds On
  • Chapter 13 (CNN Architectures): ResNet50, MobileNetV2, DenseNet121 โ€” all architectures used in this chapter
  • Chapter 17 (Transfer Learning): The 2-phase fine-tuning strategy comes directly from transfer learning theory
  • Chapter 9 (Regularization): Dropout, data augmentation, weight decay โ€” all used extensively in every project
  • Chapter 4 (Loss Functions): Cross-entropy, BCE with logits, weighted loss for class imbalance
โ†’ Enables
  • Chapter 21 (MLOps): Deploying these models in production requires CI/CD, monitoring, model versioning
  • Chapter 22 (Ethics & Future): The medical AI ethics discussion in Project 4 is expanded in the ethics chapter
๐Ÿ”ฌ Research Frontier
  • Foundation Models for CV: DINOv2, SAM (Segment Anything Model) โ€” can these replace task-specific fine-tuning?
  • Vision-Language Models: GPT-4V, Gemini โ€” can you describe a disease in text and have the model classify it?
  • Federated Learning for Medical AI: Training on hospital data without moving it โ€” privacy-preserving medical AI
๐Ÿญ Industry Implementation
  • CropIn (Bengaluru): Serves 7M+ farmers with AI-powered crop advisory
  • Qure.ai (Mumbai): Deployed in 90+ countries for chest X-ray screening
  • Ultralytics: YOLOv8 used in 100K+ projects worldwide
Section 15

Chapter Summary

๐ŸŽฏ 7 Key Takeaways

  1. The model is the least important decision. Data quality, augmentation strategy, evaluation methodology, and deployment constraints matter more than whether you use ResNet50 or EfficientNet-B3.
  2. Transfer learning is the default. Always start with a pretrained backbone. Use 2-phase training: head-only with high LR, then gradual unfreezing with 10-100ร— lower LR. This gives you 95%+ of the benefit with 10% of the compute.
  3. Metrics must match the domain. Accuracy for traffic signs. Recall (sensitivity) for medical screening. mAP@0.5 for detection. F1 when precision and recall both matter. Never use accuracy alone on imbalanced datasets.
  4. Grad-CAM is not optional. For medical AI, it's ethically required. For all projects, it's a debugging tool โ€” if your crop disease model is looking at the background instead of the leaf spots, your model learned a shortcut.
  5. Indian CV projects require domain adaptation. PlantVillage needs Indian crop augmentation. COCO needs Indian traffic classes. Chest X-ray models need Indian hospital validation. Off-the-shelf models from US/European research fail on Indian data.
  6. Deployment is half the battle. A 98% accurate model on a GPU server is useless to a farmer in Madhya Pradesh without internet. ONNX export, INT8 quantization, and mobile runtime optimization are not afterthoughts โ€” they're design requirements.
  7. Ethics is engineering. In medical AI, a false negative can kill. In autonomous driving, a missed cow can cause an accident. Build safety margins, regulatory awareness, and human-in-the-loop design into every project from day one.

๐Ÿ“ The Key Equations

F1-Score: F1 = 2ยทTP / (2ยทTP + FP + FN) = 2ยทPยทR / (P + R)

Grad-CAM: LGrad-CAMc = ReLU(ฮฃk ฮฑkc ยท Ak), where ฮฑkc = GAP(โˆ‚yc/โˆ‚Ak)

IoU: IoU(A,B) = |A โˆฉ B| / |A โˆช B|

๐Ÿ’ก The Key Intuition

Applied deep learning is not about knowing the fanciest architecture โ€” it's about engineering discipline. The same ResNet50 can give you 60% or 98% on the same dataset. The difference is in data curation, augmentation design, learning rate scheduling, evaluation rigor, and deployment optimization. Mastering these "boring" engineering skills is what separates a student who knows deep learning theory from an engineer who can deploy it to save crops, detect diseases, and prevent accidents.

Section 16

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Resources

  • NPTEL Course: "Deep Learning for Computer Vision" by Prof. Vineeth N Balasubramanian (IIT Hyderabad) โ€” covers CNN architectures, transfer learning, and object detection with Indian examples
  • NPTEL Course: "Computer Vision" by Prof. Jayanthi Sivaswamy (IIIT Hyderabad) โ€” classical + deep learning approaches
  • CropIn Technical Blog: How they build AI models for Indian agriculture with limited labeled data
  • Qure.ai Research Papers: Multiple publications on chest X-ray AI deployment in low-resource settings
  • GATE Preparation: "Deep Learning" by Ian Goodfellow โ€” Chapters 9 (CNN) and 12 (Applications)
  • IIT Bombay ITSR Dataset: Indian Traffic Sign Recognition benchmark for academic research

๐ŸŒ Global Resources

  • ๐Ÿ“„ Paper: He et al., "Deep Residual Learning for Image Recognition," CVPR 2016 โ€” the ResNet paper used in Project 1
  • ๐Ÿ“„ Paper: Selvaraju et al., "Grad-CAM: Visual Explanations from Deep Networks," ICCV 2017 โ€” the Grad-CAM paper used across all projects
  • ๐Ÿ“„ Paper: Rajpurkar et al., "CheXNet: Radiologist-Level Pneumonia Detection," 2017 โ€” inspiration for Project 4
  • ๐Ÿ“„ Paper: Jocher et al., "Ultralytics YOLOv8," 2023 โ€” the YOLOv8 architecture documentation
  • ๐Ÿ“„ Paper: Hughes & Salathรฉ, "An open access repository of images on plant health," 2015 โ€” PlantVillage dataset paper
  • ๐ŸŽฅ Video: 3Blue1Brown โ€” "But what is a Neural Network?" โ€” foundational intuition
  • ๐ŸŒ Interactive: Distill.pub: Building Blocks of Interpretability โ€” excellent visualization of neural network features
  • ๐ŸŒ Docs: Ultralytics YOLOv8 Documentation โ€” complete training, evaluation, and deployment guide
  • ๐ŸŒ Platform: Roboflow โ€” dataset annotation, augmentation, and model training platform
  • ๐Ÿ“š Book: Franรงois Chollet, "Deep Learning with Python" (2nd edition) โ€” practical Keras/TF approach to CV projects