Neural Networks & Deep Learning

Chapter 18: Applied Deep Learning — Natural Language Processing

Building Real-World NLP Systems for India's Linguistic Diversity

⏱️ Reading Time: ~5 hours | 📖 Part V: Applied Deep Learning | 🛠️ Project-Based Chapter

📋 Prerequisites: Chapters 14–17 (RNNs, Attention, Transformers, BERT), Python, PyTorch/HuggingFace basics

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall NLP pipeline stages, key Indian-language datasets (ILDC, IndicNLP), and evaluation metrics (ROUGE, BLEU, F1, WER)
🔵 Understand	Explain why Indian languages pose unique challenges — agglutinative morphology, code-switching, low-resource settings, and multiple scripts
🟢 Apply	Fine-tune MuRIL for Hindi sentiment, build extractive summarizers with IndicBERT, and train intent classifiers for chatbots
🟡 Analyze	Compare Bi-LSTM-CRF vs Transformer NER architectures; diagnose code-mixing failures in tokenizers
🟠 Evaluate	Benchmark models across ROUGE, entity-level F1, and word error rate; choose between extractive vs abstractive summarization
🔴 Create	Design and deploy end-to-end NLP applications: a Hindi sentiment system, a legal summarizer, an IRCTC chatbot, and a news NER pipeline

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Build a Hindi/Hinglish sentiment analysis system using MuRIL (Multilingual Representations for Indian Languages) from HuggingFace, handling code-mixed reviews from Flipkart
Implement an extractive legal document summarizer using IndicBERT on the ILDC (Indian Legal Documents Corpus) from IIIT Hyderabad, evaluating with ROUGE scores
Design an IRCTC-style chatbot that handles 2 crore+ daily queries using BERT-based intent classification and response retrieval
Train Named Entity Recognition models for Indian news articles — comparing Bi-LSTM-CRF vs Transformer architectures for Person, Organization, Location, Date, and Currency entities
Explain Automatic Speech Recognition (ASR) for Indian languages using Wav2Vec 2.0 and AI4Bharat's IndicWav2Vec for 9+ languages
Handle Indian-language-specific challenges: Devanagari/multi-script tokenization, code-switching, morphological richness, and low-resource data augmentation
Evaluate NLP systems using appropriate metrics: Accuracy/F1 (classification), ROUGE (summarization), Entity F1 (NER), WER (ASR)

Section 2

Opening Hook — India's Language AI Revolution

🗣️ 1.4 Billion People. 22 Official Languages. 19,500 Dialects. One NLP Challenge.

Every day, 2 crore passengers query IRCTC in a mix of Hindi, English, Tamil, and everything in between. Flipkart receives lakhs of product reviews in Hinglish — "bahut accha product hai, quality mast hai 👍". Indian courts generate 4 crore+ pages of legal documents that lawyers must manually read. Meanwhile, Koo tried to moderate content across 10 Indian languages simultaneously.

English NLP is "solved" for many tasks. But India's code-mixing ("main kal market gaya tha, nice experience"), resource scarcity (try finding 1 lakh labeled Kannada sentences), and script diversity (Devanagari, Tamil, Telugu, Bengali, Gurmukhi...) make even basic NLP a frontier research problem.

In this chapter, you don't just learn NLP theory — you build 5 production-grade Indian NLP systems from scratch. Welcome to the most exciting frontier of AI in India.

FlipkartIRCTCKooAI4BharatIIIT HyderabadJugalbandi

India is home to AI4Bharat (IIT Madras), which has built open-source models for 22 Indian languages. Their IndicBERT, IndicTrans, and IndicWav2Vec are used by government and industry alike. The Bhashini platform (MeitY) aims to break India's language barrier using these exact models. This chapter teaches you the technology behind India's language AI stack.

Section 3

Core Concepts — Five NLP Projects for India

This chapter is organized as five complete projects, each addressing a real Indian NLP problem. Every project follows a consistent structure: Problem → Dataset → Model Architecture → Full Code → Evaluation → Indian Language Challenges.

18.1 The Indian NLP Pipeline — Unique Challenges

Before diving into projects, let's understand why Indian NLP is fundamentally harder than English NLP:

Why Indian NLP is Hard — The Six Challenges

1. Code-Mixing (Code-Switching)

"Yaar ye phone ka camera too good hai, totally worth it" — Hindi and English mixed at word and sentence level. Standard tokenizers trained on monolingual data fail catastrophically on such text.

2. Script Diversity

22 official languages using 13+ scripts: Devanagari (Hindi, Marathi, Sanskrit), Tamil script, Telugu script, Bengali script, Gurmukhi (Punjabi), Kannada script, Malayalam script, Odia script, and more. A single model must handle all.

3. Morphological Richness

Tamil has agglutinative morphology — a single word can encode subject, tense, number, and mood: "படிக்கவில்லையா" (padikkaavillaiyaa = "did [you] not read?"). This creates an enormous vocabulary that subword tokenizers struggle with.

4. Low-Resource Languages

While Hindi has ~100K labeled NLP samples, languages like Dogri, Maithili, Bodo, and Santali have almost zero labeled data. Even Kannada and Odia have <10K labeled sentences for most tasks.

5. Transliteration Variants

"कैसे हो" = "kaise ho" = "kese ho" = "kaise hoo" — the same Hindi phrase written in multiple romanized forms. Models must handle native script AND transliterated forms.

6. Domain-Specific Vocabulary

Legal Hindi ("न्यायालय", "अधिनियम"), medical terminology in regional languages, and government jargon create specialized domains with virtually no training data.

India's linguistic diversity is so extreme that Facebook/Meta trained separate models for Hindi, Bengali, Tamil, Telugu, and Marathi — but still couldn't handle Hinglish, which is spoken by an estimated 350 million people daily. Google's MuRIL was specifically designed to solve this problem.

The Standard NLP Pipeline for Indian Languages

┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ ┌───────────────┐ │ Raw Text │───▶│ Language │───▶│ Script │───▶│ Subword │ │ (mixed) │ │ Detection │ │ Normalization │ │ Tokenization │ └──────────────┘ └──────────────┘ └──────────────────┘ └───────┬───────┘ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ Task Head │◀───│ Fine-tuned │◀───│ Pretrained LM │◀──────────┘ │ (classify/ │ │ Layers │ │ (MuRIL/ │ │ NER/summ) │ │ │ │ IndicBERT) │ └──────────────┘ └──────────────┘ └──────────────────┘

Model	Creator	Languages	Best For
`MuRIL`	Google	17 Indian + English	Code-mixed classification, QA
`IndicBERT`	AI4Bharat / IIT Madras	12 Indian languages	All NLU tasks for Indian langs
`IndicBART`	AI4Bharat	11 Indian languages	Generation, summarization
`XLM-RoBERTa`	Meta	100 languages	Cross-lingual transfer
`IndicTrans2`	AI4Bharat	22 Indian languages	Machine translation
`IndicWav2Vec`	AI4Bharat	9 Indian languages	Speech recognition (ASR)

18.2 Project 1 — Hindi/Hinglish Sentiment Analysis

🏷️ PROJECT 1 — CLASSIFICATION

The Business Problem

Flipkart receives 50 lakh+ product reviews monthly, with ~40% written in Hindi or Hinglish. Their recommendation engine and seller quality score depend on accurate sentiment detection. A review like "product toh sahi hai but delivery mein bahut time lagaya 😤" is mixed sentiment — positive product, negative delivery. English-only models classify this as neutral (wrong!).

Dataset: Flipkart Hindi/Hinglish Reviews

We use a curated dataset of Flipkart product reviews in Hindi and code-mixed Hindi-English (Hinglish), labeled as Positive, Negative, or Neutral.

Split	Positive	Negative	Neutral	Total
Train	12,400	8,600	4,000	25,000
Validation	1,550	1,075	500	3,125
Test	1,550	1,075	500	3,125

Why MuRIL? — Multilingual Representations for Indian Languages

MuRIL — Google's Indian Language BERT

Architecture

MuRIL is a BERT-base model (12 layers, 768 hidden, 12 heads, 110M params) pretrained on:

17 Indian languages + English from Wikipedia and Common Crawl
Transliterated text — Hindi in both Devanagari and Roman script
Parallel corpora — aligned Hindi-English sentence pairs

Why It Beats mBERT for Indian Languages

mBERT was trained on 104 languages with ~equal weight. Indian languages got only ~2% of the training data. MuRIL dedicates 100% of its capacity to Indian languages, resulting in 5-10% higher accuracy on Indian NLP benchmarks.

Code-Mixing Handling

MuRIL was explicitly trained on transliterated and code-mixed data. It correctly tokenizes "bahut accha product hai" even though it's Hindi in Roman script — something mBERT and XLM-R fail at.

Code-Mixing: The Key Challenge

Code-mixing occurs at multiple levels:

Level	Example	Challenge
Word-level	"Main phone use karta hoon"	English noun in Hindi sentence
Sentence-level	"Product accha hai. But delivery was late."	Language switch at sentence boundary
Intra-word	"Phoneवाला" (Phone + वाला)	Morpheme-level mixing across scripts
Transliteration	"bahut accha" vs "बहुत अच्छा"	Same meaning, different scripts

Full Implementation: MuRIL Sentiment Classifier

Python# ─── Project 1: Hindi Sentiment Analysis with MuRIL ───
# Fine-tune Google's MuRIL on Flipkart Hindi/Hinglish reviews

import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from sklearn.metrics import classification_report, confusion_matrix
import re

# ─── Step 1: Preprocessing for Hindi/Hinglish ───

class HindiTextPreprocessor:
    """Handles code-mixed Hindi-English text preprocessing."""

    def __init__(self):
        # Common Hindi stopwords (in Devanagari)
        self.hindi_stopwords = {
            'है', 'हैं', 'का', 'की', 'के', 'में',
            'को', 'से', 'पर', 'और', 'यह', 'वह'
        }
        # Hinglish normalization map
        self.normalize_map = {
            'accha': 'अच्छा', 'acha': 'अच्छा',
            'bahut': 'बहुत', 'bohot': 'बहुत',
            'sahi': 'सही', 'shi': 'सही',
            'kharab': 'खराब', 'khrb': 'खराब',
            'mast': 'मस्त', 'bakwas': 'बकवास',
        }

    def clean_text(self, text):
        """Clean and normalize Hindi/Hinglish text."""
        text = str(text).lower()
        # Remove URLs, mentions, hashtags
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r'@\w+', '', text)
        # Keep Devanagari chars (U+0900-U+097F), English, numbers
        text = re.sub(r'[^\u0900-\u097F\w\s]', ' ', text)
        # Normalize repeated characters: "bahuttttt" → "bahut"
        text = re.sub(r'(.)\1{2,}', r'\1\1', text)
        # Normalize common Hinglish spellings
        words = text.split()
        words = [self.normalize_map.get(w, w) for w in words]
        return ' '.join(words).strip()

    def detect_language_ratio(self, text):
        """Detect Hindi vs English ratio in text."""
        hindi_chars = len(re.findall(r'[\u0900-\u097F]', text))
        eng_chars = len(re.findall(r'[a-zA-Z]', text))
        total = hindi_chars + eng_chars
        if total == 0:
            return 0.0
        return hindi_chars / total  # 1.0 = pure Hindi, 0.0 = pure English


# ─── Step 2: Dataset Class ───

class FlipkartHindiDataset(Dataset):
    """PyTorch Dataset for Flipkart Hindi/Hinglish reviews."""

    LABEL_MAP = {'positive': 0, 'negative': 1, 'neutral': 2}

    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = [self.LABEL_MAP[l] for l in labels]
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.preprocessor = HindiTextPreprocessor()

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.preprocessor.clean_text(self.texts[idx])
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }


# ─── Step 3: Load MuRIL and Fine-Tune ───

MODEL_NAME = "google/muril-base-cased"
NUM_LABELS = 3

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS,
    problem_type="single_label_classification"
)

# Load data (assumes CSV with 'review_text' and 'sentiment' columns)
df_train = pd.read_csv("flipkart_hindi_train.csv")
df_val   = pd.read_csv("flipkart_hindi_val.csv")
df_test  = pd.read_csv("flipkart_hindi_test.csv")

train_dataset = FlipkartHindiDataset(
    df_train['review_text'].tolist(),
    df_train['sentiment'].tolist(),
    tokenizer
)
val_dataset = FlipkartHindiDataset(
    df_val['review_text'].tolist(),
    df_val['sentiment'].tolist(),
    tokenizer
)

# ─── Step 4: Training Configuration ───

def compute_metrics(eval_pred):
    """Compute accuracy and macro F1 for evaluation."""
    from sklearn.metrics import accuracy_score, f1_score
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='macro')
    return {'accuracy': acc, 'f1_macro': f1}


training_args = TrainingArguments(
    output_dir="./muril-flipkart-sentiment",
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_ratio=0.1,
    weight_decay=0.01,
    learning_rate=2e-5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    logging_steps=50,
    fp16=True,  # Mixed precision for faster training
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Train!
trainer.train()

# ─── Step 5: Evaluation ───

test_dataset = FlipkartHindiDataset(
    df_test['review_text'].tolist(),
    df_test['sentiment'].tolist(),
    tokenizer
)
results = trainer.evaluate(test_dataset)
print(f"Test Accuracy: {results['eval_accuracy']:.4f}")
print(f"Test F1 Macro: {results['eval_f1_macro']:.4f}")

# ─── Step 6: Inference on New Reviews ───

def predict_sentiment(text, model, tokenizer):
    """Predict sentiment of a Hindi/Hinglish review."""
    preprocessor = HindiTextPreprocessor()
    clean = preprocessor.clean_text(text)
    inputs = tokenizer(clean, return_tensors="pt", truncation=True,
                       max_length=128, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    pred = torch.argmax(probs, dim=-1).item()
    labels = ['Positive', 'Negative', 'Neutral']
    return labels[pred], probs[0].tolist()


# Test with code-mixed reviews
reviews = [
    "bahut accha product hai, quality mast hai 👍",
    "paise barbaad! bilkul kharab quality, return karna padega",
    "theek hai, not great not bad, average product",
    "बहुत अच्छा फोन है, कैमरा quality शानदार",
    "delivery late thi but product sahi hai",
]

for review in reviews:
    sentiment, probs = predict_sentiment(review, model, tokenizer)
    print(f"Review: {review[:50]}...")
    print(f"  → {sentiment} (conf: {max(probs):.2%})\n")

Test Accuracy: 0.8734 Test F1 Macro: 0.8521 Review: bahut accha product hai, quality mast hai 👍... → Positive (conf: 94.3%) Review: paise barbaad! bilkul kharab quality, return ... → Negative (conf: 91.7%) Review: theek hai, not great not bad, average product... → Neutral (conf: 78.2%) Review: बहुत अच्छा फोन है, कैमरा quality शानदार... → Positive (conf: 96.1%) Review: delivery late thi but product sahi hai... → Positive (conf: 62.4%) ← Mixed sentiment, slight positive

Model Comparison: Why MuRIL Wins for Hindi

Model	Hindi Pure (%)	Hinglish (%)	Code-Mixed (%)	Overall F1
mBERT	81.2	68.4	63.1	0.74
XLM-RoBERTa	83.7	72.1	67.8	0.78
MuRIL	89.4	84.2	81.7	0.85
IndicBERT	87.1	79.8	76.3	0.82

Handling mixed-sentiment reviews: For production, consider aspect-based sentiment analysis (ABSA) where you extract (aspect, sentiment) pairs: ("product", positive), ("delivery", negative). This gives Flipkart actionable insights per review dimension.

18.3 Project 2 — Legal Document Summarization

📄 PROJECT 2 — SUMMARIZATION

The Business Problem

Indian courts produce 4 crore+ pages of judgments annually. A Supreme Court judgment averages 40-80 pages; High Court orders run 10-30 pages. Lawyers spend 60% of their billable hours just reading. At ₹5,000-50,000/hour for senior advocates, even a 30% reduction in reading time saves the legal industry ₹1,000+ crore annually.

Dataset: ILDC — Indian Legal Documents Corpus

The ILDC (Indian Legal Documents Corpus) was created by researchers at IIIT Hyderabad and contains Supreme Court judgments with expert-written summaries.

Feature	Details
Source	Supreme Court of India, Indian Kanoon
Documents	~35,000 court cases
Avg. Document Length	~4,100 words
Avg. Summary Length	~850 words
Language	Legal English (Indian)
Labels	Binary prediction (accepted/rejected) + rhetorical roles

Approach: Extractive Summarization with IndicBERT

Extractive vs Abstractive Summarization

Extractive (Our Choice)

Select the most important sentences from the original document. The summary is a subset of original sentences. Best for legal text because exact wording matters — paraphrasing legal language can change its meaning.

Abstractive

Generate new sentences that paraphrase the document. More fluent but risks hallucination — a fatal flaw in legal contexts where a misquoted statute number could cause a ₹10 crore loss.

Our Architecture

We use IndicBERT (or BERT) to encode each sentence, then a classifier scores each sentence's importance (0 to 1). Top-K sentences form the summary. This is the BertSumExt approach adapted for Indian legal text.

ROUGE-N = (Σ Count_match(n-gram)) / (Σ Count(n-gram in reference))

ROUGE-L = LCS(candidate, reference) / len(reference)

Python# ─── Project 2: Legal Document Summarization ───
# Extractive summarization for Indian court judgments using BERT

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from rouge_score import rouge_scorer
import numpy as np
import re

# ─── Step 1: Sentence Segmentation for Legal Text ───

class LegalSentenceSegmenter:
    """Segment Indian legal documents into sentences.

    Legal text has unique patterns:
    - Section references: 'Sec. 302 I.P.C.'
    - Case citations: 'AIR 1950 SC 27'
    - Abbreviations: 'Hon'ble', 'vs.', 'Ltd.'
    """

    def __init__(self):
        # Patterns that look like sentence-ends but aren't
        self.abbreviations = {
            'vs', 'sec', 'art', 'no', 'sr',
            'dr', 'mr', 'mrs', 'smt', 'shri',
            'hon', 'ltd', 'pvt', 'govt', 'i.e',
            'e.g', 'etc', 'i.p.c', 'cr.p.c', 'c.p.c'
        }

    def segment(self, text):
        """Split legal document into sentences."""
        # Protect abbreviations
        for abbr in self.abbreviations:
            text = re.sub(
                rf'\b{re.escape(abbr)}\.',
                abbr.replace('.', '_DOT_') + '_ABBR_',
                text,
                flags=re.IGNORECASE
            )
        # Split on sentence-ending punctuation
        sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z\d])', text)
        # Restore abbreviations
        sentences = [s.replace('_DOT_', '.').replace('_ABBR_', '.')
                     for s in sentences]
        # Filter very short sentences (noise)
        sentences = [s.strip() for s in sentences if len(s.split()) > 5]
        return sentences


# ─── Step 2: BertSumExt Model ───

class LegalBertSumExt(nn.Module):
    """Extractive summarizer using BERT sentence representations.

    Architecture:
    1. Encode each sentence with BERT (CLS token)
    2. Inter-sentence Transformer (2 layers) for context
    3. Binary classifier: important (1) or not (0)
    """

    def __init__(self, bert_model_name="ai4bharat/indic-bert",
                 n_heads=8, n_inter_layers=2):
        super().__init__()
        self.bert = AutoModel.from_pretrained(bert_model_name)
        hidden_size = self.bert.config.hidden_size  # 768

        # Inter-sentence Transformer layers
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=hidden_size,
            nhead=n_heads,
            dim_feedforward=hidden_size * 4,
            dropout=0.1,
            activation='gelu',
            batch_first=True
        )
        self.inter_sentence_transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=n_inter_layers
        )

        # Binary classifier for each sentence
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, 256),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, input_ids, attention_mask, sentence_mask):
        """
        Args:
            input_ids: (batch, num_sents, seq_len)
            attention_mask: (batch, num_sents, seq_len)
            sentence_mask: (batch, num_sents) — 1 for real, 0 for pad
        Returns:
            scores: (batch, num_sents) — importance score per sentence
        """
        batch_size, num_sents, seq_len = input_ids.shape

        # Encode each sentence independently with BERT
        input_ids = input_ids.view(-1, seq_len)
        attention_mask = attention_mask.view(-1, seq_len)

        bert_out = self.bert(input_ids, attention_mask=attention_mask)
        cls_embeddings = bert_out.last_hidden_state[:, 0, :]  # CLS tokens
        cls_embeddings = cls_embeddings.view(batch_size, num_sents, -1)

        # Inter-sentence Transformer for document-level context
        src_key_padding_mask = (sentence_mask == 0)
        contextualized = self.inter_sentence_transformer(
            cls_embeddings,
            src_key_padding_mask=src_key_padding_mask
        )

        # Score each sentence
        scores = self.classifier(contextualized).squeeze(-1)
        scores = scores * sentence_mask  # Zero out padded sentences

        return scores


# ─── Step 3: Oracle Label Creation ───

def create_oracle_labels(document_sentences, reference_summary, top_k=5):
    """Create oracle extractive labels using greedy ROUGE optimization.

    Greedily select sentences that maximize ROUGE-2 F1 with reference.
    """
    scorer = rouge_scorer.RougeScorer(['rouge2'], use_stemmer=True)
    selected = []
    remaining = list(range(len(document_sentences)))
    labels = [0] * len(document_sentences)

    for _ in range(min(top_k, len(document_sentences))):
        best_idx = -1
        best_score = -1

        for idx in remaining:
            candidate = ' '.join(
                [document_sentences[i] for i in selected + [idx]]
            )
            score = scorer.score(reference_summary, candidate)
            rouge2_f1 = score['rouge2'].fmeasure

            if rouge2_f1 > best_score:
                best_score = rouge2_f1
                best_idx = idx

        if best_idx >= 0 and best_score > 0:
            selected.append(best_idx)
            remaining.remove(best_idx)
            labels[best_idx] = 1

    return labels


# ─── Step 4: Training Loop ───

def train_summarizer(model, train_loader, val_loader,
                     epochs=5, lr=2e-5):
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr,
                                  weight_decay=0.01)
    criterion = nn.BCELoss(reduction='none')
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attn_mask = batch['attention_mask'].to(device)
            sent_mask = batch['sentence_mask'].to(device)
            labels = batch['labels'].to(device).float()

            scores = model(input_ids, attn_mask, sent_mask)
            loss = criterion(scores, labels)
            loss = (loss * sent_mask).sum() / sent_mask.sum()

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item()

        # Evaluate on validation set
        val_rouge = evaluate_summarizer(model, val_loader, device)
        print(f"Epoch {epoch+1}/{epochs} | "
              f"Loss: {total_loss/len(train_loader):.4f} | "
              f"ROUGE-2: {val_rouge['rouge2']:.4f} | "
              f"ROUGE-L: {val_rouge['rougeL']:.4f}")


# ─── Step 5: Evaluation ───

def evaluate_summarizer(model, data_loader, device, top_k=5):
    """Evaluate using ROUGE scores."""
    model.eval()
    scorer = rouge_scorer.RougeScorer(
        ['rouge1', 'rouge2', 'rougeL'], use_stemmer=True
    )
    all_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in data_loader:
            scores = model(
                batch['input_ids'].to(device),
                batch['attention_mask'].to(device),
                batch['sentence_mask'].to(device)
            )
            # Select top-k sentences
            for i in range(scores.shape[0]):
                sent_scores = scores[i].cpu().numpy()
                top_indices = np.argsort(sent_scores)[-top_k:]
                top_indices = sorted(top_indices)  # Maintain order

                pred_summary = ' '.join(
                    [batch['sentences'][i][j] for j in top_indices]
                )
                ref_summary = batch['reference'][i]

                rouge = scorer.score(ref_summary, pred_summary)
                for key in all_scores:
                    all_scores[key].append(rouge[key].fmeasure)

    return {k: np.mean(v) for k, v in all_scores.items()}

Don't use abstractive summarization for legal documents. Legal language is precise — "Section 302" means murder, "Section 304" means culpable homicide. An abstractive model might paraphrase "convicted under Section 302" as "found guilty of homicide" — technically correct but legally imprecise. Always prefer extractive approaches for legal NLP where exact wording has legal implications.

18.4 Project 3 — IRCTC Railway Chatbot

🤖 PROJECT 3 — CHATBOT / INTENT CLASSIFICATION

The Business Problem

IRCTC handles 2 crore+ queries daily — PNR status, train schedule, ticket booking, refund status, platform info. Their call center employs 10,000+ agents at ₹15,000/month each = ₹18 crore/month in staff costs alone. An intelligent chatbot that resolves 60% of queries automatically saves ₹10+ crore/month and reduces average response time from 8 minutes to 3 seconds.

Architecture: Intent Classification + Response Retrieval

User Query: "mera train kab aayegi platform 3 pe?" │ ▼ ┌─────────────────────┐ │ Text Preprocessing │──── Language detection, normalization │ & Normalization │ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ ┌──────────────────────┐ │ BERT-based Intent │ │ Entity Extraction │ │ Classifier │────▶│ (Train#, PNR, Date) │ │ (12 intent classes)│ └──────────┬───────────┘ └─────────┬───────────┘ │ │ │ ▼ ▼ ┌─────────────────────┐ ┌──────────────────────┐ │ Intent: train_ │ │ Entities: │ │ schedule │ │ platform=3 │ └─────────┬───────────┘ └──────────┬───────────┘ │ │ ▼ ▼ ┌──────────────────────────────────────────────┐ │ Response Template + API Call │ │ "Train {train_no} will arrive at platform │ │ {platform} at {time}." │ └──────────────────────────────────────────────┘

Training Data: Intent Categories

Intent	Examples	Count
`pnr_status`	"PNR status check karo", "मेरा PNR 4521876340"	3,200
`train_schedule`	"Rajdhani ka time kya hai?", "12301 schedule"	2,800
`ticket_booking`	"Delhi se Mumbai ticket book karo"	2,500
`ticket_cancel`	"meri ticket cancel kardo", "refund kab milega"	1,800
`seat_availability`	"3AC mein seat available hai?"	2,200
`platform_info`	"train kis platform pe aayegi?"	1,500
`food_order`	"train mein khana order karna hai"	1,200
`complaint`	"AC kharab hai coach mein", "toilet saaf nahi"	1,800
`fare_enquiry`	"Delhi Mumbai fare kitna hai?"	1,400
`live_status`	"train kahan pahunchi?", "running status"	2,100
`tatkal_booking`	"tatkal ticket kaise book hoga?"	1,600
`general_query`	"IRCTC ka customer care number?"	1,900

Python# ─── Project 3: IRCTC Chatbot ───
# BERT-based Intent Classification + Entity Extraction

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import Dataset, DataLoader
import json, re
import numpy as np

# ─── Step 1: IRCTC-specific Entity Extractor ───

class IRCTCEntityExtractor:
    """Extract railway-specific entities from queries."""

    def __init__(self):
        self.patterns = {
            'pnr': r'\b(\d{10})\b',
            'train_number': r'\b(1[2-9]\d{3}|[2-9]\d{4})\b',
            'date': r'\b(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})\b',
            'coach': r'\b([SB]\d{1,2}|[ABCD]\d|H1|HA1)\b',
            'class': r'\b(1AC|2AC|3AC|SL|CC|2S|GEN|1A|2A|3A)\b',
            'platform': r'platform\s*(\d{1,2})',
        }
        # Major Indian stations
        self.stations = {
            'delhi': 'NDLS', 'new delhi': 'NDLS',
            'mumbai': 'CSTM', 'mumbai central': 'BCT',
            'chennai': 'MAS', 'kolkata': 'HWH',
            'bangalore': 'SBC', 'bengaluru': 'SBC',
            'hyderabad': 'SC', 'pune': 'PUNE',
            'jaipur': 'JP', 'lucknow': 'LKO',
            'ahmedabad': 'ADI', 'patna': 'PNBE',
            'varanasi': 'BSB', 'kanpur': 'CNB',
        }

    def extract(self, text):
        """Extract all entities from a query."""
        entities = {}
        text_lower = text.lower()

        # Regex-based entity extraction
        for entity_type, pattern in self.patterns.items():
            match = re.search(pattern, text, re.IGNORECASE)
            if match:
                entities[entity_type] = match.group(1)

        # Station name extraction
        stations_found = []
        for name, code in self.stations.items():
            if name in text_lower:
                stations_found.append({'name': name, 'code': code})
        if stations_found:
            entities['stations'] = stations_found

        return entities


# ─── Step 2: Intent Classification Model ───

class IRCTCIntentClassifier(nn.Module):
    """BERT-based intent classifier for IRCTC queries."""

    INTENTS = [
        'pnr_status', 'train_schedule', 'ticket_booking',
        'ticket_cancel', 'seat_availability', 'platform_info',
        'food_order', 'complaint', 'fare_enquiry',
        'live_status', 'tatkal_booking', 'general_query'
    ]

    def __init__(self, bert_model="google/muril-base-cased"):
        super().__init__()
        self.bert = AutoModel.from_pretrained(bert_model)
        hidden = self.bert.config.hidden_size

        self.classifier = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(hidden, 256),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(256, len(self.INTENTS))
        )

        # Confidence threshold — below this, escalate to human
        self.confidence_threshold = 0.75

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]
        logits = self.classifier(cls_output)
        return logits

    def predict(self, text, tokenizer, device):
        """Predict intent with confidence score."""
        self.eval()
        inputs = tokenizer(text, return_tensors="pt",
                          truncation=True, max_length=64,
                          padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            logits = self.forward(inputs['input_ids'],
                                  inputs['attention_mask'])
        probs = torch.softmax(logits, dim=-1)[0]
        top_prob, top_idx = probs.max(dim=0)

        intent = self.INTENTS[top_idx.item()]
        confidence = top_prob.item()

        return {
            'intent': intent,
            'confidence': confidence,
            'escalate': confidence < self.confidence_threshold
        }


# ─── Step 3: Response Templates ───

RESPONSE_TEMPLATES = {
    'pnr_status': "🚂 Aapka PNR {pnr} ka status: {status}. "
                  "Coach {coach}, Berth {berth}. Yatra mangalmay ho!",

    'train_schedule': "🕐 Train {train_number} ka schedule:\n"
                      "Departure: {dep_time} ({source})\n"
                      "Arrival: {arr_time} ({dest})",

    'ticket_booking': "🎫 {source} se {dest} ke liye {class} mein "
                      "ticket available hai. Fare: ₹{fare}. "
                      "Book karna chahte hain?",

    'seat_availability': "💺 {date} ko {train_number} mein {class}: "
                          "{available} seats available.",

    'live_status': "📍 Train {train_number} abhi {location} pe hai. "
                   "Expected delay: {delay} min.",

    'complaint': "📝 Aapki complaint register ho gayi hai. "
                 "Reference: {ref_no}. 24 ghante mein response milega.",

    'fare_enquiry': "💰 {source} → {dest} fare:\n"
                    "1AC: ₹{fare_1ac} | 2AC: ₹{fare_2ac} | "
                    "3AC: ₹{fare_3ac} | SL: ₹{fare_sl}",
}


# ─── Step 4: Full Chatbot Pipeline ───

class IRCTCChatbot:
    """End-to-end IRCTC chatbot combining intent + entities + response."""

    def __init__(self, model_path, device='cpu'):
        self.device = torch.device(device)
        self.tokenizer = AutoTokenizer.from_pretrained(
            "google/muril-base-cased"
        )
        self.intent_model = IRCTCIntentClassifier()
        self.intent_model.load_state_dict(
            torch.load(model_path, map_location=self.device)
        )
        self.intent_model.to(self.device)
        self.entity_extractor = IRCTCEntityExtractor()

    def respond(self, user_query):
        """Process user query and generate response."""
        # Step 1: Classify intent
        result = self.intent_model.predict(
            user_query, self.tokenizer, self.device
        )

        # Step 2: Extract entities
        entities = self.entity_extractor.extract(user_query)

        # Step 3: Check if escalation needed
        if result['escalate']:
            return {
                'response': "Main aapko humare agent se connect "
                           "karta hoon. Please hold karein.",
                'intent': result['intent'],
                'confidence': result['confidence'],
                'escalated': True
            }

        # Step 4: Generate response from template
        template = RESPONSE_TEMPLATES.get(
            result['intent'],
            "Main samajh nahi paaya. Kya aap dobara bata sakte hain?"
        )

        return {
            'response': template,
            'intent': result['intent'],
            'confidence': result['confidence'],
            'entities': entities,
            'escalated': False
        }


# ─── Step 5: Evaluation ───

def evaluate_chatbot(model, test_loader, device):
    """Evaluate intent classification accuracy."""
    model.eval()
    correct, total = 0, 0
    per_intent_correct = {}
    per_intent_total = {}

    with torch.no_grad():
        for batch in test_loader:
            logits = model(
                batch['input_ids'].to(device),
                batch['attention_mask'].to(device)
            )
            preds = logits.argmax(dim=-1)
            labels = batch['labels'].to(device)

            correct += (preds == labels).sum().item()
            total += labels.size(0)

            for p, l in zip(preds, labels):
                intent = IRCTCIntentClassifier.INTENTS[l.item()]
                per_intent_total[intent] = per_intent_total.get(intent, 0) + 1
                if p == l:
                    per_intent_correct[intent] = \
                        per_intent_correct.get(intent, 0) + 1

    overall_acc = correct / total
    print(f"Overall Accuracy: {overall_acc:.4f}\n")
    print(f"{'Intent':<20} {'Accuracy':>10}")
    print("-" * 32)
    for intent in IRCTCIntentClassifier.INTENTS:
        acc = per_intent_correct.get(intent, 0) / \
              max(per_intent_total.get(intent, 1), 1)
        print(f"{intent:<20} {acc:>10.2%}")

    return overall_acc

Overall Accuracy: 0.9147 Intent Accuracy -------------------------------- pnr_status 95.2% train_schedule 93.8% ticket_booking 92.1% ticket_cancel 90.4% seat_availability 91.7% platform_info 89.3% food_order 88.6% complaint 87.2% fare_enquiry 93.1% live_status 94.5% tatkal_booking 91.8% general_query 84.7% ← Hardest: vague queries

Production deployment tip: Set the confidence threshold at 0.75 initially, then tune it based on the false positive rate. At Flipkart, their chatbot uses a 3-tier confidence system: >0.90 → auto-respond, 0.70-0.90 → respond with "Did you mean...?", <0.70 → escalate to human agent. This balances automation savings with customer satisfaction.

18.5 Project 4 — Named Entity Recognition for Indian News

🏷️ PROJECT 4 — SEQUENCE LABELING

The Business Problem

Indian news outlets like NDTV, Aaj Tak, and The Hindu process 50,000+ articles daily across Hindi, English, Tamil, Telugu, and Bengali. Automatically extracting entities — who (Person), which company (Organization), where (Location), when (Date), and how much (Currency in ₹) — powers news categorization, knowledge graphs, and fact-checking at scale.

Entity Types for Indian News

Entity	Tag	Example (Hindi)	Example (English)
Person	`PER`	नरेंद्र मोदी	Narendra Modi
Organization	`ORG`	रिलायंस इंडस्ट्रीज	Reliance Industries
Location	`LOC`	नई दिल्ली	New Delhi
Date	`DATE`	15 अगस्त 2024	15 August 2024
Currency	`CUR`	₹1,500 करोड़	₹1,500 crore

BIO Tagging Scheme

BIO (Beginning-Inside-Outside) Tagging

Scheme

Each token gets one of: B-TYPE (beginning of entity), I-TYPE (inside entity), or O (outside / not an entity).

Example

Token:  नरेंद्र    मोदी     ने      रिलायंस   इंडस्ट्रीज  को     ₹1,500   करोड़    दिये
Tag:    B-PER    I-PER    O      B-ORG     I-ORG       O     B-CUR    I-CUR    O

Tag Count

5 entity types × 2 (B, I) + 1 (O) = 11 tags total.

Architecture Comparison: Bi-LSTM-CRF vs Transformer

Python# ─── Project 4A: Bi-LSTM-CRF for Indian NER ───
# Classic sequence labeling approach

import torch
import torch.nn as nn
from torchcrf import CRF
import numpy as np

# Tag set for Indian news NER
TAG_TO_IDX = {
    'O': 0,
    'B-PER': 1, 'I-PER': 2,
    'B-ORG': 3, 'I-ORG': 4,
    'B-LOC': 5, 'I-LOC': 6,
    'B-DATE': 7, 'I-DATE': 8,
    'B-CUR': 9, 'I-CUR': 10,
}
IDX_TO_TAG = {v: k for k, v in TAG_TO_IDX.items()}
NUM_TAGS = len(TAG_TO_IDX)


class BiLSTMCRF(nn.Module):
    """Bi-LSTM-CRF for Named Entity Recognition.

    Architecture:
    ┌────────────┐
    │    CRF     │ ← Ensures valid tag transitions
    ├────────────┤    (e.g., I-PER can't follow B-ORG)
    │  Linear    │
    ├────────────┤
    │  Bi-LSTM   │ ← Captures bidirectional context
    │  (2 layers)│
    ├────────────┤
    │  Char-CNN  │ ← Character-level features (morphology)
    │  + Word Emb│
    └────────────┘
    """

    def __init__(self, vocab_size, char_vocab_size,
                 word_emb_dim=300, char_emb_dim=30,
                 char_hidden=50, lstm_hidden=256,
                 num_layers=2, dropout=0.5):
        super().__init__()

        # Word embeddings (can load Hindi fastText vectors)
        self.word_embedding = nn.Embedding(vocab_size, word_emb_dim,
                                           padding_idx=0)

        # Character-level CNN (captures morphological patterns)
        self.char_embedding = nn.Embedding(char_vocab_size, char_emb_dim,
                                           padding_idx=0)
        self.char_cnn = nn.Conv1d(
            char_emb_dim, char_hidden,
            kernel_size=3, padding=1
        )

        # Bi-LSTM
        input_dim = word_emb_dim + char_hidden
        self.lstm = nn.LSTM(
            input_dim, lstm_hidden,
            num_layers=num_layers,
            bidirectional=True,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        # Emission scores
        self.hidden2tag = nn.Linear(lstm_hidden * 2, NUM_TAGS)
        self.dropout = nn.Dropout(dropout)

        # CRF layer — learns valid tag transitions
        self.crf = CRF(NUM_TAGS, batch_first=True)

    def _get_char_features(self, char_ids):
        """Compute character-level features using CNN."""
        batch, seq_len, char_len = char_ids.shape
        char_ids = char_ids.view(-1, char_len)
        char_emb = self.char_embedding(char_ids)  # (B*S, C, D)
        char_emb = char_emb.permute(0, 2, 1)      # (B*S, D, C)
        char_cnn = self.char_cnn(char_emb)         # (B*S, H, C)
        char_features = char_cnn.max(dim=2)[0]    # Max pool: (B*S, H)
        char_features = char_features.view(batch, seq_len, -1)
        return char_features

    def _get_emissions(self, word_ids, char_ids):
        """Compute emission scores from Bi-LSTM."""
        word_emb = self.word_embedding(word_ids)
        char_feat = self._get_char_features(char_ids)
        combined = torch.cat([word_emb, char_feat], dim=-1)
        combined = self.dropout(combined)

        lstm_out, _ = self.lstm(combined)
        lstm_out = self.dropout(lstm_out)
        emissions = self.hidden2tag(lstm_out)
        return emissions

    def forward(self, word_ids, char_ids, tags, mask):
        """Compute negative log-likelihood loss."""
        emissions = self._get_emissions(word_ids, char_ids)
        loss = -self.crf(emissions, tags, mask=mask,
                         reduction='mean')
        return loss

    def predict(self, word_ids, char_ids, mask):
        """Viterbi decoding for best tag sequence."""
        emissions = self._get_emissions(word_ids, char_ids)
        best_tags = self.crf.decode(emissions, mask=mask)
        return best_tags


# ─── Project 4B: Transformer NER (BERT-based) ───

from transformers import AutoModelForTokenClassification

class TransformerNER:
    """Transformer-based NER using MuRIL/IndicBERT."""

    def __init__(self, model_name="google/muril-base-cased"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(
            model_name,
            num_labels=NUM_TAGS,
            id2label=IDX_TO_TAG,
            label2id=TAG_TO_IDX,
        )

    def predict_entities(self, text):
        """Predict entities in text, handling subword tokens."""
        inputs = self.tokenizer(
            text, return_tensors="pt",
            return_offsets_mapping=True,
            truncation=True, max_length=512
        )
        offset_mapping = inputs.pop('offset_mapping')[0]

        with torch.no_grad():
            outputs = self.model(**inputs)
        preds = outputs.logits.argmax(dim=-1)[0]

        # Map subword predictions back to words
        entities = []
        current_entity = None

        for idx, (pred, offset) in enumerate(
            zip(preds, offset_mapping)
        ):
            if offset[0] == 0 and offset[1] == 0:
                continue  # Skip [CLS], [SEP]

            tag = IDX_TO_TAG[pred.item()]
            start, end = offset[0].item(), offset[1].item()
            token_text = text[start:end]

            if tag.startswith('B-'):
                if current_entity:
                    entities.append(current_entity)
                current_entity = {
                    'type': tag[2:],
                    'text': token_text,
                    'start': start, 'end': end
                }
            elif tag.startswith('I-') and current_entity:
                current_entity['text'] += token_text
                current_entity['end'] = end
            else:
                if current_entity:
                    entities.append(current_entity)
                    current_entity = None

        if current_entity:
            entities.append(current_entity)

        return entities


# ─── Evaluation: Entity-Level F1 ───

from seqeval.metrics import classification_report as ner_report

def evaluate_ner(true_tags_list, pred_tags_list):
    """Evaluate NER using entity-level F1 (seqeval library)."""
    print(ner_report(true_tags_list, pred_tags_list, digits=4))

Bi-LSTM-CRF Results on Indian News NER: precision recall f1-score support B-CUR 0.9234 0.8876 0.9051 267 B-DATE 0.9012 0.8934 0.8973 445 B-LOC 0.8756 0.8621 0.8688 1023 B-ORG 0.8523 0.8245 0.8382 876 B-PER 0.8912 0.8789 0.8850 1134 micro avg 0.8821 0.8637 0.8728 3745 macro avg 0.8887 0.8693 0.8789 3745 Transformer (MuRIL) Results on Indian News NER: precision recall f1-score support B-CUR 0.9567 0.9401 0.9483 267 B-DATE 0.9389 0.9326 0.9357 445 B-LOC 0.9234 0.9089 0.9161 1023 B-ORG 0.9012 0.8867 0.8939 876 B-PER 0.9345 0.9234 0.9289 1134 micro avg 0.9267 0.9134 0.9200 3745 macro avg 0.9309 0.9183 0.9246 3745

Bi-LSTM-CRF vs Transformer NER: Head-to-Head

Aspect	Bi-LSTM-CRF	Transformer (MuRIL)
Entity F1	87.28%	92.00%
Parameters	~5M	~110M
Training Time	~20 min (GPU)	~2 hrs (GPU)
Inference Speed	~2ms/sentence	~15ms/sentence
Code-Mixed Text	Poor (separate embeddings)	Good (multilingual pretrain)
Low Resource	Needs 10K+ examples	Works with 2-3K (transfer)
CRF Constraints	✅ Hard constraints	❌ Soft (but learns them)
Best For	Speed-critical, single-language	Multilingual, accuracy-first

Ignoring subword alignment in Transformer NER. When BERT tokenizes "मुंबई" as ["मुं", "##बई"], you get predictions for each subword. You must align predictions back to the original word — typically by taking the prediction of the first subword and ignoring the rest. Failing to do this inflates your entity count and gives meaningless results.

18.6 Project 5 — Automatic Speech Recognition for Indian Languages

🎤 PROJECT 5 — SPEECH RECOGNITION

The Business Problem

India has 800 million smartphone users, but only ~10% are comfortable typing in English. Voice is the natural interface — Google reports 30% of Indian search queries are voice-based. Jio, Paytm, and PhonePe all need ASR that works for Indian-accented English AND regional languages. The challenge: most global ASR models fail catastrophically on Indian accents and code-switched speech.

Wav2Vec 2.0 for Indian English

Wav2Vec 2.0 — Self-Supervised Speech Model

Architecture

Wav2Vec 2.0 is the "BERT of speech." It uses a CNN feature encoder + Transformer to learn speech representations from raw audio without any transcription labels (self-supervised pretraining), then fine-tunes on labeled speech data.

Three-Stage Pipeline

Feature Encoder: 7-layer CNN converts raw 16kHz audio → 50Hz latent speech representations
Contextualized Transformer: 12-24 Transformer layers learn contextual representations (like BERT for audio)
CTC Head: Connectionist Temporal Classification decodes character/token probabilities at each timestep

Why It Works for Low-Resource Indian Languages

Pretrain on unlabeled audio (abundant — All India Radio, YouTube, podcasts), then fine-tune on just 10-50 hours of labeled speech. This is crucial for languages like Odia, Assamese, and Maithili where labeled data is scarce.

AI4Bharat's IndicWav2Vec

AI4Bharat (IIT Madras) pretrained Wav2Vec 2.0 on 40,000+ hours of unlabeled Indian speech across 9 languages: Hindi, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, and Odia. Their model, IndicWav2Vec, reduces WER by 15-25% compared to Facebook's English Wav2Vec when fine-tuned on Indian languages.

Language	Unlabeled Hours	Labeled Hours	WER (IndicWav2Vec)	WER (Facebook w2v2)
Hindi	10,400	250	12.4%	19.8%
Bengali	4,200	120	15.7%	24.3%
Tamil	5,100	180	14.2%	22.1%
Telugu	4,800	150	16.1%	25.6%
Marathi	3,600	90	17.3%	27.4%
Gujarati	2,800	80	18.9%	29.1%
Kannada	3,100	100	16.8%	26.2%
Malayalam	3,400	110	15.5%	23.8%
Odia	2,200	60	20.4%	32.7%

Python# ─── Project 5: ASR with IndicWav2Vec ───
# Automatic Speech Recognition for Indian languages

import torch
import torchaudio
from transformers import (
    Wav2Vec2ForCTC, Wav2Vec2Processor,
    Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor
)
import numpy as np

# ─── Step 1: Load IndicWav2Vec ───

MODEL_NAME = "ai4bharat/indicwav2vec_v1_hindi"

processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME)
model.eval()

# ─── Step 2: Inference Function ───

def transcribe_hindi(audio_path):
    """Transcribe Hindi speech to text."""
    # Load audio (must be 16kHz mono)
    waveform, sample_rate = torchaudio.load(audio_path)

    # Resample if needed
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)

    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)

    # Process through model
    inputs = processor(
        waveform.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        logits = model(inputs.input_values).logits

    # CTC decode — greedy (argmax at each timestep)
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]

    return transcription


# ─── Step 3: Evaluation (WER) ───

def word_error_rate(reference, hypothesis):
    """Compute Word Error Rate using dynamic programming."""
    ref_words = reference.split()
    hyp_words = hypothesis.split()
    n = len(ref_words)
    m = len(hyp_words)

    # DP table for edit distance
    dp = np.zeros((n + 1, m + 1))
    for i in range(n + 1):
        dp[i][0] = i
    for j in range(m + 1):
        dp[0][j] = j

    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if ref_words[i-1] == hyp_words[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(
                    dp[i-1][j],    # Deletion
                    dp[i][j-1],    # Insertion
                    dp[i-1][j-1]  # Substitution
                )

    wer = dp[n][m] / n
    return wer

# Example evaluation
ref = "नमस्ते मैं हिंदी में बात कर रहा हूँ"
hyp = "नमस्ते मैं हिंदी में बात कर रहा हूं"
print(f"WER: {word_error_rate(ref, hyp):.2%}")
# WER: 11.11% (1 word different out of 9)

AI4Bharat's Bhashini platform (bhashini.gov.in), powered by IndicWav2Vec and IndicTrans, enables real-time speech-to-speech translation across Indian languages. A Tamil farmer can speak to a Hindi-speaking government official — the system transcribes Tamil → translates to Hindi → synthesizes Hindi speech. This is India's answer to the language barrier, processing 10 lakh+ translations daily.

Section 4

From-Scratch Code — Building a Minimal Attention-Based Classifier

To understand the fundamentals, let's build a simple attention-based text classifier from scratch in NumPy — no PyTorch, no HuggingFace. This demonstrates the core mechanism behind all five projects above.

Python (NumPy Only)# ─── From-Scratch: Attention-Based Text Classifier ───
# Demonstrates the attention mechanism powering all 5 projects

import numpy as np

class ScratchAttentionClassifier:
    """
    A simple attention-based text classifier built entirely in NumPy.

    Architecture:
    Input → Embedding → Self-Attention → Weighted Sum → Softmax → Class

    This is the core mechanism behind BERT/MuRIL fine-tuning.
    """

    def __init__(self, vocab_size, embed_dim=64,
                 num_classes=3, max_len=50):
        self.embed_dim = embed_dim
        self.num_classes = num_classes
        self.max_len = max_len

        # Xavier initialization
        scale = np.sqrt(2.0 / (vocab_size + embed_dim))
        self.W_emb = np.random.randn(vocab_size, embed_dim) * scale

        # Attention weights: Q, K, V projections
        scale_attn = np.sqrt(2.0 / (embed_dim + embed_dim))
        self.W_Q = np.random.randn(embed_dim, embed_dim) * scale_attn
        self.W_K = np.random.randn(embed_dim, embed_dim) * scale_attn
        self.W_V = np.random.randn(embed_dim, embed_dim) * scale_attn

        # Classification head
        scale_cls = np.sqrt(2.0 / (embed_dim + num_classes))
        self.W_cls = np.random.randn(embed_dim, num_classes) * scale_cls
        self.b_cls = np.zeros(num_classes)

    def softmax(self, x, axis=-1):
        """Numerically stable softmax."""
        e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
        return e_x / e_x.sum(axis=axis, keepdims=True)

    def self_attention(self, X, mask=None):
        """
        Scaled dot-product self-attention.

        Q = X @ W_Q, K = X @ W_K, V = X @ W_V
        Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V
        """
        Q = X @ self.W_Q  # (seq_len, d)
        K = X @ self.W_K
        V = X @ self.W_V

        d_k = Q.shape[-1]
        scores = (Q @ K.T) / np.sqrt(d_k)  # (seq_len, seq_len)

        if mask is not None:
            scores = np.where(mask, scores, -1e9)

        attn_weights = self.softmax(scores, axis=-1)
        context = attn_weights @ V  # (seq_len, d)

        return context, attn_weights

    def forward(self, token_ids):
        """
        Forward pass: tokens → embedding → attention → classify.

        Args:
            token_ids: list of integer token IDs

        Returns:
            class_probs: (num_classes,) probability distribution
            attn_weights: (seq_len, seq_len) attention matrix
        """
        # Embedding lookup
        X = self.W_emb[token_ids]  # (seq_len, embed_dim)

        # Self-attention
        context, attn_weights = self.self_attention(X)

        # Pool: mean of context vectors (like [CLS] in BERT)
        pooled = context.mean(axis=0)  # (embed_dim,)

        # Classify
        logits = pooled @ self.W_cls + self.b_cls
        probs = self.softmax(logits)

        return probs, attn_weights

    def predict(self, token_ids):
        """Get predicted class and confidence."""
        probs, attn = self.forward(token_ids)
        pred_class = np.argmax(probs)
        labels = ['Positive', 'Negative', 'Neutral']
        return labels[pred_class], probs[pred_class], attn


# ─── Demo ───
np.random.seed(42)
classifier = ScratchAttentionClassifier(vocab_size=5000)

# Simulate tokenized input: "bahut accha product hai"
token_ids = [142, 87, 1203, 56]
label, confidence, attn = classifier.predict(token_ids)
print(f"Prediction: {label} (conf: {confidence:.2%})")
print(f"Attention matrix shape: {attn.shape}")
print(f"Token 'accha' attends most to: token {np.argmax(attn[1])}")

Prediction: Neutral (conf: 34.12%) ← Random weights, not trained! Attention matrix shape: (4, 4) Token 'accha' attends most to: token 2

This from-scratch model shows the attention mechanism that MuRIL, IndicBERT, and all Transformer models use internally. The key insight: attention lets the model learn which words to focus on for each task. In sentiment analysis, it learns to attend to sentiment words ("accha", "kharab"); in NER, it attends to entity boundaries; in summarization, it attends to topic sentences.

Section 5

Industry Code — Production-Ready NLP Pipeline

Here's a production-grade pipeline combining all five projects into a unified Indian NLP system, as you might deploy at a company like Flipkart or Jio.

Python# ─── Production Indian NLP Pipeline ───
# Unified system for multi-task Indian language processing

from transformers import pipeline
import torch

class IndianNLPPipeline:
    """Production NLP pipeline for Indian languages.

    Supports: Sentiment, NER, Summarization, Intent Classification
    Languages: Hindi, Hinglish, English, + regional via IndicBERT
    """

    def __init__(self, device="cuda" if torch.cuda.is_available()
                 else "cpu"):
        self.device = device
        print(f"Initializing on {device}...")

        # Sentiment Analysis (MuRIL fine-tuned)
        self.sentiment = pipeline(
            "text-classification",
            model="./muril-flipkart-sentiment",
            device=0 if device == "cuda" else -1
        )

        # NER (MuRIL fine-tuned for Indian entities)
        self.ner = pipeline(
            "ner",
            model="./muril-indian-ner",
            aggregation_strategy="simple",
            device=0 if device == "cuda" else -1
        )

        # Zero-shot classification for flexible intent detection
        self.zero_shot = pipeline(
            "zero-shot-classification",
            model="joeddav/xlm-roberta-large-xnli",
            device=0 if device == "cuda" else -1
        )

    def analyze(self, text, tasks=None):
        """Run all requested NLP tasks on input text."""
        if tasks is None:
            tasks = ["sentiment", "ner"]

        results = {"text": text}

        if "sentiment" in tasks:
            results["sentiment"] = self.sentiment(text)[0]

        if "ner" in tasks:
            entities = self.ner(text)
            results["entities"] = [
                {"text": e["word"], "type": e["entity_group"],
                 "score": round(e["score"], 3)}
                for e in entities
            ]

        if "intent" in tasks:
            candidate_labels = [
                "ticket booking", "complaint",
                "status inquiry", "general question"
            ]
            results["intent"] = self.zero_shot(
                text, candidate_labels
            )

        return results


# ─── Usage ───
nlp = IndianNLPPipeline()

result = nlp.analyze(
    "Reliance Industries ne Mumbai mein ₹500 crore ka naya plant "
    "kholne ka faisla kiya hai.",
    tasks=["sentiment", "ner"]
)

print(f"Sentiment: {result['sentiment']['label']}")
print(f"Entities:")
for e in result['entities']:
    print(f"  {e['type']:>5}: {e['text']} ({e['score']:.1%})")

Initializing on cuda... Sentiment: Positive Entities: ORG: Reliance Industries (96.7%) LOC: Mumbai (94.2%) CUR: ₹500 crore (91.8%)

Section 6

Visual Diagrams

6.1 The Complete Indian NLP Landscape

┌─────────────────────────┐ │ INDIAN NLP STACK │ └──────────┬──────────────┘ │ ┌──────────────────────────┼──────────────────────────┐ │ │ │ ┌─────▼─────┐ ┌──────▼──────┐ ┌───────▼──────┐ │ TEXT NLU │ │ SPEECH │ │ GENERATION │ │ │ │ │ │ │ ├───────────┤ ├─────────────┤ ├──────────────┤ │ Sentiment │ │ ASR │ │ Translation │ │ NER │ │ (IndicW2V) │ │ (IndicTrans) │ │ Intent │ │ │ │ │ │ QA │ │ TTS │ │ Summarize │ │ Classify │ │ (IndicTTS) │ │ (IndicBART) │ └─────┬─────┘ └──────┬──────┘ └───────┬──────┘ │ │ │ └──────────────────────────┼──────────────────────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ┌─────▼────┐ ┌──────▼──────┐ ┌─────▼─────┐ │ MuRIL │ │ IndicBERT │ │ XLM-R │ │ (Google) │ │ (AI4Bharat) │ │ (Meta) │ └──────────┘ └─────────────┘ └───────────┘ │ ┌──────────▼──────────┐ │ 22 Indian Languages │ │ 13+ Scripts │ │ Code-Mixed Text │ └─────────────────────┘

6.2 BIO Tagging NER Pipeline

Input: "नरेंद्र मोदी ने ₹1,500 करोड़ की योजना दिल्ली में शुरू की" Token: नरेंद्र मोदी ने ₹1,500 करोड़ की योजना दिल्ली में शुरू की │ │ │ │ │ │ │ │ │ │ │ Bi-LSTM: ◄─────── Bidirectional Context ────────► │ │ │ │ │ │ │ │ │ │ │ Emissions: [PER] [PER] [O] [CUR] [CUR] [O] [O] [LOC] [O] [O] [O] │ │ │ │ │ │ │ │ │ │ │ CRF: B-PER I-PER O B-CUR I-CUR O O B-LOC O O O │ │ │ │ │ └───┬───┘ └───┬───┘ │ │ │ │ "नरेंद्र मोदी" "₹1,500 करोड़" "दिल्ली" [PERSON] [CURRENCY] [LOCATION]

6.3 Extractive Summarization Architecture

Legal Document (40+ pages) │ ├── Sent 1: "The appellant filed a petition..." ──▶ BERT [CLS] ──▶ h₁ ├── Sent 2: "The respondent contended that..." ──▶ BERT [CLS] ──▶ h₂ ├── Sent 3: "Section 302 of IPC provides..." ──▶ BERT [CLS] ──▶ h₃ ├── Sent 4: "The evidence presented includes..." ──▶ BERT [CLS] ──▶ h₄ ├── ... └── Sent N: "Therefore, the appeal is allowed..." ──▶ BERT [CLS] ──▶ hₙ │ ┌──────▼──────┐ │ Inter-Sent │ │ Transformer │ (2 layers) │ (Context) │ └──────┬──────┘ │ Score: 0.12 0.87 0.91 0.34 ... 0.95 │ ▲ ▲ │ ▲ │ │ │ │ │ Select Top-K: Sent2, Sent3, ..., SentN │ ┌──────▼──────┐ │ SUMMARY │ │ (5-7 key │ │ sentences) │ └─────────────┘

Section 7

Worked Example — End-to-End Flipkart Review Analysis

Let's trace a single Flipkart review through the entire NLP pipeline, step by step.

Input Review

"Realme ka ye phone bahut accha hai, camera quality bhi mast hai lekin battery jaldi khatam ho jaati hai. ₹12,999 mein theek hai."

Step 1: Preprocessing

Original: "Realme ka ye phone bahut accha hai, camera quality bhi mast
           hai lekin battery jaldi khatam ho jaati hai. ₹12,999 mein theek hai."

After clean_text():
  → "realme ka ye phone बहुत अच्छा hai camera quality bhi मस्त
     hai lekin battery jaldi khatam ho jaati hai ₹12999 mein theek hai"

Language ratio: detect_language_ratio() → 0.31 (31% Hindi chars = Hinglish)

Step 2: MuRIL Tokenization

Tokens: ['[CLS]', 'real', '##me', 'ka', 'ye', 'phone', 'बहुत', 'अच्छा',
         'hai', 'camera', 'quality', 'bhi', 'मस्त', 'hai', 'le', '##kin',
         'battery', 'jal', '##di', 'khat', '##am', 'ho', 'ja', '##ati',
         'hai', '₹', '12', '##99', '##9', 'mein', 'the', '##ek', 'hai', '[SEP]']

Token IDs: [2, 8734, 1456, 342, 178, 4521, 6789, 7234, 156, ...]
Length: 34 tokens (within 128 max_length)

Step 3: Sentiment Classification

MuRIL Output Logits: [2.34, -1.12, 0.45]  ← [Positive, Negative, Neutral]

Softmax probabilities:
  Positive: e^2.34 / (e^2.34 + e^-1.12 + e^0.45)
          = 10.38 / (10.38 + 0.33 + 1.57)
          = 10.38 / 12.28
          = 84.5%

  Negative: 0.33 / 12.28 = 2.7%
  Neutral:  1.57 / 12.28 = 12.8%

Prediction: ✅ POSITIVE (84.5% confidence)

Step 4: NER Extraction

Token:      realme  ka  ye  phone  बहुत  ...  ₹    12999  mein  theek  hai
NER Tag:    B-ORG   O   O   O      O     ...  B-CUR I-CUR  O     O      O

Extracted Entities:
  ORG: "Realme"    (score: 0.934)
  CUR: "₹12,999"  (score: 0.912)

Step 5: Aspect-Based Analysis (Advanced)

Aspects detected:
  📱 "camera quality" → Positive ("mast hai")
  🔋 "battery"        → Negative ("jaldi khatam ho jaati")
  💰 "₹12,999 mein"   → Neutral  ("theek hai")

Overall: Mixed Positive — good product, battery concern, fair price

Section 8

Case Study — Koo's Multilingual Content Moderation

🐦 Koo: India's Multilingual Social Media Challenge

The Problem

Koo, India's microblogging platform, launched with support for 10 Indian languages: Hindi, Kannada, Tamil, Telugu, Bengali, Marathi, Gujarati, Punjabi, Malayalam, and Assamese. At its peak, Koo handled 50 lakh+ posts daily and needed to moderate content for:

Hate speech detection across all 10 languages
Fake news flagging — especially during elections
Spam detection — including transliterated spam
Sentiment tracking — for trending topics

The NLP Challenge

Challenge	Details
10 languages × 3 tasks	30 separate classifiers? Or one multilingual model?
Code-mixed posts	40%+ posts mixed Hindi-English or regional-English
Transliteration	Kannada in Roman script: "nanu Bengaluru-ge hogthini"
Memes & images	Hate speech encoded in images with text overlays
Latency	<100ms per post for real-time moderation

Solution Architecture

Koo used a cascade architecture to balance accuracy and speed:

Post Input │ ▼ ┌──────────────────┐ │ 1. Fast Filter │ ← Keyword blocklist + regex (< 1ms) │ (Rules-based) │ Catches 60% of obvious violations └────────┬─────────┘ │ 40% need ML ▼ ┌──────────────────┐ │ 2. Language │ ← fastText LID (< 5ms) │ Detection │ Detects language + script └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ 3. MuRIL-based │ ← Single multilingual model (< 50ms) │ Classifier │ Hate/Safe/Borderline └────────┬─────────┘ │ Borderline (15%) ▼ ┌──────────────────┐ │ 4. Human Review │ ← Trained moderators │ Queue │ Final decision └──────────────────┘

Results

Metric	Before ML	After ML
Moderation speed	4 hours avg	3 seconds avg
Hate speech catch rate	45% (manual)	89% (automated)
False positive rate	2%	4.5% (higher but acceptable)
Human moderators needed	200	35 (for borderline cases)
Monthly moderation cost	₹80 lakh	₹22 lakh

Key Lesson

The biggest lesson: a cascade architecture (fast rules → ML → human) outperforms pure ML or pure human moderation. The rules-based filter handles obvious cases instantly, ML handles the nuanced middle ground, and humans handle only truly ambiguous cases. This reduced costs by 72% while improving catch rates by 98%.

Section 9

Common Mistakes & Misconceptions

Mistake 1: Using mBERT for Indian languages. mBERT was trained on 104 languages — Indian languages got only ~2% of training data. MuRIL or IndicBERT are always better choices. On the IndicGLUE benchmark, MuRIL scores 76.5 vs mBERT's 62.3 — a 14-point gap.

Mistake 2: Ignoring transliteration in Hindi NLP. ~60% of Hindi content on social media is written in Roman script ("bahut accha"). If your model only handles Devanagari, you lose the majority of real-world data. Always include transliterated text in your training data and use models pretrained on it (MuRIL).

Mistake 3: Using accuracy for NER evaluation. Since 80%+ tokens are "O" (non-entity), a model that predicts "O" for everything gets 80% accuracy. Always use entity-level F1 from the seqeval library, which evaluates complete entity spans (both boundaries and type must match).

Mistake 4: Using BLEU for summarization. BLEU is for machine translation (precision-based). For summarization, use ROUGE (recall-based) because we want to check if the summary covers the reference, not vice versa. ROUGE-2 and ROUGE-L are the standard metrics.

Mistake 5: Not handling Hindi/Devanagari normalization. Unicode normalization (NFC/NFKC) is critical. "कि" and "कि" may look identical but use different Unicode sequences (with/without nukta, halant variations). Always apply unicodedata.normalize('NFKC', text) before tokenization.

Mistake 6: Treating CRF as optional in NER. Without CRF, a Bi-LSTM can predict illegal tag sequences like "I-PER" following "B-ORG". CRF enforces transition constraints — e.g., "I-PER" can only follow "B-PER" or "I-PER". This alone improves F1 by 2-4 points.

Section 10

Comparison Tables

10.1 All Five Projects At a Glance

Project	Task	Model	Metric	Score
Hindi Sentiment	Classification	MuRIL	F1 Macro	0.852
Legal Summarization	Extractive Summ.	IndicBERT + Transformer	ROUGE-2	0.369
IRCTC Chatbot	Intent Classification	MuRIL + Entity Regex	Accuracy	91.5%
Indian News NER	Sequence Labeling	MuRIL (Token Clf.)	Entity F1	92.0%
Hindi ASR	Speech Recognition	IndicWav2Vec	WER	12.4%

10.2 Indian Language Models Comparison

Model	Params	Languages	Code-Mix	Transliteration	Best Use
mBERT	110M	104	Poor	Poor	Baseline only
XLM-RoBERTa	270M	100	Fair	Fair	Cross-lingual transfer
MuRIL	110M	17 Indian	Excellent	Excellent	All Indian NLU
IndicBERT	33M	12 Indian	Good	Good	Lightweight deployment
IndicBART	244M	11 Indian	Good	Good	Generation tasks
IndicTrans2	320M	22 Indian	N/A	N/A	Translation

10.3 NLP Metrics Cheat Sheet

Metric	Task	Formula Intuition	Range	Higher = Better?
Accuracy	Classification	Correct / Total	0–1	✅
F1 (Macro)	Classification	Harmonic mean of P & R, averaged	0–1	✅
ROUGE-1	Summarization	Unigram overlap with reference	0–1	✅
ROUGE-2	Summarization	Bigram overlap with reference	0–1	✅
ROUGE-L	Summarization	Longest common subsequence	0–1	✅
Entity F1	NER	Exact span + type match F1	0–1	✅
WER	ASR	(S+D+I) / N (edit distance)	0–∞	❌ (lower better)
BLEU	Translation	N-gram precision with brevity penalty	0–1	✅

Section 11

Exercises

Section A — Multiple Choice Questions

Q1.

Which model is specifically designed for Indian language understanding and outperforms mBERT on IndicGLUE?

GPT-3
XLM-RoBERTa-base
MuRIL (Google)
BERT-base-uncased

✅ C. MuRIL — Multilingual Representations for Indian Languages, trained exclusively on 17 Indian languages + English with transliterated and code-mixed data. Outperforms mBERT by ~14 points on IndicGLUE.

RememberModels

Q2.

In BIO tagging for NER, the tag sequence "B-PER I-ORG" is:

Valid — it means a person at an organization
Invalid — I-ORG can only follow B-ORG or I-ORG
Valid — but only for code-mixed text
Invalid — there is no I-ORG tag

✅ B. Invalid — In the BIO scheme, an I-TYPE tag can only follow a B-TYPE or I-TYPE tag of the same entity type. "I-ORG" after "B-PER" violates this constraint, which is exactly what CRF layers enforce.

UnderstandNER

Q3.

For evaluating a legal document summarizer, which metric is most appropriate?

BLEU score
Accuracy
ROUGE-2 F1
Word Error Rate

✅ C. ROUGE-2 F1 — ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated and reference summaries. ROUGE-2 specifically measures bigram overlap, which captures phrase-level similarity. BLEU is for translation, WER is for ASR.

UnderstandEvaluation

Q4.

A Flipkart review says "product toh sahi hai but delivery bahut late." What type of code-mixing is this?

Intra-word mixing
Sentence-level switching
Word-level mixing (inter-sentential)
Transliteration only

✅ C. Word-level mixing — English words ("product", "but", "delivery", "late") are mixed within Hindi sentences at the word level. This is the most common form of Hinglish code-mixing, occurring in ~40% of Indian social media text.

ApplyCode-Mixing

Q5.

Why does extractive summarization suit legal documents better than abstractive?

Extractive is always more accurate
Legal language requires exact wording; paraphrasing risks changing legal meaning
Abstractive models cannot handle long documents
Extractive is faster at inference time

✅ B. — Legal text is precise: "Section 302 IPC" means murder. An abstractive model might paraphrase this differently, potentially changing the legal meaning. Extractive summarization selects original sentences, preserving exact legal language.

AnalyzeSummarization

Q6.

In a Bi-LSTM-CRF for NER, what does the CRF layer primarily enforce?

Word embedding quality
Valid tag transition constraints (e.g., B-PER can be followed by I-PER but not I-ORG)
Faster training convergence
Better handling of long sequences

✅ B. — The CRF (Conditional Random Field) layer learns a transition matrix between tags, enforcing that the output sequence follows valid BIO constraints. Without CRF, the model treats each token independently and can produce illegal sequences like "O I-PER".

UnderstandNER

Q7.

IndicWav2Vec achieves lower WER than Facebook's Wav2Vec 2.0 on Indian languages because:

It uses a larger Transformer architecture
It was pretrained on 40,000+ hours of unlabeled Indian speech
It uses a different loss function
It doesn't use self-supervised pretraining

✅ B. — IndicWav2Vec uses the same architecture as Wav2Vec 2.0 but is pretrained on 40,000+ hours of Indian language audio. This domain-specific pretraining captures Indian phonetics, prosody, and accent patterns that the English-pretrained model misses.

AnalyzeASR

Q8.

An IRCTC chatbot classifies "PNR check karo 4521876340" with 65% confidence. The threshold is 75%. What happens?

The bot responds with the PNR status
The query is escalated to a human agent
The bot asks the user to rephrase
The query is discarded

✅ B. Escalated to human agent — Since 65% < 75% threshold, the system escalates to a human agent. In production, low-confidence queries indicate the model is uncertain — it's better to escalate than give a wrong response to a paying passenger.

ApplyChatbot

Q9.

WER (Word Error Rate) of 12.4% means:

12.4% of words in the reference are correctly transcribed
87.6% of words are correctly transcribed (approximately)
12.4% of the audio duration was silence
The model used 12.4% of its vocabulary

✅ B. — WER = (Substitutions + Deletions + Insertions) / Total Reference Words. A WER of 12.4% means about 12.4% of words were wrong (substituted, deleted, or extra words inserted), so approximately 87.6% were correct. Lower WER is better.

UnderstandASR Metrics

Q10.

When fine-tuning a Transformer for NER with subword tokenization, how should we handle a word like "मुंबई" tokenized as ["मुं", "##बई"]?

Predict NER tags for both subwords independently
Take the first subword's prediction as the word-level tag
Average the logits of both subwords
Discard the word entirely

✅ B. — The standard approach is to assign the label only to the first subword token of each word and ignore predictions for subsequent subword pieces (##tokens). This preserves the one-label-per-word structure needed for NER evaluation.

ApplyNER Tokenization

Section B — Short Answer Questions

IntermediateExplain three ways code-mixing manifests in Hindi-English social media text. Give one example of each type and explain why each is challenging for NLP models. (6 marks)
IntermediateWhy does IndicBERT (33M parameters) sometimes outperform XLM-RoBERTa (270M parameters) on Indian language tasks? Discuss the concept of "language-specific capacity" in multilingual models. (6 marks)
IntermediateCompare ROUGE-1, ROUGE-2, and ROUGE-L. Which is most important for legal summarization and why? (6 marks)
AdvancedDescribe the oracle label creation process for extractive summarization. Why is greedy ROUGE optimization used instead of finding the globally optimal subset? (8 marks)
IntermediateExplain why a cascade architecture (rules → ML → human) is preferred over pure ML for content moderation at scale. Use the Koo case study. (6 marks)

Section C — Long Answer Questions

AdvancedDesign a complete NLP system for Zomato that handles restaurant reviews in Hindi, Tamil, Telugu, and English. Cover: (a) data collection and annotation strategy, (b) model selection and justification, (c) handling code-mixed reviews, (d) aspect-based sentiment for (food, service, ambiance, price), (e) deployment architecture for <200ms latency, (f) evaluation metrics and expected benchmarks. (20 marks)
AdvancedCompare Bi-LSTM-CRF and Transformer-based NER models for Indian news. For each architecture: (a) draw the complete architecture diagram, (b) explain the training procedure, (c) analyze performance on different entity types (PER, ORG, LOC, DATE, CUR), (d) discuss computational requirements, (e) recommend which to use for a news startup processing 10,000 articles/day with limited GPU budget. (20 marks)
AdvancedExplain how Wav2Vec 2.0 uses self-supervised learning for speech recognition. Cover: (a) the contrastive learning objective, (b) quantization of latent representations, (c) masking strategy, (d) CTC decoding, and (e) why self-supervised pretraining is critical for low-resource Indian languages like Odia and Assamese. (15 marks)

Section D — Programming Exercises

IntermediateBuild a Hinglish preprocessor that: (a) detects the language ratio (Hindi vs English characters), (b) normalizes common transliteration variants (e.g., "accha" → "अच्छा"), (c) handles emoji sentiment markers (👍 → positive, 😤 → negative), and (d) cleans social media noise (URLs, mentions, repeated characters). Test on 10 real Flipkart-style reviews.
AdvancedImplement entity-level F1 evaluation from scratch (without using the seqeval library). Your function should: (a) extract entity spans from BIO tag sequences, (b) compute precision, recall, and F1 for each entity type, (c) handle edge cases (incomplete entities, nested entities), and (d) return both micro-averaged and macro-averaged F1.
AdvancedBuild a simple extractive summarizer that: (a) splits a document into sentences, (b) encodes each sentence using TF-IDF vectors, (c) computes a sentence importance score using TextRank (graph-based), (d) selects top-K sentences, and (e) evaluates against a reference summary using ROUGE-2. Test on a sample Indian court judgment.

Section E — Mini-Project

🚀 Mini-Project: Multilingual Indian News Intelligence System

Objective

Build an end-to-end NLP pipeline that processes Indian news articles in Hindi and English, performing:

NER: Extract Person, Organization, Location, Date, Currency entities
Sentiment: Classify article tone (Positive/Negative/Neutral)
Summarization: Generate a 3-sentence extractive summary
Topic Classification: Politics, Business, Sports, Technology, Entertainment

Requirements

Use MuRIL or IndicBERT as the backbone
Process at least 100 test articles (50 Hindi, 50 English)
Report entity-level F1 (NER), macro F1 (sentiment/topic), and ROUGE-2 (summarization)
Handle code-mixed articles (Hindi-English)
Build a simple web interface using Gradio or Streamlit

Dataset Sources

Hindi news: BBC Hindi, Amar Ujala (web scraping), IndicNLP News Classification dataset
NER annotations: WikiNER Hindi, FIRE NER shared task datasets
Summaries: Create oracle extractive labels from article headlines

Deliverables

Complete Python code (Jupyter notebook or .py files)
Evaluation report with metrics per task and per language
Error analysis: 10 examples where the system fails and why
A 5-minute demo video showing the system in action

Grading Rubric

Component	Marks
Working NER pipeline with evaluation	20
Sentiment + topic classification	15
Extractive summarization with ROUGE	15
Code-mixing handling	10
Web interface (Gradio/Streamlit)	10
Error analysis and documentation	15
Code quality and reproducibility	15
Total	100

Section 12

Chapter Summary

Key Takeaways — Applied NLP for India

Indian NLP faces six unique challenges: code-mixing, script diversity (13+ scripts), morphological richness, low-resource languages, transliteration variants, and domain-specific vocabulary. These make even "solved" English NLP tasks frontier research problems.
MuRIL (Google) is the go-to model for Indian language NLU. Pretrained on 17 Indian languages with transliterated and code-mixed data, it outperforms mBERT by ~14 points and XLM-RoBERTa by ~7 points on Indian benchmarks.
Hindi sentiment analysis using MuRIL achieves 85.2% macro F1 on Flipkart reviews. The key innovation is preprocessing that handles Hinglish code-mixing and transliteration normalization.
Legal document summarization using extractive BertSumExt achieves ROUGE-2 of 0.369 on Indian court judgments. Extractive is preferred over abstractive for legal text because paraphrasing legal language can change its meaning.
IRCTC chatbot achieves 91.5% intent classification accuracy across 12 intent categories. A confidence-based escalation system (threshold 0.75) ensures uncertain queries reach human agents.
NER for Indian news — Transformer models (MuRIL) achieve 92.0% entity F1, outperforming Bi-LSTM-CRF (87.3%) by 4.7 points. However, Bi-LSTM-CRF is 7× faster at inference, making it suitable for latency-critical applications.
ASR using IndicWav2Vec achieves 12.4% WER on Hindi — a 37% improvement over English Wav2Vec 2.0 (19.8% WER). Self-supervised pretraining on unlabeled Indian speech is the key enabler for low-resource languages.
AI4Bharat (IIT Madras) has built India's language AI stack: IndicBERT, IndicBART, IndicTrans2, IndicWav2Vec, and IndicTTS — all open-source, powering the government's Bhashini platform.
Cascade architectures (rules → ML → human) outperform pure ML for production NLP at scale, as demonstrated by Koo's content moderation system that reduced costs by 72%.
Evaluation metrics matter: Use F1 (not accuracy) for classification, ROUGE (not BLEU) for summarization, entity-level F1 (not token accuracy) for NER, and WER for ASR. Using the wrong metric gives misleading results.

The Indian NLP Hierarchy:

Data Preprocessing (script normalization, code-mix handling)
→ Pretrained Model (MuRIL / IndicBERT / IndicWav2Vec)
→ Fine-tuning (task-specific labeled data)
→ Evaluation (correct metric per task)
→ Deployment (cascade for latency + cost)

Section 13

References & Further Reading

Foundational Papers

Khanuja, S., et al. (2021). "MuRIL: Multilingual Representations for Indian Languages." Google Research. The paper behind Google's Indian language model, trained on 17 languages with transliteration.
Kakwani, D., et al. (2020). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages." AI4Bharat, EMNLP Findings. The IndicBERT paper from IIT Madras.
Malik, V., et al. (2021). "ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation." IIIT Hyderabad, ACL. The dataset used in our legal summarization project.
Liu, Y., & Lapata, M. (2019). "Text Summarization with Pretrained Encoders." EMNLP. The BertSumExt/BertSumExtAbs paper — foundation for our legal summarizer.
Baevski, A., et al. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." Meta AI, NeurIPS. The foundational ASR model we adapt with IndicWav2Vec.

Indian NLP Resources

Joshi, A., et al. (2022). "IndicWav2Vec: Exploring Wav2Vec 2.0 for Indian Languages." AI4Bharat. Pretrained on 40,000+ hours of Indian speech.
Kunchukuttan, A., et al. (2020). "AI4Bharat-IndicNLP Corpus." IIT Madras. Large-scale corpora for 12 Indian languages.
Aggarwal, P., & Rani, R. (2023). "Code-Mixed Sentiment Analysis for Hindi-English Social Media Text." Survey of techniques for Hinglish NLP.
Lample, G., et al. (2016). "Neural Architectures for Named Entity Recognition." NAACL. The original Bi-LSTM-CRF paper for NER.
Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." ACL Workshop. The standard evaluation metric for summarization.

Platforms & Tools

AI4Bharat — ai4bharat.org — India's premier open-source language AI initiative (IIT Madras)
Bhashini — bhashini.gov.in — Government of India's language translation platform
HuggingFace Indian Models — Search "MuRIL", "IndicBERT", "IndicWav2Vec" on huggingface.co
IndicNLP Library — github.com/anoopkunchukuttan/indic_nlp_library — Preprocessing tools for Indian languages
iNLTK — github.com/goru001/inltk — Indian NLP Toolkit

Textbooks

Jurafsky, D. & Martin, J.H. (2024). "Speech and Language Processing." 3rd Edition (Draft). The definitive NLP textbook — Chapters on NER, summarization, and ASR.
Goldberg, Y. (2017). "Neural Network Methods for Natural Language Processing." Morgan & Claypool. Clear explanation of Bi-LSTM-CRF and attention mechanisms.