Neural Networks & Deep Learning

Chapter 18: Applied Deep Learning โ€” Natural Language Processing

Building Real-World NLP Systems for India's Linguistic Diversity

โฑ๏ธ Reading Time: ~5 hours  |  ๐Ÿ“– Part V: Applied Deep Learning  |  ๐Ÿ› ๏ธ Project-Based Chapter

๐Ÿ“‹ Prerequisites: Chapters 14โ€“17 (RNNs, Attention, Transformers, BERT), Python, PyTorch/HuggingFace basics

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall NLP pipeline stages, key Indian-language datasets (ILDC, IndicNLP), and evaluation metrics (ROUGE, BLEU, F1, WER)
๐Ÿ”ต UnderstandExplain why Indian languages pose unique challenges โ€” agglutinative morphology, code-switching, low-resource settings, and multiple scripts
๐ŸŸข ApplyFine-tune MuRIL for Hindi sentiment, build extractive summarizers with IndicBERT, and train intent classifiers for chatbots
๐ŸŸก AnalyzeCompare Bi-LSTM-CRF vs Transformer NER architectures; diagnose code-mixing failures in tokenizers
๐ŸŸ  EvaluateBenchmark models across ROUGE, entity-level F1, and word error rate; choose between extractive vs abstractive summarization
๐Ÿ”ด CreateDesign and deploy end-to-end NLP applications: a Hindi sentiment system, a legal summarizer, an IRCTC chatbot, and a news NER pipeline
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Build a Hindi/Hinglish sentiment analysis system using MuRIL (Multilingual Representations for Indian Languages) from HuggingFace, handling code-mixed reviews from Flipkart
  • Implement an extractive legal document summarizer using IndicBERT on the ILDC (Indian Legal Documents Corpus) from IIIT Hyderabad, evaluating with ROUGE scores
  • Design an IRCTC-style chatbot that handles 2 crore+ daily queries using BERT-based intent classification and response retrieval
  • Train Named Entity Recognition models for Indian news articles โ€” comparing Bi-LSTM-CRF vs Transformer architectures for Person, Organization, Location, Date, and Currency entities
  • Explain Automatic Speech Recognition (ASR) for Indian languages using Wav2Vec 2.0 and AI4Bharat's IndicWav2Vec for 9+ languages
  • Handle Indian-language-specific challenges: Devanagari/multi-script tokenization, code-switching, morphological richness, and low-resource data augmentation
  • Evaluate NLP systems using appropriate metrics: Accuracy/F1 (classification), ROUGE (summarization), Entity F1 (NER), WER (ASR)
Section 2

Opening Hook โ€” India's Language AI Revolution

๐Ÿ—ฃ๏ธ 1.4 Billion People. 22 Official Languages. 19,500 Dialects. One NLP Challenge.

Every day, 2 crore passengers query IRCTC in a mix of Hindi, English, Tamil, and everything in between. Flipkart receives lakhs of product reviews in Hinglish โ€” "bahut accha product hai, quality mast hai ๐Ÿ‘". Indian courts generate 4 crore+ pages of legal documents that lawyers must manually read. Meanwhile, Koo tried to moderate content across 10 Indian languages simultaneously.

English NLP is "solved" for many tasks. But India's code-mixing ("main kal market gaya tha, nice experience"), resource scarcity (try finding 1 lakh labeled Kannada sentences), and script diversity (Devanagari, Tamil, Telugu, Bengali, Gurmukhi...) make even basic NLP a frontier research problem.

In this chapter, you don't just learn NLP theory โ€” you build 5 production-grade Indian NLP systems from scratch. Welcome to the most exciting frontier of AI in India.

FlipkartIRCTCKooAI4BharatIIIT HyderabadJugalbandi
India is home to AI4Bharat (IIT Madras), which has built open-source models for 22 Indian languages. Their IndicBERT, IndicTrans, and IndicWav2Vec are used by government and industry alike. The Bhashini platform (MeitY) aims to break India's language barrier using these exact models. This chapter teaches you the technology behind India's language AI stack.
Section 3

Core Concepts โ€” Five NLP Projects for India

This chapter is organized as five complete projects, each addressing a real Indian NLP problem. Every project follows a consistent structure: Problem โ†’ Dataset โ†’ Model Architecture โ†’ Full Code โ†’ Evaluation โ†’ Indian Language Challenges.

18.1 The Indian NLP Pipeline โ€” Unique Challenges

Before diving into projects, let's understand why Indian NLP is fundamentally harder than English NLP:

Why Indian NLP is Hard โ€” The Six Challenges

1. Code-Mixing (Code-Switching)

"Yaar ye phone ka camera too good hai, totally worth it" โ€” Hindi and English mixed at word and sentence level. Standard tokenizers trained on monolingual data fail catastrophically on such text.

2. Script Diversity

22 official languages using 13+ scripts: Devanagari (Hindi, Marathi, Sanskrit), Tamil script, Telugu script, Bengali script, Gurmukhi (Punjabi), Kannada script, Malayalam script, Odia script, and more. A single model must handle all.

3. Morphological Richness

Tamil has agglutinative morphology โ€” a single word can encode subject, tense, number, and mood: "เฎชเฎŸเฎฟเฎ•เฏเฎ•เฎตเฎฟเฎฒเฏเฎฒเฏˆเฎฏเฎพ" (padikkaavillaiyaa = "did [you] not read?"). This creates an enormous vocabulary that subword tokenizers struggle with.

4. Low-Resource Languages

While Hindi has ~100K labeled NLP samples, languages like Dogri, Maithili, Bodo, and Santali have almost zero labeled data. Even Kannada and Odia have <10K labeled sentences for most tasks.

5. Transliteration Variants

"เค•เฅˆเคธเฅ‡ เคนเฅ‹" = "kaise ho" = "kese ho" = "kaise hoo" โ€” the same Hindi phrase written in multiple romanized forms. Models must handle native script AND transliterated forms.

6. Domain-Specific Vocabulary

Legal Hindi ("เคจเฅเคฏเคพเคฏเคพเคฒเคฏ", "เค…เคงเคฟเคจเคฟเคฏเคฎ"), medical terminology in regional languages, and government jargon create specialized domains with virtually no training data.

India's linguistic diversity is so extreme that Facebook/Meta trained separate models for Hindi, Bengali, Tamil, Telugu, and Marathi โ€” but still couldn't handle Hinglish, which is spoken by an estimated 350 million people daily. Google's MuRIL was specifically designed to solve this problem.

The Standard NLP Pipeline for Indian Languages

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Raw Text โ”‚โ”€โ”€โ”€โ–ถโ”‚ Language โ”‚โ”€โ”€โ”€โ–ถโ”‚ Script โ”‚โ”€โ”€โ”€โ–ถโ”‚ Subword โ”‚ โ”‚ (mixed) โ”‚ โ”‚ Detection โ”‚ โ”‚ Normalization โ”‚ โ”‚ Tokenization โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ Task Head โ”‚โ—€โ”€โ”€โ”€โ”‚ Fine-tuned โ”‚โ—€โ”€โ”€โ”€โ”‚ Pretrained LM โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ (classify/ โ”‚ โ”‚ Layers โ”‚ โ”‚ (MuRIL/ โ”‚ โ”‚ NER/summ) โ”‚ โ”‚ โ”‚ โ”‚ IndicBERT) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
ModelCreatorLanguagesBest For
MuRILGoogle17 Indian + EnglishCode-mixed classification, QA
IndicBERTAI4Bharat / IIT Madras12 Indian languagesAll NLU tasks for Indian langs
IndicBARTAI4Bharat11 Indian languagesGeneration, summarization
XLM-RoBERTaMeta100 languagesCross-lingual transfer
IndicTrans2AI4Bharat22 Indian languagesMachine translation
IndicWav2VecAI4Bharat9 Indian languagesSpeech recognition (ASR)

18.2 Project 1 โ€” Hindi/Hinglish Sentiment Analysis

๐Ÿท๏ธ PROJECT 1 โ€” CLASSIFICATION

The Business Problem

Flipkart receives 50 lakh+ product reviews monthly, with ~40% written in Hindi or Hinglish. Their recommendation engine and seller quality score depend on accurate sentiment detection. A review like "product toh sahi hai but delivery mein bahut time lagaya ๐Ÿ˜ค" is mixed sentiment โ€” positive product, negative delivery. English-only models classify this as neutral (wrong!).

Dataset: Flipkart Hindi/Hinglish Reviews

We use a curated dataset of Flipkart product reviews in Hindi and code-mixed Hindi-English (Hinglish), labeled as Positive, Negative, or Neutral.

SplitPositiveNegativeNeutralTotal
Train12,4008,6004,00025,000
Validation1,5501,0755003,125
Test1,5501,0755003,125

Why MuRIL? โ€” Multilingual Representations for Indian Languages

MuRIL โ€” Google's Indian Language BERT

Architecture

MuRIL is a BERT-base model (12 layers, 768 hidden, 12 heads, 110M params) pretrained on:

  • 17 Indian languages + English from Wikipedia and Common Crawl
  • Transliterated text โ€” Hindi in both Devanagari and Roman script
  • Parallel corpora โ€” aligned Hindi-English sentence pairs
Why It Beats mBERT for Indian Languages

mBERT was trained on 104 languages with ~equal weight. Indian languages got only ~2% of the training data. MuRIL dedicates 100% of its capacity to Indian languages, resulting in 5-10% higher accuracy on Indian NLP benchmarks.

Code-Mixing Handling

MuRIL was explicitly trained on transliterated and code-mixed data. It correctly tokenizes "bahut accha product hai" even though it's Hindi in Roman script โ€” something mBERT and XLM-R fail at.

Code-Mixing: The Key Challenge

Code-mixing occurs at multiple levels:

LevelExampleChallenge
Word-level"Main phone use karta hoon"English noun in Hindi sentence
Sentence-level"Product accha hai. But delivery was late."Language switch at sentence boundary
Intra-word"Phoneเคตเคพเคฒเคพ" (Phone + เคตเคพเคฒเคพ)Morpheme-level mixing across scripts
Transliteration"bahut accha" vs "เคฌเคนเฅเคค เค…เคšเฅเค›เคพ"Same meaning, different scripts

Full Implementation: MuRIL Sentiment Classifier

Python# โ”€โ”€โ”€ Project 1: Hindi Sentiment Analysis with MuRIL โ”€โ”€โ”€
# Fine-tune Google's MuRIL on Flipkart Hindi/Hinglish reviews

import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from sklearn.metrics import classification_report, confusion_matrix
import re

# โ”€โ”€โ”€ Step 1: Preprocessing for Hindi/Hinglish โ”€โ”€โ”€

class HindiTextPreprocessor:
    """Handles code-mixed Hindi-English text preprocessing."""

    def __init__(self):
        # Common Hindi stopwords (in Devanagari)
        self.hindi_stopwords = {
            'เคนเฅˆ', 'เคนเฅˆเค‚', 'เค•เคพ', 'เค•เฅ€', 'เค•เฅ‡', 'เคฎเฅ‡เค‚',
            'เค•เฅ‹', 'เคธเฅ‡', 'เคชเคฐ', 'เค”เคฐ', 'เคฏเคน', 'เคตเคน'
        }
        # Hinglish normalization map
        self.normalize_map = {
            'accha': 'เค…เคšเฅเค›เคพ', 'acha': 'เค…เคšเฅเค›เคพ',
            'bahut': 'เคฌเคนเฅเคค', 'bohot': 'เคฌเคนเฅเคค',
            'sahi': 'เคธเคนเฅ€', 'shi': 'เคธเคนเฅ€',
            'kharab': 'เค–เคฐเคพเคฌ', 'khrb': 'เค–เคฐเคพเคฌ',
            'mast': 'เคฎเคธเฅเคค', 'bakwas': 'เคฌเค•เคตเคพเคธ',
        }

    def clean_text(self, text):
        """Clean and normalize Hindi/Hinglish text."""
        text = str(text).lower()
        # Remove URLs, mentions, hashtags
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r'@\w+', '', text)
        # Keep Devanagari chars (U+0900-U+097F), English, numbers
        text = re.sub(r'[^\u0900-\u097F\w\s]', ' ', text)
        # Normalize repeated characters: "bahuttttt" โ†’ "bahut"
        text = re.sub(r'(.)\1{2,}', r'\1\1', text)
        # Normalize common Hinglish spellings
        words = text.split()
        words = [self.normalize_map.get(w, w) for w in words]
        return ' '.join(words).strip()

    def detect_language_ratio(self, text):
        """Detect Hindi vs English ratio in text."""
        hindi_chars = len(re.findall(r'[\u0900-\u097F]', text))
        eng_chars = len(re.findall(r'[a-zA-Z]', text))
        total = hindi_chars + eng_chars
        if total == 0:
            return 0.0
        return hindi_chars / total  # 1.0 = pure Hindi, 0.0 = pure English


# โ”€โ”€โ”€ Step 2: Dataset Class โ”€โ”€โ”€

class FlipkartHindiDataset(Dataset):
    """PyTorch Dataset for Flipkart Hindi/Hinglish reviews."""

    LABEL_MAP = {'positive': 0, 'negative': 1, 'neutral': 2}

    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = [self.LABEL_MAP[l] for l in labels]
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.preprocessor = HindiTextPreprocessor()

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.preprocessor.clean_text(self.texts[idx])
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }


# โ”€โ”€โ”€ Step 3: Load MuRIL and Fine-Tune โ”€โ”€โ”€

MODEL_NAME = "google/muril-base-cased"
NUM_LABELS = 3

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS,
    problem_type="single_label_classification"
)

# Load data (assumes CSV with 'review_text' and 'sentiment' columns)
df_train = pd.read_csv("flipkart_hindi_train.csv")
df_val   = pd.read_csv("flipkart_hindi_val.csv")
df_test  = pd.read_csv("flipkart_hindi_test.csv")

train_dataset = FlipkartHindiDataset(
    df_train['review_text'].tolist(),
    df_train['sentiment'].tolist(),
    tokenizer
)
val_dataset = FlipkartHindiDataset(
    df_val['review_text'].tolist(),
    df_val['sentiment'].tolist(),
    tokenizer
)

# โ”€โ”€โ”€ Step 4: Training Configuration โ”€โ”€โ”€

def compute_metrics(eval_pred):
    """Compute accuracy and macro F1 for evaluation."""
    from sklearn.metrics import accuracy_score, f1_score
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='macro')
    return {'accuracy': acc, 'f1_macro': f1}


training_args = TrainingArguments(
    output_dir="./muril-flipkart-sentiment",
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_ratio=0.1,
    weight_decay=0.01,
    learning_rate=2e-5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    logging_steps=50,
    fp16=True,  # Mixed precision for faster training
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Train!
trainer.train()

# โ”€โ”€โ”€ Step 5: Evaluation โ”€โ”€โ”€

test_dataset = FlipkartHindiDataset(
    df_test['review_text'].tolist(),
    df_test['sentiment'].tolist(),
    tokenizer
)
results = trainer.evaluate(test_dataset)
print(f"Test Accuracy: {results['eval_accuracy']:.4f}")
print(f"Test F1 Macro: {results['eval_f1_macro']:.4f}")

# โ”€โ”€โ”€ Step 6: Inference on New Reviews โ”€โ”€โ”€

def predict_sentiment(text, model, tokenizer):
    """Predict sentiment of a Hindi/Hinglish review."""
    preprocessor = HindiTextPreprocessor()
    clean = preprocessor.clean_text(text)
    inputs = tokenizer(clean, return_tensors="pt", truncation=True,
                       max_length=128, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    pred = torch.argmax(probs, dim=-1).item()
    labels = ['Positive', 'Negative', 'Neutral']
    return labels[pred], probs[0].tolist()


# Test with code-mixed reviews
reviews = [
    "bahut accha product hai, quality mast hai ๐Ÿ‘",
    "paise barbaad! bilkul kharab quality, return karna padega",
    "theek hai, not great not bad, average product",
    "เคฌเคนเฅเคค เค…เคšเฅเค›เคพ เคซเฅ‹เคจ เคนเฅˆ, เค•เฅˆเคฎเคฐเคพ quality เคถเคพเคจเคฆเคพเคฐ",
    "delivery late thi but product sahi hai",
]

for review in reviews:
    sentiment, probs = predict_sentiment(review, model, tokenizer)
    print(f"Review: {review[:50]}...")
    print(f"  โ†’ {sentiment} (conf: {max(probs):.2%})\n")
Test Accuracy: 0.8734 Test F1 Macro: 0.8521 Review: bahut accha product hai, quality mast hai ๐Ÿ‘... โ†’ Positive (conf: 94.3%) Review: paise barbaad! bilkul kharab quality, return ... โ†’ Negative (conf: 91.7%) Review: theek hai, not great not bad, average product... โ†’ Neutral (conf: 78.2%) Review: เคฌเคนเฅเคค เค…เคšเฅเค›เคพ เคซเฅ‹เคจ เคนเฅˆ, เค•เฅˆเคฎเคฐเคพ quality เคถเคพเคจเคฆเคพเคฐ... โ†’ Positive (conf: 96.1%) Review: delivery late thi but product sahi hai... โ†’ Positive (conf: 62.4%) โ† Mixed sentiment, slight positive

Model Comparison: Why MuRIL Wins for Hindi

ModelHindi Pure (%)Hinglish (%)Code-Mixed (%)Overall F1
mBERT81.268.463.10.74
XLM-RoBERTa83.772.167.80.78
MuRIL89.484.281.70.85
IndicBERT87.179.876.30.82
Handling mixed-sentiment reviews: For production, consider aspect-based sentiment analysis (ABSA) where you extract (aspect, sentiment) pairs: ("product", positive), ("delivery", negative). This gives Flipkart actionable insights per review dimension.

18.3 Project 2 โ€” Legal Document Summarization

๐Ÿ“„ PROJECT 2 โ€” SUMMARIZATION

The Business Problem

Indian courts produce 4 crore+ pages of judgments annually. A Supreme Court judgment averages 40-80 pages; High Court orders run 10-30 pages. Lawyers spend 60% of their billable hours just reading. At โ‚น5,000-50,000/hour for senior advocates, even a 30% reduction in reading time saves the legal industry โ‚น1,000+ crore annually.

Dataset: ILDC โ€” Indian Legal Documents Corpus

The ILDC (Indian Legal Documents Corpus) was created by researchers at IIIT Hyderabad and contains Supreme Court judgments with expert-written summaries.

FeatureDetails
SourceSupreme Court of India, Indian Kanoon
Documents~35,000 court cases
Avg. Document Length~4,100 words
Avg. Summary Length~850 words
LanguageLegal English (Indian)
LabelsBinary prediction (accepted/rejected) + rhetorical roles

Approach: Extractive Summarization with IndicBERT

Extractive vs Abstractive Summarization

Extractive (Our Choice)

Select the most important sentences from the original document. The summary is a subset of original sentences. Best for legal text because exact wording matters โ€” paraphrasing legal language can change its meaning.

Abstractive

Generate new sentences that paraphrase the document. More fluent but risks hallucination โ€” a fatal flaw in legal contexts where a misquoted statute number could cause a โ‚น10 crore loss.

Our Architecture

We use IndicBERT (or BERT) to encode each sentence, then a classifier scores each sentence's importance (0 to 1). Top-K sentences form the summary. This is the BertSumExt approach adapted for Indian legal text.

ROUGE-N = (ฮฃ Count_match(n-gram)) / (ฮฃ Count(n-gram in reference))

ROUGE-L = LCS(candidate, reference) / len(reference)
Python# โ”€โ”€โ”€ Project 2: Legal Document Summarization โ”€โ”€โ”€
# Extractive summarization for Indian court judgments using BERT

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from rouge_score import rouge_scorer
import numpy as np
import re

# โ”€โ”€โ”€ Step 1: Sentence Segmentation for Legal Text โ”€โ”€โ”€

class LegalSentenceSegmenter:
    """Segment Indian legal documents into sentences.

    Legal text has unique patterns:
    - Section references: 'Sec. 302 I.P.C.'
    - Case citations: 'AIR 1950 SC 27'
    - Abbreviations: 'Hon'ble', 'vs.', 'Ltd.'
    """

    def __init__(self):
        # Patterns that look like sentence-ends but aren't
        self.abbreviations = {
            'vs', 'sec', 'art', 'no', 'sr',
            'dr', 'mr', 'mrs', 'smt', 'shri',
            'hon', 'ltd', 'pvt', 'govt', 'i.e',
            'e.g', 'etc', 'i.p.c', 'cr.p.c', 'c.p.c'
        }

    def segment(self, text):
        """Split legal document into sentences."""
        # Protect abbreviations
        for abbr in self.abbreviations:
            text = re.sub(
                rf'\b{re.escape(abbr)}\.',
                abbr.replace('.', '_DOT_') + '_ABBR_',
                text,
                flags=re.IGNORECASE
            )
        # Split on sentence-ending punctuation
        sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z\d])', text)
        # Restore abbreviations
        sentences = [s.replace('_DOT_', '.').replace('_ABBR_', '.')
                     for s in sentences]
        # Filter very short sentences (noise)
        sentences = [s.strip() for s in sentences if len(s.split()) > 5]
        return sentences


# โ”€โ”€โ”€ Step 2: BertSumExt Model โ”€โ”€โ”€

class LegalBertSumExt(nn.Module):
    """Extractive summarizer using BERT sentence representations.

    Architecture:
    1. Encode each sentence with BERT (CLS token)
    2. Inter-sentence Transformer (2 layers) for context
    3. Binary classifier: important (1) or not (0)
    """

    def __init__(self, bert_model_name="ai4bharat/indic-bert",
                 n_heads=8, n_inter_layers=2):
        super().__init__()
        self.bert = AutoModel.from_pretrained(bert_model_name)
        hidden_size = self.bert.config.hidden_size  # 768

        # Inter-sentence Transformer layers
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=hidden_size,
            nhead=n_heads,
            dim_feedforward=hidden_size * 4,
            dropout=0.1,
            activation='gelu',
            batch_first=True
        )
        self.inter_sentence_transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=n_inter_layers
        )

        # Binary classifier for each sentence
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, 256),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, input_ids, attention_mask, sentence_mask):
        """
        Args:
            input_ids: (batch, num_sents, seq_len)
            attention_mask: (batch, num_sents, seq_len)
            sentence_mask: (batch, num_sents) โ€” 1 for real, 0 for pad
        Returns:
            scores: (batch, num_sents) โ€” importance score per sentence
        """
        batch_size, num_sents, seq_len = input_ids.shape

        # Encode each sentence independently with BERT
        input_ids = input_ids.view(-1, seq_len)
        attention_mask = attention_mask.view(-1, seq_len)

        bert_out = self.bert(input_ids, attention_mask=attention_mask)
        cls_embeddings = bert_out.last_hidden_state[:, 0, :]  # CLS tokens
        cls_embeddings = cls_embeddings.view(batch_size, num_sents, -1)

        # Inter-sentence Transformer for document-level context
        src_key_padding_mask = (sentence_mask == 0)
        contextualized = self.inter_sentence_transformer(
            cls_embeddings,
            src_key_padding_mask=src_key_padding_mask
        )

        # Score each sentence
        scores = self.classifier(contextualized).squeeze(-1)
        scores = scores * sentence_mask  # Zero out padded sentences

        return scores


# โ”€โ”€โ”€ Step 3: Oracle Label Creation โ”€โ”€โ”€

def create_oracle_labels(document_sentences, reference_summary, top_k=5):
    """Create oracle extractive labels using greedy ROUGE optimization.

    Greedily select sentences that maximize ROUGE-2 F1 with reference.
    """
    scorer = rouge_scorer.RougeScorer(['rouge2'], use_stemmer=True)
    selected = []
    remaining = list(range(len(document_sentences)))
    labels = [0] * len(document_sentences)

    for _ in range(min(top_k, len(document_sentences))):
        best_idx = -1
        best_score = -1

        for idx in remaining:
            candidate = ' '.join(
                [document_sentences[i] for i in selected + [idx]]
            )
            score = scorer.score(reference_summary, candidate)
            rouge2_f1 = score['rouge2'].fmeasure

            if rouge2_f1 > best_score:
                best_score = rouge2_f1
                best_idx = idx

        if best_idx >= 0 and best_score > 0:
            selected.append(best_idx)
            remaining.remove(best_idx)
            labels[best_idx] = 1

    return labels


# โ”€โ”€โ”€ Step 4: Training Loop โ”€โ”€โ”€

def train_summarizer(model, train_loader, val_loader,
                     epochs=5, lr=2e-5):
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr,
                                  weight_decay=0.01)
    criterion = nn.BCELoss(reduction='none')
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attn_mask = batch['attention_mask'].to(device)
            sent_mask = batch['sentence_mask'].to(device)
            labels = batch['labels'].to(device).float()

            scores = model(input_ids, attn_mask, sent_mask)
            loss = criterion(scores, labels)
            loss = (loss * sent_mask).sum() / sent_mask.sum()

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item()

        # Evaluate on validation set
        val_rouge = evaluate_summarizer(model, val_loader, device)
        print(f"Epoch {epoch+1}/{epochs} | "
              f"Loss: {total_loss/len(train_loader):.4f} | "
              f"ROUGE-2: {val_rouge['rouge2']:.4f} | "
              f"ROUGE-L: {val_rouge['rougeL']:.4f}")


# โ”€โ”€โ”€ Step 5: Evaluation โ”€โ”€โ”€

def evaluate_summarizer(model, data_loader, device, top_k=5):
    """Evaluate using ROUGE scores."""
    model.eval()
    scorer = rouge_scorer.RougeScorer(
        ['rouge1', 'rouge2', 'rougeL'], use_stemmer=True
    )
    all_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in data_loader:
            scores = model(
                batch['input_ids'].to(device),
                batch['attention_mask'].to(device),
                batch['sentence_mask'].to(device)
            )
            # Select top-k sentences
            for i in range(scores.shape[0]):
                sent_scores = scores[i].cpu().numpy()
                top_indices = np.argsort(sent_scores)[-top_k:]
                top_indices = sorted(top_indices)  # Maintain order

                pred_summary = ' '.join(
                    [batch['sentences'][i][j] for j in top_indices]
                )
                ref_summary = batch['reference'][i]

                rouge = scorer.score(ref_summary, pred_summary)
                for key in all_scores:
                    all_scores[key].append(rouge[key].fmeasure)

    return {k: np.mean(v) for k, v in all_scores.items()}
Epoch 1/5 | Loss: 0.4231 | ROUGE-2: 0.2847 | ROUGE-L: 0.3921 Epoch 2/5 | Loss: 0.3156 | ROUGE-2: 0.3214 | ROUGE-L: 0.4287 Epoch 3/5 | Loss: 0.2734 | ROUGE-2: 0.3498 | ROUGE-L: 0.4512 Epoch 4/5 | Loss: 0.2501 | ROUGE-2: 0.3612 | ROUGE-L: 0.4634 Epoch 5/5 | Loss: 0.2389 | ROUGE-2: 0.3687 | ROUGE-L: 0.4721
Don't use abstractive summarization for legal documents. Legal language is precise โ€” "Section 302" means murder, "Section 304" means culpable homicide. An abstractive model might paraphrase "convicted under Section 302" as "found guilty of homicide" โ€” technically correct but legally imprecise. Always prefer extractive approaches for legal NLP where exact wording has legal implications.

18.4 Project 3 โ€” IRCTC Railway Chatbot

๐Ÿค– PROJECT 3 โ€” CHATBOT / INTENT CLASSIFICATION

The Business Problem

IRCTC handles 2 crore+ queries daily โ€” PNR status, train schedule, ticket booking, refund status, platform info. Their call center employs 10,000+ agents at โ‚น15,000/month each = โ‚น18 crore/month in staff costs alone. An intelligent chatbot that resolves 60% of queries automatically saves โ‚น10+ crore/month and reduces average response time from 8 minutes to 3 seconds.

Architecture: Intent Classification + Response Retrieval

User Query: "mera train kab aayegi platform 3 pe?" โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Text Preprocessing โ”‚โ”€โ”€โ”€โ”€ Language detection, normalization โ”‚ & Normalization โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ BERT-based Intent โ”‚ โ”‚ Entity Extraction โ”‚ โ”‚ Classifier โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ (Train#, PNR, Date) โ”‚ โ”‚ (12 intent classes)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Intent: train_ โ”‚ โ”‚ Entities: โ”‚ โ”‚ schedule โ”‚ โ”‚ platform=3 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Response Template + API Call โ”‚ โ”‚ "Train {train_no} will arrive at platform โ”‚ โ”‚ {platform} at {time}." โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Training Data: Intent Categories

IntentExamplesCount
pnr_status"PNR status check karo", "เคฎเฅ‡เคฐเคพ PNR 4521876340"3,200
train_schedule"Rajdhani ka time kya hai?", "12301 schedule"2,800
ticket_booking"Delhi se Mumbai ticket book karo"2,500
ticket_cancel"meri ticket cancel kardo", "refund kab milega"1,800
seat_availability"3AC mein seat available hai?"2,200
platform_info"train kis platform pe aayegi?"1,500
food_order"train mein khana order karna hai"1,200
complaint"AC kharab hai coach mein", "toilet saaf nahi"1,800
fare_enquiry"Delhi Mumbai fare kitna hai?"1,400
live_status"train kahan pahunchi?", "running status"2,100
tatkal_booking"tatkal ticket kaise book hoga?"1,600
general_query"IRCTC ka customer care number?"1,900
Python# โ”€โ”€โ”€ Project 3: IRCTC Chatbot โ”€โ”€โ”€
# BERT-based Intent Classification + Entity Extraction

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import Dataset, DataLoader
import json, re
import numpy as np

# โ”€โ”€โ”€ Step 1: IRCTC-specific Entity Extractor โ”€โ”€โ”€

class IRCTCEntityExtractor:
    """Extract railway-specific entities from queries."""

    def __init__(self):
        self.patterns = {
            'pnr': r'\b(\d{10})\b',
            'train_number': r'\b(1[2-9]\d{3}|[2-9]\d{4})\b',
            'date': r'\b(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})\b',
            'coach': r'\b([SB]\d{1,2}|[ABCD]\d|H1|HA1)\b',
            'class': r'\b(1AC|2AC|3AC|SL|CC|2S|GEN|1A|2A|3A)\b',
            'platform': r'platform\s*(\d{1,2})',
        }
        # Major Indian stations
        self.stations = {
            'delhi': 'NDLS', 'new delhi': 'NDLS',
            'mumbai': 'CSTM', 'mumbai central': 'BCT',
            'chennai': 'MAS', 'kolkata': 'HWH',
            'bangalore': 'SBC', 'bengaluru': 'SBC',
            'hyderabad': 'SC', 'pune': 'PUNE',
            'jaipur': 'JP', 'lucknow': 'LKO',
            'ahmedabad': 'ADI', 'patna': 'PNBE',
            'varanasi': 'BSB', 'kanpur': 'CNB',
        }

    def extract(self, text):
        """Extract all entities from a query."""
        entities = {}
        text_lower = text.lower()

        # Regex-based entity extraction
        for entity_type, pattern in self.patterns.items():
            match = re.search(pattern, text, re.IGNORECASE)
            if match:
                entities[entity_type] = match.group(1)

        # Station name extraction
        stations_found = []
        for name, code in self.stations.items():
            if name in text_lower:
                stations_found.append({'name': name, 'code': code})
        if stations_found:
            entities['stations'] = stations_found

        return entities


# โ”€โ”€โ”€ Step 2: Intent Classification Model โ”€โ”€โ”€

class IRCTCIntentClassifier(nn.Module):
    """BERT-based intent classifier for IRCTC queries."""

    INTENTS = [
        'pnr_status', 'train_schedule', 'ticket_booking',
        'ticket_cancel', 'seat_availability', 'platform_info',
        'food_order', 'complaint', 'fare_enquiry',
        'live_status', 'tatkal_booking', 'general_query'
    ]

    def __init__(self, bert_model="google/muril-base-cased"):
        super().__init__()
        self.bert = AutoModel.from_pretrained(bert_model)
        hidden = self.bert.config.hidden_size

        self.classifier = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(hidden, 256),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(256, len(self.INTENTS))
        )

        # Confidence threshold โ€” below this, escalate to human
        self.confidence_threshold = 0.75

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]
        logits = self.classifier(cls_output)
        return logits

    def predict(self, text, tokenizer, device):
        """Predict intent with confidence score."""
        self.eval()
        inputs = tokenizer(text, return_tensors="pt",
                          truncation=True, max_length=64,
                          padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            logits = self.forward(inputs['input_ids'],
                                  inputs['attention_mask'])
        probs = torch.softmax(logits, dim=-1)[0]
        top_prob, top_idx = probs.max(dim=0)

        intent = self.INTENTS[top_idx.item()]
        confidence = top_prob.item()

        return {
            'intent': intent,
            'confidence': confidence,
            'escalate': confidence < self.confidence_threshold
        }


# โ”€โ”€โ”€ Step 3: Response Templates โ”€โ”€โ”€

RESPONSE_TEMPLATES = {
    'pnr_status': "๐Ÿš‚ Aapka PNR {pnr} ka status: {status}. "
                  "Coach {coach}, Berth {berth}. Yatra mangalmay ho!",

    'train_schedule': "๐Ÿ• Train {train_number} ka schedule:\n"
                      "Departure: {dep_time} ({source})\n"
                      "Arrival: {arr_time} ({dest})",

    'ticket_booking': "๐ŸŽซ {source} se {dest} ke liye {class} mein "
                      "ticket available hai. Fare: โ‚น{fare}. "
                      "Book karna chahte hain?",

    'seat_availability': "๐Ÿ’บ {date} ko {train_number} mein {class}: "
                          "{available} seats available.",

    'live_status': "๐Ÿ“ Train {train_number} abhi {location} pe hai. "
                   "Expected delay: {delay} min.",

    'complaint': "๐Ÿ“ Aapki complaint register ho gayi hai. "
                 "Reference: {ref_no}. 24 ghante mein response milega.",

    'fare_enquiry': "๐Ÿ’ฐ {source} โ†’ {dest} fare:\n"
                    "1AC: โ‚น{fare_1ac} | 2AC: โ‚น{fare_2ac} | "
                    "3AC: โ‚น{fare_3ac} | SL: โ‚น{fare_sl}",
}


# โ”€โ”€โ”€ Step 4: Full Chatbot Pipeline โ”€โ”€โ”€

class IRCTCChatbot:
    """End-to-end IRCTC chatbot combining intent + entities + response."""

    def __init__(self, model_path, device='cpu'):
        self.device = torch.device(device)
        self.tokenizer = AutoTokenizer.from_pretrained(
            "google/muril-base-cased"
        )
        self.intent_model = IRCTCIntentClassifier()
        self.intent_model.load_state_dict(
            torch.load(model_path, map_location=self.device)
        )
        self.intent_model.to(self.device)
        self.entity_extractor = IRCTCEntityExtractor()

    def respond(self, user_query):
        """Process user query and generate response."""
        # Step 1: Classify intent
        result = self.intent_model.predict(
            user_query, self.tokenizer, self.device
        )

        # Step 2: Extract entities
        entities = self.entity_extractor.extract(user_query)

        # Step 3: Check if escalation needed
        if result['escalate']:
            return {
                'response': "Main aapko humare agent se connect "
                           "karta hoon. Please hold karein.",
                'intent': result['intent'],
                'confidence': result['confidence'],
                'escalated': True
            }

        # Step 4: Generate response from template
        template = RESPONSE_TEMPLATES.get(
            result['intent'],
            "Main samajh nahi paaya. Kya aap dobara bata sakte hain?"
        )

        return {
            'response': template,
            'intent': result['intent'],
            'confidence': result['confidence'],
            'entities': entities,
            'escalated': False
        }


# โ”€โ”€โ”€ Step 5: Evaluation โ”€โ”€โ”€

def evaluate_chatbot(model, test_loader, device):
    """Evaluate intent classification accuracy."""
    model.eval()
    correct, total = 0, 0
    per_intent_correct = {}
    per_intent_total = {}

    with torch.no_grad():
        for batch in test_loader:
            logits = model(
                batch['input_ids'].to(device),
                batch['attention_mask'].to(device)
            )
            preds = logits.argmax(dim=-1)
            labels = batch['labels'].to(device)

            correct += (preds == labels).sum().item()
            total += labels.size(0)

            for p, l in zip(preds, labels):
                intent = IRCTCIntentClassifier.INTENTS[l.item()]
                per_intent_total[intent] = per_intent_total.get(intent, 0) + 1
                if p == l:
                    per_intent_correct[intent] = \
                        per_intent_correct.get(intent, 0) + 1

    overall_acc = correct / total
    print(f"Overall Accuracy: {overall_acc:.4f}\n")
    print(f"{'Intent':<20} {'Accuracy':>10}")
    print("-" * 32)
    for intent in IRCTCIntentClassifier.INTENTS:
        acc = per_intent_correct.get(intent, 0) / \
              max(per_intent_total.get(intent, 1), 1)
        print(f"{intent:<20} {acc:>10.2%}")

    return overall_acc
Overall Accuracy: 0.9147 Intent Accuracy -------------------------------- pnr_status 95.2% train_schedule 93.8% ticket_booking 92.1% ticket_cancel 90.4% seat_availability 91.7% platform_info 89.3% food_order 88.6% complaint 87.2% fare_enquiry 93.1% live_status 94.5% tatkal_booking 91.8% general_query 84.7% โ† Hardest: vague queries
Production deployment tip: Set the confidence threshold at 0.75 initially, then tune it based on the false positive rate. At Flipkart, their chatbot uses a 3-tier confidence system: >0.90 โ†’ auto-respond, 0.70-0.90 โ†’ respond with "Did you mean...?", <0.70 โ†’ escalate to human agent. This balances automation savings with customer satisfaction.

18.5 Project 4 โ€” Named Entity Recognition for Indian News

๐Ÿท๏ธ PROJECT 4 โ€” SEQUENCE LABELING

The Business Problem

Indian news outlets like NDTV, Aaj Tak, and The Hindu process 50,000+ articles daily across Hindi, English, Tamil, Telugu, and Bengali. Automatically extracting entities โ€” who (Person), which company (Organization), where (Location), when (Date), and how much (Currency in โ‚น) โ€” powers news categorization, knowledge graphs, and fact-checking at scale.

Entity Types for Indian News

EntityTagExample (Hindi)Example (English)
PersonPERเคจเคฐเฅ‡เค‚เคฆเฅเคฐ เคฎเฅ‹เคฆเฅ€Narendra Modi
OrganizationORGเคฐเคฟเคฒเคพเคฏเค‚เคธ เค‡เค‚เคกเคธเฅเคŸเฅเคฐเฅ€เคœReliance Industries
LocationLOCเคจเคˆ เคฆเคฟเคฒเฅเคฒเฅ€New Delhi
DateDATE15 เค…เค—เคธเฅเคค 202415 August 2024
CurrencyCURโ‚น1,500 เค•เคฐเฅ‹เคกเคผโ‚น1,500 crore

BIO Tagging Scheme

BIO (Beginning-Inside-Outside) Tagging

Scheme

Each token gets one of: B-TYPE (beginning of entity), I-TYPE (inside entity), or O (outside / not an entity).

Example
Token:  เคจเคฐเฅ‡เค‚เคฆเฅเคฐ    เคฎเฅ‹เคฆเฅ€     เคจเฅ‡      เคฐเคฟเคฒเคพเคฏเค‚เคธ   เค‡เค‚เคกเคธเฅเคŸเฅเคฐเฅ€เคœ  เค•เฅ‹     โ‚น1,500   เค•เคฐเฅ‹เคกเคผ    เคฆเคฟเคฏเฅ‡
Tag:    B-PER    I-PER    O      B-ORG     I-ORG       O     B-CUR    I-CUR    O
Tag Count

5 entity types ร— 2 (B, I) + 1 (O) = 11 tags total.

Architecture Comparison: Bi-LSTM-CRF vs Transformer

Python# โ”€โ”€โ”€ Project 4A: Bi-LSTM-CRF for Indian NER โ”€โ”€โ”€
# Classic sequence labeling approach

import torch
import torch.nn as nn
from torchcrf import CRF
import numpy as np

# Tag set for Indian news NER
TAG_TO_IDX = {
    'O': 0,
    'B-PER': 1, 'I-PER': 2,
    'B-ORG': 3, 'I-ORG': 4,
    'B-LOC': 5, 'I-LOC': 6,
    'B-DATE': 7, 'I-DATE': 8,
    'B-CUR': 9, 'I-CUR': 10,
}
IDX_TO_TAG = {v: k for k, v in TAG_TO_IDX.items()}
NUM_TAGS = len(TAG_TO_IDX)


class BiLSTMCRF(nn.Module):
    """Bi-LSTM-CRF for Named Entity Recognition.

    Architecture:
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚    CRF     โ”‚ โ† Ensures valid tag transitions
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค    (e.g., I-PER can't follow B-ORG)
    โ”‚  Linear    โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚  Bi-LSTM   โ”‚ โ† Captures bidirectional context
    โ”‚  (2 layers)โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚  Char-CNN  โ”‚ โ† Character-level features (morphology)
    โ”‚  + Word Embโ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    """

    def __init__(self, vocab_size, char_vocab_size,
                 word_emb_dim=300, char_emb_dim=30,
                 char_hidden=50, lstm_hidden=256,
                 num_layers=2, dropout=0.5):
        super().__init__()

        # Word embeddings (can load Hindi fastText vectors)
        self.word_embedding = nn.Embedding(vocab_size, word_emb_dim,
                                           padding_idx=0)

        # Character-level CNN (captures morphological patterns)
        self.char_embedding = nn.Embedding(char_vocab_size, char_emb_dim,
                                           padding_idx=0)
        self.char_cnn = nn.Conv1d(
            char_emb_dim, char_hidden,
            kernel_size=3, padding=1
        )

        # Bi-LSTM
        input_dim = word_emb_dim + char_hidden
        self.lstm = nn.LSTM(
            input_dim, lstm_hidden,
            num_layers=num_layers,
            bidirectional=True,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        # Emission scores
        self.hidden2tag = nn.Linear(lstm_hidden * 2, NUM_TAGS)
        self.dropout = nn.Dropout(dropout)

        # CRF layer โ€” learns valid tag transitions
        self.crf = CRF(NUM_TAGS, batch_first=True)

    def _get_char_features(self, char_ids):
        """Compute character-level features using CNN."""
        batch, seq_len, char_len = char_ids.shape
        char_ids = char_ids.view(-1, char_len)
        char_emb = self.char_embedding(char_ids)  # (B*S, C, D)
        char_emb = char_emb.permute(0, 2, 1)      # (B*S, D, C)
        char_cnn = self.char_cnn(char_emb)         # (B*S, H, C)
        char_features = char_cnn.max(dim=2)[0]    # Max pool: (B*S, H)
        char_features = char_features.view(batch, seq_len, -1)
        return char_features

    def _get_emissions(self, word_ids, char_ids):
        """Compute emission scores from Bi-LSTM."""
        word_emb = self.word_embedding(word_ids)
        char_feat = self._get_char_features(char_ids)
        combined = torch.cat([word_emb, char_feat], dim=-1)
        combined = self.dropout(combined)

        lstm_out, _ = self.lstm(combined)
        lstm_out = self.dropout(lstm_out)
        emissions = self.hidden2tag(lstm_out)
        return emissions

    def forward(self, word_ids, char_ids, tags, mask):
        """Compute negative log-likelihood loss."""
        emissions = self._get_emissions(word_ids, char_ids)
        loss = -self.crf(emissions, tags, mask=mask,
                         reduction='mean')
        return loss

    def predict(self, word_ids, char_ids, mask):
        """Viterbi decoding for best tag sequence."""
        emissions = self._get_emissions(word_ids, char_ids)
        best_tags = self.crf.decode(emissions, mask=mask)
        return best_tags


# โ”€โ”€โ”€ Project 4B: Transformer NER (BERT-based) โ”€โ”€โ”€

from transformers import AutoModelForTokenClassification

class TransformerNER:
    """Transformer-based NER using MuRIL/IndicBERT."""

    def __init__(self, model_name="google/muril-base-cased"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(
            model_name,
            num_labels=NUM_TAGS,
            id2label=IDX_TO_TAG,
            label2id=TAG_TO_IDX,
        )

    def predict_entities(self, text):
        """Predict entities in text, handling subword tokens."""
        inputs = self.tokenizer(
            text, return_tensors="pt",
            return_offsets_mapping=True,
            truncation=True, max_length=512
        )
        offset_mapping = inputs.pop('offset_mapping')[0]

        with torch.no_grad():
            outputs = self.model(**inputs)
        preds = outputs.logits.argmax(dim=-1)[0]

        # Map subword predictions back to words
        entities = []
        current_entity = None

        for idx, (pred, offset) in enumerate(
            zip(preds, offset_mapping)
        ):
            if offset[0] == 0 and offset[1] == 0:
                continue  # Skip [CLS], [SEP]

            tag = IDX_TO_TAG[pred.item()]
            start, end = offset[0].item(), offset[1].item()
            token_text = text[start:end]

            if tag.startswith('B-'):
                if current_entity:
                    entities.append(current_entity)
                current_entity = {
                    'type': tag[2:],
                    'text': token_text,
                    'start': start, 'end': end
                }
            elif tag.startswith('I-') and current_entity:
                current_entity['text'] += token_text
                current_entity['end'] = end
            else:
                if current_entity:
                    entities.append(current_entity)
                    current_entity = None

        if current_entity:
            entities.append(current_entity)

        return entities


# โ”€โ”€โ”€ Evaluation: Entity-Level F1 โ”€โ”€โ”€

from seqeval.metrics import classification_report as ner_report

def evaluate_ner(true_tags_list, pred_tags_list):
    """Evaluate NER using entity-level F1 (seqeval library)."""
    print(ner_report(true_tags_list, pred_tags_list, digits=4))
Bi-LSTM-CRF Results on Indian News NER: precision recall f1-score support B-CUR 0.9234 0.8876 0.9051 267 B-DATE 0.9012 0.8934 0.8973 445 B-LOC 0.8756 0.8621 0.8688 1023 B-ORG 0.8523 0.8245 0.8382 876 B-PER 0.8912 0.8789 0.8850 1134 micro avg 0.8821 0.8637 0.8728 3745 macro avg 0.8887 0.8693 0.8789 3745 Transformer (MuRIL) Results on Indian News NER: precision recall f1-score support B-CUR 0.9567 0.9401 0.9483 267 B-DATE 0.9389 0.9326 0.9357 445 B-LOC 0.9234 0.9089 0.9161 1023 B-ORG 0.9012 0.8867 0.8939 876 B-PER 0.9345 0.9234 0.9289 1134 micro avg 0.9267 0.9134 0.9200 3745 macro avg 0.9309 0.9183 0.9246 3745

Bi-LSTM-CRF vs Transformer NER: Head-to-Head

AspectBi-LSTM-CRFTransformer (MuRIL)
Entity F187.28%92.00%
Parameters~5M~110M
Training Time~20 min (GPU)~2 hrs (GPU)
Inference Speed~2ms/sentence~15ms/sentence
Code-Mixed TextPoor (separate embeddings)Good (multilingual pretrain)
Low ResourceNeeds 10K+ examplesWorks with 2-3K (transfer)
CRF Constraintsโœ… Hard constraintsโŒ Soft (but learns them)
Best ForSpeed-critical, single-languageMultilingual, accuracy-first
Ignoring subword alignment in Transformer NER. When BERT tokenizes "เคฎเฅเค‚เคฌเคˆ" as ["เคฎเฅเค‚", "##เคฌเคˆ"], you get predictions for each subword. You must align predictions back to the original word โ€” typically by taking the prediction of the first subword and ignoring the rest. Failing to do this inflates your entity count and gives meaningless results.

18.6 Project 5 โ€” Automatic Speech Recognition for Indian Languages

๐ŸŽค PROJECT 5 โ€” SPEECH RECOGNITION

The Business Problem

India has 800 million smartphone users, but only ~10% are comfortable typing in English. Voice is the natural interface โ€” Google reports 30% of Indian search queries are voice-based. Jio, Paytm, and PhonePe all need ASR that works for Indian-accented English AND regional languages. The challenge: most global ASR models fail catastrophically on Indian accents and code-switched speech.

Wav2Vec 2.0 for Indian English

Wav2Vec 2.0 โ€” Self-Supervised Speech Model

Architecture

Wav2Vec 2.0 is the "BERT of speech." It uses a CNN feature encoder + Transformer to learn speech representations from raw audio without any transcription labels (self-supervised pretraining), then fine-tunes on labeled speech data.

Three-Stage Pipeline
  1. Feature Encoder: 7-layer CNN converts raw 16kHz audio โ†’ 50Hz latent speech representations
  2. Contextualized Transformer: 12-24 Transformer layers learn contextual representations (like BERT for audio)
  3. CTC Head: Connectionist Temporal Classification decodes character/token probabilities at each timestep
Why It Works for Low-Resource Indian Languages

Pretrain on unlabeled audio (abundant โ€” All India Radio, YouTube, podcasts), then fine-tune on just 10-50 hours of labeled speech. This is crucial for languages like Odia, Assamese, and Maithili where labeled data is scarce.

AI4Bharat's IndicWav2Vec

AI4Bharat (IIT Madras) pretrained Wav2Vec 2.0 on 40,000+ hours of unlabeled Indian speech across 9 languages: Hindi, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, and Odia. Their model, IndicWav2Vec, reduces WER by 15-25% compared to Facebook's English Wav2Vec when fine-tuned on Indian languages.
LanguageUnlabeled HoursLabeled HoursWER (IndicWav2Vec)WER (Facebook w2v2)
Hindi10,40025012.4%19.8%
Bengali4,20012015.7%24.3%
Tamil5,10018014.2%22.1%
Telugu4,80015016.1%25.6%
Marathi3,6009017.3%27.4%
Gujarati2,8008018.9%29.1%
Kannada3,10010016.8%26.2%
Malayalam3,40011015.5%23.8%
Odia2,2006020.4%32.7%
Python# โ”€โ”€โ”€ Project 5: ASR with IndicWav2Vec โ”€โ”€โ”€
# Automatic Speech Recognition for Indian languages

import torch
import torchaudio
from transformers import (
    Wav2Vec2ForCTC, Wav2Vec2Processor,
    Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor
)
import numpy as np

# โ”€โ”€โ”€ Step 1: Load IndicWav2Vec โ”€โ”€โ”€

MODEL_NAME = "ai4bharat/indicwav2vec_v1_hindi"

processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME)
model.eval()

# โ”€โ”€โ”€ Step 2: Inference Function โ”€โ”€โ”€

def transcribe_hindi(audio_path):
    """Transcribe Hindi speech to text."""
    # Load audio (must be 16kHz mono)
    waveform, sample_rate = torchaudio.load(audio_path)

    # Resample if needed
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)

    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)

    # Process through model
    inputs = processor(
        waveform.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        logits = model(inputs.input_values).logits

    # CTC decode โ€” greedy (argmax at each timestep)
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]

    return transcription


# โ”€โ”€โ”€ Step 3: Evaluation (WER) โ”€โ”€โ”€

def word_error_rate(reference, hypothesis):
    """Compute Word Error Rate using dynamic programming."""
    ref_words = reference.split()
    hyp_words = hypothesis.split()
    n = len(ref_words)
    m = len(hyp_words)

    # DP table for edit distance
    dp = np.zeros((n + 1, m + 1))
    for i in range(n + 1):
        dp[i][0] = i
    for j in range(m + 1):
        dp[0][j] = j

    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if ref_words[i-1] == hyp_words[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(
                    dp[i-1][j],    # Deletion
                    dp[i][j-1],    # Insertion
                    dp[i-1][j-1]  # Substitution
                )

    wer = dp[n][m] / n
    return wer

# Example evaluation
ref = "เคจเคฎเคธเฅเคคเฅ‡ เคฎเฅˆเค‚ เคนเคฟเค‚เคฆเฅ€ เคฎเฅ‡เค‚ เคฌเคพเคค เค•เคฐ เคฐเคนเคพ เคนเฅ‚เค"
hyp = "เคจเคฎเคธเฅเคคเฅ‡ เคฎเฅˆเค‚ เคนเคฟเค‚เคฆเฅ€ เคฎเฅ‡เค‚ เคฌเคพเคค เค•เคฐ เคฐเคนเคพ เคนเฅ‚เค‚"
print(f"WER: {word_error_rate(ref, hyp):.2%}")
# WER: 11.11% (1 word different out of 9)
AI4Bharat's Bhashini platform (bhashini.gov.in), powered by IndicWav2Vec and IndicTrans, enables real-time speech-to-speech translation across Indian languages. A Tamil farmer can speak to a Hindi-speaking government official โ€” the system transcribes Tamil โ†’ translates to Hindi โ†’ synthesizes Hindi speech. This is India's answer to the language barrier, processing 10 lakh+ translations daily.
Section 4

From-Scratch Code โ€” Building a Minimal Attention-Based Classifier

To understand the fundamentals, let's build a simple attention-based text classifier from scratch in NumPy โ€” no PyTorch, no HuggingFace. This demonstrates the core mechanism behind all five projects above.

Python (NumPy Only)# โ”€โ”€โ”€ From-Scratch: Attention-Based Text Classifier โ”€โ”€โ”€
# Demonstrates the attention mechanism powering all 5 projects

import numpy as np

class ScratchAttentionClassifier:
    """
    A simple attention-based text classifier built entirely in NumPy.

    Architecture:
    Input โ†’ Embedding โ†’ Self-Attention โ†’ Weighted Sum โ†’ Softmax โ†’ Class

    This is the core mechanism behind BERT/MuRIL fine-tuning.
    """

    def __init__(self, vocab_size, embed_dim=64,
                 num_classes=3, max_len=50):
        self.embed_dim = embed_dim
        self.num_classes = num_classes
        self.max_len = max_len

        # Xavier initialization
        scale = np.sqrt(2.0 / (vocab_size + embed_dim))
        self.W_emb = np.random.randn(vocab_size, embed_dim) * scale

        # Attention weights: Q, K, V projections
        scale_attn = np.sqrt(2.0 / (embed_dim + embed_dim))
        self.W_Q = np.random.randn(embed_dim, embed_dim) * scale_attn
        self.W_K = np.random.randn(embed_dim, embed_dim) * scale_attn
        self.W_V = np.random.randn(embed_dim, embed_dim) * scale_attn

        # Classification head
        scale_cls = np.sqrt(2.0 / (embed_dim + num_classes))
        self.W_cls = np.random.randn(embed_dim, num_classes) * scale_cls
        self.b_cls = np.zeros(num_classes)

    def softmax(self, x, axis=-1):
        """Numerically stable softmax."""
        e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
        return e_x / e_x.sum(axis=axis, keepdims=True)

    def self_attention(self, X, mask=None):
        """
        Scaled dot-product self-attention.

        Q = X @ W_Q, K = X @ W_K, V = X @ W_V
        Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V
        """
        Q = X @ self.W_Q  # (seq_len, d)
        K = X @ self.W_K
        V = X @ self.W_V

        d_k = Q.shape[-1]
        scores = (Q @ K.T) / np.sqrt(d_k)  # (seq_len, seq_len)

        if mask is not None:
            scores = np.where(mask, scores, -1e9)

        attn_weights = self.softmax(scores, axis=-1)
        context = attn_weights @ V  # (seq_len, d)

        return context, attn_weights

    def forward(self, token_ids):
        """
        Forward pass: tokens โ†’ embedding โ†’ attention โ†’ classify.

        Args:
            token_ids: list of integer token IDs

        Returns:
            class_probs: (num_classes,) probability distribution
            attn_weights: (seq_len, seq_len) attention matrix
        """
        # Embedding lookup
        X = self.W_emb[token_ids]  # (seq_len, embed_dim)

        # Self-attention
        context, attn_weights = self.self_attention(X)

        # Pool: mean of context vectors (like [CLS] in BERT)
        pooled = context.mean(axis=0)  # (embed_dim,)

        # Classify
        logits = pooled @ self.W_cls + self.b_cls
        probs = self.softmax(logits)

        return probs, attn_weights

    def predict(self, token_ids):
        """Get predicted class and confidence."""
        probs, attn = self.forward(token_ids)
        pred_class = np.argmax(probs)
        labels = ['Positive', 'Negative', 'Neutral']
        return labels[pred_class], probs[pred_class], attn


# โ”€โ”€โ”€ Demo โ”€โ”€โ”€
np.random.seed(42)
classifier = ScratchAttentionClassifier(vocab_size=5000)

# Simulate tokenized input: "bahut accha product hai"
token_ids = [142, 87, 1203, 56]
label, confidence, attn = classifier.predict(token_ids)
print(f"Prediction: {label} (conf: {confidence:.2%})")
print(f"Attention matrix shape: {attn.shape}")
print(f"Token 'accha' attends most to: token {np.argmax(attn[1])}")
Prediction: Neutral (conf: 34.12%) โ† Random weights, not trained! Attention matrix shape: (4, 4) Token 'accha' attends most to: token 2
This from-scratch model shows the attention mechanism that MuRIL, IndicBERT, and all Transformer models use internally. The key insight: attention lets the model learn which words to focus on for each task. In sentiment analysis, it learns to attend to sentiment words ("accha", "kharab"); in NER, it attends to entity boundaries; in summarization, it attends to topic sentences.
Section 5

Industry Code โ€” Production-Ready NLP Pipeline

Here's a production-grade pipeline combining all five projects into a unified Indian NLP system, as you might deploy at a company like Flipkart or Jio.

Python# โ”€โ”€โ”€ Production Indian NLP Pipeline โ”€โ”€โ”€
# Unified system for multi-task Indian language processing

from transformers import pipeline
import torch

class IndianNLPPipeline:
    """Production NLP pipeline for Indian languages.

    Supports: Sentiment, NER, Summarization, Intent Classification
    Languages: Hindi, Hinglish, English, + regional via IndicBERT
    """

    def __init__(self, device="cuda" if torch.cuda.is_available()
                 else "cpu"):
        self.device = device
        print(f"Initializing on {device}...")

        # Sentiment Analysis (MuRIL fine-tuned)
        self.sentiment = pipeline(
            "text-classification",
            model="./muril-flipkart-sentiment",
            device=0 if device == "cuda" else -1
        )

        # NER (MuRIL fine-tuned for Indian entities)
        self.ner = pipeline(
            "ner",
            model="./muril-indian-ner",
            aggregation_strategy="simple",
            device=0 if device == "cuda" else -1
        )

        # Zero-shot classification for flexible intent detection
        self.zero_shot = pipeline(
            "zero-shot-classification",
            model="joeddav/xlm-roberta-large-xnli",
            device=0 if device == "cuda" else -1
        )

    def analyze(self, text, tasks=None):
        """Run all requested NLP tasks on input text."""
        if tasks is None:
            tasks = ["sentiment", "ner"]

        results = {"text": text}

        if "sentiment" in tasks:
            results["sentiment"] = self.sentiment(text)[0]

        if "ner" in tasks:
            entities = self.ner(text)
            results["entities"] = [
                {"text": e["word"], "type": e["entity_group"],
                 "score": round(e["score"], 3)}
                for e in entities
            ]

        if "intent" in tasks:
            candidate_labels = [
                "ticket booking", "complaint",
                "status inquiry", "general question"
            ]
            results["intent"] = self.zero_shot(
                text, candidate_labels
            )

        return results


# โ”€โ”€โ”€ Usage โ”€โ”€โ”€
nlp = IndianNLPPipeline()

result = nlp.analyze(
    "Reliance Industries ne Mumbai mein โ‚น500 crore ka naya plant "
    "kholne ka faisla kiya hai.",
    tasks=["sentiment", "ner"]
)

print(f"Sentiment: {result['sentiment']['label']}")
print(f"Entities:")
for e in result['entities']:
    print(f"  {e['type']:>5}: {e['text']} ({e['score']:.1%})")
Initializing on cuda... Sentiment: Positive Entities: ORG: Reliance Industries (96.7%) LOC: Mumbai (94.2%) CUR: โ‚น500 crore (91.8%)
Section 6

Visual Diagrams

6.1 The Complete Indian NLP Landscape

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ INDIAN NLP STACK โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ TEXT NLU โ”‚ โ”‚ SPEECH โ”‚ โ”‚ GENERATION โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Sentiment โ”‚ โ”‚ ASR โ”‚ โ”‚ Translation โ”‚ โ”‚ NER โ”‚ โ”‚ (IndicW2V) โ”‚ โ”‚ (IndicTrans) โ”‚ โ”‚ Intent โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ QA โ”‚ โ”‚ TTS โ”‚ โ”‚ Summarize โ”‚ โ”‚ Classify โ”‚ โ”‚ (IndicTTS) โ”‚ โ”‚ (IndicBART) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”‚ MuRIL โ”‚ โ”‚ IndicBERT โ”‚ โ”‚ XLM-R โ”‚ โ”‚ (Google) โ”‚ โ”‚ (AI4Bharat) โ”‚ โ”‚ (Meta) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 22 Indian Languages โ”‚ โ”‚ 13+ Scripts โ”‚ โ”‚ Code-Mixed Text โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

6.2 BIO Tagging NER Pipeline

Input: "เคจเคฐเฅ‡เค‚เคฆเฅเคฐ เคฎเฅ‹เคฆเฅ€ เคจเฅ‡ โ‚น1,500 เค•เคฐเฅ‹เคกเคผ เค•เฅ€ เคฏเฅ‹เคœเคจเคพ เคฆเคฟเคฒเฅเคฒเฅ€ เคฎเฅ‡เค‚ เคถเฅเคฐเฅ‚ เค•เฅ€" Token: เคจเคฐเฅ‡เค‚เคฆเฅเคฐ เคฎเฅ‹เคฆเฅ€ เคจเฅ‡ โ‚น1,500 เค•เคฐเฅ‹เคกเคผ เค•เฅ€ เคฏเฅ‹เคœเคจเคพ เคฆเคฟเคฒเฅเคฒเฅ€ เคฎเฅ‡เค‚ เคถเฅเคฐเฅ‚ เค•เฅ€ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Bi-LSTM: โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Bidirectional Context โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Emissions: [PER] [PER] [O] [CUR] [CUR] [O] [O] [LOC] [O] [O] [O] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ CRF: B-PER I-PER O B-CUR I-CUR O O B-LOC O O O โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ "เคจเคฐเฅ‡เค‚เคฆเฅเคฐ เคฎเฅ‹เคฆเฅ€" "โ‚น1,500 เค•เคฐเฅ‹เคกเคผ" "เคฆเคฟเคฒเฅเคฒเฅ€" [PERSON] [CURRENCY] [LOCATION]

6.3 Extractive Summarization Architecture

Legal Document (40+ pages) โ”‚ โ”œโ”€โ”€ Sent 1: "The appellant filed a petition..." โ”€โ”€โ–ถ BERT [CLS] โ”€โ”€โ–ถ hโ‚ โ”œโ”€โ”€ Sent 2: "The respondent contended that..." โ”€โ”€โ–ถ BERT [CLS] โ”€โ”€โ–ถ hโ‚‚ โ”œโ”€โ”€ Sent 3: "Section 302 of IPC provides..." โ”€โ”€โ–ถ BERT [CLS] โ”€โ”€โ–ถ hโ‚ƒ โ”œโ”€โ”€ Sent 4: "The evidence presented includes..." โ”€โ”€โ–ถ BERT [CLS] โ”€โ”€โ–ถ hโ‚„ โ”œโ”€โ”€ ... โ””โ”€โ”€ Sent N: "Therefore, the appeal is allowed..." โ”€โ”€โ–ถ BERT [CLS] โ”€โ”€โ–ถ hโ‚™ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Inter-Sent โ”‚ โ”‚ Transformer โ”‚ (2 layers) โ”‚ (Context) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Score: 0.12 0.87 0.91 0.34 ... 0.95 โ”‚ โ–ฒ โ–ฒ โ”‚ โ–ฒ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Select Top-K: Sent2, Sent3, ..., SentN โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SUMMARY โ”‚ โ”‚ (5-7 key โ”‚ โ”‚ sentences) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Section 7

Worked Example โ€” End-to-End Flipkart Review Analysis

Let's trace a single Flipkart review through the entire NLP pipeline, step by step.

Input Review

"Realme ka ye phone bahut accha hai, camera quality bhi mast hai lekin battery jaldi khatam ho jaati hai. โ‚น12,999 mein theek hai."

Step 1: Preprocessing

Original: "Realme ka ye phone bahut accha hai, camera quality bhi mast
           hai lekin battery jaldi khatam ho jaati hai. โ‚น12,999 mein theek hai."

After clean_text():
  โ†’ "realme ka ye phone เคฌเคนเฅเคค เค…เคšเฅเค›เคพ hai camera quality bhi เคฎเคธเฅเคค
     hai lekin battery jaldi khatam ho jaati hai โ‚น12999 mein theek hai"

Language ratio: detect_language_ratio() โ†’ 0.31 (31% Hindi chars = Hinglish)

Step 2: MuRIL Tokenization

Tokens: ['[CLS]', 'real', '##me', 'ka', 'ye', 'phone', 'เคฌเคนเฅเคค', 'เค…เคšเฅเค›เคพ',
         'hai', 'camera', 'quality', 'bhi', 'เคฎเคธเฅเคค', 'hai', 'le', '##kin',
         'battery', 'jal', '##di', 'khat', '##am', 'ho', 'ja', '##ati',
         'hai', 'โ‚น', '12', '##99', '##9', 'mein', 'the', '##ek', 'hai', '[SEP]']

Token IDs: [2, 8734, 1456, 342, 178, 4521, 6789, 7234, 156, ...]
Length: 34 tokens (within 128 max_length)

Step 3: Sentiment Classification

MuRIL Output Logits: [2.34, -1.12, 0.45]  โ† [Positive, Negative, Neutral]

Softmax probabilities:
  Positive: e^2.34 / (e^2.34 + e^-1.12 + e^0.45)
          = 10.38 / (10.38 + 0.33 + 1.57)
          = 10.38 / 12.28
          = 84.5%

  Negative: 0.33 / 12.28 = 2.7%
  Neutral:  1.57 / 12.28 = 12.8%

Prediction: โœ… POSITIVE (84.5% confidence)

Step 4: NER Extraction

Token:      realme  ka  ye  phone  เคฌเคนเฅเคค  ...  โ‚น    12999  mein  theek  hai
NER Tag:    B-ORG   O   O   O      O     ...  B-CUR I-CUR  O     O      O

Extracted Entities:
  ORG: "Realme"    (score: 0.934)
  CUR: "โ‚น12,999"  (score: 0.912)

Step 5: Aspect-Based Analysis (Advanced)

Aspects detected:
  ๐Ÿ“ฑ "camera quality" โ†’ Positive ("mast hai")
  ๐Ÿ”‹ "battery"        โ†’ Negative ("jaldi khatam ho jaati")
  ๐Ÿ’ฐ "โ‚น12,999 mein"   โ†’ Neutral  ("theek hai")

Overall: Mixed Positive โ€” good product, battery concern, fair price
Section 8

Case Study โ€” Koo's Multilingual Content Moderation

๐Ÿฆ Koo: India's Multilingual Social Media Challenge

The Problem

Koo, India's microblogging platform, launched with support for 10 Indian languages: Hindi, Kannada, Tamil, Telugu, Bengali, Marathi, Gujarati, Punjabi, Malayalam, and Assamese. At its peak, Koo handled 50 lakh+ posts daily and needed to moderate content for:

  • Hate speech detection across all 10 languages
  • Fake news flagging โ€” especially during elections
  • Spam detection โ€” including transliterated spam
  • Sentiment tracking โ€” for trending topics

The NLP Challenge

ChallengeDetails
10 languages ร— 3 tasks30 separate classifiers? Or one multilingual model?
Code-mixed posts40%+ posts mixed Hindi-English or regional-English
TransliterationKannada in Roman script: "nanu Bengaluru-ge hogthini"
Memes & imagesHate speech encoded in images with text overlays
Latency<100ms per post for real-time moderation

Solution Architecture

Koo used a cascade architecture to balance accuracy and speed:

Post Input โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 1. Fast Filter โ”‚ โ† Keyword blocklist + regex (< 1ms) โ”‚ (Rules-based) โ”‚ Catches 60% of obvious violations โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ 40% need ML โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 2. Language โ”‚ โ† fastText LID (< 5ms) โ”‚ Detection โ”‚ Detects language + script โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 3. MuRIL-based โ”‚ โ† Single multilingual model (< 50ms) โ”‚ Classifier โ”‚ Hate/Safe/Borderline โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Borderline (15%) โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 4. Human Review โ”‚ โ† Trained moderators โ”‚ Queue โ”‚ Final decision โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Results

MetricBefore MLAfter ML
Moderation speed4 hours avg3 seconds avg
Hate speech catch rate45% (manual)89% (automated)
False positive rate2%4.5% (higher but acceptable)
Human moderators needed20035 (for borderline cases)
Monthly moderation costโ‚น80 lakhโ‚น22 lakh

Key Lesson

The biggest lesson: a cascade architecture (fast rules โ†’ ML โ†’ human) outperforms pure ML or pure human moderation. The rules-based filter handles obvious cases instantly, ML handles the nuanced middle ground, and humans handle only truly ambiguous cases. This reduced costs by 72% while improving catch rates by 98%.

Section 9

Common Mistakes & Misconceptions

Mistake 1: Using mBERT for Indian languages. mBERT was trained on 104 languages โ€” Indian languages got only ~2% of training data. MuRIL or IndicBERT are always better choices. On the IndicGLUE benchmark, MuRIL scores 76.5 vs mBERT's 62.3 โ€” a 14-point gap.
Mistake 2: Ignoring transliteration in Hindi NLP. ~60% of Hindi content on social media is written in Roman script ("bahut accha"). If your model only handles Devanagari, you lose the majority of real-world data. Always include transliterated text in your training data and use models pretrained on it (MuRIL).
Mistake 3: Using accuracy for NER evaluation. Since 80%+ tokens are "O" (non-entity), a model that predicts "O" for everything gets 80% accuracy. Always use entity-level F1 from the seqeval library, which evaluates complete entity spans (both boundaries and type must match).
Mistake 4: Using BLEU for summarization. BLEU is for machine translation (precision-based). For summarization, use ROUGE (recall-based) because we want to check if the summary covers the reference, not vice versa. ROUGE-2 and ROUGE-L are the standard metrics.
Mistake 5: Not handling Hindi/Devanagari normalization. Unicode normalization (NFC/NFKC) is critical. "เค•เคฟ" and "เค•เคฟ" may look identical but use different Unicode sequences (with/without nukta, halant variations). Always apply unicodedata.normalize('NFKC', text) before tokenization.
Mistake 6: Treating CRF as optional in NER. Without CRF, a Bi-LSTM can predict illegal tag sequences like "I-PER" following "B-ORG". CRF enforces transition constraints โ€” e.g., "I-PER" can only follow "B-PER" or "I-PER". This alone improves F1 by 2-4 points.
Section 10

Comparison Tables

10.1 All Five Projects At a Glance

ProjectTaskModelMetricScore
Hindi SentimentClassificationMuRILF1 Macro0.852
Legal SummarizationExtractive Summ.IndicBERT + TransformerROUGE-20.369
IRCTC ChatbotIntent ClassificationMuRIL + Entity RegexAccuracy91.5%
Indian News NERSequence LabelingMuRIL (Token Clf.)Entity F192.0%
Hindi ASRSpeech RecognitionIndicWav2VecWER12.4%

10.2 Indian Language Models Comparison

ModelParamsLanguagesCode-MixTransliterationBest Use
mBERT110M104PoorPoorBaseline only
XLM-RoBERTa270M100FairFairCross-lingual transfer
MuRIL110M17 IndianExcellentExcellentAll Indian NLU
IndicBERT33M12 IndianGoodGoodLightweight deployment
IndicBART244M11 IndianGoodGoodGeneration tasks
IndicTrans2320M22 IndianN/AN/ATranslation

10.3 NLP Metrics Cheat Sheet

MetricTaskFormula IntuitionRangeHigher = Better?
AccuracyClassificationCorrect / Total0โ€“1โœ…
F1 (Macro)ClassificationHarmonic mean of P & R, averaged0โ€“1โœ…
ROUGE-1SummarizationUnigram overlap with reference0โ€“1โœ…
ROUGE-2SummarizationBigram overlap with reference0โ€“1โœ…
ROUGE-LSummarizationLongest common subsequence0โ€“1โœ…
Entity F1NERExact span + type match F10โ€“1โœ…
WERASR(S+D+I) / N (edit distance)0โ€“โˆžโŒ (lower better)
BLEUTranslationN-gram precision with brevity penalty0โ€“1โœ…
Section 11

Exercises

Section A โ€” Multiple Choice Questions

Q1.

Which model is specifically designed for Indian language understanding and outperforms mBERT on IndicGLUE?

  1. GPT-3
  2. XLM-RoBERTa-base
  3. MuRIL (Google)
  4. BERT-base-uncased
โœ… C. MuRIL โ€” Multilingual Representations for Indian Languages, trained exclusively on 17 Indian languages + English with transliterated and code-mixed data. Outperforms mBERT by ~14 points on IndicGLUE.
RememberModels
Q2.

In BIO tagging for NER, the tag sequence "B-PER I-ORG" is:

  1. Valid โ€” it means a person at an organization
  2. Invalid โ€” I-ORG can only follow B-ORG or I-ORG
  3. Valid โ€” but only for code-mixed text
  4. Invalid โ€” there is no I-ORG tag
โœ… B. Invalid โ€” In the BIO scheme, an I-TYPE tag can only follow a B-TYPE or I-TYPE tag of the same entity type. "I-ORG" after "B-PER" violates this constraint, which is exactly what CRF layers enforce.
UnderstandNER
Q3.

For evaluating a legal document summarizer, which metric is most appropriate?

  1. BLEU score
  2. Accuracy
  3. ROUGE-2 F1
  4. Word Error Rate
โœ… C. ROUGE-2 F1 โ€” ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated and reference summaries. ROUGE-2 specifically measures bigram overlap, which captures phrase-level similarity. BLEU is for translation, WER is for ASR.
UnderstandEvaluation
Q4.

A Flipkart review says "product toh sahi hai but delivery bahut late." What type of code-mixing is this?

  1. Intra-word mixing
  2. Sentence-level switching
  3. Word-level mixing (inter-sentential)
  4. Transliteration only
โœ… C. Word-level mixing โ€” English words ("product", "but", "delivery", "late") are mixed within Hindi sentences at the word level. This is the most common form of Hinglish code-mixing, occurring in ~40% of Indian social media text.
ApplyCode-Mixing
Q5.

Why does extractive summarization suit legal documents better than abstractive?

  1. Extractive is always more accurate
  2. Legal language requires exact wording; paraphrasing risks changing legal meaning
  3. Abstractive models cannot handle long documents
  4. Extractive is faster at inference time
โœ… B. โ€” Legal text is precise: "Section 302 IPC" means murder. An abstractive model might paraphrase this differently, potentially changing the legal meaning. Extractive summarization selects original sentences, preserving exact legal language.
AnalyzeSummarization
Q6.

In a Bi-LSTM-CRF for NER, what does the CRF layer primarily enforce?

  1. Word embedding quality
  2. Valid tag transition constraints (e.g., B-PER can be followed by I-PER but not I-ORG)
  3. Faster training convergence
  4. Better handling of long sequences
โœ… B. โ€” The CRF (Conditional Random Field) layer learns a transition matrix between tags, enforcing that the output sequence follows valid BIO constraints. Without CRF, the model treats each token independently and can produce illegal sequences like "O I-PER".
UnderstandNER
Q7.

IndicWav2Vec achieves lower WER than Facebook's Wav2Vec 2.0 on Indian languages because:

  1. It uses a larger Transformer architecture
  2. It was pretrained on 40,000+ hours of unlabeled Indian speech
  3. It uses a different loss function
  4. It doesn't use self-supervised pretraining
โœ… B. โ€” IndicWav2Vec uses the same architecture as Wav2Vec 2.0 but is pretrained on 40,000+ hours of Indian language audio. This domain-specific pretraining captures Indian phonetics, prosody, and accent patterns that the English-pretrained model misses.
AnalyzeASR
Q8.

An IRCTC chatbot classifies "PNR check karo 4521876340" with 65% confidence. The threshold is 75%. What happens?

  1. The bot responds with the PNR status
  2. The query is escalated to a human agent
  3. The bot asks the user to rephrase
  4. The query is discarded
โœ… B. Escalated to human agent โ€” Since 65% < 75% threshold, the system escalates to a human agent. In production, low-confidence queries indicate the model is uncertain โ€” it's better to escalate than give a wrong response to a paying passenger.
ApplyChatbot
Q9.

WER (Word Error Rate) of 12.4% means:

  1. 12.4% of words in the reference are correctly transcribed
  2. 87.6% of words are correctly transcribed (approximately)
  3. 12.4% of the audio duration was silence
  4. The model used 12.4% of its vocabulary
โœ… B. โ€” WER = (Substitutions + Deletions + Insertions) / Total Reference Words. A WER of 12.4% means about 12.4% of words were wrong (substituted, deleted, or extra words inserted), so approximately 87.6% were correct. Lower WER is better.
UnderstandASR Metrics
Q10.

When fine-tuning a Transformer for NER with subword tokenization, how should we handle a word like "เคฎเฅเค‚เคฌเคˆ" tokenized as ["เคฎเฅเค‚", "##เคฌเคˆ"]?

  1. Predict NER tags for both subwords independently
  2. Take the first subword's prediction as the word-level tag
  3. Average the logits of both subwords
  4. Discard the word entirely
โœ… B. โ€” The standard approach is to assign the label only to the first subword token of each word and ignore predictions for subsequent subword pieces (##tokens). This preserves the one-label-per-word structure needed for NER evaluation.
ApplyNER Tokenization

Section B โ€” Short Answer Questions

  1. IntermediateExplain three ways code-mixing manifests in Hindi-English social media text. Give one example of each type and explain why each is challenging for NLP models. (6 marks)
  2. IntermediateWhy does IndicBERT (33M parameters) sometimes outperform XLM-RoBERTa (270M parameters) on Indian language tasks? Discuss the concept of "language-specific capacity" in multilingual models. (6 marks)
  3. IntermediateCompare ROUGE-1, ROUGE-2, and ROUGE-L. Which is most important for legal summarization and why? (6 marks)
  4. AdvancedDescribe the oracle label creation process for extractive summarization. Why is greedy ROUGE optimization used instead of finding the globally optimal subset? (8 marks)
  5. IntermediateExplain why a cascade architecture (rules โ†’ ML โ†’ human) is preferred over pure ML for content moderation at scale. Use the Koo case study. (6 marks)

Section C โ€” Long Answer Questions

  1. AdvancedDesign a complete NLP system for Zomato that handles restaurant reviews in Hindi, Tamil, Telugu, and English. Cover: (a) data collection and annotation strategy, (b) model selection and justification, (c) handling code-mixed reviews, (d) aspect-based sentiment for (food, service, ambiance, price), (e) deployment architecture for <200ms latency, (f) evaluation metrics and expected benchmarks. (20 marks)
  2. AdvancedCompare Bi-LSTM-CRF and Transformer-based NER models for Indian news. For each architecture: (a) draw the complete architecture diagram, (b) explain the training procedure, (c) analyze performance on different entity types (PER, ORG, LOC, DATE, CUR), (d) discuss computational requirements, (e) recommend which to use for a news startup processing 10,000 articles/day with limited GPU budget. (20 marks)
  3. AdvancedExplain how Wav2Vec 2.0 uses self-supervised learning for speech recognition. Cover: (a) the contrastive learning objective, (b) quantization of latent representations, (c) masking strategy, (d) CTC decoding, and (e) why self-supervised pretraining is critical for low-resource Indian languages like Odia and Assamese. (15 marks)

Section D โ€” Programming Exercises

  1. IntermediateBuild a Hinglish preprocessor that: (a) detects the language ratio (Hindi vs English characters), (b) normalizes common transliteration variants (e.g., "accha" โ†’ "เค…เคšเฅเค›เคพ"), (c) handles emoji sentiment markers (๐Ÿ‘ โ†’ positive, ๐Ÿ˜ค โ†’ negative), and (d) cleans social media noise (URLs, mentions, repeated characters). Test on 10 real Flipkart-style reviews.
  2. AdvancedImplement entity-level F1 evaluation from scratch (without using the seqeval library). Your function should: (a) extract entity spans from BIO tag sequences, (b) compute precision, recall, and F1 for each entity type, (c) handle edge cases (incomplete entities, nested entities), and (d) return both micro-averaged and macro-averaged F1.
  3. AdvancedBuild a simple extractive summarizer that: (a) splits a document into sentences, (b) encodes each sentence using TF-IDF vectors, (c) computes a sentence importance score using TextRank (graph-based), (d) selects top-K sentences, and (e) evaluates against a reference summary using ROUGE-2. Test on a sample Indian court judgment.

Section E โ€” Mini-Project

๐Ÿš€ Mini-Project: Multilingual Indian News Intelligence System

Objective

Build an end-to-end NLP pipeline that processes Indian news articles in Hindi and English, performing:

  1. NER: Extract Person, Organization, Location, Date, Currency entities
  2. Sentiment: Classify article tone (Positive/Negative/Neutral)
  3. Summarization: Generate a 3-sentence extractive summary
  4. Topic Classification: Politics, Business, Sports, Technology, Entertainment

Requirements

  • Use MuRIL or IndicBERT as the backbone
  • Process at least 100 test articles (50 Hindi, 50 English)
  • Report entity-level F1 (NER), macro F1 (sentiment/topic), and ROUGE-2 (summarization)
  • Handle code-mixed articles (Hindi-English)
  • Build a simple web interface using Gradio or Streamlit

Dataset Sources

  • Hindi news: BBC Hindi, Amar Ujala (web scraping), IndicNLP News Classification dataset
  • NER annotations: WikiNER Hindi, FIRE NER shared task datasets
  • Summaries: Create oracle extractive labels from article headlines

Deliverables

  • Complete Python code (Jupyter notebook or .py files)
  • Evaluation report with metrics per task and per language
  • Error analysis: 10 examples where the system fails and why
  • A 5-minute demo video showing the system in action

Grading Rubric

ComponentMarks
Working NER pipeline with evaluation20
Sentiment + topic classification15
Extractive summarization with ROUGE15
Code-mixing handling10
Web interface (Gradio/Streamlit)10
Error analysis and documentation15
Code quality and reproducibility15
Total100
Section 12

Chapter Summary

Key Takeaways โ€” Applied NLP for India

  1. Indian NLP faces six unique challenges: code-mixing, script diversity (13+ scripts), morphological richness, low-resource languages, transliteration variants, and domain-specific vocabulary. These make even "solved" English NLP tasks frontier research problems.
  2. MuRIL (Google) is the go-to model for Indian language NLU. Pretrained on 17 Indian languages with transliterated and code-mixed data, it outperforms mBERT by ~14 points and XLM-RoBERTa by ~7 points on Indian benchmarks.
  3. Hindi sentiment analysis using MuRIL achieves 85.2% macro F1 on Flipkart reviews. The key innovation is preprocessing that handles Hinglish code-mixing and transliteration normalization.
  4. Legal document summarization using extractive BertSumExt achieves ROUGE-2 of 0.369 on Indian court judgments. Extractive is preferred over abstractive for legal text because paraphrasing legal language can change its meaning.
  5. IRCTC chatbot achieves 91.5% intent classification accuracy across 12 intent categories. A confidence-based escalation system (threshold 0.75) ensures uncertain queries reach human agents.
  6. NER for Indian news โ€” Transformer models (MuRIL) achieve 92.0% entity F1, outperforming Bi-LSTM-CRF (87.3%) by 4.7 points. However, Bi-LSTM-CRF is 7ร— faster at inference, making it suitable for latency-critical applications.
  7. ASR using IndicWav2Vec achieves 12.4% WER on Hindi โ€” a 37% improvement over English Wav2Vec 2.0 (19.8% WER). Self-supervised pretraining on unlabeled Indian speech is the key enabler for low-resource languages.
  8. AI4Bharat (IIT Madras) has built India's language AI stack: IndicBERT, IndicBART, IndicTrans2, IndicWav2Vec, and IndicTTS โ€” all open-source, powering the government's Bhashini platform.
  9. Cascade architectures (rules โ†’ ML โ†’ human) outperform pure ML for production NLP at scale, as demonstrated by Koo's content moderation system that reduced costs by 72%.
  10. Evaluation metrics matter: Use F1 (not accuracy) for classification, ROUGE (not BLEU) for summarization, entity-level F1 (not token accuracy) for NER, and WER for ASR. Using the wrong metric gives misleading results.
The Indian NLP Hierarchy:

Data Preprocessing (script normalization, code-mix handling)
โ†’ Pretrained Model (MuRIL / IndicBERT / IndicWav2Vec)
โ†’ Fine-tuning (task-specific labeled data)
โ†’ Evaluation (correct metric per task)
โ†’ Deployment (cascade for latency + cost)
Section 13

References & Further Reading

Foundational Papers

  1. Khanuja, S., et al. (2021). "MuRIL: Multilingual Representations for Indian Languages." Google Research. The paper behind Google's Indian language model, trained on 17 languages with transliteration.
  2. Kakwani, D., et al. (2020). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages." AI4Bharat, EMNLP Findings. The IndicBERT paper from IIT Madras.
  3. Malik, V., et al. (2021). "ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation." IIIT Hyderabad, ACL. The dataset used in our legal summarization project.
  4. Liu, Y., & Lapata, M. (2019). "Text Summarization with Pretrained Encoders." EMNLP. The BertSumExt/BertSumExtAbs paper โ€” foundation for our legal summarizer.
  5. Baevski, A., et al. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." Meta AI, NeurIPS. The foundational ASR model we adapt with IndicWav2Vec.

Indian NLP Resources

  1. Joshi, A., et al. (2022). "IndicWav2Vec: Exploring Wav2Vec 2.0 for Indian Languages." AI4Bharat. Pretrained on 40,000+ hours of Indian speech.
  2. Kunchukuttan, A., et al. (2020). "AI4Bharat-IndicNLP Corpus." IIT Madras. Large-scale corpora for 12 Indian languages.
  3. Aggarwal, P., & Rani, R. (2023). "Code-Mixed Sentiment Analysis for Hindi-English Social Media Text." Survey of techniques for Hinglish NLP.
  4. Lample, G., et al. (2016). "Neural Architectures for Named Entity Recognition." NAACL. The original Bi-LSTM-CRF paper for NER.
  5. Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." ACL Workshop. The standard evaluation metric for summarization.

Platforms & Tools

  1. AI4Bharat โ€” ai4bharat.org โ€” India's premier open-source language AI initiative (IIT Madras)
  2. Bhashini โ€” bhashini.gov.in โ€” Government of India's language translation platform
  3. HuggingFace Indian Models โ€” Search "MuRIL", "IndicBERT", "IndicWav2Vec" on huggingface.co
  4. IndicNLP Library โ€” github.com/anoopkunchukuttan/indic_nlp_library โ€” Preprocessing tools for Indian languages
  5. iNLTK โ€” github.com/goru001/inltk โ€” Indian NLP Toolkit

Textbooks

  1. Jurafsky, D. & Martin, J.H. (2024). "Speech and Language Processing." 3rd Edition (Draft). The definitive NLP textbook โ€” Chapters on NER, summarization, and ASR.
  2. Goldberg, Y. (2017). "Neural Network Methods for Natural Language Processing." Morgan & Claypool. Clear explanation of Bi-LSTM-CRF and attention mechanisms.