Neural Networks & Deep Learning
Chapter 18: Applied Deep Learning โ Natural Language Processing
Building Real-World NLP Systems for India's Linguistic Diversity
โฑ๏ธ Reading Time: ~5 hours | ๐ Part V: Applied Deep Learning | ๐ ๏ธ Project-Based Chapter
๐ Prerequisites: Chapters 14โ17 (RNNs, Attention, Transformers, BERT), Python, PyTorch/HuggingFace basics
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall NLP pipeline stages, key Indian-language datasets (ILDC, IndicNLP), and evaluation metrics (ROUGE, BLEU, F1, WER) |
| ๐ต Understand | Explain why Indian languages pose unique challenges โ agglutinative morphology, code-switching, low-resource settings, and multiple scripts |
| ๐ข Apply | Fine-tune MuRIL for Hindi sentiment, build extractive summarizers with IndicBERT, and train intent classifiers for chatbots |
| ๐ก Analyze | Compare Bi-LSTM-CRF vs Transformer NER architectures; diagnose code-mixing failures in tokenizers |
| ๐ Evaluate | Benchmark models across ROUGE, entity-level F1, and word error rate; choose between extractive vs abstractive summarization |
| ๐ด Create | Design and deploy end-to-end NLP applications: a Hindi sentiment system, a legal summarizer, an IRCTC chatbot, and a news NER pipeline |
Learning Objectives
By the end of this chapter, you will be able to:
- Build a Hindi/Hinglish sentiment analysis system using MuRIL (Multilingual Representations for Indian Languages) from HuggingFace, handling code-mixed reviews from Flipkart
- Implement an extractive legal document summarizer using IndicBERT on the ILDC (Indian Legal Documents Corpus) from IIIT Hyderabad, evaluating with ROUGE scores
- Design an IRCTC-style chatbot that handles 2 crore+ daily queries using BERT-based intent classification and response retrieval
- Train Named Entity Recognition models for Indian news articles โ comparing Bi-LSTM-CRF vs Transformer architectures for Person, Organization, Location, Date, and Currency entities
- Explain Automatic Speech Recognition (ASR) for Indian languages using Wav2Vec 2.0 and AI4Bharat's IndicWav2Vec for 9+ languages
- Handle Indian-language-specific challenges: Devanagari/multi-script tokenization, code-switching, morphological richness, and low-resource data augmentation
- Evaluate NLP systems using appropriate metrics: Accuracy/F1 (classification), ROUGE (summarization), Entity F1 (NER), WER (ASR)
Opening Hook โ India's Language AI Revolution
๐ฃ๏ธ 1.4 Billion People. 22 Official Languages. 19,500 Dialects. One NLP Challenge.
Every day, 2 crore passengers query IRCTC in a mix of Hindi, English, Tamil, and everything in between. Flipkart receives lakhs of product reviews in Hinglish โ "bahut accha product hai, quality mast hai ๐". Indian courts generate 4 crore+ pages of legal documents that lawyers must manually read. Meanwhile, Koo tried to moderate content across 10 Indian languages simultaneously.
English NLP is "solved" for many tasks. But India's code-mixing ("main kal market gaya tha, nice experience"), resource scarcity (try finding 1 lakh labeled Kannada sentences), and script diversity (Devanagari, Tamil, Telugu, Bengali, Gurmukhi...) make even basic NLP a frontier research problem.
In this chapter, you don't just learn NLP theory โ you build 5 production-grade Indian NLP systems from scratch. Welcome to the most exciting frontier of AI in India.
FlipkartIRCTCKooAI4BharatIIIT HyderabadJugalbandiCore Concepts โ Five NLP Projects for India
This chapter is organized as five complete projects, each addressing a real Indian NLP problem. Every project follows a consistent structure: Problem โ Dataset โ Model Architecture โ Full Code โ Evaluation โ Indian Language Challenges.
18.1 The Indian NLP Pipeline โ Unique Challenges
Before diving into projects, let's understand why Indian NLP is fundamentally harder than English NLP:
Why Indian NLP is Hard โ The Six Challenges
"Yaar ye phone ka camera too good hai, totally worth it" โ Hindi and English mixed at word and sentence level. Standard tokenizers trained on monolingual data fail catastrophically on such text.
2. Script Diversity22 official languages using 13+ scripts: Devanagari (Hindi, Marathi, Sanskrit), Tamil script, Telugu script, Bengali script, Gurmukhi (Punjabi), Kannada script, Malayalam script, Odia script, and more. A single model must handle all.
3. Morphological RichnessTamil has agglutinative morphology โ a single word can encode subject, tense, number, and mood: "เฎชเฎเฎฟเฎเฏเฎเฎตเฎฟเฎฒเฏเฎฒเฏเฎฏเฎพ" (padikkaavillaiyaa = "did [you] not read?"). This creates an enormous vocabulary that subword tokenizers struggle with.
4. Low-Resource LanguagesWhile Hindi has ~100K labeled NLP samples, languages like Dogri, Maithili, Bodo, and Santali have almost zero labeled data. Even Kannada and Odia have <10K labeled sentences for most tasks.
5. Transliteration Variants"เคเฅเคธเฅ เคนเฅ" = "kaise ho" = "kese ho" = "kaise hoo" โ the same Hindi phrase written in multiple romanized forms. Models must handle native script AND transliterated forms.
6. Domain-Specific VocabularyLegal Hindi ("เคจเฅเคฏเคพเคฏเคพเคฒเคฏ", "เค เคงเคฟเคจเคฟเคฏเคฎ"), medical terminology in regional languages, and government jargon create specialized domains with virtually no training data.
The Standard NLP Pipeline for Indian Languages
| Model | Creator | Languages | Best For |
|---|---|---|---|
MuRIL | 17 Indian + English | Code-mixed classification, QA | |
IndicBERT | AI4Bharat / IIT Madras | 12 Indian languages | All NLU tasks for Indian langs |
IndicBART | AI4Bharat | 11 Indian languages | Generation, summarization |
XLM-RoBERTa | Meta | 100 languages | Cross-lingual transfer |
IndicTrans2 | AI4Bharat | 22 Indian languages | Machine translation |
IndicWav2Vec | AI4Bharat | 9 Indian languages | Speech recognition (ASR) |
18.2 Project 1 โ Hindi/Hinglish Sentiment Analysis
๐ท๏ธ PROJECT 1 โ CLASSIFICATIONThe Business Problem
Flipkart receives 50 lakh+ product reviews monthly, with ~40% written in Hindi or Hinglish. Their recommendation engine and seller quality score depend on accurate sentiment detection. A review like "product toh sahi hai but delivery mein bahut time lagaya ๐ค" is mixed sentiment โ positive product, negative delivery. English-only models classify this as neutral (wrong!).
Dataset: Flipkart Hindi/Hinglish Reviews
We use a curated dataset of Flipkart product reviews in Hindi and code-mixed Hindi-English (Hinglish), labeled as Positive, Negative, or Neutral.
| Split | Positive | Negative | Neutral | Total |
|---|---|---|---|---|
| Train | 12,400 | 8,600 | 4,000 | 25,000 |
| Validation | 1,550 | 1,075 | 500 | 3,125 |
| Test | 1,550 | 1,075 | 500 | 3,125 |
Why MuRIL? โ Multilingual Representations for Indian Languages
MuRIL โ Google's Indian Language BERT
MuRIL is a BERT-base model (12 layers, 768 hidden, 12 heads, 110M params) pretrained on:
- 17 Indian languages + English from Wikipedia and Common Crawl
- Transliterated text โ Hindi in both Devanagari and Roman script
- Parallel corpora โ aligned Hindi-English sentence pairs
mBERT was trained on 104 languages with ~equal weight. Indian languages got only ~2% of the training data. MuRIL dedicates 100% of its capacity to Indian languages, resulting in 5-10% higher accuracy on Indian NLP benchmarks.
Code-Mixing HandlingMuRIL was explicitly trained on transliterated and code-mixed data. It correctly tokenizes "bahut accha product hai" even though it's Hindi in Roman script โ something mBERT and XLM-R fail at.
Code-Mixing: The Key Challenge
Code-mixing occurs at multiple levels:
| Level | Example | Challenge |
|---|---|---|
| Word-level | "Main phone use karta hoon" | English noun in Hindi sentence |
| Sentence-level | "Product accha hai. But delivery was late." | Language switch at sentence boundary |
| Intra-word | "Phoneเคตเคพเคฒเคพ" (Phone + เคตเคพเคฒเคพ) | Morpheme-level mixing across scripts |
| Transliteration | "bahut accha" vs "เคฌเคนเฅเคค เค เคเฅเคเคพ" | Same meaning, different scripts |
Full Implementation: MuRIL Sentiment Classifier
Python# โโโ Project 1: Hindi Sentiment Analysis with MuRIL โโโ
# Fine-tune Google's MuRIL on Flipkart Hindi/Hinglish reviews
import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from sklearn.metrics import classification_report, confusion_matrix
import re
# โโโ Step 1: Preprocessing for Hindi/Hinglish โโโ
class HindiTextPreprocessor:
"""Handles code-mixed Hindi-English text preprocessing."""
def __init__(self):
# Common Hindi stopwords (in Devanagari)
self.hindi_stopwords = {
'เคนเฅ', 'เคนเฅเค', 'เคเคพ', 'เคเฅ', 'เคเฅ', 'เคฎเฅเค',
'เคเฅ', 'เคธเฅ', 'เคชเคฐ', 'เคเคฐ', 'เคฏเคน', 'เคตเคน'
}
# Hinglish normalization map
self.normalize_map = {
'accha': 'เค
เคเฅเคเคพ', 'acha': 'เค
เคเฅเคเคพ',
'bahut': 'เคฌเคนเฅเคค', 'bohot': 'เคฌเคนเฅเคค',
'sahi': 'เคธเคนเฅ', 'shi': 'เคธเคนเฅ',
'kharab': 'เคเคฐเคพเคฌ', 'khrb': 'เคเคฐเคพเคฌ',
'mast': 'เคฎเคธเฅเคค', 'bakwas': 'เคฌเคเคตเคพเคธ',
}
def clean_text(self, text):
"""Clean and normalize Hindi/Hinglish text."""
text = str(text).lower()
# Remove URLs, mentions, hashtags
text = re.sub(r'http\S+', '', text)
text = re.sub(r'@\w+', '', text)
# Keep Devanagari chars (U+0900-U+097F), English, numbers
text = re.sub(r'[^\u0900-\u097F\w\s]', ' ', text)
# Normalize repeated characters: "bahuttttt" โ "bahut"
text = re.sub(r'(.)\1{2,}', r'\1\1', text)
# Normalize common Hinglish spellings
words = text.split()
words = [self.normalize_map.get(w, w) for w in words]
return ' '.join(words).strip()
def detect_language_ratio(self, text):
"""Detect Hindi vs English ratio in text."""
hindi_chars = len(re.findall(r'[\u0900-\u097F]', text))
eng_chars = len(re.findall(r'[a-zA-Z]', text))
total = hindi_chars + eng_chars
if total == 0:
return 0.0
return hindi_chars / total # 1.0 = pure Hindi, 0.0 = pure English
# โโโ Step 2: Dataset Class โโโ
class FlipkartHindiDataset(Dataset):
"""PyTorch Dataset for Flipkart Hindi/Hinglish reviews."""
LABEL_MAP = {'positive': 0, 'negative': 1, 'neutral': 2}
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = [self.LABEL_MAP[l] for l in labels]
self.tokenizer = tokenizer
self.max_length = max_length
self.preprocessor = HindiTextPreprocessor()
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.preprocessor.clean_text(self.texts[idx])
encoding = self.tokenizer(
text,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': torch.tensor(self.labels[idx], dtype=torch.long)
}
# โโโ Step 3: Load MuRIL and Fine-Tune โโโ
MODEL_NAME = "google/muril-base-cased"
NUM_LABELS = 3
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
num_labels=NUM_LABELS,
problem_type="single_label_classification"
)
# Load data (assumes CSV with 'review_text' and 'sentiment' columns)
df_train = pd.read_csv("flipkart_hindi_train.csv")
df_val = pd.read_csv("flipkart_hindi_val.csv")
df_test = pd.read_csv("flipkart_hindi_test.csv")
train_dataset = FlipkartHindiDataset(
df_train['review_text'].tolist(),
df_train['sentiment'].tolist(),
tokenizer
)
val_dataset = FlipkartHindiDataset(
df_val['review_text'].tolist(),
df_val['sentiment'].tolist(),
tokenizer
)
# โโโ Step 4: Training Configuration โโโ
def compute_metrics(eval_pred):
"""Compute accuracy and macro F1 for evaluation."""
from sklearn.metrics import accuracy_score, f1_score
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
acc = accuracy_score(labels, preds)
f1 = f1_score(labels, preds, average='macro')
return {'accuracy': acc, 'f1_macro': f1}
training_args = TrainingArguments(
output_dir="./muril-flipkart-sentiment",
num_train_epochs=4,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
warmup_ratio=0.1,
weight_decay=0.01,
learning_rate=2e-5,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1_macro",
logging_steps=50,
fp16=True, # Mixed precision for faster training
report_to="none",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
)
# Train!
trainer.train()
# โโโ Step 5: Evaluation โโโ
test_dataset = FlipkartHindiDataset(
df_test['review_text'].tolist(),
df_test['sentiment'].tolist(),
tokenizer
)
results = trainer.evaluate(test_dataset)
print(f"Test Accuracy: {results['eval_accuracy']:.4f}")
print(f"Test F1 Macro: {results['eval_f1_macro']:.4f}")
# โโโ Step 6: Inference on New Reviews โโโ
def predict_sentiment(text, model, tokenizer):
"""Predict sentiment of a Hindi/Hinglish review."""
preprocessor = HindiTextPreprocessor()
clean = preprocessor.clean_text(text)
inputs = tokenizer(clean, return_tensors="pt", truncation=True,
max_length=128, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
pred = torch.argmax(probs, dim=-1).item()
labels = ['Positive', 'Negative', 'Neutral']
return labels[pred], probs[0].tolist()
# Test with code-mixed reviews
reviews = [
"bahut accha product hai, quality mast hai ๐",
"paise barbaad! bilkul kharab quality, return karna padega",
"theek hai, not great not bad, average product",
"เคฌเคนเฅเคค เค
เคเฅเคเคพ เคซเฅเคจ เคนเฅ, เคเฅเคฎเคฐเคพ quality เคถเคพเคจเคฆเคพเคฐ",
"delivery late thi but product sahi hai",
]
for review in reviews:
sentiment, probs = predict_sentiment(review, model, tokenizer)
print(f"Review: {review[:50]}...")
print(f" โ {sentiment} (conf: {max(probs):.2%})\n")
Model Comparison: Why MuRIL Wins for Hindi
| Model | Hindi Pure (%) | Hinglish (%) | Code-Mixed (%) | Overall F1 |
|---|---|---|---|---|
| mBERT | 81.2 | 68.4 | 63.1 | 0.74 |
| XLM-RoBERTa | 83.7 | 72.1 | 67.8 | 0.78 |
| MuRIL | 89.4 | 84.2 | 81.7 | 0.85 |
| IndicBERT | 87.1 | 79.8 | 76.3 | 0.82 |
18.3 Project 2 โ Legal Document Summarization
๐ PROJECT 2 โ SUMMARIZATIONThe Business Problem
Indian courts produce 4 crore+ pages of judgments annually. A Supreme Court judgment averages 40-80 pages; High Court orders run 10-30 pages. Lawyers spend 60% of their billable hours just reading. At โน5,000-50,000/hour for senior advocates, even a 30% reduction in reading time saves the legal industry โน1,000+ crore annually.
Dataset: ILDC โ Indian Legal Documents Corpus
The ILDC (Indian Legal Documents Corpus) was created by researchers at IIIT Hyderabad and contains Supreme Court judgments with expert-written summaries.
| Feature | Details |
|---|---|
| Source | Supreme Court of India, Indian Kanoon |
| Documents | ~35,000 court cases |
| Avg. Document Length | ~4,100 words |
| Avg. Summary Length | ~850 words |
| Language | Legal English (Indian) |
| Labels | Binary prediction (accepted/rejected) + rhetorical roles |
Approach: Extractive Summarization with IndicBERT
Extractive vs Abstractive Summarization
Select the most important sentences from the original document. The summary is a subset of original sentences. Best for legal text because exact wording matters โ paraphrasing legal language can change its meaning.
AbstractiveGenerate new sentences that paraphrase the document. More fluent but risks hallucination โ a fatal flaw in legal contexts where a misquoted statute number could cause a โน10 crore loss.
Our ArchitectureWe use IndicBERT (or BERT) to encode each sentence, then a classifier scores each sentence's importance (0 to 1). Top-K sentences form the summary. This is the BertSumExt approach adapted for Indian legal text.
ROUGE-L = LCS(candidate, reference) / len(reference)
Python# โโโ Project 2: Legal Document Summarization โโโ
# Extractive summarization for Indian court judgments using BERT
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from rouge_score import rouge_scorer
import numpy as np
import re
# โโโ Step 1: Sentence Segmentation for Legal Text โโโ
class LegalSentenceSegmenter:
"""Segment Indian legal documents into sentences.
Legal text has unique patterns:
- Section references: 'Sec. 302 I.P.C.'
- Case citations: 'AIR 1950 SC 27'
- Abbreviations: 'Hon'ble', 'vs.', 'Ltd.'
"""
def __init__(self):
# Patterns that look like sentence-ends but aren't
self.abbreviations = {
'vs', 'sec', 'art', 'no', 'sr',
'dr', 'mr', 'mrs', 'smt', 'shri',
'hon', 'ltd', 'pvt', 'govt', 'i.e',
'e.g', 'etc', 'i.p.c', 'cr.p.c', 'c.p.c'
}
def segment(self, text):
"""Split legal document into sentences."""
# Protect abbreviations
for abbr in self.abbreviations:
text = re.sub(
rf'\b{re.escape(abbr)}\.',
abbr.replace('.', '_DOT_') + '_ABBR_',
text,
flags=re.IGNORECASE
)
# Split on sentence-ending punctuation
sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z\d])', text)
# Restore abbreviations
sentences = [s.replace('_DOT_', '.').replace('_ABBR_', '.')
for s in sentences]
# Filter very short sentences (noise)
sentences = [s.strip() for s in sentences if len(s.split()) > 5]
return sentences
# โโโ Step 2: BertSumExt Model โโโ
class LegalBertSumExt(nn.Module):
"""Extractive summarizer using BERT sentence representations.
Architecture:
1. Encode each sentence with BERT (CLS token)
2. Inter-sentence Transformer (2 layers) for context
3. Binary classifier: important (1) or not (0)
"""
def __init__(self, bert_model_name="ai4bharat/indic-bert",
n_heads=8, n_inter_layers=2):
super().__init__()
self.bert = AutoModel.from_pretrained(bert_model_name)
hidden_size = self.bert.config.hidden_size # 768
# Inter-sentence Transformer layers
encoder_layer = nn.TransformerEncoderLayer(
d_model=hidden_size,
nhead=n_heads,
dim_feedforward=hidden_size * 4,
dropout=0.1,
activation='gelu',
batch_first=True
)
self.inter_sentence_transformer = nn.TransformerEncoder(
encoder_layer, num_layers=n_inter_layers
)
# Binary classifier for each sentence
self.classifier = nn.Sequential(
nn.Linear(hidden_size, 256),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(256, 1),
nn.Sigmoid()
)
def forward(self, input_ids, attention_mask, sentence_mask):
"""
Args:
input_ids: (batch, num_sents, seq_len)
attention_mask: (batch, num_sents, seq_len)
sentence_mask: (batch, num_sents) โ 1 for real, 0 for pad
Returns:
scores: (batch, num_sents) โ importance score per sentence
"""
batch_size, num_sents, seq_len = input_ids.shape
# Encode each sentence independently with BERT
input_ids = input_ids.view(-1, seq_len)
attention_mask = attention_mask.view(-1, seq_len)
bert_out = self.bert(input_ids, attention_mask=attention_mask)
cls_embeddings = bert_out.last_hidden_state[:, 0, :] # CLS tokens
cls_embeddings = cls_embeddings.view(batch_size, num_sents, -1)
# Inter-sentence Transformer for document-level context
src_key_padding_mask = (sentence_mask == 0)
contextualized = self.inter_sentence_transformer(
cls_embeddings,
src_key_padding_mask=src_key_padding_mask
)
# Score each sentence
scores = self.classifier(contextualized).squeeze(-1)
scores = scores * sentence_mask # Zero out padded sentences
return scores
# โโโ Step 3: Oracle Label Creation โโโ
def create_oracle_labels(document_sentences, reference_summary, top_k=5):
"""Create oracle extractive labels using greedy ROUGE optimization.
Greedily select sentences that maximize ROUGE-2 F1 with reference.
"""
scorer = rouge_scorer.RougeScorer(['rouge2'], use_stemmer=True)
selected = []
remaining = list(range(len(document_sentences)))
labels = [0] * len(document_sentences)
for _ in range(min(top_k, len(document_sentences))):
best_idx = -1
best_score = -1
for idx in remaining:
candidate = ' '.join(
[document_sentences[i] for i in selected + [idx]]
)
score = scorer.score(reference_summary, candidate)
rouge2_f1 = score['rouge2'].fmeasure
if rouge2_f1 > best_score:
best_score = rouge2_f1
best_idx = idx
if best_idx >= 0 and best_score > 0:
selected.append(best_idx)
remaining.remove(best_idx)
labels[best_idx] = 1
return labels
# โโโ Step 4: Training Loop โโโ
def train_summarizer(model, train_loader, val_loader,
epochs=5, lr=2e-5):
optimizer = torch.optim.AdamW(model.parameters(), lr=lr,
weight_decay=0.01)
criterion = nn.BCELoss(reduction='none')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
for epoch in range(epochs):
model.train()
total_loss = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attn_mask = batch['attention_mask'].to(device)
sent_mask = batch['sentence_mask'].to(device)
labels = batch['labels'].to(device).float()
scores = model(input_ids, attn_mask, sent_mask)
loss = criterion(scores, labels)
loss = (loss * sent_mask).sum() / sent_mask.sum()
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
# Evaluate on validation set
val_rouge = evaluate_summarizer(model, val_loader, device)
print(f"Epoch {epoch+1}/{epochs} | "
f"Loss: {total_loss/len(train_loader):.4f} | "
f"ROUGE-2: {val_rouge['rouge2']:.4f} | "
f"ROUGE-L: {val_rouge['rougeL']:.4f}")
# โโโ Step 5: Evaluation โโโ
def evaluate_summarizer(model, data_loader, device, top_k=5):
"""Evaluate using ROUGE scores."""
model.eval()
scorer = rouge_scorer.RougeScorer(
['rouge1', 'rouge2', 'rougeL'], use_stemmer=True
)
all_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
with torch.no_grad():
for batch in data_loader:
scores = model(
batch['input_ids'].to(device),
batch['attention_mask'].to(device),
batch['sentence_mask'].to(device)
)
# Select top-k sentences
for i in range(scores.shape[0]):
sent_scores = scores[i].cpu().numpy()
top_indices = np.argsort(sent_scores)[-top_k:]
top_indices = sorted(top_indices) # Maintain order
pred_summary = ' '.join(
[batch['sentences'][i][j] for j in top_indices]
)
ref_summary = batch['reference'][i]
rouge = scorer.score(ref_summary, pred_summary)
for key in all_scores:
all_scores[key].append(rouge[key].fmeasure)
return {k: np.mean(v) for k, v in all_scores.items()}
18.4 Project 3 โ IRCTC Railway Chatbot
๐ค PROJECT 3 โ CHATBOT / INTENT CLASSIFICATIONThe Business Problem
IRCTC handles 2 crore+ queries daily โ PNR status, train schedule, ticket booking, refund status, platform info. Their call center employs 10,000+ agents at โน15,000/month each = โน18 crore/month in staff costs alone. An intelligent chatbot that resolves 60% of queries automatically saves โน10+ crore/month and reduces average response time from 8 minutes to 3 seconds.
Architecture: Intent Classification + Response Retrieval
Training Data: Intent Categories
| Intent | Examples | Count |
|---|---|---|
pnr_status | "PNR status check karo", "เคฎเฅเคฐเคพ PNR 4521876340" | 3,200 |
train_schedule | "Rajdhani ka time kya hai?", "12301 schedule" | 2,800 |
ticket_booking | "Delhi se Mumbai ticket book karo" | 2,500 |
ticket_cancel | "meri ticket cancel kardo", "refund kab milega" | 1,800 |
seat_availability | "3AC mein seat available hai?" | 2,200 |
platform_info | "train kis platform pe aayegi?" | 1,500 |
food_order | "train mein khana order karna hai" | 1,200 |
complaint | "AC kharab hai coach mein", "toilet saaf nahi" | 1,800 |
fare_enquiry | "Delhi Mumbai fare kitna hai?" | 1,400 |
live_status | "train kahan pahunchi?", "running status" | 2,100 |
tatkal_booking | "tatkal ticket kaise book hoga?" | 1,600 |
general_query | "IRCTC ka customer care number?" | 1,900 |
Python# โโโ Project 3: IRCTC Chatbot โโโ
# BERT-based Intent Classification + Entity Extraction
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import Dataset, DataLoader
import json, re
import numpy as np
# โโโ Step 1: IRCTC-specific Entity Extractor โโโ
class IRCTCEntityExtractor:
"""Extract railway-specific entities from queries."""
def __init__(self):
self.patterns = {
'pnr': r'\b(\d{10})\b',
'train_number': r'\b(1[2-9]\d{3}|[2-9]\d{4})\b',
'date': r'\b(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})\b',
'coach': r'\b([SB]\d{1,2}|[ABCD]\d|H1|HA1)\b',
'class': r'\b(1AC|2AC|3AC|SL|CC|2S|GEN|1A|2A|3A)\b',
'platform': r'platform\s*(\d{1,2})',
}
# Major Indian stations
self.stations = {
'delhi': 'NDLS', 'new delhi': 'NDLS',
'mumbai': 'CSTM', 'mumbai central': 'BCT',
'chennai': 'MAS', 'kolkata': 'HWH',
'bangalore': 'SBC', 'bengaluru': 'SBC',
'hyderabad': 'SC', 'pune': 'PUNE',
'jaipur': 'JP', 'lucknow': 'LKO',
'ahmedabad': 'ADI', 'patna': 'PNBE',
'varanasi': 'BSB', 'kanpur': 'CNB',
}
def extract(self, text):
"""Extract all entities from a query."""
entities = {}
text_lower = text.lower()
# Regex-based entity extraction
for entity_type, pattern in self.patterns.items():
match = re.search(pattern, text, re.IGNORECASE)
if match:
entities[entity_type] = match.group(1)
# Station name extraction
stations_found = []
for name, code in self.stations.items():
if name in text_lower:
stations_found.append({'name': name, 'code': code})
if stations_found:
entities['stations'] = stations_found
return entities
# โโโ Step 2: Intent Classification Model โโโ
class IRCTCIntentClassifier(nn.Module):
"""BERT-based intent classifier for IRCTC queries."""
INTENTS = [
'pnr_status', 'train_schedule', 'ticket_booking',
'ticket_cancel', 'seat_availability', 'platform_info',
'food_order', 'complaint', 'fare_enquiry',
'live_status', 'tatkal_booking', 'general_query'
]
def __init__(self, bert_model="google/muril-base-cased"):
super().__init__()
self.bert = AutoModel.from_pretrained(bert_model)
hidden = self.bert.config.hidden_size
self.classifier = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(hidden, 256),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(256, len(self.INTENTS))
)
# Confidence threshold โ below this, escalate to human
self.confidence_threshold = 0.75
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids, attention_mask=attention_mask)
cls_output = outputs.last_hidden_state[:, 0, :]
logits = self.classifier(cls_output)
return logits
def predict(self, text, tokenizer, device):
"""Predict intent with confidence score."""
self.eval()
inputs = tokenizer(text, return_tensors="pt",
truncation=True, max_length=64,
padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
logits = self.forward(inputs['input_ids'],
inputs['attention_mask'])
probs = torch.softmax(logits, dim=-1)[0]
top_prob, top_idx = probs.max(dim=0)
intent = self.INTENTS[top_idx.item()]
confidence = top_prob.item()
return {
'intent': intent,
'confidence': confidence,
'escalate': confidence < self.confidence_threshold
}
# โโโ Step 3: Response Templates โโโ
RESPONSE_TEMPLATES = {
'pnr_status': "๐ Aapka PNR {pnr} ka status: {status}. "
"Coach {coach}, Berth {berth}. Yatra mangalmay ho!",
'train_schedule': "๐ Train {train_number} ka schedule:\n"
"Departure: {dep_time} ({source})\n"
"Arrival: {arr_time} ({dest})",
'ticket_booking': "๐ซ {source} se {dest} ke liye {class} mein "
"ticket available hai. Fare: โน{fare}. "
"Book karna chahte hain?",
'seat_availability': "๐บ {date} ko {train_number} mein {class}: "
"{available} seats available.",
'live_status': "๐ Train {train_number} abhi {location} pe hai. "
"Expected delay: {delay} min.",
'complaint': "๐ Aapki complaint register ho gayi hai. "
"Reference: {ref_no}. 24 ghante mein response milega.",
'fare_enquiry': "๐ฐ {source} โ {dest} fare:\n"
"1AC: โน{fare_1ac} | 2AC: โน{fare_2ac} | "
"3AC: โน{fare_3ac} | SL: โน{fare_sl}",
}
# โโโ Step 4: Full Chatbot Pipeline โโโ
class IRCTCChatbot:
"""End-to-end IRCTC chatbot combining intent + entities + response."""
def __init__(self, model_path, device='cpu'):
self.device = torch.device(device)
self.tokenizer = AutoTokenizer.from_pretrained(
"google/muril-base-cased"
)
self.intent_model = IRCTCIntentClassifier()
self.intent_model.load_state_dict(
torch.load(model_path, map_location=self.device)
)
self.intent_model.to(self.device)
self.entity_extractor = IRCTCEntityExtractor()
def respond(self, user_query):
"""Process user query and generate response."""
# Step 1: Classify intent
result = self.intent_model.predict(
user_query, self.tokenizer, self.device
)
# Step 2: Extract entities
entities = self.entity_extractor.extract(user_query)
# Step 3: Check if escalation needed
if result['escalate']:
return {
'response': "Main aapko humare agent se connect "
"karta hoon. Please hold karein.",
'intent': result['intent'],
'confidence': result['confidence'],
'escalated': True
}
# Step 4: Generate response from template
template = RESPONSE_TEMPLATES.get(
result['intent'],
"Main samajh nahi paaya. Kya aap dobara bata sakte hain?"
)
return {
'response': template,
'intent': result['intent'],
'confidence': result['confidence'],
'entities': entities,
'escalated': False
}
# โโโ Step 5: Evaluation โโโ
def evaluate_chatbot(model, test_loader, device):
"""Evaluate intent classification accuracy."""
model.eval()
correct, total = 0, 0
per_intent_correct = {}
per_intent_total = {}
with torch.no_grad():
for batch in test_loader:
logits = model(
batch['input_ids'].to(device),
batch['attention_mask'].to(device)
)
preds = logits.argmax(dim=-1)
labels = batch['labels'].to(device)
correct += (preds == labels).sum().item()
total += labels.size(0)
for p, l in zip(preds, labels):
intent = IRCTCIntentClassifier.INTENTS[l.item()]
per_intent_total[intent] = per_intent_total.get(intent, 0) + 1
if p == l:
per_intent_correct[intent] = \
per_intent_correct.get(intent, 0) + 1
overall_acc = correct / total
print(f"Overall Accuracy: {overall_acc:.4f}\n")
print(f"{'Intent':<20} {'Accuracy':>10}")
print("-" * 32)
for intent in IRCTCIntentClassifier.INTENTS:
acc = per_intent_correct.get(intent, 0) / \
max(per_intent_total.get(intent, 1), 1)
print(f"{intent:<20} {acc:>10.2%}")
return overall_acc
18.5 Project 4 โ Named Entity Recognition for Indian News
๐ท๏ธ PROJECT 4 โ SEQUENCE LABELINGThe Business Problem
Indian news outlets like NDTV, Aaj Tak, and The Hindu process 50,000+ articles daily across Hindi, English, Tamil, Telugu, and Bengali. Automatically extracting entities โ who (Person), which company (Organization), where (Location), when (Date), and how much (Currency in โน) โ powers news categorization, knowledge graphs, and fact-checking at scale.
Entity Types for Indian News
| Entity | Tag | Example (Hindi) | Example (English) |
|---|---|---|---|
| Person | PER | เคจเคฐเฅเคเคฆเฅเคฐ เคฎเฅเคฆเฅ | Narendra Modi |
| Organization | ORG | เคฐเคฟเคฒเคพเคฏเคเคธ เคเคเคกเคธเฅเคเฅเคฐเฅเค | Reliance Industries |
| Location | LOC | เคจเค เคฆเคฟเคฒเฅเคฒเฅ | New Delhi |
| Date | DATE | 15 เค เคเคธเฅเคค 2024 | 15 August 2024 |
| Currency | CUR | โน1,500 เคเคฐเฅเคกเคผ | โน1,500 crore |
BIO Tagging Scheme
BIO (Beginning-Inside-Outside) Tagging
Each token gets one of: B-TYPE (beginning of entity), I-TYPE (inside entity), or O (outside / not an entity).
ExampleToken: เคจเคฐเฅเคเคฆเฅเคฐ เคฎเฅเคฆเฅ เคจเฅ เคฐเคฟเคฒเคพเคฏเคเคธ เคเคเคกเคธเฅเคเฅเคฐเฅเค เคเฅ โน1,500 เคเคฐเฅเคกเคผ เคฆเคฟเคฏเฅ Tag: B-PER I-PER O B-ORG I-ORG O B-CUR I-CUR OTag Count
5 entity types ร 2 (B, I) + 1 (O) = 11 tags total.
Architecture Comparison: Bi-LSTM-CRF vs Transformer
Python# โโโ Project 4A: Bi-LSTM-CRF for Indian NER โโโ
# Classic sequence labeling approach
import torch
import torch.nn as nn
from torchcrf import CRF
import numpy as np
# Tag set for Indian news NER
TAG_TO_IDX = {
'O': 0,
'B-PER': 1, 'I-PER': 2,
'B-ORG': 3, 'I-ORG': 4,
'B-LOC': 5, 'I-LOC': 6,
'B-DATE': 7, 'I-DATE': 8,
'B-CUR': 9, 'I-CUR': 10,
}
IDX_TO_TAG = {v: k for k, v in TAG_TO_IDX.items()}
NUM_TAGS = len(TAG_TO_IDX)
class BiLSTMCRF(nn.Module):
"""Bi-LSTM-CRF for Named Entity Recognition.
Architecture:
โโโโโโโโโโโโโโ
โ CRF โ โ Ensures valid tag transitions
โโโโโโโโโโโโโโค (e.g., I-PER can't follow B-ORG)
โ Linear โ
โโโโโโโโโโโโโโค
โ Bi-LSTM โ โ Captures bidirectional context
โ (2 layers)โ
โโโโโโโโโโโโโโค
โ Char-CNN โ โ Character-level features (morphology)
โ + Word Embโ
โโโโโโโโโโโโโโ
"""
def __init__(self, vocab_size, char_vocab_size,
word_emb_dim=300, char_emb_dim=30,
char_hidden=50, lstm_hidden=256,
num_layers=2, dropout=0.5):
super().__init__()
# Word embeddings (can load Hindi fastText vectors)
self.word_embedding = nn.Embedding(vocab_size, word_emb_dim,
padding_idx=0)
# Character-level CNN (captures morphological patterns)
self.char_embedding = nn.Embedding(char_vocab_size, char_emb_dim,
padding_idx=0)
self.char_cnn = nn.Conv1d(
char_emb_dim, char_hidden,
kernel_size=3, padding=1
)
# Bi-LSTM
input_dim = word_emb_dim + char_hidden
self.lstm = nn.LSTM(
input_dim, lstm_hidden,
num_layers=num_layers,
bidirectional=True,
batch_first=True,
dropout=dropout if num_layers > 1 else 0
)
# Emission scores
self.hidden2tag = nn.Linear(lstm_hidden * 2, NUM_TAGS)
self.dropout = nn.Dropout(dropout)
# CRF layer โ learns valid tag transitions
self.crf = CRF(NUM_TAGS, batch_first=True)
def _get_char_features(self, char_ids):
"""Compute character-level features using CNN."""
batch, seq_len, char_len = char_ids.shape
char_ids = char_ids.view(-1, char_len)
char_emb = self.char_embedding(char_ids) # (B*S, C, D)
char_emb = char_emb.permute(0, 2, 1) # (B*S, D, C)
char_cnn = self.char_cnn(char_emb) # (B*S, H, C)
char_features = char_cnn.max(dim=2)[0] # Max pool: (B*S, H)
char_features = char_features.view(batch, seq_len, -1)
return char_features
def _get_emissions(self, word_ids, char_ids):
"""Compute emission scores from Bi-LSTM."""
word_emb = self.word_embedding(word_ids)
char_feat = self._get_char_features(char_ids)
combined = torch.cat([word_emb, char_feat], dim=-1)
combined = self.dropout(combined)
lstm_out, _ = self.lstm(combined)
lstm_out = self.dropout(lstm_out)
emissions = self.hidden2tag(lstm_out)
return emissions
def forward(self, word_ids, char_ids, tags, mask):
"""Compute negative log-likelihood loss."""
emissions = self._get_emissions(word_ids, char_ids)
loss = -self.crf(emissions, tags, mask=mask,
reduction='mean')
return loss
def predict(self, word_ids, char_ids, mask):
"""Viterbi decoding for best tag sequence."""
emissions = self._get_emissions(word_ids, char_ids)
best_tags = self.crf.decode(emissions, mask=mask)
return best_tags
# โโโ Project 4B: Transformer NER (BERT-based) โโโ
from transformers import AutoModelForTokenClassification
class TransformerNER:
"""Transformer-based NER using MuRIL/IndicBERT."""
def __init__(self, model_name="google/muril-base-cased"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=NUM_TAGS,
id2label=IDX_TO_TAG,
label2id=TAG_TO_IDX,
)
def predict_entities(self, text):
"""Predict entities in text, handling subword tokens."""
inputs = self.tokenizer(
text, return_tensors="pt",
return_offsets_mapping=True,
truncation=True, max_length=512
)
offset_mapping = inputs.pop('offset_mapping')[0]
with torch.no_grad():
outputs = self.model(**inputs)
preds = outputs.logits.argmax(dim=-1)[0]
# Map subword predictions back to words
entities = []
current_entity = None
for idx, (pred, offset) in enumerate(
zip(preds, offset_mapping)
):
if offset[0] == 0 and offset[1] == 0:
continue # Skip [CLS], [SEP]
tag = IDX_TO_TAG[pred.item()]
start, end = offset[0].item(), offset[1].item()
token_text = text[start:end]
if tag.startswith('B-'):
if current_entity:
entities.append(current_entity)
current_entity = {
'type': tag[2:],
'text': token_text,
'start': start, 'end': end
}
elif tag.startswith('I-') and current_entity:
current_entity['text'] += token_text
current_entity['end'] = end
else:
if current_entity:
entities.append(current_entity)
current_entity = None
if current_entity:
entities.append(current_entity)
return entities
# โโโ Evaluation: Entity-Level F1 โโโ
from seqeval.metrics import classification_report as ner_report
def evaluate_ner(true_tags_list, pred_tags_list):
"""Evaluate NER using entity-level F1 (seqeval library)."""
print(ner_report(true_tags_list, pred_tags_list, digits=4))
Bi-LSTM-CRF vs Transformer NER: Head-to-Head
| Aspect | Bi-LSTM-CRF | Transformer (MuRIL) |
|---|---|---|
| Entity F1 | 87.28% | 92.00% |
| Parameters | ~5M | ~110M |
| Training Time | ~20 min (GPU) | ~2 hrs (GPU) |
| Inference Speed | ~2ms/sentence | ~15ms/sentence |
| Code-Mixed Text | Poor (separate embeddings) | Good (multilingual pretrain) |
| Low Resource | Needs 10K+ examples | Works with 2-3K (transfer) |
| CRF Constraints | โ Hard constraints | โ Soft (but learns them) |
| Best For | Speed-critical, single-language | Multilingual, accuracy-first |
18.6 Project 5 โ Automatic Speech Recognition for Indian Languages
๐ค PROJECT 5 โ SPEECH RECOGNITIONThe Business Problem
India has 800 million smartphone users, but only ~10% are comfortable typing in English. Voice is the natural interface โ Google reports 30% of Indian search queries are voice-based. Jio, Paytm, and PhonePe all need ASR that works for Indian-accented English AND regional languages. The challenge: most global ASR models fail catastrophically on Indian accents and code-switched speech.
Wav2Vec 2.0 for Indian English
Wav2Vec 2.0 โ Self-Supervised Speech Model
Wav2Vec 2.0 is the "BERT of speech." It uses a CNN feature encoder + Transformer to learn speech representations from raw audio without any transcription labels (self-supervised pretraining), then fine-tunes on labeled speech data.
Three-Stage Pipeline- Feature Encoder: 7-layer CNN converts raw 16kHz audio โ 50Hz latent speech representations
- Contextualized Transformer: 12-24 Transformer layers learn contextual representations (like BERT for audio)
- CTC Head: Connectionist Temporal Classification decodes character/token probabilities at each timestep
Pretrain on unlabeled audio (abundant โ All India Radio, YouTube, podcasts), then fine-tune on just 10-50 hours of labeled speech. This is crucial for languages like Odia, Assamese, and Maithili where labeled data is scarce.
AI4Bharat's IndicWav2Vec
| Language | Unlabeled Hours | Labeled Hours | WER (IndicWav2Vec) | WER (Facebook w2v2) |
|---|---|---|---|---|
| Hindi | 10,400 | 250 | 12.4% | 19.8% |
| Bengali | 4,200 | 120 | 15.7% | 24.3% |
| Tamil | 5,100 | 180 | 14.2% | 22.1% |
| Telugu | 4,800 | 150 | 16.1% | 25.6% |
| Marathi | 3,600 | 90 | 17.3% | 27.4% |
| Gujarati | 2,800 | 80 | 18.9% | 29.1% |
| Kannada | 3,100 | 100 | 16.8% | 26.2% |
| Malayalam | 3,400 | 110 | 15.5% | 23.8% |
| Odia | 2,200 | 60 | 20.4% | 32.7% |
Python# โโโ Project 5: ASR with IndicWav2Vec โโโ
# Automatic Speech Recognition for Indian languages
import torch
import torchaudio
from transformers import (
Wav2Vec2ForCTC, Wav2Vec2Processor,
Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor
)
import numpy as np
# โโโ Step 1: Load IndicWav2Vec โโโ
MODEL_NAME = "ai4bharat/indicwav2vec_v1_hindi"
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME)
model.eval()
# โโโ Step 2: Inference Function โโโ
def transcribe_hindi(audio_path):
"""Transcribe Hindi speech to text."""
# Load audio (must be 16kHz mono)
waveform, sample_rate = torchaudio.load(audio_path)
# Resample if needed
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
# Convert to mono if stereo
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Process through model
inputs = processor(
waveform.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt",
padding=True
)
with torch.no_grad():
logits = model(inputs.input_values).logits
# CTC decode โ greedy (argmax at each timestep)
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
return transcription
# โโโ Step 3: Evaluation (WER) โโโ
def word_error_rate(reference, hypothesis):
"""Compute Word Error Rate using dynamic programming."""
ref_words = reference.split()
hyp_words = hypothesis.split()
n = len(ref_words)
m = len(hyp_words)
# DP table for edit distance
dp = np.zeros((n + 1, m + 1))
for i in range(n + 1):
dp[i][0] = i
for j in range(m + 1):
dp[0][j] = j
for i in range(1, n + 1):
for j in range(1, m + 1):
if ref_words[i-1] == hyp_words[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = 1 + min(
dp[i-1][j], # Deletion
dp[i][j-1], # Insertion
dp[i-1][j-1] # Substitution
)
wer = dp[n][m] / n
return wer
# Example evaluation
ref = "เคจเคฎเคธเฅเคคเฅ เคฎเฅเค เคนเคฟเคเคฆเฅ เคฎเฅเค เคฌเคพเคค เคเคฐ เคฐเคนเคพ เคนเฅเค"
hyp = "เคจเคฎเคธเฅเคคเฅ เคฎเฅเค เคนเคฟเคเคฆเฅ เคฎเฅเค เคฌเคพเคค เคเคฐ เคฐเคนเคพ เคนเฅเค"
print(f"WER: {word_error_rate(ref, hyp):.2%}")
# WER: 11.11% (1 word different out of 9)
From-Scratch Code โ Building a Minimal Attention-Based Classifier
To understand the fundamentals, let's build a simple attention-based text classifier from scratch in NumPy โ no PyTorch, no HuggingFace. This demonstrates the core mechanism behind all five projects above.
Python (NumPy Only)# โโโ From-Scratch: Attention-Based Text Classifier โโโ
# Demonstrates the attention mechanism powering all 5 projects
import numpy as np
class ScratchAttentionClassifier:
"""
A simple attention-based text classifier built entirely in NumPy.
Architecture:
Input โ Embedding โ Self-Attention โ Weighted Sum โ Softmax โ Class
This is the core mechanism behind BERT/MuRIL fine-tuning.
"""
def __init__(self, vocab_size, embed_dim=64,
num_classes=3, max_len=50):
self.embed_dim = embed_dim
self.num_classes = num_classes
self.max_len = max_len
# Xavier initialization
scale = np.sqrt(2.0 / (vocab_size + embed_dim))
self.W_emb = np.random.randn(vocab_size, embed_dim) * scale
# Attention weights: Q, K, V projections
scale_attn = np.sqrt(2.0 / (embed_dim + embed_dim))
self.W_Q = np.random.randn(embed_dim, embed_dim) * scale_attn
self.W_K = np.random.randn(embed_dim, embed_dim) * scale_attn
self.W_V = np.random.randn(embed_dim, embed_dim) * scale_attn
# Classification head
scale_cls = np.sqrt(2.0 / (embed_dim + num_classes))
self.W_cls = np.random.randn(embed_dim, num_classes) * scale_cls
self.b_cls = np.zeros(num_classes)
def softmax(self, x, axis=-1):
"""Numerically stable softmax."""
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / e_x.sum(axis=axis, keepdims=True)
def self_attention(self, X, mask=None):
"""
Scaled dot-product self-attention.
Q = X @ W_Q, K = X @ W_K, V = X @ W_V
Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V
"""
Q = X @ self.W_Q # (seq_len, d)
K = X @ self.W_K
V = X @ self.W_V
d_k = Q.shape[-1]
scores = (Q @ K.T) / np.sqrt(d_k) # (seq_len, seq_len)
if mask is not None:
scores = np.where(mask, scores, -1e9)
attn_weights = self.softmax(scores, axis=-1)
context = attn_weights @ V # (seq_len, d)
return context, attn_weights
def forward(self, token_ids):
"""
Forward pass: tokens โ embedding โ attention โ classify.
Args:
token_ids: list of integer token IDs
Returns:
class_probs: (num_classes,) probability distribution
attn_weights: (seq_len, seq_len) attention matrix
"""
# Embedding lookup
X = self.W_emb[token_ids] # (seq_len, embed_dim)
# Self-attention
context, attn_weights = self.self_attention(X)
# Pool: mean of context vectors (like [CLS] in BERT)
pooled = context.mean(axis=0) # (embed_dim,)
# Classify
logits = pooled @ self.W_cls + self.b_cls
probs = self.softmax(logits)
return probs, attn_weights
def predict(self, token_ids):
"""Get predicted class and confidence."""
probs, attn = self.forward(token_ids)
pred_class = np.argmax(probs)
labels = ['Positive', 'Negative', 'Neutral']
return labels[pred_class], probs[pred_class], attn
# โโโ Demo โโโ
np.random.seed(42)
classifier = ScratchAttentionClassifier(vocab_size=5000)
# Simulate tokenized input: "bahut accha product hai"
token_ids = [142, 87, 1203, 56]
label, confidence, attn = classifier.predict(token_ids)
print(f"Prediction: {label} (conf: {confidence:.2%})")
print(f"Attention matrix shape: {attn.shape}")
print(f"Token 'accha' attends most to: token {np.argmax(attn[1])}")
Industry Code โ Production-Ready NLP Pipeline
Here's a production-grade pipeline combining all five projects into a unified Indian NLP system, as you might deploy at a company like Flipkart or Jio.
Python# โโโ Production Indian NLP Pipeline โโโ
# Unified system for multi-task Indian language processing
from transformers import pipeline
import torch
class IndianNLPPipeline:
"""Production NLP pipeline for Indian languages.
Supports: Sentiment, NER, Summarization, Intent Classification
Languages: Hindi, Hinglish, English, + regional via IndicBERT
"""
def __init__(self, device="cuda" if torch.cuda.is_available()
else "cpu"):
self.device = device
print(f"Initializing on {device}...")
# Sentiment Analysis (MuRIL fine-tuned)
self.sentiment = pipeline(
"text-classification",
model="./muril-flipkart-sentiment",
device=0 if device == "cuda" else -1
)
# NER (MuRIL fine-tuned for Indian entities)
self.ner = pipeline(
"ner",
model="./muril-indian-ner",
aggregation_strategy="simple",
device=0 if device == "cuda" else -1
)
# Zero-shot classification for flexible intent detection
self.zero_shot = pipeline(
"zero-shot-classification",
model="joeddav/xlm-roberta-large-xnli",
device=0 if device == "cuda" else -1
)
def analyze(self, text, tasks=None):
"""Run all requested NLP tasks on input text."""
if tasks is None:
tasks = ["sentiment", "ner"]
results = {"text": text}
if "sentiment" in tasks:
results["sentiment"] = self.sentiment(text)[0]
if "ner" in tasks:
entities = self.ner(text)
results["entities"] = [
{"text": e["word"], "type": e["entity_group"],
"score": round(e["score"], 3)}
for e in entities
]
if "intent" in tasks:
candidate_labels = [
"ticket booking", "complaint",
"status inquiry", "general question"
]
results["intent"] = self.zero_shot(
text, candidate_labels
)
return results
# โโโ Usage โโโ
nlp = IndianNLPPipeline()
result = nlp.analyze(
"Reliance Industries ne Mumbai mein โน500 crore ka naya plant "
"kholne ka faisla kiya hai.",
tasks=["sentiment", "ner"]
)
print(f"Sentiment: {result['sentiment']['label']}")
print(f"Entities:")
for e in result['entities']:
print(f" {e['type']:>5}: {e['text']} ({e['score']:.1%})")
Visual Diagrams
6.1 The Complete Indian NLP Landscape
6.2 BIO Tagging NER Pipeline
6.3 Extractive Summarization Architecture
Worked Example โ End-to-End Flipkart Review Analysis
Let's trace a single Flipkart review through the entire NLP pipeline, step by step.
Input Review
"Realme ka ye phone bahut accha hai, camera quality bhi mast hai lekin battery jaldi khatam ho jaati hai. โน12,999 mein theek hai."
Step 1: Preprocessing
Original: "Realme ka ye phone bahut accha hai, camera quality bhi mast
hai lekin battery jaldi khatam ho jaati hai. โน12,999 mein theek hai."
After clean_text():
โ "realme ka ye phone เคฌเคนเฅเคค เค
เคเฅเคเคพ hai camera quality bhi เคฎเคธเฅเคค
hai lekin battery jaldi khatam ho jaati hai โน12999 mein theek hai"
Language ratio: detect_language_ratio() โ 0.31 (31% Hindi chars = Hinglish)
Step 2: MuRIL Tokenization
Tokens: ['[CLS]', 'real', '##me', 'ka', 'ye', 'phone', 'เคฌเคนเฅเคค', 'เค
เคเฅเคเคพ',
'hai', 'camera', 'quality', 'bhi', 'เคฎเคธเฅเคค', 'hai', 'le', '##kin',
'battery', 'jal', '##di', 'khat', '##am', 'ho', 'ja', '##ati',
'hai', 'โน', '12', '##99', '##9', 'mein', 'the', '##ek', 'hai', '[SEP]']
Token IDs: [2, 8734, 1456, 342, 178, 4521, 6789, 7234, 156, ...]
Length: 34 tokens (within 128 max_length)
Step 3: Sentiment Classification
MuRIL Output Logits: [2.34, -1.12, 0.45] โ [Positive, Negative, Neutral]
Softmax probabilities:
Positive: e^2.34 / (e^2.34 + e^-1.12 + e^0.45)
= 10.38 / (10.38 + 0.33 + 1.57)
= 10.38 / 12.28
= 84.5%
Negative: 0.33 / 12.28 = 2.7%
Neutral: 1.57 / 12.28 = 12.8%
Prediction: โ
POSITIVE (84.5% confidence)
Step 4: NER Extraction
Token: realme ka ye phone เคฌเคนเฅเคค ... โน 12999 mein theek hai
NER Tag: B-ORG O O O O ... B-CUR I-CUR O O O
Extracted Entities:
ORG: "Realme" (score: 0.934)
CUR: "โน12,999" (score: 0.912)
Step 5: Aspect-Based Analysis (Advanced)
Aspects detected:
๐ฑ "camera quality" โ Positive ("mast hai")
๐ "battery" โ Negative ("jaldi khatam ho jaati")
๐ฐ "โน12,999 mein" โ Neutral ("theek hai")
Overall: Mixed Positive โ good product, battery concern, fair price
Case Study โ Koo's Multilingual Content Moderation
๐ฆ Koo: India's Multilingual Social Media Challenge
The Problem
Koo, India's microblogging platform, launched with support for 10 Indian languages: Hindi, Kannada, Tamil, Telugu, Bengali, Marathi, Gujarati, Punjabi, Malayalam, and Assamese. At its peak, Koo handled 50 lakh+ posts daily and needed to moderate content for:
- Hate speech detection across all 10 languages
- Fake news flagging โ especially during elections
- Spam detection โ including transliterated spam
- Sentiment tracking โ for trending topics
The NLP Challenge
| Challenge | Details |
|---|---|
| 10 languages ร 3 tasks | 30 separate classifiers? Or one multilingual model? |
| Code-mixed posts | 40%+ posts mixed Hindi-English or regional-English |
| Transliteration | Kannada in Roman script: "nanu Bengaluru-ge hogthini" |
| Memes & images | Hate speech encoded in images with text overlays |
| Latency | <100ms per post for real-time moderation |
Solution Architecture
Koo used a cascade architecture to balance accuracy and speed:
Results
| Metric | Before ML | After ML |
|---|---|---|
| Moderation speed | 4 hours avg | 3 seconds avg |
| Hate speech catch rate | 45% (manual) | 89% (automated) |
| False positive rate | 2% | 4.5% (higher but acceptable) |
| Human moderators needed | 200 | 35 (for borderline cases) |
| Monthly moderation cost | โน80 lakh | โน22 lakh |
Key Lesson
The biggest lesson: a cascade architecture (fast rules โ ML โ human) outperforms pure ML or pure human moderation. The rules-based filter handles obvious cases instantly, ML handles the nuanced middle ground, and humans handle only truly ambiguous cases. This reduced costs by 72% while improving catch rates by 98%.
Common Mistakes & Misconceptions
seqeval library, which evaluates complete entity spans (both boundaries and type must match).
unicodedata.normalize('NFKC', text) before tokenization.
Comparison Tables
10.1 All Five Projects At a Glance
| Project | Task | Model | Metric | Score |
|---|---|---|---|---|
| Hindi Sentiment | Classification | MuRIL | F1 Macro | 0.852 |
| Legal Summarization | Extractive Summ. | IndicBERT + Transformer | ROUGE-2 | 0.369 |
| IRCTC Chatbot | Intent Classification | MuRIL + Entity Regex | Accuracy | 91.5% |
| Indian News NER | Sequence Labeling | MuRIL (Token Clf.) | Entity F1 | 92.0% |
| Hindi ASR | Speech Recognition | IndicWav2Vec | WER | 12.4% |
10.2 Indian Language Models Comparison
| Model | Params | Languages | Code-Mix | Transliteration | Best Use |
|---|---|---|---|---|---|
| mBERT | 110M | 104 | Poor | Poor | Baseline only |
| XLM-RoBERTa | 270M | 100 | Fair | Fair | Cross-lingual transfer |
| MuRIL | 110M | 17 Indian | Excellent | Excellent | All Indian NLU |
| IndicBERT | 33M | 12 Indian | Good | Good | Lightweight deployment |
| IndicBART | 244M | 11 Indian | Good | Good | Generation tasks |
| IndicTrans2 | 320M | 22 Indian | N/A | N/A | Translation |
10.3 NLP Metrics Cheat Sheet
| Metric | Task | Formula Intuition | Range | Higher = Better? |
|---|---|---|---|---|
| Accuracy | Classification | Correct / Total | 0โ1 | โ |
| F1 (Macro) | Classification | Harmonic mean of P & R, averaged | 0โ1 | โ |
| ROUGE-1 | Summarization | Unigram overlap with reference | 0โ1 | โ |
| ROUGE-2 | Summarization | Bigram overlap with reference | 0โ1 | โ |
| ROUGE-L | Summarization | Longest common subsequence | 0โ1 | โ |
| Entity F1 | NER | Exact span + type match F1 | 0โ1 | โ |
| WER | ASR | (S+D+I) / N (edit distance) | 0โโ | โ (lower better) |
| BLEU | Translation | N-gram precision with brevity penalty | 0โ1 | โ |
Exercises
Section A โ Multiple Choice Questions
Which model is specifically designed for Indian language understanding and outperforms mBERT on IndicGLUE?
- GPT-3
- XLM-RoBERTa-base
- MuRIL (Google)
- BERT-base-uncased
In BIO tagging for NER, the tag sequence "B-PER I-ORG" is:
- Valid โ it means a person at an organization
- Invalid โ I-ORG can only follow B-ORG or I-ORG
- Valid โ but only for code-mixed text
- Invalid โ there is no I-ORG tag
For evaluating a legal document summarizer, which metric is most appropriate?
- BLEU score
- Accuracy
- ROUGE-2 F1
- Word Error Rate
A Flipkart review says "product toh sahi hai but delivery bahut late." What type of code-mixing is this?
- Intra-word mixing
- Sentence-level switching
- Word-level mixing (inter-sentential)
- Transliteration only
Why does extractive summarization suit legal documents better than abstractive?
- Extractive is always more accurate
- Legal language requires exact wording; paraphrasing risks changing legal meaning
- Abstractive models cannot handle long documents
- Extractive is faster at inference time
In a Bi-LSTM-CRF for NER, what does the CRF layer primarily enforce?
- Word embedding quality
- Valid tag transition constraints (e.g., B-PER can be followed by I-PER but not I-ORG)
- Faster training convergence
- Better handling of long sequences
IndicWav2Vec achieves lower WER than Facebook's Wav2Vec 2.0 on Indian languages because:
- It uses a larger Transformer architecture
- It was pretrained on 40,000+ hours of unlabeled Indian speech
- It uses a different loss function
- It doesn't use self-supervised pretraining
An IRCTC chatbot classifies "PNR check karo 4521876340" with 65% confidence. The threshold is 75%. What happens?
- The bot responds with the PNR status
- The query is escalated to a human agent
- The bot asks the user to rephrase
- The query is discarded
WER (Word Error Rate) of 12.4% means:
- 12.4% of words in the reference are correctly transcribed
- 87.6% of words are correctly transcribed (approximately)
- 12.4% of the audio duration was silence
- The model used 12.4% of its vocabulary
When fine-tuning a Transformer for NER with subword tokenization, how should we handle a word like "เคฎเฅเคเคฌเค" tokenized as ["เคฎเฅเค", "##เคฌเค"]?
- Predict NER tags for both subwords independently
- Take the first subword's prediction as the word-level tag
- Average the logits of both subwords
- Discard the word entirely
Section B โ Short Answer Questions
- IntermediateExplain three ways code-mixing manifests in Hindi-English social media text. Give one example of each type and explain why each is challenging for NLP models. (6 marks)
- IntermediateWhy does IndicBERT (33M parameters) sometimes outperform XLM-RoBERTa (270M parameters) on Indian language tasks? Discuss the concept of "language-specific capacity" in multilingual models. (6 marks)
- IntermediateCompare ROUGE-1, ROUGE-2, and ROUGE-L. Which is most important for legal summarization and why? (6 marks)
- AdvancedDescribe the oracle label creation process for extractive summarization. Why is greedy ROUGE optimization used instead of finding the globally optimal subset? (8 marks)
- IntermediateExplain why a cascade architecture (rules โ ML โ human) is preferred over pure ML for content moderation at scale. Use the Koo case study. (6 marks)
Section C โ Long Answer Questions
- AdvancedDesign a complete NLP system for Zomato that handles restaurant reviews in Hindi, Tamil, Telugu, and English. Cover: (a) data collection and annotation strategy, (b) model selection and justification, (c) handling code-mixed reviews, (d) aspect-based sentiment for (food, service, ambiance, price), (e) deployment architecture for <200ms latency, (f) evaluation metrics and expected benchmarks. (20 marks)
- AdvancedCompare Bi-LSTM-CRF and Transformer-based NER models for Indian news. For each architecture: (a) draw the complete architecture diagram, (b) explain the training procedure, (c) analyze performance on different entity types (PER, ORG, LOC, DATE, CUR), (d) discuss computational requirements, (e) recommend which to use for a news startup processing 10,000 articles/day with limited GPU budget. (20 marks)
- AdvancedExplain how Wav2Vec 2.0 uses self-supervised learning for speech recognition. Cover: (a) the contrastive learning objective, (b) quantization of latent representations, (c) masking strategy, (d) CTC decoding, and (e) why self-supervised pretraining is critical for low-resource Indian languages like Odia and Assamese. (15 marks)
Section D โ Programming Exercises
- IntermediateBuild a Hinglish preprocessor that: (a) detects the language ratio (Hindi vs English characters), (b) normalizes common transliteration variants (e.g., "accha" โ "เค เคเฅเคเคพ"), (c) handles emoji sentiment markers (๐ โ positive, ๐ค โ negative), and (d) cleans social media noise (URLs, mentions, repeated characters). Test on 10 real Flipkart-style reviews.
- AdvancedImplement entity-level F1 evaluation from scratch (without using the seqeval library). Your function should: (a) extract entity spans from BIO tag sequences, (b) compute precision, recall, and F1 for each entity type, (c) handle edge cases (incomplete entities, nested entities), and (d) return both micro-averaged and macro-averaged F1.
- AdvancedBuild a simple extractive summarizer that: (a) splits a document into sentences, (b) encodes each sentence using TF-IDF vectors, (c) computes a sentence importance score using TextRank (graph-based), (d) selects top-K sentences, and (e) evaluates against a reference summary using ROUGE-2. Test on a sample Indian court judgment.
Section E โ Mini-Project
๐ Mini-Project: Multilingual Indian News Intelligence System
Objective
Build an end-to-end NLP pipeline that processes Indian news articles in Hindi and English, performing:
- NER: Extract Person, Organization, Location, Date, Currency entities
- Sentiment: Classify article tone (Positive/Negative/Neutral)
- Summarization: Generate a 3-sentence extractive summary
- Topic Classification: Politics, Business, Sports, Technology, Entertainment
Requirements
- Use MuRIL or IndicBERT as the backbone
- Process at least 100 test articles (50 Hindi, 50 English)
- Report entity-level F1 (NER), macro F1 (sentiment/topic), and ROUGE-2 (summarization)
- Handle code-mixed articles (Hindi-English)
- Build a simple web interface using Gradio or Streamlit
Dataset Sources
- Hindi news: BBC Hindi, Amar Ujala (web scraping), IndicNLP News Classification dataset
- NER annotations: WikiNER Hindi, FIRE NER shared task datasets
- Summaries: Create oracle extractive labels from article headlines
Deliverables
- Complete Python code (Jupyter notebook or .py files)
- Evaluation report with metrics per task and per language
- Error analysis: 10 examples where the system fails and why
- A 5-minute demo video showing the system in action
Grading Rubric
| Component | Marks |
|---|---|
| Working NER pipeline with evaluation | 20 |
| Sentiment + topic classification | 15 |
| Extractive summarization with ROUGE | 15 |
| Code-mixing handling | 10 |
| Web interface (Gradio/Streamlit) | 10 |
| Error analysis and documentation | 15 |
| Code quality and reproducibility | 15 |
| Total | 100 |
Chapter Summary
Key Takeaways โ Applied NLP for India
- Indian NLP faces six unique challenges: code-mixing, script diversity (13+ scripts), morphological richness, low-resource languages, transliteration variants, and domain-specific vocabulary. These make even "solved" English NLP tasks frontier research problems.
- MuRIL (Google) is the go-to model for Indian language NLU. Pretrained on 17 Indian languages with transliterated and code-mixed data, it outperforms mBERT by ~14 points and XLM-RoBERTa by ~7 points on Indian benchmarks.
- Hindi sentiment analysis using MuRIL achieves 85.2% macro F1 on Flipkart reviews. The key innovation is preprocessing that handles Hinglish code-mixing and transliteration normalization.
- Legal document summarization using extractive BertSumExt achieves ROUGE-2 of 0.369 on Indian court judgments. Extractive is preferred over abstractive for legal text because paraphrasing legal language can change its meaning.
- IRCTC chatbot achieves 91.5% intent classification accuracy across 12 intent categories. A confidence-based escalation system (threshold 0.75) ensures uncertain queries reach human agents.
- NER for Indian news โ Transformer models (MuRIL) achieve 92.0% entity F1, outperforming Bi-LSTM-CRF (87.3%) by 4.7 points. However, Bi-LSTM-CRF is 7ร faster at inference, making it suitable for latency-critical applications.
- ASR using IndicWav2Vec achieves 12.4% WER on Hindi โ a 37% improvement over English Wav2Vec 2.0 (19.8% WER). Self-supervised pretraining on unlabeled Indian speech is the key enabler for low-resource languages.
- AI4Bharat (IIT Madras) has built India's language AI stack: IndicBERT, IndicBART, IndicTrans2, IndicWav2Vec, and IndicTTS โ all open-source, powering the government's Bhashini platform.
- Cascade architectures (rules โ ML โ human) outperform pure ML for production NLP at scale, as demonstrated by Koo's content moderation system that reduced costs by 72%.
- Evaluation metrics matter: Use F1 (not accuracy) for classification, ROUGE (not BLEU) for summarization, entity-level F1 (not token accuracy) for NER, and WER for ASR. Using the wrong metric gives misleading results.
Data Preprocessing (script normalization, code-mix handling)
โ Pretrained Model (MuRIL / IndicBERT / IndicWav2Vec)
โ Fine-tuning (task-specific labeled data)
โ Evaluation (correct metric per task)
โ Deployment (cascade for latency + cost)
References & Further Reading
Foundational Papers
- Khanuja, S., et al. (2021). "MuRIL: Multilingual Representations for Indian Languages." Google Research. The paper behind Google's Indian language model, trained on 17 languages with transliteration.
- Kakwani, D., et al. (2020). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages." AI4Bharat, EMNLP Findings. The IndicBERT paper from IIT Madras.
- Malik, V., et al. (2021). "ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation." IIIT Hyderabad, ACL. The dataset used in our legal summarization project.
- Liu, Y., & Lapata, M. (2019). "Text Summarization with Pretrained Encoders." EMNLP. The BertSumExt/BertSumExtAbs paper โ foundation for our legal summarizer.
- Baevski, A., et al. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." Meta AI, NeurIPS. The foundational ASR model we adapt with IndicWav2Vec.
Indian NLP Resources
- Joshi, A., et al. (2022). "IndicWav2Vec: Exploring Wav2Vec 2.0 for Indian Languages." AI4Bharat. Pretrained on 40,000+ hours of Indian speech.
- Kunchukuttan, A., et al. (2020). "AI4Bharat-IndicNLP Corpus." IIT Madras. Large-scale corpora for 12 Indian languages.
- Aggarwal, P., & Rani, R. (2023). "Code-Mixed Sentiment Analysis for Hindi-English Social Media Text." Survey of techniques for Hinglish NLP.
- Lample, G., et al. (2016). "Neural Architectures for Named Entity Recognition." NAACL. The original Bi-LSTM-CRF paper for NER.
- Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." ACL Workshop. The standard evaluation metric for summarization.
Platforms & Tools
- AI4Bharat โ ai4bharat.org โ India's premier open-source language AI initiative (IIT Madras)
- Bhashini โ bhashini.gov.in โ Government of India's language translation platform
- HuggingFace Indian Models โ Search "MuRIL", "IndicBERT", "IndicWav2Vec" on huggingface.co
- IndicNLP Library โ github.com/anoopkunchukuttan/indic_nlp_library โ Preprocessing tools for Indian languages
- iNLTK โ github.com/goru001/inltk โ Indian NLP Toolkit
Textbooks
- Jurafsky, D. & Martin, J.H. (2024). "Speech and Language Processing." 3rd Edition (Draft). The definitive NLP textbook โ Chapters on NER, summarization, and ASR.
- Goldberg, Y. (2017). "Neural Network Methods for Natural Language Processing." Morgan & Claypool. Clear explanation of Bi-LSTM-CRF and attention mechanisms.