Part IX: Advanced Topics ⏱️ 4 Hours Advanced

Chapter 24
Natural Language Processing
& Text Mining

From tokenization to transformers — master the art of teaching machines to read, write, understand, and generate human language. Explore Word2Vec, BERT, RAG, LoRA, and multilingual NLP for India's diverse languages.

Prerequisites: Chapter 19 (Deep Learning Fundamentals), Chapter 20 (Sequence Models & RNNs). Familiarity with Python, PyTorch/TensorFlow, and probability theory is recommended.

📋 Learning Objectives

By the end of this chapter, you will be able to:

Preprocess text using tokenization, stemming, lemmatization, and stop word removal in Python with NLTK and spaCy.
Build text representations using Bag-of-Words, TF-IDF, and understand their mathematical foundations.
Train and use word embeddings — Word2Vec (CBOW & Skip-gram), GloVe, and FastText — understanding the objective functions behind each.
Design sentiment analysis pipelines from data collection through feature extraction to classification and deployment.
Implement Named Entity Recognition (NER) using sequence labeling with BIO tags, CRFs, and transformer models.
Build text classifiers for spam detection, topic classification, and multilingual settings using both classical ML and deep learning.
Understand text summarization — extractive (TextRank) vs. abstractive (seq2seq, BART, T5) — and implement both approaches.
Design question answering systems using retrieval-based and generative approaches.
Trace the evolution of language models from n-grams → neural LMs → GPT/BERT → modern LLMs.
Apply NLP to Indian languages, handling multilingual challenges, code-mixing, and script diversity using AI4Bharat tools.
Fine-tune LLMs using LoRA, QLoRA, and prompt engineering techniques for domain-specific tasks.
Implement Retrieval-Augmented Generation (RAG) pipelines combining vector databases with generative models.

🎯 Career Path

NLP Engineer roles are among the highest-paid in AI (₹25–80 LPA in India, $150K–$300K+ in the US). Key skills: transformers, fine-tuning, prompt engineering, multilingual NLP. Companies hiring: Google, Microsoft, Amazon, Flipkart, Jio, Krutrim, Sarvam AI.

📖 Introduction

Natural Language Processing (NLP) sits at the intersection of linguistics, computer science, and artificial intelligence. It is the field dedicated to enabling machines to understand, interpret, generate, and interact with human language — one of the most complex and nuanced systems ever created by humanity.

Consider the staggering scale: there are approximately 7,000 languages spoken worldwide, with India alone home to 22 scheduled languages and over 19,500 dialects. Every day, humans produce about 2.5 quintillion bytes of data, and roughly 80% of it is unstructured text — emails, social media posts, documents, reviews, and conversations. NLP is the key that unlocks this vast reservoir of information.

Text Mining, closely related to NLP, focuses on extracting meaningful patterns, trends, and insights from large text corpora. While NLP provides the tools (parsing, understanding, generation), text mining applies them to discover knowledge hidden in text.

Why NLP Matters Now More Than Ever

ChatGPT & LLMs: The launch of ChatGPT (November 2022) and subsequent models like GPT-4, Gemini, and Claude demonstrated that language understanding is the gateway to general AI.
India's Digital Transformation: With 800M+ internet users, many are non-English speakers. NLP for Indian languages is critical for financial inclusion (UPI voice payments), governance (e-courts), and education.
Enterprise Adoption: 75% of Fortune 500 companies now use NLP for customer service, document processing, compliance monitoring, and market intelligence.
Healthcare: NLP extracts diagnoses from clinical notes, enables medical chatbots, and mines research papers for drug discovery.

🎓 Professor's Insight

NLP has undergone three major paradigm shifts: (1) Rule-based systems (1960s–1990s) — hand-crafted grammars, (2) Statistical methods (1990s–2013) — probabilistic models like HMMs and CRFs, (3) Deep learning era (2013–present) — from Word2Vec through BERT to GPT-4. Each paradigm didn't eliminate the previous one; rather, understanding all three is essential for building robust NLP systems.

The NLP Technology Stack

Layer	Components	Examples
Raw Text	Documents, tweets, reviews	Web crawls, user inputs
Preprocessing	Tokenization, normalization, cleaning	NLTK, spaCy, regex
Representation	BoW, TF-IDF, embeddings	Word2Vec, BERT embeddings
Understanding	POS tagging, NER, parsing	spaCy, Stanford NLP
Application	Classification, QA, summarization	HuggingFace Transformers
Generation	Text generation, translation	GPT-4, mBART, IndicTrans

🏛️ Historical Background

The history of NLP is a fascinating journey from ambitious dreams to remarkable realities.

The Pioneering Era (1950–1970)

1950 — Alan Turing proposed the Turing Test in "Computing Machinery and Intelligence," framing language understanding as the benchmark for machine intelligence.

1954 — Georgetown-IBM Experiment: The first machine translation demonstration translated 60 Russian sentences into English using a dictionary of 250 words and 6 grammar rules. Researchers predicted full MT in 3–5 years — it took 60+.

1966 — ELIZA (Joseph Weizenbaum, MIT): The first chatbot, simulating a Rogerian psychotherapist using simple pattern matching. People formed emotional bonds with it, leading Weizenbaum to warn about AI deception.

1969 — SHRDLU (Terry Winograd, MIT): A natural language understanding system that could manipulate virtual blocks on a table, showing deep but extremely narrow language understanding.

The Statistical Revolution (1980–2010)

1980s — Hidden Markov Models (HMMs) revolutionized speech recognition and POS tagging. Fred Jelinek (IBM): "Every time I fire a linguist, the performance of the speech recognizer goes up."

1993 — Penn Treebank: The creation of large annotated corpora enabled statistical parsing.

2001 — Conditional Random Fields (CRFs) by John Lafferty: Superior to HMMs for sequence labeling tasks like NER.

2003 — Latent Dirichlet Allocation (LDA) by David Blei: Topic modeling from text corpora.

The Deep Learning Revolution (2013–Present)

2013 — Word2Vec (Mikolov et al., Google): Dense word embeddings capturing semantic relationships. "king - man + woman = queen."

2014 — GloVe (Stanford): Global Vectors combining count-based and prediction-based approaches.

2014 — Seq2Seq + Attention (Bahdanau et al.): Transformed machine translation.

2017 — "Attention Is All You Need" (Vaswani et al., Google): The Transformer architecture — the single most impactful paper in modern AI.

2018 — BERT (Devlin et al., Google): Bidirectional pre-training shattered NLP benchmarks.

2020 — GPT-3 (OpenAI): 175B parameters, few-shot learning, emergent abilities.

2022 — ChatGPT: NLP went mainstream, reaching 100M users in 2 months.

2023–2025 — AI4Bharat: IndicTrans2 covering 22 Indian languages; Krutrim and Sarvam AI building India-first LLMs.

🇮🇳 India Spotlight

India's NLP Heritage: Panini's Ashtadhyayi (4th century BCE) — 3,959 rules describing Sanskrit grammar — is considered the world's first formal grammar, predating modern computational linguistics by millennia. Modern Indian NLP builds on this rich linguistic tradition through projects like AI4Bharat's IndicNLP Suite, which provides models for 22+ Indian languages.

💡 Conceptual Explanation

4.1 Text Preprocessing Pipeline

Raw text is messy — it contains HTML tags, special characters, inconsistent casing, and irrelevant words. Preprocessing transforms raw text into a clean, standardized format suitable for analysis.

Tokenization

Tokenization splits text into individual units (tokens). These can be words, subwords, or characters.

Word Tokenization: "I can't believe it!" → ["I", "can't", "believe", "it", "!"] or ["I", "ca", "n't", "believe", "it", "!"]
Subword Tokenization (BPE): "unhappiness" → ["un", "happiness"] or ["un", "happ", "iness"]. Used by GPT, BERT.
Character Tokenization: "hello" → ["h", "e", "l", "l", "o"]. Useful for morphologically rich languages like Hindi, Tamil.

Stemming vs. Lemmatization

Stemming crudely chops off word endings: "running" → "run", "better" → "better" (fails). Fast but inaccurate.

Lemmatization uses vocabulary and morphological analysis: "running" → "run", "better" → "good". Slower but linguistically correct.

Stop Word Removal

Common words like "the", "is", "at" carry little meaning for many NLP tasks (but not all — stop words matter for sentiment analysis!).

4.2 Text Representation Models

Bag of Words (BoW)

Represents text as a vector of word frequencies, ignoring order. Simple but effective for many classification tasks.

Document: "the cat sat on the mat" → {the: 2, cat: 1, sat: 1, on: 1, mat: 1}

TF-IDF (Term Frequency-Inverse Document Frequency)

Improves BoW by weighting words based on how informative they are. Words common across all documents (like "the") get low weights; words unique to specific documents get high weights.

4.3 Word Embeddings

Unlike BoW/TF-IDF (sparse, high-dimensional), word embeddings are dense, low-dimensional vectors that capture semantic meaning. The key insight is the distributional hypothesis: "You shall know a word by the company it keeps" (J.R. Firth, 1957).

Word2Vec — CBOW (Continuous Bag of Words)

Predicts the center word from surrounding context words. Given context ["the", "cat", "on", "the"], predict "sat".

Word2Vec — Skip-gram

The reverse: predicts context words from the center word. Given "sat", predict ["the", "cat", "on", "the"]. Works better for rare words and small datasets.

GloVe (Global Vectors)

Combines the best of count-based (SVD on co-occurrence matrix) and prediction-based (Word2Vec) methods by factorizing the log co-occurrence matrix.

FastText

Extends Word2Vec by representing words as bags of character n-grams: "where" → {"<wh", "whe", "her", "ere", "re>"}. This handles out-of-vocabulary words and morphologically rich languages.

4.4 Sentiment Analysis

Determines the emotional tone of text: positive, negative, or neutral (or fine-grained scales). Applications include product review analysis, brand monitoring, stock market prediction, and political polling.

4.5 Named Entity Recognition (NER)

Identifies and classifies named entities: persons, organizations, locations, dates, monetary values, etc. Uses BIO tagging (Beginning, Inside, Outside) for sequence labeling.

4.6 Language Models: Evolution

A language model assigns probabilities to sequences of words. The evolution: N-gram → Neural LM → RNN LM → Transformer LM (GPT, BERT) → LLMs (GPT-4, Gemini).

4.7 Retrieval-Augmented Generation (RAG)

Combines retrieval (finding relevant documents from a knowledge base) with generation (using an LLM to synthesize answers). Solves hallucination, knowledge cutoff, and domain specificity problems.

📝 Exam Tip

For GATE/NET exams, remember: BoW ignores word order, TF-IDF adds document-level weighting, Word2Vec learns from local context windows, GloVe uses global co-occurrence statistics. BERT is bidirectional (masked LM), GPT is unidirectional (autoregressive). These distinctions are frequently tested.

📐 Mathematical Foundation

5.1 TF-IDF Mathematics

Term Frequency

TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in d)

Inverse Document Frequency

IDF(t, D) = log(N / |{d ∈ D : t ∈ d}|)

TF-IDF Score

TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)

Where N = total number of documents, and |{d ∈ D : t ∈ d}| = number of documents containing term t.

5.2 Word2Vec Skip-gram Objective

The Skip-gram model maximizes the probability of context words given a center word:

Skip-gram Objective Function

J(θ) = -(1/T) Σ_{t=1}^{T} Σ_{-c≤j≤c, j≠0} log P(w_{t+j} | w_t)

Where T is the total number of words, c is the context window size, and the conditional probability uses softmax:

Softmax Probability

P(w_O | w_I) = exp(v'_{w_O}ᵀ v_{w_I}) / Σ_{w=1}^{V} exp(v'_wᵀ v_{w_I})

Here v_{w} and v'_{w} are the input and output vector representations of word w, and V is the vocabulary size.

5.3 GloVe Objective

GloVe Cost Function

J = Σ_{i,j=1}^{V} f(X_{ij})(w_iᵀ w̃_j + b_i + b̃_j - log X_{ij})²

Where X_{ij} is the co-occurrence count of words i and j, f(x) is a weighting function that caps high-frequency pairs, and b_i, b̃_j are bias terms.

5.4 Negative Sampling Approximation

Computing the full softmax over vocabulary V (which can be 100K–1M words) is expensive. Negative sampling approximates this:

Negative Sampling Objective

log σ(v'_{w_O}ᵀ v_{w_I}) + Σ_{i=1}^{k} E_{w_i ~ P_n(w)} [log σ(-v'_{w_i}ᵀ v_{w_I})]

Where σ is the sigmoid function, k is the number of negative samples (typically 5–20), and P_n(w) is the noise distribution (usually unigram distribution raised to 3/4 power).

5.5 Attention Mechanism (Transformer)

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

Where Q (queries), K (keys), V (values) are linear projections of the input, and d_k is the dimension of the keys. The √d_k scaling prevents dot products from growing too large.

5.6 N-gram Language Model

Bigram Probability (Markov Assumption)

P(w_n | w_1, ..., w_{n-1}) ≈ P(w_n | w_{n-1}) = C(w_{n-1}, w_n) / C(w_{n-1})

5.7 Perplexity (Language Model Evaluation)

Perplexity

PP(W) = P(w_1, w_2, ..., w_N)^{-1/N} = 2^{H(W)}

Lower perplexity = better model. A perplexity of k means the model is as confused as if it had to choose uniformly among k possibilities at each step.

5.8 Cosine Similarity (Embedding Comparison)

Cosine Similarity

cos(A, B) = (A · B) / (||A|| × ||B||) = Σ A_i B_i / (√Σ A_i² × √Σ B_i²)

🔣 Formula Derivations

6.1 Deriving TF-IDF from First Principles

Motivation: We want a scoring function that tells us how important a word is to a specific document in a collection.

Step 1 — Term Frequency: If a word appears more often in a document, it's likely more relevant to that document.

  TF(t, d) = count(t, d) / |d|
  
  Example: Document d has 100 words, "machine" appears 5 times
  TF("machine", d) = 5/100 = 0.05

Step 2 — The Problem with TF Alone: Common words like "the", "is" have high TF in every document but carry no discriminative information.

Step 3 — Inverse Document Frequency: We need a factor that penalizes words appearing in many documents:

  df(t) = number of documents containing term t
  N = total number of documents
  
  If df(t) = N → word appears everywhere → not informative → weight should be LOW
  If df(t) = 1 → word is unique to one document → very informative → weight should be HIGH
  
  Simple ratio: N / df(t) gives higher values for rarer terms
  
  But this grows linearly and can be huge. We take the logarithm:
  IDF(t) = log(N / df(t))
  
  When df(t) = N: IDF = log(1) = 0 (zero weight — perfect!)
  When df(t) = 1: IDF = log(N) (maximum weight — perfect!)

Step 4 — Combine: TF-IDF(t, d) = TF(t, d) × IDF(t) gives high scores to words that are frequent in a specific document but rare across the collection.

6.2 Deriving Skip-gram with Negative Sampling

Motivation: The full softmax in Word2Vec is O(V) per training step — too slow for large vocabularies.

Step 1 — Original objective for a single (center, context) pair:

  maximize: log P(w_c | w_t) = log [exp(u_c · v_t) / Σ_w exp(u_w · v_t)]
  
  This requires summing over ALL V words in the vocabulary — O(V) per step.

Step 2 — Reformulate as binary classification: Instead of multi-class softmax, ask: "Is (w_t, w_c) a real pair or a fake pair?"

  P(D=1 | w_t, w_c) = σ(u_c · v_t) = 1 / (1 + exp(-u_c · v_t))
  
  For a real pair, maximize σ(u_c · v_t)
  For a fake pair (w_t, w_k), maximize σ(-u_k · v_t), i.e., minimize σ(u_k · v_t)

Step 3 — Sample k negative (fake) pairs for each real pair:

  J = log σ(u_c · v_t) + Σ_{i=1}^{k} E_{w_i ~ P_n} [log σ(-u_{w_i} · v_t)]
  
  Now each step is O(k) instead of O(V), where k = 5-20 ≪ V
  
  Noise distribution P_n(w) = [count(w)]^{3/4} / Σ_w [count(w)]^{3/4}
  The 3/4 power smooths the distribution, giving rare words more sampling chance.

6.3 Deriving Attention Scaling Factor

Why divide by √d_k?

  Let q and k be random vectors with components ~ N(0, 1)
  
  q · k = Σ_{i=1}^{d_k} q_i × k_i
  
  E[q_i × k_i] = E[q_i] × E[k_i] = 0 × 0 = 0
  Var(q_i × k_i) = Var(q_i) × Var(k_i) = 1 × 1 = 1
  
  By CLT: Var(q · k) = Σ_{i=1}^{d_k} Var(q_i × k_i) = d_k
  
  So q · k has variance d_k. For large d_k, dot products become very large,
  pushing softmax into saturated regions (extremely peaked distribution).
  
  Solution: Divide by √d_k to normalize variance back to 1:
  Var(q · k / √d_k) = d_k / d_k = 1 ✓

🎓 Professor's Insight

The √d_k scaling in attention is often treated as a "trick," but it's deeply principled. Without it, gradient flow through the softmax degrades for high-dimensional models (d_k = 64 in BERT-base), making training unstable. This is an example of how understanding variance propagation is critical in deep learning.

📝 Worked Numerical Examples

Example 1: TF-IDF Calculation

Problem: Given 3 documents, compute TF-IDF for the word "learning":

  D1: "machine learning is great"          (4 words)
  D2: "deep learning and machine learning"  (5 words, "learning" appears 2x)
  D3: "great machines work well"            (4 words)
  
  Step 1: TF("learning", D1) = 1/4 = 0.25
           TF("learning", D2) = 2/5 = 0.40
           TF("learning", D3) = 0/4 = 0.00
  
  Step 2: df("learning") = 2 (appears in D1 and D2)
           N = 3
           IDF("learning") = log₂(3/2) = log₂(1.5) = 0.585
  
  Step 3: TF-IDF("learning", D1) = 0.25 × 0.585 = 0.146
           TF-IDF("learning", D2) = 0.40 × 0.585 = 0.234
           TF-IDF("learning", D3) = 0.00 × 0.585 = 0.000
  
  ✓ "learning" is most important to D2 (highest TF-IDF score)

Example 2: Bigram Probability

  Corpus: "I like deep learning. I like machine learning."
  
  Compute: P("learning" | "deep")
  
  C("deep", "learning") = 1
  C("deep") = 1
  
  P("learning" | "deep") = C("deep", "learning") / C("deep") = 1/1 = 1.0
  
  Compute: P("learning" | "machine")  
  C("machine", "learning") = 1
  C("machine") = 1
  
  P("learning" | "machine") = 1/1 = 1.0
  
  Compute: P("deep" | "like")
  C("like", "deep") = 1
  C("like") = 2
  
  P("deep" | "like") = 1/2 = 0.5

Example 3: Cosine Similarity Between Word Vectors

  word_king   = [0.8, 0.6, 0.2]
  word_queen  = [0.7, 0.7, 0.3]
  word_apple  = [0.1, 0.2, 0.9]
  
  cos(king, queen):
    Dot product = 0.8×0.7 + 0.6×0.7 + 0.2×0.3 = 0.56 + 0.42 + 0.06 = 1.04
    ||king||  = √(0.64 + 0.36 + 0.04) = √1.04 = 1.020
    ||queen|| = √(0.49 + 0.49 + 0.09) = √1.07 = 1.034
    cos = 1.04 / (1.020 × 1.034) = 1.04 / 1.055 = 0.986  → Very similar! ✓
  
  cos(king, apple):
    Dot product = 0.08 + 0.12 + 0.18 = 0.38
    ||apple|| = √(0.01 + 0.04 + 0.81) = √0.86 = 0.927
    cos = 0.38 / (1.020 × 0.927) = 0.38 / 0.946 = 0.402  → Less similar ✓

Example 4: Perplexity Calculation

  Test sentence: "the cat sat" (3 words)
  
  Model probabilities:
    P(the) = 0.1
    P(cat | the) = 0.05
    P(sat | cat) = 0.2
  
  P(sentence) = 0.1 × 0.05 × 0.2 = 0.001
  
  Perplexity = P(sentence)^{-1/N} = (0.001)^{-1/3} = (1000)^{1/3} = 10.0
  
  Interpretation: The model is as uncertain as choosing randomly from 10 options.
  
  Better model with P = 0.2 × 0.3 × 0.5 = 0.03:
  PP = (0.03)^{-1/3} = (33.33)^{1/3} = 3.22 → Much better! ✓

📝 Exam Tip

In competitive exams, TF-IDF and perplexity calculations are common. Remember: (1) IDF uses log base (usually 2 or 10 — check the question), (2) Perplexity is the geometric mean of inverse probabilities, (3) Lower perplexity = better model.

📊 Visual Diagrams

8.1 NLP Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐ │ NLP PROCESSING PIPELINE │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Raw Text │──▶│ Cleaning │──▶│Tokenize │──▶│Stop Word │ │ │ │ │ │ HTML/URLs│ │ │ │ Removal │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ ┌──────────────────────────────────────────────┘ │ │ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │Stemming/ │──▶│ BoW / │──▶│Embedding │──▶│ Model │ │ │ │Lemmatize │ │ TF-IDF │ │Word2Vec │ │Training │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ ▼ │ │ ┌──────────┐ │ │ │ Predict/ │ │ │ │ Generate │ │ │ └──────────┘ │ └─────────────────────────────────────────────────────────────────────────┘

8.2 Word2Vec: CBOW vs Skip-gram

CBOW (Predict Center) Skip-gram (Predict Context) ───────────────────────── ───────────────────────────── Context Words Center Center Word Context Words ┌─────────┐ ┌─────────┐ ┌─────────┐ │ "the" │──┐ │ │──▶ │ "the" │ ├─────────┤ │ ┌──────────┐ │ │ ├─────────┤ │ "cat" │──┼──▶│ "sat" │ │ "sat" │──▶ │ "cat" │ ├─────────┤ │ └──────────┘ │ │ ├─────────┤ │ "on" │──┤ Predicted │ │──▶ │ "on" │ ├─────────┤ │ │ │ ├─────────┤ │ "the" │──┘ │ │──▶ │ "the" │ └─────────┘ └─────────┘ └─────────┘ │ │ ▼ ▼ Average vectors Individual predictions then predict for each position

8.3 Transformer Self-Attention

SELF-ATTENTION MECHANISM ───────────────────────────────────────────── Input: "The cat sat on the mat" For token "cat": ┌──────┐ ┌──────┐ ┌──────┐ │ Q │ │ K │ │ V │ Q, K, V = Linear projections │(cat) │ │(all) │ │(all) │ of input embeddings └──┬───┘ └──┬───┘ └──┬───┘ │ │ │ └────┬────┘ │ ▼ │ ┌──────────────┐ │ │ Q × Kᵀ │ │ Attention scores: │ ────── │ │ "The"=0.1 "cat"=0.3 │ √d_k │ │ "sat"=0.25 "on"=0.05 └──────┬───────┘ │ "the"=0.1 "mat"=0.2 ▼ │ ┌──────────────┐ │ │ Softmax │───────┤ Weighted sum of V vectors └──────┬───────┘ │ = Context-aware representation │ │ of "cat" └───────┬───────┘ ▼ ┌──────────────┐ │ Σ αᵢ × Vᵢ │ │ (Output) │ └──────────────┘

8.4 RAG Architecture

RETRIEVAL-AUGMENTED GENERATION (RAG) ════════════════════════════════════ ┌─────────────┐ ┌──────────────────────┐ │ User │ │ Knowledge Base │ │ Query │ │ (Documents, PDFs) │ └──────┬──────┘ └──────────┬────────────┘ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────────────┐ │ Query │ │ Chunk + Embed │ │ Embedding │ │ (sentence-transformers│ └──────┬───────┘ │ → vectors) │ │ └──────────┬────────────┘ │ │ │ ┌─────────▼────────────┐ │ │ Vector Database │ ├────────────▶│ (FAISS, Pinecone, │ │ Similarity │ ChromaDB, Weaviate) │ │ Search └─────────┬────────────┘ │ │ │ Top-K relevant chunks │ │ ▼ ▼ ┌────────────────────────────────────────────┐ │ PROMPT CONSTRUCTION │ │ "Given this context: {retrieved_chunks} │ │ Answer: {user_query}" │ └──────────────────┬─────────────────────────┘ ▼ ┌──────────────────┐ │ LLM │ │ (GPT-4, Llama, │ │ Mistral) │ └────────┬─────────┘ ▼ ┌──────────────────┐ │ Grounded Answer │ │ (with citations) │ └──────────────────┘

🔄 Flowcharts

9.1 Sentiment Analysis Pipeline

┌─────────────────┐ │ Collect Data │ Reviews, tweets, comments │ (Raw Text) │ └────────┬────────┘ ▼ ┌─────────────────┐ │ Preprocess │ Lowercase, remove HTML, handle emojis │ │ Tokenize, remove stopwords └────────┬────────┘ ▼ ┌─────────────────┐ │ Label Data │ Positive / Negative / Neutral │ (if supervised)│ OR use lexicons (VADER, SentiWordNet) └────────┬────────┘ ▼ ┌─────────────────────────────────────┐ │ Feature Extraction │ │ ┌─────────┐ ┌───────┐ ┌─────────┐ │ │ │ BoW │ │TF-IDF │ │ BERT │ │ │ │ │ │ │ │Embedding│ │ │ └─────────┘ └───────┘ └─────────┘ │ └────────────────┬────────────────────┘ ▼ ┌─────────────────────────────────────┐ │ Train Classifier │ │ ┌─────────┐ ┌───────┐ ┌─────────┐ │ │ │ Naive │ │ SVM │ │ BERT │ │ │ │ Bayes │ │ │ │Fine-tune│ │ │ └─────────┘ └───────┘ └─────────┘ │ └────────────────┬────────────────────┘ ▼ ┌─────────────────┐ │ Evaluate │ Accuracy, F1, Confusion Matrix │ (Test Set) │ └────────┬────────┘ ▼ ┌─────────────────┐ │ Deploy API │ Flask / FastAPI endpoint │ + Monitor │ Track drift, retrain └─────────────────┘

9.2 NER Decision Flowchart

┌────────────────────┐ │ Input: "Sundar │ │ Pichai leads Google │ │ from California" │ └─────────┬──────────┘ ▼ ┌────────────────────┐ │ Tokenize │ │ ["Sundar", │ │ "Pichai", │ │ "leads", ...] │ └─────────┬──────────┘ ▼ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌──────────────┐ ┌─────────────┐ ┌───────────┐ │Rule-based? │ │CRF/BiLSTM? │ │Transformer?│ │(Regex, Dict) │ │(Sequence │ │(BERT-NER) │ │ │ │ Labeling) │ │ │ └──────┬───────┘ └─────┬───────┘ └─────┬─────┘ │ │ │ └───────────────┼───────────────┘ ▼ ┌────────────────────┐ │ BIO Tags │ │ Sundar → B-PER │ │ Pichai → I-PER │ │ leads → O │ │ Google → B-ORG │ │ from → O │ │ California → B-LOC│ └────────────────────┘

9.3 LLM Fine-tuning Decision Tree

┌────────────────────────┐ │ Need to adapt LLM? │ └───────────┬────────────┘ ▼ ┌────────────────────────┐ ┌────│ How much data do you │────┐ │ │ have? │ │ │ └────────────────────────┘ │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ Few │ │ Lots │ │ examples │ │ (1K-100K) │ │ (< 100) │ │ │ └─────┬────┘ └─────┬────┘ ▼ ▼ ┌──────────┐ ┌──────────────┐ │ Prompt │ ┌────│ Full GPU │────┐ │ Engineer- │ │ │ available? │ │ │ ing / ICL │ │ └──────────────┘ │ └──────────┘ ▼ ▼ ┌──────────┐ ┌──────────┐ │ Full │ │ LoRA / │ │Fine-tune │ │ QLoRA │ │(All params)│ │(4-bit) │ └──────────┘ └──────────┘

🐍 Python Implementation

10.1 Text Preprocessing with NLTK & spaCy

import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

# Download required NLTK data
nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'])

class TextPreprocessor:
    """Complete text preprocessing pipeline."""
    
    def __init__(self, language='english'):
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words(language))
    
    def clean_text(self, text):
        """Remove HTML, URLs, special characters."""
        text = re.sub(r'<[^>]+>', '', text)          # HTML tags
        text = re.sub(r'http\S+|www\.\S+', '', text) # URLs
        text = re.sub(r'[^a-zA-Z\s]', '', text)      # Non-alpha
        text = text.lower().strip()
        return text
    
    def tokenize(self, text):
        """Word tokenization."""
        return word_tokenize(text)
    
    def remove_stopwords(self, tokens):
        """Remove common stop words."""
        return [t for t in tokens if t not in self.stop_words]
    
    def stem(self, tokens):
        """Apply Porter stemming."""
        return [self.stemmer.stem(t) for t in tokens]
    
    def lemmatize(self, tokens):
        """Apply WordNet lemmatization."""
        return [self.lemmatizer.lemmatize(t) for t in tokens]
    
    def preprocess(self, text, use_lemma=True):
        """Full preprocessing pipeline."""
        text = self.clean_text(text)
        tokens = self.tokenize(text)
        tokens = self.remove_stopwords(tokens)
        tokens = self.lemmatize(tokens) if use_lemma else self.stem(tokens)
        return tokens

# Demo
pp = TextPreprocessor()
sample = "The quick brown foxes are jumping over the lazy dogs! Visit https://nlp.com"
print("Original:", sample)
print("Processed:", pp.preprocess(sample))
# Output: ['quick', 'brown', 'fox', 'jumping', 'lazy', 'dog']

10.2 spaCy NLP Pipeline

import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

text = "Google CEO Sundar Pichai announced a $10 billion investment in India on March 2025."
doc = nlp(text)

# Tokenization
print("=== Tokens ===")
for token in doc:
    print(f"  {token.text:15s} | POS: {token.pos_:6s} | Lemma: {token.lemma_:15s} | Stop: {token.is_stop}")

# Named Entity Recognition
print("\n=== Named Entities ===")
for ent in doc.ents:
    print(f"  {ent.text:25s} | Label: {ent.label_:10s} | Explanation: {spacy.explain(ent.label_)}")

# Dependency Parsing
print("\n=== Dependencies ===")
for token in doc:
    print(f"  {token.text:15s} --{token.dep_:12s}--> {token.head.text}")

# Output:
# === Named Entities ===
#   Google                    | Label: ORG        | Explanation: Companies, agencies...
#   Sundar Pichai             | Label: PERSON     | Explanation: People, including fictional
#   $10 billion               | Label: MONEY      | Explanation: Monetary values
#   India                     | Label: GPE        | Explanation: Countries, cities, states
#   March 2025                | Label: DATE       | Explanation: Absolute or relative dates

10.3 TF-IDF from Scratch

import numpy as np
from collections import Counter
import math

class TFIDFVectorizer:
    """TF-IDF implementation from scratch."""
    
    def __init__(self):
        self.vocabulary = {}
        self.idf = {}
    
    def fit(self, documents):
        """Build vocabulary and compute IDF."""
        # Build vocabulary
        all_words = set()
        for doc in documents:
            all_words.update(doc.lower().split())
        self.vocabulary = {w: i for i, w in enumerate(sorted(all_words))}
        
        # Compute IDF
        N = len(documents)
        for word in self.vocabulary:
            df = sum(1 for doc in documents if word in doc.lower().split())
            self.idf[word] = math.log(N / df) if df > 0 else 0
        
        return self
    
    def transform(self, documents):
        """Compute TF-IDF matrix."""
        matrix = np.zeros((len(documents), len(self.vocabulary)))
        
        for i, doc in enumerate(documents):
            words = doc.lower().split()
            word_counts = Counter(words)
            total_words = len(words)
            
            for word, count in word_counts.items():
                if word in self.vocabulary:
                    tf = count / total_words
                    matrix[i][self.vocabulary[word]] = tf * self.idf[word]
        
        return matrix
    
    def fit_transform(self, documents):
        return self.fit(documents).transform(documents)

# Demo
docs = [
    "machine learning is great for prediction",
    "deep learning uses neural networks",
    "machine learning and deep learning are related"
]

vectorizer = TFIDFVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

print("Vocabulary:", vectorizer.vocabulary)
print("\nTF-IDF Matrix shape:", tfidf_matrix.shape)
print("\nTop words per document:")
for i, doc in enumerate(docs):
    scores = tfidf_matrix[i]
    top_indices = scores.argsort()[-3:][::-1]
    words = list(vectorizer.vocabulary.keys())
    top_words = [(words[j], scores[j]) for j in top_indices if scores[j] > 0]
    print(f"  D{i+1}: {top_words}")

10.4 Word2Vec Training with Gensim

from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import numpy as np

# Sample corpus (in practice, use a large corpus)
sentences = [
    ["machine", "learning", "is", "a", "subset", "of", "artificial", "intelligence"],
    ["deep", "learning", "uses", "neural", "networks"],
    ["natural", "language", "processing", "deals", "with", "text"],
    ["word", "embeddings", "capture", "semantic", "meaning"],
    ["transformers", "revolutionized", "natural", "language", "processing"],
    ["bert", "is", "a", "bidirectional", "transformer", "model"],
    ["gpt", "is", "an", "autoregressive", "language", "model"],
    ["attention", "mechanism", "is", "key", "to", "transformers"],
    ["recurrent", "neural", "networks", "process", "sequences"],
    ["convolutional", "networks", "work", "well", "for", "images"],
]

# Train Word2Vec (Skip-gram)
model_sg = Word2Vec(
    sentences=sentences,
    vector_size=100,     # Embedding dimension
    window=5,            # Context window size
    min_count=1,         # Minimum word frequency
    sg=1,                # 1 = Skip-gram, 0 = CBOW
    negative=5,          # Negative samples
    epochs=100,          # Training epochs
    workers=4            # Parallel threads
)

# Train Word2Vec (CBOW)
model_cbow = Word2Vec(
    sentences=sentences,
    vector_size=100, window=5, min_count=1,
    sg=0, epochs=100, workers=4
)

# Find similar words
print("Similar to 'learning' (Skip-gram):")
for word, score in model_sg.wv.most_similar("learning", topn=5):
    print(f"  {word}: {score:.4f}")

# Word vector arithmetic
# Note: requires large corpus for meaningful results
print("\nVector for 'learning':", model_sg.wv["learning"][:5], "...")

# Save and load
model_sg.save("word2vec_skipgram.model")
# loaded_model = Word2Vec.load("word2vec_skipgram.model")

10.5 Sentiment Analysis with VADER & TextBlob

from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import nltk
nltk.download('vader_lexicon')

# VADER Sentiment Analysis (rule-based, great for social media)
sia = SentimentIntensityAnalyzer()

texts = [
    "This product is absolutely amazing! Best purchase ever! 😍",
    "Terrible customer service. Never buying again. 😤",
    "The movie was okay, nothing special.",
    "AI4Bharat's IndicNLP is revolutionizing Indian language processing!",
    "The new UPI interface is confusing and slow.",
]

print("=== VADER Sentiment Analysis ===")
for text in texts:
    scores = sia.polarity_scores(text)
    sentiment = "Positive" if scores['compound'] > 0.05 else "Negative" if scores['compound'] < -0.05 else "Neutral"
    print(f"\n  Text: {text[:60]}...")
    print(f"  Scores: pos={scores['pos']:.3f} neu={scores['neu']:.3f} neg={scores['neg']:.3f}")
    print(f"  Compound: {scores['compound']:.3f} → {sentiment}")

# TextBlob Sentiment Analysis
print("\n\n=== TextBlob Sentiment Analysis ===")
for text in texts:
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity      # -1 to 1
    subjectivity = blob.sentiment.subjectivity  # 0 to 1
    print(f"\n  Text: {text[:60]}...")
    print(f"  Polarity: {polarity:.3f}  Subjectivity: {subjectivity:.3f}")

💻 Code Challenge

Modify the TextPreprocessor class to handle Hindi text: add Devanagari tokenization, Hindi stop words (से, का, के, की, में, है, और, को), and integration with the IndicNLP library. Test with sample Hindi movie reviews.

🔷 TensorFlow Implementation

11.1 BERT Text Classification

import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
import numpy as np

# ========================================
# BERT Fine-tuning for Sentiment Classification
# ========================================

# Load pre-trained BERT tokenizer and model
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3  # Positive, Negative, Neutral
)

# Sample training data
train_texts = [
    "This movie is absolutely fantastic and thrilling",
    "Terrible acting and boring storyline",
    "The film was decent, nothing remarkable",
    "Best performance I've ever seen on screen",
    "Complete waste of time and money",
    "It was an average movie with good songs",
]
train_labels = [2, 0, 1, 2, 0, 1]  # 0=Neg, 1=Neutral, 2=Pos

# Tokenize inputs
def encode_texts(texts, max_length=128):
    return tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )

train_encodings = encode_texts(train_texts)

# Create TF dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    tf.constant(train_labels)
)).shuffle(100).batch(2)

# Compile model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

# Train
print("Training BERT classifier...")
history = model.fit(train_dataset, epochs=3, verbose=1)

# Predict on new text
test_texts = ["This is a wonderful experience!", "I hated every minute of it"]
test_encodings = encode_texts(test_texts)
predictions = model.predict(dict(test_encodings))
predicted_labels = tf.argmax(predictions.logits, axis=1).numpy()

label_map = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}
for text, label in zip(test_texts, predicted_labels):
    print(f"  '{text}' → {label_map[label]}")

11.2 Text Summarization with T5

from transformers import T5Tokenizer, TFT5ForConditionalGeneration

# Load T5 model for summarization
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = TFT5ForConditionalGeneration.from_pretrained(model_name)

def summarize(text, max_input_length=512, max_output_length=150):
    """Generate abstractive summary using T5."""
    # T5 expects "summarize: " prefix
    input_text = "summarize: " + text
    
    # Tokenize
    inputs = tokenizer(
        input_text,
        max_length=max_input_length,
        truncation=True,
        return_tensors="tf"
    )
    
    # Generate summary
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_output_length,
        num_beams=4,           # Beam search
        length_penalty=2.0,     # Favor longer summaries
        early_stopping=True,
        no_repeat_ngram_size=3  # Avoid repetition
    )
    
    # Decode
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Test with a long article
article = """
India's digital transformation has been remarkable. The Unified Payments Interface (UPI)
processed over 13 billion transactions worth more than Rs 20 lakh crore in a single month
in 2024. This achievement makes India the world leader in real-time digital payments.
The success of UPI can be attributed to several factors: government support through the
Digital India initiative, widespread smartphone adoption, affordable internet connectivity
through Jio's entry, and the collaborative approach of NPCI in building an open platform.
The technology has now been adopted by countries like Singapore, UAE, and France.
India is also making significant strides in AI and NLP. Organizations like AI4Bharat
are developing language models for 22 Indian languages, enabling millions of non-English
speakers to access digital services in their native languages.
"""

summary = summarize(article)
print("=== Original Article ===")
print(article[:200], "...")
print("\n=== T5 Summary ===")
print(summary)

11.3 LSTM Text Generator

import tensorflow as tf
import numpy as np

class TextGenerator:
    """Simple LSTM-based text generator."""
    
    def __init__(self, corpus, seq_length=40):
        self.seq_length = seq_length
        self.corpus = corpus.lower()
        
        # Create character vocabulary
        self.chars = sorted(set(self.corpus))
        self.char_to_idx = {c: i for i, c in enumerate(self.chars)}
        self.idx_to_char = {i: c for c, i in self.char_to_idx.items()}
        self.vocab_size = len(self.chars)
        
        print(f"Corpus length: {len(self.corpus)} chars")
        print(f"Vocabulary size: {self.vocab_size} unique chars")
    
    def prepare_data(self):
        """Create input-output sequences."""
        X, y = [], []
        for i in range(len(self.corpus) - self.seq_length):
            seq_in = self.corpus[i:i + self.seq_length]
            seq_out = self.corpus[i + self.seq_length]
            X.append([self.char_to_idx[c] for c in seq_in])
            y.append(self.char_to_idx[seq_out])
        
        X = np.array(X) / self.vocab_size  # Normalize
        X = X.reshape(len(X), self.seq_length, 1)
        y = tf.keras.utils.to_categorical(y, self.vocab_size)
        return X, y
    
    def build_model(self):
        """Build LSTM model."""
        self.model = tf.keras.Sequential([
            tf.keras.layers.LSTM(256, input_shape=(self.seq_length, 1), return_sequences=True),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.LSTM(256),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(self.vocab_size, activation='softmax')
        ])
        self.model.compile(loss='categorical_crossentropy', optimizer='adam')
        self.model.summary()
    
    def generate(self, seed_text, length=200, temperature=0.8):
        """Generate text from seed."""
        generated = seed_text.lower()
        pattern = [self.char_to_idx[c] for c in generated[-self.seq_length:]]
        
        for _ in range(length):
            x = np.array(pattern) / self.vocab_size
            x = x.reshape(1, len(pattern), 1)
            
            pred = self.model.predict(x, verbose=0)[0]
            
            # Temperature sampling
            pred = np.log(pred + 1e-8) / temperature
            pred = np.exp(pred) / np.sum(np.exp(pred))
            
            idx = np.random.choice(len(pred), p=pred)
            generated += self.idx_to_char[idx]
            pattern.append(idx)
            pattern = pattern[1:]
        
        return generated

# Usage:
# gen = TextGenerator(large_text_corpus)
# X, y = gen.prepare_data()
# gen.build_model()
# gen.model.fit(X, y, batch_size=128, epochs=50)
# print(gen.generate("natural language", length=300))

🏭 Industry Alert

In production, don't train BERT from scratch — fine-tune a pre-trained model. For Indian languages, use IndicBERT (AI4Bharat) or MuRIL (Google) as base models. These are pre-trained on multilingual Indian corpora and significantly outperform vanilla mBERT on Indian language tasks.

🔶 Scikit-Learn Implementation

12.1 Text Classification Pipeline (Spam Detection)

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Sample SMS dataset (in practice, use full SMS Spam Collection dataset)
messages = [
    ("Free entry to win a brand new car! Text WIN to 80085", "spam"),
    ("Hey, are you free for dinner tonight?", "ham"),
    ("URGENT! You have won £1000. Call now!", "spam"),
    ("Meeting rescheduled to 3pm tomorrow", "ham"),
    ("Congratulations! Claim your prize money NOW!!!", "spam"),
    ("Can you pick up groceries on the way home?", "ham"),
    ("You've been selected for a free iPhone! Click here", "spam"),
    ("Don't forget mom's birthday next week", "ham"),
    ("WINNER!! You won a vacation trip! Reply YES", "spam"),
    ("Project deadline extended to Friday", "ham"),
    ("Get cheap loans at lowest interest rates!", "spam"),
    ("See you at the park at 5pm", "ham"),
] * 10  # Repeat for more training data

texts, labels = zip(*messages)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)

# Method 1: Simple Pipeline (TF-IDF + Naive Bayes)
pipeline_nb = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2),    # Unigrams and bigrams
        stop_words='english'
    )),
    ('clf', MultinomialNB(alpha=0.1))  # Laplace smoothing
])

# Method 2: TF-IDF + SVM
pipeline_svm = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('clf', LinearSVC(C=1.0, max_iter=1000))
])

# Train and evaluate
for name, pipeline in [("Naive Bayes", pipeline_nb), ("SVM", pipeline_svm)]:
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    
    print(f"\n=== {name} ===")
    print(classification_report(y_test, y_pred))
    
    # Cross-validation
    scores = cross_val_score(pipeline, texts, labels, cv=5, scoring='f1_macro')
    print(f"  Cross-val F1: {scores.mean():.3f} ± {scores.std():.3f}")

# Predict on new messages
new_messages = [
    "Win a free laptop! Click now!",
    "Can we reschedule our meeting to Monday?"
]
predictions = pipeline_svm.predict(new_messages)
for msg, pred in zip(new_messages, predictions):
    print(f"  '{msg[:50]}...' → {pred}")

12.2 Topic Modeling with LDA

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents about different topics
documents = [
    "Machine learning algorithms improve with more data and computing power",
    "Deep neural networks have revolutionized computer vision and NLP",
    "Indian cricket team won the World Cup with brilliant batting display",
    "Virat Kohli scored a century in the test match against Australia",
    "Stock market crashed due to global economic concerns and inflation",
    "Sensex and Nifty recovered after RBI announced rate cut policy",
    "New electric vehicles from Tata Motors are gaining market share",
    "Tesla's self-driving technology uses deep learning and sensors",
    "Artificial intelligence is transforming healthcare diagnostics",
    "ISRO launched new satellite for weather prediction using ML models",
]

# Vectorize
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(documents)

# Fit LDA
n_topics = 3
lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=20,
    learning_method='online'
)
lda.fit(X)

# Display topics
feature_names = vectorizer.get_feature_names_out()
print("=== Discovered Topics ===")
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[-8:][::-1]]
    print(f"  Topic {topic_idx + 1}: {', '.join(top_words)}")

# Assign topics to documents
doc_topics = lda.transform(X)
for i, doc in enumerate(documents[:5]):
    dominant_topic = doc_topics[i].argmax()
    print(f"  Doc {i+1} → Topic {dominant_topic + 1} ({doc[:50]}...)")

🇮🇳 Indian Case Studies

Case Study 1: AI4Bharat IndicNLP — Democratizing Indian Language AI

Challenge: India has 22 scheduled languages, 121 mother tongues, and 12+ scripts. Most NLP models are English-centric, leaving 1 billion+ Indians underserved.

Solution: AI4Bharat (IIT Madras) developed:

IndicBERT: Multilingual BERT pre-trained on 12 Indian languages with Albert-style architecture
IndicTrans2: State-of-the-art translation model covering all 22 scheduled languages with 1B+ parallel sentences
IndicNLP Library: Tokenization, transliteration, and embeddings for Indian languages
Bhasha-Abhijnaanam: Language identification for 22 Indian languages

Impact: Used by Google Translate for Indian language improvements, NPTEL for lecture translation, and multiple government portals for multilingual access.

Technical Innovation: Handled challenges like code-mixing (Hindi-English: "Yeh movie bahut amazing thi!"), agglutinative morphology (Tamil, Kannada), and free word order (Hindi, Sanskrit).

Case Study 2: Flipkart — Product Review Analysis at Scale

Problem: Processing 10M+ reviews across 150M products in multiple Indian languages. Reviews often contain code-mixed text, transliterated Hindi/Tamil, and regional expressions.

Solution:

Custom NER model to extract product attributes ("battery life", "camera quality") from reviews
Aspect-based sentiment analysis: positive about price but negative about delivery
Multilingual sentiment model handling Hindi, Tamil, Telugu, Bengali transliterations
Fake review detection using linguistic patterns and behavioral signals

Results: 40% improvement in product recommendation quality, 25% reduction in customer complaints about wrong products.

Case Study 3: Aadhaar — Multilingual Document Processing

Challenge: Processing identity documents (birth certificates, bank statements) in 22+ languages for Aadhaar verification across India.

Solution: OCR + NER pipeline handling Devanagari, Tamil, Bengali, Telugu, Gujarati, and other scripts. Achieved 97% accuracy in name extraction across 11 languages using a fine-tuned mBERT model.

Case Study 4: Indian Judiciary — SUPACE (e-Courts)

Problem: 44 million pending cases in Indian courts. Judges need to review vast amounts of legal text to find relevant precedents.

Solution: NLP system for legal document summarization, case similarity finding, and relevant statute identification. Handles legal text in English, Hindi, and regional languages.

🇮🇳 India Spotlight

India-First LLMs: Krutrim (Ola's LLM supporting 22 Indian languages), Sarvam AI (open-source Indian LLMs), and Hanooman (IIIT Hyderabad) represent India's push for sovereign AI. These address unique challenges: Hindi-English code-mixing in 60% of urban social media posts, and supporting both Devanagari and Roman scripts for the same language.

🌍 Global Case Studies

Case Study 1: Google — BERT & Search Revolution

Impact: In October 2019, Google integrated BERT into search, calling it "the biggest leap in 5 years." BERT helps understand the nuance of search queries — "can you get medicine for someone pharmacy" now correctly interprets "for someone" as picking up a prescription for another person.

Scale: BERT processes every English search query (3.5 billion/day). Google also developed MUM (Multitask Unified Model), 1000× more powerful than BERT, understanding 75+ languages.

Case Study 2: OpenAI — From GPT to ChatGPT

Evolution: GPT-1 (117M params, 2018) → GPT-2 (1.5B, 2019) → GPT-3 (175B, 2020) → GPT-4 (rumored 1.7T MoE, 2023) → GPT-4o (2024).

Innovation: Reinforcement Learning from Human Feedback (RLHF) transformed GPT-3 from a text predictor into ChatGPT, an instruction-following assistant. The key was combining pre-training (massive text data) with alignment (human preferences).

Case Study 3: Netflix — Personalized Content Description

NLP powers Netflix's content recommendation beyond just collaborative filtering. NLP analyzes plot summaries, reviews, subtitles, and social media chatter. The system generates personalized show descriptions — the same show may have different thumbnails and descriptions for different users based on their viewing history.

Case Study 4: Amazon Alexa — Conversational AI at Scale

Alexa processes 50+ billion voice interactions yearly across 100+ million devices. The NLP pipeline: ASR (speech→text) → NLU (intent+entities) → Dialog Management → NLG (text→speech). Handles 8+ languages with accent variations.

Case Study 5: Tesla — NLP for Autonomous Driving Documentation

Tesla uses NLP to analyze driving incident reports, owner manual queries, and regulatory documents across 40+ markets. NER extracts location, vehicle model, weather conditions from crash reports to improve Autopilot safety algorithms.

🚀 Startup Applications

Startup	NLP Application	Technology	Impact
Sarvam AI (India)	Indian language LLMs	Custom transformer, IndicNLP	Open-source models for 10+ Indian languages
Yellow.ai (India)	Enterprise chatbots	Multi-LLM orchestration	Serves 1000+ enterprises, 35+ languages
Haptik (India)	Conversational AI	Intent classification, NER	Acquired by Jio, 100M+ users
Grammarly (Global)	Writing assistant	BERT + custom models	30M daily active users
Jasper AI (Global)	Content generation	GPT API + fine-tuning	$1.5B valuation for marketing AI
Hugging Face (Global)	NLP model hub	Transformers library	500K+ models, 100K+ datasets
Cohere (Global)	Enterprise NLP API	Custom LLMs, RAG	Multilingual embeddings, 100+ languages
Reverie Language Tech (India)	Indian language platform	NMT, TTS, STT	Powers UMANG, DigiLocker in 12 languages

🎯 Career Path

NLP Startup Roles in India (2025): Prompt Engineer (₹8–20 LPA), NLP Data Scientist (₹12–35 LPA), LLM Fine-tuning Engineer (₹18–50 LPA), Applied ML Scientist (₹25–80 LPA). Hot skills: RAG, LoRA/QLoRA, multilingual models, vector databases, evaluation frameworks.

🏛️ Government Applications

16.1 Bhashini (National Language Translation Mission)

India's ambitious initiative to break language barriers in digital services. Bhashini provides:

Real-time translation across 22 scheduled languages
Speech-to-speech translation for non-literate users
API access for government apps and services
Crowdsourced data collection via the Bhashini app

16.2 e-Courts Project (India)

NLP-powered legal document processing: automatic case summarization, judgment prediction research, and multilingual legal text analysis helping manage India's judicial backlog of 44M+ cases.

16.3 MyGov Chatbot

India's citizen engagement chatbot handles queries about government schemes (PM-KISAN, Ayushman Bharat, MGNREGA) in Hindi and English, processing 10M+ queries monthly.

16.4 EU AI Act & NLP Regulation

The EU classifies emotion recognition and biometric categorization NLP systems as "high-risk AI," requiring transparency, human oversight, and non-discrimination testing — setting global regulatory precedent.

16.5 US Intelligence Community

DARPA's KAIROS program uses NLP to extract complex events from multilingual text, building knowledge graphs from news and intelligence reports.

🏭 Industry Applications

Industry	NLP Application	Business Impact
Healthcare	Clinical note extraction, medical chatbots, drug interaction NER	30% faster diagnosis, reduced documentation time
Finance	Sentiment-driven trading, compliance monitoring, fraud detection	News sentiment predicts stock moves with 58% accuracy
Legal	Contract analysis, e-discovery, case law search	90% reduction in document review time
E-commerce	Product search, review mining, chatbot support	25% increase in search relevance
Education	Essay grading, question generation, tutoring	Personalized learning at scale
Media	Content recommendation, headline generation, fake news detection	40% increase in engagement
Manufacturing	Maintenance report analysis, safety document mining	Predictive maintenance from text logs
Agriculture	Farmer helpline chatbots (Kisan Call Centre)	24/7 crop advisory in regional languages

🏭 Industry Alert

RAG is the enterprise standard (2025): Over 80% of enterprise LLM deployments use Retrieval-Augmented Generation rather than pure generation. This reduces hallucinations, enables domain-specific answers, and provides auditable source citations — critical for regulated industries like finance and healthcare.

🛠️ Mini Projects

🎯 Mini Project 1: Hindi Sentiment Analyzer

Objective: Build a sentiment classifier for Hindi movie reviews using IndicBERT.

Dataset

Use IIIT-H Hindi Sentiment Dataset or scrape reviews from IMDB Hindi / BookMyShow.

Implementation

# Mini Project 1: Hindi Sentiment Analyzer
# ==========================================

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd
import numpy as np

# --- Step 1: Prepare Hindi Dataset ---
hindi_reviews = {
    'text': [
        "यह फिल्म बहुत अच्छी है। कहानी दिल को छू लेती है।",
        "बेकार फिल्म। समय की बर्बादी।",
        "अभिनय शानदार है, लेकिन कहानी कमजोर है।",
        "इस फिल्म ने मेरा दिल जीत लिया! बहुत खूबसूरत!",
        "बहुत बोरिंग फिल्म। एक्टिंग भी ठीक नहीं थी।",
        "Bahut acchi movie thi! Must watch!",        # Code-mixed
        "Kya bakwas film hai, avoid karo",           # Transliterated Hindi
        "Direction kamaal ka hai, screenplay tight hai",
        "मस्त picture है भाई, interval के बाद और मज़ा आया",
        "Waste of money, mat jao dekhne",
    ],
    'label': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0]  # 1=Positive, 0=Negative
}

df = pd.DataFrame(hindi_reviews)
dataset = Dataset.from_pandas(df)

# Split dataset
dataset = dataset.train_test_split(test_size=0.2, seed=42)

# --- Step 2: Load IndicBERT / MuRIL ---
model_name = "ai4bharat/IndicBERTv2-MLM-only"  # Or "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

# --- Step 3: Tokenize ---
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# --- Step 4: Training ---
training_args = TrainingArguments(
    output_dir='./hindi_sentiment_model',
    num_train_epochs=5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    logging_steps=10,
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=-1)
    accuracy = (preds == labels).mean()
    return {'accuracy': accuracy}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,
)

# Train!
trainer.train()

# --- Step 5: Inference ---
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    label = torch.argmax(probs).item()
    return "सकारात्मक (Positive) 😊" if label == 1 else "नकारात्मक (Negative) 😞"

# Test
test_reviews = [
    "यह फिल्म कमाल की है! हर किसी को देखनी चाहिए!",
    "Bahut hi boring movie, mat jaao",
    "Acting acchi thi but story weak hai",
]

for review in test_reviews:
    result = predict_sentiment(review)
    print(f"  Review: {review}")
    print(f"  Sentiment: {result}\n")

Expected Outcomes

Handle both Devanagari and Romanized Hindi (code-mixing)
Achieve 80%+ accuracy on Hindi sentiment classification
Deploy as a FastAPI endpoint with a simple web interface

🎯 Mini Project 2: News Article Summarizer with RAG

Objective: Build a news summarization and Q&A system using extractive + abstractive methods with RAG.

Implementation

# Mini Project 2: News Summarizer with RAG
# ==========================================

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import heapq

# === PART A: Extractive Summarization (TextRank) ===

class TextRankSummarizer:
    """Extractive summarization using TextRank algorithm."""
    
    def __init__(self):
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)
        self.stop_words = set(stopwords.words('english'))
    
    def _sentence_similarity(self, sent1, sent2):
        """Compute similarity between two sentences."""
        words1 = [w.lower() for w in word_tokenize(sent1) if w.isalpha() and w.lower() not in self.stop_words]
        words2 = [w.lower() for w in word_tokenize(sent2) if w.isalpha() and w.lower() not in self.stop_words]
        
        all_words = list(set(words1 + words2))
        vec1 = [1 if w in words1 else 0 for w in all_words]
        vec2 = [1 if w in words2 else 0 for w in all_words]
        
        dot_product = sum(a * b for a, b in zip(vec1, vec2))
        norm1 = sum(a**2 for a in vec1) ** 0.5
        norm2 = sum(b**2 for b in vec2) ** 0.5
        
        return dot_product / (norm1 * norm2 + 1e-8)
    
    def summarize(self, text, num_sentences=3):
        """Generate extractive summary."""
        sentences = sent_tokenize(text)
        
        if len(sentences) <= num_sentences:
            return text
        
        # Build similarity matrix
        n = len(sentences)
        sim_matrix = np.zeros((n, n))
        for i in range(n):
            for j in range(n):
                if i != j:
                    sim_matrix[i][j] = self._sentence_similarity(
                        sentences[i], sentences[j]
                    )
        
        # PageRank-style scoring
        scores = np.ones(n) / n
        damping = 0.85
        
        for _ in range(50):  # Iterate until convergence
            new_scores = np.ones(n) * (1 - damping) / n
            for i in range(n):
                for j in range(n):
                    if i != j and sim_matrix[j].sum() > 0:
                        new_scores[i] += damping * sim_matrix[j][i] / sim_matrix[j].sum() * scores[j]
            scores = new_scores
        
        # Select top sentences (maintain original order)
        ranked_indices = sorted(range(n), key=lambda i: scores[i], reverse=True)
        top_indices = sorted(ranked_indices[:num_sentences])
        
        summary = ' '.join([sentences[i] for i in top_indices])
        return summary

# === PART B: RAG-based Q&A System ===

class SimpleRAG:
    """Simple RAG system using TF-IDF retrieval + generation."""
    
    def __init__(self):
        self.documents = []
        self.embeddings = None
    
    def add_documents(self, docs):
        """Add documents to the knowledge base."""
        self.documents = docs
        # Simple TF-IDF-based embeddings
        from sklearn.feature_extraction.text import TfidfVectorizer
        self.vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
        self.embeddings = self.vectorizer.fit_transform(docs)
        print(f"Added {len(docs)} documents to knowledge base.")
    
    def retrieve(self, query, top_k=3):
        """Retrieve most relevant documents."""
        query_vec = self.vectorizer.transform([query])
        similarities = cosine_similarity(query_vec, self.embeddings)[0]
        top_indices = similarities.argsort()[-top_k:][::-1]
        
        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx],
                'score': similarities[idx],
                'index': idx
            })
        return results
    
    def answer(self, query, top_k=3):
        """Retrieve relevant docs and generate answer."""
        retrieved = self.retrieve(query, top_k)
        
        # In production, pass to LLM. Here, return relevant excerpts.
        context = "\n".join([r['document'][:200] for r in retrieved])
        
        print(f"\n📌 Query: {query}")
        print(f"\n📄 Retrieved Context (top-{top_k}):")
        for i, r in enumerate(retrieved):
            print(f"  [{i+1}] (score: {r['score']:.3f}) {r['document'][:100]}...")
        
        # In production, you'd call:
        # prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
        # answer = llm.generate(prompt)
        return retrieved

# === Demo ===

# Extractive summarization
summarizer = TextRankSummarizer()

article = """
India's space program has achieved remarkable milestones in recent years.
The Indian Space Research Organisation (ISRO) successfully landed Chandrayaan-3
on the lunar south pole in August 2023, making India the fourth country to 
land on the Moon and the first to reach the south pole region. The mission cost
just $75 million, a fraction of NASA's comparable missions. ISRO's Chairman
S. Somanath credited the team's innovative engineering and frugal approach.
The success has boosted India's commercial space sector, with startups like
Skyroot Aerospace and Agnikul Cosmos developing reusable rockets. India now
aims to send astronauts to space through the Gaganyaan mission by 2025 and 
establish a space station by 2035. The space economy is projected to reach
$44 billion by 2033, creating thousands of high-tech jobs across the country.
"""

print("=== Extractive Summary (TextRank) ===")
summary = summarizer.summarize(article, num_sentences=3)
print(summary)

# RAG demo
print("\n\n=== RAG Q&A System ===")
rag = SimpleRAG()
rag.add_documents([
    "Chandrayaan-3 landed on lunar south pole in August 2023. It cost $75 million.",
    "ISRO plans Gaganyaan mission to send Indian astronauts to space by 2025.",
    "India's space economy is projected to reach $44 billion by 2033.",
    "Skyroot Aerospace launched India's first private rocket Vikram-S in 2022.",
    "ISRO's PSLV has launched over 300 foreign satellites commercially.",
])

rag.answer("How much did the Moon mission cost?")
rag.answer("What are India's future space plans?")

Enhancements

Replace TF-IDF with sentence-transformers for better retrieval
Integrate with ChromaDB or FAISS for scalable vector search
Connect to an LLM (GPT-4, Llama) for abstractive answer generation
Add Hindi news support using IndicTrans for translation

🎯 Mini Project 3: LLM Fine-tuning with LoRA

Objective: Fine-tune a language model for Indian legal document Q&A using LoRA (Low-Rank Adaptation).

# Mini Project 3: LoRA Fine-tuning
# ==================================

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# --- Concept: LoRA (Low-Rank Adaptation) ---
# Instead of updating all model parameters (billions),
# LoRA adds small trainable low-rank matrices to attention layers.
# 
# Original: W (d × d) — frozen
# LoRA:     W + ΔW = W + A × B  where A (d × r), B (r × d), r << d
# 
# For r=8, d=4096: LoRA trains 65K params vs 16M (0.4%!)

# Step 1: Load base model
model_name = "microsoft/phi-2"  # Or "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

# Step 2: Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # Rank of the low-rank matrices
    lora_alpha=32,           # Scaling factor
    lora_dropout=0.05,       # Dropout for regularization
    target_modules=[         # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "dense"
    ],
    bias="none",
)

# Step 3: Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 2,784,849,920 || trainable%: 0.15%

# Step 4: Prepare legal Q&A dataset
legal_qa = [
    {
        "instruction": "What is Section 302 of the Indian Penal Code?",
        "response": "Section 302 IPC deals with punishment for murder. Whoever commits murder shall be punished with death or imprisonment for life, and shall also be liable to fine."
    },
    {
        "instruction": "Explain the Right to Education under Indian Constitution.",
        "response": "Article 21A, inserted by the 86th Constitutional Amendment Act 2002, makes education a fundamental right for children aged 6-14 years. The Right of Children to Free and Compulsory Education Act, 2009, provides the legal framework."
    },
    # ... add more training examples
]

# Format for training
def format_prompt(sample):
    return f"""### Instruction: {sample['instruction']}
### Response: {sample['response']}"""

# Step 5: QLoRA variant (4-bit quantization)
# from transformers import BitsAndBytesConfig
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_use_double_quant=True,
# )
# model = AutoModelForCausalLM.from_pretrained(
#     model_name, quantization_config=bnb_config
# )

print("✅ LoRA model ready for training!")
print(f"   Base model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"   Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

✏️ End-of-Chapter Exercises

Exercise 1: Tokenization Comparison

Tokenize the sentence "I can't believe she'd do that!" using: (a) whitespace splitting, (b) NLTK word_tokenize, (c) spaCy tokenizer, (d) BERT WordPiece tokenizer. Compare the outputs and explain the differences.

Exercise 2: TF-IDF by Hand

Given documents D1="the cat sat on the mat", D2="the dog sat on the log", D3="cats and dogs are pets", compute TF-IDF for every unique word. Which word has the highest TF-IDF score in D3?

Exercise 3: Word2Vec Analogies

Using pre-trained Word2Vec (Google News vectors), find: (a) king - man + woman = ?, (b) Paris - France + India = ?, (c) doctor - man + woman = ?. Discuss any biases you observe.

Exercise 4: Sentiment Classification

Build a sentiment classifier for Amazon product reviews using: (a) Naive Bayes + BoW, (b) SVM + TF-IDF, (c) Fine-tuned BERT. Compare F1 scores and training time. Which is best for production?

Exercise 5: NER Pipeline

Implement a NER system for Indian news articles that extracts: PERSON, ORG, LOCATION, DATE, MONEY. Use spaCy and test on 20 news headlines from The Hindu or NDTV.

Exercise 6: N-gram Language Model

Build a trigram language model from a corpus of 1000 sentences. Implement: (a) MLE estimation, (b) Add-1 (Laplace) smoothing, (c) Kneser-Ney smoothing. Compare perplexity on a test set.

Exercise 7: Text Summarization Evaluation

Implement both extractive (TextRank) and abstractive (T5-small) summarizers. Evaluate both on 10 news articles using ROUGE-1, ROUGE-2, and ROUGE-L metrics. Which performs better and why?

Exercise 8: Topic Modeling

Apply LDA to a dataset of 500 Wikipedia articles across 5 categories (Science, Sports, Politics, Technology, Entertainment). Find the optimal number of topics using coherence score.

Exercise 9: Code-Mixed NLP

Collect 200 Hindi-English code-mixed tweets. Build a language identification system that tags each word as Hindi or English. Then build a sentiment classifier for the code-mixed text.

Exercise 10: Embedding Visualization

Train Word2Vec on a corpus of Indian news articles. Visualize word embeddings using t-SNE for 100 words across 5 semantic categories (politics, cricket, Bollywood, technology, food). Discuss clustering patterns.

Exercise 11: Spam Filter

Build an email spam filter using the Enron dataset. Compare BoW, TF-IDF, and BERT embeddings as features with Logistic Regression, Random Forest, and XGBoost classifiers (9 combinations).

Exercise 12: Question Answering

Using the SQuAD 2.0 dataset, fine-tune a DistilBERT model for extractive QA. Evaluate using Exact Match (EM) and F1 scores.

Exercise 13: Text Generation

Fine-tune GPT-2 on a corpus of Rabindranath Tagore's poetry. Generate 10 poems and evaluate them using: (a) perplexity, (b) BLEU score against real poems, (c) human evaluation (ask 5 people to rate coherence).

Exercise 14: RAG System

Build a RAG system for Indian Constitution Q&A: chunk the constitution text, create embeddings with sentence-transformers, store in ChromaDB, retrieve relevant sections, and generate answers using an LLM.

Exercise 15: Prompt Engineering

Design 5 different prompting strategies (zero-shot, few-shot, chain-of-thought, role-playing, structured output) for the same task: "Classify Indian court judgments by legal area." Compare accuracy across strategies.

Exercise 16: Multilingual Translation

Using the IndicTrans2 model, build a Hindi↔English↔Tamil translation pipeline. Evaluate BLEU scores and analyze common error patterns.

Exercise 17: GloVe Implementation

Implement GloVe from scratch: (a) build co-occurrence matrix from a 10K-word corpus, (b) implement the weighted least squares objective, (c) optimize using AdaGrad. Compare resulting embeddings with pre-trained GloVe.

Exercise 18: Attention Visualization

For a fine-tuned BERT model, extract and visualize attention weights for 5 example sentences. Identify which attention heads capture syntactic vs semantic relationships.

Exercise 19: Fake News Detection

Build a fake news detector using the LIAR dataset. Combine textual features (TF-IDF) with metadata features (speaker, context). Achieve at least 70% accuracy on 6-class classification.

Exercise 20: LoRA vs Full Fine-tuning

Fine-tune a model (e.g., Phi-2 or Llama-2-7B) on a custom dataset using: (a) full fine-tuning, (b) LoRA (r=4, 8, 16, 32). Compare: accuracy, training time, memory usage, and number of trainable parameters.

Exercise 21: Dependency Parsing

Implement a simple transition-based dependency parser using an arc-standard system with shift, left-arc, and right-arc transitions. Test on 50 sentences from the Universal Dependencies treebank.

Exercise 22: Document Clustering

Cluster 1000 news articles into groups using: (a) TF-IDF + K-Means, (b) BERT embeddings + K-Means, (c) BERT embeddings + HDBSCAN. Compare using silhouette score and visual inspection.

🎯 Multiple Choice Questions

MCQ 1: In Word2Vec Skip-gram, what is being predicted?

(A) The center word from context words

(B) Context words from the center word

(D) The POS tag of the word

Click to reveal answer

MCQ 2: TF-IDF assigns the HIGHEST weight to words that are:

(A) Common across all documents

(B) Rare in the document but common in corpus

(D) Equally distributed across all documents

Click to reveal answer

MCQ 3: BERT uses which pre-training objective?

(A) Next word prediction (autoregressive)

(B) Masked Language Model + Next Sentence Prediction

(D) Denoising autoencoder only

Click to reveal answer

MCQ 4: In the Transformer attention formula Attention(Q,K,V) = softmax(QKᵀ/√d_k)V, why divide by √d_k?

(A) To reduce computation cost

(B) To prevent dot products from becoming too large, causing vanishing gradients in softmax

(D) To make the model invariant to sequence length

Click to reveal answer

MCQ 5: Which approach does RAG (Retrieval-Augmented Generation) combine?

(A) Supervised learning + unsupervised learning

(B) Information retrieval + language generation

(D) Rule-based systems + statistical models

Click to reveal answer

MCQ 6: What is the BIO tagging scheme used for?

(A) Part-of-speech tagging

(B) Named Entity Recognition (sequence labeling)

(D) Language identification

Click to reveal answer

MCQ 7: LoRA fine-tuning is efficient because:

(A) It removes layers from the model

(B) It adds low-rank decomposition matrices (A × B) to frozen weights

(D) It trains only the last layer

Click to reveal answer

MCQ 8: Perplexity of a language model measures:

(A) The size of the model

(B) How well the model predicts the test data (lower = better)

(D) Number of parameters

Click to reveal answer

MCQ 9: FastText's advantage over Word2Vec is:

(A) It uses attention mechanism

(B) It can handle out-of-vocabulary words using character n-grams

(D) It requires less training data

Click to reveal answer

MCQ 10: TextRank for summarization is inspired by:

(A) Convolutional Neural Networks

(B) Google's PageRank algorithm

(D) Huffman coding

Click to reveal answer

MCQ 11: Which Indian NLP challenge is NOT common in English NLP?

(A) Named Entity Recognition

(B) Code-mixing (Hindi-English in the same sentence)

(D) Sentiment analysis

Click to reveal answer

MCQ 12: In abstractive summarization, the model:

(A) Selects the most important sentences from the original text

(B) Generates new sentences that may not appear in the original text

(D) Only removes stop words

Click to reveal answer

💼 Interview Questions

Q1: Explain the difference between Word2Vec, GloVe, and FastText. When would you use each?

Answer: Word2Vec learns embeddings from local context windows (Skip-gram/CBOW) — good for general-purpose embeddings. GloVe factorizes the global co-occurrence matrix — better at capturing global statistics. FastText extends Word2Vec with character n-grams — best for morphologically rich languages (Hindi, Turkish) and handling OOV words. In practice: use FastText for Indian languages, GloVe for analogy tasks, and nowadays most production systems use contextual embeddings from BERT/transformers instead.

Q2: How does BERT differ from GPT architecturally? Why does this matter?

Answer: BERT uses the Transformer encoder (bidirectional) with Masked Language Model pre-training — it sees context from BOTH sides. GPT uses the Transformer decoder (unidirectional/autoregressive) — it only sees left context. This means BERT excels at understanding tasks (classification, NER, QA) while GPT excels at generation tasks (text completion, dialogue). BERT can't generate text naturally; GPT can't attend to future tokens.

Q3: What is the attention mechanism and why was it revolutionary?

Answer: Attention allows each position in a sequence to attend to all other positions, computing a weighted sum of value vectors based on query-key similarities. Before attention, RNNs compressed entire sequences into fixed-size vectors (information bottleneck). Attention enables: (1) parallel processing (unlike sequential RNNs), (2) direct long-range dependencies, (3) interpretable weights. The Transformer replaces recurrence entirely with self-attention, enabling massive parallelism and scaling to billions of parameters.

Q4: How would you build a sentiment analysis system for Hindi product reviews?

Answer: (1) Data collection: scrape Flipkart/Amazon Hindi reviews or use existing datasets (IIIT-H); (2) Handle code-mixing: detect language per word, normalize transliterated text; (3) Preprocessing: use IndicNLP tokenizer for Devanagari; (4) Model: fine-tune IndicBERT or MuRIL (pre-trained on Indian languages) rather than training from scratch; (5) Handle class imbalance with oversampling/weighted loss; (6) Evaluate with F1 (not just accuracy); (7) Deploy with FastAPI + sentence caching for speed.

Q5: Explain RAG. Why is it preferred over fine-tuning for enterprise applications?

Answer: RAG (Retrieval-Augmented Generation) retrieves relevant documents from a knowledge base and feeds them as context to an LLM. Advantages over fine-tuning: (1) Knowledge can be updated without retraining, (2) Answers are grounded in source documents (auditable), (3) Reduces hallucination, (4) Domain-specific without expensive GPU training, (5) Source attribution enables trust. Architecture: Document chunking → Embedding → Vector DB → Query → Retrieve → Prompt construction → LLM → Answer.

Q6: What is LoRA and how does it make LLM fine-tuning accessible?

Answer: LoRA (Low-Rank Adaptation) freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each transformer layer. Instead of updating W (d×d), it learns ΔW = A(d×r) × B(r×d) where rank r << d (typically 4-64). This reduces trainable parameters from billions to millions (0.1-1% of total), enabling fine-tuning on consumer GPUs. QLoRA additionally quantizes the frozen weights to 4-bit, making it possible to fine-tune a 7B model on a single 16GB GPU.

Q7: How would you evaluate an NLP model for bias?

Answer: (1) Test with demographic-swapped inputs (replace "he" with "she", different cultural names), (2) Check embedding bias with WEAT (Word Embedding Association Test), (3) Disaggregate accuracy by subgroups, (4) Test for stereotypical associations ("nurse" with gender), (5) Use tools like AI Fairness 360, (6) For Indian context: check caste, religion, and regional biases, (7) Continuous monitoring with diverse test sets post-deployment.

Q8: What are the challenges of NLP for Indian languages?

Answer: (1) Morphological richness: Tamil, Kannada are agglutinative (single word = entire sentence in English); (2) Script diversity: 12+ scripts (Devanagari, Tamil, Bengali, etc.); (3) Code-mixing: 60%+ of urban social media is mixed Hindi-English; (4) Resource scarcity: limited labeled datasets for most languages; (5) Transliteration: same word written in multiple scripts; (6) Free word order: Hindi/Sanskrit allow flexible sentence structure; (7) Dialectal variation: Hindi spoken in UP differs from Rajasthani Hindi.

Q9: Explain the difference between extractive and abstractive summarization.

Answer: Extractive selects the most important sentences from the original text (TextRank, LexRank). Pros: faithful to source, no hallucination. Cons: may lack coherence, can't paraphrase. Abstractive generates new text summarizing the original (T5, BART, GPT). Pros: more natural, concise. Cons: may hallucinate facts, harder to evaluate. In practice, many systems use hybrid approaches: extract key sentences, then rephrase them abstractively.

Q10: Design a real-time fake news detection system. What NLP components would you use?

Answer: Architecture: (1) Content analysis: BERT-based claim verification against fact-check databases, linguistic feature extraction (sensationalism score, emotion intensity), (2) Source credibility: domain reputation scoring, author history NER, (3) Propagation analysis: social network spread patterns, bot detection, (4) Cross-reference: RAG-based fact verification against trusted news corpus, (5) Multilingual: IndicBERT for Hindi/regional language fake news, (6) Real-time pipeline: Kafka → preprocessing → ensemble model → confidence score → human-in-the-loop for borderline cases.

🔬 Research Problems

Research Problem 1: Code-Mixed Sentiment Analysis for Indian Social Media

Problem: Build a sentiment analysis model that handles Hindi-English, Tamil-English, and Telugu-English code-mixed text without language-specific preprocessing.

Approach: Explore unified multilingual embeddings, script-agnostic subword tokenization, and transliteration-augmented training. Can a single model handle code-mixing across all Indian language pairs?

Dataset: SemEval-2020 Task 9 (Sentiment Analysis for Code-Mixed Social Media Text), SAIL 2015 Hindi-English dataset.

Open Questions: Does explicit language identification improve or hurt performance? How to handle triple code-mixing (Hindi + English + Urdu)?

Research Problem 2: Hallucination Detection and Mitigation in LLMs

Problem: LLMs generate fluent but factually incorrect text. Develop methods to detect and reduce hallucination in domain-specific applications (medical, legal).

Approach: (1) Self-consistency checking (sample multiple outputs, detect contradictions), (2) Fact verification against knowledge graphs, (3) Uncertainty quantification using token probabilities, (4) RAG with strict grounding constraints.

Metric: Define hallucination rate, factual consistency score, and source grounding score.

Research Problem 3: Low-Resource Indian Language NLP

Problem: Of India's 22 scheduled languages, many lack sufficient training data for modern NLP. Develop methods for NLP in languages with <10,000 labeled examples.

Approach: (1) Cross-lingual transfer from Hindi/English to low-resource languages, (2) Data augmentation via back-translation, (3) Zero-shot transfer using multilingual models, (4) Active learning for efficient annotation, (5) Synthetic data generation using LLMs.

Languages to focus on: Bodo, Dogri, Maithili, Santhali, Konkani — the least-resourced scheduled languages.

Research Problem 4: Efficient Transformer Architectures for Edge Devices

Problem: Deploy NLP models on Indian smartphones (many with <4GB RAM). Design architectures that maintain accuracy while fitting in constrained environments.

Approach: Knowledge distillation, pruning, quantization (INT4/INT8), architecture search for mobile transformers. Target: sentiment analysis and language ID in <50MB model size.

🎓 Key Takeaways

Preprocessing is foundational: Tokenization, stemming, lemmatization, and stop word removal form the bedrock of every NLP pipeline. Choice of preprocessing depends on the task — sentiment analysis may need emojis and stop words that other tasks don't.
Representations evolved from sparse to dense: BoW → TF-IDF → Word2Vec → Contextual Embeddings (BERT). Each generation captures richer linguistic information.
Word embeddings capture meaning: Word2Vec, GloVe, and FastText encode semantic and syntactic relationships in dense vectors. FastText handles morphologically rich Indian languages better through character n-grams.
Transformers transformed NLP: The self-attention mechanism enables parallel processing and captures long-range dependencies. BERT (bidirectional, for understanding) and GPT (unidirectional, for generation) are two sides of the same coin.
Indian NLP has unique challenges: 22+ languages, 12+ scripts, pervasive code-mixing, dialectal variation, and limited labeled data. Solutions include IndicBERT, IndicTrans2, and Bhashini platform.
RAG is the enterprise standard: Retrieval-Augmented Generation solves hallucination, enables domain specificity, and provides auditable answers — critical for production NLP systems.
LoRA democratizes fine-tuning: Low-Rank Adaptation enables fine-tuning billion-parameter models on consumer hardware by training only 0.1% of parameters.
Evaluation matters: Perplexity for language models, ROUGE for summarization, F1 for classification, BLEU for translation, Exact Match for QA. Always use the right metric for the task.
NLP is the gateway to General AI: Language understanding underpins reasoning, planning, and knowledge — mastering NLP is essential for the next generation of AI researchers and engineers.

🎓 Professor's Insight

The field moves fast — GPT-4 is already being superseded by newer models. But the fundamentals don't change: understanding tokenization, attention, embeddings, and evaluation will serve you regardless of which model is trending. Learn the principles deeply, and you'll adapt to any new architecture.

📚 References

Foundational Papers

Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781 — Word2Vec paper.
Pennington, J., Socher, R., Manning, C. (2014). "GloVe: Global Vectors for Word Representation." EMNLP.
Bojanowski, P., et al. (2017). "Enriching Word Vectors with Subword Information." TACL — FastText paper.
Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS — The Transformer paper.
Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL.
Brown, T., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS — GPT-3 paper.
Hu, E., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685.
Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS.

Indian NLP Resources

Kakwani, D., et al. (2020). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Models for Indian Languages." EMNLP Findings.
Gala, J., et al. (2023). "IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages." TACL.
Khanuja, S., et al. (2020). "GLUECoS: An Evaluation Benchmark for Code-Switched NLP." ACL.
Joshi, A., et al. (2016). "Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text." COLING.

Textbooks

Jurafsky, D. & Martin, J.H. (2024). Speech and Language Processing, 3rd Edition (online draft). — The definitive NLP textbook.
Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool.
Eisenstein, J. (2019). Introduction to Natural Language Processing. MIT Press.

Tools & Libraries

NLTK: nltk.org — Natural Language Toolkit for Python.
spaCy: spacy.io — Industrial-strength NLP in Python.
HuggingFace Transformers: huggingface.co — 500K+ pre-trained models.
AI4Bharat: ai4bharat.org — Indian language NLP models and datasets.
Gensim: radimrehurek.com/gensim — Topic modeling and Word2Vec.

Online Courses

Stanford CS224N: Natural Language Processing with Deep Learning (free on YouTube).
fast.ai NLP Course: Practical NLP with modern techniques.
NPTEL: NLP courses in Hindi/English by IIT professors.

Chapter 24Natural Language Processing& Text Mining

📋 Learning Objectives

📖 Introduction

Why NLP Matters Now More Than Ever

The NLP Technology Stack

🏛️ Historical Background

The Pioneering Era (1950–1970)

The Statistical Revolution (1980–2010)

The Deep Learning Revolution (2013–Present)

💡 Conceptual Explanation

4.1 Text Preprocessing Pipeline

Tokenization

Stemming vs. Lemmatization

Stop Word Removal

4.2 Text Representation Models

Bag of Words (BoW)

TF-IDF (Term Frequency-Inverse Document Frequency)

4.3 Word Embeddings

Word2Vec — CBOW (Continuous Bag of Words)

Word2Vec — Skip-gram

GloVe (Global Vectors)

FastText

4.4 Sentiment Analysis

4.5 Named Entity Recognition (NER)

4.6 Language Models: Evolution

4.7 Retrieval-Augmented Generation (RAG)

📐 Mathematical Foundation

5.1 TF-IDF Mathematics

5.2 Word2Vec Skip-gram Objective

5.3 GloVe Objective

5.4 Negative Sampling Approximation

5.5 Attention Mechanism (Transformer)

5.6 N-gram Language Model

5.7 Perplexity (Language Model Evaluation)

5.8 Cosine Similarity (Embedding Comparison)

🔣 Formula Derivations

6.1 Deriving TF-IDF from First Principles

6.2 Deriving Skip-gram with Negative Sampling

6.3 Deriving Attention Scaling Factor

📝 Worked Numerical Examples

Example 1: TF-IDF Calculation

Example 2: Bigram Probability

Example 3: Cosine Similarity Between Word Vectors

Example 4: Perplexity Calculation

📊 Visual Diagrams

8.1 NLP Pipeline Architecture

8.2 Word2Vec: CBOW vs Skip-gram

8.3 Transformer Self-Attention

8.4 RAG Architecture

🔄 Flowcharts

9.1 Sentiment Analysis Pipeline

9.2 NER Decision Flowchart

9.3 LLM Fine-tuning Decision Tree

🐍 Python Implementation

10.1 Text Preprocessing with NLTK & spaCy

10.2 spaCy NLP Pipeline

10.3 TF-IDF from Scratch

10.4 Word2Vec Training with Gensim

10.5 Sentiment Analysis with VADER & TextBlob

🔷 TensorFlow Implementation

11.1 BERT Text Classification

11.2 Text Summarization with T5

11.3 LSTM Text Generator

🔶 Scikit-Learn Implementation

12.1 Text Classification Pipeline (Spam Detection)

12.2 Topic Modeling with LDA

🇮🇳 Indian Case Studies

Case Study 1: AI4Bharat IndicNLP — Democratizing Indian Language AI

Case Study 2: Flipkart — Product Review Analysis at Scale

Case Study 3: Aadhaar — Multilingual Document Processing

Case Study 4: Indian Judiciary — SUPACE (e-Courts)

🌍 Global Case Studies

Case Study 1: Google — BERT & Search Revolution

Case Study 2: OpenAI — From GPT to ChatGPT

Case Study 3: Netflix — Personalized Content Description

Case Study 4: Amazon Alexa — Conversational AI at Scale

Case Study 5: Tesla — NLP for Autonomous Driving Documentation

🚀 Startup Applications

🏛️ Government Applications

16.1 Bhashini (National Language Translation Mission)

Chapter 24
Natural Language Processing
& Text Mining