Chapter 24
Natural Language Processing
& Text Mining
From tokenization to transformers โ master the art of teaching machines to read, write, understand, and generate human language. Explore Word2Vec, BERT, RAG, LoRA, and multilingual NLP for India's diverse languages.
๐ Learning Objectives
By the end of this chapter, you will be able to:
- Preprocess text using tokenization, stemming, lemmatization, and stop word removal in Python with NLTK and spaCy.
- Build text representations using Bag-of-Words, TF-IDF, and understand their mathematical foundations.
- Train and use word embeddings โ Word2Vec (CBOW & Skip-gram), GloVe, and FastText โ understanding the objective functions behind each.
- Design sentiment analysis pipelines from data collection through feature extraction to classification and deployment.
- Implement Named Entity Recognition (NER) using sequence labeling with BIO tags, CRFs, and transformer models.
- Build text classifiers for spam detection, topic classification, and multilingual settings using both classical ML and deep learning.
- Understand text summarization โ extractive (TextRank) vs. abstractive (seq2seq, BART, T5) โ and implement both approaches.
- Design question answering systems using retrieval-based and generative approaches.
- Trace the evolution of language models from n-grams โ neural LMs โ GPT/BERT โ modern LLMs.
- Apply NLP to Indian languages, handling multilingual challenges, code-mixing, and script diversity using AI4Bharat tools.
- Fine-tune LLMs using LoRA, QLoRA, and prompt engineering techniques for domain-specific tasks.
- Implement Retrieval-Augmented Generation (RAG) pipelines combining vector databases with generative models.
NLP Engineer roles are among the highest-paid in AI (โน25โ80 LPA in India, $150Kโ$300K+ in the US). Key skills: transformers, fine-tuning, prompt engineering, multilingual NLP. Companies hiring: Google, Microsoft, Amazon, Flipkart, Jio, Krutrim, Sarvam AI.
๐ Introduction
Natural Language Processing (NLP) sits at the intersection of linguistics, computer science, and artificial intelligence. It is the field dedicated to enabling machines to understand, interpret, generate, and interact with human language โ one of the most complex and nuanced systems ever created by humanity.
Consider the staggering scale: there are approximately 7,000 languages spoken worldwide, with India alone home to 22 scheduled languages and over 19,500 dialects. Every day, humans produce about 2.5 quintillion bytes of data, and roughly 80% of it is unstructured text โ emails, social media posts, documents, reviews, and conversations. NLP is the key that unlocks this vast reservoir of information.
Text Mining, closely related to NLP, focuses on extracting meaningful patterns, trends, and insights from large text corpora. While NLP provides the tools (parsing, understanding, generation), text mining applies them to discover knowledge hidden in text.
Why NLP Matters Now More Than Ever
- ChatGPT & LLMs: The launch of ChatGPT (November 2022) and subsequent models like GPT-4, Gemini, and Claude demonstrated that language understanding is the gateway to general AI.
- India's Digital Transformation: With 800M+ internet users, many are non-English speakers. NLP for Indian languages is critical for financial inclusion (UPI voice payments), governance (e-courts), and education.
- Enterprise Adoption: 75% of Fortune 500 companies now use NLP for customer service, document processing, compliance monitoring, and market intelligence.
- Healthcare: NLP extracts diagnoses from clinical notes, enables medical chatbots, and mines research papers for drug discovery.
NLP has undergone three major paradigm shifts: (1) Rule-based systems (1960sโ1990s) โ hand-crafted grammars, (2) Statistical methods (1990sโ2013) โ probabilistic models like HMMs and CRFs, (3) Deep learning era (2013โpresent) โ from Word2Vec through BERT to GPT-4. Each paradigm didn't eliminate the previous one; rather, understanding all three is essential for building robust NLP systems.
The NLP Technology Stack
| Layer | Components | Examples |
|---|---|---|
| Raw Text | Documents, tweets, reviews | Web crawls, user inputs |
| Preprocessing | Tokenization, normalization, cleaning | NLTK, spaCy, regex |
| Representation | BoW, TF-IDF, embeddings | Word2Vec, BERT embeddings |
| Understanding | POS tagging, NER, parsing | spaCy, Stanford NLP |
| Application | Classification, QA, summarization | HuggingFace Transformers |
| Generation | Text generation, translation | GPT-4, mBART, IndicTrans |
๐๏ธ Historical Background
The history of NLP is a fascinating journey from ambitious dreams to remarkable realities.
The Pioneering Era (1950โ1970)
1950 โ Alan Turing proposed the Turing Test in "Computing Machinery and Intelligence," framing language understanding as the benchmark for machine intelligence.
1954 โ Georgetown-IBM Experiment: The first machine translation demonstration translated 60 Russian sentences into English using a dictionary of 250 words and 6 grammar rules. Researchers predicted full MT in 3โ5 years โ it took 60+.
1966 โ ELIZA (Joseph Weizenbaum, MIT): The first chatbot, simulating a Rogerian psychotherapist using simple pattern matching. People formed emotional bonds with it, leading Weizenbaum to warn about AI deception.
1969 โ SHRDLU (Terry Winograd, MIT): A natural language understanding system that could manipulate virtual blocks on a table, showing deep but extremely narrow language understanding.
The Statistical Revolution (1980โ2010)
1980s โ Hidden Markov Models (HMMs) revolutionized speech recognition and POS tagging. Fred Jelinek (IBM): "Every time I fire a linguist, the performance of the speech recognizer goes up."
1993 โ Penn Treebank: The creation of large annotated corpora enabled statistical parsing.
2001 โ Conditional Random Fields (CRFs) by John Lafferty: Superior to HMMs for sequence labeling tasks like NER.
2003 โ Latent Dirichlet Allocation (LDA) by David Blei: Topic modeling from text corpora.
The Deep Learning Revolution (2013โPresent)
2013 โ Word2Vec (Mikolov et al., Google): Dense word embeddings capturing semantic relationships. "king - man + woman = queen."
2014 โ GloVe (Stanford): Global Vectors combining count-based and prediction-based approaches.
2014 โ Seq2Seq + Attention (Bahdanau et al.): Transformed machine translation.
2017 โ "Attention Is All You Need" (Vaswani et al., Google): The Transformer architecture โ the single most impactful paper in modern AI.
2018 โ BERT (Devlin et al., Google): Bidirectional pre-training shattered NLP benchmarks.
2020 โ GPT-3 (OpenAI): 175B parameters, few-shot learning, emergent abilities.
2022 โ ChatGPT: NLP went mainstream, reaching 100M users in 2 months.
2023โ2025 โ AI4Bharat: IndicTrans2 covering 22 Indian languages; Krutrim and Sarvam AI building India-first LLMs.
India's NLP Heritage: Panini's Ashtadhyayi (4th century BCE) โ 3,959 rules describing Sanskrit grammar โ is considered the world's first formal grammar, predating modern computational linguistics by millennia. Modern Indian NLP builds on this rich linguistic tradition through projects like AI4Bharat's IndicNLP Suite, which provides models for 22+ Indian languages.
๐ก Conceptual Explanation
4.1 Text Preprocessing Pipeline
Raw text is messy โ it contains HTML tags, special characters, inconsistent casing, and irrelevant words. Preprocessing transforms raw text into a clean, standardized format suitable for analysis.
Tokenization
Tokenization splits text into individual units (tokens). These can be words, subwords, or characters.
- Word Tokenization: "I can't believe it!" โ ["I", "can't", "believe", "it", "!"] or ["I", "ca", "n't", "believe", "it", "!"]
- Subword Tokenization (BPE): "unhappiness" โ ["un", "happiness"] or ["un", "happ", "iness"]. Used by GPT, BERT.
- Character Tokenization: "hello" โ ["h", "e", "l", "l", "o"]. Useful for morphologically rich languages like Hindi, Tamil.
Stemming vs. Lemmatization
Stemming crudely chops off word endings: "running" โ "run", "better" โ "better" (fails). Fast but inaccurate.
Lemmatization uses vocabulary and morphological analysis: "running" โ "run", "better" โ "good". Slower but linguistically correct.
Stop Word Removal
Common words like "the", "is", "at" carry little meaning for many NLP tasks (but not all โ stop words matter for sentiment analysis!).
4.2 Text Representation Models
Bag of Words (BoW)
Represents text as a vector of word frequencies, ignoring order. Simple but effective for many classification tasks.
Document: "the cat sat on the mat" โ {the: 2, cat: 1, sat: 1, on: 1, mat: 1}
TF-IDF (Term Frequency-Inverse Document Frequency)
Improves BoW by weighting words based on how informative they are. Words common across all documents (like "the") get low weights; words unique to specific documents get high weights.
4.3 Word Embeddings
Unlike BoW/TF-IDF (sparse, high-dimensional), word embeddings are dense, low-dimensional vectors that capture semantic meaning. The key insight is the distributional hypothesis: "You shall know a word by the company it keeps" (J.R. Firth, 1957).
Word2Vec โ CBOW (Continuous Bag of Words)
Predicts the center word from surrounding context words. Given context ["the", "cat", "on", "the"], predict "sat".
Word2Vec โ Skip-gram
The reverse: predicts context words from the center word. Given "sat", predict ["the", "cat", "on", "the"]. Works better for rare words and small datasets.
GloVe (Global Vectors)
Combines the best of count-based (SVD on co-occurrence matrix) and prediction-based (Word2Vec) methods by factorizing the log co-occurrence matrix.
FastText
Extends Word2Vec by representing words as bags of character n-grams: "where" โ {"<wh", "whe", "her", "ere", "re>"}. This handles out-of-vocabulary words and morphologically rich languages.
4.4 Sentiment Analysis
Determines the emotional tone of text: positive, negative, or neutral (or fine-grained scales). Applications include product review analysis, brand monitoring, stock market prediction, and political polling.
4.5 Named Entity Recognition (NER)
Identifies and classifies named entities: persons, organizations, locations, dates, monetary values, etc. Uses BIO tagging (Beginning, Inside, Outside) for sequence labeling.
4.6 Language Models: Evolution
A language model assigns probabilities to sequences of words. The evolution: N-gram โ Neural LM โ RNN LM โ Transformer LM (GPT, BERT) โ LLMs (GPT-4, Gemini).
4.7 Retrieval-Augmented Generation (RAG)
Combines retrieval (finding relevant documents from a knowledge base) with generation (using an LLM to synthesize answers). Solves hallucination, knowledge cutoff, and domain specificity problems.
For GATE/NET exams, remember: BoW ignores word order, TF-IDF adds document-level weighting, Word2Vec learns from local context windows, GloVe uses global co-occurrence statistics. BERT is bidirectional (masked LM), GPT is unidirectional (autoregressive). These distinctions are frequently tested.
๐ Mathematical Foundation
5.1 TF-IDF Mathematics
Where N = total number of documents, and |{d โ D : t โ d}| = number of documents containing term t.
5.2 Word2Vec Skip-gram Objective
The Skip-gram model maximizes the probability of context words given a center word:
Where T is the total number of words, c is the context window size, and the conditional probability uses softmax:
Here v_{w} and v'_{w} are the input and output vector representations of word w, and V is the vocabulary size.
5.3 GloVe Objective
Where X_{ij} is the co-occurrence count of words i and j, f(x) is a weighting function that caps high-frequency pairs, and b_i, bฬ_j are bias terms.
5.4 Negative Sampling Approximation
Computing the full softmax over vocabulary V (which can be 100Kโ1M words) is expensive. Negative sampling approximates this:
Where ฯ is the sigmoid function, k is the number of negative samples (typically 5โ20), and P_n(w) is the noise distribution (usually unigram distribution raised to 3/4 power).
5.5 Attention Mechanism (Transformer)
Where Q (queries), K (keys), V (values) are linear projections of the input, and d_k is the dimension of the keys. The โd_k scaling prevents dot products from growing too large.
5.6 N-gram Language Model
5.7 Perplexity (Language Model Evaluation)
Lower perplexity = better model. A perplexity of k means the model is as confused as if it had to choose uniformly among k possibilities at each step.
5.8 Cosine Similarity (Embedding Comparison)
๐ฃ Formula Derivations
6.1 Deriving TF-IDF from First Principles
Motivation: We want a scoring function that tells us how important a word is to a specific document in a collection.
Step 1 โ Term Frequency: If a word appears more often in a document, it's likely more relevant to that document.
TF(t, d) = count(t, d) / |d|
Example: Document d has 100 words, "machine" appears 5 times
TF("machine", d) = 5/100 = 0.05
Step 2 โ The Problem with TF Alone: Common words like "the", "is" have high TF in every document but carry no discriminative information.
Step 3 โ Inverse Document Frequency: We need a factor that penalizes words appearing in many documents:
df(t) = number of documents containing term t N = total number of documents If df(t) = N โ word appears everywhere โ not informative โ weight should be LOW If df(t) = 1 โ word is unique to one document โ very informative โ weight should be HIGH Simple ratio: N / df(t) gives higher values for rarer terms But this grows linearly and can be huge. We take the logarithm: IDF(t) = log(N / df(t)) When df(t) = N: IDF = log(1) = 0 (zero weight โ perfect!) When df(t) = 1: IDF = log(N) (maximum weight โ perfect!)
Step 4 โ Combine: TF-IDF(t, d) = TF(t, d) ร IDF(t) gives high scores to words that are frequent in a specific document but rare across the collection.
6.2 Deriving Skip-gram with Negative Sampling
Motivation: The full softmax in Word2Vec is O(V) per training step โ too slow for large vocabularies.
Step 1 โ Original objective for a single (center, context) pair:
maximize: log P(w_c | w_t) = log [exp(u_c ยท v_t) / ฮฃ_w exp(u_w ยท v_t)] This requires summing over ALL V words in the vocabulary โ O(V) per step.
Step 2 โ Reformulate as binary classification: Instead of multi-class softmax, ask: "Is (w_t, w_c) a real pair or a fake pair?"
P(D=1 | w_t, w_c) = ฯ(u_c ยท v_t) = 1 / (1 + exp(-u_c ยท v_t)) For a real pair, maximize ฯ(u_c ยท v_t) For a fake pair (w_t, w_k), maximize ฯ(-u_k ยท v_t), i.e., minimize ฯ(u_k ยท v_t)
Step 3 โ Sample k negative (fake) pairs for each real pair:
J = log ฯ(u_c ยท v_t) + ฮฃ_{i=1}^{k} E_{w_i ~ P_n} [log ฯ(-u_{w_i} ยท v_t)]
Now each step is O(k) instead of O(V), where k = 5-20 โช V
Noise distribution P_n(w) = [count(w)]^{3/4} / ฮฃ_w [count(w)]^{3/4}
The 3/4 power smooths the distribution, giving rare words more sampling chance.
6.3 Deriving Attention Scaling Factor
Why divide by โd_k?
Let q and k be random vectors with components ~ N(0, 1)
q ยท k = ฮฃ_{i=1}^{d_k} q_i ร k_i
E[q_i ร k_i] = E[q_i] ร E[k_i] = 0 ร 0 = 0
Var(q_i ร k_i) = Var(q_i) ร Var(k_i) = 1 ร 1 = 1
By CLT: Var(q ยท k) = ฮฃ_{i=1}^{d_k} Var(q_i ร k_i) = d_k
So q ยท k has variance d_k. For large d_k, dot products become very large,
pushing softmax into saturated regions (extremely peaked distribution).
Solution: Divide by โd_k to normalize variance back to 1:
Var(q ยท k / โd_k) = d_k / d_k = 1 โ
The โd_k scaling in attention is often treated as a "trick," but it's deeply principled. Without it, gradient flow through the softmax degrades for high-dimensional models (d_k = 64 in BERT-base), making training unstable. This is an example of how understanding variance propagation is critical in deep learning.
๐ Worked Numerical Examples
Example 1: TF-IDF Calculation
Problem: Given 3 documents, compute TF-IDF for the word "learning":
D1: "machine learning is great" (4 words)
D2: "deep learning and machine learning" (5 words, "learning" appears 2x)
D3: "great machines work well" (4 words)
Step 1: TF("learning", D1) = 1/4 = 0.25
TF("learning", D2) = 2/5 = 0.40
TF("learning", D3) = 0/4 = 0.00
Step 2: df("learning") = 2 (appears in D1 and D2)
N = 3
IDF("learning") = logโ(3/2) = logโ(1.5) = 0.585
Step 3: TF-IDF("learning", D1) = 0.25 ร 0.585 = 0.146
TF-IDF("learning", D2) = 0.40 ร 0.585 = 0.234
TF-IDF("learning", D3) = 0.00 ร 0.585 = 0.000
โ "learning" is most important to D2 (highest TF-IDF score)
Example 2: Bigram Probability
Corpus: "I like deep learning. I like machine learning."
Compute: P("learning" | "deep")
C("deep", "learning") = 1
C("deep") = 1
P("learning" | "deep") = C("deep", "learning") / C("deep") = 1/1 = 1.0
Compute: P("learning" | "machine")
C("machine", "learning") = 1
C("machine") = 1
P("learning" | "machine") = 1/1 = 1.0
Compute: P("deep" | "like")
C("like", "deep") = 1
C("like") = 2
P("deep" | "like") = 1/2 = 0.5
Example 3: Cosine Similarity Between Word Vectors
word_king = [0.8, 0.6, 0.2]
word_queen = [0.7, 0.7, 0.3]
word_apple = [0.1, 0.2, 0.9]
cos(king, queen):
Dot product = 0.8ร0.7 + 0.6ร0.7 + 0.2ร0.3 = 0.56 + 0.42 + 0.06 = 1.04
||king|| = โ(0.64 + 0.36 + 0.04) = โ1.04 = 1.020
||queen|| = โ(0.49 + 0.49 + 0.09) = โ1.07 = 1.034
cos = 1.04 / (1.020 ร 1.034) = 1.04 / 1.055 = 0.986 โ Very similar! โ
cos(king, apple):
Dot product = 0.08 + 0.12 + 0.18 = 0.38
||apple|| = โ(0.01 + 0.04 + 0.81) = โ0.86 = 0.927
cos = 0.38 / (1.020 ร 0.927) = 0.38 / 0.946 = 0.402 โ Less similar โ
Example 4: Perplexity Calculation
Test sentence: "the cat sat" (3 words)
Model probabilities:
P(the) = 0.1
P(cat | the) = 0.05
P(sat | cat) = 0.2
P(sentence) = 0.1 ร 0.05 ร 0.2 = 0.001
Perplexity = P(sentence)^{-1/N} = (0.001)^{-1/3} = (1000)^{1/3} = 10.0
Interpretation: The model is as uncertain as choosing randomly from 10 options.
Better model with P = 0.2 ร 0.3 ร 0.5 = 0.03:
PP = (0.03)^{-1/3} = (33.33)^{1/3} = 3.22 โ Much better! โ
In competitive exams, TF-IDF and perplexity calculations are common. Remember: (1) IDF uses log base (usually 2 or 10 โ check the question), (2) Perplexity is the geometric mean of inverse probabilities, (3) Lower perplexity = better model.
๐ Visual Diagrams
8.1 NLP Pipeline Architecture
8.2 Word2Vec: CBOW vs Skip-gram
8.3 Transformer Self-Attention
8.4 RAG Architecture
๐ Flowcharts
9.1 Sentiment Analysis Pipeline
9.2 NER Decision Flowchart
9.3 LLM Fine-tuning Decision Tree
๐ Python Implementation
10.1 Text Preprocessing with NLTK & spaCy
import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
# Download required NLTK data
nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'])
class TextPreprocessor:
"""Complete text preprocessing pipeline."""
def __init__(self, language='english'):
self.stemmer = PorterStemmer()
self.lemmatizer = WordNetLemmatizer()
self.stop_words = set(stopwords.words(language))
def clean_text(self, text):
"""Remove HTML, URLs, special characters."""
text = re.sub(r'<[^>]+>', '', text) # HTML tags
text = re.sub(r'http\S+|www\.\S+', '', text) # URLs
text = re.sub(r'[^a-zA-Z\s]', '', text) # Non-alpha
text = text.lower().strip()
return text
def tokenize(self, text):
"""Word tokenization."""
return word_tokenize(text)
def remove_stopwords(self, tokens):
"""Remove common stop words."""
return [t for t in tokens if t not in self.stop_words]
def stem(self, tokens):
"""Apply Porter stemming."""
return [self.stemmer.stem(t) for t in tokens]
def lemmatize(self, tokens):
"""Apply WordNet lemmatization."""
return [self.lemmatizer.lemmatize(t) for t in tokens]
def preprocess(self, text, use_lemma=True):
"""Full preprocessing pipeline."""
text = self.clean_text(text)
tokens = self.tokenize(text)
tokens = self.remove_stopwords(tokens)
tokens = self.lemmatize(tokens) if use_lemma else self.stem(tokens)
return tokens
# Demo
pp = TextPreprocessor()
sample = "The quick brown foxes are jumping over the lazy dogs! Visit https://nlp.com"
print("Original:", sample)
print("Processed:", pp.preprocess(sample))
# Output: ['quick', 'brown', 'fox', 'jumping', 'lazy', 'dog']
10.2 spaCy NLP Pipeline
import spacy
# Load English model
nlp = spacy.load("en_core_web_sm")
text = "Google CEO Sundar Pichai announced a $10 billion investment in India on March 2025."
doc = nlp(text)
# Tokenization
print("=== Tokens ===")
for token in doc:
print(f" {token.text:15s} | POS: {token.pos_:6s} | Lemma: {token.lemma_:15s} | Stop: {token.is_stop}")
# Named Entity Recognition
print("\n=== Named Entities ===")
for ent in doc.ents:
print(f" {ent.text:25s} | Label: {ent.label_:10s} | Explanation: {spacy.explain(ent.label_)}")
# Dependency Parsing
print("\n=== Dependencies ===")
for token in doc:
print(f" {token.text:15s} --{token.dep_:12s}--> {token.head.text}")
# Output:
# === Named Entities ===
# Google | Label: ORG | Explanation: Companies, agencies...
# Sundar Pichai | Label: PERSON | Explanation: People, including fictional
# $10 billion | Label: MONEY | Explanation: Monetary values
# India | Label: GPE | Explanation: Countries, cities, states
# March 2025 | Label: DATE | Explanation: Absolute or relative dates
10.3 TF-IDF from Scratch
import numpy as np
from collections import Counter
import math
class TFIDFVectorizer:
"""TF-IDF implementation from scratch."""
def __init__(self):
self.vocabulary = {}
self.idf = {}
def fit(self, documents):
"""Build vocabulary and compute IDF."""
# Build vocabulary
all_words = set()
for doc in documents:
all_words.update(doc.lower().split())
self.vocabulary = {w: i for i, w in enumerate(sorted(all_words))}
# Compute IDF
N = len(documents)
for word in self.vocabulary:
df = sum(1 for doc in documents if word in doc.lower().split())
self.idf[word] = math.log(N / df) if df > 0 else 0
return self
def transform(self, documents):
"""Compute TF-IDF matrix."""
matrix = np.zeros((len(documents), len(self.vocabulary)))
for i, doc in enumerate(documents):
words = doc.lower().split()
word_counts = Counter(words)
total_words = len(words)
for word, count in word_counts.items():
if word in self.vocabulary:
tf = count / total_words
matrix[i][self.vocabulary[word]] = tf * self.idf[word]
return matrix
def fit_transform(self, documents):
return self.fit(documents).transform(documents)
# Demo
docs = [
"machine learning is great for prediction",
"deep learning uses neural networks",
"machine learning and deep learning are related"
]
vectorizer = TFIDFVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)
print("Vocabulary:", vectorizer.vocabulary)
print("\nTF-IDF Matrix shape:", tfidf_matrix.shape)
print("\nTop words per document:")
for i, doc in enumerate(docs):
scores = tfidf_matrix[i]
top_indices = scores.argsort()[-3:][::-1]
words = list(vectorizer.vocabulary.keys())
top_words = [(words[j], scores[j]) for j in top_indices if scores[j] > 0]
print(f" D{i+1}: {top_words}")
10.4 Word2Vec Training with Gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import numpy as np
# Sample corpus (in practice, use a large corpus)
sentences = [
["machine", "learning", "is", "a", "subset", "of", "artificial", "intelligence"],
["deep", "learning", "uses", "neural", "networks"],
["natural", "language", "processing", "deals", "with", "text"],
["word", "embeddings", "capture", "semantic", "meaning"],
["transformers", "revolutionized", "natural", "language", "processing"],
["bert", "is", "a", "bidirectional", "transformer", "model"],
["gpt", "is", "an", "autoregressive", "language", "model"],
["attention", "mechanism", "is", "key", "to", "transformers"],
["recurrent", "neural", "networks", "process", "sequences"],
["convolutional", "networks", "work", "well", "for", "images"],
]
# Train Word2Vec (Skip-gram)
model_sg = Word2Vec(
sentences=sentences,
vector_size=100, # Embedding dimension
window=5, # Context window size
min_count=1, # Minimum word frequency
sg=1, # 1 = Skip-gram, 0 = CBOW
negative=5, # Negative samples
epochs=100, # Training epochs
workers=4 # Parallel threads
)
# Train Word2Vec (CBOW)
model_cbow = Word2Vec(
sentences=sentences,
vector_size=100, window=5, min_count=1,
sg=0, epochs=100, workers=4
)
# Find similar words
print("Similar to 'learning' (Skip-gram):")
for word, score in model_sg.wv.most_similar("learning", topn=5):
print(f" {word}: {score:.4f}")
# Word vector arithmetic
# Note: requires large corpus for meaningful results
print("\nVector for 'learning':", model_sg.wv["learning"][:5], "...")
# Save and load
model_sg.save("word2vec_skipgram.model")
# loaded_model = Word2Vec.load("word2vec_skipgram.model")
10.5 Sentiment Analysis with VADER & TextBlob
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import nltk
nltk.download('vader_lexicon')
# VADER Sentiment Analysis (rule-based, great for social media)
sia = SentimentIntensityAnalyzer()
texts = [
"This product is absolutely amazing! Best purchase ever! ๐",
"Terrible customer service. Never buying again. ๐ค",
"The movie was okay, nothing special.",
"AI4Bharat's IndicNLP is revolutionizing Indian language processing!",
"The new UPI interface is confusing and slow.",
]
print("=== VADER Sentiment Analysis ===")
for text in texts:
scores = sia.polarity_scores(text)
sentiment = "Positive" if scores['compound'] > 0.05 else "Negative" if scores['compound'] < -0.05 else "Neutral"
print(f"\n Text: {text[:60]}...")
print(f" Scores: pos={scores['pos']:.3f} neu={scores['neu']:.3f} neg={scores['neg']:.3f}")
print(f" Compound: {scores['compound']:.3f} โ {sentiment}")
# TextBlob Sentiment Analysis
print("\n\n=== TextBlob Sentiment Analysis ===")
for text in texts:
blob = TextBlob(text)
polarity = blob.sentiment.polarity # -1 to 1
subjectivity = blob.sentiment.subjectivity # 0 to 1
print(f"\n Text: {text[:60]}...")
print(f" Polarity: {polarity:.3f} Subjectivity: {subjectivity:.3f}")
Modify the TextPreprocessor class to handle Hindi text: add Devanagari tokenization, Hindi stop words (เคธเฅ, เคเคพ, เคเฅ, เคเฅ, เคฎเฅเค, เคนเฅ, เคเคฐ, เคเฅ), and integration with the IndicNLP library. Test with sample Hindi movie reviews.
๐ท TensorFlow Implementation
11.1 BERT Text Classification
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
import numpy as np
# ========================================
# BERT Fine-tuning for Sentiment Classification
# ========================================
# Load pre-trained BERT tokenizer and model
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(
model_name,
num_labels=3 # Positive, Negative, Neutral
)
# Sample training data
train_texts = [
"This movie is absolutely fantastic and thrilling",
"Terrible acting and boring storyline",
"The film was decent, nothing remarkable",
"Best performance I've ever seen on screen",
"Complete waste of time and money",
"It was an average movie with good songs",
]
train_labels = [2, 0, 1, 2, 0, 1] # 0=Neg, 1=Neutral, 2=Pos
# Tokenize inputs
def encode_texts(texts, max_length=128):
return tokenizer(
texts,
padding='max_length',
truncation=True,
max_length=max_length,
return_tensors='tf'
)
train_encodings = encode_texts(train_texts)
# Create TF dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
tf.constant(train_labels)
)).shuffle(100).batch(2)
# Compile model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
# Train
print("Training BERT classifier...")
history = model.fit(train_dataset, epochs=3, verbose=1)
# Predict on new text
test_texts = ["This is a wonderful experience!", "I hated every minute of it"]
test_encodings = encode_texts(test_texts)
predictions = model.predict(dict(test_encodings))
predicted_labels = tf.argmax(predictions.logits, axis=1).numpy()
label_map = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}
for text, label in zip(test_texts, predicted_labels):
print(f" '{text}' โ {label_map[label]}")
11.2 Text Summarization with T5
from transformers import T5Tokenizer, TFT5ForConditionalGeneration
# Load T5 model for summarization
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = TFT5ForConditionalGeneration.from_pretrained(model_name)
def summarize(text, max_input_length=512, max_output_length=150):
"""Generate abstractive summary using T5."""
# T5 expects "summarize: " prefix
input_text = "summarize: " + text
# Tokenize
inputs = tokenizer(
input_text,
max_length=max_input_length,
truncation=True,
return_tensors="tf"
)
# Generate summary
summary_ids = model.generate(
inputs["input_ids"],
max_length=max_output_length,
num_beams=4, # Beam search
length_penalty=2.0, # Favor longer summaries
early_stopping=True,
no_repeat_ngram_size=3 # Avoid repetition
)
# Decode
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
# Test with a long article
article = """
India's digital transformation has been remarkable. The Unified Payments Interface (UPI)
processed over 13 billion transactions worth more than Rs 20 lakh crore in a single month
in 2024. This achievement makes India the world leader in real-time digital payments.
The success of UPI can be attributed to several factors: government support through the
Digital India initiative, widespread smartphone adoption, affordable internet connectivity
through Jio's entry, and the collaborative approach of NPCI in building an open platform.
The technology has now been adopted by countries like Singapore, UAE, and France.
India is also making significant strides in AI and NLP. Organizations like AI4Bharat
are developing language models for 22 Indian languages, enabling millions of non-English
speakers to access digital services in their native languages.
"""
summary = summarize(article)
print("=== Original Article ===")
print(article[:200], "...")
print("\n=== T5 Summary ===")
print(summary)
11.3 LSTM Text Generator
import tensorflow as tf
import numpy as np
class TextGenerator:
"""Simple LSTM-based text generator."""
def __init__(self, corpus, seq_length=40):
self.seq_length = seq_length
self.corpus = corpus.lower()
# Create character vocabulary
self.chars = sorted(set(self.corpus))
self.char_to_idx = {c: i for i, c in enumerate(self.chars)}
self.idx_to_char = {i: c for c, i in self.char_to_idx.items()}
self.vocab_size = len(self.chars)
print(f"Corpus length: {len(self.corpus)} chars")
print(f"Vocabulary size: {self.vocab_size} unique chars")
def prepare_data(self):
"""Create input-output sequences."""
X, y = [], []
for i in range(len(self.corpus) - self.seq_length):
seq_in = self.corpus[i:i + self.seq_length]
seq_out = self.corpus[i + self.seq_length]
X.append([self.char_to_idx[c] for c in seq_in])
y.append(self.char_to_idx[seq_out])
X = np.array(X) / self.vocab_size # Normalize
X = X.reshape(len(X), self.seq_length, 1)
y = tf.keras.utils.to_categorical(y, self.vocab_size)
return X, y
def build_model(self):
"""Build LSTM model."""
self.model = tf.keras.Sequential([
tf.keras.layers.LSTM(256, input_shape=(self.seq_length, 1), return_sequences=True),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(256),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(self.vocab_size, activation='softmax')
])
self.model.compile(loss='categorical_crossentropy', optimizer='adam')
self.model.summary()
def generate(self, seed_text, length=200, temperature=0.8):
"""Generate text from seed."""
generated = seed_text.lower()
pattern = [self.char_to_idx[c] for c in generated[-self.seq_length:]]
for _ in range(length):
x = np.array(pattern) / self.vocab_size
x = x.reshape(1, len(pattern), 1)
pred = self.model.predict(x, verbose=0)[0]
# Temperature sampling
pred = np.log(pred + 1e-8) / temperature
pred = np.exp(pred) / np.sum(np.exp(pred))
idx = np.random.choice(len(pred), p=pred)
generated += self.idx_to_char[idx]
pattern.append(idx)
pattern = pattern[1:]
return generated
# Usage:
# gen = TextGenerator(large_text_corpus)
# X, y = gen.prepare_data()
# gen.build_model()
# gen.model.fit(X, y, batch_size=128, epochs=50)
# print(gen.generate("natural language", length=300))
In production, don't train BERT from scratch โ fine-tune a pre-trained model. For Indian languages, use IndicBERT (AI4Bharat) or MuRIL (Google) as base models. These are pre-trained on multilingual Indian corpora and significantly outperform vanilla mBERT on Indian language tasks.
๐ถ Scikit-Learn Implementation
12.1 Text Classification Pipeline (Spam Detection)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# Sample SMS dataset (in practice, use full SMS Spam Collection dataset)
messages = [
("Free entry to win a brand new car! Text WIN to 80085", "spam"),
("Hey, are you free for dinner tonight?", "ham"),
("URGENT! You have won ยฃ1000. Call now!", "spam"),
("Meeting rescheduled to 3pm tomorrow", "ham"),
("Congratulations! Claim your prize money NOW!!!", "spam"),
("Can you pick up groceries on the way home?", "ham"),
("You've been selected for a free iPhone! Click here", "spam"),
("Don't forget mom's birthday next week", "ham"),
("WINNER!! You won a vacation trip! Reply YES", "spam"),
("Project deadline extended to Friday", "ham"),
("Get cheap loans at lowest interest rates!", "spam"),
("See you at the park at 5pm", "ham"),
] * 10 # Repeat for more training data
texts, labels = zip(*messages)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.2, random_state=42, stratify=labels
)
# Method 1: Simple Pipeline (TF-IDF + Naive Bayes)
pipeline_nb = Pipeline([
('tfidf', TfidfVectorizer(
max_features=5000,
ngram_range=(1, 2), # Unigrams and bigrams
stop_words='english'
)),
('clf', MultinomialNB(alpha=0.1)) # Laplace smoothing
])
# Method 2: TF-IDF + SVM
pipeline_svm = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
('clf', LinearSVC(C=1.0, max_iter=1000))
])
# Train and evaluate
for name, pipeline in [("Naive Bayes", pipeline_nb), ("SVM", pipeline_svm)]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(f"\n=== {name} ===")
print(classification_report(y_test, y_pred))
# Cross-validation
scores = cross_val_score(pipeline, texts, labels, cv=5, scoring='f1_macro')
print(f" Cross-val F1: {scores.mean():.3f} ยฑ {scores.std():.3f}")
# Predict on new messages
new_messages = [
"Win a free laptop! Click now!",
"Can we reschedule our meeting to Monday?"
]
predictions = pipeline_svm.predict(new_messages)
for msg, pred in zip(new_messages, predictions):
print(f" '{msg[:50]}...' โ {pred}")
12.2 Topic Modeling with LDA
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# Sample documents about different topics
documents = [
"Machine learning algorithms improve with more data and computing power",
"Deep neural networks have revolutionized computer vision and NLP",
"Indian cricket team won the World Cup with brilliant batting display",
"Virat Kohli scored a century in the test match against Australia",
"Stock market crashed due to global economic concerns and inflation",
"Sensex and Nifty recovered after RBI announced rate cut policy",
"New electric vehicles from Tata Motors are gaining market share",
"Tesla's self-driving technology uses deep learning and sensors",
"Artificial intelligence is transforming healthcare diagnostics",
"ISRO launched new satellite for weather prediction using ML models",
]
# Vectorize
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(documents)
# Fit LDA
n_topics = 3
lda = LatentDirichletAllocation(
n_components=n_topics,
random_state=42,
max_iter=20,
learning_method='online'
)
lda.fit(X)
# Display topics
feature_names = vectorizer.get_feature_names_out()
print("=== Discovered Topics ===")
for topic_idx, topic in enumerate(lda.components_):
top_words = [feature_names[i] for i in topic.argsort()[-8:][::-1]]
print(f" Topic {topic_idx + 1}: {', '.join(top_words)}")
# Assign topics to documents
doc_topics = lda.transform(X)
for i, doc in enumerate(documents[:5]):
dominant_topic = doc_topics[i].argmax()
print(f" Doc {i+1} โ Topic {dominant_topic + 1} ({doc[:50]}...)")
๐ฎ๐ณ Indian Case Studies
Case Study 1: AI4Bharat IndicNLP โ Democratizing Indian Language AI
Challenge: India has 22 scheduled languages, 121 mother tongues, and 12+ scripts. Most NLP models are English-centric, leaving 1 billion+ Indians underserved.
Solution: AI4Bharat (IIT Madras) developed:
- IndicBERT: Multilingual BERT pre-trained on 12 Indian languages with Albert-style architecture
- IndicTrans2: State-of-the-art translation model covering all 22 scheduled languages with 1B+ parallel sentences
- IndicNLP Library: Tokenization, transliteration, and embeddings for Indian languages
- Bhasha-Abhijnaanam: Language identification for 22 Indian languages
Impact: Used by Google Translate for Indian language improvements, NPTEL for lecture translation, and multiple government portals for multilingual access.
Technical Innovation: Handled challenges like code-mixing (Hindi-English: "Yeh movie bahut amazing thi!"), agglutinative morphology (Tamil, Kannada), and free word order (Hindi, Sanskrit).
Case Study 2: Flipkart โ Product Review Analysis at Scale
Problem: Processing 10M+ reviews across 150M products in multiple Indian languages. Reviews often contain code-mixed text, transliterated Hindi/Tamil, and regional expressions.
Solution:
- Custom NER model to extract product attributes ("battery life", "camera quality") from reviews
- Aspect-based sentiment analysis: positive about price but negative about delivery
- Multilingual sentiment model handling Hindi, Tamil, Telugu, Bengali transliterations
- Fake review detection using linguistic patterns and behavioral signals
Results: 40% improvement in product recommendation quality, 25% reduction in customer complaints about wrong products.
Case Study 3: Aadhaar โ Multilingual Document Processing
Challenge: Processing identity documents (birth certificates, bank statements) in 22+ languages for Aadhaar verification across India.
Solution: OCR + NER pipeline handling Devanagari, Tamil, Bengali, Telugu, Gujarati, and other scripts. Achieved 97% accuracy in name extraction across 11 languages using a fine-tuned mBERT model.
Case Study 4: Indian Judiciary โ SUPACE (e-Courts)
Problem: 44 million pending cases in Indian courts. Judges need to review vast amounts of legal text to find relevant precedents.
Solution: NLP system for legal document summarization, case similarity finding, and relevant statute identification. Handles legal text in English, Hindi, and regional languages.
India-First LLMs: Krutrim (Ola's LLM supporting 22 Indian languages), Sarvam AI (open-source Indian LLMs), and Hanooman (IIIT Hyderabad) represent India's push for sovereign AI. These address unique challenges: Hindi-English code-mixing in 60% of urban social media posts, and supporting both Devanagari and Roman scripts for the same language.
๐ Global Case Studies
Case Study 1: Google โ BERT & Search Revolution
Impact: In October 2019, Google integrated BERT into search, calling it "the biggest leap in 5 years." BERT helps understand the nuance of search queries โ "can you get medicine for someone pharmacy" now correctly interprets "for someone" as picking up a prescription for another person.
Scale: BERT processes every English search query (3.5 billion/day). Google also developed MUM (Multitask Unified Model), 1000ร more powerful than BERT, understanding 75+ languages.
Case Study 2: OpenAI โ From GPT to ChatGPT
Evolution: GPT-1 (117M params, 2018) โ GPT-2 (1.5B, 2019) โ GPT-3 (175B, 2020) โ GPT-4 (rumored 1.7T MoE, 2023) โ GPT-4o (2024).
Innovation: Reinforcement Learning from Human Feedback (RLHF) transformed GPT-3 from a text predictor into ChatGPT, an instruction-following assistant. The key was combining pre-training (massive text data) with alignment (human preferences).
Case Study 3: Netflix โ Personalized Content Description
NLP powers Netflix's content recommendation beyond just collaborative filtering. NLP analyzes plot summaries, reviews, subtitles, and social media chatter. The system generates personalized show descriptions โ the same show may have different thumbnails and descriptions for different users based on their viewing history.
Case Study 4: Amazon Alexa โ Conversational AI at Scale
Alexa processes 50+ billion voice interactions yearly across 100+ million devices. The NLP pipeline: ASR (speechโtext) โ NLU (intent+entities) โ Dialog Management โ NLG (textโspeech). Handles 8+ languages with accent variations.
Case Study 5: Tesla โ NLP for Autonomous Driving Documentation
Tesla uses NLP to analyze driving incident reports, owner manual queries, and regulatory documents across 40+ markets. NER extracts location, vehicle model, weather conditions from crash reports to improve Autopilot safety algorithms.
๐ Startup Applications
| Startup | NLP Application | Technology | Impact |
|---|---|---|---|
| Sarvam AI (India) | Indian language LLMs | Custom transformer, IndicNLP | Open-source models for 10+ Indian languages |
| Yellow.ai (India) | Enterprise chatbots | Multi-LLM orchestration | Serves 1000+ enterprises, 35+ languages |
| Haptik (India) | Conversational AI | Intent classification, NER | Acquired by Jio, 100M+ users |
| Grammarly (Global) | Writing assistant | BERT + custom models | 30M daily active users |
| Jasper AI (Global) | Content generation | GPT API + fine-tuning | $1.5B valuation for marketing AI |
| Hugging Face (Global) | NLP model hub | Transformers library | 500K+ models, 100K+ datasets |
| Cohere (Global) | Enterprise NLP API | Custom LLMs, RAG | Multilingual embeddings, 100+ languages |
| Reverie Language Tech (India) | Indian language platform | NMT, TTS, STT | Powers UMANG, DigiLocker in 12 languages |
NLP Startup Roles in India (2025): Prompt Engineer (โน8โ20 LPA), NLP Data Scientist (โน12โ35 LPA), LLM Fine-tuning Engineer (โน18โ50 LPA), Applied ML Scientist (โน25โ80 LPA). Hot skills: RAG, LoRA/QLoRA, multilingual models, vector databases, evaluation frameworks.
๐๏ธ Government Applications
16.1 Bhashini (National Language Translation Mission)
India's ambitious initiative to break language barriers in digital services. Bhashini provides:
- Real-time translation across 22 scheduled languages
- Speech-to-speech translation for non-literate users
- API access for government apps and services
- Crowdsourced data collection via the Bhashini app
16.2 e-Courts Project (India)
NLP-powered legal document processing: automatic case summarization, judgment prediction research, and multilingual legal text analysis helping manage India's judicial backlog of 44M+ cases.
16.3 MyGov Chatbot
India's citizen engagement chatbot handles queries about government schemes (PM-KISAN, Ayushman Bharat, MGNREGA) in Hindi and English, processing 10M+ queries monthly.
16.4 EU AI Act & NLP Regulation
The EU classifies emotion recognition and biometric categorization NLP systems as "high-risk AI," requiring transparency, human oversight, and non-discrimination testing โ setting global regulatory precedent.
16.5 US Intelligence Community
DARPA's KAIROS program uses NLP to extract complex events from multilingual text, building knowledge graphs from news and intelligence reports.
๐ญ Industry Applications
| Industry | NLP Application | Business Impact |
|---|---|---|
| Healthcare | Clinical note extraction, medical chatbots, drug interaction NER | 30% faster diagnosis, reduced documentation time |
| Finance | Sentiment-driven trading, compliance monitoring, fraud detection | News sentiment predicts stock moves with 58% accuracy |
| Legal | Contract analysis, e-discovery, case law search | 90% reduction in document review time |
| E-commerce | Product search, review mining, chatbot support | 25% increase in search relevance |
| Education | Essay grading, question generation, tutoring | Personalized learning at scale |
| Media | Content recommendation, headline generation, fake news detection | 40% increase in engagement |
| Manufacturing | Maintenance report analysis, safety document mining | Predictive maintenance from text logs |
| Agriculture | Farmer helpline chatbots (Kisan Call Centre) | 24/7 crop advisory in regional languages |
RAG is the enterprise standard (2025): Over 80% of enterprise LLM deployments use Retrieval-Augmented Generation rather than pure generation. This reduces hallucinations, enables domain-specific answers, and provides auditable source citations โ critical for regulated industries like finance and healthcare.
๐ ๏ธ Mini Projects
๐ฏ Mini Project 1: Hindi Sentiment Analyzer
Objective: Build a sentiment classifier for Hindi movie reviews using IndicBERT.
Dataset
Use IIIT-H Hindi Sentiment Dataset or scrape reviews from IMDB Hindi / BookMyShow.
Implementation
# Mini Project 1: Hindi Sentiment Analyzer
# ==========================================
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd
import numpy as np
# --- Step 1: Prepare Hindi Dataset ---
hindi_reviews = {
'text': [
"เคฏเคน เคซเคฟเคฒเฅเคฎ เคฌเคนเฅเคค เค
เคเฅเคเฅ เคนเฅเฅค เคเคนเคพเคจเฅ เคฆเคฟเคฒ เคเฅ เคเฅ เคฒเฅเคคเฅ เคนเฅเฅค",
"เคฌเฅเคเคพเคฐ เคซเคฟเคฒเฅเคฎเฅค เคธเคฎเคฏ เคเฅ เคฌเคฐเฅเคฌเคพเคฆเฅเฅค",
"เค
เคญเคฟเคจเคฏ เคถเคพเคจเคฆเคพเคฐ เคนเฅ, เคฒเฅเคเคฟเคจ เคเคนเคพเคจเฅ เคเคฎเคเฅเคฐ เคนเฅเฅค",
"เคเคธ เคซเคฟเคฒเฅเคฎ เคจเฅ เคฎเฅเคฐเคพ เคฆเคฟเคฒ เคเฅเคค เคฒเคฟเคฏเคพ! เคฌเคนเฅเคค เคเฅเคฌเคธเฅเคฐเคค!",
"เคฌเคนเฅเคค เคฌเฅเคฐเคฟเคเค เคซเคฟเคฒเฅเคฎเฅค เคเคเฅเคเคฟเคเค เคญเฅ เค เฅเค เคจเคนเฅเค เคฅเฅเฅค",
"Bahut acchi movie thi! Must watch!", # Code-mixed
"Kya bakwas film hai, avoid karo", # Transliterated Hindi
"Direction kamaal ka hai, screenplay tight hai",
"เคฎเคธเฅเคค picture เคนเฅ เคญเคพเค, interval เคเฅ เคฌเคพเคฆ เคเคฐ เคฎเคเคผเคพ เคเคฏเคพ",
"Waste of money, mat jao dekhne",
],
'label': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0] # 1=Positive, 0=Negative
}
df = pd.DataFrame(hindi_reviews)
dataset = Dataset.from_pandas(df)
# Split dataset
dataset = dataset.train_test_split(test_size=0.2, seed=42)
# --- Step 2: Load IndicBERT / MuRIL ---
model_name = "ai4bharat/IndicBERTv2-MLM-only" # Or "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
# --- Step 3: Tokenize ---
def tokenize_function(examples):
return tokenizer(
examples['text'],
padding='max_length',
truncation=True,
max_length=128
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# --- Step 4: Training ---
training_args = TrainingArguments(
output_dir='./hindi_sentiment_model',
num_train_epochs=5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
learning_rate=2e-5,
weight_decay=0.01,
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
logging_steps=10,
)
def compute_metrics(eval_pred):
predictions, labels = eval_pred
preds = np.argmax(predictions, axis=-1)
accuracy = (preds == labels).mean()
return {'accuracy': accuracy}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
compute_metrics=compute_metrics,
)
# Train!
trainer.train()
# --- Step 5: Inference ---
def predict_sentiment(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
label = torch.argmax(probs).item()
return "เคธเคเคพเคฐเคพเคคเฅเคฎเค (Positive) ๐" if label == 1 else "เคจเคเคพเคฐเคพเคคเฅเคฎเค (Negative) ๐"
# Test
test_reviews = [
"เคฏเคน เคซเคฟเคฒเฅเคฎ เคเคฎเคพเคฒ เคเฅ เคนเฅ! เคนเคฐ เคเคฟเคธเฅ เคเฅ เคฆเฅเคเคจเฅ เคเคพเคนเคฟเค!",
"Bahut hi boring movie, mat jaao",
"Acting acchi thi but story weak hai",
]
for review in test_reviews:
result = predict_sentiment(review)
print(f" Review: {review}")
print(f" Sentiment: {result}\n")
Expected Outcomes
- Handle both Devanagari and Romanized Hindi (code-mixing)
- Achieve 80%+ accuracy on Hindi sentiment classification
- Deploy as a FastAPI endpoint with a simple web interface
๐ฏ Mini Project 2: News Article Summarizer with RAG
Objective: Build a news summarization and Q&A system using extractive + abstractive methods with RAG.
Implementation
# Mini Project 2: News Summarizer with RAG
# ==========================================
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import heapq
# === PART A: Extractive Summarization (TextRank) ===
class TextRankSummarizer:
"""Extractive summarization using TextRank algorithm."""
def __init__(self):
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
self.stop_words = set(stopwords.words('english'))
def _sentence_similarity(self, sent1, sent2):
"""Compute similarity between two sentences."""
words1 = [w.lower() for w in word_tokenize(sent1) if w.isalpha() and w.lower() not in self.stop_words]
words2 = [w.lower() for w in word_tokenize(sent2) if w.isalpha() and w.lower() not in self.stop_words]
all_words = list(set(words1 + words2))
vec1 = [1 if w in words1 else 0 for w in all_words]
vec2 = [1 if w in words2 else 0 for w in all_words]
dot_product = sum(a * b for a, b in zip(vec1, vec2))
norm1 = sum(a**2 for a in vec1) ** 0.5
norm2 = sum(b**2 for b in vec2) ** 0.5
return dot_product / (norm1 * norm2 + 1e-8)
def summarize(self, text, num_sentences=3):
"""Generate extractive summary."""
sentences = sent_tokenize(text)
if len(sentences) <= num_sentences:
return text
# Build similarity matrix
n = len(sentences)
sim_matrix = np.zeros((n, n))
for i in range(n):
for j in range(n):
if i != j:
sim_matrix[i][j] = self._sentence_similarity(
sentences[i], sentences[j]
)
# PageRank-style scoring
scores = np.ones(n) / n
damping = 0.85
for _ in range(50): # Iterate until convergence
new_scores = np.ones(n) * (1 - damping) / n
for i in range(n):
for j in range(n):
if i != j and sim_matrix[j].sum() > 0:
new_scores[i] += damping * sim_matrix[j][i] / sim_matrix[j].sum() * scores[j]
scores = new_scores
# Select top sentences (maintain original order)
ranked_indices = sorted(range(n), key=lambda i: scores[i], reverse=True)
top_indices = sorted(ranked_indices[:num_sentences])
summary = ' '.join([sentences[i] for i in top_indices])
return summary
# === PART B: RAG-based Q&A System ===
class SimpleRAG:
"""Simple RAG system using TF-IDF retrieval + generation."""
def __init__(self):
self.documents = []
self.embeddings = None
def add_documents(self, docs):
"""Add documents to the knowledge base."""
self.documents = docs
# Simple TF-IDF-based embeddings
from sklearn.feature_extraction.text import TfidfVectorizer
self.vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
self.embeddings = self.vectorizer.fit_transform(docs)
print(f"Added {len(docs)} documents to knowledge base.")
def retrieve(self, query, top_k=3):
"""Retrieve most relevant documents."""
query_vec = self.vectorizer.transform([query])
similarities = cosine_similarity(query_vec, self.embeddings)[0]
top_indices = similarities.argsort()[-top_k:][::-1]
results = []
for idx in top_indices:
results.append({
'document': self.documents[idx],
'score': similarities[idx],
'index': idx
})
return results
def answer(self, query, top_k=3):
"""Retrieve relevant docs and generate answer."""
retrieved = self.retrieve(query, top_k)
# In production, pass to LLM. Here, return relevant excerpts.
context = "\n".join([r['document'][:200] for r in retrieved])
print(f"\n๐ Query: {query}")
print(f"\n๐ Retrieved Context (top-{top_k}):")
for i, r in enumerate(retrieved):
print(f" [{i+1}] (score: {r['score']:.3f}) {r['document'][:100]}...")
# In production, you'd call:
# prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
# answer = llm.generate(prompt)
return retrieved
# === Demo ===
# Extractive summarization
summarizer = TextRankSummarizer()
article = """
India's space program has achieved remarkable milestones in recent years.
The Indian Space Research Organisation (ISRO) successfully landed Chandrayaan-3
on the lunar south pole in August 2023, making India the fourth country to
land on the Moon and the first to reach the south pole region. The mission cost
just $75 million, a fraction of NASA's comparable missions. ISRO's Chairman
S. Somanath credited the team's innovative engineering and frugal approach.
The success has boosted India's commercial space sector, with startups like
Skyroot Aerospace and Agnikul Cosmos developing reusable rockets. India now
aims to send astronauts to space through the Gaganyaan mission by 2025 and
establish a space station by 2035. The space economy is projected to reach
$44 billion by 2033, creating thousands of high-tech jobs across the country.
"""
print("=== Extractive Summary (TextRank) ===")
summary = summarizer.summarize(article, num_sentences=3)
print(summary)
# RAG demo
print("\n\n=== RAG Q&A System ===")
rag = SimpleRAG()
rag.add_documents([
"Chandrayaan-3 landed on lunar south pole in August 2023. It cost $75 million.",
"ISRO plans Gaganyaan mission to send Indian astronauts to space by 2025.",
"India's space economy is projected to reach $44 billion by 2033.",
"Skyroot Aerospace launched India's first private rocket Vikram-S in 2022.",
"ISRO's PSLV has launched over 300 foreign satellites commercially.",
])
rag.answer("How much did the Moon mission cost?")
rag.answer("What are India's future space plans?")
Enhancements
- Replace TF-IDF with sentence-transformers for better retrieval
- Integrate with ChromaDB or FAISS for scalable vector search
- Connect to an LLM (GPT-4, Llama) for abstractive answer generation
- Add Hindi news support using IndicTrans for translation
๐ฏ Mini Project 3: LLM Fine-tuning with LoRA
Objective: Fine-tune a language model for Indian legal document Q&A using LoRA (Low-Rank Adaptation).
# Mini Project 3: LoRA Fine-tuning
# ==================================
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
# --- Concept: LoRA (Low-Rank Adaptation) ---
# Instead of updating all model parameters (billions),
# LoRA adds small trainable low-rank matrices to attention layers.
#
# Original: W (d ร d) โ frozen
# LoRA: W + ฮW = W + A ร B where A (d ร r), B (r ร d), r << d
#
# For r=8, d=4096: LoRA trains 65K params vs 16M (0.4%!)
# Step 1: Load base model
model_name = "microsoft/phi-2" # Or "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
# Step 2: Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank of the low-rank matrices
lora_alpha=32, # Scaling factor
lora_dropout=0.05, # Dropout for regularization
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj", "dense"
],
bias="none",
)
# Step 3: Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 2,784,849,920 || trainable%: 0.15%
# Step 4: Prepare legal Q&A dataset
legal_qa = [
{
"instruction": "What is Section 302 of the Indian Penal Code?",
"response": "Section 302 IPC deals with punishment for murder. Whoever commits murder shall be punished with death or imprisonment for life, and shall also be liable to fine."
},
{
"instruction": "Explain the Right to Education under Indian Constitution.",
"response": "Article 21A, inserted by the 86th Constitutional Amendment Act 2002, makes education a fundamental right for children aged 6-14 years. The Right of Children to Free and Compulsory Education Act, 2009, provides the legal framework."
},
# ... add more training examples
]
# Format for training
def format_prompt(sample):
return f"""### Instruction: {sample['instruction']}
### Response: {sample['response']}"""
# Step 5: QLoRA variant (4-bit quantization)
# from transformers import BitsAndBytesConfig
# bnb_config = BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_quant_type="nf4",
# bnb_4bit_compute_dtype=torch.float16,
# bnb_4bit_use_double_quant=True,
# )
# model = AutoModelForCausalLM.from_pretrained(
# model_name, quantization_config=bnb_config
# )
print("โ
LoRA model ready for training!")
print(f" Base model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f" Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
โ๏ธ End-of-Chapter Exercises
Exercise 1: Tokenization Comparison
Tokenize the sentence "I can't believe she'd do that!" using: (a) whitespace splitting, (b) NLTK word_tokenize, (c) spaCy tokenizer, (d) BERT WordPiece tokenizer. Compare the outputs and explain the differences.
Exercise 2: TF-IDF by Hand
Given documents D1="the cat sat on the mat", D2="the dog sat on the log", D3="cats and dogs are pets", compute TF-IDF for every unique word. Which word has the highest TF-IDF score in D3?
Exercise 3: Word2Vec Analogies
Using pre-trained Word2Vec (Google News vectors), find: (a) king - man + woman = ?, (b) Paris - France + India = ?, (c) doctor - man + woman = ?. Discuss any biases you observe.
Exercise 4: Sentiment Classification
Build a sentiment classifier for Amazon product reviews using: (a) Naive Bayes + BoW, (b) SVM + TF-IDF, (c) Fine-tuned BERT. Compare F1 scores and training time. Which is best for production?
Exercise 5: NER Pipeline
Implement a NER system for Indian news articles that extracts: PERSON, ORG, LOCATION, DATE, MONEY. Use spaCy and test on 20 news headlines from The Hindu or NDTV.
Exercise 6: N-gram Language Model
Build a trigram language model from a corpus of 1000 sentences. Implement: (a) MLE estimation, (b) Add-1 (Laplace) smoothing, (c) Kneser-Ney smoothing. Compare perplexity on a test set.
Exercise 7: Text Summarization Evaluation
Implement both extractive (TextRank) and abstractive (T5-small) summarizers. Evaluate both on 10 news articles using ROUGE-1, ROUGE-2, and ROUGE-L metrics. Which performs better and why?
Exercise 8: Topic Modeling
Apply LDA to a dataset of 500 Wikipedia articles across 5 categories (Science, Sports, Politics, Technology, Entertainment). Find the optimal number of topics using coherence score.
Exercise 9: Code-Mixed NLP
Collect 200 Hindi-English code-mixed tweets. Build a language identification system that tags each word as Hindi or English. Then build a sentiment classifier for the code-mixed text.
Exercise 10: Embedding Visualization
Train Word2Vec on a corpus of Indian news articles. Visualize word embeddings using t-SNE for 100 words across 5 semantic categories (politics, cricket, Bollywood, technology, food). Discuss clustering patterns.
Exercise 11: Spam Filter
Build an email spam filter using the Enron dataset. Compare BoW, TF-IDF, and BERT embeddings as features with Logistic Regression, Random Forest, and XGBoost classifiers (9 combinations).
Exercise 12: Question Answering
Using the SQuAD 2.0 dataset, fine-tune a DistilBERT model for extractive QA. Evaluate using Exact Match (EM) and F1 scores.
Exercise 13: Text Generation
Fine-tune GPT-2 on a corpus of Rabindranath Tagore's poetry. Generate 10 poems and evaluate them using: (a) perplexity, (b) BLEU score against real poems, (c) human evaluation (ask 5 people to rate coherence).
Exercise 14: RAG System
Build a RAG system for Indian Constitution Q&A: chunk the constitution text, create embeddings with sentence-transformers, store in ChromaDB, retrieve relevant sections, and generate answers using an LLM.
Exercise 15: Prompt Engineering
Design 5 different prompting strategies (zero-shot, few-shot, chain-of-thought, role-playing, structured output) for the same task: "Classify Indian court judgments by legal area." Compare accuracy across strategies.
Exercise 16: Multilingual Translation
Using the IndicTrans2 model, build a HindiโEnglishโTamil translation pipeline. Evaluate BLEU scores and analyze common error patterns.
Exercise 17: GloVe Implementation
Implement GloVe from scratch: (a) build co-occurrence matrix from a 10K-word corpus, (b) implement the weighted least squares objective, (c) optimize using AdaGrad. Compare resulting embeddings with pre-trained GloVe.
Exercise 18: Attention Visualization
For a fine-tuned BERT model, extract and visualize attention weights for 5 example sentences. Identify which attention heads capture syntactic vs semantic relationships.
Exercise 19: Fake News Detection
Build a fake news detector using the LIAR dataset. Combine textual features (TF-IDF) with metadata features (speaker, context). Achieve at least 70% accuracy on 6-class classification.
Exercise 20: LoRA vs Full Fine-tuning
Fine-tune a model (e.g., Phi-2 or Llama-2-7B) on a custom dataset using: (a) full fine-tuning, (b) LoRA (r=4, 8, 16, 32). Compare: accuracy, training time, memory usage, and number of trainable parameters.
Exercise 21: Dependency Parsing
Implement a simple transition-based dependency parser using an arc-standard system with shift, left-arc, and right-arc transitions. Test on 50 sentences from the Universal Dependencies treebank.
Exercise 22: Document Clustering
Cluster 1000 news articles into groups using: (a) TF-IDF + K-Means, (b) BERT embeddings + K-Means, (c) BERT embeddings + HDBSCAN. Compare using silhouette score and visual inspection.
๐ฏ Multiple Choice Questions
MCQ 1: In Word2Vec Skip-gram, what is being predicted?
MCQ 2: TF-IDF assigns the HIGHEST weight to words that are:
MCQ 3: BERT uses which pre-training objective?
MCQ 4: In the Transformer attention formula Attention(Q,K,V) = softmax(QKแต/โd_k)V, why divide by โd_k?
MCQ 5: Which approach does RAG (Retrieval-Augmented Generation) combine?
MCQ 6: What is the BIO tagging scheme used for?
MCQ 7: LoRA fine-tuning is efficient because:
MCQ 8: Perplexity of a language model measures:
MCQ 9: FastText's advantage over Word2Vec is:
MCQ 10: TextRank for summarization is inspired by:
MCQ 11: Which Indian NLP challenge is NOT common in English NLP?
MCQ 12: In abstractive summarization, the model:
๐ผ Interview Questions
Q1: Explain the difference between Word2Vec, GloVe, and FastText. When would you use each?
Answer: Word2Vec learns embeddings from local context windows (Skip-gram/CBOW) โ good for general-purpose embeddings. GloVe factorizes the global co-occurrence matrix โ better at capturing global statistics. FastText extends Word2Vec with character n-grams โ best for morphologically rich languages (Hindi, Turkish) and handling OOV words. In practice: use FastText for Indian languages, GloVe for analogy tasks, and nowadays most production systems use contextual embeddings from BERT/transformers instead.
Q2: How does BERT differ from GPT architecturally? Why does this matter?
Answer: BERT uses the Transformer encoder (bidirectional) with Masked Language Model pre-training โ it sees context from BOTH sides. GPT uses the Transformer decoder (unidirectional/autoregressive) โ it only sees left context. This means BERT excels at understanding tasks (classification, NER, QA) while GPT excels at generation tasks (text completion, dialogue). BERT can't generate text naturally; GPT can't attend to future tokens.
Q3: What is the attention mechanism and why was it revolutionary?
Answer: Attention allows each position in a sequence to attend to all other positions, computing a weighted sum of value vectors based on query-key similarities. Before attention, RNNs compressed entire sequences into fixed-size vectors (information bottleneck). Attention enables: (1) parallel processing (unlike sequential RNNs), (2) direct long-range dependencies, (3) interpretable weights. The Transformer replaces recurrence entirely with self-attention, enabling massive parallelism and scaling to billions of parameters.
Q4: How would you build a sentiment analysis system for Hindi product reviews?
Answer: (1) Data collection: scrape Flipkart/Amazon Hindi reviews or use existing datasets (IIIT-H); (2) Handle code-mixing: detect language per word, normalize transliterated text; (3) Preprocessing: use IndicNLP tokenizer for Devanagari; (4) Model: fine-tune IndicBERT or MuRIL (pre-trained on Indian languages) rather than training from scratch; (5) Handle class imbalance with oversampling/weighted loss; (6) Evaluate with F1 (not just accuracy); (7) Deploy with FastAPI + sentence caching for speed.
Q5: Explain RAG. Why is it preferred over fine-tuning for enterprise applications?
Answer: RAG (Retrieval-Augmented Generation) retrieves relevant documents from a knowledge base and feeds them as context to an LLM. Advantages over fine-tuning: (1) Knowledge can be updated without retraining, (2) Answers are grounded in source documents (auditable), (3) Reduces hallucination, (4) Domain-specific without expensive GPU training, (5) Source attribution enables trust. Architecture: Document chunking โ Embedding โ Vector DB โ Query โ Retrieve โ Prompt construction โ LLM โ Answer.
Q6: What is LoRA and how does it make LLM fine-tuning accessible?
Answer: LoRA (Low-Rank Adaptation) freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each transformer layer. Instead of updating W (dรd), it learns ฮW = A(dรr) ร B(rรd) where rank r << d (typically 4-64). This reduces trainable parameters from billions to millions (0.1-1% of total), enabling fine-tuning on consumer GPUs. QLoRA additionally quantizes the frozen weights to 4-bit, making it possible to fine-tune a 7B model on a single 16GB GPU.
Q7: How would you evaluate an NLP model for bias?
Answer: (1) Test with demographic-swapped inputs (replace "he" with "she", different cultural names), (2) Check embedding bias with WEAT (Word Embedding Association Test), (3) Disaggregate accuracy by subgroups, (4) Test for stereotypical associations ("nurse" with gender), (5) Use tools like AI Fairness 360, (6) For Indian context: check caste, religion, and regional biases, (7) Continuous monitoring with diverse test sets post-deployment.
Q8: What are the challenges of NLP for Indian languages?
Answer: (1) Morphological richness: Tamil, Kannada are agglutinative (single word = entire sentence in English); (2) Script diversity: 12+ scripts (Devanagari, Tamil, Bengali, etc.); (3) Code-mixing: 60%+ of urban social media is mixed Hindi-English; (4) Resource scarcity: limited labeled datasets for most languages; (5) Transliteration: same word written in multiple scripts; (6) Free word order: Hindi/Sanskrit allow flexible sentence structure; (7) Dialectal variation: Hindi spoken in UP differs from Rajasthani Hindi.
Q9: Explain the difference between extractive and abstractive summarization.
Answer: Extractive selects the most important sentences from the original text (TextRank, LexRank). Pros: faithful to source, no hallucination. Cons: may lack coherence, can't paraphrase. Abstractive generates new text summarizing the original (T5, BART, GPT). Pros: more natural, concise. Cons: may hallucinate facts, harder to evaluate. In practice, many systems use hybrid approaches: extract key sentences, then rephrase them abstractively.
Q10: Design a real-time fake news detection system. What NLP components would you use?
Answer: Architecture: (1) Content analysis: BERT-based claim verification against fact-check databases, linguistic feature extraction (sensationalism score, emotion intensity), (2) Source credibility: domain reputation scoring, author history NER, (3) Propagation analysis: social network spread patterns, bot detection, (4) Cross-reference: RAG-based fact verification against trusted news corpus, (5) Multilingual: IndicBERT for Hindi/regional language fake news, (6) Real-time pipeline: Kafka โ preprocessing โ ensemble model โ confidence score โ human-in-the-loop for borderline cases.
๐ฌ Research Problems
Research Problem 1: Code-Mixed Sentiment Analysis for Indian Social Media
Problem: Build a sentiment analysis model that handles Hindi-English, Tamil-English, and Telugu-English code-mixed text without language-specific preprocessing.
Approach: Explore unified multilingual embeddings, script-agnostic subword tokenization, and transliteration-augmented training. Can a single model handle code-mixing across all Indian language pairs?
Dataset: SemEval-2020 Task 9 (Sentiment Analysis for Code-Mixed Social Media Text), SAIL 2015 Hindi-English dataset.
Open Questions: Does explicit language identification improve or hurt performance? How to handle triple code-mixing (Hindi + English + Urdu)?
Research Problem 2: Hallucination Detection and Mitigation in LLMs
Problem: LLMs generate fluent but factually incorrect text. Develop methods to detect and reduce hallucination in domain-specific applications (medical, legal).
Approach: (1) Self-consistency checking (sample multiple outputs, detect contradictions), (2) Fact verification against knowledge graphs, (3) Uncertainty quantification using token probabilities, (4) RAG with strict grounding constraints.
Metric: Define hallucination rate, factual consistency score, and source grounding score.
Research Problem 3: Low-Resource Indian Language NLP
Problem: Of India's 22 scheduled languages, many lack sufficient training data for modern NLP. Develop methods for NLP in languages with <10,000 labeled examples.
Approach: (1) Cross-lingual transfer from Hindi/English to low-resource languages, (2) Data augmentation via back-translation, (3) Zero-shot transfer using multilingual models, (4) Active learning for efficient annotation, (5) Synthetic data generation using LLMs.
Languages to focus on: Bodo, Dogri, Maithili, Santhali, Konkani โ the least-resourced scheduled languages.
Research Problem 4: Efficient Transformer Architectures for Edge Devices
Problem: Deploy NLP models on Indian smartphones (many with <4GB RAM). Design architectures that maintain accuracy while fitting in constrained environments.
Approach: Knowledge distillation, pruning, quantization (INT4/INT8), architecture search for mobile transformers. Target: sentiment analysis and language ID in <50MB model size.
๐ Key Takeaways
- Preprocessing is foundational: Tokenization, stemming, lemmatization, and stop word removal form the bedrock of every NLP pipeline. Choice of preprocessing depends on the task โ sentiment analysis may need emojis and stop words that other tasks don't.
- Representations evolved from sparse to dense: BoW โ TF-IDF โ Word2Vec โ Contextual Embeddings (BERT). Each generation captures richer linguistic information.
- Word embeddings capture meaning: Word2Vec, GloVe, and FastText encode semantic and syntactic relationships in dense vectors. FastText handles morphologically rich Indian languages better through character n-grams.
- Transformers transformed NLP: The self-attention mechanism enables parallel processing and captures long-range dependencies. BERT (bidirectional, for understanding) and GPT (unidirectional, for generation) are two sides of the same coin.
- Indian NLP has unique challenges: 22+ languages, 12+ scripts, pervasive code-mixing, dialectal variation, and limited labeled data. Solutions include IndicBERT, IndicTrans2, and Bhashini platform.
- RAG is the enterprise standard: Retrieval-Augmented Generation solves hallucination, enables domain specificity, and provides auditable answers โ critical for production NLP systems.
- LoRA democratizes fine-tuning: Low-Rank Adaptation enables fine-tuning billion-parameter models on consumer hardware by training only 0.1% of parameters.
- Evaluation matters: Perplexity for language models, ROUGE for summarization, F1 for classification, BLEU for translation, Exact Match for QA. Always use the right metric for the task.
- NLP is the gateway to General AI: Language understanding underpins reasoning, planning, and knowledge โ mastering NLP is essential for the next generation of AI researchers and engineers.
The field moves fast โ GPT-4 is already being superseded by newer models. But the fundamentals don't change: understanding tokenization, attention, embeddings, and evaluation will serve you regardless of which model is trending. Learn the principles deeply, and you'll adapt to any new architecture.
๐ References
Foundational Papers
- Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781 โ Word2Vec paper.
- Pennington, J., Socher, R., Manning, C. (2014). "GloVe: Global Vectors for Word Representation." EMNLP.
- Bojanowski, P., et al. (2017). "Enriching Word Vectors with Subword Information." TACL โ FastText paper.
- Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS โ The Transformer paper.
- Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL.
- Brown, T., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS โ GPT-3 paper.
- Hu, E., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685.
- Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS.
Indian NLP Resources
- Kakwani, D., et al. (2020). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Models for Indian Languages." EMNLP Findings.
- Gala, J., et al. (2023). "IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages." TACL.
- Khanuja, S., et al. (2020). "GLUECoS: An Evaluation Benchmark for Code-Switched NLP." ACL.
- Joshi, A., et al. (2016). "Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text." COLING.
Textbooks
- Jurafsky, D. & Martin, J.H. (2024). Speech and Language Processing, 3rd Edition (online draft). โ The definitive NLP textbook.
- Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool.
- Eisenstein, J. (2019). Introduction to Natural Language Processing. MIT Press.
Tools & Libraries
- NLTK: nltk.org โ Natural Language Toolkit for Python.
- spaCy: spacy.io โ Industrial-strength NLP in Python.
- HuggingFace Transformers: huggingface.co โ 500K+ pre-trained models.
- AI4Bharat: ai4bharat.org โ Indian language NLP models and datasets.
- Gensim: radimrehurek.com/gensim โ Topic modeling and Word2Vec.
Online Courses
- Stanford CS224N: Natural Language Processing with Deep Learning (free on YouTube).
- fast.ai NLP Course: Practical NLP with modern techniques.
- NPTEL: NLP courses in Hindi/English by IIT professors.