Neural Networks & Deep Learning
Chapter 19: Recommendation Systems with Deep Learning
When Algorithms Know What You Want Before You Do
โฑ๏ธ Reading Time: ~3 hours | ๐ Part V: Applications | ๐ง Theory + Code Chapter
๐ Prerequisites: Chapters 6โ8 (Deep Networks, Backpropagation, Optimization), Embeddings (Ch 14/15), Basic Linear Algebra
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the user-item interaction matrix, matrix factorization formula, and the architecture of two-tower models and YouTube DNN |
| ๐ต Understand | Explain the difference between collaborative filtering, content-based filtering, and hybrid approaches; describe how embeddings capture latent factors |
| ๐ข Apply | Implement matrix factorization from scratch with gradient descent and build a Neural Collaborative Filtering model in TensorFlow |
| ๐ก Analyze | Analyze cold-start problems, popularity bias, and urban vs. rural recommendation fairness in Indian e-commerce contexts |
| ๐ Evaluate | Evaluate tradeoffs between collaborative vs. content-based vs. hybrid systems; select appropriate architectures for different business scenarios |
| ๐ด Create | Design an end-to-end deep recommendation pipeline for a Bollywood movie platform with fairness constraints |
Learning Objectives
By the end of this chapter, you will be able to:
- Define collaborative filtering, content-based filtering, and hybrid recommendation systems, and explain when each paradigm is most suitable
- Derive the matrix factorization objective with regularization and implement gradient descent updates for user/item latent factor matrices
- Explain how Neural Collaborative Filtering (NCF) generalises matrix factorization by replacing the dot product with a learned neural network
- Implement embedding layers in PyTorch and TensorFlow to represent users and items as dense, low-dimensional vectors
- Describe the two-tower architecture (user tower + item tower) and explain how approximate nearest neighbour search enables real-time retrieval at scale
- Reproduce the key components of the YouTube recommendation DNN (2016 paper) โ candidate generation and ranking stages
- Build a from-scratch matrix factorization model using NumPy and a Neural CF model using TensorFlow/Keras on the MovieLens dataset
- Analyze recommendation fairness issues โ urban vs. rural bias, language bias, and cold-start problems for new sellers on Indian platforms
- Evaluate recommendation quality using metrics like RMSE, Precision@K, Recall@K, NDCG, and Hit Rate
- Design a hybrid deep learning recommendation system for an Indian e-commerce or OTT streaming scenario
Opening Hook
๐ IPL Final Night on Hotstar โ 50 Crore Decisions per Second
It's the 2024 IPL final. Over 50 crore (500 million) users are on Disney+ Hotstar simultaneously. The match is streaming live, but here's what most viewers don't see: behind every screen, a deep recommendation engine is making hundreds of decisions.
"You watched MI vs CSK last week โ here's a highlight reel." "Since you paused KGF 2 at the interval, here's a Yash interview." "Your friend in Bengaluru is watching this Tamil web seriesโฆ"
Hotstar's recommendation engine drives over 70% of total watch time. Without it, most users would open the app, feel overwhelmed by 100,000+ titles, and leave within 30 seconds. The algorithm doesn't just suggest โ it decides what India watches.
From the โน2,000 crore GMV boost at Amazon India to Spotify Wrapped playlists that know your Bollywood guilty pleasures โ recommendation systems powered by deep learning are the invisible architects of modern digital India.
In this chapter, you'll learn to build them from scratch.
HotstarAmazon IndiaFlipkartSpotify IndiaJioSaavnNetflix estimated in 2016 that its recommendation system saves โน8,300 crore ($1 billion) per year by reducing subscriber churn. If users can't find what they want within 60โ90 seconds, they leave. In India, where ARPU (Average Revenue Per User) for OTT is just โน100โ150/month, the stakes per-user are lower but the scale โ 300M+ OTT users โ more than compensates.
Core Concepts
We'll build up from the simplest approaches to state-of-the-art deep learning architectures. The roadmap: Classical CF โ Matrix Factorization โ Neural CF โ Content-Based DL โ Hybrid โ Two-Tower โ YouTube DNN โ Fairness.
19.1 Classical Collaborative Filtering (CF)
The fundamental idea behind collaborative filtering is beautifully simple: users who agreed in the past will agree in the future. If Priya and Rahul both loved 3 Idiots, Lagaan, and Zindagi Na Milegi Dobara, and Priya also liked Dil Chahta Hai, then Rahul will probably like it too.
The User-Item Interaction Matrix
At the heart of CF lies the user-item interaction matrix R, where R[u][i] represents user u's rating (or implicit feedback) for item i.
There are two classical flavours:
User-Based CF vs Item-Based CF
Find users similar to the target user, then aggregate their ratings. "Users like you also watchedโฆ"
Similarity: Cosine similarity or Pearson correlation between user rating vectors.
Item-Based CFFind items similar to what the user has already liked, then recommend those. "Because you watched Dangalโฆ"
Advantage: Item similarities are more stable than user similarities (items don't change; user tastes drift).
Limitations of Classical CF- Cold Start: New users or new items have no interactions โ no similarity can be computed
- Scalability: Computing pairwise similarities for 50 crore users is O(nยฒ) โ infeasible
- Sparsity: With 99%+ missing entries, finding reliable neighbours is unreliable
Flipkart's Scale Challenge: With 45+ crore registered users and 15+ crore products, the user-item matrix has ~6.75 ร 10ยนโท potential entries. Classical CF with pairwise similarity is computationally impossible at this scale. This is why Flipkart moved to deep learning-based recommendations in 2019.
19.2 Matrix Factorization โ The Breakthrough
The 2006 Netflix Prize showed that matrix factorization (MF) dramatically outperforms classical CF. The core insight: decompose the sparse user-item matrix into two low-rank matrices.
R โ โmรn, P โ โmรk (user factors), Q โ โnรk (item factors)
Predicted rating: rฬui = pu ยท qi = ฮฃf=1k puf ยท qif
Each user u is represented by a k-dimensional latent vector pu, and each item i by qi. The predicted rating is their dot product. The parameter k (typically 20โ200) controls how many latent factors we learn.
What Are Latent Factors?
Think of each dimension as capturing an abstract "taste axis." For Bollywood movies:
- Factor 1: Action vs. Romance (Pushpa scores high on action; DDLJ scores high on romance)
- Factor 2: Regional vs. Pan-India (Kantara scores high on regional; RRR scores high on pan-India)
- Factor 3: Indie/art-house vs. Commercial (Ship of Theseus vs. Singham)
Neither users nor the algorithm explicitly name these factors โ they emerge automatically from the data.
Optimization Objective
ฮป = regularization strength to prevent overfitting
We only sum over observed ratings (not missing ones). The regularization term ฮป(||P||ยฒ + ||Q||ยฒ) penalises large latent factor values, preventing the model from memorising the training data.
Gradient Descent Updates
For each observed rating (u, i, rui):
pu โ pu + ฮฑ (eui ยท qi โ ฮป ยท pu)
qi โ qi + ฮฑ (eui ยท pu โ ฮป ยท qi)
Adding Biases
Real ratings have systematic biases: some users are generous raters (Priya gives 4.2 on average), some movies are universally loved (Dangal has a 4.5 average). The biased MF model:
ฮผ = global mean, bu = user bias, bi = item bias
In practice, biased MF with k=50โ100 latent factors and ฮป=0.02 achieves RMSE close to state-of-the-art on MovieLens datasets. Start here before jumping to deep models โ simpler models often perform surprisingly well!
19.3 Neural Collaborative Filtering (NCF)
Matrix factorization uses a simple dot product to model user-item interaction: rฬ = pu ยท qi. But what if the interaction is more complex than a linear combination? Neural Collaborative Filtering (He et al., 2017) replaces the dot product with a neural network.
NCF Architecture
Instead of computing pu ยท qi, pass the concatenation [pu; qi] through a multi-layer perceptron (MLP) to learn an arbitrary interaction function:
rฬui = fฮธ(pu, qi) = MLP([pu; qi])
Architecture Layers- Input Layer: One-hot encoded user ID and item ID
- Embedding Layer: Maps user/item IDs โ dense vectors (pu, qi)
- Interaction Layer: Concatenate [pu; qi] (or element-wise product, or both)
- Hidden Layers: FC โ ReLU โ FC โ ReLU โ โฆ (tower of decreasing width)
- Output Layer: Single neuron with sigmoid (for implicit) or linear (for explicit ratings)
The original NCF paper proposes NeuMF โ combining a Generalized Matrix Factorization (GMF) path with an MLP path. GMF preserves the dot-product signal; MLP adds nonlinear capacity. Their outputs are concatenated and fed to a final prediction layer.
Why NCF > MF?
- Expressiveness: MF can only model linear interactions (dot product); NCF can learn arbitrary nonlinear patterns
- Feature incorporation: Easy to add side features (user age, item category) by concatenating to the embedding vector
- Implicit feedback: NCF naturally handles binary signals (clicked/not-clicked) using binary cross-entropy loss
"Deep models always beat matrix factorization." โ This is false! On clean, dense rating data (like MovieLens 100K), well-tuned MF often matches or beats NCF. Deep models shine when you have (1) very sparse data, (2) implicit feedback, (3) rich side features, or (4) need to combine multiple signal types. Always benchmark against a simple MF baseline.
19.4 Content-Based Filtering with Deep Learning
Content-based filtering recommends items similar to what the user has previously liked, based on item features โ not collaborative signals. Deep learning revolutionises this approach by learning rich feature representations automatically.
Traditional Content Features vs. Deep Embeddings
| Aspect | Traditional | Deep Learning |
|---|---|---|
| Text (Movie Plots) | TF-IDF, Bag of Words | BERT/Sentence-BERT embeddings (768-dim) |
| Images (Product Photos) | Colour histograms, SIFT | ResNet/EfficientNet feature vectors (2048-dim) |
| Audio (Song Features) | MFCC, tempo, key | VGGish / wav2vec embeddings |
| Video (Trailers) | Keyframe extraction | Video Transformer embeddings |
Embedding-Based Content Similarity
Given a pre-trained model that maps items to embeddings:
Example: Using Sentence-BERT to embed Bollywood movie plots:
- Dangal embedding โ Chak De! India embedding (both sports + patriotism)
- Gangs of Wasseypur embedding โ Mirzapur embedding (both crime dramas + UP setting)
- Cosine similarity between Dangal & Gangs of Wasseypur โ 0.15 (very different genres)
Advantages of Content-Based DL
- No cold-start for items: New movies/products can be recommended immediately using their features
- Explainability: "Recommended because you liked sports dramas with strong female leads"
- No popularity bias: Niche items with good feature matches get surfaced
JioSaavn's Music Recommendations: JioSaavn uses deep audio embeddings to recommend music across India's 22+ official languages. A user who listens to Arijit Singh (Hindi) might get recommended Sid Sriram (Tamil/Telugu) if the audio embeddings are similar โ bridging language barriers through learned feature representations rather than explicit genre tags.
19.5 Hybrid Recommendation Systems
In practice, no single approach works best alone. Production systems at Flipkart, Amazon India, and Hotstar all use hybrid architectures that combine collaborative and content signals.
Hybrid Strategies
| Strategy | How It Works | Example |
|---|---|---|
| Weighted | Blend scores: s = ฮฑยทsCF + (1โฮฑ)ยทsCB | Hotstar: 60% CF + 40% content for new shows |
| Switching | Use CB for cold-start, switch to CF when enough data | Meesho: content-based for new sellers |
| Feature Augmentation | CB features fed as input to CF model | Amazon India: product image embeddings in NCF |
| Cascade | Stage 1: CF filters candidates; Stage 2: CB re-ranks | YouTube: candidate gen โ ranking |
| Meta-level | One model's output is another model's input | Netflix: MF embeddings fed to gradient-boosted trees |
Deep Hybrid Architecture
Modern deep learning naturally enables hybrid systems by concatenating different embedding types:
This is the essence of the Wide & Deep architecture (Google, 2016) and the two-tower model we'll explore next.
19.6 The Two-Tower Model
The two-tower (or dual-encoder) architecture is the workhorse of industrial recommendation systems. It's used at Google, Facebook, Amazon, and most Indian tech companies at scale.
Two-Tower Architecture
Train two separate neural networks (towers): one for users, one for items. Each tower produces a fixed-dimensional embedding vector. Similarity is computed via dot product or cosine distance.
Why Two Separate Towers?The key advantage is decoupled computation:
- Item embeddings can be pre-computed offline and stored in a vector index (FAISS, ScaNN, Milvus)
- At serving time, only the user tower runs in real-time to produce the user embedding
- Retrieval uses approximate nearest neighbour (ANN) search to find the top-K items closest to the user embedding โ in milliseconds, even over 100 million items
The model is trained with pairs (user, item, label) where label = 1 for positive interactions and 0 for negative samples (randomly sampled non-interactions). Loss: binary cross-entropy or sampled softmax.
Serving LatencyUser tower inference: ~2ms. ANN search over 10M items: ~5ms. Total: under 10ms โ fast enough for real-time recommendations.
Flipkart's recommendation team reported that switching from a monolithic model to a two-tower architecture reduced serving latency from 150ms to 12ms while maintaining the same recommendation quality. The item tower embeddings for their 15 crore+ product catalogue are refreshed every 6 hours and stored in a FAISS index.
Embedding Layers: The Foundation
Both towers rely heavily on embedding layers to convert sparse categorical features (user IDs, item IDs, city codes, categories) into dense vectors.
PyTorch import torch import torch.nn as nn # Embedding: maps integer IDs โ dense vectors user_embedding = nn.Embedding( num_embeddings=500000, # 5 lakh users embedding_dim=64 # each user โ 64-dim vector ) item_embedding = nn.Embedding( num_embeddings=100000, # 1 lakh items embedding_dim=64 ) # Usage user_id = torch.tensor([42]) item_id = torch.tensor([1337]) user_vec = user_embedding(user_id) # shape: (1, 64) item_vec = item_embedding(item_id) # shape: (1, 64) # Dot product โ predicted score score = torch.sum(user_vec * item_vec, dim=1) print(f"Predicted score: {score.item():.4f}")
TensorFlow import tensorflow as tf # TensorFlow Embedding layer user_embedding = tf.keras.layers.Embedding( input_dim=500000, # vocabulary size output_dim=64, # embedding dimension name="user_embedding" ) item_embedding = tf.keras.layers.Embedding( input_dim=100000, output_dim=64, name="item_embedding" ) # Lookup user_vec = user_embedding(tf.constant([42])) # (1, 64) item_vec = item_embedding(tf.constant([1337])) # (1, 64) # Cosine similarity score = tf.keras.losses.cosine_similarity(user_vec, item_vec)
19.7 YouTube Recommendation DNN โ A Deep Dive
The 2016 paper "Deep Neural Networks for YouTube Recommendations" by Covington et al. is arguably the most influential industrial recommendation systems paper ever published. It describes the system serving over 1 billion users and choosing from hundreds of millions of videos.
Two-Stage Architecture
Reduces the corpus from millions of videos to ~hundreds of candidates. Uses a deep neural network that acts as a massive multiclass classifier.
- Input Features: Watch history (embedded + averaged), search history (embedded + averaged), geographic embedding, age, gender
- Architecture: 3 fully-connected ReLU layers (1024 โ 512 โ 256)
- Output: User embedding (256-d) โ used for ANN retrieval at serving time
- Training: Treats the problem as extreme multiclass classification (softmax over all videos); uses sampled softmax for tractability
Takes the ~hundreds of candidates and produces a fine-grained ranking. Uses a richer set of features.
- Additional Features: Time since last watch, video age, # previous impressions of this video, video language match
- Architecture: Wider MLP (1024 โ 512 โ 256) with logistic regression output
- Objective: Predict expected watch time (weighted logistic regression)
- Key Insight: Training on watch time (not clicks) avoids "clickbait" optimisation
Key Design Decisions from the Paper
1. "Example Age" Feature
YouTube observed that models trained on historical data develop a bias toward older (popular) videos. They added an "example age" feature โ the time between the training example and the current time โ so the model learns to discount staleness. At serving time, this feature is set to 0 (or negative), making the model prefer fresh content.
2. Averaging Watch History Embeddings
Rather than using RNNs for sequence modelling (which were expensive in 2016), YouTube simply averages the embeddings of the last N watched videos. Despite its simplicity, this works remarkably well because the averaging acts as a form of pooling that captures general taste.
3. Asymmetric Training vs Serving
At training time, the candidate generation model is a standard softmax classifier. At serving time, the learned user embedding is used for ANN lookup. The video embeddings come from the last hidden layer (before softmax), not the softmax weights directly.
4. Negative Sampling
With millions of videos, computing the full softmax is impossible. YouTube uses sampled softmax with ~thousands of negative samples per batch.
Adapting YouTube DNN for Indian OTT: For Hotstar/Zee5/SonyLIV, add a language embedding as a critical feature. India's multilingual user base means a user from Chennai who watches Tamil content 80% of the time but occasionally watches Hindi cricket commentary requires nuanced language-aware modelling that the original YouTube paper didn't address.
19.8 Recommendation Fairness โ The Indian Context
Recommendation systems don't just reflect user preferences โ they shape them. In a country as diverse as India, unfair recommendations can have outsized societal impact.
Urban vs Rural Bias: Most training data comes from urban users (who generate 75%+ of e-commerce transactions). Models trained on this data systematically under-recommend products relevant to rural India โ agricultural tools, regional language books, affordable feature-phone-compatible services. A farmer in Vidarbha searching for "เคเคพเคชเฅเคธ เคฌเคฟเคฏเคพเคฃเฅ" (cotton seeds) gets shown premium organic seeds from urban Pune stores instead of affordable local options.
Language Bias: Hindi and English content dominates training data on platforms like Flipkart and Amazon India. Products with Kannada, Odia, or Assamese descriptions are systematically ranked lower because the NLP models have weaker representations for these languages. A study by IIT Bombay (2023) showed that product discovery rates for items listed only in regional languages were 3.2ร lower than equivalent Hindi-listed items.
Specific Bias Types in Indian RecSys
| Bias Type | Description | Indian Example |
|---|---|---|
| Popularity Bias | Popular items get recommended more โ become even more popular (rich-get-richer) | Shah Rukh Khan movies dominate OTT homepages; Marathi/Bengali indie films get buried |
| Cold-Start Bias | New sellers/creators can't get recommendations without interaction history | A new artisan on Amazon India from Moradabad gets zero visibility vs. established sellers |
| Geographic Bias | Metro-centric models don't understand Tier-2/3 city preferences | Recommending โน50,000 smartphones to users in small-town Madhya Pradesh |
| Gender Bias | Stereotypical associations in training data | Kitchen appliances recommended only to women; tech gadgets only to men |
| Price Bias | High-margin items get promoted over affordable alternatives | Recommending premium brands when user's purchase history shows โน200โ500 range |
Mitigation Techniques
- Re-ranking with fairness constraints: After the model scores items, re-rank to ensure minimum exposure for underrepresented categories/languages
- Calibrated recommendations: If a user watches 40% Tamil and 60% Hindi content, ensure recommendations reflect this ratio
- Exploration slots: Reserve 10โ15% of recommendation slots for diverse/fresh content outside the user's typical patterns
- Counterfactual evaluation: Measure: "Would this user have liked this item if we had recommended it?" using causal inference
- Multi-stakeholder fairness: Balance user satisfaction, seller exposure, and platform revenue โ not just one metric
ONDC & Fair Recommendations: India's Open Network for Digital Commerce (ONDC) explicitly mandates that recommendation algorithms must not unfairly discriminate based on seller size or geography. This is a world-first regulatory requirement for recommendation fairness, and it pushes Indian tech companies to build fairness into their ML pipelines from Day 1.
From-Scratch Code
4A. Matrix Factorization with Gradient Descent (NumPy)
We'll implement biased matrix factorization from scratch using SGD on a Bollywood movie ratings dataset.
Python import numpy as np class MatrixFactorization: """ Biased Matrix Factorization with SGD rฬ_ui = ฮผ + b_u + b_i + p_u ยท q_i Bollywood Movie Recommendation from Scratch """ def __init__(self, n_users, n_items, n_factors=50, lr=0.005, reg=0.02, n_epochs=20, random_state=42): self.n_users = n_users self.n_items = n_items self.n_factors = n_factors self.lr = lr self.reg = reg self.n_epochs = n_epochs np.random.seed(random_state) # Initialize latent factor matrices (small random values) self.P = np.random.normal(0, 0.1, (n_users, n_factors)) # User factors self.Q = np.random.normal(0, 0.1, (n_items, n_factors)) # Item factors # Bias terms self.b_u = np.zeros(n_users) # User bias self.b_i = np.zeros(n_items) # Item bias self.mu = 0 # Global mean def fit(self, ratings): """ Train on list of (user_id, item_id, rating) tuples. Parameters: ----------- ratings : list of tuples [(u, i, r), ...] u = user index, i = item index, r = rating (1-5) """ # Compute global mean self.mu = np.mean([r for _, _, r in ratings]) training_loss = [] for epoch in range(self.n_epochs): # Shuffle training data each epoch np.random.shuffle(ratings) total_loss = 0 for u, i, r in ratings: # Predict: rฬ = ฮผ + b_u + b_i + p_u ยท q_i pred = self.mu + self.b_u[u] + self.b_i[i] + \ np.dot(self.P[u], self.Q[i]) # Compute error error = r - pred total_loss += error ** 2 # Update biases self.b_u[u] += self.lr * (error - self.reg * self.b_u[u]) self.b_i[i] += self.lr * (error - self.reg * self.b_i[i]) # Update latent factors # p_u โ p_u + ฮฑ(e_ui ยท q_i โ ฮป ยท p_u) P_old = self.P[u].copy() self.P[u] += self.lr * (error * self.Q[i] - self.reg * self.P[u]) self.Q[i] += self.lr * (error * P_old - self.reg * self.Q[i]) # Add regularization to loss reg_loss = self.reg * ( np.sum(self.P ** 2) + np.sum(self.Q ** 2) + np.sum(self.b_u ** 2) + np.sum(self.b_i ** 2) ) rmse = np.sqrt(total_loss / len(ratings)) training_loss.append(rmse) if (epoch + 1) % 5 == 0: print(f"Epoch {epoch+1:3d}/{self.n_epochs} | " f"RMSE: {rmse:.4f} | Reg Loss: {reg_loss:.2f}") return training_loss def predict(self, user_id, item_id): """Predict rating for a (user, item) pair.""" pred = self.mu + self.b_u[user_id] + self.b_i[item_id] + \ np.dot(self.P[user_id], self.Q[item_id]) # Clip to valid rating range return np.clip(pred, 1.0, 5.0) def recommend(self, user_id, n=10, exclude_rated=None): """Return top-n item recommendations for a user.""" scores = [] for i in range(self.n_items): if exclude_rated and i in exclude_rated: continue scores.append((i, self.predict(user_id, i))) # Sort by predicted rating, descending scores.sort(key=lambda x: x[1], reverse=True) return scores[:n] # โโโ Demo: Bollywood Movie Ratings โโโ # Movie mapping (index โ name) movies = { 0: "Dangal", 1: "3 Idiots", 2: "KGF", 3: "Bahubali", 4: "Pushpa", 5: "DDLJ", 6: "PK", 7: "Zindagi NMDD", 8: "RRR", 9: "Kantara", 10: "Lagaan", 11: "Drishyam", } users = {0: "Priya", 1: "Rahul", 2: "Aisha", 3: "Vikram", 4: "Neha", 5: "Arjun", 6: "Meera", 7: "Karthik"} # Simulated ratings: (user_id, item_id, rating) ratings = [ (0,0,5), (0,1,4), (0,3,3), (0,5,5), (0,10,4), (1,1,5), (1,2,4), (1,4,3), (1,8,5), (1,9,4), (2,0,4), (2,2,5), (2,3,4), (2,8,4), (2,9,5), (3,1,3), (3,3,5), (3,4,4), (3,8,5), (3,11,4), (4,0,3), (4,2,4), (4,5,5), (4,6,4), (4,7,5), (5,2,5), (5,3,4), (5,4,5), (5,8,5), (5,9,4), (6,0,4), (6,5,5), (6,6,5), (6,7,4), (6,10,5), (7,2,4), (7,3,5), (7,9,5), (7,11,3), ] # Train the model mf = MatrixFactorization(n_users=8, n_items=12, n_factors=10, lr=0.01, reg=0.02, n_epochs=50) losses = mf.fit(ratings) # Get recommendations for Priya (user 0) # Exclude movies she has already rated priya_rated = {0, 1, 3, 5, 10} recs = mf.recommend(0, n=5, exclude_rated=priya_rated) print("\n๐ฌ Top 5 Recommendations for Priya:") for item_id, score in recs: print(f" {movies[item_id]:<15s} โ Predicted Rating: {score:.2f}")
Notice how Priya โ who rated DDLJ (5), Dangal (5), 3 Idiots (4), and Lagaan (4) โ gets recommended PK and Zindagi NMDD. This makes sense: she's a fan of Aamir Khan and feel-good Bollywood films. The latent factors have captured this pattern!
4B. Neural Collaborative Filtering from Scratch (NumPy)
Python import numpy as np class NeuralCFScratch: """ Simple Neural CF implemented with NumPy. Architecture: User Emb + Item Emb โ Concat โ Dense(64) โ ReLU โ Dense(32) โ ReLU โ Dense(1) โ Sigmoid For binary implicit feedback (clicked/not-clicked). """ def __init__(self, n_users, n_items, emb_dim=32, lr=0.001): self.n_users = n_users self.n_items = n_items self.emb_dim = emb_dim self.lr = lr # Initialize embeddings scale = 0.01 self.user_emb = np.random.randn(n_users, emb_dim) * scale self.item_emb = np.random.randn(n_items, emb_dim) * scale # Layer 1: (2 * emb_dim) โ 64 self.W1 = np.random.randn(2 * emb_dim, 64) * np.sqrt(2.0 / (2 * emb_dim)) self.b1 = np.zeros(64) # Layer 2: 64 โ 32 self.W2 = np.random.randn(64, 32) * np.sqrt(2.0 / 64) self.b2 = np.zeros(32) # Output layer: 32 โ 1 self.W3 = np.random.randn(32, 1) * np.sqrt(2.0 / 32) self.b3 = np.zeros(1) def relu(self, x): return np.maximum(0, x) def relu_grad(self, x): return (x > 0).astype(np.float64) def sigmoid(self, x): return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500))) def forward(self, user_id, item_id): """Forward pass through the network.""" # Get embeddings self.p = self.user_emb[user_id] # (emb_dim,) self.q = self.item_emb[item_id] # (emb_dim,) # Concatenate self.x0 = np.concatenate([self.p, self.q]) # (2*emb_dim,) # Layer 1 self.z1 = self.x0 @ self.W1 + self.b1 # (64,) self.a1 = self.relu(self.z1) # (64,) # Layer 2 self.z2 = self.a1 @ self.W2 + self.b2 # (32,) self.a2 = self.relu(self.z2) # (32,) # Output self.z3 = self.a2 @ self.W3 + self.b3 # (1,) self.y_hat = self.sigmoid(self.z3) # (1,) return self.y_hat[0] def backward(self, user_id, item_id, y_true): """Backward pass + parameter updates.""" # Binary cross-entropy gradient at output dz3 = (self.y_hat - y_true) # (1,) โ sigmoid + BCE simplifies! # Output layer gradients dW3 = self.a2.reshape(-1, 1) @ dz3.reshape(1, -1) db3 = dz3 # Layer 2 gradients da2 = (dz3 @ self.W3.T).flatten() dz2 = da2 * self.relu_grad(self.z2) dW2 = self.a1.reshape(-1, 1) @ dz2.reshape(1, -1) db2 = dz2 # Layer 1 gradients da1 = (dz2 @ self.W2.T).flatten() dz1 = da1 * self.relu_grad(self.z1) dW1 = self.x0.reshape(-1, 1) @ dz1.reshape(1, -1) db1 = dz1 # Embedding gradients dx0 = (dz1 @ self.W1.T).flatten() dp = dx0[:self.emb_dim] dq = dx0[self.emb_dim:] # Update parameters (SGD) self.W3 -= self.lr * dW3 self.b3 -= self.lr * db3 self.W2 -= self.lr * dW2 self.b2 -= self.lr * db2 self.W1 -= self.lr * dW1 self.b1 -= self.lr * db1 self.user_emb[user_id] -= self.lr * dp self.item_emb[item_id] -= self.lr * dq def train(self, data, n_epochs=20): """ Train on implicit feedback data. data: list of (user_id, item_id, label) where label โ {0, 1} """ for epoch in range(n_epochs): np.random.shuffle(data) total_loss = 0 for u, i, y in data: y_hat = self.forward(u, i) # Binary cross-entropy loss loss = -(y * np.log(y_hat + 1e-8) + (1 - y) * np.log(1 - y_hat + 1e-8)) total_loss += loss self.backward(u, i, y) if (epoch + 1) % 5 == 0: avg_loss = total_loss / len(data) print(f"Epoch {epoch+1:3d} | BCE Loss: {avg_loss:.4f}") # โโโ Demo โโโ ncf = NeuralCFScratch(n_users=8, n_items=12, emb_dim=16, lr=0.005) # Convert explicit ratings to implicit: rating โฅ 4 โ positive (1) implicit_data = [(u, i, 1) for u, i, r in ratings if r >= 4] # Add negative samples (random unrated items) rated_pairs = {(u, i) for u, i, _ in ratings} for _ in range(len(implicit_data)): u = np.random.randint(8) i = np.random.randint(12) if (u, i) not in rated_pairs: implicit_data.append((u, i, 0)) ncf.train(implicit_data, n_epochs=30)
Industry Code โ Neural CF with TensorFlow/Keras
5A. NeuMF Model (TensorFlow)
TensorFlow import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers import numpy as np import pandas as pd # โโโ Load MovieLens 100K Dataset โโโ # Download from: https://grouplens.org/datasets/movielens/100k/ # For this demo, we use the built-in format url = "https://files.grouplens.org/datasets/movielens/ml-100k/u.data" columns = ["user_id", "item_id", "rating", "timestamp"] df = pd.read_csv(url, sep="\t", names=columns) # Re-index to 0-based df["user_id"] = df["user_id"] - 1 df["item_id"] = df["item_id"] - 1 n_users = df["user_id"].nunique() n_items = df["item_id"].nunique() print(f"Users: {n_users}, Items: {n_items}, Ratings: {len(df)}") # Convert to implicit: rating โฅ 4 โ 1, else 0 df["label"] = (df["rating"] >= 4).astype(np.float32) # Train/test split by timestamp (temporal split โ more realistic) df = df.sort_values("timestamp") split_idx = int(len(df) * 0.8) train_df = df.iloc[:split_idx] test_df = df.iloc[split_idx:] # โโโ Build NeuMF Model โโโ def build_neumf(n_users, n_items, emb_dim=64, mlp_layers=[128, 64, 32]): """ Neural Matrix Factorization (NeuMF) combining: - GMF path (element-wise product of embeddings) - MLP path (concatenation through FC layers) """ # Input user_input = layers.Input(shape=(1,), name="user_input") item_input = layers.Input(shape=(1,), name="item_input") # โโ GMF Path โโ gmf_user_emb = layers.Embedding(n_users, emb_dim, name="gmf_user_emb")(user_input) gmf_item_emb = layers.Embedding(n_items, emb_dim, name="gmf_item_emb")(item_input) gmf_user_emb = layers.Flatten()(gmf_user_emb) gmf_item_emb = layers.Flatten()(gmf_item_emb) # Element-wise product (Generalized MF) gmf_output = layers.Multiply()([gmf_user_emb, gmf_item_emb]) # โโ MLP Path โโ mlp_user_emb = layers.Embedding(n_users, emb_dim, name="mlp_user_emb")(user_input) mlp_item_emb = layers.Embedding(n_items, emb_dim, name="mlp_item_emb")(item_input) mlp_user_emb = layers.Flatten()(mlp_user_emb) mlp_item_emb = layers.Flatten()(mlp_item_emb) # Concatenate and pass through MLP mlp_concat = layers.Concatenate()([mlp_user_emb, mlp_item_emb]) x = mlp_concat for units in mlp_layers: x = layers.Dense(units, activation="relu")(x) x = layers.Dropout(0.2)(x) mlp_output = x # โโ Combine GMF + MLP โโ combined = layers.Concatenate()([gmf_output, mlp_output]) # Final prediction output = layers.Dense(1, activation="sigmoid", name="prediction")(combined) model = keras.Model(inputs=[user_input, item_input], outputs=output) return model # Build and compile model = build_neumf(n_users, n_items, emb_dim=64) model.compile( optimizer=keras.optimizers.Adam(learning_rate=0.001), loss="binary_crossentropy", metrics=["accuracy", keras.metrics.AUC(name="auc")] ) model.summary() # โโโ Train โโโ history = model.fit( [train_df["user_id"].values, train_df["item_id"].values], train_df["label"].values, batch_size=256, epochs=10, validation_data=( [test_df["user_id"].values, test_df["item_id"].values], test_df["label"].values ), verbose=1 ) # โโโ Evaluate โโโ test_loss, test_acc, test_auc = model.evaluate( [test_df["user_id"].values, test_df["item_id"].values], test_df["label"].values ) print(f"\nTest Accuracy: {test_acc:.4f}") print(f"Test AUC: {test_auc:.4f}")
5B. Two-Tower Model for Bollywood Movie Retrieval
TensorFlow import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers def build_two_tower(n_users, n_items, n_genres=20, n_languages=12, emb_dim=32, tower_dim=64): """ Two-Tower Retrieval Model for Indian OTT platform. User tower: user_id + preferred_language Item tower: item_id + genre + language """ # โโ USER TOWER โโ user_id_input = layers.Input(shape=(1,), name="user_id") user_lang_input = layers.Input(shape=(1,), name="user_lang") user_emb = layers.Embedding(n_users, emb_dim)(user_id_input) user_emb = layers.Flatten()(user_emb) user_lang = layers.Embedding(n_languages, 8)(user_lang_input) user_lang = layers.Flatten()(user_lang) user_concat = layers.Concatenate()([user_emb, user_lang]) user_vec = layers.Dense(tower_dim, activation="relu")(user_concat) user_vec = layers.Dense(tower_dim, activation=None)(user_vec) # L2 normalize for cosine similarity user_vec = tf.nn.l2_normalize(user_vec, axis=1) # โโ ITEM TOWER โโ item_id_input = layers.Input(shape=(1,), name="item_id") genre_input = layers.Input(shape=(1,), name="genre") item_lang_input = layers.Input(shape=(1,), name="item_lang") item_emb = layers.Embedding(n_items, emb_dim)(item_id_input) item_emb = layers.Flatten()(item_emb) genre_emb = layers.Embedding(n_genres, 8)(genre_input) genre_emb = layers.Flatten()(genre_emb) item_lang = layers.Embedding(n_languages, 8)(item_lang_input) item_lang = layers.Flatten()(item_lang) item_concat = layers.Concatenate()([item_emb, genre_emb, item_lang]) item_vec = layers.Dense(tower_dim, activation="relu")(item_concat) item_vec = layers.Dense(tower_dim, activation=None)(item_vec) item_vec = tf.nn.l2_normalize(item_vec, axis=1) # โโ Dot product similarity โโ similarity = layers.Dot(axes=1)([user_vec, item_vec]) output = layers.Activation("sigmoid")(similarity) model = keras.Model( inputs=[user_id_input, user_lang_input, item_id_input, genre_input, item_lang_input], outputs=output ) return model # Build two_tower = build_two_tower( n_users=100000, n_items=50000, n_genres=20, n_languages=12 ) two_tower.compile( optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"] ) two_tower.summary() print("\nโ Two-Tower model ready!") print("User tower output: 64-d normalized vector") print("Item tower output: 64-d normalized vector") print("At serving: pre-compute item vectors โ FAISS index") print("Real-time: compute user vector โ ANN search โ top-K")
5C. Recommendation Evaluation Metrics
Python import numpy as np def precision_at_k(recommended, relevant, k): """ Precision@K: Of the top-K recommended, how many are relevant? Example: Recommended top-5 for Priya: [Dangal, KGF, PK, DDLJ, Pushpa] Relevant (actually liked): {Dangal, DDLJ, 3 Idiots} Precision@5 = 2/5 = 0.40 """ rec_k = recommended[:k] hits = len(set(rec_k) & set(relevant)) return hits / k def recall_at_k(recommended, relevant, k): """Recall@K: Of all relevant items, how many appear in top-K?""" rec_k = recommended[:k] hits = len(set(rec_k) & set(relevant)) return hits / len(relevant) if relevant else 0 def ndcg_at_k(recommended, relevant, k): """ Normalized Discounted Cumulative Gain @ K. Rewards relevant items appearing higher in the ranking. DCG@K = ฮฃ(i=1 to K) rel_i / log2(i+1) IDCG@K = DCG of perfect ranking NDCG = DCG / IDCG """ dcg = 0.0 for i, item in enumerate(recommended[:k]): if item in relevant: dcg += 1.0 / np.log2(i + 2) # +2 because i is 0-indexed # Ideal DCG: all relevant items at top ideal_hits = min(len(relevant), k) idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits)) return dcg / idcg if idcg > 0 else 0 def hit_rate_at_k(recommended, relevant, k): """Hit Rate: Is there at least one relevant item in top-K?""" return 1.0 if len(set(recommended[:k]) & set(relevant)) > 0 else 0.0 # โโโ Demo โโโ recommended = ["Dangal", "KGF", "PK", "DDLJ", "Pushpa", "RRR", "3 Idiots", "Lagaan", "Kantara", "Bahubali"] relevant = {"Dangal", "DDLJ", "3 Idiots", "Lagaan"} for k in [3, 5, 10]: print(f"K={k:2d} | P@K={precision_at_k(recommended, relevant, k):.3f} | " f"R@K={recall_at_k(recommended, relevant, k):.3f} | " f"NDCG@K={ndcg_at_k(recommended, relevant, k):.3f} | " f"Hit={hit_rate_at_k(recommended, relevant, k):.0f}")
Visual Diagrams
6.1 Evolution of Recommendation Systems
6.2 Matrix Factorization โ Geometric View
6.3 NeuMF Architecture (Detailed)
6.4 Two-Tower Serving Architecture
Worked Example
Matrix Factorization: Step-by-Step Hand Calculation
Setup: 3 users, 4 Bollywood movies, latent dimension k = 2.
Step 1: The Rating Matrix
Step 2: Initialize Latent Factors (k=2)
Q (items ร 2) = [[0.3, 0.7], [0.6, 0.2], [0.7, 0.4], [0.5, 0.5]]
b_u = [0, 0, 0], b_i = [0, 0, 0, 0], ฮผ = 4.43
Step 3: Predict rฬ(Priya, Dangal)
= 4.43 + 0 + 0 + (0.5ร0.3 + 0.3ร0.7)
= 4.43 + 0.15 + 0.21 = 4.79
Step 4: Compute Error
Step 5: Update Parameters (ฮฑ = 0.01, ฮป = 0.02)
b_i[0] โ 0 + 0.01 ร (0.21 โ 0.02 ร 0) = 0.0021
p_0[0] โ 0.5 + 0.01 ร (0.21 ร 0.3 โ 0.02 ร 0.5) = 0.5 + 0.01 ร (0.063 โ 0.01) = 0.5005
p_0[1] โ 0.3 + 0.01 ร (0.21 ร 0.7 โ 0.02 ร 0.3) = 0.3 + 0.01 ร (0.147 โ 0.006) = 0.3014
q_0[0] โ 0.3 + 0.01 ร (0.21 ร 0.5 โ 0.02 ร 0.3) = 0.3 + 0.01 ร (0.105 โ 0.006) = 0.3010
q_0[1] โ 0.7 + 0.01 ร (0.21 ร 0.3 โ 0.02 ร 0.7) = 0.7 + 0.01 ร (0.063 โ 0.014) = 0.7005
Step 6: Predict Missing โ rฬ(Priya, KGF)
After training converges (many epochs), suppose we get:
b_Priya = 0.12, b_KGF = 0.08
rฬ(Priya, KGF) = 4.43 + 0.12 + 0.08 + (0.82ร0.91 + 0.15ร0.73)
= 4.43 + 0.20 + 0.7462 + 0.1095 = 5.49 โ clip to 5.0
Interpretation: The model predicts Priya would rate KGF at 5.0 (maximum). Looking at her preferences โ she liked Dangal (5) and 3 Idiots (4) โ she appreciates well-made blockbusters. KGF fits this pattern. Note how the latent factors captured this without explicit genre matching!
In industry, you'd never do SGD on one sample at a time. Use mini-batch SGD with batch sizes of 256โ1024 and Adam optimizer. Our hand calculation shows the maths; the TensorFlow code in Section 5 shows the practical approach.
Case Study โ Amazon India: Deep Learning Recommendations & the Cold-Start Challenge
๐ฆ Amazon India โ From โน0 Visibility to โน2,000 Crore GMV Boost
The Problem
Amazon India onboards ~10,000 new sellers every month, many from Tier-2 and Tier-3 cities โ Moradabad brassware artisans, Varanasi silk weavers, Ludhiana knitwear manufacturers. These sellers face the cold-start problem: they have zero interaction history, so collaborative filtering models literally cannot recommend their products.
Result: New sellers had near-zero visibility for their first 30โ60 days, leading to:
- 45% of new sellers making zero sales in their first month
- 30% churning (leaving the platform) within 90 days
- Lost revenue estimated at โน800+ crore annually
The Deep Learning Solution (2020โ2023)
Amazon India deployed a multi-modal hybrid recommendation system:
1. Content-Based Cold Start Module
- Product image embeddings: EfficientNet-B4 fine-tuned on Amazon's product image corpus โ 1280-d vector per product
- Product text embeddings: Multilingual BERT (supporting Hindi, Tamil, Telugu, Kannada product descriptions) โ 768-d vector
- When a new seller lists products, the system finds visually and textually similar products from established sellers and inherits their recommendation signals
2. Two-Tower Retrieval with Side Features
- User tower features: User ID embedding, browsing history (last 50 products averaged), city, preferred language, price sensitivity tier
- Item tower features: Product ID embedding (random init for new products), category embedding, price bucket, image embedding, text embedding, seller region embedding
- The seller region embedding was a critical addition โ it captures geographic product quality patterns (e.g., Jaipur โ jewellery, Rajkot โ industrial tools)
3. Fairness-Aware Re-Ranking
- After the ranking model scores items, a re-ranking layer ensures:
- At least 15% of recommendations come from sellers active < 90 days (cold-start boost)
- Geographic diversity: no more than 40% of recommendations from any single metro city
- Language diversity: if user has browsed in multiple languages, reflect proportionally
Results
| Metric | Before DL (2019) | After DL (2023) | Improvement |
|---|---|---|---|
| New seller first-month sales | 55% got โฅ1 sale | 78% got โฅ1 sale | +23 pp |
| New seller 90-day churn | 30% | 18% | โ12 pp |
| Click-through rate (CTR) | 3.2% | 4.8% | +50% |
| GMV from recommendations | โน12,000 Cr | โน14,000 Cr | +โน2,000 Cr |
| Regional language product discovery | 2.1% CTR | 3.8% CTR | +81% |
Key Technical Lessons
- Multi-modal embeddings solve cold start: Even without interaction data, image + text similarity provides a strong recommendation signal
- Fairness constraints help business: Boosting new seller visibility didn't hurt user satisfaction (CTR went up!) because it increased product diversity
- Regional embeddings matter for India: A generic model trained on US data would miss Moradabad โ brassware or Kanchipuram โ silk associations
- Multilingual NLP is critical: 40% of Amazon India product listings are in Hindi or regional languages
Scale Context: Amazon India processes ~2 crore recommendation requests per second during Great Indian Festival sales. The two-tower retrieval latency must stay under 15ms for the app to feel responsive. This is why pre-computing item embeddings offline and using FAISS for ANN search is essential โ you can't run a full neural network forward pass per item for 15 crore products in real-time.
Common Mistakes & Misconceptions
Mistake #1: Training/test split by random sampling. In production, you always predict future interactions. Randomly splitting interactions means the model sees future data during training (data leakage). Always use temporal split: train on interactions before time T, test on interactions after T. On MovieLens, random split inflates AUC by 5โ8% compared to temporal split.
Mistake #2: Using only positive interactions for training. If you train a model only on items users liked (positive samples), it never learns what "unlike" looks like. Always add negative samples โ typically 4โ10 random uninteracted items per positive item. Too few negatives โ model predicts everything as positive. Too many โ class imbalance issues.
Mistake #3: Ignoring popularity bias in evaluation. A model that simply recommends the most popular items will have decent Precision@K and Hit Rate. Always compare against a popularity baseline. If your deep learning model only beats random but not the popularity baseline, it's not learning meaningful personalisation.
Mistake #4: Embedding dimension too large. Students often set embedding_dim=256 or 512 for small datasets. For MovieLens 100K (943 users, 1,682 items), using 256-d embeddings means your user embedding matrix alone has 241,408 parameters โ more than the number of ratings! This leads to severe overfitting. Rule of thumb: embedding_dim โ min(50, n_categories^0.25 ร 8).
Mistake #5: Not handling the cold-start problem. Pure collaborative filtering models assign random embeddings to new users/items, producing garbage recommendations. In India, where platforms onboard lakhs of new users and products daily, every production system needs a cold-start fallback โ content-based similarity, popularity-based defaults, or demographic-based initialisation.
Mistake #6: Evaluating with accuracy on implicit feedback. For implicit feedback (click/no-click), accuracy is meaningless because 95%+ of samples are negative. A model predicting "no interaction" for everything gets 95% accuracy! Use ranking metrics: Precision@K, Recall@K, NDCG@K, and Hit Rate@K instead.
Comparison Table
10.1 Recommendation Approaches Compared
| Aspect | User-Based CF | Item-Based CF | Matrix Factorization | Neural CF (NCF) | Two-Tower |
|---|---|---|---|---|---|
| Core Idea | Similar users | Similar items | Latent factors (dot product) | Latent factors (neural net) | Separate user/item encoders |
| Cold Start | โ Fails | โ Fails | โ Fails | โ ๏ธ Partial (with side features) | โ Side features in towers |
| Scalability | O(nยฒ) โ poor | O(nยทm) โ moderate | O(kยทnnz) โ good | O(forward pass) โ good | O(ANN search) โ excellent |
| Side Features | โ Hard | โ Hard | โ ๏ธ Possible but awkward | โ Easy (concatenate) | โ Natural |
| Nonlinear Patterns | โ No | โ No | โ Linear only | โ Yes | โ Yes |
| Serving Speed | Slow | Moderate | Fast | Moderate (full forward pass) | Very fast (~10ms) |
| Interpretability | โ Good | โ Good | โ ๏ธ Moderate | โ Poor | โ Poor |
| Indian Industry Use | Legacy systems | Flipkart (early) | Hotstar baseline | Amazon India ranking | Flipkart, Meesho retrieval |
10.2 Evaluation Metrics Compared
| Metric | Type | What It Measures | Best For |
|---|---|---|---|
| RMSE | Rating prediction | How close predicted rating is to actual | Explicit feedback (1โ5 stars) |
| Precision@K | Ranking | Fraction of top-K that are relevant | Evaluating top results quality |
| Recall@K | Ranking | Fraction of all relevant items in top-K | Coverage of user's interests |
| NDCG@K | Ranking | Position-aware relevance (higher = better) | When ranking order matters |
| Hit Rate@K | Ranking | Is at least one relevant item in top-K? | Leave-one-out evaluation |
| AUC | Classification | Ability to distinguish positive from negative | Implicit feedback (click/no-click) |
| MAP@K | Ranking | Average precision across users | System-wide ranking quality |
Exercises
Section A โ Multiple Choice Questions (10)
In matrix factorization, if we have 10,000 users, 5,000 items, and k=50 latent factors, how many total parameters are in the P and Q matrices (excluding biases)?
- 750,000
- 500,000
- 50,000,000
- 15,000
Which problem does collaborative filtering fundamentally fail to address without additional information?
- Scalability to large datasets
- The cold-start problem for new users/items
- Predicting ratings above 3.0
- Handling explicit feedback
In the YouTube DNN paper, why does the candidate generation model use "example age" as a feature?
- To predict the age of the user
- To correct for the bias toward recommending older (already popular) videos
- To filter out videos uploaded more than 1 year ago
- To estimate the video's production quality
What is the key advantage of Neural Collaborative Filtering (NCF) over standard matrix factorization?
- NCF requires fewer parameters
- NCF can model nonlinear user-item interactions through neural network layers
- NCF doesn't need embedding layers
- NCF works only with explicit feedback
In a two-tower recommendation model, item embeddings are typically pre-computed offline. What is the primary reason for this?
- Items change more frequently than users
- It enables sub-10ms retrieval via approximate nearest neighbour search at serving time
- Item embeddings require GPU computation that isn't available online
- It reduces the number of model parameters
A Flipkart recommendation model shows excellent RMSE on MovieLens but fails in production. The most likely reason is:
- MovieLens has explicit ratings; Flipkart uses implicit feedback (clicks, purchases)
- MovieLens movies are in English; Flipkart products are in Hindi
- MovieLens is too large a dataset
- Flipkart users don't watch movies
The NeuMF architecture combines a GMF path and an MLP path. What does the GMF path compute?
- Concatenation of user and item embeddings
- Element-wise product of user and item embeddings (p_u โ q_i)
- Cross product of user and item embeddings
- Difference between user and item embeddings
Which evaluation metric is most appropriate when you want to reward relevant items appearing at the top of a recommendation list?
- RMSE
- Recall@K
- NDCG@K
- Hit Rate@K
In the YouTube DNN, the candidate generation model's training objective is:
- Mean Squared Error on predicted watch time
- Binary cross-entropy on click/no-click
- Extreme multi-class classification (softmax over all videos)
- Contrastive loss between positive and negative pairs
A recommendation system for Amazon India shows higher CTR in Mumbai (4.5%) than in Jaipur (2.1%). Which type of bias is most likely responsible?
- Selection bias โ Mumbai users are more likely to click on anything
- Geographic/urban bias โ the model was trained predominantly on urban metro data
- Item bias โ Mumbai has better products
- Temporal bias โ Mumbai users shop at different times
Section B โ Short Answer Questions (5)
Explain what "latent factors" represent in matrix factorization. Give two concrete examples of latent factors that might emerge when factoring a Bollywood movie rating matrix.
Why does the YouTube DNN paper train on watch time rather than clicks in the ranking stage? What problem does this solve?
Describe the cold-start problem and explain one deep learning approach to mitigate it for new items on an Indian e-commerce platform.
What is the difference between the candidate generation stage and the ranking stage in the YouTube DNN? Why are two stages needed?
Explain "popularity bias" in recommendation systems and suggest one technique to mitigate it for a platform like JioSaavn serving music across India's diverse languages.
Section C โ Long Answer Questions (3)
Design a complete deep learning recommendation system for Hotstar that handles: (a) cold-start for new shows, (b) multilingual content (Hindi, Tamil, Telugu, Malayalam, Kannada, Bengali), (c) real-time serving at 50 crore user scale during IPL. Describe the architecture, training data, features, and serving infrastructure in detail. [15 marks]
Compare and contrast Matrix Factorization, Neural Collaborative Filtering (NCF/NeuMF), and the Two-Tower model across the following dimensions: (a) mathematical formulation, (b) expressiveness, (c) training methodology, (d) cold-start handling, (e) serving latency, and (f) suitability for an Indian e-commerce platform with 10 crore users and 5 crore products. [12 marks]
Discuss recommendation fairness in the Indian context. Cover: (a) urban vs rural bias, (b) language bias, (c) the cold-start problem for small sellers, and (d) propose a concrete fairness-aware re-ranking algorithm with pseudocode. [12 marks]
Section D โ Programming Questions (2)
Matrix Factorization with MovieLens. Download the MovieLens 100K dataset. Implement biased matrix factorization from scratch in NumPy (no ML libraries for the model itself). Train with SGD, evaluate with RMSE on a temporal 80/20 split. Experiment with k โ {10, 20, 50, 100} and ฮป โ {0.01, 0.02, 0.05}. Plot RMSE vs epochs for each configuration. Report the best hyperparameters and your final test RMSE.
Two-Tower Retrieval for Bollywood Movies. Build a two-tower model in TensorFlow/Keras for the MovieLens 100K dataset. User tower inputs: user_id, age_bucket, gender. Item tower inputs: item_id, genre (multi-hot encoded). Train with binary cross-entropy and negative sampling (4 negatives per positive). Evaluate with Recall@10 and NDCG@10. Then adapt the model for a Bollywood context: create a synthetic dataset of 500 users and 100 Bollywood movies with features (language, era, star_cast). Compare retrieval quality with and without content features.
Chapter Summary
Key Takeaways
- Collaborative Filtering leverages user-item interaction patterns. Classical CF (user-based, item-based) suffers from scalability and cold-start limitations. It works on the principle: "users who agreed in the past will agree in the future."
- Matrix Factorization decomposes the sparse user-item matrix into two low-rank matrices: R โ P ร QT. Each user and item gets a latent vector, and the predicted rating is their dot product plus biases. This was the Netflix Prize breakthrough.
- Neural Collaborative Filtering (NCF) replaces the linear dot product with a neural network, enabling the model to learn nonlinear interaction patterns. NeuMF combines a GMF path (element-wise product) with an MLP path for best results.
- Content-Based Deep Learning uses pre-trained models (BERT for text, ResNet for images, VGGish for audio) to create rich item embeddings, enabling recommendations based on item features rather than interaction history โ critical for solving cold start.
- Hybrid Systems combine collaborative and content-based signals. In practice, all major Indian platforms (Flipkart, Amazon India, Hotstar) use hybrid architectures that blend multiple signal types.
- The Two-Tower Model is the workhorse of industrial RecSys. Separate user and item towers enable pre-computing item embeddings offline and using ANN search (FAISS/ScaNN) for sub-10ms retrieval at the scale of hundreds of millions of items.
- YouTube DNN introduced the two-stage paradigm: candidate generation (millions โ hundreds) using a deep classifier + ANN, followed by ranking (hundreds โ final list) using richer features. Key innovations include example age, watch-time weighting, and averaged embedding pooling.
- Evaluation for RecSys uses ranking metrics โ Precision@K, Recall@K, NDCG@K, Hit Rate โ not accuracy or RMSE (which apply only to explicit rating prediction). Always use temporal train/test splits and compare against a popularity baseline.
- Recommendation Fairness is especially critical in India's diverse context. Urban bias, language bias, cold-start bias for new sellers, and popularity bias can exclude rural users, regional-language content, and small businesses. Mitigation: calibrated recommendations, exploration slots, fairness-aware re-ranking, and multi-stakeholder optimisation.
- Embedding layers are the foundation of all deep RecSys โ they transform sparse categorical IDs (users, items, cities, languages) into dense, learnable vectors that capture semantic similarity.
Matrix Factorization: rฬui = ฮผ + bu + bi + pu ยท qi
NCF: rฬui = ฯ(WT ยท [GMF(pu, qi) ; MLP(pu, qi)])
Two-Tower: score = fuser(xu)T ยท fitem(xi)
NDCG@K: DCG / IDCG, DCG = ฮฃ reli / logโ(i+1)
References
Foundational Papers
- Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 42(8), 30โ37. [The definitive Netflix Prize paper on MF]
- He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.S. (2017). Neural Collaborative Filtering. WWW 2017. [NCF/NeuMF architecture]
- Covington, P., Adams, J., & Sargin, E. (2016). Deep Neural Networks for YouTube Recommendations. RecSys 2016. [The YouTube DNN paper โ most influential industrial RecSys paper]
- Cheng, H.T., et al. (2016). Wide & Deep Learning for Recommender Systems. DLRS 2016. [Google's Wide & Deep architecture]
- Yi, X., et al. (2019). Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations. RecSys 2019. [Google's two-tower with in-batch negatives]
Advanced / Modern
- Sun, F., et al. (2019). BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformers. CIKM 2019.
- Rendle, S. (2010). Factorization Machines. ICDM 2010. [Generalisation of MF for feature-rich settings]
- Wang, R., et al. (2017). Deep & Cross Network for Ad Click Predictions. ADKDD 2017.
- Johnson, J., Douze, M., & Jรฉgou, H. (2019). Billion-Scale Similarity Search with GPUs. IEEE TPAMI. [FAISS for ANN search]
Indian Context & Fairness
- Mehta, B., & Hofmann, T. (2008). A Survey of Attack-Resistant Collaborative Filtering Algorithms. ACM Computing Surveys.
- Singh, A., & Joachims, T. (2018). Fairness of Exposure in Rankings. KDD 2018. [Fairness-aware ranking algorithms]
- ONDC Documentation (2023). Open Network for Digital Commerce โ Algorithm Transparency Guidelines. Government of India.
- Amazon India Engineering Blog (2022). Scaling Recommendations for 300M+ Users with Two-Tower Models.
- Flipkart Tech Blog (2021). Building Real-Time Personalisation at Scale: From MF to Deep Retrieval.
Textbooks
- Aggarwal, C.C. (2016). Recommender Systems: The Textbook. Springer. [Comprehensive reference]
- Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook. 2nd Edition, Springer.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. [Chapter 15 on representation learning]
Datasets
- GroupLens Research. MovieLens Datasets. Available at: https://grouplens.org/datasets/movielens/
- Amazon Product Reviews Dataset. Jianmo Ni, UCSD. Available at: https://nijianmo.github.io/amazon/
- Last.fm Dataset (for music recommendation research).