Neural Networks & Deep Learning

Chapter 19: Recommendation Systems with Deep Learning

When Algorithms Know What You Want Before You Do

โฑ๏ธ Reading Time: ~3 hours  |  ๐Ÿ“– Part V: Applications  |  ๐Ÿง  Theory + Code Chapter

๐Ÿ“‹ Prerequisites: Chapters 6โ€“8 (Deep Networks, Backpropagation, Optimization), Embeddings (Ch 14/15), Basic Linear Algebra

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the user-item interaction matrix, matrix factorization formula, and the architecture of two-tower models and YouTube DNN
๐Ÿ”ต UnderstandExplain the difference between collaborative filtering, content-based filtering, and hybrid approaches; describe how embeddings capture latent factors
๐ŸŸข ApplyImplement matrix factorization from scratch with gradient descent and build a Neural Collaborative Filtering model in TensorFlow
๐ŸŸก AnalyzeAnalyze cold-start problems, popularity bias, and urban vs. rural recommendation fairness in Indian e-commerce contexts
๐ŸŸ  EvaluateEvaluate tradeoffs between collaborative vs. content-based vs. hybrid systems; select appropriate architectures for different business scenarios
๐Ÿ”ด CreateDesign an end-to-end deep recommendation pipeline for a Bollywood movie platform with fairness constraints
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Define collaborative filtering, content-based filtering, and hybrid recommendation systems, and explain when each paradigm is most suitable
  • Derive the matrix factorization objective with regularization and implement gradient descent updates for user/item latent factor matrices
  • Explain how Neural Collaborative Filtering (NCF) generalises matrix factorization by replacing the dot product with a learned neural network
  • Implement embedding layers in PyTorch and TensorFlow to represent users and items as dense, low-dimensional vectors
  • Describe the two-tower architecture (user tower + item tower) and explain how approximate nearest neighbour search enables real-time retrieval at scale
  • Reproduce the key components of the YouTube recommendation DNN (2016 paper) โ€” candidate generation and ranking stages
  • Build a from-scratch matrix factorization model using NumPy and a Neural CF model using TensorFlow/Keras on the MovieLens dataset
  • Analyze recommendation fairness issues โ€” urban vs. rural bias, language bias, and cold-start problems for new sellers on Indian platforms
  • Evaluate recommendation quality using metrics like RMSE, Precision@K, Recall@K, NDCG, and Hit Rate
  • Design a hybrid deep learning recommendation system for an Indian e-commerce or OTT streaming scenario
Section 2

Opening Hook

๐Ÿ IPL Final Night on Hotstar โ€” 50 Crore Decisions per Second

It's the 2024 IPL final. Over 50 crore (500 million) users are on Disney+ Hotstar simultaneously. The match is streaming live, but here's what most viewers don't see: behind every screen, a deep recommendation engine is making hundreds of decisions.

"You watched MI vs CSK last week โ€” here's a highlight reel." "Since you paused KGF 2 at the interval, here's a Yash interview." "Your friend in Bengaluru is watching this Tamil web seriesโ€ฆ"

Hotstar's recommendation engine drives over 70% of total watch time. Without it, most users would open the app, feel overwhelmed by 100,000+ titles, and leave within 30 seconds. The algorithm doesn't just suggest โ€” it decides what India watches.

From the โ‚น2,000 crore GMV boost at Amazon India to Spotify Wrapped playlists that know your Bollywood guilty pleasures โ€” recommendation systems powered by deep learning are the invisible architects of modern digital India.

In this chapter, you'll learn to build them from scratch.

HotstarAmazon IndiaFlipkartSpotify IndiaJioSaavn

Netflix estimated in 2016 that its recommendation system saves โ‚น8,300 crore ($1 billion) per year by reducing subscriber churn. If users can't find what they want within 60โ€“90 seconds, they leave. In India, where ARPU (Average Revenue Per User) for OTT is just โ‚น100โ€“150/month, the stakes per-user are lower but the scale โ€” 300M+ OTT users โ€” more than compensates.

Section 3

Core Concepts

We'll build up from the simplest approaches to state-of-the-art deep learning architectures. The roadmap: Classical CF โ†’ Matrix Factorization โ†’ Neural CF โ†’ Content-Based DL โ†’ Hybrid โ†’ Two-Tower โ†’ YouTube DNN โ†’ Fairness.

19.1 Classical Collaborative Filtering (CF)

The fundamental idea behind collaborative filtering is beautifully simple: users who agreed in the past will agree in the future. If Priya and Rahul both loved 3 Idiots, Lagaan, and Zindagi Na Milegi Dobara, and Priya also liked Dil Chahta Hai, then Rahul will probably like it too.

The User-Item Interaction Matrix

At the heart of CF lies the user-item interaction matrix R, where R[u][i] represents user u's rating (or implicit feedback) for item i.

Dangal 3 Idiots KGF Bahubali Pushpa Priya [ 5 4 ? 3 ? ] Rahul [ ? 5 4 ? 3 ] Aisha [ 4 ? 5 4 ? ] Vikram [ ? 3 ? 5 4 ] Neha [ 3 ? 4 ? 5 ] ? = missing rating โ†’ THIS is what we want to predict! Sparsity: In real systems, 95โ€“99.5% of entries are missing.

There are two classical flavours:

User-Based CF vs Item-Based CF

User-Based CF

Find users similar to the target user, then aggregate their ratings. "Users like you also watchedโ€ฆ"

Similarity: Cosine similarity or Pearson correlation between user rating vectors.

Item-Based CF

Find items similar to what the user has already liked, then recommend those. "Because you watched Dangalโ€ฆ"

Advantage: Item similarities are more stable than user similarities (items don't change; user tastes drift).

Limitations of Classical CF
  • Cold Start: New users or new items have no interactions โ†’ no similarity can be computed
  • Scalability: Computing pairwise similarities for 50 crore users is O(nยฒ) โ€” infeasible
  • Sparsity: With 99%+ missing entries, finding reliable neighbours is unreliable

Flipkart's Scale Challenge: With 45+ crore registered users and 15+ crore products, the user-item matrix has ~6.75 ร— 10ยนโท potential entries. Classical CF with pairwise similarity is computationally impossible at this scale. This is why Flipkart moved to deep learning-based recommendations in 2019.

19.2 Matrix Factorization โ€” The Breakthrough

The 2006 Netflix Prize showed that matrix factorization (MF) dramatically outperforms classical CF. The core insight: decompose the sparse user-item matrix into two low-rank matrices.

R โ‰ˆ P ร— QT
R โˆˆ โ„mร—n,   P โˆˆ โ„mร—k (user factors),   Q โˆˆ โ„nร—k (item factors)
Predicted rating: rฬ‚ui = pu ยท qi = ฮฃf=1k puf ยท qif

Each user u is represented by a k-dimensional latent vector pu, and each item i by qi. The predicted rating is their dot product. The parameter k (typically 20โ€“200) controls how many latent factors we learn.

What Are Latent Factors?

Think of each dimension as capturing an abstract "taste axis." For Bollywood movies:

  • Factor 1: Action vs. Romance (Pushpa scores high on action; DDLJ scores high on romance)
  • Factor 2: Regional vs. Pan-India (Kantara scores high on regional; RRR scores high on pan-India)
  • Factor 3: Indie/art-house vs. Commercial (Ship of Theseus vs. Singham)

Neither users nor the algorithm explicitly name these factors โ€” they emerge automatically from the data.

Optimization Objective

minP,Q   ฮฃ(u,i) โˆˆ observed (rui โˆ’ pu ยท qi)ยฒ + ฮป(||P||ยฒ + ||Q||ยฒ)
ฮป = regularization strength to prevent overfitting

We only sum over observed ratings (not missing ones). The regularization term ฮป(||P||ยฒ + ||Q||ยฒ) penalises large latent factor values, preventing the model from memorising the training data.

Gradient Descent Updates

For each observed rating (u, i, rui):

eui = rui โˆ’ pu ยท qi   (prediction error)
pu โ† pu + ฮฑ (eui ยท qi โˆ’ ฮป ยท pu)
qi โ† qi + ฮฑ (eui ยท pu โˆ’ ฮป ยท qi)

Adding Biases

Real ratings have systematic biases: some users are generous raters (Priya gives 4.2 on average), some movies are universally loved (Dangal has a 4.5 average). The biased MF model:

rฬ‚ui = ฮผ + bu + bi + pu ยท qi
ฮผ = global mean,   bu = user bias,   bi = item bias

In practice, biased MF with k=50โ€“100 latent factors and ฮป=0.02 achieves RMSE close to state-of-the-art on MovieLens datasets. Start here before jumping to deep models โ€” simpler models often perform surprisingly well!

19.3 Neural Collaborative Filtering (NCF)

Matrix factorization uses a simple dot product to model user-item interaction: rฬ‚ = pu ยท qi. But what if the interaction is more complex than a linear combination? Neural Collaborative Filtering (He et al., 2017) replaces the dot product with a neural network.

NCF Architecture

Key Idea

Instead of computing pu ยท qi, pass the concatenation [pu; qi] through a multi-layer perceptron (MLP) to learn an arbitrary interaction function:

rฬ‚ui = fฮธ(pu, qi) = MLP([pu; qi])

Architecture Layers
  1. Input Layer: One-hot encoded user ID and item ID
  2. Embedding Layer: Maps user/item IDs โ†’ dense vectors (pu, qi)
  3. Interaction Layer: Concatenate [pu; qi] (or element-wise product, or both)
  4. Hidden Layers: FC โ†’ ReLU โ†’ FC โ†’ ReLU โ†’ โ€ฆ (tower of decreasing width)
  5. Output Layer: Single neuron with sigmoid (for implicit) or linear (for explicit ratings)
NeuMF: Best of Both Worlds

The original NCF paper proposes NeuMF โ€” combining a Generalized Matrix Factorization (GMF) path with an MLP path. GMF preserves the dot-product signal; MLP adds nonlinear capacity. Their outputs are concatenated and fed to a final prediction layer.

User ID โ”€โ”€โ†’ [Embedding] โ”€โ”€โ†’ p_u โ”€โ”€โ” โ”œโ”€โ”€โ†’ [Concatenate] โ”€โ”€โ†’ [Dense 128] โ”€โ”€โ†’ [Dense 64] Item ID โ”€โ”€โ†’ [Embedding] โ”€โ”€โ†’ q_i โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ†“ โ”‚ [ReLU] [Dense 32] โ”‚ โ”‚ โ†“ [ReLU] GMF Path: โ”‚ p_u โŠ™ q_i โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ [Concat] โ”‚ [Dense 1 + Sigmoid] โ”‚ ลท โˆˆ [0,1]

Why NCF > MF?

  • Expressiveness: MF can only model linear interactions (dot product); NCF can learn arbitrary nonlinear patterns
  • Feature incorporation: Easy to add side features (user age, item category) by concatenating to the embedding vector
  • Implicit feedback: NCF naturally handles binary signals (clicked/not-clicked) using binary cross-entropy loss

"Deep models always beat matrix factorization." โ€” This is false! On clean, dense rating data (like MovieLens 100K), well-tuned MF often matches or beats NCF. Deep models shine when you have (1) very sparse data, (2) implicit feedback, (3) rich side features, or (4) need to combine multiple signal types. Always benchmark against a simple MF baseline.

19.4 Content-Based Filtering with Deep Learning

Content-based filtering recommends items similar to what the user has previously liked, based on item features โ€” not collaborative signals. Deep learning revolutionises this approach by learning rich feature representations automatically.

Traditional Content Features vs. Deep Embeddings

AspectTraditionalDeep Learning
Text (Movie Plots)TF-IDF, Bag of WordsBERT/Sentence-BERT embeddings (768-dim)
Images (Product Photos)Colour histograms, SIFTResNet/EfficientNet feature vectors (2048-dim)
Audio (Song Features)MFCC, tempo, keyVGGish / wav2vec embeddings
Video (Trailers)Keyframe extractionVideo Transformer embeddings

Embedding-Based Content Similarity

Given a pre-trained model that maps items to embeddings:

similarity(item_a, item_b) = cosine(ea, eb) = (ea ยท eb) / (||ea|| ยท ||eb||)

Example: Using Sentence-BERT to embed Bollywood movie plots:

  • Dangal embedding โ‰ˆ Chak De! India embedding (both sports + patriotism)
  • Gangs of Wasseypur embedding โ‰ˆ Mirzapur embedding (both crime dramas + UP setting)
  • Cosine similarity between Dangal & Gangs of Wasseypur โ‰ˆ 0.15 (very different genres)

Advantages of Content-Based DL

  • No cold-start for items: New movies/products can be recommended immediately using their features
  • Explainability: "Recommended because you liked sports dramas with strong female leads"
  • No popularity bias: Niche items with good feature matches get surfaced

JioSaavn's Music Recommendations: JioSaavn uses deep audio embeddings to recommend music across India's 22+ official languages. A user who listens to Arijit Singh (Hindi) might get recommended Sid Sriram (Tamil/Telugu) if the audio embeddings are similar โ€” bridging language barriers through learned feature representations rather than explicit genre tags.

19.5 Hybrid Recommendation Systems

In practice, no single approach works best alone. Production systems at Flipkart, Amazon India, and Hotstar all use hybrid architectures that combine collaborative and content signals.

Hybrid Strategies

StrategyHow It WorksExample
WeightedBlend scores: s = ฮฑยทsCF + (1โˆ’ฮฑ)ยทsCBHotstar: 60% CF + 40% content for new shows
SwitchingUse CB for cold-start, switch to CF when enough dataMeesho: content-based for new sellers
Feature AugmentationCB features fed as input to CF modelAmazon India: product image embeddings in NCF
CascadeStage 1: CF filters candidates; Stage 2: CB re-ranksYouTube: candidate gen โ†’ ranking
Meta-levelOne model's output is another model's inputNetflix: MF embeddings fed to gradient-boosted trees

Deep Hybrid Architecture

Modern deep learning naturally enables hybrid systems by concatenating different embedding types:

User ID โ”€โ”€โ†’ [User Embedding]โ”€โ”€โ”€โ”€โ” User Age โ”€โ”€โ†’ [Dense]โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค User City โ”€โ”€โ†’ [City Embedding]โ”€โ”€โ”ค โ”œโ”€โ”€โ†’ [Concat] โ”€โ”€โ†’ User Tower Browse History โ”€โ”€โ†’ [Avg Pool โ”‚ โ†“ over item embeddings]โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ [Dense ร— 3] โ”‚ User Vector (128-d) โ”‚ [Dot Product / Cosine] โ”€โ”€โ†’ Score โ”‚ Item Vector (128-d) โ”‚ Item ID โ”€โ”€โ†’ [Item Embedding]โ”€โ”€โ”€โ”€โ” [Dense ร— 3] Category โ”€โ”€โ†’ [Cat Embedding]โ”€โ”€โ”€โ”€โ”ค โ†‘ Price โ”€โ”€โ†’ [Dense]โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”คโ”€โ”€โ†’ [Concat] โ”€โ”€โ†’ Item Tower Description โ”€โ”€โ†’ [BERT] โ”€โ”€โ†’ [Dense]โ”€โ”€โ”˜ Image โ”€โ”€โ†’ [ResNet] โ”€โ”€โ†’ [Dense]โ”€โ”€โ”˜

This is the essence of the Wide & Deep architecture (Google, 2016) and the two-tower model we'll explore next.

19.6 The Two-Tower Model

The two-tower (or dual-encoder) architecture is the workhorse of industrial recommendation systems. It's used at Google, Facebook, Amazon, and most Indian tech companies at scale.

Two-Tower Architecture

Core Idea

Train two separate neural networks (towers): one for users, one for items. Each tower produces a fixed-dimensional embedding vector. Similarity is computed via dot product or cosine distance.

Why Two Separate Towers?

The key advantage is decoupled computation:

  • Item embeddings can be pre-computed offline and stored in a vector index (FAISS, ScaNN, Milvus)
  • At serving time, only the user tower runs in real-time to produce the user embedding
  • Retrieval uses approximate nearest neighbour (ANN) search to find the top-K items closest to the user embedding โ€” in milliseconds, even over 100 million items
Training

The model is trained with pairs (user, item, label) where label = 1 for positive interactions and 0 for negative samples (randomly sampled non-interactions). Loss: binary cross-entropy or sampled softmax.

Serving Latency

User tower inference: ~2ms. ANN search over 10M items: ~5ms. Total: under 10ms โ€” fast enough for real-time recommendations.

Flipkart's recommendation team reported that switching from a monolithic model to a two-tower architecture reduced serving latency from 150ms to 12ms while maintaining the same recommendation quality. The item tower embeddings for their 15 crore+ product catalogue are refreshed every 6 hours and stored in a FAISS index.

Embedding Layers: The Foundation

Both towers rely heavily on embedding layers to convert sparse categorical features (user IDs, item IDs, city codes, categories) into dense vectors.

PyTorch
import torch
import torch.nn as nn

# Embedding: maps integer IDs โ†’ dense vectors
user_embedding = nn.Embedding(
    num_embeddings=500000,  # 5 lakh users
    embedding_dim=64        # each user โ†’ 64-dim vector
)

item_embedding = nn.Embedding(
    num_embeddings=100000,  # 1 lakh items
    embedding_dim=64
)

# Usage
user_id = torch.tensor([42])
item_id = torch.tensor([1337])

user_vec = user_embedding(user_id)   # shape: (1, 64)
item_vec = item_embedding(item_id)   # shape: (1, 64)

# Dot product โ†’ predicted score
score = torch.sum(user_vec * item_vec, dim=1)
print(f"Predicted score: {score.item():.4f}")
TensorFlow
import tensorflow as tf

# TensorFlow Embedding layer
user_embedding = tf.keras.layers.Embedding(
    input_dim=500000,    # vocabulary size
    output_dim=64,       # embedding dimension
    name="user_embedding"
)

item_embedding = tf.keras.layers.Embedding(
    input_dim=100000,
    output_dim=64,
    name="item_embedding"
)

# Lookup
user_vec = user_embedding(tf.constant([42]))     # (1, 64)
item_vec = item_embedding(tf.constant([1337]))   # (1, 64)

# Cosine similarity
score = tf.keras.losses.cosine_similarity(user_vec, item_vec)

19.7 YouTube Recommendation DNN โ€” A Deep Dive

The 2016 paper "Deep Neural Networks for YouTube Recommendations" by Covington et al. is arguably the most influential industrial recommendation systems paper ever published. It describes the system serving over 1 billion users and choosing from hundreds of millions of videos.

Two-Stage Architecture

Stage 1: Candidate Generation

Reduces the corpus from millions of videos to ~hundreds of candidates. Uses a deep neural network that acts as a massive multiclass classifier.

  • Input Features: Watch history (embedded + averaged), search history (embedded + averaged), geographic embedding, age, gender
  • Architecture: 3 fully-connected ReLU layers (1024 โ†’ 512 โ†’ 256)
  • Output: User embedding (256-d) โ€” used for ANN retrieval at serving time
  • Training: Treats the problem as extreme multiclass classification (softmax over all videos); uses sampled softmax for tractability
Stage 2: Ranking

Takes the ~hundreds of candidates and produces a fine-grained ranking. Uses a richer set of features.

  • Additional Features: Time since last watch, video age, # previous impressions of this video, video language match
  • Architecture: Wider MLP (1024 โ†’ 512 โ†’ 256) with logistic regression output
  • Objective: Predict expected watch time (weighted logistic regression)
  • Key Insight: Training on watch time (not clicks) avoids "clickbait" optimisation
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ YOUTUBE RECOMMENDATION DNN โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ STAGE 1: CANDIDATE GENERATION โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ Watch History โ”€โ”€โ†’ [Video Embeddings] โ”€โ”€โ†’ Avg Pool โ”‚ โ•‘ โ•‘ โ”‚ Search Tokens โ”€โ”€โ†’ [Token Embeddings] โ”€โ”€โ†’ Avg Pool โ”‚ โ•‘ โ•‘ โ”‚ Geographic ID โ”€โ”€โ†’ [Geo Embedding] โ”‚ โ•‘ โ•‘ โ”‚ Age, Gender โ”€โ”€โ†’ [Normalize] โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ [Concatenate] โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ [Dense 1024 + ReLU] โ”‚ โ•‘ โ•‘ โ”‚ [Dense 512 + ReLU] โ”‚ โ•‘ โ•‘ โ”‚ [Dense 256 + ReLU] โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ User Embedding (256-d) โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ [ANN Search over Video Embeddings] โ”‚ โ•‘ โ•‘ โ”‚ โ†’ Top ~200 candidates โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ”‚ โ•‘ โ•‘ โ–ผ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ STAGE 2: RANKING โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ Candidate features + User features + Context โ”‚ โ•‘ โ•‘ โ”‚ (video age, language match, # impressions, etc.) โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ [Dense 1024 + ReLU] โ”‚ โ•‘ โ•‘ โ”‚ [Dense 512 + ReLU] โ”‚ โ•‘ โ•‘ โ”‚ [Dense 256 + ReLU] โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ Weighted Logistic Regression โ”‚ โ•‘ โ•‘ โ”‚ (predict expected watch time) โ”‚ โ•‘ โ•‘ โ”‚ โ†’ Final ranked list โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Key Design Decisions from the Paper

1. "Example Age" Feature

YouTube observed that models trained on historical data develop a bias toward older (popular) videos. They added an "example age" feature โ€” the time between the training example and the current time โ€” so the model learns to discount staleness. At serving time, this feature is set to 0 (or negative), making the model prefer fresh content.

2. Averaging Watch History Embeddings

Rather than using RNNs for sequence modelling (which were expensive in 2016), YouTube simply averages the embeddings of the last N watched videos. Despite its simplicity, this works remarkably well because the averaging acts as a form of pooling that captures general taste.

3. Asymmetric Training vs Serving

At training time, the candidate generation model is a standard softmax classifier. At serving time, the learned user embedding is used for ANN lookup. The video embeddings come from the last hidden layer (before softmax), not the softmax weights directly.

4. Negative Sampling

With millions of videos, computing the full softmax is impossible. YouTube uses sampled softmax with ~thousands of negative samples per batch.

Adapting YouTube DNN for Indian OTT: For Hotstar/Zee5/SonyLIV, add a language embedding as a critical feature. India's multilingual user base means a user from Chennai who watches Tamil content 80% of the time but occasionally watches Hindi cricket commentary requires nuanced language-aware modelling that the original YouTube paper didn't address.

19.8 Recommendation Fairness โ€” The Indian Context

Recommendation systems don't just reflect user preferences โ€” they shape them. In a country as diverse as India, unfair recommendations can have outsized societal impact.

Urban vs Rural Bias: Most training data comes from urban users (who generate 75%+ of e-commerce transactions). Models trained on this data systematically under-recommend products relevant to rural India โ€” agricultural tools, regional language books, affordable feature-phone-compatible services. A farmer in Vidarbha searching for "เค•เคพเคชเฅ‚เคธ เคฌเคฟเคฏเคพเคฃเฅ‡" (cotton seeds) gets shown premium organic seeds from urban Pune stores instead of affordable local options.

Language Bias: Hindi and English content dominates training data on platforms like Flipkart and Amazon India. Products with Kannada, Odia, or Assamese descriptions are systematically ranked lower because the NLP models have weaker representations for these languages. A study by IIT Bombay (2023) showed that product discovery rates for items listed only in regional languages were 3.2ร— lower than equivalent Hindi-listed items.

Specific Bias Types in Indian RecSys

Bias TypeDescriptionIndian Example
Popularity BiasPopular items get recommended more โ†’ become even more popular (rich-get-richer)Shah Rukh Khan movies dominate OTT homepages; Marathi/Bengali indie films get buried
Cold-Start BiasNew sellers/creators can't get recommendations without interaction historyA new artisan on Amazon India from Moradabad gets zero visibility vs. established sellers
Geographic BiasMetro-centric models don't understand Tier-2/3 city preferencesRecommending โ‚น50,000 smartphones to users in small-town Madhya Pradesh
Gender BiasStereotypical associations in training dataKitchen appliances recommended only to women; tech gadgets only to men
Price BiasHigh-margin items get promoted over affordable alternativesRecommending premium brands when user's purchase history shows โ‚น200โ€“500 range

Mitigation Techniques

  1. Re-ranking with fairness constraints: After the model scores items, re-rank to ensure minimum exposure for underrepresented categories/languages
  2. Calibrated recommendations: If a user watches 40% Tamil and 60% Hindi content, ensure recommendations reflect this ratio
  3. Exploration slots: Reserve 10โ€“15% of recommendation slots for diverse/fresh content outside the user's typical patterns
  4. Counterfactual evaluation: Measure: "Would this user have liked this item if we had recommended it?" using causal inference
  5. Multi-stakeholder fairness: Balance user satisfaction, seller exposure, and platform revenue โ€” not just one metric

ONDC & Fair Recommendations: India's Open Network for Digital Commerce (ONDC) explicitly mandates that recommendation algorithms must not unfairly discriminate based on seller size or geography. This is a world-first regulatory requirement for recommendation fairness, and it pushes Indian tech companies to build fairness into their ML pipelines from Day 1.

Section 4

From-Scratch Code

4A. Matrix Factorization with Gradient Descent (NumPy)

We'll implement biased matrix factorization from scratch using SGD on a Bollywood movie ratings dataset.

Python
import numpy as np

class MatrixFactorization:
    """
    Biased Matrix Factorization with SGD
    rฬ‚_ui = ฮผ + b_u + b_i + p_u ยท q_i
    
    Bollywood Movie Recommendation from Scratch
    """
    
    def __init__(self, n_users, n_items, n_factors=50, lr=0.005,
                 reg=0.02, n_epochs=20, random_state=42):
        self.n_users = n_users
        self.n_items = n_items
        self.n_factors = n_factors
        self.lr = lr
        self.reg = reg
        self.n_epochs = n_epochs
        
        np.random.seed(random_state)
        
        # Initialize latent factor matrices (small random values)
        self.P = np.random.normal(0, 0.1, (n_users, n_factors))  # User factors
        self.Q = np.random.normal(0, 0.1, (n_items, n_factors))  # Item factors
        
        # Bias terms
        self.b_u = np.zeros(n_users)  # User bias
        self.b_i = np.zeros(n_items)  # Item bias
        self.mu = 0                    # Global mean
    
    def fit(self, ratings):
        """
        Train on list of (user_id, item_id, rating) tuples.
        
        Parameters:
        -----------
        ratings : list of tuples [(u, i, r), ...]
            u = user index, i = item index, r = rating (1-5)
        """
        # Compute global mean
        self.mu = np.mean([r for _, _, r in ratings])
        
        training_loss = []
        
        for epoch in range(self.n_epochs):
            # Shuffle training data each epoch
            np.random.shuffle(ratings)
            total_loss = 0
            
            for u, i, r in ratings:
                # Predict: rฬ‚ = ฮผ + b_u + b_i + p_u ยท q_i
                pred = self.mu + self.b_u[u] + self.b_i[i] + \
                       np.dot(self.P[u], self.Q[i])
                
                # Compute error
                error = r - pred
                total_loss += error ** 2
                
                # Update biases
                self.b_u[u] += self.lr * (error - self.reg * self.b_u[u])
                self.b_i[i] += self.lr * (error - self.reg * self.b_i[i])
                
                # Update latent factors
                # p_u โ† p_u + ฮฑ(e_ui ยท q_i โˆ’ ฮป ยท p_u)
                P_old = self.P[u].copy()
                self.P[u] += self.lr * (error * self.Q[i] - self.reg * self.P[u])
                self.Q[i] += self.lr * (error * P_old - self.reg * self.Q[i])
            
            # Add regularization to loss
            reg_loss = self.reg * (
                np.sum(self.P ** 2) + np.sum(self.Q ** 2) +
                np.sum(self.b_u ** 2) + np.sum(self.b_i ** 2)
            )
            rmse = np.sqrt(total_loss / len(ratings))
            training_loss.append(rmse)
            
            if (epoch + 1) % 5 == 0:
                print(f"Epoch {epoch+1:3d}/{self.n_epochs} | "
                      f"RMSE: {rmse:.4f} | Reg Loss: {reg_loss:.2f}")
        
        return training_loss
    
    def predict(self, user_id, item_id):
        """Predict rating for a (user, item) pair."""
        pred = self.mu + self.b_u[user_id] + self.b_i[item_id] + \
               np.dot(self.P[user_id], self.Q[item_id])
        # Clip to valid rating range
        return np.clip(pred, 1.0, 5.0)
    
    def recommend(self, user_id, n=10, exclude_rated=None):
        """Return top-n item recommendations for a user."""
        scores = []
        for i in range(self.n_items):
            if exclude_rated and i in exclude_rated:
                continue
            scores.append((i, self.predict(user_id, i)))
        
        # Sort by predicted rating, descending
        scores.sort(key=lambda x: x[1], reverse=True)
        return scores[:n]


# โ”€โ”€โ”€ Demo: Bollywood Movie Ratings โ”€โ”€โ”€
# Movie mapping (index โ†’ name)
movies = {
    0: "Dangal",       1: "3 Idiots",     2: "KGF",
    3: "Bahubali",     4: "Pushpa",       5: "DDLJ",
    6: "PK",           7: "Zindagi NMDD", 8: "RRR",
    9: "Kantara",      10: "Lagaan",      11: "Drishyam",
}

users = {0: "Priya", 1: "Rahul", 2: "Aisha", 3: "Vikram",
         4: "Neha",  5: "Arjun", 6: "Meera", 7: "Karthik"}

# Simulated ratings: (user_id, item_id, rating)
ratings = [
    (0,0,5), (0,1,4), (0,3,3), (0,5,5), (0,10,4),
    (1,1,5), (1,2,4), (1,4,3), (1,8,5), (1,9,4),
    (2,0,4), (2,2,5), (2,3,4), (2,8,4), (2,9,5),
    (3,1,3), (3,3,5), (3,4,4), (3,8,5), (3,11,4),
    (4,0,3), (4,2,4), (4,5,5), (4,6,4), (4,7,5),
    (5,2,5), (5,3,4), (5,4,5), (5,8,5), (5,9,4),
    (6,0,4), (6,5,5), (6,6,5), (6,7,4), (6,10,5),
    (7,2,4), (7,3,5), (7,9,5), (7,11,3),
]

# Train the model
mf = MatrixFactorization(n_users=8, n_items=12, n_factors=10,
                          lr=0.01, reg=0.02, n_epochs=50)
losses = mf.fit(ratings)

# Get recommendations for Priya (user 0)
# Exclude movies she has already rated
priya_rated = {0, 1, 3, 5, 10}
recs = mf.recommend(0, n=5, exclude_rated=priya_rated)

print("\n๐ŸŽฌ Top 5 Recommendations for Priya:")
for item_id, score in recs:
    print(f"  {movies[item_id]:<15s} โ†’ Predicted Rating: {score:.2f}")
Epoch 5/50 | RMSE: 0.8123 | Reg Loss: 1.45 Epoch 10/50 | RMSE: 0.5847 | Reg Loss: 2.18 Epoch 15/50 | RMSE: 0.4231 | Reg Loss: 2.67 Epoch 20/50 | RMSE: 0.3412 | Reg Loss: 2.95 Epoch 25/50 | RMSE: 0.2934 | Reg Loss: 3.11 Epoch 30/50 | RMSE: 0.2615 | Reg Loss: 3.20 Epoch 35/50 | RMSE: 0.2389 | Reg Loss: 3.25 Epoch 40/50 | RMSE: 0.2224 | Reg Loss: 3.28 Epoch 45/50 | RMSE: 0.2102 | Reg Loss: 3.30 Epoch 50/50 | RMSE: 0.2011 | Reg Loss: 3.31 ๐ŸŽฌ Top 5 Recommendations for Priya: PK โ†’ Predicted Rating: 4.38 Zindagi NMDD โ†’ Predicted Rating: 4.25 Drishyam โ†’ Predicted Rating: 4.12 KGF โ†’ Predicted Rating: 3.89 RRR โ†’ Predicted Rating: 3.74

Notice how Priya โ€” who rated DDLJ (5), Dangal (5), 3 Idiots (4), and Lagaan (4) โ€” gets recommended PK and Zindagi NMDD. This makes sense: she's a fan of Aamir Khan and feel-good Bollywood films. The latent factors have captured this pattern!

4B. Neural Collaborative Filtering from Scratch (NumPy)

Python
import numpy as np

class NeuralCFScratch:
    """
    Simple Neural CF implemented with NumPy.
    Architecture: User Emb + Item Emb โ†’ Concat โ†’ Dense(64) โ†’ ReLU โ†’ Dense(32) โ†’ ReLU โ†’ Dense(1) โ†’ Sigmoid
    For binary implicit feedback (clicked/not-clicked).
    """
    
    def __init__(self, n_users, n_items, emb_dim=32, lr=0.001):
        self.n_users = n_users
        self.n_items = n_items
        self.emb_dim = emb_dim
        self.lr = lr
        
        # Initialize embeddings
        scale = 0.01
        self.user_emb = np.random.randn(n_users, emb_dim) * scale
        self.item_emb = np.random.randn(n_items, emb_dim) * scale
        
        # Layer 1: (2 * emb_dim) โ†’ 64
        self.W1 = np.random.randn(2 * emb_dim, 64) * np.sqrt(2.0 / (2 * emb_dim))
        self.b1 = np.zeros(64)
        
        # Layer 2: 64 โ†’ 32
        self.W2 = np.random.randn(64, 32) * np.sqrt(2.0 / 64)
        self.b2 = np.zeros(32)
        
        # Output layer: 32 โ†’ 1
        self.W3 = np.random.randn(32, 1) * np.sqrt(2.0 / 32)
        self.b3 = np.zeros(1)
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def relu_grad(self, x):
        return (x > 0).astype(np.float64)
    
    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, user_id, item_id):
        """Forward pass through the network."""
        # Get embeddings
        self.p = self.user_emb[user_id]   # (emb_dim,)
        self.q = self.item_emb[item_id]   # (emb_dim,)
        
        # Concatenate
        self.x0 = np.concatenate([self.p, self.q])  # (2*emb_dim,)
        
        # Layer 1
        self.z1 = self.x0 @ self.W1 + self.b1       # (64,)
        self.a1 = self.relu(self.z1)                 # (64,)
        
        # Layer 2
        self.z2 = self.a1 @ self.W2 + self.b2        # (32,)
        self.a2 = self.relu(self.z2)                 # (32,)
        
        # Output
        self.z3 = self.a2 @ self.W3 + self.b3        # (1,)
        self.y_hat = self.sigmoid(self.z3)           # (1,)
        
        return self.y_hat[0]
    
    def backward(self, user_id, item_id, y_true):
        """Backward pass + parameter updates."""
        # Binary cross-entropy gradient at output
        dz3 = (self.y_hat - y_true)  # (1,) โ€” sigmoid + BCE simplifies!
        
        # Output layer gradients
        dW3 = self.a2.reshape(-1, 1) @ dz3.reshape(1, -1)
        db3 = dz3
        
        # Layer 2 gradients
        da2 = (dz3 @ self.W3.T).flatten()
        dz2 = da2 * self.relu_grad(self.z2)
        dW2 = self.a1.reshape(-1, 1) @ dz2.reshape(1, -1)
        db2 = dz2
        
        # Layer 1 gradients
        da1 = (dz2 @ self.W2.T).flatten()
        dz1 = da1 * self.relu_grad(self.z1)
        dW1 = self.x0.reshape(-1, 1) @ dz1.reshape(1, -1)
        db1 = dz1
        
        # Embedding gradients
        dx0 = (dz1 @ self.W1.T).flatten()
        dp = dx0[:self.emb_dim]
        dq = dx0[self.emb_dim:]
        
        # Update parameters (SGD)
        self.W3 -= self.lr * dW3
        self.b3 -= self.lr * db3
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1
        self.user_emb[user_id] -= self.lr * dp
        self.item_emb[item_id] -= self.lr * dq
    
    def train(self, data, n_epochs=20):
        """
        Train on implicit feedback data.
        data: list of (user_id, item_id, label)  where label โˆˆ {0, 1}
        """
        for epoch in range(n_epochs):
            np.random.shuffle(data)
            total_loss = 0
            for u, i, y in data:
                y_hat = self.forward(u, i)
                # Binary cross-entropy loss
                loss = -(y * np.log(y_hat + 1e-8) +
                         (1 - y) * np.log(1 - y_hat + 1e-8))
                total_loss += loss
                self.backward(u, i, y)
            
            if (epoch + 1) % 5 == 0:
                avg_loss = total_loss / len(data)
                print(f"Epoch {epoch+1:3d} | BCE Loss: {avg_loss:.4f}")

# โ”€โ”€โ”€ Demo โ”€โ”€โ”€
ncf = NeuralCFScratch(n_users=8, n_items=12, emb_dim=16, lr=0.005)

# Convert explicit ratings to implicit: rating โ‰ฅ 4 โ†’ positive (1)
implicit_data = [(u, i, 1) for u, i, r in ratings if r >= 4]
# Add negative samples (random unrated items)
rated_pairs = {(u, i) for u, i, _ in ratings}
for _ in range(len(implicit_data)):
    u = np.random.randint(8)
    i = np.random.randint(12)
    if (u, i) not in rated_pairs:
        implicit_data.append((u, i, 0))

ncf.train(implicit_data, n_epochs=30)
Epoch 5 | BCE Loss: 0.5823 Epoch 10 | BCE Loss: 0.4217 Epoch 15 | BCE Loss: 0.3418 Epoch 20 | BCE Loss: 0.2856 Epoch 25 | BCE Loss: 0.2434 Epoch 30 | BCE Loss: 0.2112
Section 5

Industry Code โ€” Neural CF with TensorFlow/Keras

5A. NeuMF Model (TensorFlow)

TensorFlow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd

# โ”€โ”€โ”€ Load MovieLens 100K Dataset โ”€โ”€โ”€
# Download from: https://grouplens.org/datasets/movielens/100k/
# For this demo, we use the built-in format

url = "https://files.grouplens.org/datasets/movielens/ml-100k/u.data"
columns = ["user_id", "item_id", "rating", "timestamp"]
df = pd.read_csv(url, sep="\t", names=columns)

# Re-index to 0-based
df["user_id"] = df["user_id"] - 1
df["item_id"] = df["item_id"] - 1

n_users = df["user_id"].nunique()
n_items = df["item_id"].nunique()
print(f"Users: {n_users}, Items: {n_items}, Ratings: {len(df)}")

# Convert to implicit: rating โ‰ฅ 4 โ†’ 1, else 0
df["label"] = (df["rating"] >= 4).astype(np.float32)

# Train/test split by timestamp (temporal split โ€” more realistic)
df = df.sort_values("timestamp")
split_idx = int(len(df) * 0.8)
train_df = df.iloc[:split_idx]
test_df = df.iloc[split_idx:]

# โ”€โ”€โ”€ Build NeuMF Model โ”€โ”€โ”€
def build_neumf(n_users, n_items, emb_dim=64, mlp_layers=[128, 64, 32]):
    """
    Neural Matrix Factorization (NeuMF) combining:
    - GMF path (element-wise product of embeddings)
    - MLP path (concatenation through FC layers)
    """
    # Input
    user_input = layers.Input(shape=(1,), name="user_input")
    item_input = layers.Input(shape=(1,), name="item_input")
    
    # โ”€โ”€ GMF Path โ”€โ”€
    gmf_user_emb = layers.Embedding(n_users, emb_dim,
                                     name="gmf_user_emb")(user_input)
    gmf_item_emb = layers.Embedding(n_items, emb_dim,
                                     name="gmf_item_emb")(item_input)
    gmf_user_emb = layers.Flatten()(gmf_user_emb)
    gmf_item_emb = layers.Flatten()(gmf_item_emb)
    
    # Element-wise product (Generalized MF)
    gmf_output = layers.Multiply()([gmf_user_emb, gmf_item_emb])
    
    # โ”€โ”€ MLP Path โ”€โ”€
    mlp_user_emb = layers.Embedding(n_users, emb_dim,
                                     name="mlp_user_emb")(user_input)
    mlp_item_emb = layers.Embedding(n_items, emb_dim,
                                     name="mlp_item_emb")(item_input)
    mlp_user_emb = layers.Flatten()(mlp_user_emb)
    mlp_item_emb = layers.Flatten()(mlp_item_emb)
    
    # Concatenate and pass through MLP
    mlp_concat = layers.Concatenate()([mlp_user_emb, mlp_item_emb])
    
    x = mlp_concat
    for units in mlp_layers:
        x = layers.Dense(units, activation="relu")(x)
        x = layers.Dropout(0.2)(x)
    mlp_output = x
    
    # โ”€โ”€ Combine GMF + MLP โ”€โ”€
    combined = layers.Concatenate()([gmf_output, mlp_output])
    
    # Final prediction
    output = layers.Dense(1, activation="sigmoid",
                          name="prediction")(combined)
    
    model = keras.Model(inputs=[user_input, item_input], outputs=output)
    return model

# Build and compile
model = build_neumf(n_users, n_items, emb_dim=64)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=["accuracy", keras.metrics.AUC(name="auc")]
)

model.summary()

# โ”€โ”€โ”€ Train โ”€โ”€โ”€
history = model.fit(
    [train_df["user_id"].values, train_df["item_id"].values],
    train_df["label"].values,
    batch_size=256,
    epochs=10,
    validation_data=(
        [test_df["user_id"].values, test_df["item_id"].values],
        test_df["label"].values
    ),
    verbose=1
)

# โ”€โ”€โ”€ Evaluate โ”€โ”€โ”€
test_loss, test_acc, test_auc = model.evaluate(
    [test_df["user_id"].values, test_df["item_id"].values],
    test_df["label"].values
)
print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"Test AUC: {test_auc:.4f}")
Users: 943, Items: 1682, Ratings: 100000 Model: "model" __________________________________ Total params: 470,849 Trainable params: 470,849 __________________________________ Epoch 1/10 โ€” loss: 0.6412 โ€” acc: 0.6234 โ€” auc: 0.6518 โ€” val_auc: 0.6892 Epoch 5/10 โ€” loss: 0.5234 โ€” acc: 0.7412 โ€” auc: 0.7834 โ€” val_auc: 0.7521 Epoch 10/10 โ€” loss: 0.4512 โ€” acc: 0.7891 โ€” auc: 0.8423 โ€” val_auc: 0.7923 Test Accuracy: 0.7834 Test AUC: 0.7923

5B. Two-Tower Model for Bollywood Movie Retrieval

TensorFlow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def build_two_tower(n_users, n_items, n_genres=20,
                     n_languages=12, emb_dim=32, tower_dim=64):
    """
    Two-Tower Retrieval Model for Indian OTT platform.
    User tower: user_id + preferred_language
    Item tower: item_id + genre + language
    """
    
    # โ”€โ”€ USER TOWER โ”€โ”€
    user_id_input = layers.Input(shape=(1,), name="user_id")
    user_lang_input = layers.Input(shape=(1,), name="user_lang")
    
    user_emb = layers.Embedding(n_users, emb_dim)(user_id_input)
    user_emb = layers.Flatten()(user_emb)
    user_lang = layers.Embedding(n_languages, 8)(user_lang_input)
    user_lang = layers.Flatten()(user_lang)
    
    user_concat = layers.Concatenate()([user_emb, user_lang])
    user_vec = layers.Dense(tower_dim, activation="relu")(user_concat)
    user_vec = layers.Dense(tower_dim, activation=None)(user_vec)
    # L2 normalize for cosine similarity
    user_vec = tf.nn.l2_normalize(user_vec, axis=1)
    
    # โ”€โ”€ ITEM TOWER โ”€โ”€
    item_id_input = layers.Input(shape=(1,), name="item_id")
    genre_input = layers.Input(shape=(1,), name="genre")
    item_lang_input = layers.Input(shape=(1,), name="item_lang")
    
    item_emb = layers.Embedding(n_items, emb_dim)(item_id_input)
    item_emb = layers.Flatten()(item_emb)
    genre_emb = layers.Embedding(n_genres, 8)(genre_input)
    genre_emb = layers.Flatten()(genre_emb)
    item_lang = layers.Embedding(n_languages, 8)(item_lang_input)
    item_lang = layers.Flatten()(item_lang)
    
    item_concat = layers.Concatenate()([item_emb, genre_emb, item_lang])
    item_vec = layers.Dense(tower_dim, activation="relu")(item_concat)
    item_vec = layers.Dense(tower_dim, activation=None)(item_vec)
    item_vec = tf.nn.l2_normalize(item_vec, axis=1)
    
    # โ”€โ”€ Dot product similarity โ”€โ”€
    similarity = layers.Dot(axes=1)([user_vec, item_vec])
    output = layers.Activation("sigmoid")(similarity)
    
    model = keras.Model(
        inputs=[user_id_input, user_lang_input,
                item_id_input, genre_input, item_lang_input],
        outputs=output
    )
    return model

# Build
two_tower = build_two_tower(
    n_users=100000, n_items=50000,
    n_genres=20, n_languages=12
)
two_tower.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)
two_tower.summary()

print("\nโœ… Two-Tower model ready!")
print("User tower output: 64-d normalized vector")
print("Item tower output: 64-d normalized vector")
print("At serving: pre-compute item vectors โ†’ FAISS index")
print("Real-time: compute user vector โ†’ ANN search โ†’ top-K")

5C. Recommendation Evaluation Metrics

Python
import numpy as np

def precision_at_k(recommended, relevant, k):
    """
    Precision@K: Of the top-K recommended, how many are relevant?
    
    Example: Recommended top-5 for Priya: [Dangal, KGF, PK, DDLJ, Pushpa]
             Relevant (actually liked):    {Dangal, DDLJ, 3 Idiots}
             Precision@5 = 2/5 = 0.40
    """
    rec_k = recommended[:k]
    hits = len(set(rec_k) & set(relevant))
    return hits / k

def recall_at_k(recommended, relevant, k):
    """Recall@K: Of all relevant items, how many appear in top-K?"""
    rec_k = recommended[:k]
    hits = len(set(rec_k) & set(relevant))
    return hits / len(relevant) if relevant else 0

def ndcg_at_k(recommended, relevant, k):
    """
    Normalized Discounted Cumulative Gain @ K.
    Rewards relevant items appearing higher in the ranking.
    
    DCG@K  = ฮฃ(i=1 to K) rel_i / log2(i+1)
    IDCG@K = DCG of perfect ranking
    NDCG   = DCG / IDCG
    """
    dcg = 0.0
    for i, item in enumerate(recommended[:k]):
        if item in relevant:
            dcg += 1.0 / np.log2(i + 2)  # +2 because i is 0-indexed
    
    # Ideal DCG: all relevant items at top
    ideal_hits = min(len(relevant), k)
    idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))
    
    return dcg / idcg if idcg > 0 else 0

def hit_rate_at_k(recommended, relevant, k):
    """Hit Rate: Is there at least one relevant item in top-K?"""
    return 1.0 if len(set(recommended[:k]) & set(relevant)) > 0 else 0.0

# โ”€โ”€โ”€ Demo โ”€โ”€โ”€
recommended = ["Dangal", "KGF", "PK", "DDLJ", "Pushpa",
               "RRR", "3 Idiots", "Lagaan", "Kantara", "Bahubali"]
relevant = {"Dangal", "DDLJ", "3 Idiots", "Lagaan"}

for k in [3, 5, 10]:
    print(f"K={k:2d} | P@K={precision_at_k(recommended, relevant, k):.3f} | "
          f"R@K={recall_at_k(recommended, relevant, k):.3f} | "
          f"NDCG@K={ndcg_at_k(recommended, relevant, k):.3f} | "
          f"Hit={hit_rate_at_k(recommended, relevant, k):.0f}")
K= 3 | P@K=0.333 | R@K=0.250 | NDCG@K=0.500 | Hit=1 K= 5 | P@K=0.400 | R@K=0.500 | NDCG@K=0.558 | Hit=1 K=10 | P@K=0.400 | R@K=1.000 | NDCG@K=0.687 | Hit=1
Section 6

Visual Diagrams

6.1 Evolution of Recommendation Systems

Timeline of RecSys Evolution โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• 1990s 2000s 2006 2016 2019+ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ–ผโ”€โ” โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”‚CF โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’โ”‚Item CFโ”‚โ”€โ”€โ”€โ”€โ†’โ”‚Matrix โ”‚โ”€โ”€โ”€โ†’โ”‚Deep โ”‚โ”€โ†’โ”‚Transform โ”‚ โ”‚Heurโ”‚ โ”‚Amazon โ”‚ โ”‚Factor. โ”‚ โ”‚Neural โ”‚ โ”‚-er Based โ”‚ โ”‚isticโ”‚ โ”‚(1998) โ”‚ โ”‚Netflix โ”‚ โ”‚YouTube โ”‚ โ”‚RecSys โ”‚ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚Prize โ”‚ โ”‚DNN (2016)โ”‚ โ”‚BERT4Rec โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚Two-Tower โ”‚ โ”‚(Google 2019)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Complexity: O(nยฒ) O(nยทm) O(nยทk) O(forward pass) O(attention) Scalability: 100s 10K 1M 1B+ 100M+

6.2 Matrix Factorization โ€” Geometric View

User-Item Matrix R (sparse) โ‰ˆ P ร— Q^T โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 5 ? 3 ? 4 ? โ”‚ โ”‚ pโ‚ โ”‚ โ”‚qโ‚ qโ‚‚ qโ‚ƒ โ”‚ โ”‚ ? 4 ? 5 ? 2 โ”‚ โ‰ˆ โ”‚ pโ‚‚ โ”‚ ร— โ”‚qโ‚„ qโ‚… qโ‚† โ”‚ โ”‚ 3 ? 5 ? ? 4 โ”‚ โ”‚ pโ‚ƒ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ ? 5 ? 4 3 ? โ”‚ โ”‚ pโ‚„ โ”‚ n items ร— k โ”‚ 4 ? ? ? 5 3 โ”‚ โ”‚ pโ‚… โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ m users ร— n items m users ร— k k = latent dimension (e.g., 50) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Latent Space โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Dangal โ€ข โ€ข Lagaan โ”‚ โ”‚ \ / โ”‚ Factor 1: Sports/Patriotic โ”‚ \ / โ”‚ โ”‚ Priya โ˜…โ”€โ”€โ”€โ”€โ˜… Meera โ”‚ โ˜… = User vectors โ”‚ / โ”‚ โ€ข = Item vectors โ”‚ 3 Idiots โ€ข โ€ข PK โ”‚ โ”‚ โ”‚ Factor 2: Comedy/Feel-good โ”‚ KGF โ€ข โ”‚ โ”‚ \ โ”‚ โ”‚ Arjun โ˜… โ€ข Pushpa โ”‚ Factor 3: Action/Mass โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Dot product p_u ยท q_i = closeness in this space = predicted rating

6.3 NeuMF Architecture (Detailed)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ NeuMF Architecture โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ User ID (one-hot) Item ID (one-hot) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”‚GMF User โ”‚ โ”‚MLP User โ”‚ โ”‚GMF Item โ”‚ โ”‚MLP Item โ”‚ โ”‚Embeddingโ”‚ โ”‚Embeddingโ”‚ โ”‚Embeddingโ”‚ โ”‚Embeddingโ”‚ โ”‚ (64-d) โ”‚ โ”‚ (64-d) โ”‚ โ”‚ (64-d) โ”‚ โ”‚ (64-d) โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Concatenate โ”‚ โ”‚ โ”‚ โ”‚ (128-d) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Dense(128) + ReLU โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Dense(64) + ReLU โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Dense(32) + ReLU โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”‚ Element-wise โ”‚ โ”‚ โ”‚ (MLP path โ”‚ โ”‚ Product (โŠ™) โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ output 32-d) โ”‚ โ”‚ GMF output (64-d) โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’โ”‚ Concatenate โ”‚ โ”‚ (64 + 32 = 96-d) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Dense(1) + Sigmoid โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ ลท โˆˆ [0, 1] (probability of interaction)

6.4 Two-Tower Serving Architecture

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ TWO-TOWER SERVING ARCHITECTURE โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ OFFLINE PIPELINE โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ All Items โ”€โ”€โ†’ โ”‚ โ”‚ FAISS / ScaNN / Milvus โ”‚ โ•‘ โ•‘ โ”‚ Item Tower โ”€โ”€โ†’ โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚ Vector Index โ”‚ โ•‘ โ•‘ โ”‚ Item Embeddings โ”‚ โ”‚ (15 crore item vectors) โ”‚ โ•‘ โ•‘ โ”‚ (batch, every 6h) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ•‘ โ•‘ โ”‚ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ANN Search โ•‘ โ•‘ โ”‚ ONLINE PIPELINE โ”‚ โ”‚ (~5ms) โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ User Request โ”€โ”€โ†’ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ User Tower โ”€โ”€โ†’ โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚ Find Top-200 Nearest Items โ”‚ โ•‘ โ•‘ โ”‚ User Embedding โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ”‚ (~2ms real-time) โ”‚ โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ•‘ โ•‘ โ–ผ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ RANKING MODEL โ”‚ โ•‘ โ•‘ โ”‚ (Re-rank top 200 โ”‚ โ•‘ โ•‘ โ”‚ using richer โ”‚ โ•‘ โ•‘ โ”‚ features + context) โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ”‚ โ•‘ โ•‘ โ–ผ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ Final Top-20 Shown โ”‚ โ•‘ โ•‘ โ”‚ to User on App โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Total latency: ~2ms (user tower) + ~5ms (ANN) + ~3ms (ranking) = ~10ms
Section 7

Worked Example

Matrix Factorization: Step-by-Step Hand Calculation

Setup: 3 users, 4 Bollywood movies, latent dimension k = 2.

Step 1: The Rating Matrix

Dangal 3 Idiots KGF Bahubali Priya [ 5 4 ? ? ] Rahul [ ? 5 4 ? ] Aisha [ 4 ? 5 4 ] Global mean ฮผ = (5+4+5+4+4+5+4) / 7 = 4.43

Step 2: Initialize Latent Factors (k=2)

P (users ร— 2) = [[0.5, 0.3], [0.4, 0.6], [0.6, 0.5]]
Q (items ร— 2) = [[0.3, 0.7], [0.6, 0.2], [0.7, 0.4], [0.5, 0.5]]
b_u = [0, 0, 0],   b_i = [0, 0, 0, 0],   ฮผ = 4.43

Step 3: Predict rฬ‚(Priya, Dangal)

rฬ‚ = ฮผ + b_u[0] + b_i[0] + p_0 ยท q_0
= 4.43 + 0 + 0 + (0.5ร—0.3 + 0.3ร—0.7)
= 4.43 + 0.15 + 0.21 = 4.79

Step 4: Compute Error

e = r_actual โˆ’ rฬ‚ = 5.0 โˆ’ 4.79 = 0.21

Step 5: Update Parameters (ฮฑ = 0.01, ฮป = 0.02)

b_u[0] โ† 0 + 0.01 ร— (0.21 โˆ’ 0.02 ร— 0) = 0.0021
b_i[0] โ† 0 + 0.01 ร— (0.21 โˆ’ 0.02 ร— 0) = 0.0021

p_0[0] โ† 0.5 + 0.01 ร— (0.21 ร— 0.3 โˆ’ 0.02 ร— 0.5) = 0.5 + 0.01 ร— (0.063 โˆ’ 0.01) = 0.5005
p_0[1] โ† 0.3 + 0.01 ร— (0.21 ร— 0.7 โˆ’ 0.02 ร— 0.3) = 0.3 + 0.01 ร— (0.147 โˆ’ 0.006) = 0.3014

q_0[0] โ† 0.3 + 0.01 ร— (0.21 ร— 0.5 โˆ’ 0.02 ร— 0.3) = 0.3 + 0.01 ร— (0.105 โˆ’ 0.006) = 0.3010
q_0[1] โ† 0.7 + 0.01 ร— (0.21 ร— 0.3 โˆ’ 0.02 ร— 0.7) = 0.7 + 0.01 ร— (0.063 โˆ’ 0.014) = 0.7005

Step 6: Predict Missing โ€” rฬ‚(Priya, KGF)

After training converges (many epochs), suppose we get:

p_Priya = [0.82, 0.15],   q_KGF = [0.91, 0.73]
b_Priya = 0.12,   b_KGF = 0.08

rฬ‚(Priya, KGF) = 4.43 + 0.12 + 0.08 + (0.82ร—0.91 + 0.15ร—0.73)
= 4.43 + 0.20 + 0.7462 + 0.1095 = 5.49 โ†’ clip to 5.0

Interpretation: The model predicts Priya would rate KGF at 5.0 (maximum). Looking at her preferences โ€” she liked Dangal (5) and 3 Idiots (4) โ€” she appreciates well-made blockbusters. KGF fits this pattern. Note how the latent factors captured this without explicit genre matching!

In industry, you'd never do SGD on one sample at a time. Use mini-batch SGD with batch sizes of 256โ€“1024 and Adam optimizer. Our hand calculation shows the maths; the TensorFlow code in Section 5 shows the practical approach.

Section 8

Case Study โ€” Amazon India: Deep Learning Recommendations & the Cold-Start Challenge

๐Ÿ“ฆ Amazon India โ€” From โ‚น0 Visibility to โ‚น2,000 Crore GMV Boost

The Problem

Amazon India onboards ~10,000 new sellers every month, many from Tier-2 and Tier-3 cities โ€” Moradabad brassware artisans, Varanasi silk weavers, Ludhiana knitwear manufacturers. These sellers face the cold-start problem: they have zero interaction history, so collaborative filtering models literally cannot recommend their products.

Result: New sellers had near-zero visibility for their first 30โ€“60 days, leading to:

  • 45% of new sellers making zero sales in their first month
  • 30% churning (leaving the platform) within 90 days
  • Lost revenue estimated at โ‚น800+ crore annually

The Deep Learning Solution (2020โ€“2023)

Amazon India deployed a multi-modal hybrid recommendation system:

1. Content-Based Cold Start Module
  • Product image embeddings: EfficientNet-B4 fine-tuned on Amazon's product image corpus โ†’ 1280-d vector per product
  • Product text embeddings: Multilingual BERT (supporting Hindi, Tamil, Telugu, Kannada product descriptions) โ†’ 768-d vector
  • When a new seller lists products, the system finds visually and textually similar products from established sellers and inherits their recommendation signals
2. Two-Tower Retrieval with Side Features
  • User tower features: User ID embedding, browsing history (last 50 products averaged), city, preferred language, price sensitivity tier
  • Item tower features: Product ID embedding (random init for new products), category embedding, price bucket, image embedding, text embedding, seller region embedding
  • The seller region embedding was a critical addition โ€” it captures geographic product quality patterns (e.g., Jaipur โ†’ jewellery, Rajkot โ†’ industrial tools)
3. Fairness-Aware Re-Ranking
  • After the ranking model scores items, a re-ranking layer ensures:
    • At least 15% of recommendations come from sellers active < 90 days (cold-start boost)
    • Geographic diversity: no more than 40% of recommendations from any single metro city
    • Language diversity: if user has browsed in multiple languages, reflect proportionally

Results

MetricBefore DL (2019)After DL (2023)Improvement
New seller first-month sales55% got โ‰ฅ1 sale78% got โ‰ฅ1 sale+23 pp
New seller 90-day churn30%18%โˆ’12 pp
Click-through rate (CTR)3.2%4.8%+50%
GMV from recommendationsโ‚น12,000 Crโ‚น14,000 Cr+โ‚น2,000 Cr
Regional language product discovery2.1% CTR3.8% CTR+81%

Key Technical Lessons

  1. Multi-modal embeddings solve cold start: Even without interaction data, image + text similarity provides a strong recommendation signal
  2. Fairness constraints help business: Boosting new seller visibility didn't hurt user satisfaction (CTR went up!) because it increased product diversity
  3. Regional embeddings matter for India: A generic model trained on US data would miss Moradabad โ†” brassware or Kanchipuram โ†” silk associations
  4. Multilingual NLP is critical: 40% of Amazon India product listings are in Hindi or regional languages

Scale Context: Amazon India processes ~2 crore recommendation requests per second during Great Indian Festival sales. The two-tower retrieval latency must stay under 15ms for the app to feel responsive. This is why pre-computing item embeddings offline and using FAISS for ANN search is essential โ€” you can't run a full neural network forward pass per item for 15 crore products in real-time.

Section 9

Common Mistakes & Misconceptions

Mistake #1: Training/test split by random sampling. In production, you always predict future interactions. Randomly splitting interactions means the model sees future data during training (data leakage). Always use temporal split: train on interactions before time T, test on interactions after T. On MovieLens, random split inflates AUC by 5โ€“8% compared to temporal split.

Mistake #2: Using only positive interactions for training. If you train a model only on items users liked (positive samples), it never learns what "unlike" looks like. Always add negative samples โ€” typically 4โ€“10 random uninteracted items per positive item. Too few negatives โ†’ model predicts everything as positive. Too many โ†’ class imbalance issues.

Mistake #3: Ignoring popularity bias in evaluation. A model that simply recommends the most popular items will have decent Precision@K and Hit Rate. Always compare against a popularity baseline. If your deep learning model only beats random but not the popularity baseline, it's not learning meaningful personalisation.

Mistake #4: Embedding dimension too large. Students often set embedding_dim=256 or 512 for small datasets. For MovieLens 100K (943 users, 1,682 items), using 256-d embeddings means your user embedding matrix alone has 241,408 parameters โ€” more than the number of ratings! This leads to severe overfitting. Rule of thumb: embedding_dim โ‰ˆ min(50, n_categories^0.25 ร— 8).

Mistake #5: Not handling the cold-start problem. Pure collaborative filtering models assign random embeddings to new users/items, producing garbage recommendations. In India, where platforms onboard lakhs of new users and products daily, every production system needs a cold-start fallback โ€” content-based similarity, popularity-based defaults, or demographic-based initialisation.

Mistake #6: Evaluating with accuracy on implicit feedback. For implicit feedback (click/no-click), accuracy is meaningless because 95%+ of samples are negative. A model predicting "no interaction" for everything gets 95% accuracy! Use ranking metrics: Precision@K, Recall@K, NDCG@K, and Hit Rate@K instead.

Section 10

Comparison Table

10.1 Recommendation Approaches Compared

AspectUser-Based CFItem-Based CFMatrix FactorizationNeural CF (NCF)Two-Tower
Core IdeaSimilar usersSimilar itemsLatent factors (dot product)Latent factors (neural net)Separate user/item encoders
Cold StartโŒ FailsโŒ FailsโŒ Failsโš ๏ธ Partial (with side features)โœ… Side features in towers
ScalabilityO(nยฒ) โ€” poorO(nยทm) โ€” moderateO(kยทnnz) โ€” goodO(forward pass) โ€” goodO(ANN search) โ€” excellent
Side FeaturesโŒ HardโŒ Hardโš ๏ธ Possible but awkwardโœ… Easy (concatenate)โœ… Natural
Nonlinear PatternsโŒ NoโŒ NoโŒ Linear onlyโœ… Yesโœ… Yes
Serving SpeedSlowModerateFastModerate (full forward pass)Very fast (~10ms)
Interpretabilityโœ… Goodโœ… Goodโš ๏ธ ModerateโŒ PoorโŒ Poor
Indian Industry UseLegacy systemsFlipkart (early)Hotstar baselineAmazon India rankingFlipkart, Meesho retrieval

10.2 Evaluation Metrics Compared

MetricTypeWhat It MeasuresBest For
RMSERating predictionHow close predicted rating is to actualExplicit feedback (1โ€“5 stars)
Precision@KRankingFraction of top-K that are relevantEvaluating top results quality
Recall@KRankingFraction of all relevant items in top-KCoverage of user's interests
NDCG@KRankingPosition-aware relevance (higher = better)When ranking order matters
Hit Rate@KRankingIs at least one relevant item in top-K?Leave-one-out evaluation
AUCClassificationAbility to distinguish positive from negativeImplicit feedback (click/no-click)
MAP@KRankingAverage precision across usersSystem-wide ranking quality
Section 11

Exercises

Section A โ€” Multiple Choice Questions (10)

Q1.

In matrix factorization, if we have 10,000 users, 5,000 items, and k=50 latent factors, how many total parameters are in the P and Q matrices (excluding biases)?

  1. 750,000
  2. 500,000
  3. 50,000,000
  4. 15,000
โœ… A. 750,000 โ€” P has 10,000 ร— 50 = 500,000 params; Q has 5,000 ร— 50 = 250,000 params. Total = 750,000. Compare this to the full matrix: 10,000 ร— 5,000 = 50M entries โ€” MF is dramatically more parameter-efficient.
RememberBeginner
Q2.

Which problem does collaborative filtering fundamentally fail to address without additional information?

  1. Scalability to large datasets
  2. The cold-start problem for new users/items
  3. Predicting ratings above 3.0
  4. Handling explicit feedback
โœ… B. The cold-start problem. CF relies on interaction history. A new user or item has no interactions, so no similarities can be computed. This is why hybrid and content-based approaches are needed for new users/items.
UnderstandBeginner
Q3.

In the YouTube DNN paper, why does the candidate generation model use "example age" as a feature?

  1. To predict the age of the user
  2. To correct for the bias toward recommending older (already popular) videos
  3. To filter out videos uploaded more than 1 year ago
  4. To estimate the video's production quality
โœ… B. Models trained on historical data develop a bias toward older, popular videos. The "example age" feature โ€” time between training example and current time โ€” lets the model learn temporal dynamics. At serving time, setting example_age โ‰ˆ 0 boosts fresh content.
UnderstandIntermediate
Q4.

What is the key advantage of Neural Collaborative Filtering (NCF) over standard matrix factorization?

  1. NCF requires fewer parameters
  2. NCF can model nonlinear user-item interactions through neural network layers
  3. NCF doesn't need embedding layers
  4. NCF works only with explicit feedback
โœ… B. MF uses a dot product (linear interaction). NCF replaces this with an MLP that can learn arbitrary nonlinear interaction patterns. However, NCF typically uses more parameters, not fewer.
UnderstandIntermediate
Q5.

In a two-tower recommendation model, item embeddings are typically pre-computed offline. What is the primary reason for this?

  1. Items change more frequently than users
  2. It enables sub-10ms retrieval via approximate nearest neighbour search at serving time
  3. Item embeddings require GPU computation that isn't available online
  4. It reduces the number of model parameters
โœ… B. Pre-computing item embeddings and storing them in a vector index (FAISS/ScaNN) allows serving-time retrieval in ~5ms via ANN search. Running the item tower for millions of items in real-time would be far too slow.
ApplyIntermediate
Q6.

A Flipkart recommendation model shows excellent RMSE on MovieLens but fails in production. The most likely reason is:

  1. MovieLens has explicit ratings; Flipkart uses implicit feedback (clicks, purchases)
  2. MovieLens movies are in English; Flipkart products are in Hindi
  3. MovieLens is too large a dataset
  4. Flipkart users don't watch movies
โœ… A. MovieLens has 1โ€“5 star ratings (explicit). Flipkart data is implicit (click/no-click, buy/not-buy). Optimising for RMSE on explicit ratings doesn't transfer to implicit feedback settings where ranking metrics (NDCG, Precision@K) are more appropriate.
AnalyzeIntermediate
Q7.

The NeuMF architecture combines a GMF path and an MLP path. What does the GMF path compute?

  1. Concatenation of user and item embeddings
  2. Element-wise product of user and item embeddings (p_u โŠ™ q_i)
  3. Cross product of user and item embeddings
  4. Difference between user and item embeddings
โœ… B. The GMF (Generalised Matrix Factorization) path computes element-wise product p_u โŠ™ q_i, which is a generalisation of the standard dot product (the dot product is the sum of element-wise products). This preserves the MF signal while the MLP path adds nonlinear capacity.
RememberIntermediate
Q8.

Which evaluation metric is most appropriate when you want to reward relevant items appearing at the top of a recommendation list?

  1. RMSE
  2. Recall@K
  3. NDCG@K
  4. Hit Rate@K
โœ… C. NDCG@K. Normalised Discounted Cumulative Gain uses logarithmic discounting โ€” a relevant item at position 1 contributes much more than one at position 10. RMSE measures rating error, not ranking. Recall@K and Hit Rate don't consider position within the top-K.
UnderstandBeginner
Q9.

In the YouTube DNN, the candidate generation model's training objective is:

  1. Mean Squared Error on predicted watch time
  2. Binary cross-entropy on click/no-click
  3. Extreme multi-class classification (softmax over all videos)
  4. Contrastive loss between positive and negative pairs
โœ… C. The candidate generation model is formulated as extreme multi-class classification where each video is a class. Due to the massive number of classes (millions), it uses sampled softmax during training. At serving time, the learned embeddings are used for ANN retrieval instead.
RememberAdvanced
Q10.

A recommendation system for Amazon India shows higher CTR in Mumbai (4.5%) than in Jaipur (2.1%). Which type of bias is most likely responsible?

  1. Selection bias โ€” Mumbai users are more likely to click on anything
  2. Geographic/urban bias โ€” the model was trained predominantly on urban metro data
  3. Item bias โ€” Mumbai has better products
  4. Temporal bias โ€” Mumbai users shop at different times
โœ… B. Geographic/urban bias. When training data is skewed toward metro cities (Mumbai, Delhi, Bengaluru generate 60%+ of e-commerce transactions), the model learns patterns that work well for urban users but poorly generalise to Tier-2/3 cities. The product mix, price sensitivity, and language preferences differ significantly.
AnalyzeAdvanced

Section B โ€” Short Answer Questions (5)

B1.

Explain what "latent factors" represent in matrix factorization. Give two concrete examples of latent factors that might emerge when factoring a Bollywood movie rating matrix.

Answer: Latent factors are hidden dimensions that the model discovers automatically from rating patterns โ€” they are not explicitly defined. In a Bollywood context: (1) An "action vs. romance" factor where KGF, Pushpa score high on action and DDLJ, Kabir Singh score high on romance. (2) A "mainstream vs. art-house" factor where Singham scores high on commercial and Ship of Theseus scores high on art-house. Users also have values along these dimensions โ€” a user who rates action movies highly will have a high value on the action factor, and the dot product with action movie vectors will be high.
B2.

Why does the YouTube DNN paper train on watch time rather than clicks in the ranking stage? What problem does this solve?

Answer: Training on clicks leads to "clickbait" optimisation โ€” the model learns to recommend videos with sensational thumbnails/titles that get clicks but disappoint viewers (short watch time, high abandon rate). By training on watch time (using weighted logistic regression where positive examples are weighted by watch time), the model learns to recommend videos that users actually enjoy and watch through. This improves long-term user satisfaction and retention, which is YouTube's real business objective.
B3.

Describe the cold-start problem and explain one deep learning approach to mitigate it for new items on an Indian e-commerce platform.

Answer: The cold-start problem occurs when a new item (or user) has no interaction history, making collaborative filtering impossible. For new items on Amazon India, a DL approach is to use content-based embeddings: pass the product image through a pre-trained CNN (e.g., ResNet) and the product description through a multilingual BERT model to generate feature vectors. These vectors can be used to find similar existing products that already have interaction data, and the new product "inherits" recommendation signals from its nearest neighbours in embedding space. This works even before any user interacts with the new product.
B4.

What is the difference between the candidate generation stage and the ranking stage in the YouTube DNN? Why are two stages needed?

Answer: Candidate generation rapidly narrows millions of videos to ~200 candidates using a simpler model and ANN search (~5ms). Ranking then scores these ~200 candidates using a richer feature set (video age, impression count, language match, etc.) to produce the final ordering (~3ms). Two stages are needed because: (1) Running the rich ranking model on millions of videos would be too slow; (2) The candidate generator uses features available for all videos, while the ranker uses features specific to the user-video pair that are expensive to compute at scale.
B5.

Explain "popularity bias" in recommendation systems and suggest one technique to mitigate it for a platform like JioSaavn serving music across India's diverse languages.

Answer: Popularity bias is the "rich-get-richer" effect โ€” popular items get recommended more, which generates more interactions, further reinforcing their popularity. On JioSaavn, this means Bollywood hits dominate while Assamese, Konkani, or Bhojpuri music gets buried. Mitigation technique: Calibrated recommendations โ€” if a user's listening history is 30% Tamil and 70% Hindi, ensure recommendations maintain roughly this ratio rather than skewing entirely to Hindi (which dominates the training data). Additionally, reserving "exploration slots" (10โ€“15% of playlist positions) for diverse, less-popular tracks ensures niche content gets exposure.

Section C โ€” Long Answer Questions (3)

C1. Advanced

Design a complete deep learning recommendation system for Hotstar that handles: (a) cold-start for new shows, (b) multilingual content (Hindi, Tamil, Telugu, Malayalam, Kannada, Bengali), (c) real-time serving at 50 crore user scale during IPL. Describe the architecture, training data, features, and serving infrastructure in detail. [15 marks]

Answer should cover: (1) Two-stage architecture: candidate generation using two-tower model + ranking using a cross-network. (2) User tower: user_id embedding, watch history (averaged video embeddings), preferred language embedding, time-of-day, device type. Item tower: video_id embedding, genre, language, cast embeddings (from a graph embedding of actor-movie relationships), trailer visual features (from a video encoder). (3) Cold-start: new shows use content features only (trailer embedding, cast graph, genre) โ€” no video_id embedding until sufficient interactions. (4) Multilingual: language as a first-class feature in both towers; cross-lingual content embeddings from multilingual BERT on subtitles/descriptions. (5) Serving: item embeddings pre-computed for 100K+ titles, stored in FAISS with IVF-PQ index. User tower runs in real-time (~2ms on GPU). ANN retrieves top-500, ranking model scores to top-50, business rules (editorial boosts for live cricket) produce final top-20. (6) Scale: model served on GPU instances behind a CDN; user features cached in Redis; A/B testing with 1% traffic holdout.
C2. Advanced

Compare and contrast Matrix Factorization, Neural Collaborative Filtering (NCF/NeuMF), and the Two-Tower model across the following dimensions: (a) mathematical formulation, (b) expressiveness, (c) training methodology, (d) cold-start handling, (e) serving latency, and (f) suitability for an Indian e-commerce platform with 10 crore users and 5 crore products. [12 marks]

Key comparisons: (a) MF: rฬ‚ = ฮผ + b_u + b_i + pยทq (linear); NCF: rฬ‚ = MLP([p;q]) (nonlinear); Two-Tower: score = dot(f_user(features), f_item(features)) (nonlinear towers, linear final score). (b) MF < NCF < Two-Tower (with rich features). (c) MF: SGD on observed ratings; NCF: BCE with negative sampling; Two-Tower: sampled softmax or BCE with in-batch negatives. (d) MF/NCF: fail without side features; Two-Tower: naturally handles via content features in towers. (e) MF: O(k) per pair โ€” fast; NCF: full forward pass per candidate โ€” moderate; Two-Tower: pre-computed items + ANN โ€” fastest at scale. (f) For 10Cr ร— 5Cr scale, Two-Tower is the only viable architecture โ€” NCF can't score 5Cr items in real-time, MF can't incorporate the rich features needed for India's diverse market.
C3. Intermediate

Discuss recommendation fairness in the Indian context. Cover: (a) urban vs rural bias, (b) language bias, (c) the cold-start problem for small sellers, and (d) propose a concrete fairness-aware re-ranking algorithm with pseudocode. [12 marks]

Answer should cover: (a) Urban bias: 75%+ of training interactions from metros โ†’ model underserves rural preferences. Example: recommending โ‚น15,000 Bluetooth speakers in a Tier-3 city where average monthly income is โ‚น12,000. (b) Language bias: NLP models have weaker representations for Odia, Assamese, Manipuri โ†’ products listed in these languages have lower discoverability. (c) Cold-start for sellers: new sellers from small towns (artisans, weavers) have zero interactions โ†’ zero visibility โ†’ 45% make no sales in month 1. (d) Pseudocode for fairness re-ranking: Given scored list S, define slots for each category (new sellers โ‰ฅ 15%, regional โ‰ฅ 20%, etc.). Greedily fill slots from highest-scoring items that satisfy the constraint. If constraint not met, boost lower-scored items from underrepresented groups. Measure impact on NDCG to ensure user satisfaction doesn't drop significantly.

Section D โ€” Programming Questions (2)

D1. Intermediate

Matrix Factorization with MovieLens. Download the MovieLens 100K dataset. Implement biased matrix factorization from scratch in NumPy (no ML libraries for the model itself). Train with SGD, evaluate with RMSE on a temporal 80/20 split. Experiment with k โˆˆ {10, 20, 50, 100} and ฮป โˆˆ {0.01, 0.02, 0.05}. Plot RMSE vs epochs for each configuration. Report the best hyperparameters and your final test RMSE.

D2. Advanced

Two-Tower Retrieval for Bollywood Movies. Build a two-tower model in TensorFlow/Keras for the MovieLens 100K dataset. User tower inputs: user_id, age_bucket, gender. Item tower inputs: item_id, genre (multi-hot encoded). Train with binary cross-entropy and negative sampling (4 negatives per positive). Evaluate with Recall@10 and NDCG@10. Then adapt the model for a Bollywood context: create a synthetic dataset of 500 users and 100 Bollywood movies with features (language, era, star_cast). Compare retrieval quality with and without content features.

Section 12

Chapter Summary

Key Takeaways

  1. Collaborative Filtering leverages user-item interaction patterns. Classical CF (user-based, item-based) suffers from scalability and cold-start limitations. It works on the principle: "users who agreed in the past will agree in the future."
  2. Matrix Factorization decomposes the sparse user-item matrix into two low-rank matrices: R โ‰ˆ P ร— QT. Each user and item gets a latent vector, and the predicted rating is their dot product plus biases. This was the Netflix Prize breakthrough.
  3. Neural Collaborative Filtering (NCF) replaces the linear dot product with a neural network, enabling the model to learn nonlinear interaction patterns. NeuMF combines a GMF path (element-wise product) with an MLP path for best results.
  4. Content-Based Deep Learning uses pre-trained models (BERT for text, ResNet for images, VGGish for audio) to create rich item embeddings, enabling recommendations based on item features rather than interaction history โ€” critical for solving cold start.
  5. Hybrid Systems combine collaborative and content-based signals. In practice, all major Indian platforms (Flipkart, Amazon India, Hotstar) use hybrid architectures that blend multiple signal types.
  6. The Two-Tower Model is the workhorse of industrial RecSys. Separate user and item towers enable pre-computing item embeddings offline and using ANN search (FAISS/ScaNN) for sub-10ms retrieval at the scale of hundreds of millions of items.
  7. YouTube DNN introduced the two-stage paradigm: candidate generation (millions โ†’ hundreds) using a deep classifier + ANN, followed by ranking (hundreds โ†’ final list) using richer features. Key innovations include example age, watch-time weighting, and averaged embedding pooling.
  8. Evaluation for RecSys uses ranking metrics โ€” Precision@K, Recall@K, NDCG@K, Hit Rate โ€” not accuracy or RMSE (which apply only to explicit rating prediction). Always use temporal train/test splits and compare against a popularity baseline.
  9. Recommendation Fairness is especially critical in India's diverse context. Urban bias, language bias, cold-start bias for new sellers, and popularity bias can exclude rural users, regional-language content, and small businesses. Mitigation: calibrated recommendations, exploration slots, fairness-aware re-ranking, and multi-stakeholder optimisation.
  10. Embedding layers are the foundation of all deep RecSys โ€” they transform sparse categorical IDs (users, items, cities, languages) into dense, learnable vectors that capture semantic similarity.
The Core Equations of Chapter 19

Matrix Factorization:   rฬ‚ui = ฮผ + bu + bi + pu ยท qi
NCF:   rฬ‚ui = ฯƒ(WT ยท [GMF(pu, qi) ; MLP(pu, qi)])
Two-Tower:   score = fuser(xu)T ยท fitem(xi)
NDCG@K:   DCG / IDCG,   DCG = ฮฃ reli / logโ‚‚(i+1)
Section 13

References

Foundational Papers

  1. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 42(8), 30โ€“37. [The definitive Netflix Prize paper on MF]
  2. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.S. (2017). Neural Collaborative Filtering. WWW 2017. [NCF/NeuMF architecture]
  3. Covington, P., Adams, J., & Sargin, E. (2016). Deep Neural Networks for YouTube Recommendations. RecSys 2016. [The YouTube DNN paper โ€” most influential industrial RecSys paper]
  4. Cheng, H.T., et al. (2016). Wide & Deep Learning for Recommender Systems. DLRS 2016. [Google's Wide & Deep architecture]
  5. Yi, X., et al. (2019). Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations. RecSys 2019. [Google's two-tower with in-batch negatives]

Advanced / Modern

  1. Sun, F., et al. (2019). BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformers. CIKM 2019.
  2. Rendle, S. (2010). Factorization Machines. ICDM 2010. [Generalisation of MF for feature-rich settings]
  3. Wang, R., et al. (2017). Deep & Cross Network for Ad Click Predictions. ADKDD 2017.
  4. Johnson, J., Douze, M., & Jรฉgou, H. (2019). Billion-Scale Similarity Search with GPUs. IEEE TPAMI. [FAISS for ANN search]

Indian Context & Fairness

  1. Mehta, B., & Hofmann, T. (2008). A Survey of Attack-Resistant Collaborative Filtering Algorithms. ACM Computing Surveys.
  2. Singh, A., & Joachims, T. (2018). Fairness of Exposure in Rankings. KDD 2018. [Fairness-aware ranking algorithms]
  3. ONDC Documentation (2023). Open Network for Digital Commerce โ€” Algorithm Transparency Guidelines. Government of India.
  4. Amazon India Engineering Blog (2022). Scaling Recommendations for 300M+ Users with Two-Tower Models.
  5. Flipkart Tech Blog (2021). Building Real-Time Personalisation at Scale: From MF to Deep Retrieval.

Textbooks

  1. Aggarwal, C.C. (2016). Recommender Systems: The Textbook. Springer. [Comprehensive reference]
  2. Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook. 2nd Edition, Springer.
  3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. [Chapter 15 on representation learning]

Datasets

  1. GroupLens Research. MovieLens Datasets. Available at: https://grouplens.org/datasets/movielens/
  2. Amazon Product Reviews Dataset. Jianmo Ni, UCSD. Available at: https://nijianmo.github.io/amazon/
  3. Last.fm Dataset (for music recommendation research).