Chapter 30: Graph Neural Networks & Knowledge Graphs

PART X: Specialized Domains | Reading Time: 3.5 hours | Prerequisites: Ch 12 (Neural Networks)

1. Learning Objectives

2. Introduction

Traditional deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), are designed for data that resides in Euclidean space. Images can be viewed as 2D grids of pixels, and text as 1D sequences of words. However, much of the world's most valuable data is naturally represented as graphs: social networks, citation networks, biological protein-protein interactions, knowledge bases, and financial transaction networks.

A graph $G = (V, E)$ consists of a set of vertices (or nodes) $V$ and a set of edges $E$ that connect them. Unlike grids, graphs lack a fixed coordinate system, have varying neighborhood sizes, and are invariant to permutations of their nodes. Graph Neural Networks (GNNs) were developed precisely to handle this non-Euclidean, unstructured data by leveraging the topology of the graph along with the features of the individual nodes and edges.

👨‍🏫 Professor's Insight: Non-Euclidean Data

Think of an image as a highly regular, predictable graph where every inner pixel has exactly 8 neighbors. CNN filters exploit this regularity. Now imagine a social network where one person has 2 friends and another has 2 million. You cannot slide a fixed-size $3 \times 3$ filter over this network. GNNs dynamically adapt to the local structure of each node, making them generalized, flexible convolutions.

In this chapter, we will build an intuition for how GNNs operate, dive into the rigorous mathematics of spectral and spatial convolutions, and explore specialized structures known as Knowledge Graphs. By the end, you will be able to construct a fraud detection engine and a social network analyzer using modern GNN frameworks.

3. Historical Background

The roots of graph theory trace back to 1736 when Leonhard Euler solved the Seven Bridges of Königsberg problem, establishing the foundations of topology and graph analysis. For centuries, graph analysis relied on statistical measures like degree centrality, clustering coefficients, and PageRank (introduced by Google's founders in 1998).

In the deep learning era, the timeline accelerated rapidly:

4. Conceptual Explanation

To understand GNNs, we must first break down the components of a graph and the framework that powers learning across it: Message Passing.

Graph Components

The Message Passing Framework

In a standard neural network layer, a node's feature vector is multiplied by a weight matrix and passed through an activation function. In a GNN, we add a crucial step: Neighborhood Aggregation.

For every node, we:

  1. Generate a message: Based on the neighboring nodes' features.
  2. Aggregate messages: Combine the messages from all neighbors. This aggregation function must be permutation invariant (e.g., Sum, Mean, Max) because the order of neighbors in a graph does not matter.
  3. Update node representation: Combine the aggregated message with the node's own current feature vector to create its new embedding.
🇮🇳 India Spotlight: UPI & Multi-Hop Context

Consider the UPI payment ecosystem. A fraudster might create multiple synthetic accounts. Individually, each account looks normal (standard features). But via Message Passing, if account A is 1 hop away from 5 accounts that are 2 hops away from known fraudulent IPs, the GNN pulls that "guilt by association" context directly into account A's embedding. This multi-hop aggregation is why GNNs are revolutionizing NPCI's fraud detection.

Inductive vs. Transductive Learning

Early GCNs required the entire graph (Adjacency matrix) to be present during training. If a new node joined the network tomorrow, the model had to be retrained. This is Transductive learning. Models like GraphSAGE changed this by learning aggregation functions rather than node-specific embeddings, enabling Inductive learning where the model can easily generate embeddings for entirely unseen nodes or even completely new graphs.

5. Mathematical Foundation

GNNs rely heavily on linear algebra and spectral graph theory. Let's define the fundamental matrices.

The Adjacency & Degree Matrices

Given graph $G$ with $N$ nodes. The Adjacency Matrix $A \in \mathbb{R}^{N \times N}$ represents connections.
The Degree Matrix $D$ is a diagonal matrix where $D_{ii} = \sum_j A_{ij}$, representing the number of connections for node $i$.

The Graph Laplacian

The unnormalized Graph Laplacian is defined as:

L = D - A

The Laplacian is pivotal in spectral graph theory because it acts as a differential operator on graphs (similar to the Laplace operator in continuous calculus). The eigenvectors of the Laplacian provide a basis for graph Fourier analysis.

The Normalized Laplacian

To prevent numerical instabilities and scale issues (where high-degree nodes accumulate massive feature values), we use the symmetrically normalized Laplacian:

L_sym = I - D^(-1/2) * A * D^(-1/2)

This ensures that the eigenvalues of the Laplacian lie within the range [0, 2].

Spectral vs. Spatial Convolutions

6. Formula Derivations

Derivation of the Graph Convolutional Network (GCN)

Thomas Kipf and Max Welling bridged the gap between spectral and spatial methods by applying localized first-order approximations to spectral graph convolutions.

Starting with the spectral convolution of signal $x$ with filter $g_\theta$:

g_θ * x = U g_θ U^T x

Where $U$ is the matrix of eigenvectors of $L$. Computing $U$ is $O(N^3)$.

Step 1: Chebyshev Polynomial Approximation
Hammond et al. suggested approximating $g_\theta$ using a truncated expansion of Chebyshev polynomials $T_k(x)$ up to $K^{th}$ order. This avoids eigendecomposition and localizes the filter to $K$-hops.

Step 2: First-Order Approximation ($K=1$)
Kipf and Welling limited the convolution to $K=1$ (just 1-hop neighbors) to prevent overfitting on specific graph structures and to speed up computation. This simplifies the equation to a function of the adjacency matrix $A$ and identity matrix $I$.

Step 3: The Renormalization Trick
To ensure numerical stability and include the node's own features (self-loops), they added the identity matrix to $A$: $\tilde{A} = A + I$.
Correspondingly, the new degree matrix is $\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}$.

This leads to the famous, elegant GCN forward propagation rule for layer $l$:

H^(l+1) = σ( \tilde{D}^(-1/2) \tilde{A} \tilde{D}^(-1/2) H^(l) W^(l) )

Where $H^{(l)}$ is the matrix of node features at layer $l$, $W^{(l)}$ is the trainable weight matrix, and $\sigma$ is an activation function like ReLU.

Graph Attention Networks (GAT)

GAT introduces the attention mechanism. Instead of statically weighting neighbor contributions using the degree matrix (as GCN does), GAT computes dynamic attention coefficients $e_{ij}$ between node $i$ and $j$:

e_{ij} = LeakyReLU( a^T [ W h_i || W h_j ] )

These are then normalized using the softmax function across all neighbors of node $i$ to get $\alpha_{ij}$. The final update rule is:

h_i^(l+1) = σ( \sum_{j \in N(i)} \alpha_{ij} W h_j^(l) )
📝 Exam Tip

Be prepared to contrast GCN, GraphSAGE, and GAT. GCN uses fixed, structure-based weights ($\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$). GraphSAGE allows different aggregation functions (Mean, LSTM, Pool). GAT uses learnable, feature-based weights (Attention).

7. Worked Numerical Examples

Let's manually compute one layer of GCN for a miniature graph.

Graph: 3 nodes forming a line: Node 1 --- Node 2 --- Node 3.

Adjacency Matrix (A):


0 1 0
1 0 1
0 1 0
            

Step 1: Add Self-Loops ($\tilde{A} = A + I$):


1 1 0
1 1 1
0 1 1
            

Step 2: Compute Degree Matrix ($\tilde{D}$):


2 0 0
0 3 0
0 0 2
            

Step 3: Compute $\tilde{D}^{-1/2}$:


1/√2  0     0
0     1/√3  0
0     0     1/√2
            

Step 4: Compute Normalized Adjacency ($\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}$):

Let's calculate the element at row 1, col 2:
$\hat{A}_{12} = \frac{1}{\sqrt{2}} \times 1 \times \frac{1}{\sqrt{3}} = \frac{1}{\sqrt{6}} \approx 0.408$

The resulting matrix smoothly diffuses information based on the degrees of the source and target nodes.

Step 5: Feature Multiplication:
Assume input features $H^{(0)}$ and weights $W^{(0)}$. The operation $\hat{A} H^{(0)} W^{(0)}$ first projects features into a new space (via $W$), then averages them across neighborhoods (via $\hat{A}$). Finally, apply ReLU.

8. Visual Diagrams (ASCII art)

A typical graph and its representation matrices:


      [Node A] 
     /        \
 [Node B] -- [Node C]
    |
 [Node D]

 Nodes (V): {A, B, C, D}
 Edges (E): {(A,B), (A,C), (B,C), (B,D)}

 Adjacency Matrix (A)      Feature Matrix (H)
    A  B  C  D             [ F1  F2  F3 ]
 A  0  1  1  0          A  [0.5  0.1  0.0 ]
 B  1  0  1  1          B  [0.9  0.8  0.2 ]
 C  1  1  0  0          C  [0.1  0.1  0.8 ]
 D  0  1  0  0          D  [0.2  0.6  0.4 ]
            

Message Passing Visualization:


 Target Node (B) receives messages from its neighbors (A, C, D)
 
 [A] --msg--> \
 [C] --msg-->  (AGGREGATE) ---> [Update Function] ---> New B Embedding
 [D] --msg--> /
            

9. Flowcharts (ASCII art)

The architecture of a typical Node Classification GCN Pipeline:


+-------------------+      +-------------------+      +-------------------+
| Input Graph Data  |      |   GNN Layer 1     |      |   GNN Layer 2     |
| (A, H_0)          | ---> | Neighborhood Agg  | ---> | Neighborhood Agg  |
+-------------------+      | Linear Transform  |      | Linear Transform  |
                           | ReLU Activation   |      | Softmax (Output)  |
                           +-------------------+      +-------------------+
                                                           |
                                                           v
                                                +-----------------------+
                                                | Node Classifications  |
                                                | (e.g. Fraud / Normal) |
                                                +-----------------------+
            

10. Python Implementation (from scratch)

Let's implement a rudimentary Graph Convolutional Layer using pure NumPy to solidify the mathematical concepts.

💻 Code Challenge: Numpy GCN

Study this implementation carefully. Notice how the symmetric normalization handles the division by square roots of degrees.

import numpy as np

class GCNLayer:
    def __init__(self, in_features, out_features):
        # Initialize weights randomly
        self.W = np.random.randn(in_features, out_features) * 0.1
        
    def normalize_adjacency(self, A):
        # A_tilde = A + I (Add self-loops)
        I = np.eye(A.shape[0])
        A_tilde = A + I
        
        # D_tilde: Degree matrix
        D_tilde = np.diag(np.sum(A_tilde, axis=1))
        
        # D_tilde^(-1/2)
        # Add small epsilon to prevent division by zero
        D_inv_sqrt = np.linalg.inv(np.sqrt(D_tilde + 1e-9))
        
        # Symmetric normalization: D^(-1/2) * A_tilde * D^(-1/2)
        A_norm = np.dot(np.dot(D_inv_sqrt, A_tilde), D_inv_sqrt)
        return A_norm
        
    def relu(self, x):
        return np.maximum(0, x)
        
    def forward(self, A, H):
        """
        A: Adjacency matrix (N x N)
        H: Node feature matrix (N x in_features)
        """
        A_norm = self.normalize_adjacency(A)
        
        # Message passing: A_norm * H
        neighborhood_agg = np.dot(A_norm, H)
        
        # Linear transformation: ... * W
        Z = np.dot(neighborhood_agg, self.W)
        
        # Activation
        return self.relu(Z)

# Testing the layer
A = np.array([
    [0, 1, 0],
    [1, 0, 1],
    [0, 1, 0]
])
H = np.array([
    [1.0, -1.0],
    [2.0,  0.5],
    [0.0,  1.0]
])

gcn = GCNLayer(in_features=2, out_features=4)
H_new = gcn.forward(A, H)

print("Output Node Embeddings:\n", H_new)

11. TensorFlow Implementation

While NumPy is great for learning, in production we use deep learning frameworks. Here is how you can implement a GCN layer in TensorFlow / Keras.

import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense

class GraphConvLayer(Layer):
    def __init__(self, units, activation=tf.nn.relu, **kwargs):
        super(GraphConvLayer, self).__init__(**kwargs)
        self.units = units
        self.activation = activation

    def build(self, input_shape):
        # input_shape is a list: [shape_of_features, shape_of_adjacency]
        feature_shape = input_shape[0]
        self.kernel = self.add_weight(shape=(feature_shape[-1], self.units),
                                      initializer='glorot_uniform',
                                      name='kernel')
        self.bias = self.add_weight(shape=(self.units,),
                                    initializer='zeros',
                                    name='bias')
        super(GraphConvLayer, self).build(input_shape)

    def call(self, inputs):
        features, adjacency = inputs
        
        # 1. Feature Transformation: H * W
        transformed_features = tf.matmul(features, self.kernel)
        
        # 2. Neighborhood Aggregation: A * (H * W)
        # Note: In practice, 'adjacency' should be pre-normalized.
        aggregated = tf.matmul(adjacency, transformed_features)
        
        # Add bias
        outputs = aggregated + self.bias
        
        # Apply activation
        if self.activation is not None:
            outputs = self.activation(outputs)
            
        return outputs

# Constructing a simple 2-layer GCN model
def build_gcn_model(num_nodes, num_features, num_classes):
    features_input = tf.keras.Input(shape=(num_features,))
    adjacency_input = tf.keras.Input(shape=(num_nodes,), sparse=False)
    
    x = GraphConvLayer(16)([features_input, adjacency_input])
    x = tf.keras.layers.Dropout(0.5)(x)
    output = GraphConvLayer(num_classes, activation=tf.nn.softmax)([x, adjacency_input])
    
    model = tf.keras.Model(inputs=[features_input, adjacency_input], outputs=output)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

Note: In real-world PyTorch/TF pipelines, libraries like PyTorch Geometric (PyG) or Deep Graph Library (DGL) are preferred over custom layers because they optimize sparse matrix multiplications and message passing on GPUs using scatter-gather operations.

12. Scikit-Learn Pipeline

GNNs require joint optimization of structure and features, which standard Scikit-Learn pipelines cannot do directly. However, a common classical machine learning workflow is to extract graph features (like Node2Vec embeddings or PageRank) using NetworkX, and feed those into a Random Forest.

import networkx as nx
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Create a dummy graph using NetworkX
G = nx.erdos_renyi_graph(100, 0.05)
# Assign dummy labels based on degree parity
labels = {node: (1 if G.degree[node] % 2 == 0 else 0) for node in G.nodes()}

# 2. Feature Engineering: Compute Graph Metrics
pageranks = nx.pagerank(G)
clustering_coeffs = nx.clustering(G)
centralities = nx.betweenness_centrality(G)

X = []
y = []
for node in G.nodes():
    # Build feature vector
    features = [
        G.degree[node],
        pageranks[node],
        clustering_coeffs[node],
        centralities[node]
    ]
    X.append(features)
    y.append(labels[node])

X = np.array(X)
y = np.array(y)

# 3. Scikit-Learn Pipeline
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

preds = clf.predict(X_test)
print(f"Random Forest Accuracy on Graph Features: {accuracy_score(y_test, preds):.2f}")

13. Indian Case Studies

🇮🇳 India Spotlight: Implementing GNNs at Scale

India's massive population and digital public infrastructure (India Stack) generate some of the largest graph datasets in the world. Applying GNNs to these datasets yields unique solutions.

1. UPI Fraud Ring Detection (NPCI & Banks)

The Problem: Traditional rule-based engines look at individual transaction volumes. Fraudsters bypass this by creating "mule" accounts—hundreds of fake accounts that transfer small, sub-radar amounts.

GNN Solution: By constructing a Transaction Graph where Nodes = Bank Accounts/VPA IDs and Edges = Transactions, GNNs can identify dense subgraphs (cliques or bipartite cores) that are indicative of money laundering or fraud rings. A GAT model can assign high attention weights to edges connecting suspicious nodes, flagging the entire ring simultaneously.

2. Indian Railways (IRCTC) Network Optimization

The Problem: A delay in one train causes a cascading effect on the entire railway network, delaying dozens of other trains due to shared tracks and platforms.

GNN Solution: The rail network is modeled as a Spatial-Temporal Graph. Nodes are railway stations, and edges are the tracks. Train movements add a temporal dimension. Spatio-Temporal GCNs (ST-GCNs) are used to predict traffic bottlenecks and dynamically route trains to minimize systemic delays across zones.

3. Aadhaar Deduplication & Knowledge Graphs

The Problem: Resolving identities to prevent duplicate Aadhaar card issuances.

GNN Solution: A Knowledge Graph is constructed where entities are demographic data points (addresses, partial names, phone numbers) and biometric hashes. Knowledge Graph Embeddings (like TransE) map these entities into a dense vector space, allowing the UIDAI system to perform rapid link prediction to see if a "new" applicant actually resolves to an existing identity.

14. Global Case Studies

1. Google Knowledge Graph

When you search for "Leonardo da Vinci," the rich panel on the right side of the search results is powered by the Google Knowledge Graph. It contains billions of facts connected by relationships (e.g., [Da Vinci] -> IS_A -> [Painter]). Graph embeddings allow Google to infer missing facts and answer complex multi-hop queries instantly.

2. AlphaFold 2 (DeepMind)

AlphaFold solved the 50-year-old protein folding problem. It represents amino acid residues as nodes in a graph. Edges represent spatial proximity. It uses an SE(3)-equivariant Graph Neural Network to iteratively refine the 3D coordinates of the atoms, leading to atomic-level accuracy in structure prediction.

3. Pinterest (PinSage)

Pinterest uses a massive bipartite graph of Users and Pins. They developed PinSage, an industrial-scale random-walk based GraphSAGE model. PinSage generates embeddings for 3 billion nodes and 18 billion edges, powering highly visual, context-aware content recommendations.

15. Startup Applications

Startups are heavily leveraging GNNs for niche, high-value problems:

16. Government Applications

Governments utilize graph analytics to enhance public service and security:

17. Industry Applications

🏢 Industry Alert: Recommendation Engines

If you are building an e-commerce platform, Collaborative Filtering is dead. GNN-based recommendation systems (modeling User-Item bipartite graphs) are the industry standard because they inherently capture high-order collaborative signals (e.g., Users who bought Item A also bought Item B, which is similar to Item C).

18. Mini Projects (2-3)

Mini Project 1: Social Network Analyzer (Node Classification)

Goal: Predict the political affiliation of a user based on their social connections and profile text.
Dataset: Cora Citation Network (Nodes = Papers, Edges = Citations, Features = Bag of Words).
Task: Use PyTorch Geometric (PyG) to build a 2-layer GCN. Compare the accuracy of standard MLP vs. GCN to prove the value of edge data.
Key Learning: Message passing implementations.

Mini Project 2: Fraud Ring Detection (Link Prediction)

Goal: Identify missing edges (hidden associations) between seemingly unrelated accounts.
Dataset: Elliptic Data Set (Bitcoin transactions).
Task: Implement GraphSAGE. Train the model using positive edges (actual transactions) and negative edges (random unconnected nodes). Use binary cross-entropy loss to predict edge existence.
Key Learning: Inductive learning and negative sampling techniques.

Mini Project 3: Knowledge Graph Explorer

Goal: Build a movie recommendation engine using Knowledge Graph Embeddings.
Dataset: Freebase or MovieLens-1M mapped to entities.
Task: Implement the TransE model. Train embeddings such that vector(User) + vector(Likes) ≈ vector(Movie). Perform nearest neighbor search in the embedding space to recommend movies.
Key Learning: Relational triples ($h, r, t$) and translational distances.

19. Exercises (20+)

  1. Explain why standard Convolutional Neural Networks cannot be directly applied to graph data.
  2. Define the Adjacency Matrix and Degree Matrix. Write out both for a 4-node square graph.
  3. What is the significance of the Graph Laplacian in spectral graph theory?
  4. Why do we add self-loops (the identity matrix) to the Adjacency matrix in Kipf & Welling's GCN?
  5. Mathematically explain the symmetric normalization trick: $D^{-1/2} A D^{-1/2}$.
  6. Explain the concept of permutation invariance in neighborhood aggregation.
  7. Why is the Mean aggregation function sometimes inadequate compared to the Sum aggregation function? Give an example.
  8. What is the difference between Transductive and Inductive learning in GNNs?
  9. How does GraphSAGE achieve inductive capabilities?
  10. Describe the attention mechanism in Graph Attention Networks (GAT).
  11. What is the Weisfeiler-Lehman (WL) graph isomorphism test?
  12. How does the Graph Isomorphism Network (GIN) relate to the WL test?
  13. In a Knowledge Graph, what does a triple $(h, r, t)$ represent? Give an example.
  14. Explain the TransE objective function. Why does it fail on 1-to-N relationships?
  15. How does RotatE improve upon TransE for complex relations?
  16. Describe how a GNN can be used for Graph-Level Classification (e.g., molecular property prediction).
  17. What is the "oversmoothing" problem in deep GNNs (e.g., with 10+ layers)?
  18. How do skip-connections or architectures like JK-Nets mitigate oversmoothing?
  19. Explain how Spatio-Temporal GNNs are structured to handle traffic forecasting.
  20. Using NetworkX, write a Python snippet to calculate the PageRank of a graph and print the top 3 nodes.

20. MCQs (10+ with click-to-reveal)

1. Which of the following handles Inductive learning natively?
  • A. Spectral GCN
  • B. GraphSAGE
  • C. DeepWalk
  • D. TransE
Answer: B. GraphSAGE learns aggregation functions rather than node-specific embeddings, making it inductive.
2. In Kipf and Welling's GCN, the Adjacency matrix is normalized using:
  • A. $D^{-1} A$
  • B. $D^{-1/2} A D^{-1/2}$
  • C. $A + D$
  • D. Softmax
Answer: B. Symmetric normalization $D^{-1/2} \tilde{A} D^{-1/2}$ prevents numerical instability.
3. What does TransE attempt to satisfy in its embedding space?
  • A. $h \times r = t$
  • B. $h + r \approx t$
  • C. $h \cdot t = r$
  • D. $||h|| + ||t|| = r$
Answer: B. It models relationships as translations in the vector space.
4. The oversmoothing problem in GNNs refers to:
  • A. Gradients vanishing during backprop
  • B. Node embeddings converging to the same vector in deep architectures
  • C. The loss curve becoming too smooth
  • D. Too many edges being removed
Answer: B. After many rounds of message passing, all nodes in a connected component mix their features until they become indistinguishable.
5. Which task involves predicting whether a connection exists between two nodes?
  • A. Node Classification
  • B. Graph Classification
  • C. Link Prediction
  • D. Graph Generation
Answer: C. Link Prediction.
6. GAT utilizes which mechanism to weigh neighbor importance?
  • A. Random Walks
  • B. Node Degrees
  • C. Self-Attention
  • D. PageRank
Answer: C. Self-Attention allows GAT to learn dynamic edge weights based on node features.
7. Which algorithm is considered the foundation for continuous node embeddings based on random walks?
  • A. GIN
  • B. ResNet
  • C. DeepWalk
  • D. YOLO
Answer: C. DeepWalk applied Word2Vec skip-gram techniques to random graph walks.
8. In a molecular graph predicting drug toxicity, which GNN pooling strategy is most appropriate?
  • A. Global Average Pooling over all nodes
  • B. Dropping 50% of the nodes randomly
  • C. Only looking at the highest degree node
  • D. Link prediction between carbon atoms
Answer: A. Global Average Pooling combines all node embeddings to form a single graph-level embedding for classification.
9. Which framework is explicitly built to optimize operations on graph data using PyTorch?
  • A. Pandas
  • B. PyTorch Geometric (PyG)
  • C. Keras
  • D. Scikit-learn
Answer: B. PyTorch Geometric.
10. To capture directed and complex logical relationships (e.g., symmetric, antisymmetric), which KG embedding is preferred?
  • A. TransE
  • B. RotatE
  • C. Word2Vec
  • D. TF-IDF
Answer: B. RotatE models relationships as rotations in the complex plane, capturing symmetry and inversion.

21. Interview Questions (10+)

🚀 Career Path: AI Graph Engineer

Graph AI Engineers are highly sought after at companies like Meta, Google, and Stripe. Mastering these interview questions sets you apart from standard NLP/CV candidates.

  1. What is the difference between Transductive and Inductive learning in the context of graphs?
    Answer: Transductive learning requires the entire graph structure during training. It cannot generate embeddings for unseen nodes without retraining. Inductive learning (like GraphSAGE) learns the aggregation functions, allowing it to generalize to completely new nodes or graphs at inference time.
  2. How do you handle extremely large graphs (e.g., 1 billion nodes) that don't fit into GPU memory?
    Answer: By using neighborhood sampling techniques (e.g., GraphSAGE random sampling, Cluster-GCN, or GraphSAINT) which create mini-batches of subgraphs rather than loading the full adjacency matrix.
  3. Explain the Weisfeiler-Lehman (WL) Test and its relation to GNNs.
    Answer: The WL test checks if two graphs are isomorphic by iteratively hashing node features and neighborhood features. It was proven that standard message-passing GNNs are at most as powerful as the 1-WL test. GIN was designed to theoretically match this maximum expressive power.
  4. What causes oversmoothing in GNNs and how do you prevent it?
    Answer: After many layers, a node's receptive field encompasses the entire graph, causing all node embeddings to converge to a similar average vector. Prevent it by keeping GNNs shallow (2-4 layers), using skip connections (ResGCN), or methods like DropEdge.
  5. Describe TransE. What is its major limitation?
    Answer: TransE models relationships as $h + r \approx t$. Its major limitation is handling 1-to-N or N-to-1 relationships. For example, if [Shakespeare] + [wrote] = [Hamlet] and [Shakespeare] + [wrote] = [Macbeth], TransE forces the embeddings of Hamlet and Macbeth to be identical.
  6. How does Graph Attention (GAT) differ from standard GCN?
    Answer: GCN uses a fixed, structure-based normalization factor ($\frac{1}{\sqrt{d_i d_j}}$). GAT uses learnable attention weights based on node features, allowing the model to dynamically prioritize which neighbors are most important.
  7. What are the challenges of applying GNNs to directed graphs?
    Answer: In directed graphs, the adjacency matrix is asymmetric. This complicates the symmetric normalization of the Laplacian. Message passing must explicitly handle directionality (e.g., maintaining separate weights for incoming vs. outgoing edges).
  8. How would you use a GNN for an unsupervised anomaly detection task?
    Answer: By using an Autoencoder framework. A GNN encoder compresses the graph into node embeddings. A decoder attempts to reconstruct the adjacency matrix and feature matrix. Nodes or edges with high reconstruction error are flagged as anomalies.
  9. Explain the concept of Heterogeneous Graphs.
    Answer: Heterogeneous graphs have multiple types of nodes and edges (e.g., an academic graph with Paper, Author, and Institution nodes, and 'Wrote', 'Affiliated_with' edges). Specialized architectures like Heterogeneous Graph Neural Networks (HGNNs) use type-specific weight matrices during message passing.
  10. What is the difference between Node2Vec and a GCN?
    Answer: Node2Vec is a shallow, unsupervised method based on random walks that only captures structural equivalence. It cannot utilize node features. GCN is a deep learning model that jointly optimizes both graph structure and rich node features, typically in a supervised or semi-supervised manner.

22. Research Problems (3-4)

23. Key Takeaways

24. References