Overview
Link prediction in social graphs asks: given the current structure of a friendship network, which missing edges are likely to exist? This is the problem behind “people you may know” recommendations on social platforms — predicting future connections from current graph topology.
The Facebook benchmark dataset provides a snapshot of a real social graph. The task is to train a classifier on known edges, hold out a test set of edges, and predict whether a given node pair should be connected.
Graph EDA
Before any modeling, the graph was analyzed for structural properties that would inform feature design.
Degree distribution: node degree followed a power law — a small number of nodes with very high degree (hubs), and many nodes with few connections. This is characteristic of social networks and has implications for certain features: a common neighbor who is a hub (high degree) is a weaker signal than a common neighbor who is a niche connector.
Clustering coefficient: the probability that two neighbors of a node are also connected to each other. High local clustering means the graph has dense communities — a strong prior for future edge formation within communities.
Connected components and path lengths: shortest path analysis revealed the “small world” property — most node pairs can be reached in a small number of hops despite the large network size.
These properties directly shaped which structural features to compute.
Feature Engineering
Two classes of features were engineered for each candidate node pair (u, v):
Structural Features
Common Neighbors: count of shared neighbors between u and v. The most direct signal — if u and v have many mutual friends, they’re likely to know each other.
Jaccard Coefficient: common neighbors / union of neighbors. Normalizes common neighbors by network neighborhood size.
Adamic-Adar Index: sum of 1/log(degree) for each common neighbor. Down-weights high-degree common neighbors — a shared friend who has 1000 friends is a weaker signal than one who has 10. This turned out to be the single most predictive feature.
Preferential Attachment: product of degree(u) × degree(v). The intuition: high-degree nodes are more likely to gain new connections simply because they are already well-connected.
Shortest Path Length: distance between u and v in the current graph. Shorter distance = higher probability of future connection.
Node-Level Features
Degree, eigenvector centrality, clustering coefficient, and PageRank for each node in the pair.
Modeling and Results
XGBoost and Random Forest were both trained on the engineered feature set. Evaluated on AUC-ROC on a held-out test set.
Key finding: structural features dramatically outperformed node-level features. The top predictors were Adamic-Adar, common neighbors count, and shortest path length. Individual node centrality measures contributed marginal lift at best.
This result makes intuitive sense: link prediction is fundamentally about the relationship between two nodes, not the properties of each node individually. The structural features directly measure that relationship.
Feature importance ranking (XGBoost):
- Adamic-Adar index
- Common neighbors count
- Shortest path length
- Jaccard coefficient
- Preferential attachment
- Node degree features (marginal contribution)
Connection to Later Research
This project was the precursor to the graph ML research completed at NUS, where the same core featurization logic was extended to temporal graphs using graph attention networks. The insight that neighborhood structure dominates over node properties in link prediction carried over directly into the feature design choices for that research.
See Graph Link Prediction with Temporal Attention for the published research version of this problem.