Case Studies Predicting Missing Friendship Links from Social Graph Structure
ml Personal Research Project · Applied Project

Predicting Missing Friendship Links from Social Graph Structure

Full-pipeline graph Machine Learning project predicting missing friendship links in a Facebook social network - from graph EDA through structural feature engineering to XGBoost and Random Forest evaluation.

Problem

Given the current state of a social graph, predict which pairs of currently unconnected users are likely to form friendships - the core recommendation problem behind 'people you may know' features.

Outcome

AUC-ROC evaluation showing that pair-level structural features (Adamic-Adar, common neighbors, shortest path) decisively outperform individual node-level features for link prediction, with Adamic-Adar the single strongest predictor.

Impact - who used it & what changed

A self-directed research project that established the featurization strategy I later carried directly into temporal-graph link prediction research at NUS - its value is methodological, not commercial.

The problem

Link prediction in social graphs asks a deceptively simple question: given the current structure of a friendship network, which missing edges are likely to exist? This is the problem behind “people you may know” recommendations - predicting future connections from current graph topology.

I took this on as a self-directed project to learn graph Machine Learning end to end, before tackling the harder temporal version of the same problem at NUS. Starting from a benchmark Facebook social-graph snapshot, the task was concrete: train a classifier on known edges, hold out a test set of edges, and predict whether a given node pair should be connected. The deliverable I cared about was not a single accuracy number - it was understanding which kind of features actually carry the signal in link prediction, so that insight would transfer to later work.

Reading the graph before building features

Before any modeling, I analyzed the graph for structural properties that would inform feature design. Skipping this step is how you build the wrong features.

Degree distribution: node degree followed a power law. A small number of nodes with very high degree (hubs), and many nodes with few connections. Characteristic of social networks, and it has implications for certain features. A common neighbor who is a hub (high degree) is a weaker signal than a common neighbor who is a niche connector.

Clustering coefficient: the probability that two neighbors of a node are also connected to each other. High local clustering means the graph has dense communities - a strong prior for future edge formation within those communities.

Connected components and path lengths: shortest path analysis revealed the “small world” property. Most node pairs reachable in a small number of hops despite the large network size.

These three properties directly shaped which structural features to compute - and, more importantly, gave a prior for why pair-level features should beat node-level ones.

Feature engineering

Two classes of features for each candidate node pair (u, v), built deliberately so the comparison between them would be informative.

Structural features (the relationship between u and v)

Common Neighbors: count of shared neighbors between u and v. The most direct signal - if u and v have many mutual friends, they are likely to know each other.

Jaccard Coefficient: common neighbors over union of neighbors. Normalizes common neighbors by network neighborhood size.

Adamic-Adar Index: sum of 1/log(degree) for each common neighbor. Down-weights high-degree common neighbors - a shared friend who has 1,000 friends is a weaker signal than one who has 10. This turned out to be the single most predictive feature.

Preferential Attachment: product of degree(u) × degree(v). The intuition: high-degree nodes are more likely to gain new connections simply because they are already well-connected.

Shortest Path Length: distance between u and v in the current graph. Shorter distance equals higher probability of future connection.

Node-level features (properties of each node alone)

Degree, eigenvector centrality, clustering coefficient, and PageRank for each node in the pair. Included specifically so I could measure how much they add on top of the structural features.

Modeling and what it showed

I trained XGBoost and Random Forest on the engineered feature set and evaluated AUC-ROC on a held-out edge set.

The result that mattered: pair-level structural features decisively outperformed node-level features. The top predictors were Adamic-Adar, common neighbors count, and shortest path length, with Adamic-Adar the single strongest. Individual node centrality measures contributed marginal lift at best.

This makes intuitive sense in hindsight, but seeing it in the feature importances is what made it stick: link prediction is fundamentally about the relationship between two nodes, not the properties of each node individually. The structural features directly measure that relationship; node-level features do not.

Feature importance ranking (XGBoost):

  1. Adamic-Adar index
  2. Common neighbors count
  3. Shortest path length
  4. Jaccard coefficient
  5. Preferential attachment
  6. Node degree features (marginal contribution)

Why it mattered

This was a learning project, and its payoff was the next project. It was the direct precursor to the graph Machine Learning research I completed at NUS, where the same core featurization logic was extended to temporal graphs using graph attention networks. The conclusion here - that neighborhood structure dominates node properties in link prediction - carried straight into the feature-design choices for that research, so I went into it already knowing where the signal lives instead of rediscovering it.

See Graph Link Prediction with Temporal Attention for the research version of this problem.

Stack

Python NetworkX XGBoost Scikit-learn Pandas Matplotlib
graph-ml link-prediction social-networks feature-engineering

Have a problem worth solving?

Whether you need a quantitative researcher, a Machine Learning systems builder, or a technical advisor, I take a small number of consulting engagements at a time.

Book a call →