Context
During my undergraduate studies at IIT Bombay, I did a research internship at the National University of Singapore with Prof. Bryan Hooi’s group, an active graph-mining lab. The task: temporal graph representation learning - use attention to capture how relationships evolve over time in a dynamic network, then use that representation to predict which edges form next.
It was my first exposure to cutting-edge Machine Learning research, and it is where I first learned that a result and a claim are not the same thing - and that Machine Learning research and Machine Learning engineering are different disciplines. The engineering question is “does it run and scale.” The research question is “is the number real, and does it survive an honest comparison.” This project was about the second one.
The Problem
Most graph Machine Learning work at the time treated graphs as static: a fixed set of nodes and edges, embeddings learned from that frozen structure. But real networks evolve. Social connections form and dissolve, transactions create temporal patterns, information spreads in time-ordered sequences.
For link prediction - predicting which connections will form in the future - the timing of past interactions is not noise. It is signal. A model that ignores when interactions happened is throwing away the most predictive part of the data. The question was how to represent a time-evolving graph so that signal survives into the prediction.
Why It Mattered
Temporal link prediction underpins social-network recommendation, fraud detection (spotting unusual connection patterns), academic collaboration prediction, and dynamic knowledge-graph completion. The lab’s interest was the core representation question - how to encode a time-evolving graph usefully - rather than any single application.
Data & Inputs
- College Messages dataset: a temporal graph of timestamped direct messages between students at a US college. The task: predict which node pairs will interact in the future.
- Additional datasets to cross-check the approach beyond a single graph.
- Node features where available; structural features derived from graph topology otherwise.
The timestamp on each interaction was the thing we were trying to exploit. That is what made the problem interesting and what every baseline either used or ignored.
The Work
I implemented and benchmarked the relevant temporal and static methods on a common footing:
- node2vec - static random-walk embeddings; the static baseline.
- TMF (Temporal Matrix Factorization) - time-aware matrix factorization.
- CTDNE (Continuous-Time Dynamic Network Embeddings) - random walks biased by temporal order.
- BANE (Binarized Attributed Network Embedding) - attribute-aware embeddings.
The contribution was a temporal attention model that:
- encodes the sequence of interactions a node has participated in,
- weights past interactions with an attention mechanism rather than a fixed recency rule, and
- combines that temporal history with structural graph features for the link-prediction head.
The attention mechanism was the load-bearing design choice. It let the model learn which past interactions were most predictive, instead of assuming the most recent one always was - a subtle distinction with a measurable effect on the result.
Engineering & Validation
The harder half of the project was making the comparison honest:
- Implemented in PyTorch from scratch - no existing library had the exact temporal attention architecture I needed.
- Negative sampling for training: random negatives plus hard-negative mining near the decision boundary.
- Strict temporal split - train only on edges before a cutoff time, evaluate only on edges after it, so no future information leaks into training. This is the difference between a real number and an inflated one.
- Hyperparameters tuned by grid search on a validation split.
- Reimplemented every baseline myself and ran all methods on identical train/test splits, identical negative sampling, and the same metric.
Without that discipline, the comparison would have been meaningless - you can make almost any model “win” with a generous split. Holding every method to the same protocol is what turned the 86% from a claim into a result.
Outcome
- 86% AUC on the College Messages dataset for link prediction.
- Outperformed every benchmarked baseline - node2vec, TMF, CTDNE, BANE - under the one shared, leak-free protocol.
- Demonstrated that temporal attention captures structure that static methods cannot, and that the gain held under the same evaluation given to the baselines.
Impact
This was research, not a product, so the honest measure of impact is academic: the work was conducted inside Prof. Bryan Hooi’s NUS lab and reviewed by his research group, with findings presented back to them. That review by an active graph-mining group is the external validation - the approach and its results were scrutinized by people whose job is to find the hole in a comparison. There was no production or commercial deployment, and I am not claiming one.
Limitations & What I’d Do Differently
- The attention model was trained end-to-end on the link-prediction objective alone. It would likely benefit from self-supervised temporal pre-training before fine-tuning - the BERT-style recipe applied to interaction sequences.
- It assumed every node had enough interaction history to learn a meaningful representation. New nodes needed special handling that wasn’t part of this implementation - cold start is always the hard problem.
Stack
Python, PyTorch, NetworkX, NumPy, Pandas, Matplotlib.