Case Studies Detecting Semantically Duplicate Questions Despite Different Wording (Quora Question Pairs)
ml NLP Applied Project · Applied Project

Detecting Semantically Duplicate Questions Despite Different Wording (Quora Question Pairs)

An NLP pipeline for the Quora Question Pairs problem - pairing hand-crafted semantic features with embedding-based and siamese-LSTM models to flag questions that ask the same thing in different words, with a documented ablation showing what each feature family actually contributes.

Problem

Given two questions, decide whether they are semantically equivalent. Natural language allows infinite surface variation over a fixed meaning space: pairs sharing almost no words can be duplicates, and pairs sharing most of their words can be unrelated. That asymmetry breaks rule-based matching and makes simple bag-of-words classifiers unreliable.

Outcome

An end-to-end pipeline that combines hand-crafted semantic features (lexical overlap, TF-IDF cosine, fuzzy matching, length) with Word2Vec embedding distances and a weight-shared siamese LSTM, ensembled together - with an ablation that isolates where each family carries signal.

Impact - who used it & what changed

A self-directed NLP project on a public benchmark; its value is methodological - it established the feature-vs-deep-learning ablation discipline and the semantic-similarity featurization I reused in later graph and biomedical text work. No commercial deployment is claimed.

The work

I built a duplicate-question detector for the Quora Question Pairs dataset, treating it as the genuinely hard problem it is rather than another text-classification exercise. The pipeline runs two complementary feature families through gradient-boosted and deep models, then ensembles them, and I documented an ablation that shows what each family actually buys.

Hand-crafted semantic features - the strong baseline:

Embedding-based features - for the cases overlap misses:

Models: XGBoost on the hand-crafted features as the baseline; a weight-shared siamese LSTM over the embedding sequences; and an ensemble of the two. The siamese branches share weights, which enforces the symmetry constraint that similarity(Q1, Q2) equals similarity(Q2, Q1); their final hidden states are concatenated into a dense sigmoid head, with dropout (0.3) after the LSTM layers and binary cross-entropy loss.

The context

Most text-classification tasks reward simple solutions. This one punishes them. “How do I learn Python?” and “What’s the best way to start learning Python programming?” are duplicates with little word overlap; “How do I learn Python?” and “What should I learn after Python?” share most of their words and are not. Rule-based matching breaks immediately on paraphrase, so the question becomes which feature strategy recovers meaning from surface form.

The data is text-only - no user metadata or behavioral signals:

The outcome

A working duplicate-detection pipeline with a full feature ablation that makes the trade-offs explicit rather than implicit:

The impact

This is a self-directed project on a public benchmark, so the honest claim is methodological, not commercial. What it produced was a transferable discipline: build the cheap, interpretable feature baseline first, then prove the deep model earns its place by measuring where it actually improves over that baseline. That semantic-similarity featurization and the feature-vs-deep-learning ablation habit carried directly into my later graph link-prediction and biomedical text work - the value is the method, not a deployment.

Stack

Python, TensorFlow, Keras, NLTK, Gensim (Word2Vec), Scikit-learn, XGBoost, Pandas.

Stack

Python TensorFlow Keras NLTK Gensim Scikit-learn XGBoost Pandas
nlp semantic-similarity deep-learning classification

Have a problem worth solving?

Whether you need a quantitative researcher, a Machine Learning systems builder, or a technical advisor, I take a small number of consulting engagements at a time.

Book a call →