Case Studies Semantic Duplicate Detection at Scale (Quora Question Pairs)
ml NLP Applied Project · Applied Project

Semantic Duplicate Detection at Scale (Quora Question Pairs)

Built an NLP pipeline for semantic duplicate question detection on the Quora Question Pairs dataset — combining feature engineering with deep learning to identify semantically equivalent questions despite surface-level phrasing differences.

Problem

Quora's content quality depends on detecting semantically duplicate questions — questions that ask the same thing in different words — to consolidate answers and reduce redundancy at platform scale.

Outcome

End-to-end NLP pipeline combining hand-crafted semantic features with embedding-based similarity for duplicate detection, with documented ablation analysis.

Overview

The Quora Question Pairs benchmark is a canonical NLP task: given two questions, determine if they are semantically equivalent. It sounds straightforward but is genuinely hard — natural language allows infinite surface-level variation over a fixed semantic space. “How do I learn Python?” and “What’s the best way to start learning Python programming?” are duplicates; “How do I learn Python?” and “What should I learn after Python?” are not.

The Problem

At platform scale, duplicate question detection affects answer quality (consolidate answers to the same question), user experience (redirect users to the canonical question), and content quality (reduce noise). Rule-based approaches break immediately on paraphrase variation.

Data & Inputs

Approach

Two complementary feature families:

Hand-crafted semantic features:

Embedding-based features:

Models: XGBoost on hand-crafted features (strong baseline), LSTM-based siamese network on embedding sequences, ensemble.

Results & Impact

Technical Detail

Dataset characteristics: ~400K question pairs, ~37% labeled duplicate. The class imbalance reflects the natural platform distribution — most pairs of questions are not duplicates. Class-weighted training prevents the model from defaulting to “not duplicate” for all predictions.

Hand-crafted similarity features:

Lexical overlap: unigram Jaccard similarity (|intersection| / |union| of word sets), bigram overlap, character n-gram similarity at the 3- and 4-gram level (handles spelling variations and domain abbreviations)

TF-IDF cosine similarity: vectorize each question against the full corpus vocabulary, then compute cosine similarity between the pair vectors. Down-weights common words and emphasizes discriminative terms.

Length features: absolute length difference, ratio of shorter to longer question. Extreme length differences weakly signal that questions are asking different things.

Fuzzy matching: token sort ratio and token set ratio using edit-distance variants — handles word reordering between semantically identical questions (“best Python resources” vs. “Python best resources”).

Embedding-based features:

Siamese LSTM architecture:

Ablation findings: hand-crafted lexical features alone gave strong baseline performance, confirming that surface-level overlap is a powerful signal. Embedding-based features added the most value on paraphrase cases where word overlap is low but semantic meaning is equivalent (“What is the capital of France?” vs. “Which city is France’s capital?”). The ensemble of both feature families outperformed either alone — they capture complementary signal and their errors are not fully correlated.

Stack

Python, TensorFlow, Keras, NLTK, Gensim (Word2Vec), Scikit-learn, XGBoost, Pandas

Stack

Python TensorFlow Keras NLTK Gensim Scikit-learn Pandas
nlp semantic-similarity deep-learning classification

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →