Semantic Duplicate Detection at Scale (Quora Question Pairs)

Overview

The Quora Question Pairs benchmark is a canonical NLP task: given two questions, determine if they are semantically equivalent. It sounds straightforward but is genuinely hard — natural language allows infinite surface-level variation over a fixed semantic space. “How do I learn Python?” and “What’s the best way to start learning Python programming?” are duplicates; “How do I learn Python?” and “What should I learn after Python?” are not.

The Problem

At platform scale, duplicate question detection affects answer quality (consolidate answers to the same question), user experience (redirect users to the canonical question), and content quality (reduce noise). Rule-based approaches break immediately on paraphrase variation.

Data & Inputs

Quora Question Pairs dataset: ~400K question pairs, labeled duplicate/not-duplicate
Class imbalance: approximately 37% duplicates
Text only — no user metadata or behavioral signals

Approach

Two complementary feature families:

Hand-crafted semantic features:

Word overlap: shared unigrams, bigrams, trigrams (Jaccard similarity)
TF-IDF weighted cosine similarity
Edit distance and character n-gram similarity
Length-based features: absolute and relative length difference

Embedding-based features:

Word2Vec sentence embeddings (average and weighted average)
Distance metrics between embedding pairs: cosine, Euclidean, Manhattan

Models: XGBoost on hand-crafted features (strong baseline), LSTM-based siamese network on embedding sequences, ensemble.

Results & Impact

Working duplicate detection pipeline with full feature ablation analysis
Hand-crafted semantic features provided strong baseline — demonstrating that simple NLP features carry significant signal
LSTM siamese network improved over the baseline on paraphrase cases where word overlap is low
Ensemble combined the complementary strengths of both approaches

Technical Detail

Dataset characteristics: ~400K question pairs, ~37% labeled duplicate. The class imbalance reflects the natural platform distribution — most pairs of questions are not duplicates. Class-weighted training prevents the model from defaulting to “not duplicate” for all predictions.

Hand-crafted similarity features:

Lexical overlap: unigram Jaccard similarity (|intersection| / |union| of word sets), bigram overlap, character n-gram similarity at the 3- and 4-gram level (handles spelling variations and domain abbreviations)

TF-IDF cosine similarity: vectorize each question against the full corpus vocabulary, then compute cosine similarity between the pair vectors. Down-weights common words and emphasizes discriminative terms.

Length features: absolute length difference, ratio of shorter to longer question. Extreme length differences weakly signal that questions are asking different things.

Fuzzy matching: token sort ratio and token set ratio using edit-distance variants — handles word reordering between semantically identical questions (“best Python resources” vs. “Python best resources”).

Embedding-based features:

Word2Vec embeddings trained on the Quora corpus (100-dimensional, skipgram architecture)
Sentence representation: simple average of word vectors, and TF-IDF weighted average (down-weights stop words during aggregation)
Distance metrics between sentence embedding pairs: cosine distance, Euclidean distance, Manhattan distance — each captures different aspects of the embedding geometry

Siamese LSTM architecture:

Two LSTM branches sharing weights, each processing one question in the pair
Shared weights enforce the symmetry constraint: similarity(Q1, Q2) = similarity(Q2, Q1)
Final hidden states concatenated and passed to a dense classification layer with sigmoid output
Dropout (0.3) applied after LSTM layers for regularization
Binary cross-entropy loss

Ablation findings: hand-crafted lexical features alone gave strong baseline performance, confirming that surface-level overlap is a powerful signal. Embedding-based features added the most value on paraphrase cases where word overlap is low but semantic meaning is equivalent (“What is the capital of France?” vs. “Which city is France’s capital?”). The ensemble of both feature families outperformed either alone — they capture complementary signal and their errors are not fully correlated.

Stack

Python, TensorFlow, Keras, NLTK, Gensim (Word2Vec), Scikit-learn, XGBoost, Pandas

Overview

The Problem

Data & Inputs

Approach

Results & Impact

Technical Detail

Stack

Stack

Lets collaborate!