Overview
The Quora Question Pairs benchmark is a canonical NLP task: given two questions, determine if they are semantically equivalent. It sounds straightforward but is genuinely hard — natural language allows infinite surface-level variation over a fixed semantic space. “How do I learn Python?” and “What’s the best way to start learning Python programming?” are duplicates; “How do I learn Python?” and “What should I learn after Python?” are not.
The Problem
At platform scale, duplicate question detection affects answer quality (consolidate answers to the same question), user experience (redirect users to the canonical question), and content quality (reduce noise). Rule-based approaches break immediately on paraphrase variation.
Data & Inputs
- Quora Question Pairs dataset: ~400K question pairs, labeled duplicate/not-duplicate
- Class imbalance: approximately 37% duplicates
- Text only — no user metadata or behavioral signals
Approach
Two complementary feature families:
Hand-crafted semantic features:
- Word overlap: shared unigrams, bigrams, trigrams (Jaccard similarity)
- TF-IDF weighted cosine similarity
- Edit distance and character n-gram similarity
- Length-based features: absolute and relative length difference
Embedding-based features:
- Word2Vec sentence embeddings (average and weighted average)
- Distance metrics between embedding pairs: cosine, Euclidean, Manhattan
Models: XGBoost on hand-crafted features (strong baseline), LSTM-based siamese network on embedding sequences, ensemble.
Results & Impact
- Working duplicate detection pipeline with full feature ablation analysis
- Hand-crafted semantic features provided strong baseline — demonstrating that simple NLP features carry significant signal
- LSTM siamese network improved over the baseline on paraphrase cases where word overlap is low
- Ensemble combined the complementary strengths of both approaches
Technical Detail
Dataset characteristics: ~400K question pairs, ~37% labeled duplicate. The class imbalance reflects the natural platform distribution — most pairs of questions are not duplicates. Class-weighted training prevents the model from defaulting to “not duplicate” for all predictions.
Hand-crafted similarity features:
Lexical overlap: unigram Jaccard similarity (|intersection| / |union| of word sets), bigram overlap, character n-gram similarity at the 3- and 4-gram level (handles spelling variations and domain abbreviations)
TF-IDF cosine similarity: vectorize each question against the full corpus vocabulary, then compute cosine similarity between the pair vectors. Down-weights common words and emphasizes discriminative terms.
Length features: absolute length difference, ratio of shorter to longer question. Extreme length differences weakly signal that questions are asking different things.
Fuzzy matching: token sort ratio and token set ratio using edit-distance variants — handles word reordering between semantically identical questions (“best Python resources” vs. “Python best resources”).
Embedding-based features:
- Word2Vec embeddings trained on the Quora corpus (100-dimensional, skipgram architecture)
- Sentence representation: simple average of word vectors, and TF-IDF weighted average (down-weights stop words during aggregation)
- Distance metrics between sentence embedding pairs: cosine distance, Euclidean distance, Manhattan distance — each captures different aspects of the embedding geometry
Siamese LSTM architecture:
- Two LSTM branches sharing weights, each processing one question in the pair
- Shared weights enforce the symmetry constraint: similarity(Q1, Q2) = similarity(Q2, Q1)
- Final hidden states concatenated and passed to a dense classification layer with sigmoid output
- Dropout (0.3) applied after LSTM layers for regularization
- Binary cross-entropy loss
Ablation findings: hand-crafted lexical features alone gave strong baseline performance, confirming that surface-level overlap is a powerful signal. Embedding-based features added the most value on paraphrase cases where word overlap is low but semantic meaning is equivalent (“What is the capital of France?” vs. “Which city is France’s capital?”). The ensemble of both feature families outperformed either alone — they capture complementary signal and their errors are not fully correlated.
Stack
Python, TensorFlow, Keras, NLTK, Gensim (Word2Vec), Scikit-learn, XGBoost, Pandas