The work
I built a duplicate-question detector for the Quora Question Pairs dataset, treating it as the genuinely hard problem it is rather than another text-classification exercise. The pipeline runs two complementary feature families through gradient-boosted and deep models, then ensembles them, and I documented an ablation that shows what each family actually buys.
Hand-crafted semantic features - the strong baseline:
- Lexical overlap: unigram Jaccard similarity (intersection over union of word sets), bigram overlap, and character 3/4-gram similarity to absorb spelling variation and abbreviations.
- TF-IDF cosine similarity: each question vectorized against the corpus vocabulary, then cosine similarity between the pair vectors - down-weighting common words, emphasizing discriminative ones.
- Length features: absolute length difference and the ratio of shorter to longer question; extreme differences weakly signal the questions are after different things.
- Fuzzy matching: token sort ratio and token set ratio over edit-distance variants, to survive word reordering between otherwise identical questions.
Embedding-based features - for the cases overlap misses:
- Word2Vec embeddings trained on the Quora corpus (100-dimensional, skip-gram).
- Sentence representation by simple average of word vectors and by TF-IDF-weighted average (so stop words are down-weighted during aggregation).
- Distance metrics between the sentence-embedding pairs - cosine, Euclidean, Manhattan - each reading a different aspect of the embedding geometry.
Models: XGBoost on the hand-crafted features as the baseline; a weight-shared siamese LSTM over the embedding sequences; and an ensemble of the two. The siamese branches share weights, which enforces the symmetry constraint that similarity(Q1, Q2) equals similarity(Q2, Q1); their final hidden states are concatenated into a dense sigmoid head, with dropout (0.3) after the LSTM layers and binary cross-entropy loss.
The context
Most text-classification tasks reward simple solutions. This one punishes them. “How do I learn Python?” and “What’s the best way to start learning Python programming?” are duplicates with little word overlap; “How do I learn Python?” and “What should I learn after Python?” share most of their words and are not. Rule-based matching breaks immediately on paraphrase, so the question becomes which feature strategy recovers meaning from surface form.
The data is text-only - no user metadata or behavioral signals:
- Quora Question Pairs: ~400K question pairs, labeled duplicate / not-duplicate.
- Class imbalance: roughly 37% duplicates, so training is class-weighted to stop the model defaulting to “not duplicate” for everything - a common, quiet failure mode when imbalance is ignored.
The outcome
A working duplicate-detection pipeline with a full feature ablation that makes the trade-offs explicit rather than implicit:
- Hand-crafted lexical features alone gave a strong baseline - surface-level overlap carries a lot of signal, which routinely surprises people who reach for deep learning first.
- The embedding features and siamese LSTM added the most value precisely where overlap is low but meaning is equivalent - the paraphrase cases the baseline gets wrong.
- The ensemble beat either family alone, because their errors are not fully correlated: the hand-crafted and embedding views fail on different pairs, so combining them recovers signal neither captures on its own.
The impact
This is a self-directed project on a public benchmark, so the honest claim is methodological, not commercial. What it produced was a transferable discipline: build the cheap, interpretable feature baseline first, then prove the deep model earns its place by measuring where it actually improves over that baseline. That semantic-similarity featurization and the feature-vs-deep-learning ablation habit carried directly into my later graph link-prediction and biomedical text work - the value is the method, not a deployment.
Stack
Python, TensorFlow, Keras, NLTK, Gensim (Word2Vec), Scikit-learn, XGBoost, Pandas.