Context
The MSKCC “Personalized Medicine: Redefining Cancer Treatment” challenge is a benchmark biomedical-NLP problem. Given a genetic mutation and the clinical literature associated with it, classify the mutation into one of nine categories describing its role in cancer causation. The hard part is that the knowledge lives in medical text - not structured data - and the class boundaries come from expert consensus, not clean rules.
Today that consensus is reached by hand: a pathologist reads clinical papers and decides whether a mutation is a driver (causally involved in tumor progression) or a passenger (a coincidental mutation with no causal role). At the scale of modern genomic sequencing - thousands of mutations per patient - manual review is the bottleneck. The bet of this project was that systematic feature engineering on clinical text could triage that review for the common cases without compromising the protocol.
The Work
Data. Two inputs per mutation: a structured record (gene, mutation type, annotated class label) and free text - excerpts from the medical papers tied to that mutation. The text is noisy: domain-specific vocabulary, abbreviations, and citation styles. The labels are imbalanced - some mutation classes are rare.
Feature engineering. The text was normalized (lowercase, HTML tag removal, medical-abbreviation expansion) and cleaned with NLTK stopword removal augmented by a custom domain list of clinical terms that carry no discriminative signal. Porter stemming reduced vocabulary size before term-document matrix construction. TF-IDF vectorization kept the 10,000 most discriminative terms via document-frequency thresholding and term-frequency ranking. Crucially, gene and mutation type were encoded as categorical indicator features alongside the text vectors - gene identity alone carries significant signal given established oncology literature.
Models. A documented comparison across three classifiers:
- Logistic Regression on TF-IDF features - a strong multi-class baseline, particularly on majority classes.
- Random Forest - less effective on high-dimensional sparse text; outperformed by the linear model.
- Multinomial Naive Bayes - fast and interpretable, competitive on majority classes.
Protocol. Stratified 64% train / 16% cross-validation / 20% holdout test, with class distribution preserved across all splits and folds. The nine-class label set is heavily skewed - classes 1, 2, and 7 dominate the observations while 4, 6, and 8 are rare - so training used class-weighted loss to stop majority classes from swamping the learning signal, and evaluation used multi-class log loss (the Kaggle competition metric). Log loss penalizes confident wrong predictions heavily, which is the right pressure for a medical setting where false certainty is more dangerous than honest uncertainty.
Outcome
A reproducible, end-to-end pipeline with a documented comparison across models. Logistic Regression with TF-IDF performed competitively with more complex models - a consistent result in medical text classification, where linear models over sparse high-dimensional features are hard to beat. The pipeline showed that systematic feature engineering on clinical text can triage classification for the common mutation classes; how the submission ranked on the competition leaderboard is not recorded here.
Impact
This is a competition and research project (Kaggle MSKCC), not a clinically deployed system - nothing here was used to make patient-level decisions. Its value is methodological: a clean, leakage-free protocol and one finding worth carrying forward.
The key finding: gene identity carried disproportionate discriminative signal. Some genes are almost exclusively associated with one or two mutation classes per the established oncology literature. That reframes the task - mutation classification is not purely a text problem but a gene-mutation co-occurrence problem with known biological priors. A system that throws away gene identity discards information a clinical expert would never ignore. The lesson generalizes past this dataset: in domain Machine Learning, the structured priors experts already trust often outweigh the raw text.
Stack
Python, Scikit-learn, NLTK, Pandas, NumPy, Matplotlib