Overview
The MSKCC cancer treatment challenge is a benchmark problem in biomedical ML: given a genetic mutation and associated clinical literature, classify the mutation as one of several categories describing its role in cancer causation. The difficulty is that the knowledge is encoded in medical text — not in structured data — and the class boundaries are defined by expert consensus, not clean rules.
The Problem
Pathologists manually review clinical papers to classify whether a genetic mutation is a driver (causally involved in tumor progression) or a passenger (a random mutation with no causal role). At the scale of modern genomic sequencing — thousands of mutations per patient — this manual review is a bottleneck. An ML system that could automate or triage this process would directly accelerate cancer treatment research.
Data & Inputs
- Mutation dataset: gene, mutation type, annotated class label
- Clinical literature: text excerpts from medical papers associated with each mutation
- The key challenge: class imbalance (some mutation classes are rare) and noisy text (clinical literature has domain-specific vocabulary, abbreviations, and citation styles)
Approach
Text preprocessing: tokenization, stopword removal, lemmatization, and TF-IDF vectorization. Tested multiple feature representations.
Models: Logistic Regression (strong baseline for text classification), Random Forest, and Naive Bayes. Evaluated on balanced accuracy to handle class imbalance.
Strict experimental protocol: 64% training, 16% cross-validation, 20% holdout test — with class distribution preserved across splits using stratified sampling.
Results & Impact
- Working classification pipeline with documented comparison across models
- Logistic Regression with TF-IDF performed competitively with more complex models — a consistent finding in medical text classification tasks
- Demonstrated that systematic feature engineering on clinical text can partially automate expert review for common mutation types
Technical Detail
Class structure: The MSKCC dataset has 9 mutation classes with significant imbalance — classes 1, 2, and 7 represent the large majority of observations, while classes 4, 6, and 8 are rare. Imbalance handling used stratified cross-validation to preserve the class distribution across all folds, and class-weighted loss functions during model training to prevent the majority classes from dominating the learning signal.
Feature engineering pipeline:
- Text normalization: lowercase, HTML tag removal, medical abbreviation expansion, stopword removal using NLTK’s English corpus augmented with a custom domain list of clinical terms that carry no discriminative signal
- Stemming via Porter Stemmer to reduce vocabulary size, then term-document matrix construction
- TF-IDF vectorization with vocabulary limited to the 10,000 most discriminative terms by document frequency thresholding and term frequency ranking
- Gene and mutation type encoded as categorical indicator features alongside the text vectors — the gene identity alone carries significant classification signal given established oncology literature
Model comparison:
- Logistic Regression (TF-IDF features): strong multi-class classifier, particularly on majority classes
- Random Forest: less effective on high-dimensional sparse text; outperformed by the linear model
- Naive Bayes (Multinomial): fast and interpretable, competitive on majority classes
Evaluation metric: multi-class log loss (the Kaggle competition metric). Log loss penalizes confident wrong predictions heavily — an appropriate metric for medical classification where false certainty is more dangerous than uncertain correct classification.
Key finding: the gene identity feature carried disproportionate discriminative signal. Some genes are almost exclusively associated with one or two mutation classes based on established oncology literature. This reflects the domain structure: mutation classification is not purely a text problem — it is a gene-mutation co-occurrence problem with known biological priors. A system that ignores gene identity loses information that clinical experts would never ignore.
Stack
Python, Scikit-learn, NLTK, Pandas, NumPy, Matplotlib