The hypothesis I set out to test
10 to 15% of medical diagnoses involve some form of error. Radiology and pathology are among the highest-risk specialties. Second opinions from specialists improve accuracy - but they are time-consuming, expensive, and geographically constrained to where specialists practice.
The hypothesis: machine learning systems trained on large annotated datasets could serve as rapid first-pass screening tools. Not replacing the clinician. Flagging cases that warrant specialist review. Prioritizing the human expert’s attention on the highest-uncertainty cases.
Good hypothesis. The data reality was more complicated - and the work below is the record of testing it before writing a line of modeling code.
The context: India as the target environment
The project was specifically scoped toward Indian healthcare. India has a distinct cancer burden: lung cancer and head/neck cancers are most prevalent (driven heavily by tobacco use), with cervical and stomach cancer also highly prevalent. The dominant imaging modalities available in Indian hospitals - particularly tier-2 and tier-3 cities - are X-ray and CT scan rather than MRI, which is expensive and less accessible.
This shapes both the dataset selection criteria and the feasibility constraints. A tool built on MRI data from North American academic medical centers is not a tool that works in Rajasthan.
The work: EDA on The Cancer Imaging Archive (TCIA)
TCIA is a public repository of de-identified medical imaging datasets across multiple cancer types, contributed by hospitals and research institutions worldwide. CT, MRI, PET, along with associated clinical annotations and pathology reports.
I went into TCIA to answer one question before committing to any architecture: does this data actually support a screening tool for the Indian context, or does it just support a good benchmark number? Pulling collections through the TCIA API and profiling them surfaced four findings, in increasing order of how much they hurt.
What the EDA found
Dataset coverage: TCIA includes dozens of collections covering lung, breast, prostate, brain, and other cancer types. Coverage varies substantially. Some collections have thousands of annotated cases, others a few dozen. Sample size constraints are binding for deep learning approaches.
Imaging modality distribution: different collections use different modalities. Lung cancer datasets skew toward CT. Brain tumor datasets include both MRI and PET. Cross-modality generalization is limited. A model trained on CT scans does not transfer directly to MRI.
Annotation quality: this was the critical finding. Annotations in TCIA datasets are contributed by different radiologists and institutions, following different annotation protocols. Lesion boundary definitions, severity grading conventions, and what counts as a “positive” case vary across datasets. Models trained on one dataset’s annotation convention may not generalize to another’s.
Patient demographics: demographic distributions across collections reflect the populations of contributing institutions. A model trained primarily on data from academic medical centers in North America may not generalize to community hospitals with different patient demographics and equipment.
The gap the benchmark would have hidden
Benchmark performance on a held-out test set from the same dataset understates the challenges of clinical deployment. Significantly. The EDA findings above are exactly the inputs to that gap.
Distribution shift: medical equipment (CT scanner models, MRI field strengths), acquisition protocols, and patient populations differ between institutions. Models trained on one hospital’s data frequently perform substantially worse on another hospital’s data. Sometimes catastrophically worse.
Annotation inconsistency: if the ground truth labels are inconsistently defined across annotations, model performance is bounded by human rater agreement. For some clinical tasks, radiologist agreement rates are lower than most people assume.
Regulatory path: FDA 510(k) clearance for AI-based diagnostic tools is a multi-year undertaking requiring clinical validation studies, not just benchmark performance. The gap between a research model with high AUC-ROC and a clinically deployable tool is measured in years and millions of dollars.
The outcome: a documented decision to stop
The project was put on hold after the initial EDA. The data quality issues encountered were more fundamental than initially anticipated, and the gap between research prototype and clinical deployment was assessed as too large for a personal project without clinical partners.
I treat that as the deliverable, not a failure. The EDA produced a concrete, evidence-backed answer to the feasibility question - no, not on this data, not without partners - which is more useful than a model with a high AUC that would have masked the real obstacles. No clinical use, no patients, no commercial deployment ever followed; this stayed a research investigation throughout.
The impact is the transferable judgment. The lessons below are broadly applicable to healthcare AI.
- Benchmark performance is necessary but not sufficient. A model that achieves 95% AUC on a held-out split from the same dataset may perform substantially worse on a new institution’s data. The benchmark tells you the model learned something. It does not tell you what that something generalizes to.
- Data quality constraints bind before modeling constraints. Annotation inconsistency limits what any model can learn, regardless of architecture choices. Fixing the data is more valuable than improving the model.
- The deployment environment is part of the problem definition. A screening tool that reduces specialist workload in one hospital system may be unusable in another due to workflow integration constraints. Define the deployment context before you build anything.
The biomedical Machine Learning work that was completed - text-based cancer mutation classification on the MSKCC dataset - was more tractable because the annotation protocol was consistent and the task was well-defined. See NLP-Driven Cancer Mutation Classification for that project.