Biomedical Second Opinion: ML Feasibility Study

Motivation

Medical diagnostic errors are a significant cause of patient harm — estimates suggest that 10–15% of diagnoses involve some form of error, and radiology and pathology are among the highest-risk specialties for diagnostic mistakes. Second opinions from specialists improve accuracy, but at a cost: they’re time-consuming, expensive, and geographically constrained to where specialists practice.

The hypothesis driving this project: machine learning systems trained on large annotated datasets could serve as rapid first-pass screening tools — not replacing the clinician, but flagging cases that warrant specialist review and prioritizing the human expert’s attention on the highest-uncertainty cases.

India as the Target Context

The project was specifically scoped toward Indian healthcare. India has a distinct cancer burden: lung cancer and head/neck cancers are the most prevalent (driven heavily by tobacco use), with cervical cancer and stomach cancer also highly prevalent. The dominant imaging modalities available in Indian hospitals — particularly tier-2 and tier-3 cities — are X-ray and CT scan rather than MRI, which is expensive and less accessible. This shapes both the dataset selection criteria and the feasibility constraints.

The AI in Indian healthcare landscape is nascent but growing: companies like OncoStem, Wadhwani AI, and MedyMatch have built targeted tools, but none have achieved broad clinical deployment. A useful second-opinion tool would need to work with X-ray and CT data specifically, tolerate lower-quality imaging from under-resourced facilities, and integrate with Indian radiologist workflows.

Data: The Cancer Imaging Archive (TCIA)

TCIA is a public repository of de-identified medical imaging datasets across multiple cancer types, contributed by hospitals and research institutions worldwide. It includes imaging modalities including CT, MRI, and PET, along with associated clinical annotations and pathology reports.

EDA Findings

Dataset coverage: TCIA includes dozens of collections covering lung, breast, prostate, brain, and other cancer types. Coverage varies substantially — some collections have thousands of annotated cases, others a few dozen. Sample size constraints are binding for deep learning approaches.

Imaging modality distribution: different collections use different modalities. Lung cancer datasets skew toward CT. Brain tumor datasets include both MRI and PET. Cross-modality generalization is limited — a model trained on CT scans does not transfer directly to MRI.

Annotation quality: this was the critical finding. Annotations in TCIA datasets are contributed by different radiologists and institutions, following different annotation protocols. Lesion boundary definitions, severity grading conventions, and what counts as a “positive” case vary across datasets. Models trained on one dataset’s annotation convention may not generalize to another’s.

Patient demographics: demographic distributions across collections reflect the populations of contributing institutions. A model trained primarily on data from academic medical centers in North America may not generalize to community hospitals with different patient demographics and equipment.

The Clinical Deployment Gap

Benchmark performance on a held-out test set from the same dataset understates the challenges of clinical deployment:

Distribution shift: medical equipment (CT scanner models, MRI field strengths), acquisition protocols, and patient populations differ between institutions. Models trained on one hospital’s data frequently perform significantly worse on another hospital’s data — sometimes catastrophically worse.

Annotation inconsistency: if the ground truth labels are inconsistently defined across annotations, model performance is bounded by human rater agreement. For some clinical tasks, radiologist agreement rates are lower than often assumed.

Regulatory path: FDA 510(k) clearance for AI-based diagnostic tools is a multi-year undertaking requiring clinical validation studies, not just benchmark performance. The gap between a research model with high AUC-ROC and a clinically deployable tool is measured in years and millions of dollars.

Project Status and Lessons

The project was put on hold after the initial EDA. The data quality issues encountered were more fundamental than initially anticipated, and the gap between research prototype and clinical deployment was assessed as too large for a personal project without clinical partners.

The lessons are broadly applicable to healthcare AI:

Benchmark performance is necessary but not sufficient. A model that achieves 95% AUC on a held-out split from the same dataset may perform substantially worse on a new institution’s data.
Data quality constraints bind before modeling constraints. Annotation inconsistency limits what any model can learn, regardless of architecture choices.
The deployment environment is part of the problem definition. A screening tool that reduces specialist workload in one hospital system may be unusable in another due to workflow integration constraints.

The biomedical ML work that was completed — text-based cancer mutation classification on the MSKCC dataset — was more tractable because the annotation protocol was consistent and the task was well-defined. See NLP-Driven Cancer Mutation Classification for that project.