Case Studies Why TCIA Cancer Imaging Won't Carry a Clinical Screening Tool: A Feasibility Study
ml Personal Research Project · Personal Project

Why TCIA Cancer Imaging Won't Carry a Clinical Screening Tool: A Feasibility Study

Feasibility investigation for a Machine Learning-powered medical second-opinion tool using cancer imaging data from The Cancer Imaging Archive - scoped for Indian healthcare, and stopped at the EDA stage once annotation inconsistency and cross-institution distribution shift proved to be the binding constraints, not modeling.

Problem

Medical diagnostic errors cause significant patient harm. Second opinions from specialists improve accuracy but are expensive and slow. Can Machine Learning provide rapid first-pass screening to prioritize specialist attention - specifically in Indian hospitals reliant on X-ray and CT rather than MRI?

Outcome

Early-stage EDA on TCIA revealed the binding constraints were in the data, not the model - annotation inconsistency across contributing institutions and distribution shift between scanners and patient populations - leading to a documented, evidence-based decision to halt the project rather than build a model whose benchmark AUC would overstate clinical viability.

Impact - who used it & what changed

A self-directed feasibility study whose value is the kill decision itself: it converted a plausible-sounding 'Machine Learning second opinion' idea into a concrete account of why TCIA data and the absence of clinical partners made deployment infeasible - a research finding, with no clinical use, patients, or commercial deployment involved.

The hypothesis I set out to test

10 to 15% of medical diagnoses involve some form of error. Radiology and pathology are among the highest-risk specialties. Second opinions from specialists improve accuracy - but they are time-consuming, expensive, and geographically constrained to where specialists practice.

The hypothesis: machine learning systems trained on large annotated datasets could serve as rapid first-pass screening tools. Not replacing the clinician. Flagging cases that warrant specialist review. Prioritizing the human expert’s attention on the highest-uncertainty cases.

Good hypothesis. The data reality was more complicated - and the work below is the record of testing it before writing a line of modeling code.

The context: India as the target environment

The project was specifically scoped toward Indian healthcare. India has a distinct cancer burden: lung cancer and head/neck cancers are most prevalent (driven heavily by tobacco use), with cervical and stomach cancer also highly prevalent. The dominant imaging modalities available in Indian hospitals - particularly tier-2 and tier-3 cities - are X-ray and CT scan rather than MRI, which is expensive and less accessible.

This shapes both the dataset selection criteria and the feasibility constraints. A tool built on MRI data from North American academic medical centers is not a tool that works in Rajasthan.

The work: EDA on The Cancer Imaging Archive (TCIA)

TCIA is a public repository of de-identified medical imaging datasets across multiple cancer types, contributed by hospitals and research institutions worldwide. CT, MRI, PET, along with associated clinical annotations and pathology reports.

I went into TCIA to answer one question before committing to any architecture: does this data actually support a screening tool for the Indian context, or does it just support a good benchmark number? Pulling collections through the TCIA API and profiling them surfaced four findings, in increasing order of how much they hurt.

What the EDA found

Dataset coverage: TCIA includes dozens of collections covering lung, breast, prostate, brain, and other cancer types. Coverage varies substantially. Some collections have thousands of annotated cases, others a few dozen. Sample size constraints are binding for deep learning approaches.

Imaging modality distribution: different collections use different modalities. Lung cancer datasets skew toward CT. Brain tumor datasets include both MRI and PET. Cross-modality generalization is limited. A model trained on CT scans does not transfer directly to MRI.

Annotation quality: this was the critical finding. Annotations in TCIA datasets are contributed by different radiologists and institutions, following different annotation protocols. Lesion boundary definitions, severity grading conventions, and what counts as a “positive” case vary across datasets. Models trained on one dataset’s annotation convention may not generalize to another’s.

Patient demographics: demographic distributions across collections reflect the populations of contributing institutions. A model trained primarily on data from academic medical centers in North America may not generalize to community hospitals with different patient demographics and equipment.

The gap the benchmark would have hidden

Benchmark performance on a held-out test set from the same dataset understates the challenges of clinical deployment. Significantly. The EDA findings above are exactly the inputs to that gap.

Distribution shift: medical equipment (CT scanner models, MRI field strengths), acquisition protocols, and patient populations differ between institutions. Models trained on one hospital’s data frequently perform substantially worse on another hospital’s data. Sometimes catastrophically worse.

Annotation inconsistency: if the ground truth labels are inconsistently defined across annotations, model performance is bounded by human rater agreement. For some clinical tasks, radiologist agreement rates are lower than most people assume.

Regulatory path: FDA 510(k) clearance for AI-based diagnostic tools is a multi-year undertaking requiring clinical validation studies, not just benchmark performance. The gap between a research model with high AUC-ROC and a clinically deployable tool is measured in years and millions of dollars.

The outcome: a documented decision to stop

The project was put on hold after the initial EDA. The data quality issues encountered were more fundamental than initially anticipated, and the gap between research prototype and clinical deployment was assessed as too large for a personal project without clinical partners.

I treat that as the deliverable, not a failure. The EDA produced a concrete, evidence-backed answer to the feasibility question - no, not on this data, not without partners - which is more useful than a model with a high AUC that would have masked the real obstacles. No clinical use, no patients, no commercial deployment ever followed; this stayed a research investigation throughout.

The impact is the transferable judgment. The lessons below are broadly applicable to healthcare AI.

  1. Benchmark performance is necessary but not sufficient. A model that achieves 95% AUC on a held-out split from the same dataset may perform substantially worse on a new institution’s data. The benchmark tells you the model learned something. It does not tell you what that something generalizes to.
  2. Data quality constraints bind before modeling constraints. Annotation inconsistency limits what any model can learn, regardless of architecture choices. Fixing the data is more valuable than improving the model.
  3. The deployment environment is part of the problem definition. A screening tool that reduces specialist workload in one hospital system may be unusable in another due to workflow integration constraints. Define the deployment context before you build anything.

The biomedical Machine Learning work that was completed - text-based cancer mutation classification on the MSKCC dataset - was more tractable because the annotation protocol was consistent and the task was well-defined. See NLP-Driven Cancer Mutation Classification for that project.

Stack

Python TCIA API Pandas Matplotlib Medical Imaging (DICOM)
biomedical medical-imaging feasibility-study healthcare-ai

Have a problem worth solving?

Whether you need a quantitative researcher, a Machine Learning systems builder, or a technical advisor, I take a small number of consulting engagements at a time.

Book a call →