Machine Learning Models for Pancreatic Cancer Risk Prediction Using Electronic Health Records

PMC 2023 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Pancreatic Cancer Needs Better Risk Prediction

Pancreatic cancer (PC) ranks as the 11th most common cancer globally, with 458,918 new cases reported in 2018. Despite its relatively lower incidence, it is projected to become the second leading cause of cancer-related death in the United States by 2030. The key problem is late-stage diagnosis: only 15 to 20% of patients are eligible for surgical resection at the time of discovery. Earlier detection of localized disease is strongly correlated with improved survival, but effective screening remains elusive because of low overall incidence and the absence of accurate early-stage biomarkers.

Current screening limitations: Population-wide PC screening is not recommended. Current practice restricts screening to individuals carrying pathogenic or likely pathogenic germline mutations in PC susceptibility genes and those with multiple affected family members. However, fewer than 20% of PC patients have identifiable familial or genetic risk factors, which leaves the vast majority of at-risk individuals unscreened.

The EHR opportunity: Electronic Health Records contain a wealth of structured data (diagnoses, labs, medications) and unstructured data (clinical notes). Combined with advances in machine learning and deep learning, EHR data could be leveraged to identify high-risk individuals who would benefit from targeted screening. Explainable AI (X-AI) techniques add the possibility of uncovering novel, previously unrecognized risk factors directly from clinical data.

This systematic review from Mayo Clinic researchers evaluated ML and AI models applied to EHR data for PC risk prediction, covering studies published between January 2012 and February 2024. The review assessed model development strategies, evaluation methods, and overall effectiveness in predicting pancreatic cancer.

TL;DR: Pancreatic cancer has a dismal prognosis largely due to late detection. Only 15 to 20% of patients qualify for surgery, and fewer than 20% have known genetic risk factors. This review evaluates ML models that use EHR data to identify high-risk individuals for targeted screening.
Pages 2-3
Search Strategy, Inclusion Criteria, and Quality Assessment

The authors searched six major databases: Ovid MEDLINE, Ovid EMBASE, Ovid Cochrane Central Register of Controlled Trials, Ovid Cochrane Database of Systematic Reviews, Scopus, and Web of Science. The search covered articles published between January 1, 2012 and February 1, 2024, restricted to the English language. An experienced librarian designed the search strategy using controlled vocabulary supplemented with keywords targeting ML, natural language processing, and EHR-based pancreatic cancer prediction.

Screening process: Two independent reviewers screened articles by title and abstract, followed by full-text review. Reference lists and citation matching identified additional eligible studies. A third reviewer adjudicated disagreements. The initial search yielded 183 articles after deduplication, ultimately narrowing to 21 articles, with 9 more added from reference screening for a total of 30 studies.

Quality tools: The CHARMS (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies) checklist guided data extraction, covering study type, data sources, participants, missing data handling, ML methods, calibration, and validation. PROBAST was used to evaluate risk of bias and applicability. PRISMA guided the overall systematic review process. Four articles were excluded due to unclear data sources, absence of multivariate model development, unclear predictor use, or significant overlap with another included study.

Performance metric conventions: The C-index (concordance index) served as the primary performance metric. Because studies varied widely in data exclusion windows, prediction time windows, datasets, and modeling techniques, the authors adopted specific conventions. For multiple exclusion windows, they reported results from the smallest exclusion interval. For multiple prediction windows, results from the shortest window were used. Multiple datasets were included as individual data points, but subset analyses (e.g., new-onset diabetes subgroups) were excluded in favor of full-cohort results.

TL;DR: The review searched 6 databases and identified 30 studies (169,149 PC cases). Quality was assessed using CHARMS, PROBAST, and PRISMA. The C-index was the primary performance metric, with specific conventions for handling heterogeneous study designs.
Pages 4-5
Thirty Studies, 169,149 Cases, and Three Model Groups

The 30 included studies encompassed a total of 169,149 pancreatic cancer cases. Most studies defined PC as a composite outcome using a range of ICD codes, without differentiating between pancreatic ductal adenocarcinoma (PDAC, approximately 85% of cases), pancreatic neuroendocrine tumors (PNET, under 5%), or other specific subtypes. This composite approach is problematic because PDAC, PNET, and other types have distinct tumor biology, natural history, and risk factors.

Predictor selection: Twenty of the 30 studies used a curated set of known PC risk predictors based on published literature or clinical expertise. The remaining studies used non-curated approaches, allowing the models to select from a broader set of EHR variables. Interestingly, models using curated predictors showed similar mean discrimination performance (C-index = 0.81, range 0.61 to 1.0, n=18) compared to those using non-curated predictors (C-index = 0.80, range 0.72 to 0.93, n=19).

Model categorization: The authors grouped modeling techniques into three categories. Group A consisted of linear ML models (such as logistic regression). Group B included non-linear models excluding deep learning (such as XGBoost, random forests, and Cox regression). Group C comprised deep learning models only (such as gated recurrent units and transformers). A greater proportion of Group A (8 of 14) and Group B (9 of 14) models used curated predictors compared to Group C (1 of 9), suggesting deep learning approaches more often worked with raw or uncurated EHR features.

TL;DR: Across 30 studies and 169,149 PC cases, curated vs. non-curated predictor sets yielded similar mean C-index values (0.81 vs. 0.80). Models were grouped as linear (Group A), non-linear non-DL (Group B), and deep learning (Group C). Deep learning models more frequently used non-curated EHR features.
Pages 4-5
Modeling Techniques, Missing Data, and Prediction Windows

Modeling methods: Logistic regression was the most frequently used technique (n=18 studies). Beyond logistic regression, the studies employed a diverse range of methods including tree-based models (XGBoost, random forests), survival models (random survival forests, Cox regression, multistate models), artificial neural networks, and advanced deep learning architectures such as gated recurrent units (GRUs) and transformers.

Missing data handling: Only 16 of the 30 studies reported information about missing data. The most common strategies included excluding patients with missing values, excluding predictors with high missingness percentages, and imputation. Some studies replaced missing values with categorical labels such as "Not known" or "missing," or created binary indicator variables. One study treated missing laboratory results as equivalent to normal values, a questionable assumption. The remaining 14 studies provided no information about missing data at all, which represents a significant threat to model validity.

Prediction and exclusion windows: Studies predicted PC occurrence within windows of up to 8 years from the date of risk assessment. Six articles did not provide any information about prediction time windows or data exclusion intervals. Only 12 studies experimented with data exclusion intervals ranging from 1 month to 5 years. For models without curated predictors and a 1-year lead time, C-index ranged from 0.71 to 0.83 for internal validation and 0.60 to 0.78 for external validation.

Data exclusion intervals are critical because predictor data close to the time of PC diagnosis likely reflects symptoms of existing disease rather than true predictive risk factors. Models that include these late-stage signals may appear to perform well in retrospective evaluation but would fail as genuine early-detection tools. Research suggests that predictor data considered with a lead time of 24 to 36 months before diagnosis may be most appropriate for genuine risk prediction.

TL;DR: Logistic regression dominated (18/30 studies). Missing data was severely underreported (14 studies silent). With 1-year lead time and non-curated predictors, internal validation C-index ranged 0.71 to 0.83 and external validation ranged 0.60 to 0.78. Only 12 studies tested data exclusion intervals.
Pages 5-6
Internal vs. External Validation and Model Calibration

Twenty-four of the 30 studies performed either internal validation, external validation, or both. Internal validations typically used a holdout test set (commonly 20% of the dataset) or bootstrapping. External validations evaluated models on data from a different health system or geographic region, which is the gold standard for assessing generalizability.

Performance by model group: For internal validation, the average C-index was 0.77 for Group A (linear models), 0.83 for Group B (non-linear models), and 0.83 for Group C (deep learning). For external validation, the averages were 0.77 for Group A, 0.79 for Group B, and 0.88 for Group C. However, the Group C external validation result came from a single study, making it impossible to draw reliable conclusions about deep learning's generalizability advantage.

Data exclusion effects: When comparing models with and without data exclusion intervals, deep learning models (Group C) performed best with minimal or no data exclusion but showed a notable decline in performance when longer exclusion periods were applied. In contrast, non-linear models (Group B) performed comparably under maximum data exclusion conditions. Linear models (Group A) had the weakest discrimination with data exclusion. This pattern suggests deep learning models in these studies may have relied more heavily on data close to the cancer event, raising concerns about their utility as true early-detection tools.

Calibration: Only 10 of the 30 studies performed any calibration analysis. Methods included the Hosmer-Lemeshow chi-square goodness-of-fit test, Greenwood-Nam-D'Agostino (GND) calibration tests, Platt calibration, and calibration graphs. Calibration is essential for clinical deployment because a model must not only rank patients correctly (discrimination) but also produce accurate absolute risk probabilities to guide clinical decision-making.

TL;DR: Internal validation C-index averaged 0.77 (linear), 0.83 (non-linear), and 0.83 (deep learning). External validation averaged 0.77, 0.79, and 0.88 respectively, though deep learning had only one external study. Only 10 studies performed calibration, and 6 studies performed no validation at all.
Page 6
Explainable AI Uncovers New Pancreatic Cancer Risk Signals

Six studies that did not rely on curated predictor sets used explainable AI (X-AI) techniques to identify novel risk factors from EHR data. These approaches are particularly important because approximately 80% of pancreatic cancer is considered sporadic, meaning it occurs without known familial or genetic risk factors.

Key findings from individual studies: Chen et al. used XGBoost feature importance (gain scores) and found that non-cancerous pancreatic disorders unrelated to diabetes were the most important model predictor. Placido et al. applied integrated gradients in neural networks and identified jaundice, abdominal pain, and weight loss as key features within 0 to 6 months before diagnosis. At longer intervals before diagnosis, diabetes mellitus, anemia, functional bowel disease, and other pancreatic and bile duct diseases emerged as contributors.

Additional discoveries: Salvatore et al. grouped ICD codes into clinically relevant "phecodes" and found that digestive and neoplasm phecodes were strong predictors. Park et al. used SHAP (SHapley Additive exPlanations) values and identified kidney function, liver function, diabetes, red blood cell counts, and white blood cell counts as top laboratory contributors. Jia et al. ranked features by univariate AUC and identified age, number of recent records, serum creatinine, number of early records, and uncomplicated diabetes as top predictors. Zhu et al. reported that unspecified pancreatic disease (ICD10 K86.9), transverse colon malignancy, pancreatic pseudocyst, breast hypertrophy, and digestive system neoplasm of unspecified behavior were key factors based on model odds ratios.

The most commonly identified risk factors across studies included pancreatic disorders, biliary tract diseases, abdominal and pelvic pain, digestive neoplasms, and jaundice. These findings align with clinical intuition but also highlight the potential for EHR-based models to systematically quantify and rank risk signals that might otherwise be overlooked in routine clinical practice.

TL;DR: Six studies used X-AI techniques (SHAP, integrated gradients, XGBoost gains) to identify novel risk factors. Top signals included pancreatic disorders, jaundice, biliary tract diseases, diabetes, and abdominal pain. About 80% of PC is sporadic, making novel risk factor discovery essential.
Pages 7-8
Bias, Fairness Gaps, and Missing Data Problems

Risk of bias: PROBAST assessment found that models from only 4 of the 30 studies had low risk of bias. Common sources of bias included outcome definition (composite PC rather than PDAC-specific), predictor selection approaches, short prediction windows, and inadequate handling of missing data. Most studies using logistic regression did not report whether they assessed modeling assumptions or whether control populations were appropriately sampled.

Fairness and subgroup analysis: Only four studies performed any subgroup analysis, and these focused narrowly on age or race. Jia et al. developed separate models for different racial groups and geographic locations, then tested cross-group performance. However, none of the studies reported formal fairness metrics such as equalized odds or equalized opportunity. Given the known disparities in pancreatic cancer incidence and outcomes across demographic groups, this is a significant gap.

Missing data underreporting: Fourteen studies provided no information about missing data or how missingness was handled. This matters because prognostic model performance and applicability can be severely affected by missing data patterns. Predictor-outcome associations are unbiased only if excluded participants are a completely random subset of the original sample. For structured EHR data, multiple imputation has been shown to be superior in terms of bias and precision. Deep learning approaches like recurrent neural networks can also handle irregularities and missing patterns in time-series clinical data.

Limited real-world deployment: Only two studies conducted prospective validation after model development. None of the studies reported integration of their model into an EHR system or deployment to identify high-risk individuals in a real-world clinical setting. The authors note this is appropriate, as all algorithms likely require further external validation before clinical deployment.

TL;DR: Only 4 of 30 studies had low PROBAST risk of bias. Fairness analysis was nearly absent (no formal metrics reported). Fourteen studies ignored missing data entirely. Only 2 studies performed prospective validation, and none deployed models in real clinical settings.
Pages 8-9
From Curated Features to Language Models and Multidisciplinary Collaboration

Structured plus unstructured data: Most studies relied on curated, structured EHR features. The authors argue that combining structured data (diagnoses, labs, medications) with unstructured data (free-text clinical notes) could substantially improve risk prediction. Transformer-based language models can retain the context of words and phrases in clinical notes, unlike approaches such as XGBoost that treat each word individually. This contextual understanding could enable more nuanced identification of risk signals buried in clinical narratives.

Explainability as a priority: The review emphasizes that future studies should incorporate X-AI techniques more systematically. With only 6 of 30 studies attempting novel risk factor identification through explainability, there is substantial untapped potential. Techniques such as SHAP values, integrated gradients, and attention-based interpretability in transformers can make model predictions clinically actionable by revealing which features drive individual risk assessments.

Best practice recommendations: The authors provide a set of best practice recommendations for future AI/ML model development for PC prediction using EHR data. These include clearly defining the target outcome (preferably PDAC-specific rather than composite PC), implementing appropriate data exclusion intervals (24 to 36 months before diagnosis), reporting missing data and handling strategies transparently, performing both internal and external validation, conducting subgroup and fairness analyses, and following the TRIPOD statement for reporting prediction model development and validation details.

Multidisciplinary collaboration: The authors strongly recommend that PC risk modeling efforts involve collaboration across physicians, epidemiologists, biostatisticians, data scientists, and AI/ML experts. This breadth of expertise is necessary to evaluate modeling assumptions, minimize biased estimates, avoid inefficient models, and draw correct conclusions from EHR data. The combined use of structured and unstructured data, coupled with X-AI techniques and rigorous validation, represents the most promising path toward clinically deployable pancreatic cancer risk prediction tools.

TL;DR: Future work should combine structured and unstructured EHR data using transformer-based models, systematically apply X-AI for novel risk factor discovery, use 24 to 36 month data exclusion windows, follow TRIPOD reporting standards, and ensure multidisciplinary collaboration across clinical and data science expertise.
Citation: Mishra AK, Chong B, Arunachalam SP, Oberg AL, Majumder S.. Open Access, 2024. Available at: PMC11296923. DOI: 10.14309/ajg.0000000000002870. License: Open Access.