Can AI Be Useful in the Early Detection of Pancreatic Cancer in Patients with New-Onset Diabetes?

PMC 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Pancreatic Cancer in New-Onset Diabetes Patients Deserves Special Attention

Pancreatic ductal adenocarcinoma (PDAC) accounts for roughly 90% of all pancreatic cancers and carries a dismal 5-year survival rate of just 12.8%. Over an entire decade (2000-2014), the survival rate in the United States improved only from 7.2% to 11.5%. A major reason is late diagnosis: only about 20% of patients are eligible for surgical resection, which remains the only curative option. Patients who do receive surgery have a 5-year survival of 27%, while those diagnosed at metastatic stages survive a median of just six months.

New-onset diabetes (NOD) as a warning signal: Diabetes or impaired glucose tolerance is present in 38-75% of PDAC patients, and NOD specifically can appear up to 3 years before a pancreatic cancer diagnosis. NOD accounts for up to 58% of diabetes cases in pancreatic cancer, and it resolves after surgical removal of the tumor in 57% of cases, confirming that the cancer itself is driving the diabetes. However, the 3-year cumulative incidence of PDAC among NOD patients aged 50 or older is only about 0.6%, which is too low for blanket screening to be cost-effective. The threshold for cost-effective screening requires at least a 10% likelihood of PDAC being the cause of the diabetes within the screened group.

Biological mechanism: The link between PDAC and diabetes involves disrupted insulin and insulin-like growth factor (IGF) axes. IGF-1 and its receptor are overexpressed in PDAC tissue, and their expression correlates with tumor grade. Drugs that increase circulating insulin (such as sulfonylureas) are associated with elevated PDAC risk, while metformin, which reduces insulin resistance, shows the opposite effect. Amyloid deposition in PDAC islets may cause beta-cell dysfunction, and substances secreted by pancreatic cancer cells appear to impair beta-cell function and promote insulin resistance.

Population-wide screening for pancreatic cancer is not recommended because the disease is rare and current screening methods are invasive. False-positive results can lead to unnecessary surgeries whose mortality risk outweighs the benefits. Screening is currently recommended only for hereditary syndromes (Peutz-Jeghers, hereditary pancreatitis, familial atypical multiple mole melanoma) and mutation carriers (BRCA1, BRCA2, PALB2, ATM, Lynch syndrome) with affected relatives, where screening may extend life expectancy by up to 260 days.

TL;DR: PDAC has a 12.8% five-year survival rate, and only 20% of patients qualify for curative surgery. NOD appears up to 3 years before cancer diagnosis in many patients, but the 0.6% three-year PDAC incidence among NOD patients is too low for routine screening. The goal is to use AI to narrow the NOD population to a high-risk subgroup where screening becomes cost-effective (requiring at least 10% PDAC likelihood).
Pages 4-5
Review Approach and AI/ML Terminology Primer

This is a narrative review. The authors searched PubMed, Scopus, Google Scholar and ClinicalTrials.gov using keywords including "artificial intelligence," "machine learning," "deep learning," "pancreatic cancer" and "new-onset diabetes mellitus" (and variations). All literature published before February 2025 was eligible for inclusion. The review summarizes both traditional statistical models and AI-based models for PDAC risk prediction in NOD patients, then evaluates gaps and future directions.

Machine learning (ML) vs. deep learning (DL): Classical (shallow) ML algorithms, such as gradient boosting (GB), random forests (RF), support vector machines (SVM), k-nearest neighbors (KNN) and naive Bayes (NB), require human-directed feature extraction. The training set contains known outcomes (e.g., whether a patient was later diagnosed with PDAC), and the model learns to predict those outcomes in new cases. Deep learning (DL) uses artificial neural networks (ANNs) that can work on raw, unlabeled data with minimal human intervention, automatically discovering relevant features. DL has shown superior performance in histopathological and radiological image classification for cancer diagnosis.

Feature engineering: One of the most critical steps in building ML models is feature engineering, which creates new predictive variables from raw data. For example, instead of using a single HbA1c measurement, the rate of change in HbA1c over time may be a more powerful predictor. However, feature engineering can only work well when the data set is sufficiently diverse. Overfitting, where a model learns noise rather than real patterns, is a persistent risk, especially with the small sample sizes typical in PDAC research (where cancer cases represent only 0.1-0.2% of diabetic cohorts).

Evaluation metrics: The reviewed studies use AUROC (area under the receiver operating characteristic curve), sensitivity (recall), specificity, positive predictive value (PPV, or precision), negative predictive value (NPV) and F1 score (the harmonic mean of precision and recall). Harrell's C index and calibration slope also appear in some studies. These metrics are critical for evaluating whether a model can meaningfully stratify NOD patients into actionable risk groups.

TL;DR: Narrative review searching PubMed, Scopus, Google Scholar and ClinicalTrials.gov for literature on AI, ML and DL applied to PDAC screening in NOD patients (all publications before February 2025). The paper distinguishes classical ML (requiring human feature extraction) from DL (automated feature discovery via neural networks) and emphasizes that overfitting is a core challenge given the extreme class imbalance in PDAC cohorts.
Pages 6-9
ENDPAC and Other Traditional Screening Models for Pancreatic Cancer in NOD

ENDPAC model: The Enriching New-Onset Diabetes for Pancreatic Cancer (ENDPAC) model is the most widely known tool for identifying patients with pancreatic cancer-associated diabetes (PCD). It uses weight change and blood glucose trajectories to assign risk scores. At a cut-off of 3 or higher, it achieved 78% sensitivity and 85% specificity in the initial validation cohort, with a 3-year PDAC incidence of 3.6% in the high-risk group (versus 0.82% in general NOD). However, validation studies on larger cohorts showed diminished performance, with AUROC values of only 0.72-0.75. A key discriminative insight is that PCD patients tend to lose weight before diabetes onset, while T2DM patients typically gain weight.

Cost-effectiveness: Wang et al. evaluated ENDPAC-based screening and found it yielded an additional 0.54 quality-adjusted life years (QALYs) for patients later diagnosed with PDAC, at a cost of USD 65,076 per QALY gained. This falls below the standard US willingness-to-pay threshold of USD 100,000, making it satisfactory from a health-economics perspective, though limited to the US healthcare system.

Boursi et al. (2016): This earlier model incorporated age, BMI, BMI change, smoking status, proton pump inhibitor use, anti-diabetic medications, HbA1c, cholesterol, hemoglobin, creatinine and alkaline phosphatase. It achieved an AUROC of 0.82 (95% CI: 0.75-0.89) but with only 44.7% sensitivity and 94% specificity at a 1% risk cut-off. Ali et al. (2024): A model restricted to Australian women with NOD used only age and medication data, achieving AUROC 0.73 (95% CI: 0.68-0.78) with 69% sensitivity and 69% specificity. Though less discriminative than ENDPAC, its simplicity makes it more accessible clinically.

Boursi et al. (2022) prediabetes model: Targeting patients with impaired fasting glucose (100-125 mg/dL), this model incorporated age, BMI, proton pump inhibitor use and lab values (total cholesterol, LDL, ALT, alkaline phosphatase), reaching AUROC 0.71, 66.5% sensitivity and 54.9% specificity. Proton pump inhibitor use was included as a predictor because a meta-analysis of 10 studies linked PPIs to a 69.8% increase in pancreatic cancer risk, though data quality was limited.

TL;DR: ENDPAC remains the best-known screening tool (sensitivity 78%, specificity 85% initially, AUROC 0.72-0.75 in validation). Cost-effectiveness analysis showed USD 65,076 per QALY, below the US threshold. Other models (Boursi 2016: AUROC 0.82; Ali 2024: AUROC 0.73; Boursi 2022: AUROC 0.71) offer trade-offs between complexity and accessibility. None have achieved the 10% PDAC probability threshold needed for broad screening implementation.
Pages 9-12
Early AI Attempts at Predicting Pancreatic Cancer in Diabetic Patients

Hsieh et al. (2018), the first study: Using the Taiwanese National Health Insurance Program database, the researchers gathered 1,358,634 patients with T2DM, of whom only 3,092 (0.23%) developed pancreatic cancer. They compared logistic regression against an artificial neural network (ANN). Logistic regression outperformed the ANN, with AUROCs of 0.707 (95% CI: 0.650-0.765) vs. 0.642 (95% CI: 0.576-0.708). While both models achieved 99.5% precision, the ANN had lower sensitivity (87.3% vs. 99.8%). The ANN's underperformance was attributed to asymmetric outcome distribution and the fact that logistic regression handles categorical variables more naturally. Patients with PDAC were older (mean age 63.8 vs. 57.3), had more pancreatitis, gallstones and cirrhosis, but fewer classic T2DM complications (retinopathy, nephropathy, neuropathy).

Chen et al. (2023), eight-model comparison: Drawing from the Taipei Medical University Clinical Research Database (66,384 patients with non-T1 DM, 89 with subsequent PDAC), this study tested eight algorithms. Linear discriminant analysis (LDA) won with AUROC 0.9073, 84% accuracy, 86% sensitivity and 84% specificity. Classical logistic regression performed worst (AUROC 0.6669, 38% accuracy). Other models ranged from SVM (AUROC 0.7721) to ensemble voting (AUROC 0.9049). The most important features were blood glucose levels (HbA1c and fasting glucose) within 12 months before study entry, along with hyperlipidemia. The superior performance of ML models over logistic regression was attributed to their ability to analyze higher-dimensional data and capture complex, nonlinear patterns.

Cichosz et al. (2024), random forest on Danish data: Using a nationwide Danish cohort (1998-2018), this study applied random forest classification to 716 PCD and 716 T2DM patients. NOD onset was defined by ICD-10 or ATC medication codes, and patients with PDAC within 3 years of NOD were identified (consistent with the known 32-month peak interval between NOD and PDAC). After feature engineering on routine biochemical measurements, the top 20 features were selected. The model achieved AUROC 0.78 (95% CI: 0.75-0.83). The most significant discriminators were age and HbA1c rate of change, along with altered triglyceride trajectories and liver function markers. In a simulated surveillance of 1 million NOD patients, the top 1% highest-risk group had a relative risk 20 times higher than general NOD, translating to a cumulative 3-year cancer risk of 12%, above the 10% threshold needed for cost-effective screening.

TL;DR: The first AI study (Hsieh 2018) saw logistic regression beat an ANN (AUROC 0.707 vs. 0.642). Chen et al. (2023) showed LDA achieving AUROC 0.9073 across eight models. Cichosz et al. (2024) used random forest on Danish data (AUROC 0.78) and demonstrated that the top 1% risk group had a 12% three-year PDAC incidence, exceeding the 10% threshold for cost-effective screening.
Pages 11-13
XGBoost, Random Forest and Genetic Integration Models

Clift et al. (2024), UK QResearch database: This study compared three models for predicting pancreatic cancer within two years of DM diagnosis, using data from over 1,500 UK general practices (2010-2021). Surprisingly, the Cox proportional hazards model (Harrell's C index: 0.802) outperformed both XGBoost (0.723) and an ANN (0.650). The statistical model explained 46% of the variation in time to diagnosis. The highest-risk 1% and 10% of patients captured 12.5% and 44.1% of pancreatic cancer cases, respectively. This far exceeds the 3.95% sensitivity of current UK guidelines (urgent imaging for patients over 60 with NOD and weight loss), demonstrating that even traditional models with a well-chosen set of clinical predictors can surpass more complex AI approaches.

Khan et al. (2023), head-to-head comparison: Using the TriNetX global network (3,224,021 patients with NOD), the authors built an XGBoost model and directly compared it against ENDPAC and the Boursi model on the same data. XGBoost achieved AUROC 0.80, substantially outperforming ENDPAC (0.63) and Boursi (0.68). Consistent with other studies, patients with PCD were older, more anemic, weighed less, had higher alkaline phosphatase levels and were more likely to be prescribed proton pump inhibitors than T2DM patients.

Chen et al. (2023), random forest with HbA1c focus: Working with Kaiser Permanente Southern California data (2010-2018) on patients aged 50-84 with HbA1c of 6.5% or higher, three RF models achieved AUROCs of 0.808-0.822. The best model used age, weight change and HbA1c change over 18 months as predictors. At the top 20% risk threshold, the models achieved approximately 60% sensitivity and 80% specificity. The emphasis on HbA1c rate of change rather than single glucose readings may explain the improvement over ENDPAC. The algorithms have been made publicly available online for external evaluation.

Sun et al. (2024), clinical plus genetic integration: This study, using the UK Biobank (502,407 initial participants, 25,897 in the final NOD group, 100 with PCD), pioneered combining clinical data with genetic factors. From 82 candidate predictors, the top five clinical features (age, platelet count, systolic blood pressure, immature reticulocyte fraction, platelet crit) were combined with single nucleotide polymorphisms (SNPs), including rs6465133 (SRI gene) and rs61759623 (KRAS). The combined clinical-genetic model achieved the highest reported AUROCs: logistic regression at 0.897 and a multi-layer perceptron (MLP) at 0.884. At a 1.28% probability cut-off, the model could identify 76% of PCD cases while testing only 13% of the NOD population. At a stricter 5.26% threshold, it captured 46% of cases while testing just 2%, with 98% specificity, 18.1% PPV, 99.6% NPV and 97.9% accuracy.

TL;DR: Clift et al. showed Cox regression (C index 0.802) outperforming XGBoost and ANNs. Khan et al. demonstrated XGBoost (AUROC 0.80) beating ENDPAC (0.63) and Boursi (0.68) on the same 3.2 million-patient dataset. Chen et al. achieved AUROC 0.808-0.822 with RF models focused on HbA1c change. Sun et al. reached the highest AUROC of 0.897 by integrating clinical features with genetic SNP data, identifying 76% of PCD cases while testing only 13% of the NOD population.
Pages 14-15
Hormonal Testing and Biomarker-Based Differentiation of PCD from T2DM

Bao et al., mixed meal tolerance test approach: This study took a fundamentally different direction by using pancreatic hormone responses rather than routine clinical data. Patients with pancreatic cancer underwent a fasting period of 10 hours or more before a mixed meal tolerance test (MMTT), which measures hormonal responses to complex food (fats, proteins and carbohydrates) rather than glucose alone. A control group included healthy volunteers and patients with NOD but no pancreatic cancer history.

The algorithms were built on insulin sensitivity, insulin secretion and pancreatic polypeptide measurements. Patients with PCD showed significantly weaker insulin and C-peptide responses to MMTT than controls. Paradoxically, the PCD group had better insulin sensitivity than the T2DM group, but their poor glucose response was driven by lower insulin secretion, suggesting direct beta-cell damage rather than peripheral resistance. Among four candidate models (RF, logistic regression, SVM and naive Bayes), naive Bayes achieved the highest AUROC of 0.965, with 81.5% classification accuracy and 92.2% specificity.

CA19-9 and sorcin pathway: The review also discusses existing biomarkers. Carbohydrate antigen 19-9 (CA19-9), despite high sensitivity and specificity in symptomatic patients, lacks the positive predictive value needed for screening. A newer candidate involves the SRI gene encoding sorcin, a protein overexpressed in PDAC. Sorcin triggers a pathway leading to plasminogen activator inhibitor-1 (PAI-1), which was found to be significantly elevated in PCD patients compared to T2DM patients, making it a potential blood-based biomarker. However, these findings still require validation in larger cohorts.

It is important to note that the Bao et al. results were published only as a conference abstract, so the full methodology and reproducibility cannot be assessed. Despite the impressive AUROC of 0.965, the lack of a complete peer-reviewed publication means these numbers should be interpreted cautiously.

TL;DR: Bao et al. used hormonal responses from mixed meal tolerance testing to differentiate PCD from T2DM, with naive Bayes achieving AUROC 0.965, 81.5% accuracy and 92.2% specificity. PCD patients showed weaker insulin secretion but better insulin sensitivity than T2DM patients. However, these results are from a conference abstract only. The sorcin/PAI-1 pathway is an emerging blood-based biomarker candidate that needs larger validation studies.
Pages 15-16
Legal, Ethical and Clinical Barriers to Implementing AI-Based Screening

Black-box problem: Deep learning models in particular operate as "black boxes," providing predictions without transparent reasoning. This makes translation to clinical settings difficult and raises legal challenges. The European Union's GDPR requires that individuals receive information about how their data is processed "in a concise, transparent, intelligible and easily accessible form," which is difficult or impossible with some DL architectures. The EU's AI Act, which entered into force in August 2024, represents the first legal framework on AI development, though its implications for healthcare remain unclear.

Data privacy and liability: The need for large, multi-institutional data sets creates new threats to patient privacy. Responsibility for data leaks and errors from AI-generated decisions remains legally ambiguous, because AI systems can learn and make decisions semi-autonomously. The FDA, Health Canada and the UK's MHRA have proposed 10 guiding principles for Good Machine Learning Practice (GMLP) in medical device development, but clear and precise policy on AI-based cancer screening approval is still lacking.

Clinician concerns: Surveys reveal that oncologists worry about misleading diagnoses, overreliance on AI, algorithmic or data bias, patient privacy breaches, delayed regulatory adaptation, conflicting recommendations that undermine patient confidence, and low flexibility of AI systems to handle unusual situations. In 2021, Tamori et al. found that 73.5% of doctors expressed concern about liability for negative outcomes resulting from AI-assisted decisions. While clinicians show optimism toward AI in general, the majority do not believe their hospitals are prepared for practical implementation.

Integration challenge: The authors emphasize that AI should be integrated into clinical workflows as a supportive tool rather than a replacement for human decision-making. This framing sidesteps some liability concerns by treating the algorithm as a decision-support instrument under physician oversight, rather than an autonomous diagnostic agent.

TL;DR: Key barriers include DL black-box opacity (conflicting with GDPR transparency requirements), unresolved liability for AI-driven errors, and clinician reluctance (73.5% of doctors worried about liability). The EU AI Act (August 2024) and FDA/MHRA GMLP guidelines are early regulatory efforts, but clear policy on AI cancer screening approval does not yet exist.
Pages 16-18
Explainable AI, Federated Learning and the Path Forward

Explainable AI (XAI): The review highlights XAI as a critical next step. Shapley Additive Explanations (SHAP), considered a gold standard for feature-importance visualization, was used by only 3 of the reviewed studies. Most papers lacked explicit explanations of the processes behind feature selection. The four principles of XAI include clear indication of knowledge limits, specifying the conditions under which a model is expected to function well. Future models should incorporate XAI methods not only through text-based explanations but also through worked examples.

Federated learning (FL): FL trains models without exchanging raw patient data between institutions. Each hospital runs model training locally, sending only processed parameters to a central coordinator for global model assembly. This approach could resolve privacy concerns, open previously inaccessible databases and facilitate regulatory approval. Self-supervised learning (SSL): SSL enables models to learn from unlabeled data, reducing the need for manual curation. Given the enormous volume of T2DM patient data passing through hospitals daily, SSL could help discover new discriminatory factors between T2DM and PCD from routinely collected records.

Imaging integration: AI has achieved excellent results in detecting PDAC on CT scans (AUROC 0.97, sensitivity 88%, specificity 95%) and endoscopic ultrasound images (AUROC 0.95, sensitivity 93%, specificity 90%). However, none of the reviewed biochemical risk models have been combined with image-based AI screening. Trials of DL for detecting T2DM on CT scans have also been published, further motivating the integration of clinical data, biomarkers and imaging in future screening programs.

Validation gaps: Most studies relied on internal cross-validation, which can produce overly optimistic results. Only Sun et al. used nested cross-validation, which is more robust. External validation on independent populations is essential because model performance can vary significantly across different patient groups. The review also notes that only 2 of the reviewed studies referenced TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) guidelines, and none referenced CONSORT-AI. No study has assessed the cost-effectiveness of AI-based PCD screening, and the "No Free Lunch" theorem reminds us that no single algorithm is universally superior, meaning algorithm selection must be guided by the specific problem and data characteristics.

TL;DR: Key future directions include XAI (only 3 studies used SHAP), federated learning for privacy-preserving multi-institutional training, self-supervised learning on massive unlabeled T2DM datasets, and integration with imaging AI (CT: AUROC 0.97, EUS: AUROC 0.95). Major gaps: no external validation in most studies, no cost-effectiveness analysis for AI-based screening, and minimal adherence to TRIPOD or CONSORT-AI reporting guidelines.
Citation: Mejza M, Bajer A, Wanibuchi S, MaƂecka-Wojciesko E.. Open Access, 2025. Available at: PMC12025102. DOI: 10.3390/biomedicines13040836. License: cc by.