Personalized Three-Year Survival Prediction and Prognosis Forecast by Interpretable Machine Learning for Pancreatic Cancer Patients

Frontiers in Oncology 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why This Study Was Needed and What It Sets Out to Do

Pancreatic cancer is one of the deadliest malignancies in the world, with a five-year survival rate of only about 9%. Roughly 1% of patients with metastatic disease survive beyond three years. Surgical resection remains the only potentially curative treatment, but only a small fraction of patients are eligible at diagnosis because the disease is typically detected at an advanced stage. Existing prognostic tools, such as nomograms built on Cox regression models, have limited sensitivity and specificity when it comes to accurately predicting individual patient outcomes.

Biomarkers like CA19-9, circulating tumor DNA (ctDNA), microRNAs, and tumor mutational burden (TMB) have been explored for prognosis prediction, but each has significant drawbacks. CA19-9 produces false positives when elevated by other conditions like cholangitis, while ctDNA is limited by low abundance in early-stage cancers. The authors argue that machine learning (ML), which excels at capturing complex, non-linear relationships in large datasets, offers a better path forward for developing accurate, personalized prognostic tools.

This study, led by Teng and colleagues from Xuzhou Medical University affiliated hospitals, aimed to build and validate ML-based models for two distinct tasks: (1) predicting whether a pancreatic cancer patient will survive beyond three years, and (2) forecasting overall prognosis. The team used population-level data from the SEER (Surveillance, Epidemiology, and End Results) database spanning 2000 to 2021, supplemented by an external validation cohort from a Chinese hospital. In total, 20,064 patients from SEER and 103 patients from the external site were included.

TL;DR: Pancreatic cancer has dismal survival rates and existing prognostic tools are insufficient. This study develops machine learning models using over 20,000 SEER database patients to predict three-year survival and overall prognosis, with external validation from a Chinese hospital cohort.
Pages 2-4
How the Most Important Clinical Variables Were Identified

Before building any predictive model, the researchers needed to determine which clinical variables actually matter for predicting three-year survival. They started with 24 candidate variables covering demographics (age, sex, race, marital status, household income), tumor characteristics (grade, stage, size, histology, location), treatment information (surgery type, chemotherapy, radiotherapy, lymph node surgery), and metastasis data (bone, brain, liver, lung). Univariate and multivariate logistic regression analyses were performed on the training cohort to identify statistically significant predictors.

The multivariate analysis revealed that age, marital status, household income, histology, grade, summary stage, tumor size, AJCC stage, surgery type, radiotherapy, chemotherapy, lung metastasis, and M stage were all independently associated with three-year survival (P < 0.05). Among these, AJCC stage showed the strongest correlation with three-year survival in the initial correlation analysis.

Next, the team applied Recursive Feature Elimination (RFE) based on six ML algorithms: CatBoost, Random Forest (RF), Support Vector Machine (SVM), XGBoost, decision tree, and gradient boosting machine (GBM). Each algorithm ranked the variables by importance, and the Robust Rank Aggregation (RRA) algorithm was used to integrate these rankings into a comprehensive ordering. The optimal feature subset from GBM-based RFE retained 12 variables with the highest AUC of 0.819.

After removing highly correlated variables (such as Summary Stage, which overlapped with AJCC Stage), the final set contained eight variables selected for model development: AJCC Stage, Chemotherapy, Age, Grade, Lung Metastasis, M Stage, Surgery Type, and Tumor Size. These eight variables appeared in at least four of the six ML-based RFE selections, confirming their consistent importance across different algorithmic perspectives.

TL;DR: Starting from 24 candidate variables, the researchers used logistic regression and Recursive Feature Elimination across six ML algorithms to distill the eight most important predictors: AJCC Stage, Chemotherapy, Age, Grade, Lung Metastasis, M Stage, Surgery Type, and Tumor Size.
Pages 4-8
Building and Validating the CatBoost Model for Three-Year Survival

Using the eight selected variables, the team trained 13 different ML algorithms: CatBoost, RF, SVM, XGBoost, decision tree, GBM, k-nearest neighbor (KNN), logistic regression, naive Bayes, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), neural network (NNET), and generalized linear model (GLM). All models were built using the mlr3 R package with hyperparameter optimization via 1,000-evaluation random search across 5-fold cross-validation, repeated five times. To handle the class imbalance problem (81.7% of patients died within three years versus 18.3% who survived), the Synthetic Minority Over-sampling Technique (SMOTE) was applied.

CatBoost emerged as the best-performing model across all evaluation metrics. On the training set, it achieved an AUC of 0.932 [0.924, 0.939]. On the internal validation set (5,937 patients), the AUC was 0.899 [0.873, 0.934], and on the external validation set (103 patients), it reached 0.826 [0.735, 0.919]. The model demonstrated an accuracy of 0.839, sensitivity of 0.872, specificity of 0.803, and precision of 0.832. The optimized hyperparameters were: depth of 5, learning rate of 0.01678, 548 iterations, and L2 leaf regularization of 7.409.

Calibration curves confirmed that CatBoost's predicted probabilities closely matched actual observed outcomes, meaning the risk estimates provided by the model can be trusted to reflect true likelihoods. Decision curve analysis (DCA) showed that CatBoost delivered the greatest net clinical benefit compared to all other models, indicating it would lead to better clinical decisions than either a treat-all or treat-none strategy across a wide range of threshold probabilities. Ten-fold cross-validation further confirmed the model's robustness and resistance to overfitting.

TL;DR: CatBoost outperformed all 12 competing ML algorithms with an AUC of 0.932 on training data and 0.826 on external validation. It achieved 83.9% accuracy, 87.2% sensitivity, and 80.3% specificity using just eight clinical variables, with well-calibrated probability estimates and strong clinical decision utility.
Pages 8-10
SHAP Analysis Reveals Which Factors Drive Predictions

A major strength of this study is its emphasis on interpretability. While many ML models function as "black boxes," the authors used the SHAP (SHapley Additive exPlanations) framework to explain exactly how the CatBoost model reaches its predictions. SHAP values quantify the contribution of each feature to every individual prediction, making it possible to understand not just which variables matter globally, but how they influence each specific patient's predicted outcome.

The feature importance analysis across eight ML algorithms consistently identified Surgery Type as the most impactful variable for three-year survival prediction. AJCC Stage and M Stage followed as the next most important predictors. The SHAP beeswarm plot revealed clear directional relationships: patients who received no surgery, had higher tumor grade, were older, had lung metastasis, received no chemotherapy, had higher AJCC stage, and had M1 stage all showed increased SHAP values, corresponding to a higher probability of death within three years.

The authors demonstrated the model's interpretability through two representative patient cases. The first patient, who survived beyond three years, had a low SHAP prediction score of 0.0276 (well below the median cutoff of 0.0962), driven by favorable surgery type and early-stage disease. The second patient, who did not survive, showed a prediction score of 0.187, reflecting the combined negative impact of advanced stage and absence of surgical intervention.

This level of per-patient transparency is clinically valuable because it allows oncologists to see exactly which factors are pushing a given patient toward a favorable or unfavorable prediction, facilitating more informed conversations about treatment options and prognosis with patients and their families.

TL;DR: SHAP analysis revealed Surgery Type as the single most important predictor of three-year survival, followed by AJCC Stage and M Stage. The framework provides patient-level explanations showing exactly which factors drive each individual prediction, making the "black box" model transparent and clinically actionable.
Pages 10-12
The RSF+GBM Model for Overall Survival Prognosis

Beyond the three-year survival classification task, the researchers built a separate prognostic model to predict overall survival (OS) as a continuous outcome. Univariate and multivariate Cox regression analysis identified 20 independent prognostic variables, including sex, race, age, marital status, household income, household location, tumor primary site, histology, grade, tumor size, AJCC stage, T stage, surgery type, lymph node surgery, regional lymph node status, radiotherapy, chemotherapy, bone metastasis, liver metastasis, and lung metastasis.

The team then evaluated 101 different ML algorithm combinations using a leave-one-out cross-validation (LOOCV) framework. These combinations paired 10 survival-focused algorithms for feature selection (including Random Survival Forest (RSF), elastic net, Lasso, Ridge, stepwise Cox, CoxBoost, plsRcox, SuperPC, GBM, and survival-SVM) with 10 algorithms for model construction. The concordance index (C-index) for each combination was calculated across training, internal validation, and external validation cohorts.

The winning combination was "RSF+GBM", which used Random Survival Forest for feature selection and gradient boosting machine for model construction. This model achieved a C-index of 0.774 in the training set, 0.722 in internal validation, and 0.674 in external validation, yielding the highest average C-index (0.723) across all three cohorts. Surgery Type was again identified as the most significant variable in both the RSF feature importance and GBM model construction steps.

Time-dependent ROC curves showed that the RSF+GBM model outperformed conventional clinical variables at most time points for predicting 1-, 3-, and 5-year OS. Kaplan-Meier survival analysis confirmed that patients stratified into low-risk and high-risk groups by the model's median risk score showed significantly different survival trajectories across all three cohorts (training, internal validation, and external validation), validating the model's risk stratification capability.

TL;DR: From 101 ML algorithm combinations, the RSF+GBM model emerged as the best prognostic tool with an average C-index of 0.723 across three cohorts. It uses 20 clinical variables with Random Survival Forest for feature selection and gradient boosting for prediction, effectively stratifying patients into distinct risk groups.
Pages 12-16
Patient Characteristics and Cohort Composition

The study drew from 20,064 pancreatic cancer patients in the SEER database (2000-2021) who had confirmed pancreatic adenocarcinoma, were over 18 years old, had positive histology, and had complete follow-up information including TNM stage and grade details. The SEER cohort was randomly split 7:3 into a training set (14,127 patients) and an internal validation set (5,937 patients). An external validation cohort of 103 patients came from The First People's Hospital of Lianyungang (2015-2024).

The demographics reflected broad representation: approximately 50.7% were male, 80.3% were White, 11.0% were Black, and 60.7% were married. By age, the largest group was 60-69 years (32.3%), followed by 70-79 years (27.9%). Roughly two-thirds of tumors were located in the pancreas head (66.0%), and the most common histological subtype was adenomas and adenocarcinomas (65.7%). Nearly half of tumors were moderately differentiated (grade II, 46.8%), while 40.0% were poorly differentiated (grade III).

For staging, the majority were AJCC stage II (56.6%), followed by stage IV (23.6%), stage III (10.2%), and stage I (9.5%). In terms of treatment, 49.2% received local or partial pancreatectomy, 40.0% received no surgery, and 10.9% underwent total pancreatectomy. Chemotherapy was administered to 67.0% of patients, while 29.2% received radiotherapy. In the training set, only 18.3% of patients were alive at three-year follow-up, while 81.7% had died, underscoring the devastating lethality of this disease.

TL;DR: The study analyzed 20,064 SEER patients plus 103 external validation patients. Over 80% of patients died within three years. The majority had stage II disease, two-thirds received chemotherapy, and about half underwent partial pancreatectomy, providing a large and clinically representative cohort.
Pages 13-15
How Surgery, Chemotherapy, and Other Factors Influence Outcomes

Surgery Type was the dominant predictor across both models, which aligns with established clinical knowledge. Patients who undergo surgical resection have significantly improved survival rates compared to those who do not, but surgery alone is often insufficient for long-term survival, with median survival times typically ranging between 8 to 10 months even after resection, frequently accompanied by tumor recurrence. The model quantified this relationship precisely: the SHAP analysis showed that "no surgery" was the single strongest driver of predicted death within three years.

Chemotherapy was identified through both logistic and Cox regression analyses as a key independent factor in enhancing patient survival. Adjuvant chemotherapy has been shown to double median survival rates compared to patients who do not receive it, while neoadjuvant chemotherapy improves overall survival and increases the likelihood of achieving a microscopically clear (R0) resection margin. The model captures these treatment effects and can help identify patients who would benefit most from aggressive chemotherapy regimens.

Age emerged as an independent risk factor, with older patients exhibiting lower survival rates, likely due to diminished immunity and physical decline. The study also found that race plays a role in pancreatic cancer prognosis: African American patients had higher incidence rates and lower overall survival, potentially reflecting differences in socioeconomic status, healthcare access, and genetic or environmental factors. Gender showed a modest effect, with women generally demonstrating slightly better overall survival, consistent with prior studies analyzing outcomes from standard treatments and more aggressive regimens like FOLFIRINOX.

Metastasis patterns had distinct prognostic implications. Liver metastasis was associated with the poorest prognosis due to the liver's role in drug metabolism, while lung metastasis, though serious, generally carried a slightly better prognosis. The model integrates these nuanced relationships between metastatic sites and survival, providing more granular risk assessment than conventional staging systems alone.

TL;DR: Surgery type is the strongest survival predictor, but even after resection, median survival is only 8-10 months. Chemotherapy doubles median survival rates. Older age, African American race, and liver metastasis are associated with worse outcomes. The model captures these complex interactions for personalized risk assessment.
Pages 16-19
The Machine Learning Framework and Methodology

The study employed a rigorous benchmarking framework to ensure the selected models genuinely outperformed alternatives rather than winning by chance. For the predictive model, 13 ML algorithms were compared using nine evaluation metrics: AUC, area under the precision-recall curve (PRAUC), accuracy, sensitivity, specificity, precision, cross-entropy, Brier scores, balanced accuracy, and F-beta score. The selection criteria prioritized the model with the highest AUC, highest PRAUC, and lowest Brier score while maintaining good calibration.

CatBoost (Categorical Boosting) is a gradient boosting framework based on symmetric decision trees (also called oblivious trees). It is particularly effective at handling categorical features without requiring manual encoding, which made it well-suited for this dataset where most variables (AJCC stage, surgery type, grade) are categorical. The algorithm also has built-in mechanisms to prevent overfitting and requires fewer hyperparameters than many competing methods, making it both efficient and practical for clinical deployment.

For the prognostic model, the 101 ML algorithm combinations were evaluated using a leave-one-out cross-validation framework, which is computationally expensive but provides the least biased estimate of model performance. The top five combinations by average C-index were further evaluated using k-fold cross-validation, logarithmic loss, recall, and decision calibration to confirm robustness. The final RSF+GBM model used Random Survival Forest to identify the most relevant features from the full set of 20 prognostic variables, then fed those features into a GBM algorithm to construct the final survival prediction model.

The class imbalance issue (only 18.3% of patients survived three years) was addressed using SMOTE, which generates synthetic examples of the minority class rather than simply duplicating existing ones. Nested resampling with a two-tiered k-fold cross-validation process was used to prevent information leakage between hyperparameter tuning and model selection, a common pitfall in ML studies that can lead to overly optimistic performance estimates.

TL;DR: The study used a rigorous benchmarking framework comparing 13 algorithms for prediction and 101 algorithm combinations for prognosis. CatBoost won for classification due to its strength with categorical data, while RSF+GBM won for survival prediction. SMOTE addressed class imbalance and nested cross-validation prevented information leakage.
Pages 19-21
Study Limitations and What Comes Next

The most significant limitation is the small external validation cohort of only 103 patients. While the authors calculated the minimum required sample size using the Riley formula and attempted to collect the largest available sample, the limited number of patients with complete follow-up information at the external hospital constrained the cohort size. The authors partially mitigated this by applying 10-fold cross-validation on the SEER data to assess generalizability, and they plan to expand the external validation set in future studies.

The study relies on retrospective data from the SEER database, which introduces potential selection bias. Inconsistent data collection across multiple hospitals and the retrospective design led to missing clinical feature data in some cases. The handling of missing values by categorizing them into an "unknown" category, while preserving data integrity, may introduce noise into the model's predictions for patients with incomplete records.

Several important clinicopathological parameters were unavailable in the SEER database, including imaging data, laboratory test results (such as CA19-9 and CEA levels), and KRAS gene mutation status. These biomarkers are known to be prognostically relevant, and their absence likely limits the model's predictive ceiling. Additionally, the model uses a broad range of baseline clinical features, which somewhat complicates its practical application, as all eight variables need to be available at the time of prediction.

The model has not yet been implemented in clinical practice, necessitating prospective, multicenter, and large-scale validation studies before it can be adopted as a clinical decision support tool. The authors envision future work incorporating molecular biomarkers, imaging features, and treatment response data to create even more accurate personalized prognostic models. Despite these limitations, the study establishes a strong foundation for ML-based prognostication in pancreatic cancer, demonstrating that relatively accessible clinical variables can achieve high predictive accuracy when combined with sophisticated algorithmic approaches.

TL;DR: Key limitations include a small external validation cohort (103 patients), retrospective SEER data with potential selection bias, and the absence of molecular biomarkers like CA19-9 and KRAS mutations. Prospective multicenter validation is needed before clinical adoption, but the framework demonstrates strong potential for personalized pancreatic cancer prognostication.
Citation: Teng B, Zhang X, Ge M, Miao M, Li W, Ma J.. Open Access, 2024. Available at: PMC11532159. DOI: 10.3389/fonc.2024.1488118. License: cc by.