ML Ultrasound Radiomics for Pancreatic Tumor Classification

Plain-English Explanations

Overview

Pages 1-2

What This Study Does and Why It Matters

Pancreatic cancer is one of the deadliest malignancies worldwide, having surpassed breast cancer to become the third leading cause of cancer-related death in the United States. It is projected to climb to the second leading cause, behind only lung cancer, before 2040. The high mortality rate stems primarily from late-stage diagnosis: the pancreas is anatomically difficult to access for routine screening, early-stage disease typically presents with no symptoms or only nonspecific ones, and there are no reliable diagnostic biomarkers for early detection. The rising incidence of pancreatic tumors is further driven by the obesity epidemic and increased life expectancy.

This 2024 study from Guangxi Medical University and Guangxi University in China set out to build a machine learning classification model that could distinguish benign from malignant pancreatic tumors by combining clinical patient data with radiomics features extracted from ultrasound images. The core idea is that neither clinical features nor imaging features alone are sufficient for reliable differential diagnosis, but fusing the two data types could substantially improve accuracy.

The researchers enrolled 242 patients who were hospitalized between January 2020 and June 2023, splitting them into a training cohort of 169 patients and a test cohort of 73 patients (a 7:3 ratio). They collected 28 clinical features and extracted 306 radiomics features from endoscopic ultrasound (EUS) images. Three separate models were built: a clinical-only model, a radiomics-only model, and a fusion model combining both. The fusion model achieved the best performance, with an AUC of 0.978 on cross-validation and 0.925 on the independent test set.

TL;DR: This study built a machine learning model that fuses 28 clinical features with 306 ultrasound radiomics features to classify pancreatic tumors as benign or malignant. The fusion model outperformed clinical-only and radiomics-only models, achieving an AUC of 0.978 in training cross-validation and 0.925 on an independent test set of 73 patients.

Patient Cohort

Pages 2-3

Study Population and Enrollment Criteria

The study was approved by the ethics review board of The First Affiliated Hospital of Guangxi Medical University and used a retrospective design. The inclusion criteria required patients to have a pathologically confirmed diagnosis of either benign or malignant pancreatic tumors. Key exclusion criteria filtered out patients who had received any antitumor treatment prior to laboratory tests or ultrasonography, patients with malignant tumors elsewhere in the body, patients with recurrent pancreatic tumors, and patients with incomplete clinical or ultrasound data.

Of the 242 enrolled patients, 169 (approximately 70%) were assigned to the training cohort and 73 (approximately 30%) to the test cohort through random allocation. The two cohorts showed no statistically significant differences in baseline characteristics such as sex, age, BMI, abdominal pain, jaundice, tumor diameter, or tumor classification (all P values greater than 0.05). In total, 169 patients had malignant tumors (117 in training, 52 in test) and 73 had benign tumors (52 in training, 21 in test). The mean patient age was approximately 57 years, with a mean BMI of roughly 22 and an average tumor diameter of about 4 cm.

This balanced split between cohorts is important because it ensures that the test set performance reflects genuine generalization rather than data leakage from the training process. The pathological examination served as the ground truth label for each patient, providing a definitive classification that the models were trained to predict.

TL;DR: 242 patients with pathologically confirmed pancreatic tumors were split 7:3 into training (169) and test (73) cohorts. Baseline characteristics were balanced across cohorts, and pathological diagnosis served as the ground truth for the benign-versus-malignant classification task.

Clinical Features

Pages 2-3

Clinical Data Collection and the Clinical Model

All patient clinical data, ultrasound images, and pathological results were sourced from the hospital's Health Information System. The 28 clinical features collected fell into three categories: general patient information (sex, age, BMI), clinical signs (abdominal pain, jaundice, tumor location, tumor diameter, tumor subtype), and laboratory test results (blood type, blood sugar, total bilirubin, LDL cholesterol, apolipoprotein B, CA125, CA 19-9, CEA, HbA1c, and others).

During preprocessing, features with more than 10% missing values were discarded. For continuous numerical variables, missing values were imputed using the median; for categorical variables, the mode was used. Continuous variables such as BMI, CA125, and CA 19-9 were converted into categorical variables based on WHO classification criteria. The clinical model was then built using logistic regression. Univariate and multivariate analyses identified age, abdominal pain, CEA, CA125, CA 19-9, and HbA1c as significant independent predictors. Tumor diameter, with a P value of 0.058 (just above the 0.05 threshold), was also included given its clinical importance.

The clinical model achieved an AUC of 0.892 (95% CI: 0.85 to 0.94) in 5-fold cross-validation on the training cohort and 0.882 (95% CI: 0.74 to 0.95) on the test cohort. These results confirm that clinical variables alone carry meaningful predictive signal, but the authors hypothesized that adding imaging-derived features could push performance even higher.

TL;DR: A logistic regression model using 28 clinical features identified age, abdominal pain, CEA, CA125, CA 19-9, HbA1c, and tumor diameter as key predictors. The clinical-only model achieved AUCs of 0.892 (training cross-validation) and 0.882 (test cohort), establishing a strong baseline.

Radiomics Features

Pages 3-4

Ultrasound Radiomics Feature Extraction

Radiomics is the process of extracting large numbers of quantitative features from medical images that are not visible to the human eye. In this study, all procedures followed the Image Biomarker Standardization Initiative (IBSI) standards. Endoscopic ultrasound (EUS) images were acquired using Olympus or Fuji ultrasonic equipment with linear probes operating at 5.0 to 7.5 MHz and saved in DICOM format. The EUS examinations were performed by gastroenterologists with 10 years of experience, and one image containing the tumor was collected per patient.

The ultrasound grayscale images were imported into MaZda software (v4.6), where a physician with ten years of experience delineated the region of interest (ROI) encompassing the tumor. To assess interobserver reproducibility, 100 images were randomly re-segmented by a senior radiologist with 20 years of experience. An interclass correlation coefficient (ICC) above 0.75 was used as the threshold for high feature stability, and discrepancies were resolved through consultation. Importantly, the radiologists were blinded to the histopathological type when delineating ROIs.

A total of 306 radiomics features were extracted and grouped into seven categories: first-order statistics, shape-based features, gray level co-occurrence matrix (GLCM), gray level run length matrix (GLRLM), gray level size zone matrix (GLSZM), neighboring gray tone difference matrix (NGTDM), and gray level dependence matrix (GLDM). Feature preprocessing included handling outliers through log transformation and normalization using min-max scaling.

TL;DR: 306 radiomics features were extracted from EUS images using MaZda software, spanning seven feature categories (first-order statistics, shape, GLCM, GLRLM, GLSZM, NGTDM, GLDM). Interobserver reproducibility was validated with ICC above 0.75, and radiologists were blinded to pathological diagnosis during ROI delineation.

Radiomics Model

Pages 3-5

Building and Comparing Radiomics Classification Models

Four different machine learning algorithms were used to build classification models from the 306 radiomics features: support vector machine (SVM), random forest (RF), XGBoost, and K-nearest neighbors (KNN). Each algorithm was trained on the training cohort and evaluated using both 5-fold cross-validation and the independent test cohort. The goal was to find the best-performing radiomics-only model before combining it with clinical data.

In 5-fold cross-validation, the KNN algorithm produced the best results with an AUC of 0.854 (95% CI: 0.78 to 0.92), accuracy of 0.811, precision of 0.824, recall of 0.923, and F1 score of 0.871. XGBoost came in second with an AUC of 0.798, followed by SVM and RF, both at 0.780. On the test cohort, KNN again led with an AUC of 0.739, while the other three algorithms ranged from 0.601 to 0.639. The KNN model was therefore selected as the radiomics model going forward.

Although the radiomics model performed decently, its test-set AUC of 0.739 was notably lower than the clinical model's 0.882. This gap underscores the point that radiomics features from ultrasound images, while informative, are not sufficient on their own for high-confidence classification. The variability in performance across algorithms also highlights the importance of systematically comparing multiple approaches rather than defaulting to a single algorithm.

TL;DR: Four ML algorithms (SVM, RF, XGBoost, KNN) were compared for radiomics-based classification. KNN performed best with an AUC of 0.854 in cross-validation and 0.739 on the test set. The radiomics model alone underperformed the clinical model, motivating the fusion approach.

Fusion Model

Pages 4-6

The Fusion Model: Combining Clinical and Radiomics Data

The fusion model represents the central contribution of this study. Rather than simply concatenating all 306 radiomics features with the 28 clinical features (which would create a very high-dimensional and potentially unstable model), the researchers adopted a more elegant two-step approach. First, they used the KNN radiomics model to calculate the probability of each patient having a malignant pancreatic tumor. This probability was named RAD-prob, effectively condensing the entire radiomics feature space into a single informative variable.

The RAD-prob was then combined with the clinical features in a multivariate logistic regression analysis. Through this process, four features were selected for the final fusion model: age, RAD-prob, CA125, and CA 19-9. This parsimonious model achieved an AUC of 0.978 (95% CI: 0.96 to 0.99) in 5-fold cross-validation and an AUC of 0.925 (95% CI: 0.86 to 0.98) on the test cohort. In the training cohort, the fusion model reached an accuracy of 0.917, precision of 0.940, recall of 0.940, and F1 score of 0.940.

The fusion model consistently outperformed both the clinical model (AUC 0.892/0.882) and the radiomics model (AUC 0.854/0.739) across all evaluation metrics and in both the cross-validation and test cohort settings. This result aligns with a growing body of literature showing that multimodal models integrating clinical and imaging data outperform single-modality approaches for pancreatic tumor classification.

TL;DR: The fusion model condensed 306 radiomics features into a single RAD-prob score via KNN, then combined it with clinical features using logistic regression. The final four-variable model (age, RAD-prob, CA125, CA 19-9) achieved AUCs of 0.978 (cross-validation) and 0.925 (test), outperforming both single-modality models.

Nomogram

Pages 5-7

The Nomogram: A Visual Tool for Clinical Use

To facilitate practical clinical application, the researchers developed a nomogram based on the fusion model. A nomogram is a graphical calculation tool that allows clinicians to quickly estimate a patient's probability of having a malignant pancreatic tumor without needing a computer. Each variable (age, RAD-prob, CA125, CA 19-9) has its own axis on the nomogram. To use it, a clinician locates the value for each variable on its respective axis, draws a line up to the points axis to obtain a score, sums the scores across all variables, and then reads the corresponding malignancy probability from the total points axis.

The nomogram's accuracy was validated through calibration curves, which compare predicted probabilities against observed outcomes. The calibration curves for both the training and test cohorts demonstrated good agreement between predictions and actual results, indicating that the model's probability estimates are well-calibrated and not systematically over- or under-predicting risk. Decision curve analysis further showed that the nomogram provides high clinical utility across a range of threshold probabilities, meaning it delivers net benefit over both a "treat all" and "treat none" strategy.

This type of translational tool is particularly valuable in resource-limited settings where access to advanced computational infrastructure may be limited. A printed nomogram can be used at the bedside, making the fusion model's predictions accessible even without specialized software.

TL;DR: A nomogram was built from the fusion model, enabling clinicians to visually calculate malignancy probability at the bedside. Calibration curves confirmed accurate probability estimates in both cohorts, and decision curve analysis demonstrated high clinical utility across a wide range of decision thresholds.

Key Predictors

Pages 6-8

Understanding the Clinical and Radiomics Predictors

The study identified eight independent predictors of malignant pancreatic tumors from the clinical analysis: age, abdominal pain, tumor diameter, CA 19-9, CEA, CA125, LPS (lipopolysaccharide), and HbA1c. These findings align with existing literature. CA 19-9 is the only serum biomarker for pancreatic cancer recommended in the European Oncology Guidelines, yet its sensitivity and specificity hover around 80%, reflecting limited standalone diagnostic performance. CA125, traditionally used as a marker for ovarian cancer, has also shown positive correlation with malignant pancreatic tumors in multiple studies.

On the radiomics side, the top features in the KNN model spanned multiple categories. GLCM features such as correlation metrics captured pixel-to-pixel relationships across different directions. Wavelet transform energy features revealed high-frequency textures and edge details in the ultrasound images. First-order statistical features like area and kurtosis described the size of the image region and the peakedness of the gray-level distribution. Inverse difference moment features indicated the uniformity and local similarity of image texture. Together, these features reveal both the physical geometry and biological characteristics of the lesions.

The integration strategy of condensing all radiomics features into the single RAD-prob variable was inspired by prior work on ovarian cancer and gallbladder carcinoma nomograms. This approach simplifies the complexity of multi-feature analysis while preserving the predictive power of the radiomics information. It also makes the final model more interpretable, as clinicians need to consider only four variables rather than hundreds.

TL;DR: Age, abdominal pain, tumor diameter, CA 19-9, CEA, CA125, LPS, and HbA1c were independent clinical predictors. Top radiomics features included GLCM correlations, wavelet energy features, and first-order statistics. Condensing 306 radiomics features into a single RAD-prob score preserved predictive power while keeping the model interpretable.

Limitations and Future Directions

Pages 7-9

Limitations, Context, and What Comes Next

While the fusion model achieved strong performance metrics, several limitations should be considered. The study was conducted at a single center (The First Affiliated Hospital of Guangxi Medical University) with a relatively modest sample size of 242 patients. Single-center studies are susceptible to institutional biases in patient demographics, imaging protocols, and clinical practices. The authors acknowledge that validation with multicenter data is needed to assess the model's performance in real-world settings with greater patient diversity.

The retrospective design introduces additional caveats. All data were collected from existing medical records, which means the study cannot fully control for selection biases or ensure uniform data quality. A prospective study in which patients are enrolled and data collected according to a predefined protocol would provide stronger evidence of the model's clinical utility. Additionally, the EUS images were acquired using equipment from only two manufacturers (Olympus and Fuji), which raises questions about generalizability to images from other vendors or different ultrasound frequencies.

The authors note that no single radiological or laboratory test can reliably distinguish between malignant and nonmalignant pancreatic tumors, which is precisely why multimodal fusion approaches are needed. Traditional diagnostic methods that rely on doctors interpreting test reports are limited by individual experience, potentially leading to inconsistent diagnoses. AI-based models can objectively integrate diverse data sources to assist in this process.

Looking forward, the researchers suggest that methods such as ensemble learning and deep learning could further enhance the model's predictive performance. The current approach used classical machine learning algorithms (SVM, RF, KNN, XGBoost), and more advanced architectures could potentially extract richer representations from the ultrasound data. Multicenter validation studies and integration of additional imaging modalities (such as CT or MRI) represent natural extensions of this work.

TL;DR: Key limitations include the single-center retrospective design with 242 patients and imaging from only two equipment manufacturers. Multicenter validation, prospective studies, and more advanced methods such as deep learning and ensemble approaches are needed to confirm and extend these results before clinical adoption.

A Machine Learning Model Based on Clinical Features and Ultrasound Radiomics Features for Pancreatic Tumor Classification

Original Paper (PDF)