ML for Predicting Primary Refractory DLBCL

Overview and Background

Pages 1-3

Why Predicting Refractory DLBCL Matters

Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of non-Hodgkin lymphoma. While many patients respond to frontline immuno-chemotherapy such as R-CHOP, approximately 30 to 40% are either unresponsive to initial treatment or experience early relapse. These patients are classified as having "primary refractory disease," and their prognosis is notably poor. Managing refractory DLBCL is complicated by the disease's heterogeneity, its complex genetic underpinnings, and patient-specific factors like comorbidities and age.

Treatment landscape for refractory disease: Options include salvage chemotherapy followed by autologous hematopoietic stem cell transplantation (HSCT) for eligible patients, chimeric antigen receptor (CAR) T-cell therapy, or enrollment in clinical trials with targeted regimens. The choice depends heavily on patient eligibility and disease characteristics. An online Cox Model-based tool already exists for estimating survival in progressive disease, but no validated statistical model provides early decision-making support for predicting refractoriness before it occurs.

Machine learning in oncology: Recent advances in supervised ML have demonstrated value in creating predictive, diagnostic, and prognostic tools across multiple cancer types, including breast, brain, lung, liver, and prostate cancers. In hematology specifically, ML has been applied to outcome prediction after allogeneic HSCT, clinical deterioration monitoring, and laboratory diagnostics. The authors cite prior work using EcoTyper (a framework integrating transcriptome deconvolution and single-cell RNA sequencing) to characterize DLBCL cell states, and studies using ML to identify therapeutic targets such as core genes in the PI3K-Akt signaling pathway.

This study set out to test five supervised ML algorithms on a monocentric dataset from Grand Hopital de Charleroi in Belgium, aiming to build a clinically useful model for predicting primary refractory DLBCL before treatment failure becomes apparent.

TL;DR: About 30 to 40% of DLBCL patients are refractory to first-line therapy with poor outcomes. This study tested five supervised ML models on a single-center Belgian cohort to predict refractory status early, addressing a gap where no validated predictive tool currently exists.

Methodology

Pages 4-6

Study Design, Data Preparation, and Model Building

Cohort and setting: This was a retrospective single-center study at Grand Hopital de Charleroi, a large non-academic hospital with 1,154 beds in Belgium's Walloon Region. The cohort included consecutive adult patients (age 18 or older) with a first diagnosis of DLBCL treated between January 2017 and December 2022. Patient data were extracted from the center's oncological-hematological database (RegistreOncoHematoGHdC, FileMaker Pro v.17) and anonymized before analysis. The study was approved by the institution's Ethics Committee (G2-2023-E006).

Feature set: The extracted variables spanned demographics (age, gender, BMI), clinical factors (Ann Arbor stage, IPI score, comorbidities, smoking, alcohol use), disease biology (germinal center phenotype, MYC/BCL2/BCL6 mutations, double-hit and triple-hit rearrangements), viral infections (CMV, EBV, HIV, Helicobacter pylori), treatment details (first-line regimen, interim PET/CT after 2 cycles), and socioeconomic factors. Categorical variables were compared using chi-squared tests and continuous variables with Kruskal-Wallis tests, with significance set at p less than 0.05.

Five algorithms tested: The study evaluated Support Vector Machine (SVM using NuSVC), Random Forest Classifier (RFC), Logistic Regression (LR), Naive Bayes Categorical Classifier (NBC), and eXtreme Gradient Boost (XGBoost). Each algorithm received specific data preprocessing: One-hot Encoding for Random Forest, StandardScaler for SVM and LR, and OrdinalEncoder for XGBoost, NBC, SVM, and LR. Missing data were handled with SimpleImputer (most frequent value) for categorical variables and KNNImputer (k-nearest neighbors) for continuous variables in the RF model. For other models, missing nervous system extension values were set to "no" and missing delay values were filled with medians (17.5 days for referral delay, 6 days for diagnosis-to-treatment delay).

Training and validation: The dataset was split into 85% training and 15% validation, with 10-fold cross-validation on the training portion. Grid search was applied for hyperparameter tuning of RFC, SVM, and XGBoost. Performance was evaluated using ROC-AUC, accuracy, false positive rate, sensitivity, and F1-score. The random state was fixed at 1 for reproducibility. All work was done in Python 3.9.13 using Scikit-learn v1.2.2, with code and data publicly available on GitHub.

TL;DR: Retrospective single-center study of DLBCL patients (2017 to 2022) from a Belgian hospital. Five ML models (SVM, RFC, LR, NBC, XGBoost) were trained on 85% of data with 10-fold cross-validation and tested on a 15% holdout set. Features included demographics, staging, genetics, viral infections, and treatment variables.

Cohort Characteristics

Pages 7-9

Patient Demographics, Disease Profile, and Outcomes

The study analyzed 130 patients aged 25 to 95 years. The median age was 69.5 years, with 72.3% of patients older than 60. The cohort was 53% male and 47% female. Median follow-up was 19.5 months (range 1 to 77 months). By the end of the study, 78 patients (60%) were alive, and among those, 70 (89%) were in complete remission.

Disease staging and biology: A total of 75.3% of patients presented with advanced-stage disease (Ann Arbor stage III or IV), and 33% had high-risk IPI scores (4 to 5). According to the Hans Criteria, 33.85% were classified as germinal center B-cell (GCB) subtype and 66.15% as non-GCB. Genetic rearrangements were notable: 13% had double-HIT (MYC plus BCL2 or BCL6) and 2.3% had triple-HIT (MYC plus BCL2 plus BCL6). Over half the cohort (52.3%) had comorbidities, 29% were active or former smokers, and 10% reported excessive alcohol intake.

Survival data: Overall survival at 3 years was 58.5% (95% CI, 51 to 68.5) and progression-free survival at 3 years was 63% (95% CI, 54 to 71) by Kaplan-Meier analysis. At 5 years, OS dropped to 51% (95% CI, 38 to 61) and PFS was 57% (95% CI, 46 to 67.5). These results were consistent with the published literature, including the Rovira et al. large-cohort study of 468 patients reporting 5-year OS of 59% and 5-year PFS of 51%.

Treatment and refractory outcomes: The majority of patients received R-CHOP-based regimens (74.2% R-CHOP, 13% R-mini-CHOP). Of 124 patients who received first-line treatment, 42 (33.8%) developed primary refractory disease and 2 (1.6%) relapsed within 6 months. Among the 42 refractory patients, 8 died from DLBCL before salvage therapy could be attempted, 34 (77.3%) underwent salvage chemotherapy, and at study end only 14 (33%) were alive with 10 (23.8%) in remission.

TL;DR: Of 130 patients (median age 69.5), 75.3% had advanced-stage disease and 33.8% developed primary refractory disease. Three-year OS was 58.5% and 3-year PFS was 63%. Among the 42 refractory patients, only 33% were alive and 23.8% in remission at study end.

Statistical Analysis

Pages 7, 10

Univariate Risk Factors and Feature Importance

Before building the ML models, the authors performed univariate analysis to identify clinical variables associated with primary refractory disease. Eight factors reached statistical significance: age (p = 0.009), Ann Arbor stage (p = 0.013), CMV infection (p = 0.012), presence of comorbidity (p = 0.019), IPI score (p less than 0.001), first line of treatment (p less than 0.001), EBV infection (p = 0.008), and socioeconomic status (p = 0.02).

Decision Tree Classifier feature importance: A complementary Decision Tree Classifier (DTC) was used to visualize feature importance by measuring each variable's contribution to reducing node impurity (Gini index). The top features were IPI score (feature-score = 54), BMI (feature-score = 22), patient age (feature-score = 9.3), time between diagnosis and treatment (feature-score = 9.2), and BMI category 35 to 40 (feature-score = 4.4). The dominance of the IPI score in the DTC analysis aligned with its strong significance in univariate testing.

Consistency with literature: The identified risk factors align with previously described predictors of refractory DLBCL, including older age, elevated Ann Arbor stage, high IPI score, and comorbidity burden. Interestingly, viral infections (CMV and EBV) and socioeconomic status emerged as significant factors in this cohort, which is less commonly reported in the literature and may reflect this population's specific characteristics.

Only 38 patients (30%) had interim PET/CT scans after 2 cycles of treatment, as systematic use was not fully implemented at the center until 2021. Among this sub-cohort, 42% achieved complete response, 31.5% had partial metabolic response, and 26.3% were refractory.

TL;DR: Eight variables were significantly associated with refractory disease, with IPI score (p less than 0.001) and first-line treatment (p less than 0.001) showing the strongest associations. DTC feature importance confirmed IPI score as the top predictor (feature-score = 54), followed by BMI (22) and age (9.3).

Model Performance

Pages 9-11

Head-to-Head Comparison of Five ML Algorithms

The Naive Bayes Categorical (NBC) classifier emerged as the top-performing model with a ROC-AUC of 0.81 (95% CI, 0.64 to 0.96), accuracy of 83%, F1-score of 0.82, sensitivity of 0.71, and a false positive rate of just 10%. According to standard benchmarks, an AUC of 0.80 to 0.89 is classified as "good" discriminative ability, and an F1-score of 0.8 to 0.9 indicates good performance balance between precision and recall.

XGBoost and Random Forest: The XGBoost model followed with a ROC-AUC of 0.74 (95% CI, 0.52 to 0.93), accuracy of 78%, F1-score of 0.75, and sensitivity of 0.57. Random Forest achieved a ROC-AUC of 0.67 (95% CI, 0.46 to 0.88), accuracy of 72%, F1-score of 0.67, and sensitivity of 0.43. Both maintained a low false positive rate of 10%, but their sensitivity was notably weaker, meaning they missed a larger proportion of truly refractory patients.

SVM and Logistic Regression: SVM (NuSVC) performed poorly with a ROC-AUC of 0.65 (95% CI, 0.40 to 0.87), accuracy of 67%, and a high false positive rate of 28%. Logistic Regression was the worst performer with a ROC-AUC of 0.45 (95% CI, 0.29 to 0.64), accuracy of 50% (essentially random), F1-score of 0.43, sensitivity of only 0.29, and a false positive rate of 37%. The GridSearchCV-optimized hyperparameters for RF included max_depth of 5, n_estimators of 50, max_samples of 0.75, min_samples_splits of 10, and min_samples_leaf of 2. XGBoost used colsample_bytree of 0.8, learning_rate of 0.2, max_depth of 5, and subsample of 0.8.

The authors note that the NBC classifier's advantage stems from its probabilistic approach based on Bayes' theorem with strong feature independence assumptions. This makes it computationally efficient and particularly well-suited to small datasets with categorical features, which matches the characteristics of this 120-patient clinical dataset.

TL;DR: NBC classifier was the best model: ROC-AUC 0.81, accuracy 83%, F1-score 0.82, false positive rate 10%. XGBoost came second (AUC 0.74, accuracy 78%). Logistic Regression was worst (AUC 0.45, accuracy 50%). NBC's probabilistic design and efficiency with small categorical datasets drove its superior performance.

Feature Analysis

Pages 10-11

Key Features Driving the Top Three Models

The authors examined which clinical variables mattered most to each of the three best-performing models. For XGBoost, the top five features by importance score were: IPI score (125), time elapsed between diagnosis and treatment (69), age category over 80 (73), BMI category 30 to 35 (67), and PET/CT scan performed after 2 cycles of treatment (64). For the Random Forest Classifier, the top features were: age (139), time between diagnosis and treatment (102), IPI score (87), Ann Arbor stage (86), and BMI (79).

NBC classifier feature contributions: For the Naive Bayes model, the authors used the 'feature_log_prob_' attribute to examine conditional log-probabilities for each feature given each class. The variables contributing most, in order of importance, were: triple-hit status, nervous system involvement, CMV infection, all age categories, all BMI categories, alcohol consumption, double-hit status, Ann Arbor stage, first line of treatment, referring department, germinal center phenotype, smoking, comorbidity, gender, IPI score, SARS-CoV-2 pneumonia, and overweight.

Cross-model consistency: Despite different algorithmic approaches, IPI score and patient age emerged as important predictors across all three models, which is consistent with the known clinical literature on DLBCL prognosis. The NBC classifier uniquely highlighted genetic markers (triple-hit, double-hit) and viral infections (CMV) among its top features, suggesting it captures risk patterns that tree-based models may underweight. BMI and the time interval between diagnosis and first treatment also appeared consistently, pointing to the influence of patient fitness and healthcare access speed on treatment outcomes.

TL;DR: IPI score and age were consistently top predictors across all three best models. The NBC classifier also weighted triple-hit status, nervous system involvement, and CMV infection heavily. XGBoost prioritized IPI score (125) and diagnosis-to-treatment delay (69), while Random Forest ranked age (139) and treatment delay (102) highest.

Limitations

Pages 13, 15-16

Single-Center Design, Small Dataset, and Validation Gaps

Retrospective single-center design: The most significant limitation is that all data came from one hospital (Grand Hopital de Charleroi). Single-center studies inherently carry selection bias and may not generalize to other populations, treatment protocols, or healthcare settings. The cohort of 130 patients (120 evaluable) is small by ML standards, which limits the complexity of models that can be reliably trained and increases the risk of overfitting.

No unseen test set: This study was primarily a feasibility study and did not include a held-out test set with data completely unseen by the algorithms during development. The 15% validation set, while separate from training, was part of the same dataset and distribution. True external validation on independent cohorts would be necessary to confirm generalizability. The authors acknowledge this and point to their ongoing prospective study (NCT06241729) which will include a proper unseen test set.

Limited feature engineering and data availability: The feature set was constrained by what was available in the hospital's electronic database. Adding biological parameters, tumor-specific characteristics, or genomic profiling data could enhance model performance. The lack of systematic interim PET/CT scanning before 2021 meant only 30% of patients had this data point, reducing its utility as a feature. Additionally, the study did not thoroughly compare ML model performance against established prognostic scoring systems like IPI alone or cell-of-origin classification.

Model interpretability: While the NBC classifier performed well on standard metrics, the authors did not apply formal explainability techniques such as LIME (Local Interpretable Model-agnostic Explanations) in this study. Without such tools, it is difficult for clinicians to understand why a specific prediction was made for an individual patient. The authors plan to incorporate LIME in their prospective follow-up study to address this gap.

TL;DR: Key limitations include the single-center retrospective design with only 130 patients, no external validation on unseen data, constrained feature availability, incomplete PET/CT data (only 30% of patients), and lack of formal model explainability techniques like LIME.

Future Directions

Pages 15-17

Prospective Validation and Clinical Deployment

Registered prospective study: The authors have already registered a prospective multicenter study (ClinicalTrials.gov identifier NCT06241729) that aims to validate the NBC classifier on previously unseen patient data. This study will include a proper held-out test set and apply LIME to ensure model decisions are transparent and interpretable by clinicians. The ultimate goal is deployment as a web application that clinicians can use at the point of care.

Feature set expansion: Future iterations could incorporate additional biological parameters, tumor microenvironment data, and emerging treatment variables. For example, the authors note that polatuzumab vedotin-R-CHP (where polatuzumab vedotin replaces vincristine in R-CHOP) could be added as a new treatment feature as it enters routine practice. Improving data collection through enhanced electronic medical records at the center would also expand the available feature space.

Ensemble techniques: The authors suggest that combining multiple models through ensemble methods such as stacking, boosting, or bagging could yield more robust predictions than any single algorithm. Given that the NBC classifier, XGBoost, and Random Forest each capture different patterns in the data (probabilistic vs. sequential error correction vs. aggregated decision trees), an ensemble approach could leverage complementary strengths.

Clinical impact: The broader vision is to use ML-predicted refractory status to guide early treatment decisions. Given the growing number of effective salvage therapies and the availability of CAR-T cell therapy, intervening quickly in a patient's therapeutic pathway could maximize cure rates while minimizing toxicity. Rather than waiting for treatment failure to become clinically apparent, a validated predictive model could enable proactive switching to alternative regimens for patients identified as high-risk for refractoriness.

TL;DR: A registered prospective multicenter study (NCT06241729) will validate the model on unseen data with LIME explainability. Future work includes expanding features (e.g., polatuzumab vedotin-R-CHP), testing ensemble techniques, and deploying a clinical web application to guide early treatment decisions for high-risk refractory patients.

A machine learning approach in a monocentric cohort for predicting primary refractory disease

Original Paper (PDF)

Plain-English Explanations