Predicting Immunochemotherapy Response in DLBCL

Overview and Background

Pages 1-2

Why Predicting Drug Response in DLBCL Matters

Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of aggressive non-Hodgkin lymphoma. The standard frontline treatment is R-CHOP, a combination of rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone. While approximately 60% of patients are cured with R-CHOP, the remaining 40% develop chemorefractory disease, eventually relapse, and face a dismal prognosis. Identifying which patients will not respond before or early in treatment could allow clinicians to switch to more aggressive or alternative regimens sooner.

Diagnostic complexity: DLBCL is highly heterogeneous at the clinical, pathological, and molecular levels. It can be sub-classified by cell of origin into germinal centre B-cell-like (GCB) and activated B-cell-like (ABC) subtypes using immunohistochemistry markers such as CD10, BCL6, and MUM1. Double- or triple-hit lymphomas, involving concurrent rearrangements of MYC with BCL2 and/or BCL6, occur in 4 to 8% of DLBCL cases and carry particularly poor outcomes after R-CHOP. Standard prognostic tools include the International Prognostic Index (IPI), PET/CT staging, and FISH for MYC and BCL2 rearrangements.

The AI opportunity: Traditional histopathological examination remains the bedrock of lymphoma diagnosis but suffers from inter-observer variability and time-intensive evaluation. Digital pathology, which converts glass slides into high-resolution digital images, opens the door to deep learning-based computational analysis. Previous AI studies in DLBCL focused on diagnostic classification and MYC translocation detection. However, no prior study had used digital pathology and AI to predict immunochemotherapy response or prognosis in DLBCL, making this study the first of its kind.

TL;DR: DLBCL is the most common aggressive non-Hodgkin lymphoma. R-CHOP cures about 60% of patients, but 40% relapse. This study is the first to use AI-driven digital pathology to predict which DLBCL patients will respond to immunochemotherapy.

Methodology

Pages 3-4

Study Design, Cohort, and Slide Processing

The study retrospectively enrolled 729 patients newly diagnosed with DLBCL and treated with R-CHOP between 2005 and 2020 at Chonnam National University Hwasun Hospital in South Korea. Tissue slides were reviewed by two pathologists (MGN and YDC). Because many slides predated 2020 and had faded staining, the team generated recut H&E-stained slides from paraffin blocks. Slides with insufficient tumour cells or poor stain quality were excluded. Ultimately, 338 patients had usable whole slide images (WSIs), of which 216 patients (251 WSIs) had complete clinical information and final response evaluations.

Data split: The 216 patients were divided using consecutive split validation, with 80% (200 patients) allocated to the training and validation sets and the remaining 51 patients forming the test set. All slides were scanned at 40x magnification (0.25 micrometres per pixel) using a Leica-Aperio GT450 scanner. Clinical data collected included age, performance status, LDH levels, extranodal involvement, Ann Arbor stage, spleen and bone marrow involvement on baseline FDG-PET/CT, IPI score, revised IPI, bulky disease status, Bcl-2 expression, and treatment details.

Response assessment: Treatment response was evaluated using 18F-PET/CT according to the Lugano response criteria for non-Hodgkin lymphoma. Interim PET/CT scans were obtained after 3 to 4 cycles of R-CHOP, and end-of-treatment PET/CT was performed more than a month after completing immunochemotherapy. Responses were scored using the Deauville five-point scale (DS 1 to 5), with DS 1 to 3 classified as complete response (CR). Non-responders were defined as patients who did not achieve CR at the final assessment. Of the 216 patients, 186 (86.1%) achieved CR, 9 (4.2%) had partial response, 2 (0.9%) had stable disease, and 19 (8.8%) had progressive disease.

TL;DR: From an initial pool of 729 DLBCL patients, 216 with 251 H&E whole slide images and complete clinical data were included. Slides were scanned at 40x magnification. Response was assessed via PET/CT using Lugano criteria. The CR rate was 86.1%, with 30 non-responders.

AI Architecture

Pages 4-5

Contrastive Learning, Multiple Instance Learning, and Knowledge Distillation

Feature extraction with DINO: The authors used a self-supervised contrastive learning method called DINO (self-distillation with no labels) to extract features from histopathology patches. Non-overlapping 448 x 448 pixel patches were extracted from WSIs at 40x magnification, then downscaled to 224 x 224 pixels using Lanczos filtering. Artifacts, non-tissue background, and noise were filtered by pixel brightness and a depth-first search algorithm that excluded contiguous regions spanning 25 or fewer tiles. The backbone model was a Vision Transformer (ViT-S/8), producing 384-dimensional feature vectors. Because publicly available datasets like TCGA contain only about 40 DLBCL slides, the team retrained the DINO model specifically on their DLBCL data, starting from pre-trained weights.

Multiple instance learning (MIL): The study employed a dual-stream MIL network with attention-based pooling to aggregate information across all patches in a WSI. The attention mechanism assigns importance scores to each patch, which were normalized using min-max scaling and visualized as heatmaps. This approach is well-suited for gigapixel histopathology images because it operates at the "bag" (whole-slide) level rather than requiring patch-level annotations.

Multi-modal integration and knowledge distillation: To combat overfitting on only 251 WSIs, the team first built a multi-modal model that combined 384 histopathology features from the MIL model with 54 clinical features extracted using unsupervised learning with TabNet. These combined features were passed through a final linear layer for prediction. Then, a pathology-only model was trained using knowledge distillation from the multi-modal model. The pathology model's 384 MIL features were guided by the multi-modal model's pathology feature representation using cosine similarity as a loss function. Models were trained for 300 epochs with an initial learning rate of 0.0001, halved if validation loss plateaued for 10 consecutive epochs.

TL;DR: The pipeline uses DINO (ViT-S/8) for self-supervised feature extraction from patches, dual-stream MIL with attention for slide-level prediction, TabNet for clinical features, and knowledge distillation to train a pathology-only model from the multi-modal teacher. Training ran 300 epochs with early stopping.

Patient Characteristics

Pages 5-6

Cohort Demographics and Clinical Features

The 216-patient cohort had a median age of 66 years (range 20 to 87), and 95 patients (44.0%) were male. A total of 115 patients (53.2%) were in Ann Arbor stage III or IV, and 38 (17.6%) were classified as high-risk IPI. Treatment consisted of 3 to 4 cycles of R-CHOP for 26 patients with limited-stage disease (with or without involved-field radiotherapy), 6 cycles for 141 patients, and 8 cycles for 45 patients. Consolidation radiotherapy was administered to 12 patients.

Responders vs. non-responders: Significant differences emerged between the 186 responders and 30 non-responders. Non-responders had a higher median age (71 vs. 65 years, p = 0.004), higher rates of elevated LDH (86.7% vs. 56.5%, p = 0.002), more extranodal involvement at two or more sites (43.3% vs. 18.8%, p = 0.005), and higher IPI risk classification (p = 0.001). Non-responders were also more likely to be in advanced Ann Arbor stage, with 43.3% in stage IV compared to 24.7% among responders, though the overall stage distribution showed a borderline p value of 0.074.

These findings confirm that non-responders present with more aggressive disease characteristics at baseline. The clinical heterogeneity between the two groups underscores the need for a predictive tool that can integrate both pathological and clinical information, rather than relying on any single prognostic factor.

TL;DR: Non-responders (n=30) were older (median 71 vs. 65 years), had higher LDH (86.7% vs. 56.5%), more extranodal sites (43.3% vs. 18.8%), and worse IPI risk (p = 0.001) compared to responders (n=186).

Model Performance

Pages 6-7

Drug Response Prediction Accuracy and Survival Analysis

Pathology-only model: The model trained solely on histopathology images achieved an AUROC of 0.744 (95% CI: 0.605 to 0.883). At its optimal threshold (Youden's index), the model reached a sensitivity of 63.4%, specificity of 90.0%, positive predictive value (PPV) of 96.3%, and negative predictive value (NPV) of 37.5%. The area under the precision-recall curve (AUPRC) was 0.935, reflecting strong performance even with class imbalance.

Multi-modal model: When clinical data was integrated with histopathology features, the multi-modal model achieved an AUROC of 0.856 (95% CI: 0.733 to 0.980). This model's sensitivity was 90.2%, specificity 70.0%, PPV 92.5%, and NPV 63.6%. The AUPRC reached 0.961. The substantial improvement from 0.744 to 0.856 AUROC demonstrates the complementary value of clinical and pathological data.

Survival analysis: Kaplan-Meier analysis showed that the pathology-only model significantly stratified patients by relapse-free survival (RFS), with a log-rank test p value of 0.041. The multi-modal model further improved RFS stratification with a p value of 0.026. These results indicate that both models capture meaningful prognostic information, with the multi-modal approach providing the strongest discrimination between patients likely to remain in remission and those at risk of relapse.

TL;DR: The pathology-only model achieved AUROC 0.744 and the multi-modal model reached AUROC 0.856 for predicting drug response. Both models significantly stratified relapse-free survival (p = 0.041 and p = 0.026, respectively). AUPRC values were 0.935 and 0.961.

External Validation and Clinical Correlations

Pages 7-8

TCGA Validation and Associations with Prognostic Factors

External validation: The pathology-only model was externally validated using TCGA data. From the TCGA DLBCL cohort, 40 patients with follow-up data, vital status records, age, sex, and clinical stage were included. Of the 48 TCGA patients with DLBCL, 36 had received R-CHOP treatment. Patients were stratified into two groups based on the median predicted value from the pathology model. Kaplan-Meier analysis demonstrated a statistically significant survival difference with a p value of 0.037. Among the seven recorded deaths, only one belonged to the group predicted to respond well to treatment.

Cox proportional hazards analysis: A multivariable Cox regression incorporating age, sex, clinical stage, and the pathology-based prediction score was performed on the TCGA cohort. Although the pathology prediction did not reach statistical significance in this small external set, it displayed the lowest p value among all clinical factors and carried a positive coefficient, suggesting independent prognostic value beyond standard clinical variables.

Clinical variable correlations: Spearman correlation analysis of the histopathology-only model's predictions against clinical factors revealed a significant negative correlation with IPI risk (rho = -0.289, p = 0.040) and a near-significant negative correlation with Ann Arbor stage (rho = -0.264, p = 0.061). The association with bulky disease was borderline (p = 0.055), while serum LDH levels did not show a significant correlation (p = 0.37). The authors note that rituximab has reduced the influence of some classical prognostic factors like LDH and bulky disease, which may explain these weaker correlations.

TL;DR: External validation on TCGA data (40 patients) showed significant survival stratification (p = 0.037). The model correlated significantly with IPI risk (rho = -0.289, p = 0.040) and borderline with Ann Arbor stage (p = 0.061) and bulky disease (p = 0.055).

Histopathological Findings

Pages 8-9

Morphological Features Linked to Treatment Response

A major strength of attention-based MIL is its interpretability. The model's attention mechanism identifies which patches contribute most to the prediction, enabling pathologists to examine the histological features driving the AI's decisions. The team extracted the 4,020 most predictive patches (3,040 for responders and 980 for non-responders) from 216 WSIs and had expert pathologists review them for morphological characteristics.

Responder-associated features: Patches with the highest attention scores for predicting drug response (complete response) showed centroblastic and immunoblastic features. According to the previous WHO (2008) classification, the centroblastic subtype is the most common morphological variant of DLBCL and is known to carry a better prognosis and higher overall survival. The AI model's findings align with this established morphological classification, reinforcing the biological plausibility of the predictions.

Non-responder-associated features: In contrast, patches most predictive of non-response showed anaplastic features and clear cytoplasm. These morphological patterns were visible in the heatmap distributions across entire WSIs. In responder WSIs, anaplastic and clear cytoplasmic features appeared as high-signal regions in the heatmap, while in non-responder WSIs, immunoblastic and centroblastic features appeared as high-signal regions. This inversion pattern suggests the model captures a spectrum of morphological risk rather than binary categories.

The consistency between AI-identified features and prior WHO morphological classifications demonstrates that the deep learning model is detecting biologically meaningful patterns rather than artifacts. Importantly, AI-based classification overcomes the inter- and intra-observer variability that has historically plagued morphological sub-typing of DLBCL by human pathologists.

TL;DR: The AI identified centroblastic and immunoblastic features as predictive of drug response, and anaplastic features with clear cytoplasm as predictive of non-response. These findings align with the WHO (2008) morphological classification, confirming biological plausibility. A total of 4,020 top-attention patches were reviewed by expert pathologists.

Limitations and Future Directions

Pages 9-11

Study Constraints and Opportunities for Improvement

Sample attrition: From the initial 729 DLBCL patients, only 216 (with 251 WSIs) were ultimately used. Many slides were excluded due to faded staining, insufficient tumour cells, or missing clinical data. This significant attrition raises questions about selection bias and limits the statistical power for detecting smaller effect sizes, particularly in the non-responder group (n = 30).

Single-center, retrospective design: All data came from one institution (Chonnam National University Hwasun Hospital), which limits generalizability. While external validation on TCGA data was performed, the TCGA DLBCL cohort included only 40 patients with available follow-up, providing a limited external test. Multi-center prospective validation with larger cohorts is essential before clinical adoption.

No tumour segmentation: The DINO model was trained on all WSI patches without distinguishing neoplastic from non-neoplastic tissue. DLBCL arises in various organs (not just lymph nodes), and background non-neoplastic tissues vary considerably across anatomical sites, potentially introducing noise into the model. Despite this, the model generalized to the TCGA dataset, but incorporating a tumour segmentation step could improve performance. The authors note that manual tumour region annotation requires expert haematopathologists and is extremely labour-intensive, which is why it was omitted.

Limited data modalities: The study used only H&E-stained slides and clinical data. In practice, DLBCL diagnosis requires additional molecular pathological tests, including immunohistochemical staining (for markers like CD10, BCL6, MUM1, BCL2, MYC). The authors anticipate that multi-modal learning incorporating immunohistochemically stained slides, other test results, and molecular genetic data (such as gene expression profiling and targeted deep sequencing) could substantially improve model performance. Integrating genomic subtypes like MCD, N1, A53, BN2, ST2, and EZB could further refine treatment response prediction.

Future potential: Despite these limitations, the study establishes a proof of concept that digital pathology combined with deep learning can predict immunochemotherapy response in DLBCL, a task that human pathologists cannot reliably perform from morphology alone. The knowledge distillation framework is particularly promising because it enables a pathology-only model to benefit from clinical data during training without requiring clinical data at inference time, simplifying potential clinical deployment.

TL;DR: Key limitations include single-center design, sample attrition from 729 to 216 patients, no tumour segmentation, small external validation set (40 TCGA patients), and use of H&E slides only. Future work should incorporate multi-center data, tumour segmentation, immunohistochemistry, and genomic subtypes.

Prediction of immunochemotherapy response for diffuse large B-cell lymphoma using radiomics

Original Paper (PDF)

Plain-English Explanations