Deep Learning Lymphoma Diagnosis on Whole-Slide Images

Overview and Background

Page 1

Why Lymphoma Diagnosis Needs Automated Help

Diagnosing lymphoma under the microscope is notoriously difficult. In France, the nationwide Lymphopath network revealed a 20% discrepancy rate between referral pathologists and expert reviewers, directly impacting patient care. Studies in the US and UK have reported similar disagreement rates ranging from 14.8% to 27.3%. These numbers highlight a real clinical problem: the accuracy of a lymphoma diagnosis often depends on who is reading the slide.

Follicular lymphoma vs. follicular hyperplasia: The specific challenge addressed in this paper is distinguishing follicular lymphoma (FL), the second most common lymphoma subtype, from follicular hyperplasia (FH), a benign reactive condition. Both can look strikingly similar on haematoxylin and eosin (H&E)-stained tissue sections. The definitive answer typically requires immunohistochemistry to detect Bcl2 and CD10 expression, but roughly 10% of FL cases are Bcl2-negative, making even immunostaining insufficient.

The deep learning proposition: The authors developed a deep learning framework built on Bayesian neural networks (BNN) that not only classifies whole-slide images (WSI) of lymph nodes as FL or FH, but also provides a certainty estimate for each prediction. This uncertainty quantification is a critical addition. Previous deep learning approaches for histopathology achieved high accuracy but offered no way to flag unreliable predictions, limiting their clinical utility.

The study was conducted at the University Cancer Institute of Toulouse-Oncopole and the University Hospital of Dijon, drawing on 378 lymph node WSI (197 FL cases and 181 FH cases). An additional 65 slides of other small B-cell lymphomas were included to test how the system handles unfamiliar data outside its training scope.

TL;DR: Lymphoma misdiagnosis rates run 14.8-27.3% across countries. This study built a Bayesian deep learning system on 378 whole-slide images (197 FL, 181 FH) that produces both a diagnosis and a certainty score, addressing a gap left by prior deep learning tools that offered no reliability measure.

Methodology

Pages 1-2

Dataset Construction, Patch Extraction, and Multi-Resolution Training

Image acquisition: All H&E-stained slides were digitised using a Panoramic 250 Flash II scanner (3DHISTECH) equipped with a Zeiss Plan-Apochromat 20x objective, producing images at 0.24 microns-squared per pixel. The 378 WSI were randomly split into training (50%), validation (25%), and testing (25%) sets. FL cases came from two centres (Toulouse and Dijon), while FH cases were sourced exclusively from Toulouse.

Patch-based classification: Because forwarding an entire WSI through a CNN at full resolution is computationally prohibitive, the authors adopted a patch-based framework. Each slide was divided into non-overlapping 299 x 299 pixel patches, discarding any patch with less than 50% tissue coverage. A total of 320,000 patches were extracted: 160,000 for training (20,000 per resolution level), 80,000 for validation, and 80,000 for testing. Each patch inherited its slide-level label (FL or FH) without manual region annotation by a pathologist.

Multi-resolution approach: Patches were extracted at eight resolution levels ranging from 0.49 to 125.44 microns per pixel, corresponding to different pyramid levels of the digitised image. The rationale is that malignant tissues exhibit both cellular-level atypia (visible at high resolution) and structural abnormalities (visible at lower resolution), so different magnifications each contribute important diagnostic information. Separate CNN models were trained at each resolution level, and their performance was compared.

Slide-level diagnosis: After individual patch predictions, the final slide-level diagnosis was computed by averaging the patch predictions across the entire slide. This averaging treats each patch as an independent measurement statistically centred on the correct diagnosis, consistent with Bayesian inference principles. The more patches available, the more robust the averaged prediction becomes.

TL;DR: 320,000 patches (299 x 299 pixels) were extracted from 378 WSI at 8 resolution levels (0.49 to 125.44 microns/pixel). Training used 160,000 patches with no manual region annotation. Slide-level diagnosis was determined by averaging all patch predictions for a given slide.

Results

Pages 2-3

CNN Diagnostic Performance Across Resolution Levels

The deep CNN achieved an overall patch-level accuracy of 91%, correctly classifying 72,895 out of 80,000 test patches. However, the real clinical metric is slide-level performance, since the goal is to diagnose the patient, not individual tiles. At the slide level, the model achieved AUC values between 0.92 and 0.99 depending on the resolution level used.

Resolution matters: Counterintuitively, the best slide-level AUC (0.99) was obtained at the lowest resolution (pyramid level 4, corresponding to 7.84 microns/pixel), even though patch-level validation accuracy was lower at this setting. The explanation is straightforward: at lower resolution, each patch covers a larger tissue area, so more independent, non-overlapping patches can be extracted from a single slide. Averaging more independent measurements produces a more robust prediction. At the highest resolution levels, fewer patches were available per slide, making the average less stable.

Visualisation of predictions: When patch-level classifications were overlaid on WSI, FH slides showed uniformly low FL probability across all patches, while FL slides showed FL probability close to 1.0 everywhere. This spatial consistency suggests the algorithm captures a pervasive morphological difference rather than relying on isolated focal features, and it implies the system could potentially work on smaller samples such as needle biopsies.

With optimum threshold settings, accepting a 20% false alarm rate allowed 100% FL detection. This is a clinically meaningful trade-off for a screening tool, where missing a lymphoma case carries far greater consequences than triggering additional confirmatory testing on a benign sample.

TL;DR: Slide-level AUC reached 0.99 at 7.84 microns/pixel resolution. Patch-level accuracy was 91% (72,895/80,000 correct). Lower resolution paradoxically yielded better slide diagnoses because more independent patches could be averaged. At a 20% false alarm threshold, FL detection was 100%.

Bayesian Uncertainty

Pages 3-4

How Dropout Variance Quantifies Prediction Certainty

Bayesian neural networks: The key innovation of this study is the use of BNN to attach a certainty score to each prediction. The authors implemented the approach proposed by Gal and Ghahramani, which uses dropout (randomly removing network units during inference) as an approximation to Bayesian inference. For each input image, multiple forward passes with different dropout configurations were performed. The average output served as the final prediction, and the variance across passes served as the uncertainty measure.

Uncertainty correlates with errors: A critical finding was that erroneous predictions consistently showed higher uncertainty values than correct ones, regardless of whether the misclassified case was FL or FH. This means the uncertainty score is genuinely informative, not just noise. The system was more confident when predicting FL (variance = 0.02) than FH (variance = 0.04), suggesting that the morphological signature of malignancy may be more distinctive than that of benign hyperplasia.

Performance gains from uncertainty filtering: By removing the 10% most uncertain slides from the test set, the system achieved perfect FL detection with only about 2% false alarms, and AUC increased across all resolution levels. Using class-specific variance thresholds (different cutoffs for FL and FH predictions) was more efficient than a single global threshold: perfect accuracy was reached after removing 23% of cases with class-specific thresholding versus 36% with global thresholding.

This uncertainty-driven referral mechanism has a natural clinical analogue. Cases the system flags as uncertain could be routed to expert pathologists for manual review, while high-certainty cases could be processed with minimal oversight. This tiered workflow could significantly reduce the burden on expert centres like the Lymphopath network.

TL;DR: Dropout-based Bayesian inference produced uncertainty scores that reliably flagged errors. Removing the top 10% most uncertain slides yielded perfect FL detection with only 2% false alarms. Class-specific thresholds reached perfect accuracy after filtering out 23% of cases, compared to 36% with a global threshold.

Out-of-Distribution Detection

Pages 3-5

Detecting External Centre Data and Unfamiliar Lymphoma Subtypes

The biased dataset experiment: To test how the system handles data from unfamiliar sources, the authors deliberately built a biased dataset. Training and validation used only slides from Toulouse, while the test set mixed 24 internal Toulouse cases with 24 external cases from Dijon. The models achieved perfect validation AUC (1.0) on the internal data but dropped to AUC 0.63-0.69 on the mixed test set. This dramatic decline confirmed that CNNs are extremely sensitive to differences in tissue processing, staining protocols, and scanning equipment between centres.

Uncertainty as a centre-bias detector: Although the biased model performed poorly on external data, the uncertainty distributions for internal versus external cases were significantly different. Setting a variance threshold of 0.03 removed no more than 10% of internal predictions but triggered rejection of over 50% of external predictions. This means dropout variance can serve as a statistical index of whether data is within the model's reliable prediction range, even if it cannot perfectly separate every individual case.

Unfamiliar pathologies: Beyond centre-related differences, the authors also tested 65 slides of other small B-cell lymphomas (chronic lymphocytic leukaemia/lymphoma, mantle cell lymphoma, and marginal zone lymphoma) that the network had never seen during training. The system still forced a FL or FH prediction on these slides, but the predictions carried markedly higher uncertainty values compared to genuine FL/FH test data. This demonstrates that dropout variance can detect out-of-scope pathologies, flagging cases where the model should not be trusted.

This dual capability, detecting both technically heterogeneous data and unfamiliar disease subtypes, makes the uncertainty framework particularly valuable for real-world deployment where unexpected inputs are inevitable.

TL;DR: A deliberately biased single-centre model dropped from AUC 1.0 (internal) to 0.63-0.69 (external). Dropout variance at threshold 0.03 rejected 50% of external data while removing only 10% of internal data. Unfamiliar lymphoma subtypes (65 slides) were also flagged with higher uncertainty, proving the system can detect out-of-scope inputs.

Multi-Centre Training

Pages 5-6

Solving the Pre-Processing Sensitivity Problem

The biased dataset experiment exposed a fundamental challenge: when training and test slides come from different laboratories, accuracy collapses. This is not a theoretical concern. Different pathology departments use different staining protocols, fixation times, and scanner hardware, all of which subtly alter how tissue appears in digitised images. CNNs are precise enough to pick up on these technical variations and inadvertently incorporate them into their decision-making process.

The multi-centre solution: When training and validation sets included slides from both Toulouse and Dijon (the unbiased configuration used in the main experiments), the system achieved its peak AUC of 0.99. This confirms that exposure to heterogeneous pre-processing during training is essential for building robust diagnostic tools. The network learned to look past staining variability and focus on the morphological features that actually distinguish FL from FH.

Practical implications: The authors argue that developing clinically deployable AI tools for histopathology requires large training sets drawn from multiple institutions with diverse technical procedures. Maximum accuracy with a limited number of cases was achievable only when all slides came from one department, but that accuracy did not generalise. The trade-off between internal perfection and external robustness must be resolved in favour of heterogeneous training data. The French Lymphopath network, which has collected over 100,000 lymphoma cases from multiple sources covering all WHO-classified entities, represents a uniquely valuable resource for building such universally applicable models.

TL;DR: Single-centre training achieved AUC 1.0 internally but only 0.63-0.69 on external data. Multi-centre training restored AUC to 0.99. The Lymphopath network's 100,000+ cases from diverse sources offer an ideal foundation for training generalisable lymphoma AI tools.

Limitations

Pages 6-7

Constraints and Open Questions

Binary classification scope: The system was designed only to distinguish FL from FH. It cannot classify other lymphoma subtypes, grade FL, or handle the full spectrum of lymph node pathologies a pathologist encounters. The 65 unfamiliar small B-cell lymphoma slides were correctly flagged as uncertain, but the system offered no specific diagnosis for them. Extending the model to cover more entities from the WHO classification would require substantially larger and more diverse training datasets.

Retrospective, two-centre design: All data came from two French university hospitals. While the multi-centre experiment proved the importance of diverse training data, two centres still represent a narrow slice of global laboratory practices. Different countries, scanner manufacturers, and tissue preparation methods could introduce variability beyond what the current model has learned to handle. Prospective validation across a wider geographic and institutional range is needed.

No manual region annotation: The authors deliberately avoided manual patch-level annotation, instead labelling all patches from a FL slide as "FL" and all from a FH slide as "FH." While this simplifies dataset construction enormously, it means the training data includes patches of normal tissue, artefacts, and non-diagnostic regions labelled with the slide-level diagnosis. The 91% patch-level accuracy may partly reflect this noisy labelling. It remains unclear whether manual annotation of diagnostic regions would meaningfully improve performance.

Uncertainty is not a complete safety net: While dropout variance effectively detects many out-of-distribution inputs, the biased dataset experiment showed it cannot perfectly separate every external case from internal data on a case-by-case basis. The 10% internal vs. 50% external rejection ratio is statistically favourable but not clinically sufficient as a standalone filter. Accuracy on the remaining 50% of "certain" external predictions was still as low as on the full external test set (63-69%), meaning certainty did not correlate with accuracy for truly out-of-distribution data.

TL;DR: Key limitations include binary FL/FH scope only, retrospective two-centre design, noisy patch labelling (no manual region annotation), and imperfect out-of-distribution detection. On truly external data, uncertainty filtering removed 50% of cases but did not improve accuracy on the remaining "certain" predictions.

Future Directions

Pages 6-7

Toward a Universal Lymphoma Screening Tool

Leveraging the Lymphopath network: The authors identify the French Lymphopath network's collection of over 100,000 lymphoma cases, encompassing all clinical-pathological entities in the WHO classification and sourced from multiple institutions, as the ideal foundation for a more comprehensive model. Training on this dataset could extend the system beyond FL/FH to cover the full range of lymphoma subtypes, transforming it from a proof-of-concept into a practical screening tool.

Iterative uncertainty-driven learning: The study proposes using dropout variance as a mechanism for continuous model improvement. In a deployment scenario, cases flagged as uncertain could be automatically collected and reviewed by experts. These curated, uncertain cases could then be fed back into a new training cycle, gradually expanding the model's competence to handle data from new centres and new staining protocols. This iterative approach could make the system progressively more robust without requiring a single massive retraining effort.

Applicability to smaller samples: Because FL probability was spatially consistent across entire slides, the authors speculate the system could work on needle biopsies and other small specimens where immunomorphological interpretation is especially challenging, even for expert pathologists. Validating performance on such samples would open up new clinical use cases, particularly in settings where excisional biopsies are not feasible.

Broader histopathology applications: The BNN framework with uncertainty quantification is not specific to lymphoma. The authors suggest it could be applied to diagnosing other pathologies on H&E-stained digital slides. The core insight, that deep learning predictions without reliability estimates are insufficient for medical practice, applies across all of computational pathology. Open-source code for the experiments is available on GitHub (ArnaudAbreu/DiagFLFH), facilitating adaptation to other diagnostic tasks.

TL;DR: Next steps include training on the Lymphopath network's 100,000+ multi-centre cases, using uncertainty-driven iterative learning to continuously improve robustness, validating on needle biopsies, and adapting the Bayesian framework to other histopathology domains. Code is publicly available on GitHub.

Accurate diagnosis of lymphoma on whole-slide histopathology images using deep learning

Original Paper (PDF)

Plain-English Explanations