Deep Learning for Prostate Cancer Diagnosis and Grading in WSI

Overview and Background

Pages 1-2

Why Gleason Grading Needs AI Assistance

Prostate cancer is the most commonly diagnosed cancer in men and one of the leading causes of cancer death. Treatment decisions hinge on the Gleason grade assigned by a pathologist after examining core needle biopsies (CNBs) under a microscope. The Gleason scoring system divides prostate cancer into five prognostically distinct grade groups, ranging from group 1 (3+3, low risk) to group 5 (high risk). On a CNB, the score is the sum of the most common primary Gleason pattern and the highest secondary pattern.

The subjectivity problem: Gleason grading is notoriously subjective, with significant inter-observer and intra-observer variability. While uropathologists achieve higher agreement rates, their expertise is not widely available. Recent guidelines also require pathologists to estimate and quantify the percentage of tumor across multiple Gleason patterns, which further increases the burden and exacerbates subjectivity. Despite the introduction of the five-group system, research shows that inter- and intra-observer variability has not decreased.

The domain shift challenge: Prior deep learning (DL) systems for Gleason grading have shown promise, but most exhibit reduced performance when applied to whole slide images (WSI) from institutions not represented in training data. This phenomenon, called domain shift, limits the real-world applicability of these systems. Most previous methods also focus solely on core-level grading, ignoring gland-level segmentation and overlap with pathologists' pixel-level annotations.

This study proposes a DL system that segments and grades epithelial tissue using a novel training methodology designed to learn domain-agnostic features. The system was evaluated on 6,670 WSI from three cohorts: Muljibhai Patel Urological Hospital (MPUH), Radboud University Medical Center (RUMC), and Karolinska Institute. The goal was to build a system that works as an assistive tool for CNB review, improving consistency and accuracy of grading across institutions.

TL;DR: Gleason grading of prostate cancer biopsies is highly subjective (mean inter-observer kappa of 0.79), and most existing DL systems suffer from domain shift when applied to new institutions. This study proposes a domain-agnostic DL approach trained on 3,741 CNBs and evaluated on 6,670 WSI from three independent cohorts.

Data and Study Design

Pages 2-3

Multi-Center Dataset from Three Institutions

The study used a total of 6,670 WSI sourced from three institutions: MPUH (India), Radboud University Medical Center (Netherlands), and Karolinska Institute (Sweden). Cases were randomly assigned to either development (training/tuning) or independent validation datasets, ensuring a rigorous separation between training and evaluation data.

MPUH dataset: The pathology archives provided 580 de-identified, anonymized CNB slides from 110 individuals. Hematoxylin and Eosin (H&E)-stained, formalin-fixed paraffin-embedded (FFPE) needle core biopsies were digitized using a Hamamatsu Nanozoomer XR at 40x magnification. This dataset was split into 155 slides for training and 425 for testing. The MPUH Institutional Review Board (IRB #EC/678/2020) authorized the study and waived informed consent because the data was de-identified and used retrospectively.

PANDA challenge dataset: The publicly available Prostate cANcer graDe Assessment (PANDA) dataset, created as part of a Kaggle competition by RUMC and Karolinska Institute in collaboration with Tampere University, provided the remaining data. From RUMC, 3,586 biopsies were used for training and 1,201 for testing. The Karolinska Institute contributed 1,303 biopsies that were used exclusively as unseen test data, meaning the system was never exposed to this institution's data during training.

Consensus reference standard: For the internal test set of 425 biopsies, a panel of four pathologists (two uropathologists and two general surgical pathologists) independently graded each biopsy to create a consensus reference. This rigorous approach allowed the authors to compare the DL system's performance against a robust ground truth rather than a single pathologist's opinion.

TL;DR: The study used 6,670 WSI from three institutions across three countries. Training used 3,741 CNBs from MPUH and RUMC. Testing involved three independent sets: 425 internal (MPUH), 1,201 external (RUMC), and 1,303 completely unseen (Karolinska). A four-pathologist consensus panel provided the reference standard.

Methodology

Pages 3-6

Semi-Supervised Active Learning and Network Architecture

The system's pipeline consists of three major steps. First, a pathologist selects and annotates a limited fraction of images with Gleason patterns 3+3, 4+4, and 5+5 for tumor glands. A Fully Convolutional Network (FCN) is then trained to segment epithelial tissue, and the pathologist's Gleason grade is assigned to predicted tumor locations as initial labels.

Active learning loop: In the second step, the system simulates iterative active learning-based data labelling, inspired by the Cost-Effective Active Learning paradigm. Using the initial labelled dataset, the FCN semantic segmentation model is trained for Gleason grade group identification. After training, unlabeled images are fed into the FCN, and an uncertainty measure is computed for each sample. A pathologist then annotates only the most uncertain samples (those exceeding an uncertainty threshold), and these are added to the training set. The FCN is retrained with each new batch. This cycle repeats until no samples exceed the uncertainty threshold, efficiently targeting the hardest cases for expert annotation.

Network architecture: The FCN is built on a U-Net architecture with several key enhancements. The encoder (contracting path) uses ResNeXt50 instead of the vanilla U-Net encoder. ResNeXt50 introduces "cardinality" as an additional dimension beyond depth and width, reducing hyperparameters while combining ResNet's residual block repetition with the Inception Network's split-transform-merge strategy. Atrous Spatial Pyramid Pooling (ASPP) is added at the bottleneck to capture contextual data at various scales for more precise classification. A Feature Pyramid Network (FPN) is also included to improve segmentation of very small glands by blending low-resolution, semantically rich features with high-resolution features via top-down connections.

Ensemble distillation: Multiple models with this architecture are trained on five folds of data with various stain-based augmentations. The aggregate knowledge of these five teacher models is then transferred via ensemble distillation into a single student network using KL divergence loss. This step acts as a strong regularizer, reducing the effect of label noise in the training set.

Uncertainty estimation: The system uses Monte-Carlo dropout to estimate prediction confidence. By keeping dropout active during inference and repeating forward passes, slightly different predictions emerge due to stochastic neuron activation. The standard deviation of posterior probabilities generates an uncertainty map, and the total uncertainty for each unlabeled sample is calculated by summing pixel values in the map.

TL;DR: The system uses a U-Net architecture enhanced with ResNeXt50 encoder, ASPP, and FPN. Training employs semi-supervised active learning where only the most uncertain samples get manual annotation. Five teacher models trained on different folds are distilled into a single student network via KL divergence loss, and Monte-Carlo dropout provides uncertainty estimates.

Domain Agnostic Training

Pages 6-7

How the System Overcomes Stain Variation Across Laboratories

Histopathology images vary significantly in appearance from lab to lab due to differences in tissue slide preparation, fixation, processing, sectioning, staining procedures, and scanning equipment. This variation causes "domain shift," which degrades deep learning model performance on out-of-distribution samples. The authors address this with a domain-neutral training methodology based on a multi-task paradigm.

Multi-task model design: The segmentation model shares its feature extractor with a stain-normalization network (Generator Head). During training, the model receives pairs of raw and color-augmented images that simulate stain color variations. The model must simultaneously reconstruct a normalized image from the color-augmented input (matching the raw image) and semantically segment the image into pixel-level class labels. To enforce learning stain-robust features, the system penalizes the distance between logits for color-augmented and raw images.

Training loss function: The overall loss is a weighted combination of four components: (1) MSE loss between stain-normalized and original image, (2) KL divergence loss between color-augmented and raw image logits, (3) cross-entropy pixel segmentation loss, and (4) KL divergence ensemble distillation loss. During inference, the stain normalization layers are removed, leaving only the layers needed for image segmentation. This means the model has internalized stain-invariant representations without any inference-time overhead.

Proof of effectiveness: The authors used t-SNE visualization of feature vectors from the FCN's last convolutional layer to demonstrate the impact. Without domain-agnostic training, features from Radboud and Karolinska datasets formed discrete, separated clusters. With domain-agnostic training, the previously fragmented representations overlapped and showed smooth distributions. On the Karolinska test set, the domain-agnostic model achieved 83.1% accuracy (kappa 0.93), while the baseline model without this training only reached 74.6% accuracy (kappa 0.88), a substantial improvement of 8.5 percentage points.

TL;DR: A multi-task training approach pairs stain normalization with segmentation, forcing the model to learn stain-invariant features. This boosted accuracy on the completely unseen Karolinska test set from 74.6% (kappa 0.88) to 83.1% (kappa 0.93), an 8.5 percentage point improvement, as confirmed by t-SNE visualization of overlapping feature distributions.

Results

Pages 2-3

Inter-Observer Agreement and System Performance

Pathologist variability: The panel of four pathologists evaluated the internal set of 425 biopsies. The inter-observer agreement between the two uropathologists was kappa 0.89. Between general surgical pathologists, it was kappa 0.69. The agreement between uropathologists and general surgical pathologists ranged from kappa 0.50 to 0.59. The mean inter-observer agreement with the consensus reference was kappa 0.79. Individual uropathologists achieved kappa values of 0.90 and 0.91, while general surgical pathologists scored 0.71 and 0.65.

System grading accuracy: The proposed system demonstrated 89.4% accuracy and kappa 0.92 on the internal MPUH test set of 425 biopsies. On the external RUMC test set, accuracy was 85.3% with kappa 0.96. On the completely unseen Karolinska test set (1,303 WSI not used in training), the system achieved 83.1% accuracy with kappa 0.93. Notably, the system's kappa values exceeded those of individual uropathologists (0.90 and 0.91) and far surpassed general surgical pathologists (0.71 and 0.65).

Clinically relevant classifications: ROC analysis tested the system's ability to classify images into clinically important categories. For cancer vs. benign detection, the system achieved AUC values of 0.997 (MPUH), 0.991 (RUMC), and 0.920 (Karolinska). For low-grade vs. high-grade tumor classification, AUC values were 0.990, 0.960, and 0.930 across the three test sets. For the challenging distinction between grade group 2 (3+4) and grade group 3 (4+3), AUCs were 0.900, 0.920, and 0.830 respectively.

Percent Gleason grade correlation: The association between percent Gleason grade 4/5 as scored by pathologists and by the DL system was also analyzed. The Pearson correlation coefficients were r = 0.97 for Gleason grade 4 and r = 0.95 for grade 5 on the MPUH and Radboud test sets, indicating excellent agreement between algorithm output and pathologist assessment of tumor area proportions.

TL;DR: The DL system achieved kappa values of 0.92, 0.96, and 0.93 across three test sets, outperforming individual pathologists (mean inter-observer kappa 0.79). Cancer detection AUCs reached 0.997, 0.991, and 0.920. Percent Gleason grade 4/5 correlations with pathologists were r = 0.97 and r = 0.95.

Segmentation Performance

Pages 6-7

Pixel-Level Segmentation and Architecture Comparison

Unlike most previous studies that focus only on core-level Gleason grade prediction, this system also provides pixel-level segmentation overlays. These overlays allow pathologists to see exactly which glands the system identified and what Gleason pattern was assigned, functioning as a "second read" to ensure no areas are missed.

Segmentation benchmarks on MPUH: The proposed ensemble distillation model was compared against state-of-the-art architectures. On the MPUH test set, the model achieved F1 scores of 0.887 (benign), 0.897 (grade 3), 0.912 (grade 4), and 0.923 (grade 5). By comparison, U-Net with FPN and ASPP using ResNeXt50 scored 0.852, 0.881, 0.900, and 0.923. UNet++ with EfficientNetB4 scored 0.887, 0.836, 0.905, and 0.950. DeepLabV3+ scored 0.860, 0.856, 0.889, and 0.952. The proposed model achieved the highest overall F1 score of 0.898 on MPUH.

Segmentation benchmarks on RUMC: On the Radboud test set, the model achieved F1 scores of 0.673 (benign), 0.830 (grade 3), 0.795 (grade 4), and 0.925 (grade 5), with an overall F1 score of 0.823. The decline in performance relative to MPUH was attributed to label noise in the Radboud dataset. When students annotated the Radboud images for comparison, accuracy against consensus labels was only 72% (kappa 0.85), revealing the presence of labelling errors. By contrast, the MPUH dataset was annotated by expert pathologists with low label noise.

The Karolinska Institute dataset lacked pixel-level labels and was therefore excluded from segmentation performance evaluation, though it was still used for core-level grading assessment.

TL;DR: The proposed model outperformed U-Net++, DeepLabV3+, and U-Net-ASPP-FPN on pixel-level segmentation. Overall F1 scores were 0.898 (MPUH) and 0.823 (RUMC). The RUMC performance gap was linked to label noise (only 72% accuracy in ground truth annotations), not model deficiency.

Comparison with Prior Work

Pages 7-9

How This System Compares to Previous Deep Learning Approaches

The authors provide an extensive comparison with previous prostate cancer DL systems. Litjens et al. developed a CNN with 225 slides achieving an AUC of 0.98 for cancer detection, but focused only on detection rather than grading. Lucas et al. found their CNN could distinguish benign from malignant with 92% accuracy (sensitivity 90%, specificity 93%). Arvaniti et al. trained a CNN using extensive Gleason annotations and achieved inter-annotator kappa of 0.75 and 0.71, equivalent to the inter-pathologist agreement of 0.71.

Large-scale systems: Campanella et al. achieved AUC values of 0.976 (ResNet34) and 0.977 (VGG11-BN) for cancer vs. benign classification using 12,160 WSI, but focused only on binary classification. Nagpal et al. trained a system on 1,557 slides and compared it to 29 pathology experts, achieving mean accuracy of 0.61 on validation. Bulten et al. published a large trial where their DL system outperformed 10 out of 15 pathologists in determining biopsy malignancy (AUC 0.990). Strom et al. digitized 6,682 slides and achieved AUC 0.960 for cancer identification, with mean pairwise kappa of 0.62 for Gleason grading.

Clinical impact systems: Raciti et al. showed that the Paige Prostate Alpha system improved the sensitivity of all pathologists from 74% to 90% for diagnosing prostate cancer, with sensitivity gains of 20% for grade group 1, 13% for grade group 2, and 11% for grade group 3. Mun et al. published the YAAGGS system trained on data from two hospitals, achieving 77.5% accuracy and kappa of 0.65 for grade group prediction.

Key differentiators: Most previous studies focus on algorithms that only predict core-level Gleason grade groups. In contrast, this system provides a multi-task output: malignant vs. benign classification, percentage area of tumor, core-level grading, percentage area of Gleason scores, and pixel-level segmentation overlays. The system can also identify Gleason pattern transitions in terms of likelihood probabilities and assign finer-grained patterns to glands, particularly at the 3/4 and 4/5 transitions where pathologists often disagree. Furthermore, the domain-agnostic training enables generalization to unseen institutions, which most earlier approaches lack.

TL;DR: Previous systems achieved strong binary detection (AUC up to 0.990) but limited grading performance (kappa 0.62 to 0.71). This system achieves kappa 0.92 to 0.96 for grading and uniquely provides pixel-level segmentation, continuous Gleason pattern probability maps, and domain-agnostic generalization to unseen institutions.

Limitations and Future Directions

Pages 8-10

Current Constraints and Paths Forward

Misclassification patterns: The system occasionally produced misclassifications in the pixel-level overlays, particularly in stromal regions and at tissue border margins. Most tissue border errors were caused by preparation artifacts that the network could not recognize. Specific failure cases included cribriform grade 4 patterns misclassified as grade 5, lymphocyte infiltration incorrectly predicted as grade 5, and cutting artifacts producing false positives in compressed, dense tissue at biopsy margins. The authors suggest that a dedicated artifact-detection neural network could be trained as a pre-processing step to filter out these cases in clinical applications.

Inherent scoring subjectivity: A more fundamental limitation is the subjective nature of the Gleason scoring system itself. The mean inter-observer agreement among the four-pathologist panel was only kappa 0.79, and the agreement between uropathologists and general surgical pathologists was particularly low (kappa 0.50 to 0.59). Any automated system trained on these labels inherits this inconsistency. The authors note that consensus annotations from larger panels of pathologists could help improve automated Gleason grading approaches.

Label noise impact: The performance gap between the MPUH dataset (expert-annotated, low label noise) and the Radboud dataset (higher label noise, 72% accuracy against consensus) highlights how data quality directly affects system performance. The ensemble distillation approach partially mitigates label noise, but it remains a limiting factor.

Future directions: Larger-scale studies involving multiple medical facilities are needed to consolidate and develop a system suitable for clinical deployment. Critically, the current study did not assess the algorithm's predictive efficacy or compare it to long-term clinical outcomes. While Gleason grading methods provide useful prognostic information, the ultimate test is whether DL-assisted grading translates into better treatment decisions and patient survival. The authors plan to analyze long-term clinical outcomes from biopsy cases to improve risk stratification in future work.

TL;DR: Key limitations include misclassifications from tissue artifacts, inherent Gleason scoring subjectivity (inter-observer kappa only 0.79), and label noise in training data. No long-term clinical outcome validation was performed. Future work will focus on multi-center scaling, artifact pre-filtering, and linking DL-assisted grading to patient survival outcomes.

A deep learning system for prostate cancer diagnosis and grading in whole slide images

Original Paper (PDF)

Plain-English Explanations