Automating Prostate Cancer Grading with Deep Learning

Overview and Background

Pages 1-2

Why Automating Gleason Grading Matters for Prostate Cancer

Prostate cancer (PCa) is the second most common cancer in men, affecting over 1.4 million people annually according to Globocan 2020 data, with 375,304 deaths in a single year. When caught early and managed properly, patients can achieve up to a 98% long-term survival rate. The Gleason grading system is the gold standard for assessing prostate tissue histopathology. It assigns grades from 1 to 5 based on glandular differentiation patterns, and these are combined into a Gleason score (e.g., 3+4=7). The International Society for Urological Pathology (ISUP) revised this system in 2014, mapping Gleason scores to ISUP grades 1 through 5.

The problem with manual grading: Gleason grading is subjective and labor-intensive. Pathologists in the USA alone must assess over 10 million tissue samples per year from prostate biopsies. Significant interobserver variability exists among pathologists, with studies showing kappa agreement scores ranging from 0.60 to 0.73 among experts. This variability risks overtreatment or missed diagnoses, directly affecting patient care decisions.

Deep learning as a solution: Recent advances in deep learning have demonstrated the ability to detect intricate patterns in large datasets with consistent precision. Campanella et al. analyzed 44,000 whole slide images (WSIs) across breast, skin, and prostate tissues, achieving an AUC of 0.986 for PCa detection using a multiple-instance learning framework with a recurrent neural network. Strom et al. used ensembles of 30 InceptionV3 models, reaching an AUC of 0.997 on an independent test set with 6,682 training WSIs. These results highlight the potential for AI to match or exceed human-level performance in cancer detection.

Prior approaches and gaps: While classification-only or segmentation-only methods have been explored, much of the existing work has been limited to basic binary tasks (benign vs. malignant) or distinguishing only between Gleason grades 3 and 4. This paper proposes a three-stage framework that integrates both classification and segmentation, refined by a machine learning classifier, to predict the full ISUP grade across all categories.

TL;DR: Prostate cancer affects 1.4M people/year. Manual Gleason grading suffers from high interobserver variability (kappa 0.60-0.73). This study proposes a three-stage deep learning framework combining classification, segmentation, and ML classifiers to automate ISUP grading across all grades.

Dataset and Preprocessing

Pages 3-5

The PANDA Challenge Dataset: Cleaning 5,160 Cases Down to 2,699

The study used the Prostate Cancer Grade Assessment (PANDA) challenge dataset, the most extensive public collection of whole-slide images for PCa grading. It consists of 5,160 hematoxylin and eosin (H&E) slides from Radboud University Medical Center and 5,456 cases from Karolinska Institute, scanned at 20x magnification and stored in TIFF format. The Radboud dataset includes segmentation masks with detailed Gleason pattern annotations (background, stroma, benign, Gleason 3, 4, and 5), while the Karolinska dataset only distinguishes between benign and malignant areas.

Why extensive cleaning was needed: The PANDA dataset is notoriously noisy. The PANDA challenge winners identified 1,153 cases in the Radboud dataset where the Gleason scores derived from mask images did not match the ground truth labels. The authors found an additional 845 cases with incorrect Gleason representations or incomplete mask data, plus 463 cases with image markings, empty masks, or missing masks entirely. Combined, these exclusions reduced the usable dataset from 5,160 to 2,699 cases. Two clinical experts were consulted to validate the cleaning process.

Patch sampling strategy: Because WSIs are gigapixel-scale images that exceed GPU memory capacities, the authors used patch-based sampling. They experimented with two patch sizes: 500 x 500 and 1,000 x 1,000 pixels. For classification, a patch was labeled as benign if 100% of its tumor area was benign, and a 50% overlap threshold was set for malignant Gleason patterns. Subject-wise five-fold cross-validation was used to evaluate model performance.

Data augmentation for class balance: To address class imbalance, augmentation techniques including horizontal flips, vertical flips, and 90-degree rotations were applied. After augmentation, each class contained 168,141 training patches for 500 x 500 classification and 34,315 training patches for 1,000 x 1,000 classification, ensuring balanced representation across benign, Gleason 3, Gleason 4, and Gleason 5 categories.

TL;DR: Starting from 5,160 PANDA challenge WSIs, the authors discarded 2,461 noisy or mislabeled cases, leaving 2,699 usable cases. Patches were sampled at 500x500 and 1000x1000 pixels, augmented to balance classes (168,141 patches per class for 500x500), and evaluated using five-fold cross-validation.

Classification Architecture

Pages 6-8

Four Deep Learning Models for Patch-Level Gleason Classification

The first stage of the framework classifies extracted patches into four categories: benign, Gleason 3, Gleason 4, and Gleason 5. The authors evaluated four pretrained architectures fine-tuned on the PANDA dataset, with the final fully connected layers replaced for the four-class task. All models were pretrained on ImageNet and fine-tuned at a learning rate of 0.0001 using the Adam optimizer. Patches were resized to 224 x 224 pixels for classification.

DenseNet121: A densely connected CNN where each layer in a dense block receives feature maps from all preceding layers, promoting feature reuse and efficient gradient flow. It uses bottleneck layers, batch normalization, ReLU activation, and transition layers between blocks. EfficientNet_b0: Uses compound scaling to simultaneously optimize network width, depth, and resolution. Designed via neural architecture search to maximize accuracy per FLOP. Inception_v3: A Google-developed CNN that uses factorized convolutions and label smoothing to reduce parameters without sacrificing depth. Vision Transformer (ViT): Applies the transformer self-attention mechanism to image patches, capturing global relationships within visual data.

Results on 500 x 500 patches: EfficientNet_b0 achieved the best overall accuracy of 90.13% and F1-score of 83.83%. DenseNet121 followed closely at 89.48% accuracy with the highest specificity of 92.95%. Inception_v3 reached 89.46% accuracy, while Vision Transformer trailed at 88.51%. Most misclassification confusion occurred between Gleason 4 and Gleason 5 categories.

Results on 1000 x 1000 patches: EfficientNet_b0 maintained its robust performance at 89.2% accuracy and 83.83% F1-score. DenseNet121 dipped slightly to 88.9% accuracy but improved its F1-score to 83.51%. Inception_v3 experienced a notable drop across all metrics (84.64% accuracy, 75.20% F1), suggesting it struggles with larger tissue regions. Vision Transformer improved slightly to 88.94% accuracy. Overall, EfficientNet_b0 consistently outperformed across both patch sizes.

TL;DR: EfficientNet_b0 was the top classifier with 90.13% accuracy and 83.83% F1-score on 500x500 patches. DenseNet121 achieved the highest specificity (92.95%). Larger 1000x1000 patches hurt Inception_v3 performance significantly but had minimal impact on EfficientNet_b0.

Segmentation Architecture

Pages 8-10

Self-ONN Enhanced DeepLabV3+ for Multiclass Gleason Segmentation

The second stage performs pixel-level multiclass segmentation, assigning each region of a patch to benign, Gleason 3, Gleason 4, or Gleason 5. The authors tested seven segmentation architectures: a vanilla U-Net baseline, U-Net with DenseNet and EfficientNet encoders, DeepLabV3 and DeepLabV3+ with EfficientNet encoders, and both DeepLabV3 variants enhanced with Self-Organized Operational Neural Networks (Self-ONN). Patches were resized to 256 x 256 pixels for segmentation.

What is Self-ONN? Traditional CNNs use homogeneous, linear neuron structures that do not fully capture the diversity of biological neural systems. Operational Neural Networks (ONNs) introduce heterogeneous, non-linear operational units per neuron, incorporating diverse operators beyond standard linear convolutions. Self-ONN extends this by allowing the network to self-organize its operational structure. The authors replaced all CNN layers in DeepLabV3 and DeepLabV3+ with Self-ONN layers.

Results on 500 x 500 patches: The DeepLabV3+ with Self-ONN achieved the highest overall Dice Similarity Coefficient (DSC) of 84.90%, IoU of 83.47%, and accuracy of 81.14%. It excelled at segmenting Gleason 5 with a DSC of 93.04%. By comparison, the vanilla U-Net managed only 52.1% DSC and 49.58% IoU, demonstrating the dramatic impact of using advanced encoders. The standard EfficientNet U-Net achieved 89.56% DSC but lower accuracy (55.62%), while DeepLabV3 variants consistently outperformed U-Net variants across all metrics.

Results on 1000 x 1000 patches: The DeepLabV3+ with Self-ONN again topped the rankings with a DSC of 83.08% and accuracy of 83.99%. Its Gleason 5 DSC reached 93.56%. Interestingly, the vanilla U-Net improved dramatically on larger patches (77.85% DSC vs. 52.1%), suggesting that more tissue context helps even simple architectures. DenseNet U-Net achieved the highest IoU of 79.00% among U-Net variants. The Self-ONN enhancement consistently boosted performance over standard CNN-based DeepLabV3 architectures across both patch sizes.

TL;DR: DeepLabV3+ with Self-ONN layers achieved the best segmentation: 84.90% DSC on 500x500 patches and 83.08% DSC on 1000x1000 patches. Self-ONN layers outperformed standard CNN layers in all DeepLabV3 variants. Gleason 5 was the easiest to segment (DSC up to 93.56%), while benign tissue was the hardest (DSC around 68-74%).

ISUP Grade Prediction

Pages 10-12

Combining Classification and Segmentation Features for Final ISUP Grading

The third and final stage uses machine learning classifiers to predict the ISUP grade (0-5) from the percentage distributions of each Gleason pattern computed by the classification and segmentation networks. For classification, the proportion of patches assigned to each class was calculated. For segmentation, the proportion of area covered by each class within the predicted masks was quantified. These proportions served as feature vectors for seven ML classifiers: MLP, RandomForest, Linear Regression, ExtraTrees, KNN, XGBoost, and SVM.

Classification features alone: Using features from 500 x 500 patch models, EfficientNet_b0 features with XGBoost achieved the highest QWK of 0.8587. DenseNet121 features with ExtraTrees reached 0.8469. Features from 1000 x 1000 patches performed significantly worse, with the best QWK being only 0.6424, indicating that smaller patches yield better classification features for this task.

Segmentation features alone: Segmentation-derived features outperformed classification features overall. The DeepLabV3+ with Self-ONN trained on 1000 x 1000 patches achieved a QWK of 0.9140 with the SVM classifier. Unlike classification, segmentation features from larger patches outperformed smaller patches (best QWK of 0.914 vs. 0.840). This reversal suggests that segmentation benefits from the broader tissue context of larger patches, while classification benefits from the focused patterns in smaller patches.

Combined features: The highest QWK of 0.9215 was achieved by the RandomForest classifier trained on concatenated features from both classification and segmentation models across both patch sizes. This result demonstrates that combining both approaches, each operating at different spatial scales, captures complementary information about tissue architecture. The confusion matrix for the best classifier shows strong diagonal concentration, confirming high agreement with ground truth ISUP grades across all six categories (grades 0-5).

TL;DR: RandomForest on combined classification + segmentation features achieved the best QWK of 0.9215, outperforming any single model. Segmentation features alone (QWK 0.914) beat classification features alone (QWK 0.859). Combining both approaches and both patch sizes captured complementary tissue information.

Comparison with Prior Work

Pages 13-15

How This Framework Compares to Existing Gleason Grading Methods

The authors benchmarked their framework against a range of existing approaches for PCa grade assessment. Arvaniti et al. achieved a kappa of 0.85 using CNN classification of tissue microarray images on 641 training patients. Nagpal et al. reached only 70% accuracy on 331 validation slides using patch-based classification. Bulten et al. obtained a kappa of 0.918 using a U-Net variant with CycleGAN normalization on 5,759 samples, though this was evaluated on a single internal dataset. Singhal et al. reported a QWK of 0.93 on 1,303 test biopsies using a custom U-Net architecture.

Key differentiators of this framework: Most prior studies used either classification or segmentation in isolation. This study is one of the first to combine both, leveraging their complementary strengths. Additionally, most prior work used a single patch size (ranging from 299x299 to 911x911), while this study systematically compared 500x500 and 1000x1000 patches and combined features from both. The use of five-fold cross-validation across all 2,699 cases provides more reliable performance estimates than the train/test splits used in many prior studies.

Recent deep learning approaches: Liang et al. proposed an attention-LSTM aggregator that achieved a QWK of 0.903 using five-fold cross-validation. Zhongyi et al. developed an Intensive-Sampling Multiple Instance Learning framework reaching a QWK of 0.860 on the PANDA dataset. Balaha et al. reported 88.91% classification accuracy using transfer learning on the PANDA dataset. The proposed framework's QWK of 0.9215 surpasses all of these, though direct comparisons are complicated by differences in dataset subsets, preprocessing, and evaluation protocols.

The Self-ONN advantage: A unique contribution is the introduction of Self-ONN layers into the DeepLabV3 architecture for histopathology. By replacing standard convolutional layers with self-organizing operational layers, the model gains heterogeneous, non-linear processing capabilities that better capture the complex morphological patterns in prostate tissue. This consistently improved segmentation metrics over standard CNN-based variants across all experiments.

TL;DR: The proposed QWK of 0.9215 exceeds most prior methods, including Liang et al. (0.903), Zhongyi et al. (0.860), and Nagpal et al. (70% accuracy). Only Singhal et al. (0.93) reported a higher QWK, but on a different evaluation protocol. The combination of classification + segmentation + dual patch sizes is a novel contribution.

Limitations

Pages 15-16

Single-Source Training, Stain Variability, and Missing External Validation

Single-institution data: All training and testing was conducted using data from Radboud University Medical Center only. The Karolinska dataset could not be used because its segmentation masks only distinguish benign from malignant tissue without Gleason pattern annotations. This means the framework's generalizability to tissue processed and scanned at other institutions remains unproven. The authors themselves acknowledge that this "leaves the model's efficacy on new, untested data open to question."

Stain variation sensitivity: Differences in staining protocols between institutions produce color variations in H&E slides that can degrade model performance. The Karolinska dataset exhibited different staining colors from the Radboud data, which would have made cross-institutional predictions less effective. The framework currently has no mechanism to handle such stain domain shifts, which is a significant barrier to real-world clinical deployment where tissue preparation varies widely.

Noisy label reliance: Even after extensive data cleaning, the remaining 2,699 cases inherit the subjective grading of individual pathologists. The reference standard accuracy was only 0.675 (kappa 0.819) for Gleason scores and 0.720 (kappa 0.853) for grade groups when compared against semi-automatic labels. This ceiling on label quality limits how accurately any model can learn, since the ground truth itself contains disagreement.

No prospective clinical validation: The study was entirely retrospective, using existing digitized slides. There is no evidence of how this framework would perform in a real clinical workflow, including factors like digitization quality, turnaround time requirements, and integration with pathologist review processes. The absence of any external validation cohort is a notable gap that must be addressed before clinical adoption.

TL;DR: Key limitations include single-institution training (Radboud only), no external validation, sensitivity to stain variation across labs, noisy ground truth labels (reference accuracy only 0.675 for Gleason scores), and no prospective clinical testing.

Future Directions

Pages 16-17

Stain Normalization, Dataset Expansion, and Clinical Deployment

Generative adversarial networks for stain adaptation: The authors propose using GANs to normalize staining variations between institutions. This would allow models trained on Radboud data to generalize to slides processed with different staining protocols, such as those from Karolinska. Stain normalization using CycleGAN or similar architectures has already shown promise in other digital pathology studies and could be a practical path toward multi-institutional deployment.

Expanding the Karolinska dataset: Annotating the Karolinska cases with cancer-grade masks would effectively double the available training data and enable cross-institutional evaluation. The current limitation of binary (benign vs. malignant) masks in this dataset prevents its use for Gleason pattern-level training. Creating detailed Gleason annotations for these 5,456 cases would be a significant effort but could substantially improve model generalizability.

Relabeling noisy data: The authors suggest that better-performing segmentation models could be used to relabel the noisy cases that were excluded during preprocessing. This bootstrapping approach, where a trained model helps create improved training labels, could recover many of the 2,461 discarded cases and increase the effective training set size. However, this approach carries the risk of propagating model biases into the training labels.

Path to clinical utility: For this framework to become a practical prognostic tool, it would need prospective validation across multiple clinical sites, integration with existing laboratory information systems, and studies demonstrating its impact on pathologist workflow efficiency and diagnostic concordance. The strong QWK of 0.9215 suggests the technical foundation is solid, but the translational gap from retrospective research to clinical deployment remains significant.

TL;DR: Future priorities include GAN-based stain normalization for cross-institutional generalization, annotating the 5,456 Karolinska cases with Gleason pattern masks, using trained models to relabel excluded noisy data, and prospective multi-site clinical validation to bridge the gap from research to deployment.

Automating Prostate Cancer Grading: A Novel Deep Learning Framework for Automatic Gleason Grading

Original Paper (PDF)

Plain-English Explanations