Deep Learning for Urothelial Carcinoma Histopathology

Plain-English Explanations

Overview

Page 1

Why Grading Urothelial Carcinoma Matters and What This Study Aims to Do

Urothelial cell carcinoma is the most common cancer of the urinary bladder, accounting for approximately 90% of all bladder cancers. It predominantly affects males over 50 years of age, with higher incidence in Caucasian populations. Established risk factors include both genetic predispositions and environmental exposures such as smoking, petrochemicals, aniline dyes, auramines, phenacetin, and cyclophosphamide. Two distinct genetic pathways drive the disease: the FGFR3-associated pathway linked to low-grade non-invasive tumors, and the TP53/CDKN2A pathway associated with high-grade urothelial carcinoma.

Clinical significance of grading: The distinction between low-grade urothelial carcinoma (LGUC) and high-grade urothelial carcinoma (HGUC) has direct implications for surgical management and prognosis. HGUC can invade the detrusor muscle of the bladder wall, necessitating more aggressive surgery and frequent follow-ups. The WHO 2016 classification divides non-invasive papillary urothelial carcinoma into low and high grade, but this grading remains a diagnostic challenge for pathologists, with no immunohistochemistry biomarkers available to reliably differentiate the two grades. Interobserver agreement for grading ranges from 65% to 88%, with kappa values between 0.30 and 0.73 across published studies.

Study objective: This study from Sri Ramachandra Institute of Higher Education and IIT Madras aims to classify urothelial carcinoma into low and high grade using a deep learning approach based on convolutional neural networks (CNNs). The researchers digitized hematoxylin and eosin-stained transurethral resection of bladder tumor (TURBT) specimens, extracted image patches from whole slide images, and trained a pretrained VGG16 model to perform binary classification. The study also incorporates Grad-CAM visualization to provide interpretability into the model's decision-making process.

Why AI is needed here: Traditional histopathological grading relies on subjective assessment of architectural and cytological features under light microscopy. LGUC shows orderly cellular arrangement with mild nuclear irregularity, while HGUC displays fused solid papillae, marked nuclear atypia, and numerous mitoses. A threshold of just 5% high-grade component suffices to classify a specimen as HGUC. This subjectivity, combined with the lack of objective biomarkers, makes this classification problem well suited for deep learning, which can provide a more objective, repeatable screening method.

TL;DR: Urothelial carcinoma accounts for 90% of bladder cancers, and grading it as low or high grade directly impacts treatment decisions. Pathologist agreement on grading ranges only 65-88%. This study uses a VGG16 deep learning model with Grad-CAM visualization to objectively classify TURBT histopathology images into LGUC and HGUC.

Data Collection

Pages 2-3

Patient Selection, Slide Digitization, and Patch Extraction Pipeline

Patient cohort: A total of 20 non-muscle invasive bladder cancer (NMIBC) specimens from 20 patients were collected from SRMC, Chennai. All patients underwent TURBT between January 2020 and December 2021. Only non-invasive papillary urothelial carcinoma cases (both low and high grade) were included. Muscle-invasive urothelial carcinoma, other urothelial pathologies, and resection specimens were excluded. The study was approved by the Sri Ramachandra Institutional Ethics Committee.

Digitization process: Hematoxylin and eosin-stained slides were prepared from 4-micrometer thick sections of formalin-fixed paraffin-embedded (FFPE) tissue blocks. These were digitized using the Morphle DigiPath 6T Scanner at 40x resolution, producing whole slide images (WSIs) with an in-plane resolution of 0.22 micrometers per pixel. The scanned WSIs ranged from 500 MB to 8.8 GB in size. Care was taken to clean slides before scanning to minimize artifacts related to slide preparation, staining, and scanning.

Patch generation: Two patch sizes were used. Patches of 256 x 256 pixels were extracted with zero overlap, yielding 134,500 and 144,500 patches for low- and high-grade carcinoma respectively. After filtering to retain only patches with at least 90% tumor tissue and manually removing patches with stain deposits, blurring, scanning artifacts, and pure stroma, the final dataset contained 67,500 LGUC patches and 64,700 HGUC patches. Larger patches of 1,024 x 1,024 pixels were also generated, yielding 12,450 LGUC and 13,440 HGUC patches after similar preprocessing.

Annotation quality control: All exported images were annotated manually by experienced pathologists. Tumor regions were delineated, and non-atypical urothelium, fibrovascular tissue, tissue folds, cauterization artifacts, blank areas, and out-of-focus regions were annotated and excluded. All annotations were verified by a specialized uropathologist. The grade of each tumor was initially assessed by 3 experienced pathologists using the WHO 2016 grading system, with a subsequent consensus reading in cases of disagreement. This rigorous multi-pathologist validation established a reliable ground truth for training the deep learning model.

TL;DR: Twenty TURBT specimens were digitized at 40x resolution (0.22 micrometers/pixel), producing WSIs up to 8.8 GB each. Patches were extracted at 256x256 and 1024x1024 pixels, filtered for 90%+ tumor content, and manually verified by three pathologists, yielding approximately 132,200 usable patches for the classification network.

Model Architecture

Pages 3-4

VGG16 Transfer Learning Architecture and Training Configuration

Network architecture: The study employed VGG16, a 16-layer convolutional neural network developed by the Visual Geometry Group at Oxford University. The model was pretrained on the ImageNet dataset for generic image classification and then adapted for the binary grading task. The final classification layer of the original VGG16 was removed and replaced with a single fully connected layer for the two-class problem (LGUC vs. HGUC). A filter size of 8 x 8 was used. The softmax function served as the activation function for the output layer.

Transfer learning approach: Rather than training the entire network from scratch, the study fixed the ImageNet-pretrained feature maps in the earlier convolutional layers and only trained the additional fully connected layer on the bladder cancer dataset using gradient descent. This transfer learning strategy is particularly valuable when annotated histopathology data is scarce, as the pretrained layers already capture generalizable visual features (edges, textures, patterns) that transfer well to medical imaging tasks. The Adam optimizer was used with a learning rate of 0.001.

Training infrastructure and data split: The model was trained on an NVIDIA Quadro P2200 GPU with CUDA cores. The Keras framework (v2.4.0) with TensorFlow (v2.7.0) served as the backend. The dataset was split at the patient level into a 70:15:15 ratio for training, validation, and testing. This patient-level split ensured that patches from the same patient appeared only in one subset, preventing data leakage. For the 256 x 256 pixel patches, this corresponded to 92,540 training, 19,830 validation, and 19,830 testing patches.

Loss function and training dynamics: Cross-entropy loss was used to train the network, which is standard for classification problems. The training curves showed a steady gain in accuracy and decrease in loss over the training epochs, indicating that the model learned to classify patches into the two classes with increasing confidence. The curve reached a steady state after approximately 30 epochs, attributed to the relatively small number of unique patients in the dataset.

TL;DR: A pretrained VGG16 model was adapted via transfer learning, replacing the final layer for binary LGUC/HGUC classification. Training used Adam optimizer (learning rate 0.001), Keras/TensorFlow, and an NVIDIA Quadro P2200 GPU. Data was split 70:15:15 at the patient level to prevent leakage, with convergence reached at approximately 30 epochs.

Hyperparameter Tuning

Pages 4-5

Impact of Patch Size, Batch Size, Learning Rate, and Epochs on Classification Accuracy

Patch size comparison: The most significant hyperparameter affecting accuracy was patch size. Using 1,024 x 1,024 pixel patches, the model achieved 91% accuracy for HGUC and 89% for LGUC, giving an overall accuracy of 90%. In contrast, 256 x 256 pixel patches yielded substantially lower accuracy: 78% for HGUC and 70% for LGUC. The authors attribute this difference to the fact that tissue architecture, which is a critical feature for grading, is best appreciated at low magnification and cannot be adequately captured in small patches. The 1,024 x 1,024 patches also demonstrated better interobserver correlation among pathologists.

Batch size effects: With the 1,024 x 1,024 pixel patches and a learning rate of 0.001, a batch size of 16 achieved the best overall accuracy of 90% (84% LGUC, 91% HGUC). Increasing the batch size to 32 slightly reduced accuracy to 88% (84% LGUC, 88% HGUC). The smaller batch size likely allowed the optimizer to explore a broader set of minima in the loss landscape, resulting in better generalization.

Learning rate impact: The learning rate controls the step size the algorithm takes during feature learning. At the optimal learning rate of 0.001, overall accuracy was 90%. Increasing the learning rate tenfold to 0.01 degraded performance to 81% for LGUC and 84% for HGUC (overall approximately 78%). A higher learning rate causes the optimizer to overshoot optimal weight values, preventing the network from converging to a well-performing solution. This sensitivity to learning rate underscores the importance of careful hyperparameter tuning in histopathology deep learning applications.

Epoch analysis: The training curve reached a steady state after approximately 30 epochs out of the 50 used for training. No considerable differences in accuracy were observed when changing the epoch count beyond this point. This early saturation is characteristic of relatively small datasets, where the model extracts the available information quickly. Despite the limited number of patients (20), the large number of patches (over 130,000) provided sufficient training signal for the transfer learning approach to converge reliably.

TL;DR: Larger 1024x1024 patches achieved 90% overall accuracy compared to 74% with 256x256 patches, because architectural features critical for grading require low-magnification context. Optimal hyperparameters were batch size 16, learning rate 0.001, and 50 epochs (convergence by epoch 30).

Results

Pages 5-6

Classification Performance: Confusion Matrices, Sensitivity, and Specificity

Performance on 1,024 x 1,024 patches: The confusion matrix for the larger patch size revealed strong classification performance. The model correctly classified 6,250 LGUC patches and 8,091 HGUC patches as true positives and true negatives, respectively. There were 1,171 false positives (LGUC patches misclassified as HGUC) and 820 false negatives (HGUC patches misclassified as LGUC). This yielded a sensitivity of 88%, specificity of 87%, precision of 84%, and an F1 score of 86%. The low type 1 and type 2 error rates demonstrate robust classification capability.

Performance on 256 x 256 patches: The smaller patch size produced notably weaker results. The confusion matrix showed 4,244 correct LGUC classifications and 5,740 correct HGUC classifications, but with substantially more errors: 1,740 false positives and 1,820 false negatives. This translated to sensitivity of 70%, specificity of 76%, precision of 70%, and an F1 score of 70%. The inferior performance confirms that small patches fail to capture the architectural patterns that pathologists use for grading decisions.

Asymmetric accuracy between grades: An interesting finding is the slightly better accuracy for HGUC (91%) compared to LGUC (89%) at the 1,024 x 1,024 patch size. The authors hypothesize that LGUC represents a more heterogeneous class because low-grade specimens can encompass areas of normal urothelium and papilloma-like architecture. HGUC, with its more uniformly atypical features (fused papillae, marked nuclear atypia, frequent mitoses), presents a more consistent pattern for the CNN to learn. This asymmetry was also reflected in the interobserver agreement: kappa was 0.9 for HGUC (almost perfect agreement) but 0.7 for LGUC (substantial agreement).

Comparison with existing literature: The 90% overall accuracy compares favorably with a similar study by Jansen et al., who used smaller 224 x 224 patches for a 3-class classification and achieved only 76% accuracy for LGUC and 71% for HGUC. That study employed a U-Net segmentation network and operated at 20x magnification. The improved performance in the current study is attributed to the larger patch size, which preserves the architectural context essential for grade discrimination, and the focused 2-class (rather than 3-class) problem formulation.

TL;DR: At 1024x1024 pixel patches, the model achieved 88% sensitivity, 87% specificity, 84% precision, and F1 score of 86%. HGUC was classified at 91% accuracy vs. 89% for LGUC, reflecting greater morphological consistency of high-grade tumors. This outperformed a comparable study by Jansen et al. that achieved only 71-76% accuracy.

Visualization

Pages 6-7

Grad-CAM Visualization: Understanding What the Model Sees

What Grad-CAM does: Gradient-weighted Class Activation Mapping (Grad-CAM) is a visualization technique that produces a coarse heat map highlighting the regions of an image that the deep learning model considers most important for its classification decision. It works by computing the gradients of the target class flowing into the final convolutional layer, using these gradient signals to weight the feature maps and produce a localization map. In the resulting heat map, red areas indicate high importance (regions the model relies on most), while blue and green areas represent low importance.

Application to urothelial carcinoma: In this study, Grad-CAM was applied to the final convolutional layer of the pretrained VGG16 model. The visualization consistently showed that the model identified cellular areas of the tumor as the most important features for classification. Stroma, blood vessels, and blank areas within the tissue were consistently colored blue or green, confirming that the model appropriately focused on tumor cellularity rather than non-diagnostic tissue components. This aligns with pathologist practice, where nuclear features and cellular arrangement are the primary criteria for grading.

Insights and inaccuracies: While the heat maps generally corroborated the model's learning behavior, some notable inaccuracies were observed. Certain images showed regions of regular, uniform LGUC cells highlighted in blue, suggesting the model did not always recognize orderly low-grade tumor as diagnostically significant. Additionally, the visualization revealed how the model sometimes misclassifies artifacts into one of the two tumor classes, rather than ignoring them. These insights are valuable because they expose failure modes that could be addressed through improved data preprocessing or artifact detection layers.

Clinical interpretability value: The overlay of Grad-CAM heat maps on the original H&E-stained slides provides an intuitive visual tool that helps pathologists understand the model's reasoning. This transparency is critical for adoption of AI as a companion diagnostic tool. Rather than operating as a "black box," the model can show exactly which tissue regions drove its grading decision. For a pathologist reviewing ambiguous cases, seeing the model's attention map on specific cellular areas can serve as a confirmatory second opinion, particularly in cases where the distinction between LGUC and HGUC is borderline.

TL;DR: Grad-CAM heat maps showed the model correctly focused on cellular tumor areas (red) while ignoring stroma and blank spaces (blue/green). Some inaccuracies were noted, including misclassification of artifacts and underweighting of orderly LGUC regions. The visualization provides transparency that is essential for clinical adoption of AI as a companion diagnostic tool.

Context

Pages 7-8

Comparison to Other AI Approaches in Bladder and Other Cancer Pathology

Traditional vs. deep learning approaches: The authors discuss two broad strategies for automated grading. Traditional image processing, useful for tasks like Ki67 quantification and nucleus segmentation, relies on handcrafted features based on parameters such as size and hyperchromasia. While traditional machine learning algorithms such as Support Vector Machines (SVM) and Random Forest can yield decent results for some problems, CNNs are fundamentally better suited for image classification because they automatically learn hierarchical features from the data, layer by layer, without manual feature engineering.

Related work in other cancers: The study situates itself within a growing body of AI-pathology research. Inception v3 has been used to detect epithelial tumors in stomach and colon with good results. Dimensionality reduction techniques such as t-SNE have been applied to neuropathological tissue samples to better understand learned features. AI applications extend to breast pathology, bone pathology, prostatic pathology (automated Gleason grading), and lung pathology. Notably, AI-based pathology has been shown to predict the origin of cancers of unknown primary, demonstrating the breadth of CNN applicability across oncology.

Grad-CAM in medical imaging: The use of Grad-CAM for model interpretability extends well beyond this study. It has been deployed for adenocarcinoma classification in lung and colon histology slides, for detecting colorectal cancer in colonoscopy images, and for identifying diabetic retinopathy in fundus photographs. The consistent utility of Grad-CAM across these diverse imaging modalities validates its use as a standard visualization tool for explaining deep learning decisions in clinical contexts.

Potential for expanded applications: The authors note that if combined with gene expression data, AI models could potentially predict tumor recurrence. A prognostic AI-monitor has already been developed for metastatic urothelial cancer patients receiving immunotherapy. AI is also making significant inroads in radiology, dermatology, and neurology, suggesting that the computational pathology approach demonstrated in this study could be integrated into broader multi-modal diagnostic workflows in the future.

TL;DR: CNNs outperform traditional ML methods (SVM, Random Forest) for image classification tasks. Similar deep learning approaches have succeeded in stomach, colon, breast, prostate, and lung pathology. Grad-CAM is widely used across medical imaging for model interpretability. Combining histopathology AI with genomic data could enable recurrence prediction.

Limitations and Future Directions

Page 8

Study Limitations, Ethical Considerations, and Future Research Directions

Sample size limitation: The most significant constraint is the small cohort of only 20 patients. While the patch extraction strategy generated over 130,000 image patches, the underlying biological diversity is limited to 20 individuals from a single institution. This raises concerns about generalizability to broader populations with different demographics, tissue preparation protocols, and scanning equipment. The authors acknowledge that inter-institutional comparison would be a better test for the practical utility of such algorithms in real-world clinical settings.

Sources of the 10% inaccuracy: The approximately 10% error rate can be partially explained by the presence of low-grade areas within predominantly high-grade specimens, and vice versa. LGUC samples sometimes contain areas of normal urothelium and papilloma-like architecture, making them a more heterogeneous class. Additionally, artifacts from cauterization, crush damage during TURBT extraction, blurred regions, stain deposits, and Von Brunn's nests (solid nests of benign urothelium in the lamina propria) can confound classification. Despite the 90% tumor tissue threshold for patch inclusion, some of these confounders persisted.

Ethical considerations: The authors raise an important point about the ethical implications of deploying AI screening tools without human oversight. The use of an independent AI screening test without any oversight raises issues of responsibility in cases of wrong diagnosis. Very high accuracy would be needed before deploying such a tool as a standalone screening instrument. The current 90% accuracy supports its use as a companion diagnostic tool to assist pathologists, rather than a replacement for human judgment.

Future research directions: Several promising avenues are outlined. Segmentation of images to delineate tumor areas from stroma, muscle, lymphatics, and blood vessels could improve classification accuracy. Incorporating life expectancy data could enable predictions about 5-year survival rate and prognosis. Another compelling application is the prediction of germline mutations based on subtle morphological clues detected by deep learning algorithms. The authors also note that cheaper storage, easier-to-deploy AI tools, and growing computational infrastructure will make computational pathology more economical and practical for routine use in pathology laboratories and even in pathology education.

Path to clinical deployment: For this model to reach clinical utility, it would need validation on multi-institutional datasets, testing across different scanners and staining protocols, and demonstration of consistent performance across diverse patient populations. The infrastructure requirements for computational pathology, including high-performance GPUs, large file storage for whole slide images, and integration with laboratory information systems, represent additional practical barriers that must be addressed before widespread adoption.

TL;DR: The study is limited by its 20-patient single-institution cohort and approximately 10% error rate from tissue heterogeneity and artifacts. The model is recommended as a companion diagnostic tool, not a standalone system. Future work should include multi-institutional validation, image segmentation, survival prediction, and germline mutation detection from histopathology.