Urothelial cell carcinoma is the most common cancer of the urinary bladder, accounting for approximately 90% of all bladder cancers. It predominantly affects males over 50 years of age, with higher incidence in Caucasian populations. Established risk factors include both genetic predispositions and environmental exposures such as smoking, petrochemicals, aniline dyes, auramines, phenacetin, and cyclophosphamide. Two distinct genetic pathways drive the disease: the FGFR3-associated pathway linked to low-grade non-invasive tumors, and the TP53/CDKN2A pathway associated with high-grade urothelial carcinoma.
Clinical significance of grading: The distinction between low-grade urothelial carcinoma (LGUC) and high-grade urothelial carcinoma (HGUC) has direct implications for surgical management and prognosis. HGUC can invade the detrusor muscle of the bladder wall, necessitating more aggressive surgery and frequent follow-ups. The WHO 2016 classification divides non-invasive papillary urothelial carcinoma into low and high grade, but this grading remains a diagnostic challenge for pathologists, with no immunohistochemistry biomarkers available to reliably differentiate the two grades. Interobserver agreement for grading ranges from 65% to 88%, with kappa values between 0.30 and 0.73 across published studies.
Study objective: This study from Sri Ramachandra Institute of Higher Education and IIT Madras aims to classify urothelial carcinoma into low and high grade using a deep learning approach based on convolutional neural networks (CNNs). The researchers digitized hematoxylin and eosin-stained transurethral resection of bladder tumor (TURBT) specimens, extracted image patches from whole slide images, and trained a pretrained VGG16 model to perform binary classification. The study also incorporates Grad-CAM visualization to provide interpretability into the model's decision-making process.
Why AI is needed here: Traditional histopathological grading relies on subjective assessment of architectural and cytological features under light microscopy. LGUC shows orderly cellular arrangement with mild nuclear irregularity, while HGUC displays fused solid papillae, marked nuclear atypia, and numerous mitoses. A threshold of just 5% high-grade component suffices to classify a specimen as HGUC. This subjectivity, combined with the lack of objective biomarkers, makes this classification problem well suited for deep learning, which can provide a more objective, repeatable screening method.
Patient cohort: A total of 20 non-muscle invasive bladder cancer (NMIBC) specimens from 20 patients were collected from SRMC, Chennai. All patients underwent TURBT between January 2020 and December 2021. Only non-invasive papillary urothelial carcinoma cases (both low and high grade) were included. Muscle-invasive urothelial carcinoma, other urothelial pathologies, and resection specimens were excluded. The study was approved by the Sri Ramachandra Institutional Ethics Committee.
Digitization process: Hematoxylin and eosin-stained slides were prepared from 4-micrometer thick sections of formalin-fixed paraffin-embedded (FFPE) tissue blocks. These were digitized using the Morphle DigiPath 6T Scanner at 40x resolution, producing whole slide images (WSIs) with an in-plane resolution of 0.22 micrometers per pixel. The scanned WSIs ranged from 500 MB to 8.8 GB in size. Care was taken to clean slides before scanning to minimize artifacts related to slide preparation, staining, and scanning.
Patch generation: Two patch sizes were used. Patches of 256 x 256 pixels were extracted with zero overlap, yielding 134,500 and 144,500 patches for low- and high-grade carcinoma respectively. After filtering to retain only patches with at least 90% tumor tissue and manually removing patches with stain deposits, blurring, scanning artifacts, and pure stroma, the final dataset contained 67,500 LGUC patches and 64,700 HGUC patches. Larger patches of 1,024 x 1,024 pixels were also generated, yielding 12,450 LGUC and 13,440 HGUC patches after similar preprocessing.
Annotation quality control: All exported images were annotated manually by experienced pathologists. Tumor regions were delineated, and non-atypical urothelium, fibrovascular tissue, tissue folds, cauterization artifacts, blank areas, and out-of-focus regions were annotated and excluded. All annotations were verified by a specialized uropathologist. The grade of each tumor was initially assessed by 3 experienced pathologists using the WHO 2016 grading system, with a subsequent consensus reading in cases of disagreement. This rigorous multi-pathologist validation established a reliable ground truth for training the deep learning model.
Network architecture: The study employed VGG16, a 16-layer convolutional neural network developed by the Visual Geometry Group at Oxford University. The model was pretrained on the ImageNet dataset for generic image classification and then adapted for the binary grading task. The final classification layer of the original VGG16 was removed and replaced with a single fully connected layer for the two-class problem (LGUC vs. HGUC). A filter size of 8 x 8 was used. The softmax function served as the activation function for the output layer.
Transfer learning approach: Rather than training the entire network from scratch, the study fixed the ImageNet-pretrained feature maps in the earlier convolutional layers and only trained the additional fully connected layer on the bladder cancer dataset using gradient descent. This transfer learning strategy is particularly valuable when annotated histopathology data is scarce, as the pretrained layers already capture generalizable visual features (edges, textures, patterns) that transfer well to medical imaging tasks. The Adam optimizer was used with a learning rate of 0.001.
Training infrastructure and data split: The model was trained on an NVIDIA Quadro P2200 GPU with CUDA cores. The Keras framework (v2.4.0) with TensorFlow (v2.7.0) served as the backend. The dataset was split at the patient level into a 70:15:15 ratio for training, validation, and testing. This patient-level split ensured that patches from the same patient appeared only in one subset, preventing data leakage. For the 256 x 256 pixel patches, this corresponded to 92,540 training, 19,830 validation, and 19,830 testing patches.
Loss function and training dynamics: Cross-entropy loss was used to train the network, which is standard for classification problems. The training curves showed a steady gain in accuracy and decrease in loss over the training epochs, indicating that the model learned to classify patches into the two classes with increasing confidence. The curve reached a steady state after approximately 30 epochs, attributed to the relatively small number of unique patients in the dataset.
Patch size comparison: The most significant hyperparameter affecting accuracy was patch size. Using 1,024 x 1,024 pixel patches, the model achieved 91% accuracy for HGUC and 89% for LGUC, giving an overall accuracy of 90%. In contrast, 256 x 256 pixel patches yielded substantially lower accuracy: 78% for HGUC and 70% for LGUC. The authors attribute this difference to the fact that tissue architecture, which is a critical feature for grading, is best appreciated at low magnification and cannot be adequately captured in small patches. The 1,024 x 1,024 patches also demonstrated better interobserver correlation among pathologists.
Batch size effects: With the 1,024 x 1,024 pixel patches and a learning rate of 0.001, a batch size of 16 achieved the best overall accuracy of 90% (84% LGUC, 91% HGUC). Increasing the batch size to 32 slightly reduced accuracy to 88% (84% LGUC, 88% HGUC). The smaller batch size likely allowed the optimizer to explore a broader set of minima in the loss landscape, resulting in better generalization.
Learning rate impact: The learning rate controls the step size the algorithm takes during feature learning. At the optimal learning rate of 0.001, overall accuracy was 90%. Increasing the learning rate tenfold to 0.01 degraded performance to 81% for LGUC and 84% for HGUC (overall approximately 78%). A higher learning rate causes the optimizer to overshoot optimal weight values, preventing the network from converging to a well-performing solution. This sensitivity to learning rate underscores the importance of careful hyperparameter tuning in histopathology deep learning applications.
Epoch analysis: The training curve reached a steady state after approximately 30 epochs out of the 50 used for training. No considerable differences in accuracy were observed when changing the epoch count beyond this point. This early saturation is characteristic of relatively small datasets, where the model extracts the available information quickly. Despite the limited number of patients (20), the large number of patches (over 130,000) provided sufficient training signal for the transfer learning approach to converge reliably.
Performance on 1,024 x 1,024 patches: The confusion matrix for the larger patch size revealed strong classification performance. The model correctly classified 6,250 LGUC patches and 8,091 HGUC patches as true positives and true negatives, respectively. There were 1,171 false positives (LGUC patches misclassified as HGUC) and 820 false negatives (HGUC patches misclassified as LGUC). This yielded a sensitivity of 88%, specificity of 87%, precision of 84%, and an F1 score of 86%. The low type 1 and type 2 error rates demonstrate robust classification capability.
Performance on 256 x 256 patches: The smaller patch size produced notably weaker results. The confusion matrix showed 4,244 correct LGUC classifications and 5,740 correct HGUC classifications, but with substantially more errors: 1,740 false positives and 1,820 false negatives. This translated to sensitivity of 70%, specificity of 76%, precision of 70%, and an F1 score of 70%. The inferior performance confirms that small patches fail to capture the architectural patterns that pathologists use for grading decisions.
Asymmetric accuracy between grades: An interesting finding is the slightly better accuracy for HGUC (91%) compared to LGUC (89%) at the 1,024 x 1,024 patch size. The authors hypothesize that LGUC represents a more heterogeneous class because low-grade specimens can encompass areas of normal urothelium and papilloma-like architecture. HGUC, with its more uniformly atypical features (fused papillae, marked nuclear atypia, frequent mitoses), presents a more consistent pattern for the CNN to learn. This asymmetry was also reflected in the interobserver agreement: kappa was 0.9 for HGUC (almost perfect agreement) but 0.7 for LGUC (substantial agreement).
Comparison with existing literature: The 90% overall accuracy compares favorably with a similar study by Jansen et al., who used smaller 224 x 224 patches for a 3-class classification and achieved only 76% accuracy for LGUC and 71% for HGUC. That study employed a U-Net segmentation network and operated at 20x magnification. The improved performance in the current study is attributed to the larger patch size, which preserves the architectural context essential for grade discrimination, and the focused 2-class (rather than 3-class) problem formulation.
What Grad-CAM does: Gradient-weighted Class Activation Mapping (Grad-CAM) is a visualization technique that produces a coarse heat map highlighting the regions of an image that the deep learning model considers most important for its classification decision. It works by computing the gradients of the target class flowing into the final convolutional layer, using these gradient signals to weight the feature maps and produce a localization map. In the resulting heat map, red areas indicate high importance (regions the model relies on most), while blue and green areas represent low importance.
Application to urothelial carcinoma: In this study, Grad-CAM was applied to the final convolutional layer of the pretrained VGG16 model. The visualization consistently showed that the model identified cellular areas of the tumor as the most important features for classification. Stroma, blood vessels, and blank areas within the tissue were consistently colored blue or green, confirming that the model appropriately focused on tumor cellularity rather than non-diagnostic tissue components. This aligns with pathologist practice, where nuclear features and cellular arrangement are the primary criteria for grading.
Insights and inaccuracies: While the heat maps generally corroborated the model's learning behavior, some notable inaccuracies were observed. Certain images showed regions of regular, uniform LGUC cells highlighted in blue, suggesting the model did not always recognize orderly low-grade tumor as diagnostically significant. Additionally, the visualization revealed how the model sometimes misclassifies artifacts into one of the two tumor classes, rather than ignoring them. These insights are valuable because they expose failure modes that could be addressed through improved data preprocessing or artifact detection layers.
Clinical interpretability value: The overlay of Grad-CAM heat maps on the original H&E-stained slides provides an intuitive visual tool that helps pathologists understand the model's reasoning. This transparency is critical for adoption of AI as a companion diagnostic tool. Rather than operating as a "black box," the model can show exactly which tissue regions drove its grading decision. For a pathologist reviewing ambiguous cases, seeing the model's attention map on specific cellular areas can serve as a confirmatory second opinion, particularly in cases where the distinction between LGUC and HGUC is borderline.
Traditional vs. deep learning approaches: The authors discuss two broad strategies for automated grading. Traditional image processing, useful for tasks like Ki67 quantification and nucleus segmentation, relies on handcrafted features based on parameters such as size and hyperchromasia. While traditional machine learning algorithms such as Support Vector Machines (SVM) and Random Forest can yield decent results for some problems, CNNs are fundamentally better suited for image classification because they automatically learn hierarchical features from the data, layer by layer, without manual feature engineering.
Related work in other cancers: The study situates itself within a growing body of AI-pathology research. Inception v3 has been used to detect epithelial tumors in stomach and colon with good results. Dimensionality reduction techniques such as t-SNE have been applied to neuropathological tissue samples to better understand learned features. AI applications extend to breast pathology, bone pathology, prostatic pathology (automated Gleason grading), and lung pathology. Notably, AI-based pathology has been shown to predict the origin of cancers of unknown primary, demonstrating the breadth of CNN applicability across oncology.
Grad-CAM in medical imaging: The use of Grad-CAM for model interpretability extends well beyond this study. It has been deployed for adenocarcinoma classification in lung and colon histology slides, for detecting colorectal cancer in colonoscopy images, and for identifying diabetic retinopathy in fundus photographs. The consistent utility of Grad-CAM across these diverse imaging modalities validates its use as a standard visualization tool for explaining deep learning decisions in clinical contexts.
Potential for expanded applications: The authors note that if combined with gene expression data, AI models could potentially predict tumor recurrence. A prognostic AI-monitor has already been developed for metastatic urothelial cancer patients receiving immunotherapy. AI is also making significant inroads in radiology, dermatology, and neurology, suggesting that the computational pathology approach demonstrated in this study could be integrated into broader multi-modal diagnostic workflows in the future.
Sample size limitation: The most significant constraint is the small cohort of only 20 patients. While the patch extraction strategy generated over 130,000 image patches, the underlying biological diversity is limited to 20 individuals from a single institution. This raises concerns about generalizability to broader populations with different demographics, tissue preparation protocols, and scanning equipment. The authors acknowledge that inter-institutional comparison would be a better test for the practical utility of such algorithms in real-world clinical settings.
Sources of the 10% inaccuracy: The approximately 10% error rate can be partially explained by the presence of low-grade areas within predominantly high-grade specimens, and vice versa. LGUC samples sometimes contain areas of normal urothelium and papilloma-like architecture, making them a more heterogeneous class. Additionally, artifacts from cauterization, crush damage during TURBT extraction, blurred regions, stain deposits, and Von Brunn's nests (solid nests of benign urothelium in the lamina propria) can confound classification. Despite the 90% tumor tissue threshold for patch inclusion, some of these confounders persisted.
Ethical considerations: The authors raise an important point about the ethical implications of deploying AI screening tools without human oversight. The use of an independent AI screening test without any oversight raises issues of responsibility in cases of wrong diagnosis. Very high accuracy would be needed before deploying such a tool as a standalone screening instrument. The current 90% accuracy supports its use as a companion diagnostic tool to assist pathologists, rather than a replacement for human judgment.
Future research directions: Several promising avenues are outlined. Segmentation of images to delineate tumor areas from stroma, muscle, lymphatics, and blood vessels could improve classification accuracy. Incorporating life expectancy data could enable predictions about 5-year survival rate and prognosis. Another compelling application is the prediction of germline mutations based on subtle morphological clues detected by deep learning algorithms. The authors also note that cheaper storage, easier-to-deploy AI tools, and growing computational infrastructure will make computational pathology more economical and practical for routine use in pathology laboratories and even in pathology education.
Path to clinical deployment: For this model to reach clinical utility, it would need validation on multi-institutional datasets, testing across different scanners and staining protocols, and demonstration of consistent performance across diverse patient populations. The infrastructure requirements for computational pathology, including high-performance GPUs, large file storage for whole slide images, and integration with laboratory information systems, represent additional practical barriers that must be addressed before widespread adoption.