Bladder cancer (BCa) is one of the most expensive cancers to manage over a patient's lifetime, primarily because it demands long-term surveillance with repeated cystoscopy procedures and imaging. Current diagnostic workflows rely on white-light cystoscopy, urine cytology, CT urography, and MRI, all of which carry well-documented limitations. White-light cystoscopy misses tumors at a rate of 10 to 20%, urine cytology has sensitivity as low as 16% for low-grade tumors, and biopsies obtained during transurethral resection of bladder tumor (TURBT) understage muscle invasion in up to 50% of T1 cases.
This comprehensive 2023 review by Ferro et al. from the European Institute of Oncology and 13 other institutions searched PubMed for articles from the last ten years using keywords including "artificial intelligence," "machine learning," "deep learning," "diagnosis," and "bladder cancer." From 94 initially retrieved papers, 48 studies met the inclusion criteria and were analyzed. The review spans the full spectrum of AI applications in BCa diagnosis: cystoscopy-based tumor detection, urine cytology automation, urine metabolome profiling, bladder segmentation, radiomics-based imaging, tumor grading, histopathology analysis, and cancer staging.
The paper introduces the foundational AI concepts relevant to BCa research. Machine learning (ML) uses mathematical and statistical algorithms trained on labeled datasets to build classification and prediction models. Deep learning (DL) relies on multi-layer artificial neural networks, particularly convolutional neural networks (CNNs), which excel at image classification tasks. CNNs recognize visual patterns through successive convolutional and pooling layers and have achieved expert-level performance in dermatology, ophthalmology, and oncology. The review notes that DL approaches increasingly combine genomic, transcriptomic, and histopathological data to enhance diagnosis, prognosis, and treatment selection.
A key limitation flagged throughout the paper is overfitting, where algorithms perform well on their training data but poorly on new patient populations. This is compounded by limited data availability in healthcare due to privacy regulations, economic constraints, and the difficulty of standardizing certain data types like cystoscopy video.
Cystoscopy is the standard tool for initial diagnosis and follow-up surveillance of BCa, but human interpretation errors under white light remain a significant problem with miss rates up to 20%. Multiple research groups developed CNN-based systems to serve as physician-assistant tools during cystoscopy, and nearly all achieved AUCs above 0.95. Because AI is immune to fatigue, stress, and burnout, these tools can complement the endoscopist's judgment in real time.
Top-performing models: Eminaga et al. tested several CNN architectures on cystoscopy images and found the Xception model performed best with an F1 score of 99.52%. Lorencin et al. used an ANN on 1997 BCa images and 986 benign images, achieving an AUC of 0.99. The same team later applied another CNN to 2983 cystoscopy images and again achieved AUC 0.99. Ikeda et al. developed a CNN on 2102 cystoscopy images that reached 89.7% sensitivity and 94% specificity. Yang et al. compared CNNs including LeNet, AlexNet, and GoogLeNet with the EasyDL platform and found EasyDL achieved the best accuracy at 96.9%. Du et al. confirmed this finding, with EasyDL reaching 96.9% accuracy compared to 82.9% for Caffe DL on 1736 cystoscopy images.
CystoNet and ResNet-based systems: Shkolyar et al. developed CystoNet, an automated CNN system trained on 2335 normal urothelium images and 417 papillary carcinoma images from 95 patients, achieving 91% sensitivity and 99% specificity. Wu et al. built a diagnostic system using the ResNet 101 model with the pyramid scene parsing network (PSPNet) framework and demonstrated 93.9% accuracy and 95.4% sensitivity, outperforming expert urologists in speed and accuracy. Ali et al. combined photodynamic detection (PDD) blue-light cystoscopy with a CNN, achieving 95.77% sensitivity and 87.84% specificity for tumor detection, and 88% sensitivity and 96.56% specificity for assessing invasiveness.
SVM-based approach: Yoo et al. used SVM models with white-light cystoscopy and a red-green-blue color analysis method to detect and grade BCa. Their system achieved 95.0% sensitivity, 93.7% specificity, 94.1% diagnostic accuracy, and a dice similarity coefficient (DSC) of 74.7%. The RGB method distinguished benign from low-grade lesions at 98% accuracy and detected inflammatory lesions and carcinoma in situ (CIS) at over 90%.
The clinical problem: Urine cytology is a non-invasive test for BCa, but it suffers from low sensitivity, particularly for low-grade tumors (as low as 16%). It is also user-dependent and can be hampered by low cellular yield, urinary tract infections, and stones. Several research groups applied DL models to automate and improve urine cytology interpretation, addressing both the sensitivity gap and the inter-observer variability among pathologists.
16VGG CNN for cytology: Nojima et al. used a 16-layer Visual Geometry Group (16VGG) CNN to predict whether urinary cytology samples contained malignant or high-grade lesions. The model achieved excellent performance for cancer-versus-benign differentiation (AUC 0.989, F1 score 0.900), for identifying invasive BCa (AUC 0.863, F1 score 0.824), and for high-grade BCa detection (AUC 0.866, F1 score 0.822). Awan et al. developed a method to automatically identify atypical and neoplastic cells and found that the Xception model performed best in their validation set with an AUC of 0.99.
Paris System automation: Vaickus et al. built a hybrid DL and morphometric model using AlexNet/ResNet to automate the Paris System for reporting urine cytology. Using whole slide images from 51 negative, 60 atypical, 52 suspicious, and 54 positive cases, their model achieved a 95% accuracy rate for cell type and atypia detection. Sanghvi et al. applied a CNN to 2405 ThinPrep glass slides and validated on a separate dataset, achieving an AUC of 0.88 (95% CI: 0.83 to 0.93) with 79.5% sensitivity and 84.5% specificity for high-grade urothelial carcinoma. Lilli et al. reported that standard CNN algorithms had weak performance on urinary cytopathology, but applying focal loss improved accuracy to 89.90%.
Atomic force microscopy and biomarkers: Sokolov et al. used atomic force microscopy to scan bladder cell surfaces, analyzing multiple parameters with an ML model to non-invasively detect BCa in 25 cancer and 43 control patients, achieving 94% accuracy, significantly outperforming cystoscopy in that cohort. Khosravi et al. used CNN methods to differentiate four biomarkers of BCa and four immunohistochemistry staining scores, with the Inception V1-Fine tune algorithm reaching 99% accuracy for blood biomarker discrimination.
Metabolomic biomarkers: Beyond cytology, researchers explored whether urine metabolites could serve as BCa biomarkers, since the bladder is in direct contact with urine. Shao et al. profiled 87 BCa samples and 65 control samples using an ML model with a decision tree (DT) classifier. They identified imidazoleacetic acid as a marker potentially related to BCa, achieving 76.60% accuracy, 71.88% sensitivity, and 86.67% specificity.
Stage-specific biomarkers: Kouznetsova et al. investigated urine metabolites as biomarkers for classifying BCa stage using ML and ANN models combined with logistic regression. They identified D-glucose as an early-stage BCa biomarker capable of impacting potential neoplastic genes (AKT, EGFR, and MAPK3). For late-stage BCa, the identified biomarkers included glycerol, choline, 13(S)-hydroxyoctadecadienoic acid, 2'-fucosyllactose, and insulin. Their best-performing model predicted metabolite class with 82.54% accuracy and AUC of 0.84 on the training set.
Clinical context: While the metabolomics-based AI models showed lower accuracy than cystoscopy-based AI (which routinely achieved AUCs above 0.95), they offer a fundamentally non-invasive approach. The ability to distinguish early from late-stage BCa through urine analysis could have important implications for screening programs and reducing the need for invasive cystoscopy, particularly in low-risk surveillance populations. However, these results remain preliminary, with relatively small sample sizes and no external validation reported.
Why segmentation matters: Accurate bladder auto-segmentation, differentiating the bladder wall from surrounding tissues, is a prerequisite for automated diagnosis of bladder wall lesions. The bladder is a "shifting" organ whose shape and size change with urine volume, pressure, and physiology. This variability means some tumors may fall outside the region of interest (false negatives) while non-bladder structures may be misidentified as tumors (false positives). Manual delineation by radiologists is time-consuming and expensive, making automated segmentation a high-priority target for AI.
CNN for CT urography: Cha et al. used a CNN to segment regions of interest from CT urography images of 173 patients (81 training, 92 validation), outperforming previously used methods. Dolz et al. applied a CNN to MRI data from 60 confirmed BCa patients and obtained accuracy of 0.98 for inner wall, 0.84 for outer wall, and 0.69 for tumor region segmentation. Ma et al. developed a U-Net DL model for CT urography segmentation using 81 training and 92 testing patients, with the deep CNN (DCNN) showing statistically significant improvement over U-Net alone (p < 0.001).
U-Net architectures: Li et al. proposed an automatic segmentation method on 1092 MRI images demonstrating that the U-Net method achieved a dice similarity coefficient (DSC) of 85.48%. Niazi et al. applied a U-Net for multi-class segmentation of bladder layers in T1 histopathology specimens, using hematoxylin-eosin stained images. Their 12-layer model achieved 89.3% accuracy for layer identification. Zhang et al. used an attention-mechanism-based U-Net DL-CNN model for cystoscopic image segmentation, achieving a DSC of 82.7% and mean Intersection over Union (MIoU) of 69%.
Radiomics in bladder cancer: Radiomics extracts quantitative features from medical images using computer-aided diagnosis (CAD) and mathematical algorithms, capturing information not visible to the human eye. While AI-driven imaging is more advanced in prostate and renal cancers, several studies combined radiomics and AI for bladder cancer, primarily to address the approximately 50% understaging rate of biopsies for distinguishing non-muscle invasive (NMIBC) from muscle-invasive bladder cancer (MIBC).
MRI-based radiomics: Xu et al. analyzed 3D texture features from T2-weighted MRI in 62 cancer lesions and 62 controls, using a recursive feature elimination support vector machine (RFE-SVM) classifier. With data augmentation via the synthetic minority oversampling technique (SMOTE), the model achieved 89.67% sensitivity, 87.80% specificity, 88.74% accuracy, and AUC of 0.94. Wu et al. used radiomic MRI features with the LASSO algorithm on 103 BCa patients (69 training, 34 validation) to predict lymph node metastasis, building a nomogram that achieved AUCs of 0.91 in training and 0.89 in validation.
Clinical nomograms: Zheng et al. extracted 2602 radiomics features from T2-weighted MRI and used LASSO to build a combined radiomics-clinical nomogram for pre-operative muscle invasiveness discrimination, yielding AUCs of 0.913 for training and 0.874 for validation with demonstrated clinical usefulness. A systematic review and meta-analysis by Kozikowski et al. found an overall HSROC AUC of 0.88, with 81% specificity and 82% sensitivity for predicting muscle wall invasiveness across multiple radiomics studies.
VI-RADS with DL reconstruction: Taguchi et al. validated the vesical imaging and reporting data system (VI-RADS) in a prospective multicenter study of 68 BCa patients using the latest MRI technology with DL reconstruction. Muscle invasion diagnosis using VI-RADS score 4 or higher achieved 94% accuracy (AUC 0.92). DL reconstruction identified four additional patients initially misdiagnosed at VI-RADS score 3, with correct diagnosis established through T2-weighted imaging plus denoising DL reconstruction.
AI-assisted grading: Grading is critical in BCa management because up to 30% of non-muscle invasive cases are high-grade and can progress to muscle invasion or develop metastases. Patients with high-grade NMIBC may receive BCG (Bacillus Calmette-Guerin) immunotherapy or undergo early cystectomy, making accurate grading essential for treatment decisions. Zhang et al. used texture features from MRI to discriminate between low-grade (32 patients) and high-grade (29 patients) BCa. The SVM classifier achieved the best performance with AUC 0.861, 82.9% accuracy, 78.4% sensitivity, and 87.1% specificity.
Multimodal radiomics grading: Wang et al. used multimodal MRI radiomics (T2w, DWI, and ADC) with the LASSO algorithm on 70 training and 30 validation patients, with multimodality models achieving an AUC of 0.923, outperforming any single imaging modality. Jansen et al. developed a fully automated grading system using a U-Net segmentation network trained to detect bladder urothelium. The system correctly graded 76% of low-grade and 71% of high-grade cancers in agreement with expert consensus, addressing the significant inter-observer and intra-observer variability that plagues manual grading of TURBT specimens.
Whole slide image analysis: Zhang et al. automated the analysis of 913 whole slide images of BCa pathology using a CNN algorithm, achieving results comparable to expert pathologists with an AUC of 0.97. Velmahos et al. used a CNN on histopathology slides from 418 BCa patients to predict fibroblast growth factor receptor (FGFR) alterations by identifying tumor-infiltrating lymphocyte percentages. The best model for FGFR2/FGFR3 mutation prediction achieved 82% sensitivity, 85% specificity, and AUC of 0.86.
Lymph node status prediction: Several studies used AI to predict pathological lymph node status after cystectomy. Seiler et al. employed a K-nearest neighbor (KNN51) classifier on whole transcriptome profiles from 199 cystectomy patients, achieving AUC 0.82, significantly outperforming the 15-gene cancer recurrence signature (AUC 0.62) and 20-gene lymph node signature (AUC 0.46). Wu et al. built a genomic-clinical-pathological nomogram using five mRNAs from the TCGA database in 325 BCa patients, achieving AUC 0.89 for lymph node status prediction with logistic regression.
The staging challenge: Accurate staging is essential for therapeutic decision-making in BCa, yet current tools are limited by the sub-optimal ability to correctly determine muscle layer infiltration. Up to 50% of patients with T1 disease on TURBT actually have muscle-invasive BCa (MIBC). CT and MRI provide additional staging information but show sub-optimal performance in evaluating microscopic invasion (T1 versus T2 disease), with their main use being assessment of locally advanced disease (T3b or higher).
MRI-based staging: Xu et al. used multiparametric MRI with 1104 radiomics features from 54 patients to differentiate NMIBC from MIBC. Using SVM-RFE with SMOTE data augmentation, 19 features from T2w and DWI sequences achieved an AUC of 0.986 and outperformed experts in diagnostic accuracy at 96.30%. Li et al. compared radiomics, single-task DL, and multi-task DL on T2w MRI from 121 BCa lesions, with AUCs of 0.920 (radiomics), 0.933 (single-task DL), and 0.932 (multi-task DL). In a separate study, Li et al. showed that a DL-CNN model based on T2w achieved higher AUCs than two expert radiologists (0.963 vs. 0.843 and 0.852), with significantly higher accuracy for VI-RADS 2 or 3 score lesions (p = 0.006).
CT-based staging: Yang et al. used a DL-CNN model on 1200 CT images from 369 patients to differentiate non-muscle from muscle-invasive BCa, achieving AUC 0.997, 88.9% sensitivity, and 98.9% specificity. Xu et al. developed a DL algorithm using the YOLO (You Only Look Once) architecture on CT images from 60 patients, with clinical staging coincidence rates of: T1 = 50.01%, T2a = 91.65%, and T2b/T3/T4 = 100.00%. Garapati et al. compared four AI algorithms (LDA, CNN, SVM, and Random Forest) for staging BCa at or above T2 from CT urography in 76 patients, achieving AUCs ranging from 0.89 to 0.97.
Histopathology-based staging: Yin et al. differentiated Ta from T1 BCa on hematoxylin and eosin stain images from 1177 BCa tissues using an ML-CNN model with imaging processing through ImageJ and CellProfiler, achieving accuracies between 91% and 96%. Zou et al. used T2w images with the Inception V3 platform to build a multi-task BCa muscular invasion prediction model, achieving 92.3% accuracy, 100% sensitivity, and 88.5% specificity in the prospective data group. Sarkar et al. developed a hybrid ML-DL model using the LDA classifier on XceptionNet, reaching 86.07% accuracy for detection and 79.72% accuracy for invasiveness staging.
Data quality and availability: The review identifies data quality as the primary barrier to clinical adoption of AI in BCa. AI algorithms require large, well-annotated datasets for training and validation, but comprehensive datasets with good image quality are only available for digital pathology and radiology. Cystoscopy images are scarce and difficult to standardize because cystoscopy is a dynamic procedure, unlike radiology or histopathology. All retrieved studies were retrospective, with most AI tools trained, validated, and tested on the same dataset, frequently leading to overfitting where an algorithm works well on its own data but performs poorly on new patient populations.
Methodological limitations: Performance metrics across the 48 studies relied mainly on sensitivity, specificity, accuracy, and AUC, but substantial heterogeneity in study design, algorithm architecture, and outcome definitions makes meaningful cross-study comparison difficult. External validation across multiple institutions was rare, raising questions about generalizability. The interpretability problem, often described as the "black box" nature of AI models, hinders trust among clinicians who need transparent methods to justify clinical decisions, especially for consequential recommendations like early cystectomy or BCG therapy.
Ethical and practical barriers: Analysis of confidential electronic health records raises legal and governance issues related to patient privacy versus research benefit. Algorithm bias from skewed or poorly representative datasets could disproportionately affect certain populations. Adoption challenges include the massive computational power required for complex ML algorithms, integration with existing healthcare systems, and the training requirements for healthcare professionals. Pending FDA and CE regulatory approvals further delay the clinical implementation of AI/ML-based medical devices.
Future promise: Despite these limitations, AI has the potential to enhance diagnostic accuracy across all BCa diagnostic modalities, from cystoscopy to imaging to histopathology. The authors envision AI enabling precision medicine through integration of patient factors with multi-omics data (genomics, proteomics, transcriptomics) to identify molecular signatures and predict therapy responses. Combined with telemedicine, AI could enable remote patient monitoring, real-time evaluation of symptoms and treatment response, and reduced healthcare costs. Hybrid generative-discriminative DL models and the integration of AI with traditional imaging analysis represent the most promising trajectories for future research.
Scalability and regulation: The ability of AI-based diagnostic tools to operate at the size, speed, and complexity required for BCa assessment is still suboptimal, representing a major obstacle to widespread clinical implementation. Robust infrastructures to support large-scale adoption are needed. Additionally, ML interpretability suffers from a lack of mature definitions and formal methods, creating ambiguity that has limited adoption in sensitive clinical domains. Patient perspectives toward AI remain controversial, with mixed results in terms of understanding and acceptance of this technology in common clinical practice.