AI Prognostic Subgroups in Muscle-Invasive Bladder Cancer

Plain-English Explanations

Overview

Pages 1-3

Why AI-Based Risk Stratification Matters for Muscle-Invasive Bladder Cancer

Clinical burden of bladder cancer: Bladder cancer (BC) is the tenth most common cancer in the United States, predominantly affecting individuals over 55 years of age, with men affected roughly four times more frequently than women. While most cases present as non-muscle-invasive bladder cancer (NMIBC), up to 25% are muscle-invasive (MIBC), carrying substantial mortality risk. Five-year survival rates drop steeply by stage: 96% for stage I, 70% for stage II, 38% for stage III, and just 6% for stage IV. MIBC is highly heterogeneous and costly, with distinct multilevel molecular subtype profiles that influence both prognosis and treatment response.

The molecular profiling bottleneck: Determining a patient's molecular subtype profile requires analysis of protein expression, gene mutations, mRNA, DNA methylation, and miRNA. This demands complex and expensive infrastructure that is unavailable in most cancer centers worldwide. A cost-effective solution that can stratify patients by risk and identify those who would benefit from specific treatment regimens is therefore a major unmet clinical need.

What this study proposes: Eminaga et al. developed an AI-based interpretable scoring system using weakly supervised learning and neural architecture search (NAS) to capture prognostic histopathological patterns from H&E-stained whole-slide images (WSIs). They constructed and externally validated this scoring system using multi-institutional datasets comprising 653 whole-slide images from the PLCO trial (development set, 196 WSIs from 113 patients across nine U.S. centers) and the TCGA bladder cancer cohort (external validation, 412 images). The central hypothesis is that morphometric patterns visible in standard histology slides encode information about multi-omics molecular signatures and can therefore serve as a prognostic digital biomarker.

Key limitation of prior work: Previous deep learning studies in bladder cancer typically treated model confidence scores as probability scores, ignoring the well-documented overconfidence problem in neural networks. They also lacked interpretable methods for verifying whether features in the latent space actually correspond to meaningful histological patterns. This study explicitly addresses both shortcomings.

TL;DR: This study develops an AI scoring system from H&E-stained whole-slide images to stratify muscle-invasive bladder cancer patients into prognostic risk groups. Using 653 WSIs from PLCO and TCGA cohorts, the approach bypasses expensive molecular profiling while revealing associations with multi-omics signatures, addressing the overconfidence and interpretability gaps in prior deep learning studies.

Data and Preprocessing

Pages 3-4

Development Cohort Design and Image Preprocessing Pipeline

PLCO development set: The development set drew from 113 patients with urothelial carcinoma of the bladder enrolled in the Prostate, Lung, Colon, and Ovarian (PLCO) Cancer Screening Trial, a randomized controlled trial of 154,900 participants aged 55 to 74 enrolled between 1993 and 2001. Although PLCO did not screen for bladder cancer, it tracked BC diagnoses during the trial and linked to the National Death Index for mortality follow-up of up to 19 years. From 1,430 BC cases, 285 were randomly selected, yielding 196 H&E-stained slides from nine U.S. centers, digitized at 40x objective magnification (~0.2532 micrometers per pixel) using Leica Biosystems scanners.

Data splitting strategy: The images were divided by institution into a training set (26 patients, 46 WSIs), an optimization set (6 patients, 8 WSIs), and a validation set (81 patients, 142 WSIs). The institution-based split prevented any patient overlap between sets. The center with the largest proportion of cases supplied the training data, the smallest supplied the optimization data, and remaining centers formed the validation set. No significant differences in cohort characteristics (age, sex, WHO grade, T stage, N stage, M stage, or follow-up duration) were found between subsets (all p > 0.05).

Tile extraction and labeling: The tissue area on each slide was identified by thresholding the grayscale thumbnail at 1x magnification, then upscaling the boundary to 40x. Tissue was divided into 2048 x 2048 pixel tiles, and tiles where more than 50% of pixels matched the white background were excluded. Remaining tiles were downsized to 512 x 512 pixels (approximately 10x objective magnification). Each tile inherited the patient-level binary label of cancer-specific death (CSD) status from the death certificate, implementing the weakly supervised learning paradigm where patch-level labels are derived from case-level outcomes.

Patch volume: This preprocessing yielded 26,949 patches (16.5%) from the training set, 7,574 patches (4.6%) from the optimization set, and 129,122 patches (78.9%) from the validation set. The training set was intentionally diverse in pathology (77% NMIBC, reflecting the population distribution), while the validation set had a balanced distribution of NMIBC and MIBC cases as well as G1/2 and G3 grades, minimizing sampling bias.

TL;DR: The PLCO development cohort included 113 patients (196 WSIs) from nine U.S. centers, split by institution into training (26 patients), optimization (6), and validation (81) sets. Slides were tiled into 512 x 512 pixel patches labeled with cancer-specific death status under a weakly supervised framework, generating over 163,000 total patches across all subsets.

Model Architecture

Pages 4-5

Neural Architecture Search and PlexusNet Model Design

Neural architecture search (NAS): Rather than using an off-the-shelf architecture like ResNet or Inception, the study applied a PlexusNET-based NAS algorithm to systematically identify the optimal model design for cancer-specific death prediction. The search space encompassed 1,296 candidate architectures, varying across seven parameters: block type (inception, residual/ResNet, conventional/VGG, or attention block), network width (2, 4, or 6), depth (3, 4, or 5), number of pathways (2, 3, or 4), junction interconnections (1, 2, or 3), global pooling type (average vs. maximum), and whether to include a transformer component (yes vs. no).

NAS procedure: Each of the 1,296 configurations was trained for one epoch using the ADAM optimizer with a learning rate of 1 x 10^-3, cross-entropy loss, and a batch size of 64. To optimize computational efficiency, patches were downsized to 32 x 32 pixels during the search phase. Two-fold cross-validation was used to evaluate each architecture for balanced classification accuracy. The entire NAS process required approximately 12 hours.

Winning architecture: The NAS selected a remarkably shallow model with the configuration VGG D6L2J1F2 + transformer and global average pooling. This architecture has only 23,783 trainable parameters and produces just 20 fully connected representation features. For context, ResNet18 has 512 features in its final layer, meaning the PlexusNET model uses roughly 4% of the feature representations of a standard off-the-shelf architecture. This extreme compactness is deliberate: fewer features improve both computational efficiency for downstream analysis and human interpretability.

Full training protocol: The selected architecture was then trained on full 512 x 512 pixel patches until convergence, with early stopping triggered after 10 epochs without improvement on the optimization set. AdamW (Adam with weight decay) was used with a learning rate of 1 x 10^-4. Binary labels were randomly smoothed with +/- 0.25 to combat overconfidence and improve calibration. Data augmentation included random rotation, flipping, clipping, and color space augmentations. The Levene test confirmed that 18 of the 20 two-dimensional feature maps showed significantly different variance between cancer-specific death patches and survivor patches, verifying that the model extracted meaningful prognostic features from the histology.

TL;DR: A neural architecture search across 1,296 PlexusNET configurations (~12 hours) selected a compact VGG-transformer hybrid with only 23,783 parameters and 20 feature representations. Label smoothing (+/- 0.25) addressed neural network overconfidence. The Levene test confirmed 18 of 20 feature maps captured statistically significant differences between cancer death and survivor patches.

Scoring System

Pages 5-7

Deriving the Interpretable CSD Scoring System from Feature Space

Feature space visualization: The model's penultimate convolutional features were visualized using t-SNE (t-distributed stochastic neighbor embedding) on 25,000 randomly selected patches from the validation set, representing all 81 patients. The prediction deciles intuitively sorted the feature space, with the t-SNE plots revealing that patches with similar prediction scores clustered together. Critically, the corresponding patch images confirmed visible differences in histopathological appearance across deciles, validating that the model's numerical predictions corresponded to genuine morphological differences.

Decile selection: Based on both the t-SNE feature visualization and expert assessment of histopathological patterns, the authors selected two reference deciles: the second decile (D2) and the fifth decile (D5). D2 (orange color in the visualizations) was predominantly associated with negative patches (more than 50%), including benign bladder tissue. D5 (lilac color) was the center decile between the first and ninth deciles, representing an intermediate prediction zone. The tenth decile was excluded due to negligible sample size.

Score computation algorithm: For each patient, the algorithm first calculates a histogram of patch predictions across 10 equal-width bins. The bin width is computed as (s_max - s_min) / sqrt(10), where s_max and s_min are the maximum and minimum CSD scores for that case. After maximum normalization, the unadjusted CSD score is calculated as Su = Dm - Dr, where Dm and Dr are the normalized frequencies at deciles D5 and D2 respectively. To handle out-of-distribution data without requiring ground truth, the authors introduced an adjustment: SCSD = -(mean + median_deviation) + Dm - Dr. A threshold T = mean_0 + 1.05 * sigma_0 + alpha divides patients into binary risk categories, where alpha corrects for distribution shifts between development and external cohorts.

Generalization verification: The authors validated that the median midrange CSD scores for D2 (MR = 0.17, IQR: 0.16-0.18) and D5 (MR = 0.41, IQR: 0.37-0.42) were comparable between the development and out-of-distribution cohorts, ensuring that the equal-width binning strategy generalized across datasets. This is a notable methodological contribution, as most AI-based scoring systems lack explicit mechanisms for handling distributional shifts between training and deployment populations.

TL;DR: The scoring system uses t-SNE-guided selection of two reference deciles (D2 and D5) from the model's feature space, then computes a per-patient risk score as the difference in normalized patch frequencies at these deciles. A distribution-aware normalization strategy adjusts for out-of-distribution cohorts without requiring ground truth labels, and a threshold converts continuous scores into binary low-risk vs. high-risk categories.

Internal Validation

Pages 7-9

Prognostic Performance on the PLCO Validation Set

Cancer-specific survival prediction: On the PLCO validation set (81 patients), the AI-derived risk score was prognostic for cancer-specific mortality with a hazard ratio of 8.0 (95% CI: 1.4-46.1; z: 2.332; p = 0.0197). The 5-year time-dependent AUC was 0.772 +/- 0.04, indicating good discriminative accuracy for cancer-specific death at the five-year mark. The high hazard ratio indicates that patients in the high-risk group had an 8-fold increased risk of cancer-specific death compared to the low-risk group.

Multivariate analysis on PLCO: The multivariate Cox regression analysis, adjusting for age at diagnosis and tumor grade (WHO 1973 grading), demonstrated that the risk score remained an independent prognosticator. In the adjusted model, the risk score had an HR of 8.39 (95% CI: 1.53-46.12; z: 2.45; p = 0.01). Notably, G3 tumors had an HR of 11.99 (95% CI: 1.61-89.21; z: 2.43; p = 0.02) relative to G1, confirming that the AI score provides complementary prognostic information distinct from conventional grading. Age at diagnosis was not independently significant (HR: 1.03, p = 0.39) in this model.

Kaplan-Meier survival analysis: The Kaplan-Meier curve revealed two statistically distinct risk groups (log-rank p = 0.014). The median cancer-specific survival for the high-risk group was achieved between 204 months (17 years) and 216 months (18 years) after initial diagnosis. This long time horizon reflects the PLCO cohort's extended follow-up (median 168 months in the validation set), which is unusual for bladder cancer studies and provides robust long-term outcome data.

Feature space interpretation: The density histograms of the 8 x 8 two-dimensional feature maps (each 1 x 1 pixel on a feature map corresponding to a 64 x 64 pixel area on the original 512 x 512 patch) confirmed that specific features such as F13, F14, F15, F16, and F20 exhibited histogram ranges for pixel values more commonly associated with cancer-specific death patches. This granular feature-level analysis provides a degree of interpretability rarely seen in deep learning prognostic models.

TL;DR: On the PLCO validation set, the AI risk score predicted cancer-specific death with HR 8.0 (p = 0.02) and 5-year AUC of 0.772. After adjusting for age and tumor grade, the score remained independently prognostic (HR 8.39, p = 0.01). Kaplan-Meier analysis showed statistically distinct survival curves (p = 0.014), with the high-risk group reaching median survival between 17 and 18 years.

External Validation

Pages 10-13

External Validation on the TCGA Cohort: Overall Survival in MIBC

TCGA validation cohort: External validation used 412 H&E-stained whole-slide images from The Cancer Genome Atlas (TCGA) Urothelial Bladder Carcinoma cohort, contributed by 36 institutions worldwide. The tissue specimens came from radical cystectomies. The cohort was 74% male, with a median age of 68 years (IQR: 60-76). The vast majority of cases were high-grade MIBC (95% high-grade), with 46% at pathologic stage T3/T4. Follow-up was relatively short (median 19 months, IQR: 12-33), and 45% of patients had died by the time of analysis.

Prognostic performance: The risk groups were prognostic for overall survival with a univariate hazard ratio of 1.46 (95% CI: 1.05-2.02; z: 2.23; p = 0.03). The multivariate Cox regression, adjusting for AJCC pathologic tumor stage and age at diagnosis, confirmed the risk score as an independent prognosticator: HR 1.35 (95% CI: 1.01-1.80; z: 1.99; p = 0.046). Stage III had an HR of 1.51 (95% CI: 1.03-2.21; p = 0.036) and stage IV had an HR of 2.21 (95% CI: 1.54-3.18; p < 0.0001) relative to stage I/II, with multicollinearity negligibly small (VIFs < 2).

Survival separation: The Kaplan-Meier curve with log-rank test confirmed statistically distinct risk groups (p = 0.037). Both groups eventually reached median overall survival, but at markedly different time points: approximately 30 months for the high-risk group versus 60 months for the low-risk group. The high-risk group reached median survival roughly 2.5 years earlier than the low-risk group. This is clinically meaningful for MIBC, where treatment decisions about neoadjuvant chemotherapy, radical cystectomy timing, and adjuvant therapy depend heavily on predicted survival trajectory.

Histopathologic pattern correspondence: The categorization was driven by the dominance of either D2 or D5 in each case. D2 and D5 were associated with distinct histopathologic patterns in the TCGA cohort. D2-dominant cases (low-risk) showed patterns more consistent with papillary architecture, while D5-dominant cases (high-risk) showed features associated with more aggressive phenotypes. A negligible fraction of patches in D2 solely included arterial vessels as luminal structures, suggesting the model did not rely on vascular artifacts for its predictions.

TL;DR: On external validation with 412 TCGA MIBC cases, the AI risk score was independently prognostic for overall survival (adjusted HR 1.35, p = 0.046). Median survival was ~60 months for low-risk vs. ~30 months for high-risk, a 2.5-year separation. The Kaplan-Meier log-rank test confirmed statistical significance (p = 0.037), validating the scoring system across institutional and cohort boundaries.

Molecular Associations

Pages 14-16

Associations with Molecular Subtypes and Multi-Omics Signatures

Transcriptomic subtypes: The AI-derived risk groups showed significant associations with TCGA molecular clusters across multiple omics levels. At the mRNA level (p = 0.010), the luminal papillary subtype was associated with the low-risk group, while the basal/squamous and neuronal subtypes were associated with the high-risk group. At the miRNA level (p = 0.004), cluster 3 was more frequent in the low-risk group and cluster 1 in the high-risk group. For lncRNA (p = 0.012), cluster 3 predominated in the low-risk group and cluster 4 in the high-risk group.

Epigenomic associations: DNA hypomethylation cluster 2, characterized by greater DNA hypermethylation signals (i.e., more gene inactivation), was associated with the high-risk group (p = 0.093). In contrast, cluster 4, which has lesser DNA hypomethylation, was linked with the low-risk group. Low-risk patients showed molecular profiles associated with papillary tumors, high FGFR3 mutations, elevated miR-200 levels, and low Epithelial-Mesenchymal Transition (EMT) scores. High-risk patients exhibited profiles linked to lymphocyte infiltration, high expression of carcinoma in situ (CIS) signature genes, elevated CD274 (PD-L1) and PDCD1 (PD-1) levels, high TP53 mutations, and high EMT scores.

Histopathologic features: The squamous phenotype showed a trend toward association with the high-risk group (p = 0.066). This aligns with the mRNA-level finding that the basal/squamous subtype clusters with high-risk patients. The 63.3% of squamous pathology cases fell into the high-risk group, consistent with the known aggressive biology of the squamous variant of urothelial carcinoma.

Multi-omics integration insight: These findings collectively demonstrate that the AI scoring system captures information encoded across multiple molecular levels simultaneously. The risk groups are not merely reflecting a single mutation or subtype, but rather integrating signals from transcriptomic, epigenomic, and histopathological dimensions. The authors emphasize that this underscores the risk of confounding bias when reducing complex tumor biology to a single mutation, because histopathological changes can only be fully captured through comprehensive multi-omics profiles.

TL;DR: The AI risk groups significantly correlated with TCGA molecular clusters for miRNA (p = 0.004), mRNA (p = 0.010), and lncRNA (p = 0.012). Low-risk mapped to luminal papillary subtypes with high FGFR3 and low EMT scores. High-risk aligned with basal/squamous and neuronal subtypes, elevated PD-L1/PD-1, and DNA hypermethylation, demonstrating that the scoring system captures multi-omics biology from histology alone.

Gene Mutations

Pages 14-16

Gene Mutation Distributions and FGFR3 Prediction Accuracy

TSC1 mutation: The low-risk group captured 72% of all TSC1 mutations (28 of 39 TSC1 mutations) in the cohort. The odds ratio of TSC1 mutation in the high-risk group was 0.36 (95% CI: 0.15-0.76; p = 0.004), indicating that high-risk patients were significantly less likely to harbor TSC1 mutations. TSC1 is a tumor suppressor gene involved in the mTOR signaling pathway, and its mutation is being investigated as a target for nab-sirolimus therapy (NCT05103358).

ERBB3 mutation: Similarly, the low-risk group included 67% of all ERBB3 mutations (30 of 45 ERBB3 mutations). The odds ratio for ERBB3 mutation in the high-risk group was 0.46 (95% CI: 0.22-0.91; p = 0.018). ERBB3 mutations are therapeutically actionable through agents such as afatinib, a pan-ERBB inhibitor that has shown activity in platinum-refractory urothelial carcinoma.

FGFR3 mutation: The true positive rate of the low-risk group for FGFR3 mutation detection was 65%, with an AUC of 0.593 (95% CI: 0.55-0.69). The odds ratio for FGFR3 mutation in the high-risk group was 0.49 (95% CI: 0.27-0.87; p = 0.010). The authors note that this AUC is comparable to previous reports attempting to predict FGFR3 mutations directly from histology, suggesting that multi-omics confounders explain why single-mutation prediction from histology has limited ceiling accuracy. FGFR3 is clinically significant because erdafitinib, an FGFR inhibitor, is FDA-approved for FGFR-altered locally advanced or metastatic urothelial carcinoma.

Additional significant mutations: Several other gene mutations showed significant differential distribution: PIK3CA (p = 0.028), FAT1 (p = 0.023), KANSL1 (p = 0.034), TMCO4 (p = 0.038), and KDM6A (p = 0.045), all with lower mutation frequencies in the high-risk group. KDM6A is a histone lysine demethylase with a known female gender bias, suggesting that the risk groups may also capture gender-linked epigenetic vulnerabilities. Mutations in GNA13 (p = 0.094), ZNF773 (p = 0.092), PSIP1 (p = 0.075), and METTL3 (p = 0.058) showed marginal significance at the p < 0.10 level.

TL;DR: The low-risk group concentrated 72% of TSC1 mutations (OR 0.36, p = 0.004), 67% of ERBB3 mutations (OR 0.46, p = 0.018), and 65% of FGFR3 mutations (OR 0.49, p = 0.010). Additional mutations in PIK3CA, FAT1, KDM6A, KANSL1, and TMCO4 also showed differential distribution (all p < 0.05), linking the AI risk groups to multiple actionable therapeutic targets.

Clinical Implications

Pages 17-18

Treatment Implications and the Case Against Single-Mutation Prediction

Treatment stratification potential: The risk groups map directly onto therapeutic targets. The low-risk group, enriched for FGFR3 mutations, could benefit from erdafitinib (FGFR inhibitor). ERBB3-mutated cases in the low-risk group are candidates for afatinib (pan-ERBB inhibitor). PIK3CA mutations suggest sensitivity to PI3K/mTOR inhibitors like LY294002. TSC1-mutated patients may respond to nab-sirolimus. The high-risk group, associated with basal/squamous subtypes showing elevated PD-L1 and PD-1 levels, may benefit from immune checkpoint inhibitors (anti-PD-1 or anti-PD-L1 agents). Additionally, the DNA hypermethylation pattern in the high-risk group suggests that epigenetic therapy could be a potential treatment option.

Why single-mutation prediction from histology is inadequate: A major intellectual contribution of this study is the demonstration that histopathological appearance reflects a collection of multifaceted molecular modulations, not a single mutation. The FGFR3 prediction AUC of only 0.593 illustrates this point: while histology encodes molecular information, attempting to extract a single mutation signal from what is fundamentally a multi-omics readout produces modest accuracy at best. Earlier studies reporting FGFR3 or molecular subtype prediction from histology were likely confounded by the interconnected nature of these molecular features.

Interpretability advantages: The study used neural architecture search to achieve a model with only 20 feature representations, compared to 512 in ResNet18, a roughly 96% reduction. These compact features preserve three-dimensional topological information and allow analysis of 8,000,000 data points in approximately 30 minutes using parallel computing. Comparable studies using off-the-shelf models are limited to one-dimensional features that lose topological information, or face prohibitive computational costs for three-dimensional feature analysis.

Critique of saliency methods: The authors note that comparable studies relied on activation maps or top-scoring tiles to interpret model inference. However, saliency methods carry a high risk of misinterpretation, limited reproducibility, and sparse visualization. Moreover, selecting only top-scoring tiles ignores the variance in histology patterns between categories. Their decile-based approach instead captures the full distribution of histological patterns across the prediction spectrum, providing a more robust and interpretable framework for understanding what the model has learned.

TL;DR: The risk groups suggest distinct treatment strategies: FGFR3/ERBB3/PIK3CA-targeting therapies for low-risk patients, and immune checkpoint inhibitors or epigenetic therapy for high-risk patients. The study demonstrates that single-mutation prediction from histology (FGFR3 AUC = 0.593) is inherently limited because histology encodes multi-omics information. The compact 20-feature model (96% fewer features than ResNet18) offers superior interpretability.

Limitations and Future Directions

Pages 18-20

Study Limitations, Domain Shift Challenges, and Future Research

Image quality variability: Whole-slide images from different institutions and scanners inevitably introduce variability in staining, scanning technology, and image artifacts. The authors mitigated this by using PlexusNET's domain shift handling, conducting comprehensive manual review by domain experts, and validating on multicentric datasets. Feature visualization techniques were employed to assess whether staining variations influenced the selected histological patterns, though some residual variability is unavoidable in real-world pathology workflows.

Intra-tumor heterogeneity: TCGA slide images provide only a snapshot of a specific tumor region, which may not fully capture the complex spatial and molecular heterogeneity within and between tumors. Importantly, TCGA slide images (H&E-stained FFPE sections) do not directly correspond to the tissue regions used for molecular profiling, as genomic, transcriptomic, and proteomic analyses use separate tissue preparations and protocols. This means the associations between histological patterns and molecular features are indirect rather than spatially matched. The authors selected FFPE tissues over frozen sections because FFPE offers standardized staining and more reliable histology, whereas frozen tissue preparation can cause structural damage and staining inconsistencies.

Out-of-distribution normalization: The authors acknowledge that distributional divergence between development and deployment cohorts is a persistent challenge. Their proposed normalization strategy follows principles of the central limit theorem and requires a representative cohort for reliable outcomes. A continuous normalization approach with instantaneous threshold adjustments is proposed, but this requires either a latency period or initial representative data. The application boundary of the approach is determined by image quality and the cohort characteristics of the development set.

Future directions: The authors envision prospective randomized studies to validate clinical utility for patient selection in real-world settings. Digital biomarkers like histomics could serve as companion variables for disease staging. Integration with Electronic Health Records (EHRs) and clinical decision support systems is needed for practical deployment. Multi-omics data harmonization across transcriptomics, genomics, and epigenomics remains a significant computational challenge. Whether omics strategies provide superior clinical benefits compared to a single data modality remains an open question. Possessing a scoring system that captures multi-omics features from a single FFPE histology image may help justify customized molecular profiling in the clinical setting, potentially reducing unnecessary testing while ensuring that patients who need comprehensive molecular characterization receive it.

TL;DR: Key limitations include image quality variability across institutions, spatial mismatch between H&E slides and molecular profiling regions, and the challenge of distributional shifts between development and deployment cohorts. Future work requires prospective validation, EHR integration, and multi-omics harmonization. The scoring system could serve as a cost-effective pre-screening tool to guide which patients need expensive molecular characterization.