Upper tract urothelial carcinoma (UTUC) is a rare and aggressive malignancy arising from the urothelial lining of the renal pelvis and ureter, accounting for only 5 to 10% of all urothelial carcinomas. The remaining 90 to 95% are urothelial bladder cancers (UBC). Despite its rarity, UTUC carries a poor prognosis: two-thirds of patients are diagnosed at an invasive tumor stage. Because UTUC and UBC share histopathological similarities, and because UBC is far more common, UTUC has been historically understudied. This knowledge gap means that molecular subtypes well-established in muscle-invasive bladder cancer (MIBC), such as luminal and basal classifications, have not been systematically extended to UTUC.
Clinical relevance of subtypes: In MIBC, the luminal subtype is associated with higher responsiveness to FGFR3-targeted therapies, while the basal subtype tends to respond better to immunotherapies such as PD-L1 and PD-1 inhibitors. Determining whether similar subtype-therapy associations exist in UTUC could open new treatment options for patients. However, molecular subtyping via high-throughput sequencing is neither ubiquitously available nor cost-effective. An alternative approach uses immunohistochemistry (IHC) to identify protein-based subtypes, which have been shown to overlap substantially with transcriptome-based subtypes in MIBC studies.
The study's proposition: This study by Angeloni et al. from the University Hospital Erlangen-Nurnberg and collaborators across Germany and the Netherlands proposes a two-part strategy. First, they identify UTUC protein-based subtypes using hierarchical clustering of six IHC markers: three luminal markers (FOXA1, GATA3, and CK20) and three basal markers (CD44, CK5, and CK14). Second, they develop a deep-learning (DL) model to predict these protein-based subtypes directly from routine hematoxylin and eosin (H&E) whole slide images, bypassing the need for IHC entirely. This could allow pathologists to prioritize which UTUC patients should undergo molecular testing for targeted therapies.
Training cohort (German cohort): The primary dataset comprised N = 163 retrospectively analyzed patients diagnosed with UTUC between 1995 and 2012 at University Hospital Erlangen-Nurnberg and University Hospital Giessen and Marburg, Germany. All patients underwent radical nephroureterectomy or partial ureterectomy without prior treatment. All samples were invasive (tumor stage pT1 or higher), and one whole slide image (WSI) per patient was selected, specifically the slide showing the most representative invasive portion of the tumor. The cohort had a median age of 73 years (range 47 to 94), with 68.1% male patients. Tumor stage distribution included 20.2% pT1, 17.2% pT2, 49.7% pT3, and 12.9% pT4.
Independent test cohort (Dutch cohort): An external validation set of N = 55 patients came from a multicenter, phase II prospective trial conducted at University Medical Center Rotterdam between 2017 and 2020. This cohort had a median age of 71 years (range 52 to 85), with 61.8% male patients. Tumor stages were 30.9% pT1, 23.6% pT2, and 45.5% pT3, with no pT4 cases. The prospective nature of this cohort provided a particularly rigorous test of model generalizability.
IHC marker selection rationale: The six-marker panel was chosen based on the biology of urothelial differentiation. Normal urothelium consists of three layers: a basal layer expressing CK5/6, CK14, and CD44; an intermediate layer with variable CD44 and high CK18 expression; and a superficial (umbrella cell) layer expressing CK20 and uroplakin proteins. Urothelial neoplasms arise via two oncogenic pathways: the luminal pathway driven by transcription factors GATA3, FOXA1, and PPARG, and the basal pathway driven by p63, STAT2, and EGFR. IHC staining was performed on tissue microarray (TMA) sections using four representative 1 mm cores per patient (two from the tumor center, two from the invasion front), and expression was quantified using the H-score (range 0 to 300).
Additional biomarker assessments: PD-L1 expression was evaluated using the SP263 assay, with positivity defined as immune cell (IC) score of 5% or higher, or combined positive score (CPS) of 10 or higher. FGFR3 mutational status was assessed via the SNaPshot method, which simultaneously detects nine hotspot mutations. These biomarkers served as clinically relevant endpoints for validating whether DL-predicted subtypes captured biologically meaningful distinctions related to targetable alterations.
Clustering methodology: Unsupervised hierarchical clustering was performed on the standardized expression of the six IHC markers across all 163 German cohort samples. The expression for each marker per patient was taken as the median H-score across the four TMA cores. Statistical analyses used R (v.4.0.3), with Fisher's exact test for categorical variables, the Wilcoxon rank-sum test and Kruskal-Wallis test for continuous variables, and the Kaplan-Meier estimator with log-rank test for survival analyses. A p-value threshold of 0.05 was applied for statistical significance.
Three subtypes identified: The clustering analysis revealed three protein-based subtypes: a luminal cluster (80 samples, 49.1%), a basal cluster (42 samples, 25.8%), and an indifferent cluster (41 samples, 25.1%) characterized by low expression of both basal and luminal markers. Notably, only 2 of the 41 indifferent samples had marker expression equal to zero; the remainder showed weak but detectable expression. The basal subtype was associated with shorter overall survival (OS) and disease-specific survival (DSS) compared to the luminal and indifferent subtypes.
Morphological associations: Tumor stages differed significantly across the three subtypes (p = 0.01), with nearly half of pT4 samples falling in the basal group. Infiltration type (p = 0.02) and tumor type (p = 0.02) differed significantly between basal and indifferent cases. Basal samples showed a clear prevalence of diffusely infiltrative, non-papillary tumors, whereas luminal and indifferent subtypes showed higher proportions of pushing and papillary tumors. Importantly, no significant morphological differences were found between the luminal and indifferent subtypes, indicating their histopathological similarity.
Implications for modeling: The histopathological similarity between indifferent and luminal subtypes had direct consequences for the deep-learning approach. Because these two groups looked morphologically alike under H&E staining, the DL model would face inherent difficulty distinguishing them based on visual features alone. This observation motivated the team's eventual decision to train a focused two-class model (luminal vs. basal) rather than attempting a three-class classification.
Slide preprocessing: Whole slide images from both cohorts were digitized using a Panoramic P250 scanner at multiple resolution levels. For each WSI, tumor tissue was manually annotated in QuPath (v.0.2.3) by a trained observer under expert pathologist supervision. An automated Python-based pipeline called TilGenPro (publicly available on GitHub) tessellated the annotated tumor areas into non-overlapping tiles of 512 x 512 pixels, performed quality filtering to remove background and artifact tiles, and applied stain normalization to reduce inter-scanner variability. This produced a library of 100,178 luminal tiles, 66,770 basal tiles, and 57,874 indifferent tiles.
Model architecture: The DL framework used a transfer-learning approach by fine-tuning a ResNet50 convolutional neural network initialized with weights pre-trained on the ImageNet database. ResNet50 is a 50-layer deep residual network that uses skip connections to avoid the vanishing gradient problem common in very deep networks. By starting from ImageNet-pretrained weights, the model leveraged general visual features (edges, textures, shapes) learned from millions of natural images, then adapted them to histopathological features through fine-tuning on the UTUC tile dataset.
Weakly supervised labeling: The approach was weakly supervised, meaning each tile inherited the protein-based subtype label of its parent slide (luminal, basal, or indifferent) from the hierarchical clustering. This avoids the need for expensive tile-level annotations by pathologists. The trade-off is label noise: not every tile in a "luminal" slide necessarily shows purely luminal morphology. To account for class imbalance, the number of tiles belonging to each class was equalized within each training set. WSI-level predictions were obtained by averaging tile-level prediction scores.
Validation strategy: Model performance was estimated using a three-fold cross-validation repeated three times, with random splitting performed at the patient level to ensure no data leakage between training and validation folds. For each repetition, the area under the receiver operating characteristic curve (AUROC), accuracy, precision, recall, and F1-score were computed as the mean and 95% confidence interval across the three validation folds using Student's t-distribution. Confusion matrices for each repetition were obtained by concatenating predictions across the three validation folds.
Three-class model limitations: The initial three-class model (luminal, basal, indifferent) encountered predictable difficulties. While most basal samples (AUROC = 0.77, 95% CI: 0.67 to 0.86 in repetition three) and luminal samples (AUROC = 0.71, 95% CI: 0.44 to 0.99 in repetition two) were correctly classified, more than 55% of indifferent samples were predicted as luminal by the DL model. This confirmed that the DL model could not distinguish the indifferent subtype from luminal based on H&E morphology alone, consistent with the histopathological similarity observed during the clustering analysis.
Two-class model performance: A refined model was trained using only the 80 luminal and 42 basal samples, producing 100,178 luminal and 66,770 basal tiles. This binary classifier achieved substantially better results across all three repetitions. The mean AUROC values were 0.83 (95% CI: 0.67 to 0.99) in repetition one, 0.80 (95% CI: 0.62 to 0.99) in repetition two, and 0.81 (95% CI: 0.65 to 0.96) in repetition three. The best mean accuracy was 0.79 (95% CI: 0.75 to 0.84) in repetition two, which also showed the most consistent metrics across folds.
High-confidence predictions: From repetition two (the best-performing run), slides were stratified by prediction confidence. "High-confidence" slides were those with a prediction score of 0.7 or higher for either the luminal or basal class. The true positive rate among high-confidence luminal slides was 86.2% (50 of 58 correctly classified), and for high-confidence basal slides it was 87.5% (14 of 16 correctly classified). In the top-scoring luminal slide, 99.9% of tiles were predicted luminal, and in the top-scoring basal slide, 90% of tiles were predicted basal. Whole-slide IHC validation with all six markers confirmed these top predictions.
Low-confidence and heterogeneous slides: Twenty-two "low-confidence" slides had prediction scores between 0.4 and 0.6. These slides showed no significant difference in the distribution of luminal versus basal marker expression (p = 0.43). Tile-level prediction maps revealed that some of these were "heterogeneous slides" with distinguishable clusters of luminal-predicted and basal-predicted tiles. Whole-slide IHC validation of a candidate heterogeneous slide confirmed co-presence of luminal and basal areas, with the basal marker CK14 expressed only in the outer cell layer in luminal-predicted areas but across all cell layers in basal-predicted areas. This finding demonstrates the model's ability to identify intratumoral heterogeneity.
Morphological validation: High-confidence DL-predicted subtypes showed significant associations with key morphological features (p < 0.001 for all). High-confidence luminal predictions were predominantly papillary tumors with not otherwise specified (NOS) histological subtype and pushing infiltration type. High-confidence basal predictions were mainly non-papillary tumors with squamous or other subtype histology and diffuse infiltration. Visual inspection of tile-level prediction maps confirmed these associations: luminal tiles displayed dense nuclei with small stroma bridges, while basal tiles showed dense stroma and keratinization.
Marker expression concordance: High-confidence predicted luminal slides exhibited significantly higher expression of luminal markers compared to basal markers (p = 6.62 x 10^-9), and high-confidence predicted basal slides showed significantly higher basal marker expression (p = 0.00241). Even among the incorrectly predicted slides, the morphological features were consistent with the DL-predicted subtype rather than the protein-assigned subtype. Five of the eight WSIs labeled "basal" by protein expression but predicted luminal by DL showed NOS histology characteristic of the luminal subtype. Both wrongly predicted basal slides were diffusely infiltrating, non-papillary tumors with subtype histology.
PD-L1 association: The proportion of PD-L1-positive samples was significantly higher in high-confidence predicted basal slides, both using the IC score (p = 0.01) and the combined positive score (p < 0.001). This is clinically significant because PD-L1 positivity is a biomarker for response to immune checkpoint inhibitors. The finding that DL-predicted basal subtype, determined solely from H&E slides, enriches for PD-L1-positive samples suggests the model captures morphological correlates of immune activation.
FGFR3 mutation association: The proportion of FGFR3-mutated samples was significantly higher in high-confidence predicted luminal slides (p = 0.002). FGFR3 mutations are targetable with specific inhibitors such as erdafitinib. Interestingly, among incorrectly predicted slides, one wrongly predicted basal slide was actually PD-L1 positive, and four wrongly predicted luminal slides were FGFR3 mutated. This suggests the DL model may in some cases identify biologically meaningful subtype features that the TMA-based protein assay missed due to sampling limitations.
Dutch cohort clustering: Hierarchical clustering of the independent Dutch cohort (N = 55) identified the same three protein-based subtypes observed in the German cohort: luminal (31 samples, 56.3%), basal (4 samples, 7.3%), and indifferent (20 samples, 36.4%). The expression profiles matched those of the German cohort, providing strong support for the biological reproducibility of these subtypes. The consistency of this finding is particularly notable given that the Dutch cohort came from a prospective multicenter trial, whereas the German cohort was retrospective.
External test performance: WSI-level predictions on the Dutch cohort were generated using an ensemble of the three DL models trained during repetition two. The model correctly classified all 31 luminal samples, achieving an average prediction score of 0.89. However, the model failed to correctly classify all four basal samples. Three of these four basal samples exhibited histological subtypes with glandular, squamous, and sarcomatoid features, which may have made their H&E morphology atypical relative to the basal patterns learned from the German training set.
The papillary invasion front case: The fourth misclassified basal sample was predominantly characterized by papillary growth with NOS histology, which typically appears luminal. However, basal features could be observed at the invasion front, where two of the four TMA cores had been punched for IHC assessment. The tile-level prediction map revealed that the DL model did identify a small basal-predicted area corresponding to the invasion front, while the remainder of the papillary tumor was predicted luminal. This case highlights how TMA-based subtype labels can be misleading when tumor heterogeneity is present.
Implications for model development: The external validation results underscore that while the DL model performs well on luminal cases, the presence of histological subtypes (squamous, glandular, sarcomatoid) in basal tumors presents a challenge. These variant histologies were not excluded from the study due to the rarity of UTUC, but their distinct morphological appearances may confound a model trained primarily on NOS and conventional subtype features. Future studies with larger UTUC cohorts containing sufficient representation of each histological subtype would be needed for robust multi-class DL training.
Clinical utility vision: The study's DL model, predicting protein-based subtypes from routine H&E slides, could serve as a triage tool for UTUC patients. By identifying which patients are likely basal (and thus PD-L1-positive candidates for immunotherapy) or luminal (and thus FGFR3-mutated candidates for targeted therapy), pathologists could prioritize expensive FGFR3 and PD-L1 testing. This is especially relevant as healthcare systems face increasing demand for these molecular tests due to the expanding use of anti-PD-L1/PD-1 therapies and specific FGFR3 inhibitors in urothelial carcinomas. A fully digital diagnostics workflow would be needed to implement such prioritization in daily practice.
Label uncertainty challenges: The authors identify two sources of label uncertainty that affect model training. First, hierarchical clustering assigns subtypes at the cohort level, meaning the cluster membership of individual samples may shift if different samples are included. Second, IHC-based subtype labels are derived from TMA cores (four 1 mm punches per patient), which may not be representative of the entire tumor. The predominantly papillary Dutch cohort case with basal invasion front features illustrates this problem clearly. RNA-sequencing analyses could provide more robust subtype assignments through genome-wide profiling, though they too may fail to capture intratumoral heterogeneity.
Histological subtype considerations: Given the rarity of UTUC, the study did not exclude tumors with variant histological subtypes (squamous, glandular, sarcomatoid), as had been done in some UBC studies. This decision reflects a practical reality but introduces a modeling challenge, since variant histologies have gained increasing clinical importance due to their impact on pathological and clinical outcomes. Developing machine-learning approaches that can accommodate the prediction of histological subtypes from H&E slides would be an important advance, but requires cohorts large enough to represent each subtype adequately for robust DL training.
Future workflow extensions: Several improvements could enhance the practical utility of this framework. An upstream automated tumor segmentation step would eliminate the need for manual QuPath annotations when processing new samples. A downstream postprocessing tool could automatically detect candidate heterogeneous slides by analyzing luminal/basal spatial patterns in tile-level prediction maps. Extension to biopsy samples would be clinically valuable, though the limited and superficial nature of biopsies may not provide sufficient morphological information for subtype prediction. Larger multicenter UTUC cohorts with comprehensive molecular profiling will be essential for training next-generation models.
Broader significance: This study represents one of the first demonstrations that protein-based UTUC subtypes, associated with the presence of targetable alterations, can be predicted directly from H&E slides using deep learning. The approach lays a foundation for AI-based tools that could support UTUC diagnosis and extend patient access to targeted treatments without requiring upfront IHC or molecular testing on every case. However, prospective validation in larger cohorts, integration into digital pathology workflows, and solutions for histological subtype diversity remain necessary before clinical deployment.