Artificial Intelligence in Renal Cell Carcinoma Histopathology: Current Applications and Future Perspectives

Diagnostics 2023 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why AI Is Needed in RCC Histopathology

Renal cell carcinoma (RCC) ranks among the top 10 most common cancers in both men and women, and its incidence has been steadily rising. Accurate diagnosis depends on histological analysis supported by genetic and cytogenetic profiling. Key prognostic features include tumor grade, RCC subtype (such as ccRCC, pRCC, and chRCC), lymphovascular invasion, tumor necrosis, and sarcomatoid dedifferentiation. Classifying these features manually is challenging, as RCC encompasses a broad spectrum of histopathological entities that have recently undergone reclassification.

A major limitation of conventional pathology is interobserver variability, where different pathologists may reach different conclusions when examining the same tissue. This inconsistency affects both renal mass biopsy (RMB) specimens, which are non-diagnostic in approximately 10 to 15% of cases, and surgical resection specimens, which exhibit substantial heterogeneity. The 5-year survival rate for low-grade RCC is around 90%, compared to roughly 12% for high-grade disease, making accurate grading critical for treatment decisions.

Whole Slide Imaging (WSI) technology has enabled the digitization of pathology slides, creating enormous datasets of high-quality images that can be used to train and test AI models. Machine learning (ML) and its subset deep learning (DL) use algorithms that enable computers to learn from these digital images, automatically identifying complex patterns and relationships within large datasets. Convolutional neural networks (CNNs) are a class of deep neural networks consisting of sequential convolutional layers that process input images through learnable filters, extracting features in an end-to-end fashion.

The authors conducted a narrative review of the Medline database, screening articles published in English between January 2017 and January 2023. They retrieved 98 results, focusing on original studies and case series while excluding reviews, editorials, and letters. The review covers AI applications in RCC diagnosis, subtyping, grading, molecular-morphological connections, therapy response prediction, and prognosis modeling.

TL;DR: RCC is a top-10 cancer with rising incidence, where 5-year survival ranges from 90% (low-grade) to 12% (high-grade). Interobserver variability and complex histopathology make diagnosis challenging. This 2023 review of 98 Medline articles covers AI/ML applications in RCC diagnosis, grading, and prognosis using WSI-based computational pathology.
Pages 4-6
AI for RCC Subtyping in Biopsy and Surgical Specimens

Biopsy-level classification: Fenstermaker et al. developed a DL-based algorithm using a custom CNN with 6 convolutional layers (2 layers each of 32, 64, and 128 filters) for RCC diagnosis, grading, and subtype assessment. Using only a 100 square micrometer patch, the model achieved 99.1% accuracy for RCC diagnosis, 97.5% accuracy for subtyping, and 98.4% accuracy for grading on the test set. The study included 15 ccRCC, 15 pRCC, and 12 chRCC cases. Training was halted when validation performance ceased to improve.

Oncocytoma distinction: Zhu et al. tackled one of pathology's well-known challenges: differentiating oncocytomas from chRCC. They trained and tested a ResNet-based model (comparing ResNet-18, ResNet-34, ResNet-50, and ResNet-101) on 486 surgical resection specimens and 79 RMB slides (including 24 renal oncocytomas). ResNet-18 was selected for the highest average F1-score of 0.96. The model achieved 97% accuracy on surgical resections and 97% accuracy on RMB specimens. External validation on 908 patients (505 ccRCC, 294 pRCC, 109 chRCC) yielded 95% accuracy.

LASSO-based approaches: Chen et al. used LASSO (least absolute shrinkage and selection operator) feature selection on The Cancer Genome Atlas (TCGA) cohort of 362 ccRCC, 128 pRCC, and 84 chRCC cases. Their pipeline achieved 94.5% accuracy distinguishing ccRCC from normal tissue, 97% accuracy separating ccRCC from pRCC and chRCC, and predicted 1-, 3-, and 5-year disease-free survival at 88.8%, 90.0%, and 89.6% accuracy. External validation on 436 patients showed 87.6% accuracy for diagnosis and 81.4% for subtyping.

TL;DR: Fenstermaker's CNN achieved 99.1% diagnostic accuracy on 100-micrometer patches. Zhu's ResNet-18 model reached 97% accuracy on both surgical and biopsy specimens with 95% on external validation (n=908). Chen's LASSO pipeline hit 94.5% for ccRCC diagnosis and predicted 5-year DFS at 89.6% accuracy using TCGA data.
Pages 6-7
Surgical Resection Subtyping and Unsupervised Approaches

CNN with DAG-SVM: Tabibu et al. used two pre-trained CNNs based on ResNet-18 and ResNet-34 architectures to distinguish between ccRCC, chRCC, and normal tissue using 509 normal tissue, 1,027 ccRCC, 303 pRCC, and 254 chRCC samples. Their test set accuracy was 93.9% for ccRCC vs. normal parenchyma, 87.34% for chRCC vs. normal parenchyma, and 92.16% for overall subtyping. A directed acyclic graph support vector machine (DAG-SVM) layered on top of the deep network improved classification. Data augmentation included random patches, vertical flip, rotation, and noise addition, with weighted resampling to address class imbalance.

Clear cell papillary RCC identification: Abdeltawab et al. developed a multi-scale CNN model using three CNNs for small, medium, and large patch sizes to distinguish ccRCC from clear cell papillary RCC (ccpRCC), a subtype only recently recognized as a distinct entity. Using 27 ccRCC and 14 ccpRCC cases with 50% patch overlap, the model achieved 91% accuracy for identifying ccpRCC and 90% accuracy for diagnosing ccRCC on an external dataset of 10 cases. This distinction matters because ccpRCC carries a favorable prognosis and was recently reclassified by the WHO as a clear cell papillary renal cell tumor.

Marostica et al. created a transfer learning pipeline comparing three deep CNN (DCNN) architectures: VGG-16, Inception-v3, and ResNet-50. Training on 537 ccRCC, 288 pRCC, and 103 chRCC cases, the model achieved test set AUCs of 0.990 for ccRCC, 1.00 for pRCC, and 0.9998 for chRCC. External validation on 913 patients yielded AUCs of 0.964 to 0.985 for ccRCC and 0.782 to 0.993 for subtyping. Faust et al. also demonstrated that an unsupervised ML system originally trained on brain tumors could be applied to cluster and analyze RCC specimens without RCC-specific labeling.

TL;DR: Tabibu's ResNet/DAG-SVM achieved 92.16% subtyping accuracy across 2,093 samples. Marostica's DCNN pipeline (VGG-16, Inception-v3, ResNet-50) reached near-perfect AUCs (0.990 to 1.00) on test data and 0.964 to 0.985 on 913-patient external validation. Abdeltawab's multi-scale CNN distinguished the newly recognized ccpRCC subtype at 91% accuracy.
Pages 8-9
AI-Powered Nuclear Grading: Fuhrman and WHO/ISUP Systems

Tumor grading is one of the most critical prognostic factors in RCC. The Fuhrman grading system focuses on nuclear morphology (size, shape, and prominent nucleoli), but suffers from significant inter- and intra-observer variability. The newer WHO/ISUP grading system relies solely on nucleolar prominence for grades 1 through 3, allowing lower interobserver variation. AI approaches have been developed for both systems.

SVM for Fuhrman grading: Yeh et al. trained a support vector machine (SVM) classifier on 39 ccRCC specimens that effectively identified nuclei, estimated their size, and calculated spatial distribution. The model distinguished low from high Fuhrman grades with an AUC of 0.97. However, it could not differentiate between specific grades (e.g., grade III vs. IV), and no survival analyses were presented. A non-specialist with no pathology training was able to participate in training the classifier through an interactive interface.

WHO/ISUP grading with nucleoli detection: Holdbrook et al. developed an automated pipeline on 59 ccRCC patients that detected prominent nucleoli and quantified nuclear pleomorphic patterns. The system used a cascade detector of prominent nucleoli (constructed by stacking 20 classifiers sequentially) and classification via SVM, logistic regression, and AdaBoost. Grade prediction achieved an F-score of 0.78 to 0.83, with a high degree of correlation (R = 0.59) with a multigene score for prognosis.

LASSO-based histomic grading: Tian et al. used 395 ccRCC cases from TCGA, extracting 72 nuclear features and applying LASSO regression for feature selection. Of seven ML classification methods tested, LASSO demonstrated the highest performance with 84.6% sensitivity and 81.3% specificity for grade prediction. The predicted grade was significantly associated with overall survival (HR: 2.05; 95% CI 1.21 to 3.47), and outperformed both TCGA-assigned and pathologist-assigned grades in discordant cases.

TL;DR: Yeh's SVM classifier achieved AUC 0.97 for two-tiered Fuhrman grading (n=39). Holdbrook's nucleoli-detection pipeline reached F-scores of 0.78 to 0.83 for WHO/ISUP grading (n=59). Tian's LASSO model on 395 TCGA cases achieved 84.6% sensitivity and 81.3% specificity, with predicted grades outperforming pathologist grades in survival prediction (HR: 2.05).
Pages 10-12
Linking Histopathology to Molecular Biomarkers and Treatment Response

Copy number alterations and mutations: Marostica et al. used transfer learning with VGG-16, Inception-v3, and ResNet-50 to predict copy number alterations (CNAs) and somatic mutations from WSI images. They demonstrated that CNAs in genes including KRAS, EGFR, and VHL affect quantitative histopathology patterns. For ccRCC KRAS CNA prediction, the model achieved an AUC of 0.724, while pRCC somatic mutation AUCs ranged from 0.419 to 0.684. The model also predicted tumor mutational burden with a Spearman correlation coefficient of 0.419. This approach was weakly supervised and did not require pixel-level segmentation, making it clinically applicable.

VEGFR-TKI response prediction: Go et al. developed an ML-based classifier to identify which metastatic ccRCC patients would respond to VEGFR-TKI treatment. Using clinical, pathology, and molecular data from 101 patients with a 10-fold cross-validated SVM and decision tree analysis, the model achieved 87.5% apparent accuracy. The C-index was 0.7001 for progression-free survival (PFS) and 0.6552 for overall survival (OS). Features showing statistical differences between responder and non-responder groups were selected, with secondary feature selection via SVM to build the most efficient model.

Vascular phenotype analysis: Ing et al. used ML to analyze tumor vasculature in ccRCC cases from TCGA. Using SVM and random forest classifiers to identify endothelial cells and generate vascular area masks within WSIs, they identified 9 vascular features with predictive value for disease-free survival (AUC = 0.79). They discovered a 14-gene expression signature correlated with these features and built two GLMNET models. The combined Stage + 14-gene model achieved a C-Index of 0.74, outperforming staging alone (C-Index = 0.70).

Methylation prediction from morphology: Zheng et al. used morphometric features extracted from histopathological images to predict DNA methylation values. They tested six classical ML models (logistic regression with LASSO, random forest, SVM, AdaBoost, Naive Bayes, and a two-layer fully connected neural network) on 326 RCC cases, achieving average AUC and F1 scores higher than 0.6 across 30 training/testing splits with 5-fold cross-validation.

TL;DR: Marostica's DCNN predicted KRAS CNAs with AUC 0.724 from H&E slides alone. Go's SVM classifier predicted VEGFR-TKI response at 87.5% accuracy (n=101 metastatic ccRCC). Ing's vascular analysis identified a 14-gene signature improving prognostic C-Index from 0.70 to 0.74. These studies bridge histopathology with molecular biomarkers for precision oncology.
Pages 13-15
Multimodal AI Models for RCC Survival Prediction

Current prognostic models for localized ccRCC include the Leibovich score and the UISS (UCLA Integrated Staging System) score, while metastatic RCC relies on the MSKCC and IMDC risk groupings. However, MSKCC and IMDC classifications may differ in up to 23% of cases, and AI multimodal approaches can improve accuracy by up to 27.7% compared to single-modality methods.

CT plus histopathology integration: Ning et al. combined features from CT scans, histopathological images, clinical data, and genomic data using two CNNs with identical structures for image feature extraction. Their multimodal model achieved a mean C-index of 0.832 (range 0.761 to 0.903) on 209 ccRCC patients, with the BFPS (block filtering post-pruning search) algorithm for feature selection. Radiologic imaging proved to be the single modality with the best predictive performance.

Gene-morphology fusion: Cheng et al. were the first to combine gene expression data and histopathologic features for ccRCC prognosis, using 410 patients. Their unsupervised segmentation method for cell nuclei with LASSO-Cox modeling generated a risk index that outperformed predictions from morphologic features or eigengenes alone. The predicted risk could also stratify early-stage patients (stages I and II), whereas staging alone showed no significant survival difference in these cases.

Multimodal deep learning: Schulz et al. built a multimodal DL model using an 18-layer ResNet per image modality (histopathology slides, CT scans, MR scans) plus a dense layer for genomic data, with an attention layer to weight each modality. On 248 ccRCC patients, the model achieved a mean C-index of 0.7791 and mean accuracy of 83.43% for 5-year survival prediction. External validation on 18 patients yielded a C-index of 0.799 (maximum 0.8662) and accuracy of 79.17% (maximum 94.44%). The model outperformed T-stage, N-stage, M-stage, and grading as individual predictors.

TL;DR: Ning's multimodal model (CT + histopathology + genomics) achieved C-index 0.832 (n=209). Cheng's gene-morphology fusion outperformed single-modality prognosis in 410 patients and stratified early-stage cases where staging alone failed. Schulz's attention-based ResNet reached 83.43% accuracy for 5-year survival with C-index 0.799 on external validation.
Pages 16-17
The Black Box Problem and Interpretability in RCC Pathomics

A fundamental concern with supervised learning models is the "black box" problem: the machine generates answers (e.g., low or high grade, or a specific subtype) based on learned algorithms that humans cannot directly survey or verify. This opacity makes pathologists reluctant to trust AI-generated findings before approving reports and discussing them in multidisciplinary meetings. The authors identify this as a significant barrier to clinical adoption.

One proposed solution is gradient-weighted class activation mapping (Grad-CAM), a visualization tool that overlays heatmaps on images to highlight which cell types or regions contributed most to the model's decision. Marostica et al. used Grad-CAM to identify regions of greatest importance for CNA and mutation predictions. This type of explainability is essential for pathologists to understand and validate AI outputs rather than relying on blind trust.

Another approach is unsupervised "search and match" methods rather than direct classification. Faust et al. demonstrated this by applying an AI system trained on brain tumors to cluster and analyze RCC specimens. This method resembles the current pathology workflow where pathologists compare specimen images to atlases of known conditions. While this approach does not eliminate the need for human expert interpretation, it provides a more transparent process where the pathologist can visually verify the matches.

TL;DR: The "black box" nature of supervised AI models creates trust issues for pathologists. Grad-CAM heatmaps and unsupervised search-and-match methods offer transparency. Explainability tools are critical for clinical adoption, as pathologists need to verify AI decisions before acting on them in multidisciplinary settings.
Pages 16-18
Overfitting, Stain Variability, and the Generalization Gap

A major obstacle to deploying AI in clinical practice is generalization error, which occurs when models trained on data from one pathology laboratory fail to perform well on data from another institution. The color distribution of WSIs varies across different laboratories due to differences in the staining process, tissue preparation, and digital scanning equipment. This inter-center variability impacts state-of-the-art CNN-based algorithms, which often show reduced performance on images from centers other than the one used for training.

Overfitting is a related problem where a model becomes so finely tuned to its training dataset that it memorizes specific patterns rather than learning generalizable features. If the training and test data come from the same laboratory, overfitting may go undetected because the model performs well on familiar data. The authors stress that any features selected based on idiosyncrasies in the training data, such as technical or sampling biases, will likely fail when applied to new datasets.

Solutions include stain color augmentation and stain color normalization, with ML-based methods that perform normalization using neural networks. The most effective method to detect overfitting is external validation, which tests the model on a completely separate group of patients from a different institution. Adequate performance on a reasonably extensive external validation set provides the strongest evidence of a model's generalizability. However, many of the reviewed studies lacked external validation entirely. The authors call for a global standard for tissue processing, staining, slide preparation, and digital acquisition.

TL;DR: CNN-based models often fail when applied to WSIs from different laboratories due to stain color variability and overfitting. External validation is critical but missing from many studies. Stain normalization, color augmentation, and standardized tissue processing protocols are needed to close the generalization gap for clinical deployment.
Pages 16-18
The Road Ahead for AI in RCC Computational Pathology

The authors note a significant shift in computational pathology research over the past decade. Initially, the goal was to replicate the diagnostic process already performed by pathologists. More recently, the field has moved toward uncovering "sub-visual" prognostic image cues from histopathological images that are imperceptible to the human eye. This transition represents a fundamental change in how AI can contribute to pathology, moving from automation to genuine augmentation.

The fusion of radiomics (extracting computational features at the macroscopic level from imaging) and pathomics (quantitative analysis at the microscopic level from tissue slides) offers a future opportunity to combine tumor heterogeneity information at both macro and micro scales. Many radiomic studies already use histopathology results as the reference standard to evaluate their models, suggesting that integrating both modalities could produce stronger combined signatures.

Before these promising technologies reach widespread clinical use, several critical steps must be addressed. Models need to be externally validated on large, diverse cohorts from multiple institutions. Cross-validation alone is insufficient because biased input data can lead to biased evaluations. The authors recommend always checking for potential sample bias and assessing whether issues related to sample size, heterogeneity, noise, and confounding factors exist before model training.

The review concludes that AI in RCC pathology holds substantial promise for overcoming inter- and intra-observer variability, reducing time consumption, and revealing microscopic patterns invisible to human assessment. Results of current algorithms are either on par with or outperform state-of-the-art methods, but most remain unavailable for widespread clinical use. The path forward requires standardized validation workflows, larger diverse datasets, transparent and interpretable models, and prospective comparison with established clinical prognostic tools.

TL;DR: The field is shifting from replicating pathologist tasks to discovering sub-visual prognostic cues invisible to humans. Radiomics-pathomics fusion, external validation on diverse multi-institutional cohorts, and standardized tissue processing are essential next steps. Current AI models match or outperform traditional methods but need rigorous prospective validation before clinical deployment.
Citation: Distante A, Marandino L, Bertolo R, et al.. Open Access, 2023. Available at: PMC10340141. DOI: 10.3390/diagnostics13132294. License: cc by.