Deep learning techniques for imaging diagnosis of renal cell carcinoma: current and emerging trends

Frontiers in Oncology 2023 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Deep Learning Matters for Kidney Cancer Diagnosis

Renal cell carcinoma (RCC) is one of the most common and lethal cancers of the urinary system, accounting for roughly 4% of all human malignancies. In 2020 alone, there were approximately 431,288 new cases worldwide. The cancer originates from the tubular epithelial cells of the renal parenchyma, and clear cell RCC (ccRCC) is by far the most prevalent pathological subtype, comprising 60% to 80% of all RCC cases. RCC is often asymptomatic in its early stages, with an estimated 15-40% of cases discovered incidentally during CT scans performed for unrelated reasons. By the time of diagnosis, roughly 25-30% of patients already have metastatic disease, making early detection critically important for improving outcomes.

This 2023 review from Frontiers in Oncology is described by the authors as the first comprehensive review of deep learning specifically applied to RCC. The paper covers multiple domains of application: differentiating benign from malignant renal tumors, identifying pathological subtypes, grading tumor severity, analyzing digital pathology images, working with ultrasound data, and predicting patient prognosis. The imaging modalities discussed include contrast-enhanced computed tomography (CECT), magnetic resonance imaging (MRI), contrast-enhanced ultrasound (CEUS), and digital histopathology slides.

Deep learning versus traditional machine learning: The authors distinguish between traditional machine learning methods (support vector machines, random forests, decision trees, K-nearest neighbor, naive Bayes, logistic regression) and deep learning approaches built on convolutional neural networks (CNNs). CNNs and their improved variants have raised classification accuracy to human-level performance in many medical imaging tasks, including rectal cancer, breast cancer, and lung cancer. In urology specifically, deep learning models have shown strong results across RCC, prostate cancer, bladder cancer, and urolithiasis detection.

Clinical motivation: Percutaneous biopsy is the standard for confirming renal tumor diagnosis, but it carries risks including a 1.2% rate of biopsy tract seeding overall, rising to 12.5% for papillary RCC. Complications such as hematoma, severe hematuria, pneumothorax, and hemorrhage, though relatively uncommon, further motivate the need for reliable non-invasive diagnostic tools. Deep learning-based preoperative imaging systems, trained against pathological ground truth, can often reach more than 90% accuracy and could help patients avoid unnecessary biopsy risks.

TL;DR: RCC accounts for 4% of human cancers with 431,288 new cases globally in 2020. About 25-30% of patients present with metastases at diagnosis. This review, described as the first on deep learning in RCC, covers tumor classification, subtyping, grading, pathology, ultrasound, and prognosis prediction across CT, MRI, CEUS, and histopathology.
Pages 2-3
Distinguishing Benign from Malignant Renal Tumors

Not all incidentally discovered renal masses are cancerous. Up to 20% of solid renal tumors smaller than 4 cm are benign, most commonly renal oncocytoma (RO) and fat-poor angiomyolipoma (fpAML). Accurately distinguishing benign from malignant tumors before surgery is essential because precise preoperative diagnosis can spare patients from unnecessary surgical intervention. The review summarizes multiple deep learning studies tackling this binary classification problem using different imaging modalities and architectures.

CT-based approaches: Zhou et al. (2019) used the Inception-v3 architecture on CECT images from 134 malignant and 58 benign tumors, achieving 97% accuracy, 95% sensitivity, and 97% specificity. Tanaka et al. (2020) focused specifically on small renal tumors (4 cm or less), training six Inception-v3 models on four-phase CECT data from 168 tumors. The nephrographic (NP) phase images yielded the highest accuracy at 88% with an AUC of 0.846. Zabihollahy et al. (2020) developed a custom 6-layer CNN with semi-automatic and fully automatic tumor segmentation on 77 benign and 238 malignant tumors. The semi-automatic method achieved 83.75% accuracy, 89.05% precision, and 91.73% recall, while the fully automated method reached 77.36% accuracy, 85.92% precision, and 87.22% recall.

MRI-based approaches: Xi et al. (2020) assembled a large dataset of 1,162 renal lesions (655 malignant, 507 benign) and applied a ResNet residual network on MRI (T1C and T2WI sequences). The deep learning model achieved accuracy of 0.70, sensitivity of 0.92, and specificity of 0.41, which was significantly higher than both the radiomics-only model and expert radiologist assessments. The high sensitivity of 0.92 is particularly noteworthy for a screening context, where missing malignant cases is the primary concern.

Across all of these studies, CT-based models generally achieved higher accuracy and specificity than MRI-based models for benign versus malignant classification, likely due to the sharper contrast enhancement patterns captured by multi-phase CECT. However, MRI remains critical for patients with contraindications to CT contrast agents, including allergies and pregnancy.

TL;DR: Up to 20% of small renal tumors are benign. Inception-v3 on CECT achieved 97% accuracy for benign vs. malignant classification (Zhou et al.), while small-tumor-specific models reached 88% accuracy with 0.846 AUC. MRI-based ResNet on 1,162 lesions reached 0.92 sensitivity. Semi-automatic CNN segmentation yielded 83.75% accuracy and 91.73% recall.
Pages 3-5
Classifying RCC into Pathological Subtypes with Deep Learning

The World Health Organization (WHO) classifies renal tumors into multiple subtypes, each with distinct growth patterns, treatment requirements, and recurrence risks. Clear cell RCC (ccRCC) dominates at 60-80% of cases, while the remaining non-ccRCC tumors include papillary RCC (pRCC) and chromophobe RCC (chRCC) as the most common variants. Precise preoperative identification of subtype matters clinically because angiomyolipoma, oncocytoma, and cystic lesions can be followed conservatively, while different subtypes of malignant tumors require different targeted therapy and immunotherapy regimens.

pRCC versus chRCC: These two non-ccRCC subtypes share overlapping imaging features, especially in early-stage or small tumors where distinguishing characteristics like cysts, necrosis, and calcification (pRCC) or central whorl-like enhancement (chRCC) are often atypical. Teng et al. (2021) tested six deep learning architectures including MobileNetV2, EfficientNet, ShuffleNet, ResNet-34, ResNet-50, and ResNet-101. The best performer, MobileNetV2, achieved 96.9% accuracy in the validation set (99.4% sensitivity, 94.1% specificity) and 100% case accuracy with 93.3% image accuracy on the external test set. This substantially outperformed manual classification, which achieved only 85% accuracy (100% sensitivity, 70% specificity).

Multi-class subtype discrimination: Han et al. (2019) built a GoogLeNet-based model to classify ccRCC, pRCC, and chRCC simultaneously using three-phase CECT data from 169 tumors. The model achieved approximately 0.85 accuracy, sensitivity ranging from 0.64 to 0.98 across subtypes, specificity of 0.83-0.93, and an AUC of 0.9. Zheng et al. (2021) took a different approach using MRI T2WI with ResNet to classify four tumor types (ccRCC, pRCC, chRCC, and AML) in 199 patients. Overall accuracy was 60.4% with a macro-average AUC of 0.82, but performance varied by subtype: AUCs for ccRCC, chRCC, AML, and pRCC were 0.94, 0.78, 0.80, and 0.76, respectively.

Binary subtype classifiers: Several studies focused on clinically challenging binary distinctions. Baghdadi et al. (2020) differentiated RO from chRCC using NiftyNet on CECT, achieving 95% accuracy (100% sensitivity, 89% specificity). Nikpanah et al. (2021) used ResNet-50 V2 to distinguish RO from RCC on three-phase CECT with 369 patients, achieving AUC of 0.973 and 93.3% accuracy on one test set, and AUC of 0.946 with 90.0% accuracy on a second. Zuo et al. (2021) applied AlexNet to MRI for ccRCC versus RO classification, reaching 91% accuracy with an AUC of 0.9.

TL;DR: MobileNetV2 distinguished pRCC from chRCC with 96.9% accuracy (vs. 85% for manual classification). GoogLeNet achieved 0.85 accuracy and 0.9 AUC for three-subtype classification. NiftyNet reached 95% accuracy for RO vs. chRCC, and ResNet-50 V2 achieved AUC of 0.973 for RO vs. RCC. Performance varied by subtype, with ccRCC consistently easiest to identify (AUC 0.94).
Pages 4-5
Predicting Tumor Grade Non-Invasively with Deep Learning

Pathological grading of ccRCC is one of the strongest predictors of patient outcomes, and accurate preoperative grading can help urologists develop treatment strategies earlier. Two main grading systems are used: the Fuhrman system, long established in oncology, and the newer WHO/ISUP grading system introduced in 2012 by the International Society of Urological Pathology (ISUP), which classifies tumors into four grades (I through IV). Most deep learning studies simplify this into a binary task, classifying grades I-II as "low grade" and grades III-IV as "high grade."

CT-based grading models: Lin et al. (2020) trained ResNet models on CECT images of 410 ccRCC patients, achieving 73.7% accuracy with an AUC of 0.82 on internal validation, and 77.9% accuracy with AUC of 0.81 on an external test set. Xu et al. (2022) used the largest cohort in this area, with 706 ccRCC patients. They refined the training process by adding a mixed loss strategy and sample reweighting to address domain shift, noisy labels, and class imbalance. By combining four deep learning networks (RegNet-400, RegNet-800, ResNet-50, ResNet-101), the ensemble model achieved an AUC of 0.882, compared to 0.864 for the best single model. Yang et al. (2022) introduced TransResNet, a transformer-based architecture, on 759 ccRCC patients and achieved the highest performance in this category: 86.5% accuracy and an AUC of 0.912.

MRI-based grading: Zhao et al. (2020) used ResNet-50 on MRI data (T2WI and T1C sequences) from 376 patients with 430 RCC lesions, restricting the analysis to AJCC stage I and II tumors. For the Fuhrman grading test set, the model achieved 0.88 accuracy, 0.89 sensitivity, and 0.88 specificity. On a separate WHO/ISUP-graded test set of 77 tumors, it achieved 0.83 accuracy, 0.92 sensitivity, and 0.78 specificity. The higher sensitivity on the WHO/ISUP set (0.92) suggests the model was particularly effective at identifying high-grade tumors, which is the more clinically urgent task.

The progression across these studies shows a clear trend: larger datasets, ensemble methods, and transformer architectures are steadily improving non-invasive grading accuracy. The jump from single-model AUC of 0.864 to ensemble AUC of 0.882 (Xu et al.) and further to 0.912 with TransResNet (Yang et al.) illustrates the impact of architectural innovation combined with larger training cohorts.

TL;DR: TransResNet achieved the best grading performance at 86.5% accuracy and AUC of 0.912 on 759 patients. Ensemble models on 706 patients reached AUC of 0.882 vs. 0.864 for single models. MRI-based ResNet-50 achieved 0.88 accuracy on Fuhrman grading and 0.92 sensitivity on WHO/ISUP grading. Larger cohorts and transformer architectures are driving performance gains.
Page 5
Combining Deep Learning with Traditional Radiomics

Radiomics is a technique that extracts high-throughput quantitative features from medical images, including shape, texture, and intensity metrics, then uses those features to build predictive models. Traditional radiomics relies on hand-crafted feature extraction paired with classical machine learning algorithms like support vector machines (SVM) and random forests. Deep learning-based radiomics takes a fundamentally different approach: instead of manually designing features, CNNs automatically learn optimal feature representations directly from raw image data through end-to-end training.

The review highlights several key differences between these approaches. Traditional radiomics performance is limited by the quality and selection of manually engineered features, and the researcher must decide which features to extract before training begins. Deep learning-based radiomics eliminates this bottleneck by learning hierarchical, abstract features through multiple neural network layers. This automatic feature learning is particularly advantageous when dealing with large-scale, complex medical image datasets where the most informative features may not be obvious to human designers.

Some studies have taken a hybrid approach, using CNNs to model features that were originally extracted using traditional radiomics pipelines. This combination leverages the domain knowledge embedded in radiomics feature design while gaining the pattern recognition power of deep learning. The review notes that these hybrid models can outperform either approach alone, though the advantage of fully end-to-end deep learning typically grows with dataset size. For smaller datasets where overfitting is a concern, the structure imposed by traditional radiomics features can serve as useful regularization.

TL;DR: Traditional radiomics uses hand-crafted features (shape, texture, intensity) with classical ML, while deep learning radiomics learns features automatically through CNNs. Hybrid approaches that combine radiomics feature extraction with deep learning modeling can outperform either method alone, especially on smaller datasets where hand-crafted features help prevent overfitting.
Pages 6-7
Deep Learning in Digital Pathology, Ultrasound, and Prognosis Prediction

Digital pathology: Manual identification of RCC subtypes under the microscope is time-consuming and suffers from high inter-observer and intra-observer variability. The emergence of whole-slide imaging in digital pathology has enabled deep learning models to automatically classify histopathology images. Tabibu et al. (2019) trained ResNet-18 and ResNet-34 models on 1,027 ccRCC, 303 pRCC, 254 chRCC, and 477 normal tissue samples, achieving 93.39% accuracy for distinguishing ccRCC from normal tissue, 87.34% for chRCC from normal tissue, and 94.07% for the three-way RCC subtype classification.

Biopsy and surgical slide classification: Zhu et al. (2021) developed a model that classifies digitized slides into five categories (ccRCC, pRCC, chRCC, RO, and normal tissue) and validated it across multiple settings. The mean AUC was 0.98 on both internal surgical resection sections and internal biopsy sections, and 0.97 on 917 external sections from The Cancer Genome Atlas (TCGA) database. Abu Haeyeh et al. (2022) trained three multi-scale CNNs with decision fusion for classifying four tissue types including the challenging distinction between ccRCC and clear cell papillary RCC (ccpRCC), achieving 93.0% overall accuracy, 91.3% sensitivity, and 95.6% specificity at the slide level.

Ultrasound applications: A systematic review and meta-analysis of 16 studies found that contrast-enhanced ultrasound (CEUS) and CECT have comparable diagnostic sensitivity for benign versus malignant renal masses (0.90 vs. 0.96). Zhu et al. (2022) developed MUF-Net (Multimodal Ultrasound Fusion Network), trained on 9,794 images cropped from CEUS videos across 81 benign and 100 malignant tumors. MUF-Net achieved 80.0% accuracy, 80.4% sensitivity, 79.1% specificity, and AUC of 0.877. Notably, this outperformed both junior radiologists (70.6% accuracy, AUC 0.740) and senior radiologists (75.7% accuracy, AUC 0.794).

Prognosis prediction: Schulz et al. (2021) pioneered multimodal deep learning for RCC prognosis by integrating histopathological images, CT/MRI scans, and genomic data from whole-exome sequencing of 248 ccRCC patients. Their multimodal deep learning model (MMDLM) achieved an average C-index of 0.7791 and an average accuracy of 83.43%. However, the study was limited by missing imaging data for some patients and a relatively small dataset, highlighting the early stage of prognosis prediction research in RCC.

TL;DR: Digital pathology models achieved AUCs of 0.97-0.98 across internal and external validation (Zhu et al.). MUF-Net on ultrasound reached 80.0% accuracy and 0.877 AUC, outperforming senior radiologists (75.7%, 0.794 AUC). The first multimodal prognosis model combining histopathology, imaging, and genomics achieved a C-index of 0.7791 and 83.43% accuracy on 248 patients.
Pages 7-8
Critical Limitations Facing Deep Learning in RCC

Single-center bias and small sample sizes: Most studies reviewed were conducted at single institutions and have not been validated in independent cohorts, which limits generalizability. The typical sample size in hotspot research areas ranges from only 100-200 cases, creating significant overfitting risk. The authors stress the urgent need for multi-center image sharing platforms and larger datasets to establish models with stable, reproducible performance.

Lack of prospective validation: Nearly all current studies are retrospective in design, lacking the large-sample, randomized, multi-center prospective trials needed to bridge the gap to clinical application. Without prospective validation, it remains unclear how well these models would perform in real-world clinical workflows where patient populations, imaging equipment, and acquisition protocols vary substantially from the training data.

Standardization and reproducibility issues: The deep learning image acquisition process lacks a unified standard or evaluation system. Differences in imaging equipment parameters, image reconstruction methods, imaging physician habits, and patient compliance all reduce the comparability of studies. Image segmentation, a critical step in model building, presents its own reproducibility challenges, as manual, semi-automatic, and fully automatic methods each have different accuracy and repeatability characteristics. Both overfitting and underfitting can compromise model repeatability, and optimizing segmentation accuracy with high repeatability remains an open problem.

Model explainability: Deep learning models typically consist of many layers with complex nonlinear mappings, making their decision-making processes opaque. This "black box" nature raises trust and acceptance issues in clinical practice. Researchers have proposed interpretability tools such as Grad-CAM (Gradient-weighted Class Activation Mapping), which visualizes which image regions drive a specific prediction, as well as LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations). These methods provide some explanatory insights, but fully transparent clinical decision support remains elusive.

Self-supervised learning gaps: Self-supervised learning, which leverages unlabeled data by designing tasks that generate labels automatically, has received limited research attention in RCC despite its potential to address data scarcity. This approach could reduce reliance on expensive labeled datasets and accelerate training, particularly for segmentation tasks where pixel-level annotation is especially costly.

TL;DR: Key limitations include single-center study designs with only 100-200 cases, purely retrospective validation, no standardized imaging protocols, poor model explainability (addressed partially by Grad-CAM, LIME, and SHAP), and insufficient exploration of self-supervised learning to address data scarcity.
Pages 7-9
Emerging Opportunities and Research Frontiers

Prognosis and treatment response prediction: The authors note that deep learning research in RCC currently remains concentrated on diagnosis and identification, with very limited work on predicting patient prognosis. Studies evaluating the efficacy of immunotherapy and targeted therapy in RCC patients using deep learning are particularly scarce. Given the increasing importance of personalized treatment selection in kidney cancer, this represents one of the largest unmet needs in the field.

Radiogenomics and molecular biology integration: The combination of radiomics with genomic data is forming a new discipline called radiogenomics. Several machine learning studies have already demonstrated the ability to identify specific gene expression patterns from preoperative images, including PET/MRI-based identification of VEGF gene expression, and CT-based identification of PBRM1, BAP1, and VHL gene mutation levels. Integrating proteomics data adds another dimension. The authors suggest that upgrading these existing machine learning models to deep learning architectures could significantly improve prediction accuracy and provide biologically plausible explanations for what deep learning models detect in imaging data.

Fine clinicopathological indicators: Beyond traditional benign-malignant differentiation and TNM staging, several clinically important indicators have not yet been addressed by deep learning in RCC, despite existing radiomics studies. These include juxtatumoral perinephric fat invasion, inferior vena cava tumor thrombosis and vessel wall invasion, and evaluation of perirenal fat adhesions. The authors note that methodologically, these studies are no longer difficult to perform, and it is only a matter of time before deep learning models are applied to these endpoints.

Pharmacokinetic DCE-MRI: Dynamic contrast-enhanced MRI (DCE-MRI) tracks the distribution and clearance of contrast agents to provide information about tumor blood flow, vascular permeability, and extracellular space. Wang et al. have demonstrated the potential of pharmacokinetic parameters for differentiating RCC subtypes and determining tumor malignancy. Applying CNNs to automatically extract and learn these pharmacokinetic parameters from dynamic sequences is a promising but challenging frontier, particularly regarding accurate parameter extraction across diverse dynamic sequences and the trade-off between temporal and spatial resolution.

Multidisciplinary collaboration: The training of deep learning models requires high-throughput data processing that exceeds traditional statistical methods. This creates higher demands for interdisciplinary communication among radiologists, surgeons, and computer engineers. The authors emphasize that figuring out how clinicians can better interface with engineers remains a practical challenge that must be addressed for deep learning research in RCC to mature and reach clinical deployment.

TL;DR: Key future directions include prognosis and treatment response prediction (especially for immunotherapy and targeted therapy), radiogenomics integration (VEGF, PBRM1, BAP1, VHL gene identification from imaging), DCE-MRI pharmacokinetic analysis, evaluation of fine indicators like perinephric fat invasion and vena cava thrombosis, and improved multidisciplinary collaboration between clinicians and engineers.
Citation: Wang Z, Zhang X, Wang X, et al.. Open Access, 2023. Available at: PMC10505614. DOI: 10.3389/fonc.2023.1152622. License: cc by.