Artificial Intelligence and Radiomics in Evaluation of Kidney Lesions: A Comprehensive Literature Review

Therapeutic Advances in Urology 2023 AI 10 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Radiomics and AI Matter for Kidney Lesions

Renal cell carcinoma (RCC) is the sixth most commonly diagnosed cancer in men and the tenth in women. Over recent decades, the growing availability of imaging technologies like ultrasound, CT, and MRI has led to a surge in incidental findings of kidney lesions. Up to 50% of all diagnosed renal lesions are considered small renal masses (SRMs), measuring 4 cm or less in diameter. Critically, up to 30% of these lesions turn out to be benign at final histology after surgical removal, meaning many patients undergo unnecessary nephrectomy.

Current diagnostic imaging still struggles to reliably distinguish RCC from benign lesions before surgery. Renal lesion biopsy offers high accuracy but is invasive, carries a non-diagnostic rate of approximately 15%, and has an erroneous diagnosis rate of around 10% due to tumor heterogeneity. Computer-aided diagnosis (CAD) using artificial intelligence (AI), machine learning (ML), and deep learning (DL) represents a promising new approach to improve this diagnostic gap.

Radiomics is the field that extracts quantitative data from digital radiographic images, including statistical, geometrical, and textural features that capture contour characteristics, internal heterogeneity, and gray zone properties of lesions. By combining these features with clinical and pathological variables, radiomics can build descriptive and predictive models that capture subtle patterns invisible to human observers. This comprehensive review analyzed 49 studies from the last 7 years (through July 2022), plus 6 earlier landmark studies and 7 ongoing clinical trials, following PRISMA guidelines.

TL;DR: Up to 30% of surgically removed kidney lesions are benign, highlighting a major diagnostic gap. Radiomics extracts quantitative imaging features (textures, shapes, histograms) from CT and MRI to build AI models that distinguish benign from malignant kidney masses. This review covers 55 studies and 7 ongoing trials through July 2022.
Pages 2-4
CT-Based Radiomics for Distinguishing Benign from Cancerous Kidney Tumors

Early CT texture analysis: Yu et al. studied 119 patients using histogram-based features (skewness and kurtosis) and reported an AUC of 0.91 and 0.93, respectively, for differentiating renal cancer from oncocytoma (ONC), with an AUC of 0.92 for distinguishing ONC from other tumors. Coy et al. (n=200) achieved an AUC of 0.85 using CAD to discriminate malignant versus benign lesions. Erdim et al. (n=79) used a random forest (RF) algorithm on 198 unenhanced CT features and 244 contrast-enhanced CT features, achieving 90.5% accuracy and AUC of 0.915, which improved to 91.7% after eliminating collinear features.

Deep learning models: Zhou et al. (n=192) applied the InceptionV3 DL radiomics model and achieved an AUC of 0.97 for ROI-based data and 0.93 for rectangular box region data. Sun et al. (n=290) selected 57 features and built a radiomics ML model yielding AUC of 0.93 to 0.94 for differentiating RCC from fat-poor benign lesions. Uhlig et al. (n=94) found that the RF algorithm achieved AUC of 0.83, outperforming radiologists' assessment (AUC = 0.68), even when using 18 different CT scanners.

Larger cohorts: Nassiri et al. (n=684) tested two predictive models: REAL AdaBoost (AUC = 0.84) and RF (AUC = 0.77) for distinguishing benign from cancerous lesions, especially for SRMs when combined with clinical variables. Yap et al. (n=735) demonstrated increasing AUC from 0.67 to 0.75 across shape-only, texture-only, and combined radiomic models for malignant versus benign classification.

TL;DR: CT-based radiomics achieves strong results: InceptionV3 deep learning reached AUC 0.97 (Zhou et al., n=192), RF algorithms hit 90.5% accuracy (Erdim et al., n=79), and RF outperformed radiologists (AUC 0.83 vs. 0.68) even across 18 different CT scanners (Uhlig et al.). Larger studies (n=684, n=735) confirmed robust performance.
Pages 4-5
MRI-Based Radiomics for Benign vs. Malignant Classification

Ensemble deep learning on MRI: Xi et al. analyzed a large cohort of 1,162 renal lesions using an ensemble DL model on MRI data. The model's AUC ranged from 0.52 to 0.76 across different clinical radiomics features. Compared with expert radiologists, the model achieved higher accuracy (0.70 vs. 0.60), sensitivity (0.92 vs. 0.80), and specificity (0.41 vs. 0.35). Said et al. (n=125) reported AUCs ranging from 0.62 to 0.90 for individual features, with the ML model achieving AUC of 0.73 on validation sets.

ResNet-18 deep learning: Xu et al. (n=217) created three DL models using ResNet-18 architecture combined with RF classifiers on T2-weighted imaging (T2WI) and diffusion-weighted imaging (DWI). The results were: AUC of 0.906 for T2WI alone, 0.846 for DWI alone, and 0.925 for the combined model. This demonstrated that fusing multiple MRI sequences significantly improves classification performance.

SVM on MRI features: Massa'a et al. (n=160) tested multiple ML algorithms and found that the support vector machine (SVM) trained on T2WI features achieved the best results with AUC of 0.79. Similar results were obtained for T1WI 4-minute delayed features. Interestingly, combining multiple radiomics features in this study did not improve model performance, suggesting that feature selection and quality matter more than quantity in some contexts.

TL;DR: MRI-based radiomics also shows promise: ResNet-18 combined T2WI and DWI data achieved AUC 0.925 (Xu et al., n=217), ensemble DL outperformed radiologists in sensitivity (0.92 vs. 0.80) on 1,162 lesions (Xi et al.), and SVM on T2WI reached AUC 0.79 (Massa'a et al., n=160).
Pages 5-7
Differentiating Angiomyolipoma from Renal Cell Carcinoma

Angiomyolipoma (AML) accounts for 40 to 55% of resected benign renal tumors and is typically identified by the presence of macroscopic fat in imaging. However, some AMLs contain low intratumor fat that cannot be detected, making their accurate characterization critical given AML's benign course and favorable prognosis. Misclassifying AML as RCC leads to unnecessary surgery.

CT-based approaches: Feng et al. (n=58) evaluated 42 CT-extracted features and achieved AUC of 0.939 using SVM recursive feature elimination. Cui et al. (n=171) developed an SVM classifier reaching AUC of 0.96 for AML vs. RCC differentiation. Yang et al. (n=113) extracted 774 radiomics features and reported AUC of 0.917. Ma et al. (n=84) built four logistic classifiers with AUCs from 0.839 to 0.950. Nie et al. (n=198) created a radiomics nomogram achieving training AUC of 0.879 and validation AUC of 0.846 from over 2,800 CT-based features. Ma et al. (n=163) built a CT nomogram reaching AUC of 0.968.

MRI-based approaches: Razik et al. (n=60) performed MRI analysis to distinguish AML, RCC, and ONC, reporting AUC greater than 0.8 with the best parameter being mean of positive pixels on DWI (AUC of 0.891). Jian et al. (n=171) found the T2WI model achieved AUC of 0.874, which increased to 0.919 when combined with urinary creatinine. Matsumoto et al. (n=58) showed the ADC map alone could differentiate AML from RCC with AUC of 0.87.

TL;DR: AML vs. RCC differentiation is critical because AML is benign but can mimic cancer on imaging. CT radiomics achieves excellent performance: SVM reached AUC 0.96 (Cui et al.), CT nomogram hit AUC 0.968 (Ma et al.), and MRI combined with urinary creatinine reached AUC 0.919 (Jian et al.).
Pages 7-9
Distinguishing Oncocytoma from RCC to Avoid Unnecessary Surgery

Renal oncocytoma (ONC) is a benign solid kidney neoplasm accounting for 3 to 7% of all renal tumors. Despite its benign nature and excellent prognosis, ONC is usually treated with surgical resection because current imaging cannot reliably distinguish it from RCC, particularly chromophobe RCC (chRCC) and clear cell RCC (ccRCC). A reliable non-invasive method to differentiate ONC from RCC before surgery would prevent many unnecessary procedures.

CNN-based classification: Baghdadi et al. (n=192) used convolutional neural networks (CNNs) on CT images to differentiate CD117-positive oncocytomas from chRCC. The tumor-to-cortex peak early-phase enhancement ratio (PEER) evaluation achieved 95% accuracy, 100% sensitivity, and 89% specificity. Li et al. (n=61) tested five classifiers and found SVM performed best with AUC of 0.96, sensitivity of 0.99, specificity of 0.80, and accuracy of 0.94 for distinguishing chRCC from ONC.

Texture and quantitative analysis: Raman et al. (n=99) used CT texture analysis with random forest (RF) and identified 90% of oncocytomas with sensitivity of 89% and specificity of 99%. Varghese et al. (n=174) found that adding Fourier analysis to histogram parameters improved combined model AUC to 0.87 to 1.00 for differentiating ONC from cancerous lesions. Sasaguri et al. (n=166) achieved AUCs of 0.82, 0.95, and 0.84 for differentiating ONCs from ccRCCs. Hoang et al. (n=41) used MRI texture parameters to achieve accuracy of 77.9%, sensitivity of 64.7%, and specificity of 85.9% for distinguishing ONC from RCC subtypes.

TL;DR: Oncocytoma is benign but often leads to unnecessary surgery due to imaging overlap with RCC. CNN-based PEER evaluation achieved 95% accuracy (Baghdadi et al.), SVM reached AUC 0.96 for chRCC vs. ONC (Li et al.), and RF identified 90% of oncocytomas with 99% specificity (Raman et al.).
Pages 9-11
AI for Differentiating RCC Subtypes: ccRCC, papRCC, and chRCC

RCC encompasses three major subtypes that differ in aggressiveness, prognosis, and treatment approach. Clear cell RCC (ccRCC) is the most aggressive, comprising 75% of all RCCs with significant metastatic potential. Papillary RCC (papRCC) accounts for 10 to 15%, and chromophobe RCC (chRCC) about 5%, both showing better survival rates. Accurate preoperative subtyping has direct implications for treatment selection, particularly with the growing use of molecular targeted drugs.

CT-based subtyping: Kocak et al. (n=68) achieved 84.6% accuracy using an artificial neural network (ANN) classifier with adaptive boosting for differentiating ccRCC from other types. The SVM classifier achieved 69.2% accuracy for the three-way ccRCC vs. papRCC vs. chRCC distinction. Han et al. (n=169) used a DL neural network that achieved AUC of 0.93 for ccRCC, 0.91 for papRCC, and 0.87 for chRCC. Li et al. (n=170 training, n=85 validation) built a radiomics model from corticomedullary CT images achieving AUC of 0.95 (accuracy 92.9%) for ccRCC vs. non-ccRCC, with five of eight key features strongly associated with VHL gene mutation.

MRI-based subtyping: Hoang et al. (n=41) used MRI texture parameters to differentiate papRCC from ccRCC with accuracy of 77.9%, sensitivity of 65.5%, and specificity of 88%. Paschall et al. (n=55) demonstrated whole-lesion ADC values could distinguish papRCC from ccRCC with AUC of 95.2, sensitivity of 84.5%, and specificity of 93.1%. Li et al. (n=92) used volumetric ADC histogram analysis to achieve the best AUC of 0.851 for SRM characterization.

Raman et al. (n=99) used CT texture analysis with RF modeling to correctly categorize ccRCCs in 91% of patients (sensitivity 91%, specificity 97%) and papRCCs in 100% of patients (sensitivity 100%, specificity 98%), suggesting RF-based CT texture analysis is a particularly effective approach for renal mass characterization.

TL;DR: RCC subtyping is critical for treatment planning. DL neural networks achieved AUC 0.93 for ccRCC classification (Han et al., n=169). CT radiomics with VHL gene correlation reached AUC 0.95 (Li et al.). RF on CT correctly classified 91% of ccRCCs and 100% of papRCCs (Raman et al., n=99). MRI ADC analysis achieved AUC 95.2 for papRCC vs. ccRCC.
Pages 11-12
Predicting Tumor Nuclear Grade from Imaging with Machine Learning

Fuhrman grade is one of the most important pathological risk factors influencing patient outcomes, particularly the risk of cancer recurrence. While renal mass biopsy can determine grade, it is invasive and carries complications. Being able to predict nuclear grade directly from imaging before surgery would significantly improve patient management and treatment planning.

CT-based grading: Shu et al. extracted 1,029 radiomics features from corticomedullary and nephrographic CT scans and found that 11 and 24 features correlated with Fuhrman grades, confirming that radiomics can preoperatively assess the Fuhrman grade of kidney lesions. A retrospective study on 290 patients with 298 confirmed RCCs demonstrated a significant increase of entropy value both in clear cell carcinoma and higher Fuhrman grade at CT imaging. Yin et al. (n=25) found that SVMRadial, RF, and Bayesian models had the best prognostic ability to predict Fuhrman grade of ccRCC using radiomics from contrast-enhanced CT (CECT) images.

MRI-based grading: A study on 34 RCC masses demonstrated that entropy at spatial scaling factors (SSF) on DWI, corticomedullary phase, and nephrographic phase were the best parameters for assessing RCC grading. Stanzione et al. developed five ML algorithms incorporating different MRI features to predict tumor grading, achieving accuracy greater than 90%. This texture analysis post-processing technique can quantify tumor heterogeneity across multiple parameters, whether applied to CT or MRI.

Semantic segmentation is also gaining popularity in this field, with promising results for differentiating RCC subtypes, though studies applying it to nuclear grading differentiation are still primarily based on pathological samples rather than imaging data.

TL;DR: Radiomics can predict Fuhrman nuclear grade from imaging. Shu et al. identified 11 to 24 features from 1,029 CT radiomics features that correlate with grade. MRI-based ML algorithms achieved over 90% accuracy for grade prediction (Stanzione et al.). SVMRadial, RF, and Bayesian models showed the best performance on CECT images.
Pages 12-14
Predicting Gene Mutations and Treatment Response with Radiomics

Radiogenomics integrates multi-scale genome data with imaging features through refined CAD systems. Lee et al. (n=58) combined radiomics parameters from CT scans with whole transcriptome sequencing (WTS) gene expression data to predict metastasis in pT1 RCC. Four radiomic features, including histogram features, gray-level co-occurrence matrix (GLCM), and voxel ratios from ROIs, were used to prognose metastasis and identify correlated heterogeneous gene signatures.

Specific gene mutation prediction: Two studies assessed BAP1 mutation prediction using CT-based texture radiomics: the first reported AUC of 0.77, while the second achieved sensitivity of 90.4%, specificity of 78.8%, accuracy of 81%, and AUC of 0.89 on 65 ccRCC tumors. The PBRM1 mutation was investigated with a radiomics model achieving AUC of 0.925. A multi-gene study found AUC of 0.85 for VHL, PBRM1, and BAP1 genes combined. BAP1 prediction from The Cancer Genome Atlas data achieved AUC of 0.71 within CT nephrogenic scan images.

Treatment response in metastatic RCC: Goh et al. (n=39) analyzed 87 metastatic lesions in patients receiving TKIs (Sunitinib, Cedirinib, Pazopanib, or Regorafenib) and found that texture uniformity was an independent predictor of progression (p = 0.005). Haider et al. (n=40) showed that size-normalized standard deviation (nSD) predicted OS (p = 0.01) and PFS (p = 0.01, p = 0.003). Khene et al. (n=48) predicted response to Nivolumab immunotherapy with accuracy of 0.71 to 0.91 and AUC of 0.67 to 0.92, with logistic regression outperforming RF, KNN, and SVM classifiers.

TL;DR: Radiogenomics predicts gene mutations from imaging: BAP1 mutation at AUC 0.89, PBRM1 at AUC 0.925, and multi-gene VHL/PBRM1/BAP1 at AUC 0.85. For metastatic RCC treatment response, texture uniformity predicts TKI progression (p = 0.005), and radiomics predicted Nivolumab response with up to AUC 0.92.
Pages 15-17
Comparing Machine Learning Algorithms Used Across Radiomics Studies

Support vector machines (SVM) were identified as the most frequently used classification algorithm across the reviewed studies. SVMs search for the optimal boundary (hyperplane) between two data sets and are effective when classes are clearly separable. However, their performance degrades when data are perturbed or noisy. SVMs achieved strong results in studies differentiating benign from malignant tumors, distinguishing AML from RCC, and predicting gene mutations and therapy response.

Artificial neural networks (ANNs) model human brain functions with layered artificial nodes. Two studies used ANNs, achieving the best AUC results alongside SVMs. ANNs perform well with large quantities of data but face challenges with overfitting and settling for relative rather than absolute minimum error. Convolutional neural networks (CNNs) were used in two additional studies, with the advantage of not requiring hand-crafted features from experts, allowing end-to-end learning from raw imaging data.

Random forests (RF) combine predictions from multiple decision trees, with each tree learning from randomly selected data. RF handles nonlinear data well and can reduce the variable space to emphasize important features. RF obtained the best AUC values in multiple kidney lesion radiomics studies. LASSO regression (least absolute shrinkage and selection operator) pushes regression coefficients toward zero, improving interpretability and selecting important predictors. It was successfully used across many studies to reduce overfitting.

Approximately 30% of the reviewed studies still used traditional algorithms for comparison rather than advanced ML methods. Direct head-to-head comparisons between algorithms have not yet been published, and more data are needed to identify the optimal ML algorithms for kidney radiomics research.

TL;DR: SVM is the most commonly used classifier in kidney radiomics, effective for clear class separation. RF achieved the best AUC in multiple studies and handles nonlinear data well. CNNs offer end-to-end learning without hand-crafted features. LASSO regression reduces overfitting. About 30% of studies still use traditional methods, and direct algorithm comparisons are lacking.
Pages 17-19
Current Challenges and the Path Toward Clinical Implementation

Standardization gaps: A major limitation is the lack of standardized imaging protocols across studies. Different CT phasing (contrast-enhanced vs. unenhanced), different scanners, and varying MRI sequences all contribute to inconsistency. The classifiers used to analyze gray zone features (SVM, histograms, and others) have not been independently validated. Most studies employed retrospective designs with small sample sizes and required manual delineation of regions of interest (ROIs) or volumes of interest (VOIs), limiting reproducibility and scalability.

Performance variability: AUC values across the reviewed studies ranged widely, from 0.64 to 0.97, reflecting the diversity of AI algorithms, mathematical models, and imaging approaches used. MRI appears to provide less reliable evidence for radiomics compared with CT. The DL methods show strong potential, particularly when combined with genomics data (radiogenomics), with ROI data sets reporting accuracy of 0.97 and rectangular box regions achieving 0.93. However, the lack of both internal and external validation in many studies limits the generalizability of these findings.

Ongoing clinical trials: Seven clinical trials were actively investigating radiomics in kidney lesions at the time of review. These include studies using PET/MRI for molecular subtype characterization of ccRCC (NCT04271254), SPECT/CT for renal mass management (NCT03996850, n=100), multiparametric MRI for diagnosing small renal tumors (NCT03470285, n=500), and CEUS correlation with Fuhrman grade (NCT03821376, n=40).

Path forward: The authors conclude that AI and radiomics show a strong association with improved sensitivity, specificity, and accuracy in detecting and differentiating renal lesions. Standardization of scanner protocols between institutions will be essential. Future clinical integration will require standardized radiomics models and methodology, prospective external validation, and comparison with existing well-validated diagnostic tools. AI-driven automated segmentation of kidneys and tumors using DL algorithms on both ultrasound and CT is already emerging as a practical first step toward reducing workload and improving accuracy.

TL;DR: Key challenges include non-standardized imaging protocols, small retrospective study designs, wide AUC variability (0.64 to 0.97), and insufficient external validation. Seven ongoing clinical trials are advancing the field. Clinical implementation requires standardized models, prospective validation, and head-to-head comparisons with traditional diagnostic tools.
Citation: Ferro M, Crocetto F, Barone B, et al.. Open Access, 2023. Available at: PMC10126666. DOI: 10.1177/17562872231164803. License: cc by-nc.