Many renal masses are discovered incidentally during abdominal CT imaging for unrelated conditions. While the majority of these masses are benign (most commonly simple cysts), approximately 70-80% of solid renal masses turn out to be malignant, typically renal cell carcinoma (RCC). Accurately characterizing these masses is essential for treatment planning, because the management of a benign cyst, a benign solid tumor like angiomyolipoma (AML), and a malignant RCC differs dramatically. Radiologists can generally distinguish cystic from solid masses with high confidence on contrast-enhanced CT, but differentiating benign solid masses from malignant ones remains a persistent challenge.
Why automated approaches matter: The evaluation pipeline for a renal mass begins with localizing the kidneys, then detecting and segmenting the mass, and finally classifying it by type. Each step traditionally requires manual radiologist input. This study, conducted by researchers at Peking University First Hospital in Beijing, aimed to automate this entire pipeline using deep learning. The team combined two well-established neural network architectures: 3D U-Net for volumetric segmentation and ResNet for classification. The goal was a fully automated system that could localize kidneys, delineate renal masses, and classify them as AML, cystic, or solid without human intervention.
Bosniak classification and mass definitions: The study followed the 2019 Bosniak classification system for distinguishing cystic from solid masses. A cystic mass is defined as one where less than approximately 25% of the mass consists of enhancing tissue. A solid mass is one where more than 25% of the mass is composed of tumor tissue showing significant contrast uptake (a change of more than 20 Hounsfield Units). For solid masses, the presence of macroscopic fat without calcification supports a diagnosis of AML. These well-defined criteria provided the framework for the classification labels used in training the algorithm.
This retrospective study was approved by the institutional review board at Peking University First Hospital. The primary dataset included 490 patients with renal masses who underwent CT scans between August 2009 and August 2021, yielding 610 total image series (since some patients had both corticomedullary phase and nephrographic phase images). The cohort comprised 263 males and 227 females with a mean age of 49.75 years (range: 2-86 years). Inclusion required contrast-enhanced CT images in both corticomedullary (CMP, 30-35 seconds post-injection) and nephrographic (NP, 60-70 seconds post-injection) phases. Patients with prominent CT artifacts or prior renal biopsy/surgery were excluded.
External validation dataset: An independent external validation set of 81 patients was enrolled between August 2018 and July 2022 (42 males, 39 females; mean age 51.01 years; range 23-77 years). There was no statistically significant difference in gender or age between the two datasets (P>0.05). Within the external validation set, radiologists manually identified 42 AML, 352 cystic masses, and 33 solid masses across the 81 image series.
Mass distribution in the primary dataset: Across the 610 image series, fellowship-trained radiologists manually defined a total of 198 AML, 1,296 cystic masses, and 397 solid masses. These lesions were split approximately 8:1:1 into training (487 series), validation (58 series), and test (65 series) sets, with image series from the same patient always allocated to the same set to prevent data leakage. The study additionally validated segmentation performance on the publicly available KiTS21 challenge dataset, which comprises 300 labeled CT scans with kidney and tumor annotations.
CT protocol details: Imaging was performed on multiple multidetector CT systems. Contrast agent (iopromide 370 mg I/ml or iohexol 320 mg I/ml) was injected at 2 ml/kg with a flow rate of 2.5 ml/s. All patients underwent four-phase CT scanning (plain, corticomedullary, nephrographic, and delayed phases), with tube voltage of 80-120 kV, reconstructed slice thickness of 1-1.5 mm, and reconstruction interval of 1 mm.
Two-stage segmentation approach: Rather than attempting to segment kidneys and masses simultaneously, the authors designed a cascaded (two-stage) pipeline. In the first stage, a 3D U-Net model identifies and segments the bilateral kidneys on contrast-enhanced CT images. In the second stage, the CT images are cropped to the segmented kidney regions, and a separate 3D U-Net is trained specifically for renal mass segmentation within those cropped volumes. This cascade approach was explicitly compared against a one-stage model that attempted simultaneous kidney and mass segmentation using the same architecture and hyperparameters, and the two-stage approach consistently outperformed the single-stage alternative.
3D U-Net architecture: The 3D U-Net is a fully convolutional neural network designed for volumetric segmentation tasks. It uses an encoder-decoder structure with skip connections. The encoder extracts hierarchical features from the input 3D volume, the decoder reconstructs the segmentation map, and the skip connections preserve spatial information between corresponding encoder and decoder layers to improve segmentation accuracy. Before training, all images were resized from their native 512x512xN resolution to 128x128x128 volumes.
Data augmentation and training parameters: To increase dataset diversity and prevent overfitting, the authors applied several augmentation techniques during training: random rotation between -10 and 10 degrees, random noise injection, and random horizontal and vertical translation between -0.1 and 0.1. The segmentation models were trained with the following hyperparameters: 16 initial filters, batch size of 4, 400 training epochs, and a learning rate of 0.0001. Image window settings were adjusted to 30 HU window width and 300 HU window level prior to training.
Manual segmentation as ground truth: Two radiologists performed the manual segmentation that served as the reference standard. A fellowship-trained radiologist with 15 years of experience led the process in collaboration with a junior radiologist. They delineated kidney boundaries (including cortex, medulla, and renal sinus, but excluding retroperitoneal fat and hilar structures) and mass contours (including necrotic, cystic change, and hemorrhagic areas within the tumor). Both readers were blinded to clinical and pathological information.
ResNet for mass classification: After segmentation, the classification stage used a 3D ResNet (residual network) to categorize detected masses as AML, cystic, or solid. ResNet addresses the vanishing gradient problem in deep neural networks by using residual mappings with shortcut connections instead of direct mappings, allowing the network to achieve better accuracy while keeping the parameter count manageable. The specific architecture was a 10-layer residual network with the original pretrained weights preserved, augmented by a global average pooling layer, a fully connected layer, and a final classification layer.
Classification training details: Before training, original images were cropped using manually labeled mass contours, with non-covered areas automatically removed. The cropped images were then resized to 128x128x128. The same data augmentation techniques used for segmentation (random rotation of -10 to 10 degrees, random noise, random translation of -0.1 to 0.1) were applied. Training parameters included a model depth of 10, a hidden layer size of 128, dropout of 0.1, batch size of 4, 400 epochs, and a learning rate of 0.0001. The network outputs the category with the highest predicted probability.
Grad-CAM for interpretability: To make the classification decisions more transparent, the authors generated class activation maps using Gradient-weighted Class Activation Mapping (Grad-CAM). This technique produces heatmaps that highlight which regions of the input image contributed most to a particular classification decision. These visualizations provide insight into the reasoning of the network and help clinicians understand why the model assigned a specific label to a given mass, which is critical for building clinical trust in AI-assisted diagnostic tools.
Size-stratified analysis: Research has shown that renal masses smaller than 10 mm, and in practice those measuring 5 mm or less, are typically too small to be characterized reliably on CT. Accordingly, the authors used a 5 mm average diameter threshold to stratify their classification analysis, reporting separate accuracy figures for masses below and above this cutoff. This approach acknowledges a real-world clinical limitation and provides more informative performance metrics.
Kidney segmentation performance: The algorithm achieved near-perfect kidney segmentation, with a mean Dice Similarity Coefficient (DSC) of 0.99 for both left and right kidneys in the test set. On the external KiTS21 dataset, kidney segmentation DSCs were 0.96 (left) and 0.97 (right). In the external validation set, the mean DSC was 0.98 for both sides. These results confirm that the first stage of the pipeline, localizing the kidneys, is highly reliable across different datasets and scanning protocols.
Renal mass segmentation performance: Mass segmentation proved more challenging, as expected. In the test set, the mean DSC was 0.75 for left kidney masses and 0.83 for right kidney masses. On the KiTS21 dataset, DSCs dropped to 0.68 (left) and 0.64 (right), likely due to differences in data distribution, scan protocols, and annotation conventions between datasets. In the external validation set, mass DSCs were 0.70 (left) and 0.72 (right). Hausdorff Distances (HD), which measure the worst-case boundary error, were 5.10 mm and 4.26 mm for left and right kidneys in the test set, and 3.75 mm and 4.88 mm in the external validation set.
Detection performance by mass type: The algorithm detected renal masses with precision of 67.77% (left) and 60.58% (right) in the test set, with corresponding recalls of 84.54% and 75.90% and F1-scores of 0.75 and 0.67. In the external validation set, precision was 76.96% (left) and 72.35% (right), recall was 72.54% and 67.21%, and F1-scores were 0.76 and 0.70. Solid renal masses were segmented more accurately than cystic masses and AML, with statistically significant differences (P<0.01 for right kidney, P<0.05 for left kidney in the test set).
Two-stage vs. one-stage comparison: The cascaded two-stage model consistently outperformed the one-stage model that attempted simultaneous kidney and mass segmentation. This validates the design decision to first localize the kidneys and then focus the second network on mass segmentation within the cropped kidney region, reducing the search space and improving accuracy.
Test set classification accuracy: The ResNet classification model achieved an overall accuracy of 90.56% on the test set. When stratified by mass size, accuracy was 86.05% for masses smaller than 5 mm and 91.97% for masses 5 mm or larger. This size-dependent performance gap is expected, because very small masses provide less imaging information for the network to work with and are inherently harder to characterize even for experienced radiologists.
Per-class performance for masses 5 mm or larger: For AML, the model achieved 100% precision and 94.12% recall (F1-score: 0.97). For cystic masses, precision was 90.24% and recall was 96.10% (F1-score: 0.93). For solid masses, precision was 94.74% and recall was 83.72% (F1-score: 0.89). The AUC values were 1.00 for AML, 0.98 for cystic masses, and 0.99 for solid masses. These results demonstrate excellent discrimination ability across all three categories when the mass is large enough to be meaningfully characterized.
Error patterns: Most solid masses that were misclassified in the test set were larger tumors with features of cystoid degeneration and necrosis, conditions that make them visually resemble cystic masses on CT. Conversely, most misclassified AML and cystic masses were smaller than 5 mm, falling below the threshold for reliable CT characterization. For the smaller mass subgroup (less than 5 mm), AML precision was only 66.67% with recall of 28.57% (F1: 0.40), while cystic mass performance remained relatively strong at 87.50% precision and 97.22% recall (F1: 0.92).
External validation classification: In the external validation set, 300 true positive and 100 false positive domains were identified after segmentation, with 217 true positive connected domains having average diameters larger than 5 mm. The ResNet algorithm achieved 84.33% accuracy on this external set, with particularly high sensitivity for identifying solid masses (recall: 92.86%, precision: 54.17%, F1: 0.68). Cystic mass classification remained strong (precision: 95.21%, recall: 88.54%, F1: 0.92), while AML performance was moderate (precision: 78.26%, recall: 56.25%, F1: 0.65).
Context within existing literature: Several previous studies have tackled renal mass segmentation using deep learning. Chen et al. used 3D segmentation software with interpolation on 27 patients and achieved high concordance. He et al. employed a grayscale adaptive network on 123 patients to simultaneously segment kidneys, tumors, arteries, and veins on CTA images, achieving 86.4% DSC with 29.85 mm HD. Houshyar et al. used a CNN on 319 patients and reported median DSCs of 0.970 (kidney) and 0.816 (tumor). Turk et al. developed a hybrid V-Net model on the KiTS19 dataset, reaching 97.7% DSC for kidney and 86.5% for tumor segmentation, later improving to 86.9% with a double-stage bottleneck block architecture.
What this study adds: Compared to prior work, this study used a substantially larger and more diverse dataset covering most common renal mass types encountered in clinical practice (AML, cystic, and solid), rather than focusing exclusively on tumors. The dataset was collected over 12 years across multiple CT scanner types, enhancing real-world applicability. Although the broader mass type coverage led to slightly lower segmentation performance compared to some narrower studies, the trade-off was a more clinically representative evaluation. Critically, this study went beyond segmentation alone by integrating a classification component, producing a complete end-to-end diagnostic pipeline.
Clinical value of the full pipeline: The ability to detect multiple lesions simultaneously is a significant practical advantage, since patients frequently present with more than one renal mass. By automating the entire workflow from kidney localization through mass detection and classification, the system could serve as a time-efficient screening tool, flagging suspicious masses for radiologist review. The Pearson correlation analysis showed a statistically significant positive correlation (r=0.209, P<0.05) between segmentation quality (DSC of true positive masses larger than 5 mm) and classification accuracy, confirming that better segmentation feeds directly into better classification outcomes.
Retrospective design: The study was retrospective in nature, which introduces inherent selection biases. Patients were drawn from a single institution (Peking University First Hospital), meaning the dataset reflects the demographics, scanner types, and imaging protocols of one hospital. While external validation on an independent cohort and the KiTS21 dataset partially addresses generalizability concerns, prospective multi-center validation would be needed before clinical deployment.
Missing pathological confirmation: Not all renal masses in the study had histopathological information available, since some patients did not undergo surgery at the study hospital. The classification labels (AML, cystic, solid) were based on imaging features and clinical consensus rather than universal pathological confirmation. The authors appropriately note that imaging-based designations of "cystic" or "solid" should not be equated with pathological diagnoses. This limits the strength of the ground truth for the classification model, particularly for ambiguous cases.
Performance gaps in specific scenarios: The model struggled with several specific clinical scenarios. Cystic masses and AML had lower segmentation DSCs compared to solid masses. Some solid masses in the external validation set had DSCs near zero due to poor contrast with surrounding normal tissue, resulting in near iso-density on nephrographic phase images. Solid masses with cystoid degeneration and necrosis were frequently misclassified as cystic. These failure modes highlight the importance of understanding where the algorithm may underperform.
Incomplete end-to-end evaluation: The segmentation and classification models were evaluated separately rather than as a fully integrated pipeline. The impact of combining both models in sequence, where segmentation errors propagate into classification inputs, was not fully quantified. Future work should evaluate the complete pipeline performance, including how false positive and false negative segmentation results affect downstream classification accuracy. Additionally, the study focused only on three mass categories (AML, cystic, solid) and did not subtype malignant solid masses into specific RCC subtypes, which would add further clinical value.
Future opportunities: The foundation established here could be extended to incorporate malignancy subtyping (clear cell RCC, papillary RCC, chromophobe RCC), integration of multi-phase CT features more explicitly, and prospective clinical trials measuring the impact on radiologist workflow and diagnostic accuracy. The publicly available KiTS challenge datasets offer a natural benchmark for continued model improvement and cross-institutional comparison.