Renal cell carcinoma (RCC) is one of the most common cancers in Taiwan, and its incidence continues to rise. Computed tomography (CT) is the standard tool for diagnosing and staging renal tumors, with prior studies reporting staging accuracy around 91%. However, CT interpretation depends heavily on radiologist experience, and inter-rater and intra-rater variability can significantly affect diagnostic accuracy. Studies have found that 6.4% to 40.4% of renal tumors classified as malignant based on preoperative CT turned out to be benign after surgical resection, leading to unnecessary procedures.
Renal tumor biopsy (RTB) provides a tissue-based alternative, but it carries risks including tumor cell seeding, bleeding, fistula formation, pseudoaneurysm, infection, and pneumothorax. Biopsies are also nondiagnostic in roughly 11-14% of cases, which limits their routine clinical use. Meanwhile, treatment options for kidney cancer have expanded beyond surgery to include active surveillance, targeted therapy, and immunotherapy. Being able to classify tumor subtypes noninvasively from imaging alone would be a major clinical advantage.
Most existing deep learning models for renal tumors are limited to binary classification, such as benign vs. malignant, or clear cell RCC vs. non-clear cell RCC. Many of these studies also used small cohorts of fewer than 200 patients. This study aimed to go beyond binary classification by training convolutional neural network (CNN) models to distinguish among the five most common renal tumor subtypes: angiomyolipoma (AML), oncocytoma, clear cell RCC (ccRCC), chromophobe RCC (chRCC), and papillary RCC (pRCC).
This retrospective study was approved by the Institutional Review Board of Chang Gung Memorial Hospital, Linkou Branch, Taiwan (IRB No. 201901321B0). Between January 2008 and September 2018, the researchers enrolled 691 patients who had been diagnosed with renal tumors and undergone surgical resection. Patients were excluded if they lacked preoperative CT scans or had only non-enhanced CT. Additional exclusion criteria included renal cysts, polycystic kidney disease, maintenance hemodialysis, tumors smaller than 1 cm, and severe imaging artifacts.
The final cohort comprised 554 patients: 328 males (59.2%) and 226 females (40.8%) with a median age of 56 years (IQR: 47-66 years). The subtype distribution was as follows: ccRCC (n = 246, 44.4%), chRCC (n = 124, 22.4%), pRCC (n = 83, 15%), AML (n = 67, 12%), and oncocytoma (n = 34, 6.1%). The median largest tumor diameter was 53.5 mm (IQR: 36-74 mm). This distribution reflects the known epidemiology of renal tumors, where ccRCC is the most common malignant subtype and oncocytoma is comparatively rare (3-7% of solid renal tumors).
Many patients were referred from external hospitals, so contrast-enhanced CT images came from multiple institutions. Standard scanning parameters included 5 mm slice thickness, contrast agent injection rate of 1-2 mL/sec, contrast dose of 1-2 mL/kg, and whole-abdomen coverage with a non-contrast phase followed by an enhanced phase taken 80-120 seconds after injection. For patients with bilateral or multiple tumors, pathology reports were correlated with CT images, and cases with disagreement between pathology and imaging were excluded.
Renal tumor outlines on axial nephrographic-phase CT images were manually segmented by two urologists, who defined regions of interest (ROIs). The CT images were then converted to PNG format using the default abdominal imaging window of Chang Gung Medical Center, mapping Hounsfield Units (HUs) in the range of -115 to 227 onto 8-bit PNG pixel values (0 to 255). This range was chosen to clearly image abdominal organs, though it meant the model could not learn features from tissue densities outside this window. After conversion, the renal tumor was cropped using a minimal bounding rectangle.
The 554 patients were randomly split into a training set (90%, n = 501) and a testing set (10%, n = 53). The test set contained the following distribution: AML (n = 6), oncocytoma (n = 3), ccRCC (n = 24), chRCC (n = 12), and pRCC (n = 8). The training set was further divided at an 8:2 ratio for 5-fold cross-validation.
Data augmentation: To address class imbalance, images from underrepresented groups (AML and oncocytoma) were augmented to approximately 50% of the count in the largest group (ccRCC). Augmentation techniques included horizontal flipping, vertical flipping, and rotation. After augmentation, the training dataset expanded to 4,238 CT images: AML (966 images), oncocytoma (881 images), ccRCC (1,811 images), chRCC (1,087 images), and pRCC (642 images). Importantly, only the training data was augmented; the original test data remained unmodified for unbiased performance evaluation.
The researchers trained two well-established CNN architectures: Inception V3 (311 layers) and ResNet-50 (175 layers). Both models were developed using Python 3.8.5 and TensorFlow 2.5.0, with initial weights pretrained on ImageNet. The study systematically explored how many layers to "unfreeze" for fine-tuning, rather than using a fixed approach. For Inception V3, the team tested 0 (pure transfer learning), 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, and 280 trainable layers. For ResNet-50, they tested 0, 25, 50, 75, 100, 125, and 150 trainable layers.
Both models were trained with a learning rate of 10^-5 for 30 epochs. This systematic layer-by-layer exploration of fine-tuning depth is notable because it demonstrated that pure transfer learning (zero trainable layers) performed poorly on this medical imaging task. Inception V3 with zero trainable layers achieved only 0.689 average accuracy and 0.727 weighted precision. ResNet-50 with zero trainable layers achieved only 0.717 average accuracy and 0.760 weighted precision. Allowing additional layers to be retrained significantly improved performance.
Patient-level aggregation: Since each patient had multiple 2D CT slices through the tumor, the study employed a pixel-weighted voting scheme to produce a final per-patient classification. Each image's predicted class probabilities were multiplied by the number of pixels in the cropped ROI for that slice. The products were summed across all slices, and the class with the highest cumulative score became the patient-level prediction. This approach gave greater weight to slices with larger tumor cross-sections.
The Inception V3 model achieved its best performance when 220 of its 311 layers were set as trainable. At this configuration, the model reached a peak accuracy of 0.830, peak weighted precision (WP) of 0.885, peak macro F1-score of 0.786, and peak weighted F1-score of 0.833. These represent single-fold peak values; the 5-fold cross-validation averages were slightly lower but consistent.
5-fold cross-validation averages at 220 trainable layers: accuracy of 0.804 +/- 0.019, weighted precision of 0.847 +/- 0.021, macro F1-score of 0.757 +/- 0.028, and weighted F1-score of 0.813 +/- 0.0176. The relatively tight standard deviations across folds suggest stable model performance rather than overfitting to a particular data split.
The gap between the macro F1-score (0.757) and the weighted F1-score (0.813) is worth noting. Macro F1 gives equal weight to every class regardless of sample size, while weighted F1 accounts for class frequency. The lower macro F1 indicates that the model performed less well on the minority classes (oncocytoma and AML) compared to the more common subtypes, reflecting the persistent challenge of class imbalance even after augmentation.
ResNet-50 achieved its highest accuracy of 0.849 using just 50 trainable layers out of 175, outperforming Inception V3's peak accuracy of 0.830. The 5-fold cross-validation average accuracy was 0.811 +/- 0.027 (50 trainable layers). The highest weighted precision (0.887) came at 150 trainable layers, with an average of 0.865 +/- 0.015. The highest macro F1-score (0.813) was achieved using 75 trainable layers, with an average of 0.753 +/- 0.040. The highest weighted F1-score (0.852) was achieved with 50 trainable layers, averaging 0.838 +/- 0.027.
Interestingly, ResNet-50 needed far fewer trainable layers to reach peak accuracy compared to Inception V3 (50 vs. 220), suggesting that its residual connections allowed more efficient adaptation to this medical imaging domain. The overall finding is that ResNet-50 was slightly more accurate (0.849 vs. 0.830 peak, 0.811 vs. 0.804 average) and also achieved higher weighted F1-scores (0.838 vs. 0.813).
Comparison with prior studies: Most earlier deep learning work on renal tumors was limited to binary classification. Lee et al. achieved 76.6% accuracy distinguishing AML from ccRCC. Baghdadi et al. reached 95% accuracy differentiating oncocytoma from chRCC. Zhou et al. achieved 97% accuracy for benign vs. malignant classification. While these binary results are higher in absolute terms, they solve a much simpler clinical problem. In multi-class renal subtype discrimination, Uhlig et al. used radiomic features with extreme gradient boosting (XGBoost) and achieved an AUC of only 0.72 across the same five subtypes. The current study's deep learning approach substantially outperformed this radiomic/machine learning baseline.
Single-center cohort: All patients came from a single tertiary center (Chang Gung Memorial Hospital), even though some were referrals from other hospitals. This limits the generalizability of results to broader populations, different scanner types, and varying imaging protocols across institutions. External validation on independent multi-center datasets was not performed.
Hounsfield Unit windowing: The study used an HU range of -115 to 227 for image preprocessing, which is the standard abdominal window at their institution. However, this means the models could not learn features from tissue densities outside this range. Some renal tumor characteristics, such as fat content in AML or calcifications, may produce HU values outside this window, potentially limiting classification performance for certain subtypes.
Persistent class imbalance: Despite augmentation, oncocytoma (n = 34 patients, only 3 in the test set) and AML (n = 67, only 6 in the test set) remained underrepresented. The lower macro F1-scores compared to weighted F1-scores across both models confirm that the minority classes were harder to classify accurately. With only 3 oncocytoma patients in the test set, per-class performance estimates for this subtype have very wide confidence intervals.
Manual segmentation: Tumor ROIs were drawn manually by two urologists, which is not scalable for clinical deployment. Automatic segmentation would be needed for real-world use. Additionally, the study included only five renal tumor subtypes, and rarer histologic variants were not represented.
Multi-center validation: The most critical next step is validating these models on datasets from multiple medical centers with different CT scanners, protocols, and patient demographics. Without this, the clinical utility of an 80-85% accuracy 5-class model remains theoretical. Multi-site data would also increase sample sizes for the underrepresented subtypes, particularly oncocytoma.
Automated segmentation: Replacing manual tumor delineation with an automated segmentation model (such as a U-Net or similar architecture) would be essential for practical deployment. The current workflow requires urologists to manually outline each tumor, which is time-consuming and introduces its own variability. An end-to-end pipeline that takes a CT scan and outputs subtype classification without manual input would be far more useful in clinical practice.
Expanded subtype coverage and advanced architectures: Future work could incorporate additional rare renal tumor subtypes and explore more modern architectures beyond Inception V3 and ResNet-50, such as EfficientNet, Vision Transformers (ViT), or 3D CNNs that use volumetric information rather than individual 2D slices. Incorporating multi-phase CT data (corticomedullary, nephrographic, and excretory phases) as separate input channels could also improve discriminative power, particularly for subtypes that show distinct enhancement patterns across phases.
The study's finding that transfer learning alone was insufficient, and that substantial fine-tuning was needed, is an important insight for the field. It suggests that medical imaging features for renal tumor classification diverge significantly from the natural image features learned by ImageNet-pretrained models, and future studies should budget for extensive fine-tuning or consider pretraining on large medical imaging datasets instead.