AI-Driven Kidney and Renal Mass Segmentation and Classification on 3D CT

Plain-English Explanations

Overview & Background

Pages 1-2

Why Automated Kidney and Renal Mass Diagnosis on CT Matters

Kidney cancer is among the ten most common cancers worldwide, and renal cell carcinoma (RCC) is the most common renal malignancy. Abdominal CT imaging is the standard tool for detecting and diagnosing renal masses, but manual segmentation and quantification of organs and tumors in clinical practice is expensive and time-consuming. The most common RCC subtypes include clear cell, chromophobe, oncocytoma, papillary, and other subtypes. Classifying histologic subtypes is clinically crucial because it can help avoid unnecessary biopsies or surgeries, and it directly informs treatment decisions.

The AI gap in renal mass diagnosis: While AI algorithms have shown the ability to differentiate benign from malignant renal masses on CT, their performance still falls short when distinguishing between different kidney cancer subtypes or making detailed tumor assessments. Additionally, most existing studies are limited in data volume and lack robustness for images collected from various institutions using different scanners and acquisition protocols. This creates a significant generalizability problem for any model trained on a single dataset.

The proposed framework: This paper from The City College of New York and Memorial Sloan Kettering Cancer Center introduces a novel end-to-end AI-driven diagnosis framework with two main components: a 3D segmentation network (3D Res-UNet) for automatic kidney and renal mass segmentation, and a dual-path classification network that predicts five histological subtypes of RCC. The authors also introduce a weakly supervised learning schema that uses annotations from only three CT slices per volume to improve cross-institutional robustness, bridging the domain gap between datasets collected from different vendors.

TL;DR: This study proposes an end-to-end framework combining 3D Res-UNet segmentation with dual-path subtype classification across five RCC subtypes (clear cell, chromophobe, oncocytoma, papillary, and other). A weakly supervised method requiring only three annotated slices per volume improves cross-institutional robustness.

Architecture

Pages 3-4

3D Res-UNet: The Segmentation Backbone

The segmentation component of the framework is built on a 3D U-Net encoder-decoder architecture, a widely used design in medical image segmentation. The encoder consists of five convolutional neural network blocks that progressively aggregate semantic features while losing spatial resolution. To recover spatial information needed for precise segmentation, skip connections pass high-resolution features from the encoder to the corresponding decoder layers. The key modification over a standard 3D U-Net is the addition of residual blocks to all convolutional layers, which helps overcome the vanishing and exploding gradient problems common in deep networks.

Network dimensions and training details: The first convolution block starts with 32 channels, doubling at each subsequent block, and the feature map is downsampled until it reaches 4 x 4 x 4 voxels. The final decoder layer produces kidney and renal mass prediction masks via a convolution layer followed by SoftMax activation. The model uses leaky ReLU activations and instance normalization, with a combined loss function of dice loss and cross-entropy loss. Training ran for 300 epochs using the Adam optimizer (beta = 0.99) with an initial learning rate of 5 x 10^-3 selected via grid search from five candidates. The patch size was 128 x 128 x 128 with a batch size of 2 on a single NVIDIA GTX 1080Ti GPU.

Post-processing for false positive reduction: Connected component-based post-processing removes small, discontinuous predicted regions while keeping the top two largest components. During training, if removing a region improves the dice coefficient, that region is flagged as a false positive. This step is particularly beneficial for the combined kidney and renal mass segmentation, where spurious predictions in adjacent organs are common.

Pre-processing pipeline: CT images were converted from NIfTI to DICOM format and resampled to a standard spacing of 3.22 x 1.62 x 1.62 mm, resulting in 128 x 248 x 248 voxel volumes. The Hounsfield unit window width and level were set to 400 and 30 respectively to optimize kidney area visualization, with pixel values normalized to 0-1. Data augmentation included random rotations (-45 to 55 degrees, probability 0.2), random scaling (factor 0.9-1.1, probability 0.2), random elastic deformations (probability 0.05), gamma correction (0.7-1.5 per pixel, probability 0.15), and mirroring (probability 0.5).

TL;DR: The segmentation network is a 3D Res-UNet with five encoder blocks (32 to 512 channels), residual connections, combined dice and cross-entropy loss, trained for 300 epochs. Connected component post-processing removes false positives by keeping only the two largest predicted components. CT volumes were resampled to 128 x 248 x 248 voxels with extensive data augmentation.

Classification Network

Pages 4-5

Dual-Path Classification for RCC Subtype Prediction

Once the segmentation network identifies kidney and renal mass regions, the classification component takes over to predict the histological subtype. The key design choice here is a dual-path learning schema that simultaneously processes two different views of the detected mass. The global path receives the entire cropped region containing both the kidney and the renal mass, capturing contextual information about where the mass sits relative to surrounding structures. The local path receives only the cropped renal mass region itself, focusing on fine-grained texture and morphological details within the tumor.

Network architecture: Both paths use a ResNet-50 backbone consisting of five convolution layers, each with skip connections, ReLU activation, and a stride of 2. The five layers contain 64, 128, 256, 256, and 512 channels of 3x3 filters. The output of the fifth layer is flattened and transformed into a 64-dimensional feature vector through a fully connected layer. The global and local feature vectors are concatenated and fed into a six-class SoftMax classifier for subtype prediction (clear cell, chromophobe, oncocytoma, papillary, other, and background). A dropout rate of 50% is applied in the fully connected layer to prevent overfitting. The fourth and fifth layers use dense convolutional connections, where each layer receives feature maps from all preceding layers.

Training configuration: The cropped images were resized to 64 x 64 pixels as input to the classification network. The model was trained for 100 epochs using the Adam optimizer (beta = 0.99) with weighted cross-entropy loss to handle class imbalance. The batch size was 32 with an initial learning rate of 0.001, reduced by a factor of 0.1 every 25 epochs. Pathological labels of renal subtypes from clinical records were used as ground truth for training.

TL;DR: A dual-path ResNet-50 classification network processes both global context (kidney + mass) and local detail (mass only) to predict five RCC subtypes. Features from both paths are concatenated into a 128-dimensional vector and classified with SoftMax. The model was trained for 100 epochs with weighted cross-entropy loss, 50% dropout, and a learning rate decaying from 0.001 by 0.1 every 25 epochs.

Methodology

Page 3

Weakly Supervised Learning for Cross-Institutional Robustness

A core challenge in deploying AI models across different hospitals is the domain gap: CT images acquired on different scanners with different protocols look subtly different, and a model trained on one dataset may underperform on another. Full annotation of new datasets is impractical in clinical settings because pixel-level labeling of 3D CT volumes is extremely labor-intensive. The authors address this with a weakly supervised method that requires annotations from only three slices per CT volume to fine-tune the pre-trained segmentation network.

Slice selection strategy: The three annotated slices were chosen based on empirical experiments: (1) the first slice containing the renal mass, (2) the slice with the largest diameter of the renal mass, and (3) the last slice of the renal mass. These three slices capture the beginning, maximum extent, and end of the tumor, providing sufficient spatial context for the network to infer the complete 3D segmentation from minimal labeled data. This dramatically reduces the annotation burden compared to full voxel-level labeling of every slice.

Validation approach: To assess quantitative performance in this semi-automated setting, the authors cross-validated all annotation slices across all training samples using five-fold cross-validation, testing both with and without batch normalization. The intersection over union (IoU) metric was used to compare predicted 3D volumes against ground truth. The weakly supervised schema was specifically designed to leverage the domain gap between datasets from various vendors, making the framework applicable to images from institutions not represented in the training set.

TL;DR: The weakly supervised method requires only three annotated slices per CT volume (first slice, largest-diameter slice, and last slice of the renal mass) to fine-tune the model for new institutions. Five-fold cross-validation with IoU metrics confirmed that this approach effectively bridges domain gaps between different scanner vendors and acquisition protocols.

Datasets

Page 6

Training and Testing Datasets: KiTs19 and Three NIH Collections

The primary training dataset was KiTs19, a publicly available collection from the 2019 MICCAI Renal Tumor Segmentation Challenge. It contains 210 arterial-phase CT volumes from 210 patients who underwent nephrectomy for a renal mass. The subtype distribution was heavily skewed toward clear cell RCC: 203 cases (67.7%) of clear cell, 28 cases (9.3%) of papillary, 27 cases (9.0%) of chromophobe, 16 cases (5.3%) of oncocytoma, and 26 cases (8.7%) of other subtypes. This imbalance reflects the real-world prevalence of clear cell RCC and was addressed in the classification network through weighted cross-entropy loss.

External validation datasets: Three additional publicly available NIH datasets were used for cross-institutional testing: TCGA-KICH (15 cases of chromophobe subtype), TCGA-KIRP (20 cases of papillary subtype), and TCGA-KIRC (40 cases of clear cell subtype). These datasets were collected from different institutions using different scanners and protocols. Importantly, the NIH datasets contain cases with various tumor shapes and tumor sizes that are on average 23% larger than those in the KiTs19 training set, making them a rigorous test of generalization.

Evaluation metrics: Segmentation performance was measured using the Dice similarity coefficient (spatial overlap between predicted and ground-truth volumes, ranging from 0 to 1), pixel accuracy, sensitivity, and specificity. Classification performance was evaluated using the area under the ROC curve (AUC) via one-vs-rest multiclass classification with 95% confidence intervals. Five-fold cross-validation was applied throughout.

TL;DR: Training used 210 KiTs19 CT volumes (67.7% clear cell, 9.3% papillary, 9.0% chromophobe, 5.3% oncocytoma, 8.7% other). External testing used three NIH datasets totaling 75 cases (TCGA-KICH: 15, TCGA-KIRP: 20, TCGA-KIRC: 40) with tumors averaging 23% larger than training data.

Results

Pages 7-8

Segmentation and Classification Performance on KiTs19

On the KiTs19 test set, the framework achieved state-of-the-art segmentation results. The average Dice coefficient was 96.7% for kidney regions and 86.6% for renal mass regions. Pixel accuracy reached 95.3% (up from 95.7% by the next best method, Ruan et al.), with sensitivity of 91.2% and specificity of 88.1%. These results outperformed all comparison methods, including approaches by Xia et al. (Dice 79.6%), Yin et al. (Dice 83.8%), Yu et al. (Dice 80.4%), and Ruan et al. (Dice 85.9%). The improvements are attributed to the combination of residual blocks, extensive hyperparameter tuning, and the pre-processing and post-processing pipeline.

Ablation study results: The ablation study systematically evaluated each component's contribution. A baseline 3D U-Net achieved 80.9% Dice and 86.2% pixel accuracy. Adding residual connections improved these to 82.4% and 89.1%. Pre-processing alone brought the Dice to 83.1% and pixel accuracy to 91.7%. The combination of all components (residual blocks, pre-processing, and post-processing) yielded the final 86.6% Dice and 95.9% pixel accuracy, a total improvement of 5.7 percentage points in Dice and 9.7 percentage points in pixel accuracy over the baseline.

Subtype classification: The dual-path classification network achieved an average AUC of 85% across all five subtypes on the KiTs19 dataset. The breakdown by subtype was: clear cell RCC at 87.1% AUC (84.3% specificity, 84.1% sensitivity), papillary at 86.2% AUC (83.7% specificity, 82.4% sensitivity), chromophobe at 85.9% AUC (82.1% specificity, 83.7% sensitivity), oncocytoma at 79.9% AUC (77.7% specificity, 76.9% sensitivity), and other subtypes at 85.8% AUC (81.6% specificity, 80.4% sensitivity). The lower performance on oncocytoma is expected, as this subtype is notoriously difficult to differentiate from RCC due to similarities in morphologic, histologic, and imaging characteristics.

TL;DR: Segmentation achieved 96.7% kidney Dice and 86.6% renal mass Dice on KiTs19, outperforming all prior methods. Classification achieved an average AUC of 85% across five RCC subtypes, with clear cell highest at 87.1% and oncocytoma lowest at 79.9%. The ablation study showed that residual blocks, pre-processing, and post-processing together improved Dice by 5.7 percentage points over baseline 3D U-Net.

Cross-Institutional Validation

Page 8

Weakly Supervised Results Across Three NIH Datasets

The cross-institutional validation tested the framework on three NIH datasets that differ significantly from the KiTs19 training data in scanner manufacturer, acquisition protocol, and tumor characteristics (on average 23% larger tumors). As expected, the deep learning model showed decreased performance on out-of-distribution data, particularly at the edges of very large tumors rarely seen in training. However, the weakly supervised fine-tuning with just three annotated slices per volume substantially recovered performance.

TCGA-KICH (chromophobe, 15 cases): With weakly supervised fine-tuning, the kidney segmentation Dice reached 91.2% and the renal mass Dice reached 83.4%, representing improvements of 5.3% and 4.2% respectively over the model without fine-tuning. Subtype classification accuracy was 84.5%, a 2.0% improvement with the weakly supervised method.

TCGA-KIRP (papillary, 20 cases): After weakly supervised fine-tuning, the kidney Dice was 92.8% and renal mass Dice was 81.8%, with improvements of 3.3% and 5.4% respectively. The subtype prediction accuracy reached 79.5%, a 2.0% improvement over the non-fine-tuned baseline.

TCGA-KIRC (clear cell, 40 cases): This was the most challenging external dataset. With weakly supervised fine-tuning, kidney Dice was 93.2% and renal mass Dice was 80.1%, improvements of 2.5% and 1.8% respectively. Subtype classification reached 86.1%, a 2.5% improvement. Despite the substantial domain shift and larger tumor sizes, the results remained clinically meaningful.

TL;DR: Weakly supervised fine-tuning with only three annotated slices per volume improved cross-institutional performance by 2.5-5.4 percentage points in segmentation Dice and 2.0-2.5 percentage points in classification accuracy. Best results: TCGA-KIRC with 93.2% kidney Dice and 86.1% subtype accuracy. Worst renal mass Dice: TCGA-KIRC at 80.1%, reflecting larger and more variable tumors.

Limitations & Future Directions

Pages 9-10

Constraints of the Current Framework and Paths Forward

Class imbalance: The KiTs19 dataset is heavily skewed toward clear cell RCC (67.7% of cases), with oncocytoma representing only 5.3%. This imbalance likely contributes to the notably lower AUC for oncocytoma (79.9%) compared to clear cell (87.1%). While weighted cross-entropy loss was used to partially address this, the limited number of minority-class training examples constrains the model's ability to learn discriminative features for rarer subtypes.

Single-phase imaging: The framework operates on single-phase arterial CT images, whereas clinical practice often uses multi-phase contrast-enhanced CT (corticomedullary, nephrographic, and excretory phases) for renal mass characterization. Multi-phase imaging provides additional contrast patterns that help differentiate subtypes, particularly between oncocytoma and chromophobe RCC. The single-phase limitation means the model cannot leverage these differential enhancement patterns.

External validation scale: While the cross-institutional testing on three NIH datasets (totaling 75 cases) demonstrated robustness, these datasets are relatively small. The TCGA-KICH dataset contained only 15 cases, which limits the statistical power of the chromophobe-specific results. Larger, multi-center prospective validation studies would be needed before clinical deployment. The weakly supervised approach also requires at least three annotated slices per volume for fine-tuning, which, while minimal, still requires some expert input at each new institution.

Future directions: The dual-path classification design could be extended to incorporate multi-phase CT data, potentially improving differentiation of morphologically similar subtypes like oncocytoma and chromophobe. Integration of additional clinical metadata (patient demographics, lab values, tumor growth rates) could further enhance classification accuracy. The weakly supervised schema could be explored with even fewer annotations, such as single-slice or image-level labels, to further reduce the adaptation burden for new institutions. The framework could also be expanded to predict treatment-relevant features such as tumor grade, staging, and surgical margin risk.

TL;DR: Key limitations include class imbalance in training data (67.7% clear cell vs. 5.3% oncocytoma), reliance on single-phase arterial CT rather than multi-phase imaging, and small external validation sets (15-40 cases per NIH dataset). Future work could incorporate multi-phase CT, expand to larger multi-center cohorts, and reduce the weakly supervised annotation requirement below three slices per volume.

AI-Driven Robust Kidney and Renal Mass Segmentation and Classification on 3D CT Images

Original Paper (PDF)