Semantic segmentation of medical images, the process of labeling every voxel in a scan as belonging to a specific structure, has become essential for managing renal cell carcinoma (RCC). RCC is the eighth most common malignancy in the United States, and small renal masses are frequently detected incidentally on cross-sectional CT imaging. Accurate automated segmentations of kidneys and tumors can support pre-operative decision-making, 3D surgical simulation, patient education, and even intra-operative image overlays.
The nnU-Net standard: The no-new-U-Net (nnU-Net) framework, published by Isensee et al., has emerged as the state-of-the-art approach for medical image segmentation. It extends the classic U-Net architecture by automating best practices for pre-processing, model selection, hyperparameter tuning, and post-processing. The nnU-Net won the 2018 Medical Decathlon Segmentation Challenge, and all top submissions in the 2021 Kidney and Kidney Tumor Segmentation Challenge (KiTS21) used nnU-Net variations.
The dataset size question: While nnU-Net is well established as a segmentation model, a critical unanswered question remains: how many annotated training images does a researcher actually need before adding more data stops improving performance? Curating labeled medical image datasets is one of the most time-consuming and expensive steps in model development, and revisiting this step after modeling has begun is especially costly. This study proposes an exponential-plateau model to predict the exact dataset size at which Dice score performance plateaus, tested across CT and MR imaging modalities, kidney and tumor targets, and 2D versus 3D architectures.
This retrospective study, conducted at Mayo Clinic under IRB approval with HIPAA compliance, drew on two internal datasets and one public dataset. The CT dataset comprised 1,233 non-contrast and various contrast-phase abdomen/pelvis CT scans from patients who underwent radical nephrectomy for renal tumors between 2000 and 2017. After excluding 356 images (due to shifted voxel intensities, non-axial orientation, or missing segmentations), 877 images remained. CT images were cropped around both kidneys and resampled to a standard 256 x 128 pixel in-plane resolution.
MR dataset: A total of 501 patients with partial (n = 313) or radical nephrectomy (n = 188) were identified from the radiology database, with only T2-weighted fat-saturated coronal abdominal/pelvic MR images selected (n = 419). After excluding patients with small lesions not visible on the single coronal series (n = 28) and patients with total kidney volume greater than 600 mL due to polycystic kidney disease (n = 7), 384 images remained. A U-Net algorithm initially segmented the kidneys, which were then manually refined by two urologic oncology fellows who also manually annotated the tumors.
Dataset stratification: For both CT and MR, training-validation sets of 50, 100, 150, 200, 250, and 300 images were assembled using an 80-20 training-validation split. Fifty random images per modality were held out as a fixed test set. All nnU-Net models used fivefold cross-validation with the 3d_fullres configuration. The final ensemble model used majority voting across all five folds, where the most common voxel prediction (background, kidney, or tumor) was selected as the final label.
KiTS21 external dataset: The publicly available KiTS21 challenge dataset, consisting of 300 corticomedullary contrast-phase CT images, was used as a third experiment. Here, 60 images were held out for testing and training-validation sets of 80, 120, 160, 200, and 240 were constructed. Both 2D and 3D nnU-Net configurations were tested on this data to examine the effect of model architecture on data requirements.
The core analytical tool in this study is the exponential-plateau model, defined as D(x) = DM - (DM - D0) x e^(-kx), where DM is the maximum achievable Dice score, D0 is the minimum Dice, k is the exponential rate constant, and x is the number of training images. The parameters were fit using the curve_fit function from the SciPy library, which employs a non-linear least squares method. The plateau point was defined as the dataset size at which performance comes within 0.01 Dice of the predicted maximum.
Primary metric: The Dice coefficient was the main evaluation metric, measuring the degree of overlap between predicted and reference standard segmentations on a 0-to-1 scale (where 1 is perfect overlap). This is the most widely used metric in 3D medical image segmentation. Paired Student's t-tests compared Dice scores across different dataset sizes to determine statistically significant differences in performance.
Additional metrics: For the final 300-dataset-size ensemble models, the authors also reported the Jaccard index, true positive rate (TPR), and mean surface distance (MSD). Bland-Altman analysis and linear regression were used to compare predicted segmentation volumes against reference standard volumes, assessing both systematic bias and agreement.
Plateau stability analysis: To validate the robustness of the exponential-plateau model, the authors progressively dropped the largest dataset sizes from the fitting procedure. If the predicted maximum Dice remained stable after removing the 300-size and then the 250-size datasets, it indicated that the model was reliably predicting the true performance ceiling. Divergent predictions when dropping larger datasets signaled that the true plateau had not yet been reached.
Segmenting non-neoplastic kidney tissue proved relatively straightforward for the nnU-Net framework. On CT, the best observed ensemble Dice score was 0.93 (from the 250-image dataset), and the exponential-plateau model estimated that a Dice of 0.93 could be reached with just 54 training-validation images. On MR, the best observed Dice was 0.92 (from the 300-image dataset), with the plateau estimated at 122 images at a Dice of 0.91. No statistically significant difference was found between ensemble models past either plateau point.
Why CT outperforms MR at lower dataset sizes: CT images generally produced higher Dice scores at smaller dataset sizes compared to MR. The nnU-Net ensemble framework achieved over 0.89 mean test set Dice for both modalities even with only 50 training-validation examples. However, MR kidney tissue had substantially more variable voxel intensities, with a standard deviation of 37 compared to 9.64 for CT kidney. This increased variability likely explains why MR required roughly twice as many images (122 vs. 54) to reach its performance ceiling.
Cohort characteristics: The CT cohort included 350 patients (229 male, 121 female, mean age 63 +/- 13 years) while the MR cohort included 350 patients (217 male, 133 female, mean age 59 +/- 14 years). CT images had a mean in-plane voxel size of 1.03 x 1.03 mm with a mean slice thickness of 4.03 +/- 1.38 mm. MR images had a mean in-plane voxel size of 1.34 x 1.34 mm with a mean slice thickness of 6.26 +/- 1.65 mm and fewer slices per volume (mean 32 +/- 12 vs. 45 +/- 24 for CT).
Tumor segmentation was substantially harder than kidney segmentation due to the increased heterogeneity of tumor size, shape, and intensity, as well as the difficulty of differentiating tumors from other renal structures like simple cysts. The best-performing tumor ensemble models for CT and MR were both from the 300-image dataset, yielding average test Dice scores of 0.86 and 0.76, respectively. The exponential-plateau model estimated a plateau at 126 images (Dice 0.84) for CT tumor and 389 images (Dice 0.76) for MR tumor.
CT tumor plateau stability: For CT tumors, the plateau prediction was highly stable. When the model was refit after dropping the 300-size dataset, or after dropping both the 300 and 250-size datasets, the predicted maximum Dice values were 0.84, 0.84, and 0.83, respectively. No statistically significant difference was observed beyond the CT tumor plateau point, providing strong confidence in the estimate.
MR tumor plateau instability: The MR tumor plateau was less reliable. When progressively dropping the largest datasets, the predicted maximum Dice values were 0.76, 0.76, and 0.71. The large jump between the "drop 300" and "drop 300 and 250" conditions suggested that the model had not yet fully stabilized. Additionally, a statistically significant difference (p = 0.03, paired Student's t-test) was observed between the 200 and 250-dataset-size ensemble models for MR tumor. The predicted plateau of 389 images exceeded the available 300 training-validation images, indicating that more MR data would likely improve performance further.
Small tumor performance: The smallest quartile of tumors proved especially challenging. For CT, the smallest quartile (mean 22.0 +/- 12.7 mL) required an estimated 378 images to reach a plateau of 0.711 Dice. For MR, the smallest quartile (mean 8.7 +/- 4.0 mL, all from partial nephrectomies) needed an estimated 338 images to reach only 0.53 Dice. In both modalities, median Dice scores were higher than means, reflecting the outsized impact of difficult small-tumor outliers.
Linear regression between reference standard and predicted segmentation volumes showed excellent agreement for kidney segmentation, with R-squared values of 0.969 for CT and 0.904 for MR. Tumor segmentation also showed strong concordance, with R-squared values of 0.932 for CT and 0.982 for MR. These high correlations indicate that the models accurately capture volumetric measurements needed for clinical decision-making.
Bland-Altman analysis: For kidney volume predictions, the bias +/- standard deviation was -0.99% +/- 6.21% for CT and -0.79% +/- 7.23% for MR, indicating minimal systematic error. Tumor volume predictions showed larger variability: 6.36% +/- 46.17% for CT and 22.69% +/- 58.58% for MR. The wider spread for tumors reflects the difficulty of segmenting heterogeneous masses, particularly small ones that the model sometimes misses entirely.
Full metrics on the 300-image models: For CT kidney, the ensemble achieved a Dice of 0.93 +/- 0.02, Jaccard of 0.87 +/- 0.04, TPR of 0.93 +/- 0.03, and MSD of 0.60 +/- 0.27. For CT tumor: Dice 0.85 +/- 0.20, Jaccard 0.77 +/- 0.22, TPR 0.86 +/- 0.21, MSD 1.42 +/- 2.12. For MR kidney: Dice 0.92 +/- 0.04, Jaccard 0.85 +/- 0.07, TPR 0.92 +/- 0.05, MSD 0.50 +/- 0.35. For MR tumor: Dice 0.76 +/- 0.27, Jaccard 0.66 +/- 0.26, TPR 0.75 +/- 0.28, MSD 15.15 +/- 55.54. The strikingly high MSD for MR tumor reflects cases where the model entirely missed small tumors.
Qualitative findings: Visual inspection of good and poor cases revealed predictable patterns. Large, homogeneous tumors with clear borders received excellent predictions (e.g., a large hypointense CT tumor scored kidney Dice 0.95 and tumor Dice 0.96). Small, poorly defined lesions with low contrast were the primary failure mode. One CT case with a small hypointense lesion scored kidney Dice 0.95 but tumor Dice of only 0.05, and an MR case with a small hypointense tumor achieved kidney Dice 0.88 but tumor Dice of 0.17. The model tended to produce more false-negative than false-positive segmentations, sometimes entirely missing areas of smaller tumors.
The KiTS21 public dataset provided an opportunity to compare model architectures using a standardized, single-phase (corticomedullary contrast) CT dataset. Only tumor segmentation was analyzed on KiTS21 because even small datasets already achieved high kidney Dice scores, making the plateau analysis less informative for kidney tissue. Both 2D and 3D nnU-Net configurations were tested across training-validation sets of 80, 120, 160, 200, and 240 images, with 60 held out for testing.
Performance comparison: The top-performing tumor ensemble models for 2D and 3D architectures were both from the 240-image training set, achieving average test Dice scores of 0.67 +/- 0.29 and 0.84 +/- 0.18, respectively. The 3D model clearly outperformed the 2D model, which is expected given that volumetric context helps distinguish tumors from surrounding structures across slices. The exponential-plateau model predicted a maximum Dice of 0.76 at 177 images for the 2D model and 0.88 at 440 images for the 3D model.
Data efficiency trade-off: A notable finding is that the 2D model required far fewer images to reach its plateau (177 vs. 440 for 3D), despite predicting a lower ceiling (0.76 vs. 0.88 Dice). This reveals an important principle: architectures with higher performance ceilings are not necessarily more data-efficient. The 3D model leverages volumetric spatial information more effectively but needs substantially more labeled examples to do so. Researchers must weigh whether the performance gain justifies the additional annotation burden.
Plateau stability on KiTS21: The stability analysis confirmed reliable predictions for both architectures. For the 2D model, the maximum predicted Dice under "all data," "drop 240," and "drop 240 and 200" conditions was 0.76, 0.75, and 0.75. For the 3D model, corresponding values were 0.88, 0.87, and 0.85. The tight clustering of these estimates, especially for the 2D model, indicates that the exponential-plateau approach generalizes well across different datasets and architectures.
Different preprocessing for CT and MR: A key limitation is that the CT images were pre-cropped around the kidneys, presenting an easier task for the segmentation model compared to the full abdominal MR images. This cropping is analogous to the coarse-to-fine segmentation strategy used by the best-performing KiTS21 submissions, where an initial model identifies the renal region of interest before segmenting specific tissue. However, it means the CT and MR results are not directly comparable in terms of task difficulty, and the CT performance figures may be somewhat optimistic for real-world deployment on uncropped images.
Cohort differences between modalities: The MR dataset included patients who underwent both partial and radical nephrectomies, while the CT dataset was limited to radical nephrectomy patients only. Because partial nephrectomy patients tend to have smaller tumors (mean 8.7 +/- 4.0 mL in the smallest MR quartile vs. 22.0 +/- 12.7 mL in the smallest CT quartile), the MR dataset presented an inherently harder segmentation challenge. Additionally, MR tumor voxel intensities had roughly seven times the standard deviation of CT tumor voxels, further increasing the difficulty of learning consistent features from MR data.
Architecture and configuration scope: The study evaluated only the 3d_fullres nnU-Net configuration for the internal datasets and added the 2D configuration only for KiTS21. It did not explore multi-model ensembles or cascaded configurations that some top KiTS21 submissions employed. The performance plateau is also specific to the holdout test set, and the authors emphasize that model developers must independently ensure their test set is representative of real-world images for the intended clinical task.
Future directions: The authors propose extending this approach to additional renal anatomic structures, such as renal cysts, which may only be present in a subset of training examples and would presumably require larger datasets. The KiTS21 challenge includes labels for simple cysts, providing a natural starting point. Investigating whether plateau points differ across organ systems and pathologies is another promising direction. The exponential-plateau framework itself could become a standard tool for researchers planning dataset curation for any medical image segmentation task, helping isolate whether suboptimal performance is due to insufficient training data or inherent architectural limitations.