Automatic Segmentation of Kidneys and Kidney Tumors: The KiTS19 International Challenge

Frontiers in Medicine 2021 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Page 1
Why Automated Kidney Tumor Segmentation Matters for Clinical Decision-Making

Incidental detection of renal masses has risen sharply as cross-sectional abdominal imaging (CT and MRI) has become routine for non-urological indications. Once a mass is identified, clinicians rely on imaging characteristics to assess its malignancy potential and guide treatment strategy. Nephrometry scores, such as the RENAL scoring system developed by Kutikov and Uzzo, quantify tumor complexity by measuring size, endophytic proportion, proximity to the collecting system, anterior/posterior orientation, and location relative to polar lines. These scores directly influence decisions about surgical technique, prognosis, and patient counseling.

The bottleneck of manual scoring: In current clinical practice, nephrometry scores are calculated manually from cross-sectional imaging. This process is labor-intensive, unreimbursed, and subject to considerable interobserver variability. The inter-class correlation between radiology fellows, urology fellows, radiology residents, and medical students for C-index, PADUA, and RENAL scores has been reported as only 0.77, 0.68, and 0.66, respectively. This inconsistency limits widespread adoption despite the established clinical benefit of these scoring systems.

The promise of deep learning: Renal tumors image well on contrast-enhanced CT, distinguishable from normal kidney parenchyma at diameters as small as 10 mm. This characteristic makes them strong candidates for automated delineation through deep learning. AI-generated segmentations could be translated into fully automated nephrometry scores, eliminating the manual effort and variability that currently limit their use. However, reliable automatic segmentation of both kidneys and kidney tumors remains a prerequisite for any downstream automation.

Prior work using CT texture analysis attempted to differentiate benign angiomyolipomas from malignant tumors, but these approaches relied on expert manual segmentation to extract discriminative features and therefore still required considerable error-prone manual effort. The KiTS19 Challenge was designed to address this gap by crowdsourcing the development of automatic segmentation algorithms through an international competition.

TL;DR: Manual nephrometry scoring of kidney tumors is labor-intensive and suffers from interobserver variability (inter-class correlations of only 0.66-0.77). Automated segmentation via deep learning could eliminate this bottleneck, enabling consistent, reproducible scoring for clinical decision-making.
Pages 2-3
300 Patients, 50,000+ Annotated Regions, and Images from Over 70 Clinics

The dataset was drawn from 544 consecutive patients who underwent surgery for a renal mass between January 2010 and July 2018 at a single surgical site. Inclusion was restricted to patients with available pre-operative CT abdominal/pelvic imaging in the late arterial contrast phase (n = 326), which was selected for consistency and because it was the most commonly available phase. Patients with tumor thrombus were excluded to avoid ambiguity in defining kidney tumor voxels (n = 26), leaving 300 patients in the final dataset. Importantly, although all surgeries occurred at one center, the imaging was acquired from over 70 different clinics across the country, using scanners from four different manufacturers.

Comprehensive clinical data: Beyond imaging, the researchers extracted pre-operative demographic and clinical data, intra-operative details (surgical technique, operative time, ischemia time for partial nephrectomy, blood transfusion), detailed pathological data (histological subtype, T stage, ISUP grade), and post-operative outcomes (complications, renal function, survival). This rich clinical context enables correlation between segmentation performance and tumor characteristics.

Annotation process: Twenty-five medical students performed the annotations after receiving 60 minutes of virtual training from a Computer Science Ph.D. student. For one week after training, students were monitored and their performance validated against that of a staff urologic oncologist. Given the different radiodensities of normal renal parenchyma, cysts, tumors, and perinephric fat, simple image processing techniques such as denoising and thresholding were used to consistently delineate boundaries between structures. In total, more than 50,000 regions were delineated across the 300 scans, encompassing several hundred hours of effort.

Interobserver agreement: The annotation quality was assessed by calculating the average Sorensen-Dice coefficient between human annotators on 30 randomly selected cases. The mean Dice score for kidney regions was 0.983, and the mean Dice for tumor alone was 0.923. These values establish the human performance benchmark against which all automated methods were compared.

TL;DR: The dataset comprised 300 patients with CT scans from 70+ clinics and four scanner manufacturers. Twenty-five medical students annotated 50,000+ regions with interobserver Dice scores of 0.983 for kidney and 0.923 for tumor, establishing the human performance ceiling.
Page 3
How the KiTS19 Competition Was Structured: Training, Testing, and Scoring

The KiTS19 Challenge was hosted on grand-challenge.org from March 1 to October 13, 2019, in conjunction with the 2019 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) held in Shenzhen, China. The aim was to identify the best method for automatic semantic segmentation of kidneys and kidney tumors from contrast-enhanced CT scans.

Data split and timeline: A training set of 210 fully annotated cases was made publicly available three months before the testing phase. When the test phase began, a set of 90 unannotated cases was released, and teams had two weeks to produce fully automatic segmentations with no manual intervention allowed. Teams were permitted to use other publicly available data to supplement their training but were restricted to a single submission. Each team also had to submit a detailed manuscript describing their methods to qualify.

Scoring metric: Teams were ranked based on the average Sorensen-Dice coefficient between their predicted kidney and tumor segmentations and the ground truth across all 90 test cases. This composite score equally weighted kidney segmentation accuracy and tumor segmentation accuracy. To incentivize participation, Intuitive Surgical offered a $5,000 cash prize to the winning team. The official leaderboard was released shortly after the testing phase closed and has remained open, accumulating 657 total submissions at the time of writing.

Survey of participants: A convenience sample of 67 teams was anonymously surveyed. On average, teams reported spending approximately 170 hours (SD 212 hours) developing their models. Only 6% of teams reported working with a physician, highlighting a potential gap in clinical-AI collaboration during algorithm development.

TL;DR: KiTS19 gave 100+ teams 210 annotated training CTs and 3 months to build models, then tested on 90 held-out cases scored by composite Sorensen-Dice. Teams invested an average of 170 hours, and only 6% collaborated with a physician. A $5,000 prize from Intuitive Surgical incentivized participation.
Pages 3-4
100 Submissions, 20,000 Hours of Global Effort, and a Winning Dice of 0.912

The KiTS19 Challenge attracted 106 unique teams from across five continents, of which 100 met all submission requirements and were included in the final MICCAI 2019 leaderboard. It was recognized as the challenge with the greatest number of participants at MICCAI 2019. All 100 valid submissions were based on deep neural networks, though they exhibited considerable differences in pre-processing strategies, architectural details, and training procedures. Collectively, the competition coordinated an estimated 20,000 hours of global development effort focused on kidney tumor segmentation.

The winning algorithm: The top-ranking model was submitted by the German Cancer Research Center. It utilized an ensemble of three 3D U-Net architectures, a convolutional neural network designed specifically for volumetric segmentation in biomedicine. This submission achieved a kidney Dice of 0.974, a tumor Dice of 0.851, and a composite score of 0.912. The 3D U-Net approach, adapted through the nnU-Net framework, proved to be the most effective strategy in this challenge.

Kidney vs. tumor performance: Automated segmentation of the kidney by participating teams performed comparably to expert manual segmentation. The winning algorithm's kidney Dice of 0.974 was close to the human interobserver agreement of 0.983. However, tumor segmentation was notably less reliable. The winning algorithm's tumor Dice of 0.851 fell short of the human interobserver agreement of 0.923, representing a gap of 0.072 points. This disparity is expected, since tumors are smaller, more variable in shape and appearance, and can be difficult to distinguish from surrounding tissue.

TL;DR: 100 teams from five continents submitted deep learning models. The winner (German Cancer Research Center) used three 3D U-Nets to achieve 0.974 kidney Dice and 0.851 tumor Dice (0.912 composite). Kidney segmentation matched human performance (0.974 vs. 0.983), but tumor segmentation lagged behind (0.851 vs. 0.923).
Pages 4-6
How Tumor Size, Depth, Location, and Collecting System Proximity Affected Segmentation

The study systematically analyzed how each component of the RENAL nephrometry score influenced automatic segmentation accuracy. While no nephrometry component significantly affected kidney Dice scores on multivariable analysis, tumor Dice was significantly associated with all four components examined: tumor size, endophytic proportion, collecting system involvement, and location relative to polar lines (all p < 0.01).

Tumor size (R component): Larger tumors were segmented more accurately. The mean tumor Dice for all teams was 0.45 (SD 0.21) for tumors under 4 cm, 0.70 (SD 0.21) for tumors between 4 and 7 cm, and 0.75 (SD 0.19) for tumors over 7 cm. The winning algorithm followed a similar pattern, scoring 0.80 for small tumors, 0.91 for medium, and 0.89 for large. This pattern is intuitive: smaller tumors occupy fewer voxels, making precise boundary delineation more difficult.

Endophytic proportion (E component): Tumors that grew inward (endophytic) were harder to segment. The mean tumor Dice across all teams was 0.61 (SD 0.21) for mostly exophytic tumors, 0.51 (SD 0.20) for less than 50% endophytic, and 0.41 (SD 0.19) for entirely endophytic tumors. The winning algorithm scored 0.88, 0.86, and 0.74 for these three categories respectively. Endophytic tumors blend more with surrounding parenchyma on imaging, reducing contrast at the tumor-kidney boundary.

Collecting system proximity (N component) and polar line location (L component): Tumors farther from the collecting system (7 mm or more) had a mean Dice of only 0.44, compared to 0.64 for tumors within 4 mm. For polar line location, tumors entirely above or below polar lines had a mean Dice of 0.43, while those between polar lines scored 0.68. The winning algorithm showed similar trends, with Dice scores of 0.75 vs. 0.88 for collecting system proximity and 0.77 vs. 0.92 for polar line location. These patterns suggest that tumors in more central, anatomically prominent positions are segmented more accurately.

TL;DR: Tumor segmentation accuracy varied significantly by RENAL score components (all p < 0.01). Small tumors (<4 cm) had mean Dice of 0.45 vs. 0.75 for tumors over 7 cm. Entirely endophytic tumors scored 0.41 vs. 0.61 for exophytic. Algorithms struggled most with small, deeply embedded, peripherally located tumors.
Pages 4-5
From Segmentation to Automated Diagnosis, Nephrometry Scoring, and Surgical Planning

The authors envision automated segmentation as the foundation for a range of clinical applications in kidney cancer care. The most immediate use case is automated calculation of nephrometry scores. The RENAL score, C-index, and PADUA score all depend on precise measurements of tumor geometry relative to kidney anatomy. Currently, interobserver variation for these scores is substantial (inter-class correlations of 0.66-0.77), limiting their reliability. Automation would eliminate this variability and make these scoring systems accessible even in resource-poor and medically underserved settings that lack subspecialty expertise.

Incidental lesion detection: Automatic segmentation could also aid radiologists in identifying concerning renal lesions on CT scans performed for other indications. The literature reports that malignant kidney lesion detection sensitivity is only 84% for incidental findings. An AI system running in the background could flag suspicious regions for radiologist review, reducing the risk of missed diagnoses.

Radiomics and imaging biomarkers: Beyond geometry, automated segmentation opens the door to sophisticated tumor analytics such as radiomics and texture analysis. By consistently delineating tumor boundaries, these systems could extract quantitative imaging features that correlate with tumor biology, potentially predicting malignancy probability, histological grade, or aggressive behavior based solely on imaging. Such predictive capabilities could inform treatment decisions before any biopsy or surgical intervention.

Surgical training and planning: Reliable 3D segmentations are also a necessary step toward creating high-fidelity surgical training models, including 3D-printed kidney replicas and augmented or virtual reality environments. These tools could enable surgeons to practice complex partial nephrectomy procedures on patient-specific anatomical models, improving surgical precision and reducing complications.

TL;DR: Automated segmentation enables automated nephrometry scoring (eliminating interobserver variability of 0.66-0.77), incidental lesion flagging (current detection sensitivity is only 84%), radiomics-based outcome prediction, and 3D-printed or virtual surgical planning models.
Page 7
How Open Competitions Accelerate Medical AI: 20,000 Hours of Effort in 7 Months

The KiTS19 Challenge demonstrates the power of community-driven, incentivized competitions for developing medical AI applications. By making high-quality, annotated data publicly available and structuring an international competition, the organizers coordinated approximately 20,000 hours of development effort from 100+ teams across five continents in just seven months. This approach contrasts sharply with isolated institutional efforts, which typically require years of dedicated funding and personnel to produce comparable results.

Comparison to other challenges: The success of KiTS19 mirrors similar efforts in other cancer types. Mak et al. conducted a financially incentivized online challenge for lung tumor segmentation for radiation therapy targeting, attracting 34 algorithms across multiple phases and producing a model with a Dice score of 0.68 that outperformed commercially available software and matched interobserver variation among five radiation oncologists. The Digital Mammography DREAM Challenge achieved similar success for breast cancer segmentation from mammograms. These results collectively suggest that open competitions in oncology can accelerate the development of high-quality tools in an open, low-cost, and time-efficient manner.

The 6% physician involvement problem: One notable finding from the participant survey was that only 6% of teams reported working with a physician. This highlights a systemic gap in clinical-AI collaboration. The impact of not working with a clinical expert could cause algorithms to miss tumors entirely, potentially leading to less aggressive treatment plans such as surveillance rather than intervention. Bridging this gap will likely require institutional incentives and interdisciplinary training programs that pair computer scientists with clinicians from the earliest stages of algorithm development.

TL;DR: KiTS19 crowdsourced roughly 20,000 hours of global AI development effort in 7 months. Similar competitions in lung and breast cancer have produced models matching or exceeding commercial software. Only 6% of teams collaborated with physicians, highlighting a critical gap in clinical-AI partnerships.
Pages 7-8
Geographic Bias, Small Dataset Size, and the Road to Clinical Translation

Limited geographic diversity: Although the imaging was acquired from over 70 different clinics using four scanner manufacturers, all patients underwent surgery at a single center in Minnesota. This geographic restriction means the algorithms may not generalize well to different patient populations, ethnicities, or clinical settings. Scanner parameters, contrast protocols, and patient demographics can vary significantly across regions, and performance degradation on out-of-distribution data remains a recognized challenge for deep learning models in medical imaging.

Relatively small dataset: With 300 cases (210 training, 90 test), the KiTS19 dataset is relatively small compared to non-segmentation or non-medical imaging AI challenges that often involve tens of thousands of examples. The performance estimates are therefore less precise than they would be with a larger, more diverse dataset. The small test set of 90 cases also limits the statistical power for subgroup analyses, particularly for less common tumor characteristics.

Single contrast phase: The dataset was restricted to late arterial phase CT images for consistency. In clinical practice, renal masses are often evaluated across multiple contrast phases (non-contrast, corticomedullary, nephrographic, excretory), and the absence of multi-phase data limits the applicability of these algorithms to real-world multi-phase protocols. Future datasets should incorporate multiple phases to more closely reflect clinical workflows.

Path forward: The authors concluded that rapid advancement in automated semantic segmentation of kidney lesions is achievable when data is released publicly and participation is incentivized. The KiTS Challenge has continued to evolve (the leaderboard remains open), and additional training scans from a wider range of centers and contrast phases could push algorithms to approach and potentially surpass human-level performance. Achieving clinical deployment will require prospective validation studies, regulatory approval pathways, and integration into radiology PACS workflows.

TL;DR: Key limitations include single-center surgery data (despite 70+ imaging sites), a relatively small dataset of 300 cases, and restriction to a single contrast phase. Future work needs multi-center, multi-phase datasets, prospective validation, and PACS integration for clinical translation.
Citation: Sathianathen NJ, Heller N, Tejpaul R, et al.. Open Access, 2021. Available at: PMC8763784. DOI: 10.3389/fdgth.2021.797607. License: cc by.