Kidney Tumor Detection and Classification via Deep Learning in CT Scans

Plain-English Explanations

Overview & Background

Pages 1-2

Why Early Kidney Tumor Detection Matters and How Deep Learning Can Help

Kidney tumors are the seventh most common tumor type in both men and women worldwide. About a third of cases are discovered only after the cancer has already spread to other organs, partly because kidney tumors often produce no obvious symptoms. When symptoms do appear, they can include blood in the urine, abdominal pain, anemia (present in roughly 30% of patients), weakness, and vomiting. Because early detection dramatically improves survival and treatment outcomes, computed tomography (CT) scans of the abdomen and pelvis are a primary diagnostic tool for identifying kidney masses and distinguishing tumors from cysts or stones.

The case for automation: Traditional diagnosis by radiologists is tedious and time-consuming, and the subjective nature of manual CT interpretation introduces the risk of misdiagnosis. Deep learning (DL), a subset of machine learning that automatically learns features and patterns from raw images, has consistently matched or surpassed human-level performance in medical image analysis tasks. Convolutional neural networks (CNNs) are especially well suited for this work because they use weight-sharing and hierarchical feature extraction to identify visual patterns at multiple scales, from low-level edges to high-level anatomical structures.

Study objectives: This paper from Jordan University of Science and Technology presents four deep learning models for kidney tumor diagnosis on CT scans: three for tumor detection (distinguishing normal kidneys from those with tumors) and one for tumor classification (distinguishing benign from malignant tumors). The models include a custom 6-layer CNN (CNN-6), a ResNet50 (50 layers), a VGG16 (16 layers), and a custom 4-layer CNN (CNN-4) for classification. Critically, the authors also introduce a novel dataset of 8,400 CT images from 120 patients collected at King Abdullah University Hospital (KAUH) in Jordan.

TL;DR: Kidney tumors are the 7th most common tumor worldwide, with a third of cases found only after metastasis. This study proposes four deep learning models (CNN-6, ResNet50, VGG16, and CNN-4) for kidney tumor detection and classification on CT scans, alongside a new 8,400-image dataset from 120 patients at a Jordanian hospital.

Related Work

Pages 3-5

Prior Approaches to Kidney Tumor Detection and Classification

The paper reviews seven prior studies that tackled kidney tumor detection or classification using CT scans. Ghalib et al. (2014) used traditional artificial neural networks (ANN) with self-organizing maps (SOM) on preprocessing that included noise removal and contrast-limited adaptive histogram equalization, achieving an average execution time of 0.85 seconds. Liu et al. (2014) applied machine learning to 167 CT scans for exophytic renal lesion detection, reaching 95% sensitivity for exophytic lesions and 80% for endophytic lesions using belief propagation-based kidney segmentation.

Segmentation and radiomics approaches: Mredhula and Dorairangaswamy (2015) used only 28 CT scans and proposed an associative neural network (ASNN) combined with k-nearest neighbor (KNN) for tumor classification, achieving 83% accuracy. Zhou et al. (2019) used transfer learning with InceptionV3 on 192 CT scans, reaching 97% accuracy through five-fold cross-validation evaluated by receiver operating characteristic (ROC) curves. Zabihollahy et al. (2020) compared semi-automated majority voting 2D-CNN, fully automated 2D-CNN, and 3D-CNN on 315 patients (77 benign, 238 malignant), achieving 83.75% accuracy, 89.05% precision, and 91.73% recall.

Feature-based machine learning: Schieda et al. (2020) used XGBoost on manually segmented features from three CT scan phases (nephrographic, corticomedullary, and non-contrast) for 177 patients, achieving AUC values of 0.70 for distinguishing renal cell carcinoma (RCC) from benign tumors and 0.77 for classifying clear cell RCC subtypes. Yap et al. (2020) used AdaBoost and Random Forest on shape and texture radiomics features from 735 patients (196 benign, 539 malignant), with Random Forest reaching AUC values of 0.68 to 0.75.

Key gap identified: A consistent limitation across these studies was the small size of their datasets, typically ranging from 28 to 735 patients. The authors note that data scarcity in medical imaging leads to high risks of overtraining and reduced generalization. This motivated the creation of a larger, more comprehensive dataset from KAUH, which at 8,400 images from 120 patients exceeds previously available public kidney CT datasets in both size and image diversity.

TL;DR: Prior kidney tumor studies used datasets of 28 to 735 patients and achieved accuracies ranging from 68% to 97%, with methods spanning ANN, InceptionV3, XGBoost, and Random Forest. The key limitation was small dataset size, which this study addresses with a new 8,400-image collection.

Dataset

Pages 5-9

The KAUH Dataset: 8,400 CT Images from 120 Patients

The novel dataset was collected from King Abdullah University Hospital (KAUH) in Jordan and consists of 8,400 CT scan images from 120 adult patients aged 30 to 80 years (55 females and 65 males). Each patient contributed approximately 70 images from different CT dimensions, stored originally in DICOM format and later converted to JPEG. The dataset includes both contrast-enhanced and non-contrast CT scans. Data were collected from patients seen in 2020 (83 cases) and 2021 (37 cases), with clinical metadata reviewed and validated by radiologists and urological specialists.

Patient composition: Of the 120 patients, 60 had kidney tumors, 32 were classified as normal and healthy, and 28 were normal but had non-tumor kidney conditions such as cysts, hydronephrosis, or stones. Among the 60 tumor patients, 38 were benign and 22 were malignant. The benign cases broke down into 28 adenomas, 9 angiomyolipomas, and 1 lipoma. The malignant cases included 11 renal cell carcinomas (RCC) and 11 secondary metastases from cancers in neighboring organs (breast, colon, uterus). Tumor staging showed 92% of cases in Stage I, 3% in Stage II, 3% in Stage III, and 2% in Stage IV.

Clinical annotations: The dataset contains 20 attributes per patient, including patient ID, age, gender, test area, contrast status, clinical history, symptoms, diagnosis for each kidney, injury range, segmentation location (upper, middle, lower), tumor stage, tumor type (benign or malignant), and tumor subtype. For tumor location, left kidney tumors were most often in the upper segment (21 cases), while right kidney tumors also concentrated in the upper part (18 cases). This level of metadata annotation distinguishes the KAUH dataset from prior public datasets, which typically provide only images without detailed clinical context.

Comparison with existing datasets: The authors compared their dataset against six publicly available kidney CT datasets. The largest prior dataset was the G037-RCP from the Royal College of Pathologists in London with 5,339 images, but many other datasets were far smaller, such as TCGA-KICH with just 15 patients and TCGA-KIRP with 33 patients. The C4KC-KiTS19 dataset from the University of Minnesota had 210 patients. The KAUH dataset is notable for its 70 images per patient, comprehensive metadata, and availability on GitHub for research use.

TL;DR: The KAUH dataset includes 8,400 images from 120 patients (65 male, 55 female), with 60 tumor cases (38 benign, 22 malignant) and 60 non-tumor cases. It features 20 clinical attributes per patient, 70 images per patient, and exceeds prior public datasets in size and annotation detail. The dataset is publicly available on GitHub.

Methodology

Pages 9-12

Data Preprocessing, Augmentation, and the Two-Phase Classification Framework

The study employed a two-phase classification framework. In the first phase, the models classify each CT scan as either "normal" (combining healthy cases and cases with cysts) or "tumor." In the second phase, detected tumors are further classified as "benign" or "malignant." The normal/cyst labels were merged into a single "normal" class to balance the binary detection task. Each phase uses binary classification with labels encoded as 0 and 1.

Preprocessing pipeline: Raw CT video files were manually divided into frames, and images were converted from DICOM to JPEG format. From each patient's scan, 70 images were selected showing the kidneys from different dimensions. Images were then normalized by converting from 3-channel RGB to 1-channel grayscale and resized to 224 x 224 pixels. The CT window level and breadth were adjusted to emphasize the renal area while suppressing information from surrounding organs. Edge features were extracted using the OpenCV Canny edge detection algorithm, which helped identify anatomical structures and kidney boundaries.

Data augmentation: To address the risk of overfitting and to expand the training set, four augmentation techniques were applied: rescaling (normalizing pixel values to a 224 x 224 range), shear transformation (with a 0.2 range), zoom (0.2 range), and horizontal flipping. These operations quadrupled the dataset size. For the tumor detection task, the dataset grew from 8,400 to 33,600 images. For the tumor classification task, it grew from 4,200 to 16,800 images. Augmentation was applied only to the training set, not to validation or test data, using Keras's ImageDataGenerator class.

Data splitting: Both tasks used an 80/20 training-testing split. For the detection task (33,600 augmented images), the training set comprised 5,376 images with 1,344 for validation, and 1,680 for testing. For the classification task (16,800 augmented images), 2,688 images were used for training, 672 for validation, and 840 for testing.

TL;DR: A two-phase framework first detects tumors (normal vs. tumor), then classifies them (benign vs. malignant). Images were converted from DICOM to JPEG, normalized to 224 x 224 grayscale, and augmented 4x using shear, zoom, rescaling, and horizontal flips, expanding the detection dataset from 8,400 to 33,600 images and the classification dataset from 4,200 to 16,800.

Model Architecture

Pages 12-16

Four Deep Learning Architectures: VGG16, ResNet50, CNN-6, and CNN-4

VGG16 detection model: The VGG16 architecture consists of 16 layers with weights: 13 convolutional layers, 4 max-pooling layers, and 3 fully connected layers. It accepts 224 x 224 x 3 input images and uses 3 x 3 convolution filters throughout with a fixed stride of 1 pixel. Max-pooling is performed through 2 x 2 windows. The three fully connected layers end with a softmax classifier for binary prediction (normal vs. tumor). All hidden layers use the ReLU activation function. The model originally achieved 92.7% top-5 accuracy on the ImageNet benchmark.

ResNet50 detection model: ResNet50 is a 50-layer deep residual network consisting of 49 convolutional layers and 1 fully connected layer. It also accepts 224 x 224 x 3 inputs. Each convolution block contains three convolution layers with skip connections (residual connections) that allow gradients to flow directly through the network, helping to solve the vanishing gradient problem. The fully connected output layer uses softmax for classification. Its learning rate per step is 12.83 ms, slightly faster than VGG16's 16.55 ms.

CNN-6 detection model (custom): The authors' proposed detection model consists of 6 layers: a batch normalization layer, a Conv2D input layer (32 filters of 3 x 3 with ReLU activation on 224 x 224 x 3 input), a Max-Pooling2D layer (2 x 2), a dropout layer for regularization, a flatten layer, and a dense output layer with softmax activation. The batch normalization step standardizes inputs for each mini-batch, reducing overfitting. The dropout layer randomly disables neurons during training to prevent co-adaptation of features.

CNN-4 classification model (custom): Designed for the second phase (benign vs. malignant classification), this simpler architecture has just 4 layers: a Conv2D input layer, a Max-Pooling2D layer (2 x 2), a flatten layer, and a dense output layer. It omits batch normalization and dropout, reflecting the simpler nature of the binary classification task on a smaller subset of data (only tumor cases). All models used the Adam optimizer, binary cross-entropy loss, a learning rate of 0.001, a batch size of 32, and 128 hidden neurons.

TL;DR: Four architectures were tested: VGG16 (16 layers, 13 conv), ResNet50 (50 layers with skip connections), a custom CNN-6 (6 layers with batch normalization and dropout), and a simpler CNN-4 (4 layers for benign vs. malignant classification). All used 224 x 224 inputs, Adam optimizer, binary cross-entropy loss, learning rate 0.001, and batch size 32.

Results

Pages 17-20

Model Performance: CNN-6 and ResNet50 Outperform VGG16 by a Wide Margin

VGG16 results (detection): The VGG16 model performed poorly, achieving only 60% test accuracy with a loss of 0.3506 and a training time of 3s 68ms per step over 44 epochs. On the test set of 848 normal and 832 tumor samples, it correctly classified 764 normal cases but only 585 tumor cases, failing on 247 tumor images. Its precision was 0.57 for normal and 0.75 for tumor, with recall of 0.90 for normal but only 0.30 for tumor, yielding F1 scores of 0.70 and 0.42 respectively. The model was clearly undertrained and unsuitable for clinical application.

ResNet50 results (detection): ResNet50 performed dramatically better, reaching 96% training accuracy and 97.47% test accuracy with a much lower loss of 0.0806. Trained for 25 epochs at 3s 70ms per step, it correctly identified 806 of 848 normal samples and 813 of 832 tumor samples. Precision was 0.98 for normal and 0.95 for tumor, with recall of 0.95 and 0.98 respectively. Both classes achieved an F1 score of 0.96, demonstrating strong and balanced performance across categories.

CNN-6 results (detection): The authors' custom 6-layer CNN achieved the best detection results, with 97% training accuracy and 95.31% test accuracy at a loss of 0.1480. It trained for 50 epochs at 3s 62ms per step, the fastest training time among the detection models. It correctly classified 823 of 848 normal samples and 801 of 832 tumor samples. Precision was 0.96 for normal and 0.97 for tumor, with recall of 0.97 and 0.96. The F1 score was 0.97 for both classes, slightly outperforming ResNet50 on the F1 metric despite a marginally lower raw test accuracy.

CNN-4 results (classification): The 4-layer CNN for benign vs. malignant classification achieved 97.77% training accuracy and 92% test accuracy with a notably low loss of 0.0643, trained for 50 epochs at just 1s 64ms per step. Tested on 531 malignant and 234 benign samples, it correctly classified 229 of 234 benign cases (precision 0.99, recall 0.89, F1 0.94) and 474 of 531 malignant cases (precision 0.80, recall 0.98, F1 0.88). The model showed high sensitivity to malignant tumors (98% recall), which is clinically desirable to minimize missed cancers.

TL;DR: CNN-6 achieved 97% training accuracy and 95.31% test accuracy for tumor detection (F1: 0.97). ResNet50 reached 97.47% test accuracy (F1: 0.96). VGG16 failed at just 60% accuracy. For benign vs. malignant classification, CNN-4 achieved 92% test accuracy with 98% recall for malignant cases and 99% precision for benign cases.

Comparative Analysis

Pages 19-20

How These Results Compare to Previous Kidney Tumor Studies

The authors present a direct comparison table against seven prior studies spanning 2014 to 2021. Their 97% detection accuracy (CNN-6) matches or exceeds all prior work: Ghalib et al. (2014) reached 85% with SOM and ANN on an unspecified dataset, Liu et al. (2014) achieved 95% sensitivity with HOG and SURF features on 167 scans, Mredhula and Dorairangaswamy (2015) reached 83% with ASNN/KNN on just 28 scans, and Zabihollahy et al. (2020) achieved 83.75% accuracy with CNN on 315 patients.

Closest competitors: Zhou et al. (2019) achieved 97% accuracy using transfer learning with InceptionV3 on 192 CT scans with five-fold cross-validation, matching the detection accuracy of this study. However, the KAUH dataset is substantially larger (8,400 images vs. 192 scans) and includes a richer set of clinical annotations. The radiomics-based approaches by Schieda et al. (2020, XGBoost, AUC 0.70-0.77) and Yap et al. (2020, Random Forest, AUC 0.68-0.75) focused on different classification tasks using hand-crafted texture and shape features, making direct comparison less straightforward.

The authors emphasize that their study is the first to perform both detection and classification on a single large dataset from a single institution. Most previous studies addressed either detection or classification, but not both sequentially. The 92% classification accuracy for benign vs. malignant tumors, combined with the 97% detection accuracy, provides a more complete diagnostic pipeline than any single prior study offered.

TL;DR: The 97% detection accuracy matches or exceeds all seven prior studies reviewed (ranging from 68% to 97%). This is the first study to combine both tumor detection (97%) and tumor type classification (92%) on a single, large 8,400-image dataset, providing a more complete diagnostic pipeline than prior work.

Limitations & Future Directions

Pages 20-22

Challenges Encountered and Planned Extensions

Data collection challenges: The manual nature of the data pipeline was a significant bottleneck. CT scan videos had to be manually segmented into frames, images were manually selected (70 per patient), DICOM-to-JPEG conversion required careful quality control, and clinical text data had to be manually structured and validated by radiologists. The authors encountered technical problems that required re-collecting data for some patients, and missing data issues arose during the labeling process. These challenges highlight the difficulty of building high-quality medical imaging datasets from scratch.

VGG16 failure: The poor performance of VGG16 (60% accuracy) compared to ResNet50 (97.47%) and CNN-6 (95.31%) is notable and likely attributable to overfitting. VGG16's 138 million parameters may be excessive for the relatively small training dataset even after augmentation, while the simpler CNN-6 with far fewer parameters generalized better. The ResNet50 architecture's skip connections likely helped it avoid the vanishing gradient problems that may have hindered VGG16's deeper layers from learning effectively.

Single-center design: All 120 patients came from a single Jordanian hospital (KAUH), which limits generalizability. Differences in CT scanner hardware, imaging protocols, patient demographics, and disease prevalence across institutions could affect model performance when deployed elsewhere. The dataset also includes only 120 patients, which, despite the 8,400 total images, may not capture the full spectrum of kidney tumor presentations. The 80/20 train-test split without external validation on an independent dataset further limits confidence in real-world applicability.

Future work: The authors plan to extend their framework beyond detection and classification of tumor type to include classification of tumor subtypes (adenoma, angiomyolipoma, lipoma, RCC, and secondary metastases), tumor staging (Stages I through IV), and segmentation of tumor location within each kidney. They aim to build a complete diagnostic pipeline that covers all aspects of kidney tumor characterization. The availability of the KAUH dataset on GitHub is intended to facilitate follow-up studies by other research groups, and the authors express interest in incorporating 3D CT scan analysis for improved spatial understanding of tumor morphology.

TL;DR: Key limitations include single-center data from only 120 patients, manual data pipeline requiring extensive radiologist involvement, no external validation, and VGG16's failure (60% accuracy) due to likely overfitting. Future work will expand to tumor subtype classification, staging, kidney segmentation, and 3D CT analysis.

Kidney Tumor Detection and Classification Based on Deep Learning Approaches: A New Dataset in CT Scans

Original Paper (PDF)