Prostate cancer is the most common cancer in males in the United States. In 2017, there were roughly 161,360 new cases (19% of all new cancer cases) and 26,730 deaths (8% of all cancer deaths). Despite these numbers, early detection dramatically improves survival because prostate cancer tends to progress slowly. The challenge is that current screening methods, particularly the prostate-specific antigen (PSA) test, lead to significant over-diagnosis, which in turn triggers unnecessary and painful needle biopsies and potential over-treatment.
Multiparametric MRI and PI-RADS: Diffusion-weighted imaging (DWI), a key component of multiparametric MRI, has become increasingly central to prostate cancer diagnosis. Radiologists interpreting DWI achieve AUC values ranging from 0.69 to 0.81 for detecting clinically significant prostate cancer. The standardized PI-RADS v2 scoring system was developed to guide interpretation, but inter-observer variability remains a persistent problem, meaning different radiologists can reach different conclusions from the same images.
Computer-aided detection (CAD): Traditional CAD approaches combine handcrafted imaging features (texture, shape, volume, intensity) with machine learning classifiers such as Support Vector Machines (SVM), Adaboost, and Decision Trees. While these methods have shown potential, deep learning methods using convolutional neural networks (CNNs) have increasingly outperformed them in computer vision tasks like segmentation, classification, and object detection, motivating their application to prostate cancer detection.
This paper proposes an end-to-end automated CNN-based pipeline that classifies prostate cancer at both the individual MRI slice level and the patient level using DWI images from 427 patients. The key innovation is that the pipeline does not require manually drawn regions of interest (ROIs), making it more practical for clinical deployment.
The authors provide a thorough review of prior deep learning approaches to prostate cancer detection on MRI, highlighting a key distinction: most existing methods require user-drawn or automatically generated regions of interest (ROIs) as input. This reliance on ROIs introduces either manual labor or dependence on a separate segmentation algorithm, both of which limit real-world applicability.
ROI-based methods: Tsehay et al. used a 5-layer VGGNet-inspired CNN on 3x3 pixel windows from 196 patients and achieved an AUC of 0.90 on a test set of 52 patients, but this was based on small pixel-level ROIs. Le et al. combined a multimodal Residual Network (ResNet) with handcrafted features and reached an ROI-level AUC of 0.91, though they used the test set for fine-tuning, which inflates reported performance. Liu et al. applied a VGGNet-inspired 2D CNN to 32x32 ROIs centered around biopsy locations on the ProstateX challenge dataset (341 patients) and achieved AUC of 0.84, but on an augmented test set. Mehrtash et al. used a 9-layer 3D CNN on the same ProstateX dataset and achieved a lesion-level AUC of 0.80.
Slice-level and patient-level methods: Ishioka et al. performed slice-level analysis using U-Net combined with ResNet on 316 patients but evaluated on only 17 test slices, achieving AUC of 0.79. Wang et al. compared deep learning to non-deep learning methods with 172 patients and converted slice-level predictions to patient-level results via simple voting, achieving patient-level AUC of 0.84 and PPV of 79%, but this was based on cross-validation rather than an independent test set.
The central argument is that ROI-based methods are inherently limited: they require time-consuming manual annotation or risk propagating errors from automated segmentation, and they struggle to aggregate thousands of small ROI-level predictions into a meaningful patient-level diagnosis. The proposed pipeline addresses all of these issues by operating on full DWI slices without ROI generation.
Patient cohort: The dataset consisted of 427 consecutive patients who had a PI-RADS score of 3 or higher and underwent biopsy. Of these, 175 patients had clinically significant prostate cancer (defined as Gleason score of 7 or higher, corresponding to ISUP grade group 2 or above) and 252 patients did not. In total, 5,832 2D DWI slices containing the prostate gland were used.
MRI acquisition: All DWI data were acquired between January 2014 and July 2017 on a Philips Achieva 3T MRI scanner. The imaging protocol used a single-shot spin-echo echo-planar imaging sequence with four b-values (0, 100, 400, and 1000 s/mm2), TR of 5000-7000 ms, TE of 61 ms, slice thickness of 3 mm, FOV of 240 x 240 mm, and a 140 x 140 matrix. From these sequences, ADC maps and computed high b-value images (b1600) were also derived.
Data preprocessing: All DWI slices were resized to 144 x 144 pixels and center-cropped to 66 x 66 pixels to ensure the prostate was covered. The CNNs were modified to accept 6-channel input (ADC, b0, b100, b400, b1000, and b1600) instead of the standard 3-channel RGB input. All images were normalized across the entire dataset using z-score normalization (subtracting the dataset mean and dividing by the standard deviation).
Data splitting: The 427 patients were divided into three sets: a training set of 271 patients (3,692 slices), a validation set of 48 patients (654 slices), and a test set of 108 patients (1,486 slices), following a 64%/11%/25% split. The ratio of cancer-positive to cancer-negative patients was kept roughly similar across all three sets. Critically, the test set was never used during training or fine-tuning, ensuring unbiased performance evaluation.
CNN architecture: The authors built a 41-layer deep ResNet for slice-level classification. The architecture begins with a 2D convolutional layer with a 7x7 filter, followed by a 3x3 max pooling layer. It then uses two sets of residual blocks: ResNet Block 1 consists of 3-layer bottleneck blocks with filter sizes 64, 64, and 256, stacked 4 times. ResNet Block 2 uses filter sizes 128, 128, and 512, stacked 9 times. The network ends with 2D average pooling (7x7), a dropout layer (rate 0.90), and a fully connected layer with 1000 nodes producing two probabilistic outputs.
Pre-activated residual blocks: Rather than using the original ResNet structure (convolution, then batch normalization and ReLU), the authors implemented fully pre-activated residual blocks where batch normalization and ReLU come before the convolution layers. This design prevents gradient vanishing even when weights become very small. The 3-layer "bottleneck" building blocks were chosen over 2-layer blocks because they significantly reduce training time without sacrificing performance.
Training details: Stochastic Gradient Descent (SGD) was used with an initial learning rate of 0.001, reduced by a factor of 10 when the model plateaued. Batch size was set to 8, weight decay to 0.000001, and momentum to 0.90. Binary cross entropy served as the loss function to handle the extreme class imbalance in the dataset (only 1-3 slices per patient contain tumor out of an average of 14 slices total).
Stacked generalization: Because CNN training involves randomness (random weight initialization, for instance), each trained CNN captures slightly different features. The authors used stacked generalization, an ensemble technique that combines predictions from five individually trained CNNs. Five was selected as the optimal number based on validation performance. Increasing beyond five led to overfitting due to the limited patient-level sample size (48 validation patients). The stacked approach significantly improved patient-level AUC from 0.71 (single CNN) to 0.84 (five CNNs), with a 2-tailed P-value of 0.048.
Converting slice-level CNN outputs into a meaningful patient-level prediction is one of the paper's core contributions. Rather than using a simple voting or averaging strategy (as in prior work), the authors designed a multi-step statistical feature extraction pipeline that captures richer information from the distribution of slice-level probabilities.
Probability filtering: Each of the five CNNs produces two probability sets for every patient: one for the PCa class and one for the non-PCa class. From these, only the top five probabilities exceeding a threshold of 0.74 were retained. This threshold was determined via grid search on the validation set, ensuring that low-confidence slice predictions did not introduce noise into the patient-level classifier.
First-order statistical features: From the filtered probability sets, nine statistical features were extracted: mean, standard deviation, variance, median, sum, minimum (from the non-PCa class only), maximum (from the PCa class only), skewness, kurtosis, and range. This produced 90 features per patient (9 features x 2 classes x 5 CNNs). A decision tree-based feature selector then identified the 26 most important features, trained with 10-fold cross-validation on the validation set.
Random Forest classifier: The selected 26 features were fed into a Random Forest classifier for final patient-level classification. The Random Forest was trained and fine-tuned using 10-fold cross-validation exclusively on the validation set. The entire computational pipeline, from CNN training on a single Nvidia Titan X GPU with an Intel i7 CPU and 32 GB RAM, took 6 hours for all five CNNs, less than 10 seconds for the Random Forest, and under 1 minute to test all 108 patients.
Performance was evaluated using AUC of the receiver operating characteristic (ROC) curve on the completely held-out test set of 108 patients (1,486 slices). AUC was chosen as the primary metric because the dataset is extremely unbalanced, with only 1-3 slices per patient containing tumor out of an average of 14 total slices.
Slice-level results: Each of the five individually trained CNNs was evaluated separately. CNN1 and CNN2 both achieved an AUC of 0.87 (95% CI: 0.84-0.90), CNN3 achieved 0.86 (95% CI: 0.83-0.89), and CNN4 and CNN5 each achieved 0.85 (95% CI: 0.82-0.88). The consistency across all five CNNs (AUC range of 0.85-0.87) demonstrates the stability of the architecture despite random initialization differences.
Patient-level results: The Random Forest classifier, using statistical features extracted from the stacked CNN outputs, achieved a patient-level AUC of 0.84 (95% CI: 0.76-0.91). This result was obtained on a fully independent test set that was never exposed to the model during training or fine-tuning, making it a more reliable estimate than cross-validation-based results reported in comparable studies.
The authors emphasize that their slice-level AUC of 0.87 surpasses Liu et al.'s AUC of 0.84 (which was measured on an augmented test set and only on slices with biopsy locations). Their patient-level AUC of 0.84 matches Wang et al.'s result, but Wang et al. used cross-validation on only 17 patients per fold rather than an independent test set of 108 patients. The test set size and independence from training data make this pipeline's performance estimates substantially more robust.
Selection bias: The dataset is inherently biased because it includes only patients who were sent for MRI due to clinical suspicion of prostate cancer (e.g., elevated PSA). This means the cohort does not reflect the general population, and the model's performance may differ when applied to lower-risk screening populations or patients without prior clinical indication.
Label quality issues: The ground truth labels carry a structural inconsistency. Positive slices (those containing cancer) are labeled based on pathology reports from targeted biopsies. However, negative slices are labeled based on the absence of biopsy at that location, as determined by radiologist reports. This means some "negative" slices might actually contain cancer that was missed or not biopsied. This labeling asymmetry could affect both training and evaluation accuracy.
Single-center design: All data were acquired on a single Philips Achieva 3T scanner at one institution (Sunnybrook Health Sciences Centre). The pipeline has not been validated on data from different scanners, different field strengths, or different institutions. External validation across multiple sites and scanner vendors is essential to confirm that the model generalizes beyond the specific imaging conditions under which it was trained.
Limited user intervention still required: While the pipeline eliminates the need for manual ROI annotation, it does require a user to indicate the first and last MRI slice containing the prostate gland. Although this is a minimal intervention compared to drawing lesion contours, it is not fully automated and introduces a small dependency on human input.
The authors outline several avenues for extending this work. The most immediate is the exploration of 3D CNN architectures that could process entire DWI volumes rather than individual 2D slices. Since prostate tumors span multiple adjacent slices, a 3D approach could capture spatial continuity that the current 2D slice-by-slice analysis misses. This could improve both slice-level and patient-level performance by incorporating volumetric context.
Recurrent neural networks (RNNs): The authors also suggest investigating recurrent neural networks, particularly for modeling the sequential nature of lesions across neighboring slices. An RNN or LSTM architecture could learn temporal or spatial dependencies between consecutive slices, potentially improving the aggregation of slice-level predictions into patient-level diagnoses without requiring the separate statistical feature extraction step.
External validation: As noted in the limitations, the pipeline was developed and tested on data from a single institution with one scanner model. Multi-site validation studies using data from different MRI vendors, field strengths, and patient populations would be necessary before clinical deployment. This is a common bottleneck for medical imaging AI tools, where performance often degrades when models are applied to data that differ from the training distribution.
Clinical integration: The ROI-free design of this pipeline is a practical advantage for potential clinical integration, as it reduces dependence on expert annotation. However, prospective studies comparing the pipeline's diagnostic accuracy against radiologist performance in real-time clinical workflows would be needed to demonstrate clinical utility. The sub-one-minute inference time for 108 patients suggests that computational speed would not be a barrier to clinical adoption.