Prostate cancer (PCa) is the second most common cancer in men and the fifth leading cause of cancer death globally. Current diagnostic workflows rely on multiparametric MRI (mpMRI) for pelvic imaging, PET/CT for detecting lymph node and distant metastases, and androgen deprivation therapy (ADT) as frontline treatment for high-risk patients. However, all of these depend heavily on subjective specialist interpretation, making them vulnerable to inter-reader variability, unnecessary biopsies, and inconsistent risk stratification.
Deep learning (DL), a subset of machine learning, can automatically learn hierarchical data representations from complex, unstructured inputs such as medical images. Unlike traditional ML approaches that require extensive manual feature engineering, DL algorithms improve their accuracy over time as they are exposed to more data. This makes them well-suited for tasks like prostate segmentation, cancer detection, and treatment response prediction, where subtle patterns in imaging data can be difficult even for experienced radiologists to consistently identify.
This review set out to analyze the current state of DL-based prostate cancer diagnosis across five key clinical areas: MR-based prostate reconstruction, PCa detection and stratification, PCa reconstruction, PET/CT diagnosis, ADT optimization, and prostate biopsy assistance. The authors searched PubMed and Google Scholar in October 2023 for English-language articles published within the past five years that described DL methods in these domains, including details on datasets, segmentation approaches, and validation strategies.
From an initial pool of 784 articles, 64 met the inclusion criteria. These were distributed across the six focus areas: 21 studies on prostate reconstruction, 22 on PCa detection and stratification, 6 on PCa reconstruction, 7 on PET/CT, 2 on ADT, and 6 on prostate biopsy. Descriptive statistics were generated using SPSS version 26.0, including normality testing via the Kolmogorov-Smirnov test, with results presented as means with standard deviations or medians with ranges depending on the data distribution.
Twenty-one studies investigated DL algorithms for 3D prostate segmentation on MRI, a critical preprocessing step for cancer detection and treatment planning. Local and open datasets were used in 13 and 14 studies, respectively, with Promise12 being the most frequently used open dataset (11 studies), followed by ProstateX (3 studies). Nearly all datasets used 3 T magnetic field strength (18 studies), with multi-vendor scanners employed in 16 studies. The median number of cases per study was 146 (range: 25 to 648), and T2-weighted images (T2WIs) were the dominant input sequence, used in all 21 studies.
Architectures and performance: The Dice similarity coefficient (DSC) ranged widely across the 21 studies, from 0.85 to 0.9865. At the lower end, da Silva et al. achieved a DSC of 0.85 using a coarse-to-fine segmentation method combining a deep convolutional neural network (DCNN) with the particle swarm optimization algorithm. Wang et al. achieved 0.86 with a 3D deeply supervised densely fully convolutional network (DSD-FCN), significantly outperforming UNet (0.836, p = 0.023) and VNet (0.838, p = 0.018). At the higher end, Yan et al. reported the best DSC of 0.9865 using PSPNet (Pyramid Scene Parsing Network), which substantially outperformed FCN (0.8924) and U-net (0.9107).
Notable approaches: Several architectures pushed segmentation accuracy above DSC 0.92. Bardis et al. used three sequential hybrid 3D/2D U-Nets for whole prostate (DSC 0.940), transitional zone (0.910), and peripheral zone (0.774) segmentation. To et al. described a 3D deep dense multi-path neural network achieving DSC 0.9511, outperforming 3D U-Net (0.9380), 2D DS-net (0.9247), and 3D MRC-net (0.9237). Meyer et al. used an anisotropic 3D multi-stream CNN with multiplanar T2 images, showing statistically significant DSC improvements at the prostate base (0.906 vs. 0.898) and apex (0.901 vs. 0.888) compared to monoplanar approaches.
Multimodal inputs: While most studies relied on T2WI alone, Nai et al. evaluated monomodal DenseVNet versus multimodal HighRes3DNet and ScaleNet using T2WI, DWI, and ADC. The multimodal HighRes3DNet achieved the highest DSC of 0.890, with statistically significant improvements in zonal reconstruction of the peripheral and central zones compared to the monomodal approach. However, the difference was not significant for whole-prostate isolation, suggesting multimodal inputs primarily benefit zonal-level segmentation.
Twenty-two studies explored DL for prostate cancer detection and risk stratification on MRI. These used local datasets (16 studies) and open datasets (11 studies), with ProstateX being the most common open resource. The median number of cases was 344 (range: 37 to 2,170). Multi-vendor scanners were used in 18 studies, and 3 T field strength dominated (19 studies). Input sequences varied considerably: T2WI was used in 17 studies, ADC in 19, DWI in 13, and DCE in 5. Biopsy served as the reference standard in 16 studies, while whole-mount histopathology was used in 6.
AUC performance range: The reported AUC values spanned from 0.645 to 0.97, illustrating the wide performance gap across architectures and datasets. At the lower end, Ishioka et al. combined U-net with ResNet50 for pelvic structure differentiation and cancer detection, achieving AUC values of only 0.645 and 0.636 on two validation sets. At the top, Xu et al. used a ResNet-based framework to identify suspicious lesions on mp-MRI, achieving an AUC of 0.97 with an average Jaccard score of 71% for lesion segmentation agreement with radiologists.
Architectures that performed well: Pellicer-Valero et al. tested a 3D Retina U-Net combining single-stage RetinaNet and U-Net, reaching an AUC of 0.96 with 100% sensitivity and 79% specificity. Song et al. modified VGGNet into a DCNN architecture trained on T2, DWI, and ADC images from 195 patients, achieving an AUC of 0.944 with 87.0% sensitivity and 90.6% specificity. The joint model of PI-RADS v2 and their DCNN provided additional net benefits over either system alone. Saha et al. designed two parallel 3D CNNs on a large 2,137-patient cohort with external validation, reaching an AUC of 0.885, significantly outperforming Attention U-Net (0.861), nnU-Net (0.872), UNet++ (0.850), and U-SEResNet (0.500).
Transfer learning approaches: Several groups leveraged transfer learning to overcome limited dataset sizes. Chen et al. used InceptionV3 (AUC 0.81) and VGG-16 (AUC 0.83) with transfer learning for PCa detection. Zhong et al. applied ResNet with transfer learning to distinguish clinically insignificant from significant lesions, achieving AUC 0.726 compared to 0.687 without transfer learning and 0.711 for PI-RADS v2 alone, though the difference versus PI-RADS v2 was not statistically significant. Sobecki et al. extended VGG-16 to a 3D model with knowledge encoding, improving AUC from 0.82 to 0.84.
Six studies tackled the more difficult problem of segmenting the cancer itself (as opposed to the prostate organ) in 3D on MRI. This is significantly harder than prostate segmentation because tumor boundaries are often diffuse and poorly defined. Local and open datasets were each used in three studies, with ProstateX appearing in two. The median number of cases was only 129 (range: 16 to 204), reflecting the difficulty of acquiring large annotated tumor datasets. DSC values for tumor segmentation were substantially lower than for prostate organ segmentation, ranging from 0.32 to 0.892.
Low-performing results: Gunashekar et al. used a 3D U-Net with gradient-weighted class activation maps (Grad-CAM) for interpretability, but achieved a DSC of only 0.32, which notably did not significantly differ from manual tumor annotation by radiologists. De Vente et al. used a 2D U-Net to segment lesions while encoding ISUP grade, achieving a weighted kappa of 0.446 and a DSC for clinically significant cancers of 0.370. These low scores reflect both the inherent difficulty of tumor delineation and the limitations of small single-modality datasets.
Better-performing architectures: Alkadi et al. described a DCNN with a modified VGG16 architecture featuring an encoder-decoder structure with SoftMax and pixel-classification layers. Tested on T2 images from the 12CVB dataset, it achieved a DSC of 0.892, the highest in this category. Chen et al. proposed a multibranch UNet (MB-UNet) using T2, DWI, and ADC maps, reaching a test DSC of 0.6333 and emphasizing that DWI was the most important sequence for PCa segmentation. Lai et al. used SegNet with an encoder-decoder structure and found that combining all three sequences (T2, DWI, ADC) yielded the best DSC of 0.5273.
The stark contrast between prostate organ segmentation (DSC up to 0.9865) and tumor segmentation (DSC mostly below 0.65 for multi-sequence approaches) underscores the fundamental challenge of cancer delineation. Tumor boundaries on MRI are inherently ambiguous, and the small dataset sizes (as few as 16 cases) severely limit what DL models can learn.
Seven studies investigated DL for PET/CT-based prostate cancer diagnosis, all using institutional datasets. Six of seven were single-institution studies. The dominant radiotracer was [68Ga]Ga-PSMA-11, used in five studies, while [18F]DCFPyl and [18F]PSMA-1007 were each used in one study. The median number of cases was 193 (range: 39 to 660). Manual labeling was used in five studies and semi-automated labeling in two. All seven studies provided internal validation, and all performed testing, though only one included external testing.
Lymph node staging: Hartenstein et al. tested whether CNNs could predict 68Ga-PSMA-PET/CT lymph node status from CT images alone. The CNNs achieved AUC values of 0.95 (status) and 0.86 (balanced location, masked), compared to 0.81 for experienced radiologists. The CNNs improved by learning infiltration probabilities at different anatomical locations. Capobianco et al. developed a transfer-learning approach using 18F-FDG PET/CT training data to classify 68Ga-PSMA-11 images, achieving 80.4% average precision (CI: 71.1-87.8) for suspect uptake sites and 77% accuracy for anatomical location classification.
Lesion detection and classification: Kendrick et al. developed an automated system for identifying metastatic PCa lesions in whole-body [68Ga]Ga-PSMA-11 PET/CT, achieving patient-level accuracy, sensitivity, and positive predictive value (PPV) all exceeding 90%, with the best at 97.2%. Lesion-level PPV and sensitivity were 88.2% and 73.0%, respectively. Leung et al. built a DL and radiomics framework for lesion and patient classification using PSMA-RADS groups, with lesion-level and patient-level AUROC scores of 0.87 and 0.90, respectively. For prostate cancer classification specifically, the AUROC values reached 0.92 (lesion-level) and 0.85 (patient-level).
Tumor burden quantification: Zhao et al. used a triple-combining 2.5D U-Net to automatically characterize PCa lesions on 68Ga-PSMA-11 PET/CT for optimizing PSMA-directed radionuclide therapy. The network detected bone lesions with 99% accuracy, 99% recall, and 99% F1 score, and lymph node lesions with 94% precision, 89% recall, and 92% F1 score. Tragardh et al. developed AI segmentation achieving 79% sensitivity for prostate tumors, 79% for lymph node metastases, and 62% for bone metastases, comparable to nuclear medicine physicians (78%, 78%, and 59%, respectively).
Androgen Deprivation Therapy (ADT): Only two studies met the inclusion criteria for DL in the context of ADT, both using single-institution datasets with manual labeling. Spratt et al. used digital pathology images and clinical data from 5,727 individuals across five phase III randomized trials to build an AI-based predictive model for ADT benefit. In the validation cohort (NRG/RTOG 9408, 1,594 males), the model identified 34% of patients (n = 543) as ADT-positive, for whom ADT significantly reduced distant metastasis (hazard ratio 0.34, 95% CI: 0.19-0.63, p < 0.001). For the 66% classified as ADT-negative (n = 1,051), there was no significant treatment benefit (sHR 0.92, 95% CI: 0.59-1.43, p = 0.71).
Multimodal ADT prediction: Mobadersany et al. developed a multimodal framework combining clinical features, digitized H&E pathology, and radiology bone scans (rBS) to predict outcomes in non-metastatic castration-resistant prostate cancer (nmCRPC). Using survival convolutional neural networks (SCNNs) and the Cox proportional-hazards model (CPH), the multimodal approach improved clinical CPH prediction by 14-16% across overall survival and time to PSA progression endpoints. The improvement was statistically significant (Wilcoxon signed-rank test, p < 0.0001). However, this study included only 154 patients, creating a vast gap compared to Spratt et al.'s 5,727-patient cohort.
Prostate biopsy: Six studies explored DL for biopsy assistance, using temporal enhanced ultrasound (TeUS, 2 studies), transrectal ultrasound (TRUS, 3 studies), and MRI (1 study) as input modalities. Sedghi et al. used deep neural mapping (DNM) on TeUS data, achieving projection AUC greater than 0.8. Azizi et al. demonstrated a recurrent neural network architecture processing TeUS data from 157 individuals and 255 biopsy cores, reaching an AUC of 0.85. Van Sloun et al. used a U-net for automated real-time prostate zonal segmentation on TRUS, achieving 98% median pixel accuracy and a Jaccard index of 0.93. Soerensen et al. integrated ProGNet into clinical MR-ultrasound fusion biopsy, achieving a DSC of 0.93 prospectively, outperforming both U-Net and radiology technicians (DSC 0.90), while reducing segmentation time from 10 minutes to 35 seconds per case.
Data diversity and generalizability: The most fundamental limitation across these 64 studies is the lack of dataset diversity. Only 15 of 49 studies (31%) on MR-based prostate reconstruction and PCa diagnosis used multi-center datasets, and none of the PCa reconstruction studies did. Single magnetic field power of 3 T dominated (86%), with combined 3/1.5 T field power used in only 12% of studies. Castillo et al. demonstrated this problem concretely: three single-center radiomics models achieved a mean AUC of 0.75, which dropped to 0.54 when applied to external data. A multi-center radiomics model also achieved AUC 0.75 but with better generalizability.
Insufficient sample sizes: The minimum case numbers across the six focus areas were alarmingly small: 25 for prostate reconstruction, 37 for PCa detection/stratification, 16 for PCa reconstruction, 34 for PET/CT, 154 for ADT, and 157 for biopsy. Hosseinzadeh et al. found that PI-RADS-trained DL can detect and localize ISUP greater than 1 lesions accurately but requires substantially more than 2,000 training cases to match expert performance. Most studies in this review fall far short of that threshold.
Sequence selection bias: Very few studies systematically evaluated the added value of different MRI sequences. The findings that did exist were contradictory. Aldoj et al. found that T2 added nothing to PCa detection accuracy, while Wang et al. found the T2 and ADC combination was justified. Mehrtash et al. reported the best 3D-CNN accuracy using DWI at maximum diffusion factor and Ktrans, not ADC. Meanwhile, a separate meta-analysis showed ADC correlated significantly with ISUP grade, adding to the confusion about optimal input selection.
Segmentation and reference standard bias: Manual segmentation was used in 92% of all studies. While some used open datasets with pre-existing annotations, those annotations were also manually created, inheriting the same inter-observer variability. For PCa detection and reconstruction, biopsy was the reference standard in 71% of studies (20 of 28), despite biopsy sampling only a small area of the prostate gland. Alqahtani et al. found that 31.6% of patients had ISUP grade upgrades from 12-core biopsy to radical prostatectomy specimens, highlighting how biopsy-based ground truth can systematically underestimate cancer grade and extent.
Validation gaps: While all studies reported internal validation, testing was performed in only 55% of studies (35 of 64), and external validation was completed in just 10 papers. This means nearly half of the models have never been evaluated on data they were not trained on, making claims of clinical readiness premature.
Multi-center, multi-vendor datasets: The most urgent need is large, diverse, multi-center datasets that include images from different MRI scanners, field strengths (both 1.5 T and 3 T), and patient populations. The current reliance on single-center data and 3 T-only datasets means most models will likely fail when deployed in hospitals with different equipment. Sanford et al.'s approach of using deep multilevel transformation as a data-augmentation method to handle MR image heterogeneity from different sources is a step in the right direction, but systematic multi-site validation remains rare.
Better reference standards: The field needs to move away from biopsy as the primary ground truth for DL training. Whole-mount histopathology after radical prostatectomy provides far more accurate tumor delineation but was used in only 6 of 22 detection/stratification studies. Automated segmentation approaches, such as the deep learning mask (DLM) described by Bleker et al. for auto-fixed VOI placement, can reduce manual segmentation time by 97% while improving accuracy for clinically significant PCa detection.
Systematic sequence evaluation: Future studies should rigorously compare different MRI sequence combinations rather than choosing them in advance. The contradictory findings about T2, ADC, DWI, and DCE contributions suggest that optimal input configurations may vary by task (detection vs. stratification vs. reconstruction) and by anatomical zone. Bonekamp et al.'s finding that biparametric contrast-free radiomic ML had comparable but not superior performance to simple mean ADC assessment raises questions about whether complex multi-sequence models always justify their additional computational cost.
Clinical integration: The comparison between AI systems and clinicians of varying experience levels provides an important framework. Youn et al. found that the Siemens Prostate AI system outperformed only the least experienced radiologists, while experienced readers achieved significantly greater accuracy. This suggests that DL tools may find their greatest immediate clinical value as decision support for less experienced practitioners and trainees rather than as replacements for senior radiologists. Prospective, multi-site clinical trials comparing DL-augmented workflows to standard care are essential before any of these tools can be responsibly integrated into routine practice.