A Narrative Review of Artificial Intelligence in MRI-Guided Prostate Cancer Diagnosis

PMC 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why MRI-Guided Prostate Cancer Diagnosis Needs AI

Prostate cancer (PCa) is the second most common cancer in men worldwide, with approximately 1.4 million new cases diagnosed annually. Clinically significant prostate cancer (csPCa) is associated with rapid progression and higher mortality, making early detection essential. Performing a prostate MRI before biopsy has been shown to reduce unnecessary procedures, overdiagnosis, and overtreatment while improving csPCa detection. As a result, the European Association of Urology (EAU), the National Institute for Health and Care Excellence (NICE), and the American Urological Association (AUA) have all incorporated multi-parametric MRI (mpMRI) into their clinical guidelines as a key element of PCa diagnosis.

The variability problem: Despite the central role of prostate MRI, its positive predictive value (PPV) can be affected by false-positive rates reaching up to 50%. Patient motion artifacts, rectal gas, body habitus, and hip prostheses all negatively affect MRI quality. Differences in magnet strength, coil quality, and pulse sequence parameters introduce further variability. The PI-RADS scoring system (versions 2 and 2.1) was created to standardize both acquisition and interpretation, but studies show that csPCa detection accuracy is significantly influenced by the interpreting radiologist's expertise, resulting in considerable inter-reader variability.

Scope of this review: This narrative review explores AI techniques, particularly machine learning (ML) and deep learning (DL), in three specific areas of MRI-guided PCa diagnosis: (1) standardizing and improving MRI quality, (2) enhancing csPCa detection while minimizing unnecessary interventions, and (3) reducing inter-reader variability. The authors note that in a paired multi-center retrospective confirmatory study, an AI-based system outperformed radiologists in detecting Gleason grade group 2 or higher cancers, with AUROC values of 0.91 for the AI system versus 0.86 for radiologists.

TL;DR: PCa affects 1.4 million men annually. MRI false-positive rates can reach 50%, and inter-reader variability remains a major problem. This review examines how AI can standardize MRI quality, improve csPCa detection, and reduce diagnostic inconsistency across radiologists.
Pages 3-4
Literature Search Strategy and Study Selection

The authors conducted a structured literature search using PubMed and Scopus, covering publications from January 2010 through February 2025. Search terms included combinations of prostate cancer (PCa, csPCa), magnetic resonance imaging (MRI, mpMRI, bpMRI), artificial intelligence (AI, machine learning, deep learning, radiomics), and diagnostic concepts (lesion detection, image quality, PI-RADS, interpretation variability).

Inclusion criteria: The review included peer-reviewed, full-text English-language articles involving human participants diagnosed with or clinically suspected of having PCa/csPCa. Studies had to utilize mpMRI or bi-parametric MRI (bpMRI), confirm csPCa through pathological reference standards (biopsy or radical prostatectomy), and explicitly focus on AI applications in MRI-guided diagnosis. This covered AI-driven enhancements in MRI acquisition quality, AI-based assessment and standardization of image quality, AI performance metrics in lesion detection and PI-RADS scoring, and AI-assisted reduction in inter-reader variability.

Exclusion criteria: Non-human studies (animal models, phantoms), conference abstracts, commentaries, editorials, letters, case reports, and articles without direct clinical relevance were excluded. Reference lists from included studies were reviewed to identify additional relevant articles. Each selected article underwent critical analysis and was categorized according to core themes, with particular attention to high-impact journals and methodologically rigorous studies.

Given the narrative approach of this review rather than a systematic review, the authors did not include a PRISMA-style flowchart but provided detailed descriptions of their literature selection methodology to ensure transparency and reproducibility.

TL;DR: Structured search of PubMed and Scopus (January 2010 to February 2025) using AI and prostate MRI terms. Included only peer-reviewed human studies with pathological confirmation of csPCa. This is a narrative review, so no PRISMA flowchart was included.
Pages 4-8
How AI Enhances Image Quality and Reduces Scan Times

Deep learning reconstruction: AI integration in prostate MRI scans leads to significantly reduced acquisition times compared to traditional techniques while also improving image quality. DL reconstruction algorithms effectively reduce noise and minimize Gibbs ringing artifacts. The longest portions of MRI acquisition time are dedicated to T2-weighted imaging (T2WI) and diffusion-weighted imaging (DWI) sequences, so efforts to shorten scan duration have primarily targeted these two sequences. In one retrospective study of 30 patients, DL-reconstructed T2WI images outperformed conventional MRI in lesion detectability, diagnostic confidence, and overall image quality.

Synthesizing contrast sequences: Huang et al. developed a DL model (pix2pix algorithm) to synthesize dynamic contrast-enhanced MRI (DCE-MRI) sequences from non-contrast MRI, including T1WI, T2WI, DWI, and apparent diffusion coefficient (ADC) maps. The simulated DCE-MRI showed high similarity to original sequences, with excellent agreement with radiologists' PI-RADS scores. Notably, 34 out of 323 patients were upgraded from PI-RADS 3 to PI-RADS 4 when simulated DCE-MRI was added to bpMRI. Ueda et al. demonstrated that DL-reconstructed DWI images with b-values from 1000 to 5000 s/mm2 exhibited significantly higher signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR) compared to non-DL-reconstructed DWI.

Quality assessment with PI-QUAL: The PI-QUAL scoring system was developed by European prostate MRI experts to assess diagnostic quality. PI-QUAL V1 uses a 5-point scale evaluating T2WI, DWI, ADC, and DCE images. PI-QUAL V2, developed by the European Society of Urogenital Radiology (ESUR), simplifies this to a 3-point scale. AI holds significant potential for automated MRI quality assessment. Cipollari et al. demonstrated that CNNs could classify prostate MRI images at the individual slice level with high accuracy and at the sequence level with near-perfect performance. Lin et al. showed that a DL-based model achieved performance comparable to expert radiologists in classifying T2WI image quality, and that high-quality T2WI led to higher detection rates of csPCa in targeted biopsies.

Limitations of current approaches: A retrospective study of 155 patients showed that DL-reconstructed 2D turbo spin echo (TSE) images had superior SNR compared to conventional images, yet lesion detection rates were similar. Some studies found no significant differences in image quality between DL-reconstructed and conventional modalities. A study of 46 patients showed DL-reconstructed images with shorter acquisition times were of comparable quality to conventional T2WI. These mixed results highlight that while AI can improve technical image quality metrics, the clinical impact on lesion detection is not always proportional.

TL;DR: AI reduces MRI scan times and improves SNR through DL reconstruction. The pix2pix algorithm synthesized DCE-MRI from non-contrast data, upgrading 34 of 323 patients from PI-RADS 3 to 4. CNNs classify MRI quality at near-perfect accuracy. However, improved technical quality does not always translate to better lesion detection rates.
Pages 9-13
AI as a Second Reader for Prostate MRI Interpretation

The introduction of mpMRI has greatly enhanced PCa detection, shifting from traditional transrectal ultrasound (TRUS)-guided biopsies to MRI-targeted methods. While mpMRI has moderate specificity, large-scale studies have shown that MRI-targeted biopsies outperform in detecting csPCa, providing high sensitivity. The success of this diagnostic method depends on obtaining high-quality MR images, precise interpretation, accurate biopsy guidance, and thorough pathological analysis for diagnosis and Gleason grading.

AI in clinical workflow: AI tools can serve as second-opinion systems, aiding in lesion detection and classification while reducing inter-reader variability. AI-powered models can also play a role in radiologist training and certification by providing real-time feedback and pinpointing areas for improvement. In the authors' own radiology clinic, a DL-based application integrated within the picture archiving and communication system (PACS) detects and segments suspicious prostate lesions while performing PI-RADS scoring. Suspicious lesions are highlighted on the prostate gland sector map with size and volume information, contributing significantly to biopsy guidance.

ML approaches (radiomics): Radiomics involves extracting quantitative features from medical images to characterize tumor properties such as shape, texture, and intensity. The workflow consists of two phases: feature extraction (morphological features, textural attributes from the Gray Level Co-occurrence Matrix, and intensity distributions) followed by classification using Support Vector Machines (SVMs), Random Forests, or Gradient Boosting Machines. These models classify lesions by comparing features against established patterns. However, radiomics requires extensive human input during feature engineering, introducing variability and limiting reproducibility.

DL approaches (CNNs): Convolutional Neural Networks eliminate the need for manual feature selection by learning patterns directly from raw pixel-level data. Initial layers detect basic features like edges and shapes, intermediate layers capture textures and tumor boundaries, and deeper layers identify the likelihood of clinical significance. Chen et al. introduced a multimodal DL nomogram integrating clinical variables, PI-RADS scores, and radiomic features from bpMRI, achieving an AUC of 0.986 in the training set and 0.965 in the testing set for predicting csPCa in gray-zone PSA patients. Hybrid models combining radiomic features, clinical data, and DL approaches have demonstrated superior performance compared to individual modalities.

TL;DR: AI serves as a second reader in prostate MRI, performing lesion detection, segmentation, and PI-RADS scoring within PACS. Radiomics-based ML uses hand-crafted features (SVMs, Random Forests), while CNNs learn directly from pixel data. Chen et al.'s hybrid DL nomogram achieved AUC of 0.986 (training) and 0.965 (testing) for csPCa prediction.
Pages 14-17
How Well Does AI Actually Detect Prostate Cancer on MRI?

Systematic review findings: A systematic review of 12 studies evaluating ML models for detecting csPCa found pooled AUC of 0.85 (95% CI: 0.79-0.91) using biopsy as the reference standard and 0.88 (95% CI: 0.76-0.99) using radical prostatectomy specimens. Notably, non-DL methods outperformed DL-based models (pooled AUC 0.90 vs. 0.78). A separate review of 29 studies comparing ML and DL techniques on bpMRI scans found no clear performance advantage between the two approaches, with detection rates and tumor identification comparable to trained radiologists.

Key individual studies: Yu et al. developed a DL-based AI-assisted PI-RADS system that outperformed 70% of radiologists. Hosseinzadeh et al. demonstrated 87% sensitivity for PI-RADS 4 or higher lesions with zonal segmentation and the largest training set (1586 scans). Khosravi et al. achieved AUCs of 0.89 for distinguishing cancerous from benign cases and 0.78 for high-risk versus low-risk disease. Winkel et al. showed that incorporating AI improved radiologists' AUC from 0.84 to 0.88 in 100 patients, with inter-reader agreement (Fleiss' kappa) improving from 0.22 to 0.36 and reading times reduced by 21%.

Large-scale validation (PI-CAI): The international PI-CAI study analyzed over 10,000 MRI examinations and found that an AI system achieved an AUROC of 0.91 in detecting Gleason grade group 2 or higher cancers, compared to 0.86 for a pool of 62 radiologists from 45 centers across 20 countries. The AI system reduced false-positive results by 50.4% and identified 20% fewer indolent cancers. However, it did not achieve non-inferiority compared to routine clinical reports in real-world multidisciplinary practice, underscoring the gap between controlled reading conditions and clinical workflows.

Commercial AI tools: A commercial AI software (mdprostate), integrated into PACS, achieved an AUC of 0.803 for detecting PCa of any grade in 123 patients. At a PI-RADS 4 or higher threshold, it reported 85.5% sensitivity and 63.2% specificity. However, at lower thresholds (PI-RADS 2 or higher), specificity dropped to just 5.9-7.5%, raising concerns about false positives in routine screening. Zhao et al. developed multicenter models from seven hospitals, with their integrated PIDL-CS model achieving AUC up to 0.914. Karagoz et al. achieved AUROC of 0.888-0.889 on external validation data using the PI-CAI dataset.

TL;DR: Pooled AUC for ML models: 0.85 (biopsy reference) and 0.88 (prostatectomy reference). The PI-CAI study (10,000+ exams) showed AI AUROC of 0.91 vs. 0.86 for 62 radiologists, with 50.4% fewer false positives. AI improved inter-reader agreement from kappa 0.22 to 0.36 and cut reading times by 21%. Commercial tools reached AUC 0.803, but specificity at low thresholds can drop below 8%.
Pages 17-20
Key Challenges in Deploying AI for Prostate MRI

MRI quality control gaps: The current state of AI in prostate MRI quality control remains in its early stages. DL-based reconstruction algorithms for accelerating MRI acquisition have been restricted to particular modalities or a single imaging plane, which may result in volume averaging and hinder visual assessment of anatomical structures. Most methods have been developed and tested using images from the same center, scanner, or protocol, leading to inconsistent performance across different distributions. The PI-QUAL scoring system itself includes subjective components, and moderate agreement among readers with varying experience has been reported for PI-QUAL V2.

Cohort selection and annotation bias: Selecting the appropriate cohort and ensuring accurate segmentation are critical yet challenging steps. Lesion contouring is subjective and strongly influenced by the annotator's expertise. In a systematic review of 25 studies, private databases were unicentric in ten studies and multicentric in nine, with training datasets ranging from 78 to 2,170 patients (median: 637, mean: 724) and testing datasets from 41 to 1,002 patients (median: 333, mean: 365). The small size and limited diversity of some cohorts can lead to biases affecting generalizability.

Reference standard variability: Approximately 20% of studies used both systematic and targeted biopsies as ground truth, 16% used only MRI-guided targeted biopsies, 24% relied on targeted biopsy pathology alone, and about 32% used radical prostatectomy for the entire population or selected cases. The remaining 8% applied different ground truths to different subcohorts. This variability in reference standards makes it difficult to compare AI performance across studies.

Population bias: AI models trained on specific disease prevalences may not perform well when applied to populations with different demographics or varying definitions of PCa. According to Penzkofer et al., the benefits of MRI have been most apparent in Western populations, where secondary screening shows a prevalence of ISUP grades 2-5 ranging from 30% to 50%. The local prevalence and anatomical characteristics of PCa in different populations directly affect AI system performance, limiting generalizability.

TL;DR: Most AI models are single-center with training sets of 78-2,170 patients (median 637). Reference standards vary widely: 32% use prostatectomy, 24% targeted biopsy alone, and 20% combined approaches. PI-QUAL V2 shows only moderate inter-reader agreement. Population-specific training limits global applicability.
Pages 20-22
Standardizing AI Research Through CLAIM and Transparent Reporting

The variability in methods, data, and result reporting among AI studies has led to serious concerns about reproducibility. To address this, radiology AI experts introduced the Checklist for Artificial Intelligence in Medical Imaging (CLAIM), modeled on the Standards for Reporting of Diagnostic Accuracy Studies (STARD) guidelines, to ensure transparent and reproducible reporting in AI-based medical imaging research.

Compliance gaps: A systematic meta-analysis evaluating CLAIM adherence found that most studies provide comprehensive details on model descriptions, training methods, ground truth definitions, data partitioning, and performance metrics. However, information about the study population is reported less consistently, with fewer studies detailing eligibility criteria, demographic and clinical characteristics, or providing participant flowcharts. The tools used for annotation, assessments of inter- and intra-reader variability, external validation, ensembling techniques, and failure analyses are all infrequently reported.

Sequence and input choices: Most AI developments in prostate MRI interpretation have focused on bi-parametric MRI (bpMRI), which includes T2WI and DWI with emphasis on axial plane imaging. All models incorporate T2WI as an input. A significant number use bpMRI with various combinations of T2W, ADC, and DWI sequences, while a smaller subset uses mpMRI (adding DCE-MRI). Some networks incorporate zonal segmentations, and others integrate additional information such as lesion location or histopathological data to enhance performance.

All relevant performance metrics, including sensitivity, specificity, PPV, negative predictive value (NPV), and the frequency of false-positive results per patient, should be thoroughly disclosed. The authors note that these metrics are now being reported more consistently in recent studies, but standardization across the field remains incomplete. Most studies still lack comparative analysis on computational efficiency, such as inference speed, hardware dependencies, and scalability across institutional infrastructures.

TL;DR: The CLAIM checklist, modeled on STARD, aims to standardize AI reporting in medical imaging. Most studies report model details and performance metrics, but population demographics, annotation tools, inter-reader variability, and external validation are frequently missing. Most models use bpMRI (T2WI + DWI) rather than full mpMRI.
Pages 22-24
What Needs to Happen Before AI Reaches the Clinic

Prospective validation: The vast majority of existing studies are retrospective and based on single-center or single-vendor datasets. Moving forward, robust prospective validation across diverse patient populations and institutional settings is essential. Studies must assess not only diagnostic accuracy but also AI's impact on clinical workflows, inter-reader agreement, patient outcomes, and healthcare system efficiency. The authors emphasize that large-scale benchmarking efforts like PI-CAI, PROSTATEx, and Prostate158 play a critical role in establishing external validity and mitigating bias.

Operational benchmarks: Future studies should incorporate practical deployment metrics into their evaluation protocols. These include inference latency, integration cost, interface usability, and radiologist workflow impact. Most studies lack comparative analysis on computational efficiency, hardware dependencies, and scalability across institutional infrastructures. Limited evidence exists on how seamlessly AI tools integrate into routine clinical workflows, including compatibility with PACS and radiology information systems (RIS), radiologist usability, and interoperability challenges.

Addressing the specificity gap: A meta-analysis of 25 studies found AUROC values ranging from 0.573 to 0.892 at the lesion level and 0.82 to 0.875 at the patient level. While AI sensitivity often matches experienced radiologists, specificity tends to be lower, particularly for PI-RADS 3 lesions. This can increase false-positive findings and potentially lead to unnecessary biopsies, higher costs, patient anxiety, and procedural harm. Improving specificity without sacrificing sensitivity is a key research priority.

Ethical and regulatory considerations: Ensuring model explainability, performance monitoring, and regulatory compliance will be essential for gaining clinical trust. Data privacy, equity of access, and mitigation of automation bias must remain central to development and deployment efforts. AI is positioned not as a replacement for radiologists but as an augmentative tool supporting consistency, reproducibility, and diagnostic confidence. Standardized evaluation frameworks like CLAIM and PI-QUAL should be routinely applied to ensure transparent reporting and replicability.

TL;DR: Prospective, multicenter validation is the top priority. AI specificity lags behind sensitivity, especially for PI-RADS 3 lesions, risking unnecessary biopsies. Future studies need operational benchmarks (inference speed, PACS integration, cost) alongside diagnostic metrics. Large public datasets (PI-CAI, PROSTATEx, Prostate158) are critical for reducing bias and establishing generalizability.
Citation: Alis D, Onay A, Colak E, Karaarslan E, Bakir B.. Open Access, 2025. Available at: PMC12154491. DOI: 10.3390/diagnostics15111342. License: cc by.