The Future of Breast Cancer Organized Screening Program Through Artificial Intelligence: A Scoping Review

PMC (Open Access) 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Breast Cancer Screening Needs an AI Upgrade

Breast cancer (BC) remains the second highest cause of cancer-related death among women worldwide, with 2 million new cases recorded in 2020 and approximately 80% of patients being over the age of 50. The risk of developing breast cancer climbs steadily with age: 1.5% at age 40, 3% at age 50, and more than 4% at age 70. Projections estimate that by 2030 the global count of new cases will reach 2.7 million per year, with 0.87 million deaths annually. The estimated economic cost of all cancers from 2020 to 2050 is a staggering $25.2 trillion, and breast cancer alone accounts for 7.7% of that burden.

The screening paradox: Screening mammography has been proven to reduce breast cancer mortality through early detection, and many countries have established population-based screening programs. However, current mammography screening is associated with a high rate of both false positives (unnecessary callbacks and biopsies) and false negatives (missed cancers). These diagnostic inaccuracies translate directly into patient harm: false positives cause anxiety and invasive follow-up procedures, while false negatives allow cancers to progress undetected. Furthermore, not all countries have organized, population-based screening infrastructure, leaving large populations reliant on opportunistic (non-organized) screening with fewer quality controls.

Enter artificial intelligence: The introduction of AI into medical imaging has created a new frontier in mammographic screening. The central research question driving this scoping review is whether AI, when integrated into breast cancer screening workflows, can help resolve persistent diagnostic shortcomings. Specifically, the authors ask whether AI can reduce missed cancers, cut false-positive rates, and detect tumors at earlier stages. They also assess whether AI performs differently depending on the type of screening program (organized vs. opportunistic) and how many human readers are involved.

Scope of the review: The authors searched PubMed, Web of Science, Scopus, and Embase for English-language articles published within the last 10 years, with the search updated to 28 May 2024. Using the PRISMA method, they selected studies and classified them by publication type (meta-analyses, trials, prospective, and retrospective studies). They also categorized studies by how AI was deployed: AI applied only on datasets, AI used in comparison with readers, or AI used as a support tool for readers. Quality assessment was performed using AMSTAR 2 for reviews, Cochrane for randomized trials, and Newcastle-Ottawa for observational studies.

TL;DR: Breast cancer causes 2 million new cases per year globally, with costs projected at $25.2 trillion over 30 years. Mammography screening reduces mortality but suffers from high false-positive and false-negative rates. This scoping review examines whether integrating AI into screening workflows can fix these diagnostic problems, drawing from 26 studies across meta-analyses, trials, and retrospective designs.
Pages 2-4
How the Evidence Was Gathered and Organized

PRISMA selection process: The review followed the PRISMA framework for article selection. After removing duplicates and screening titles and abstracts, the authors narrowed down their literature to 26 studies that met inclusion criteria. Conference proceedings, preprints, and non-peer-reviewed publications were excluded. The final set included 2 meta-analyses, 1 systematic review, 1 narrative review, 1 randomized controlled trial, 1 prospective study, and 17 retrospective studies. This hierarchy of evidence, adapted from the evidence-based pyramid model, places meta-analyses and systematic reviews at the top, followed by trials, then observational studies.

Classification by AI deployment: A particularly useful contribution of this review is its dual classification system. First, studies were organized by publication type (the traditional approach). Second, and more practically, they were organized by how AI was used within each study: (i) AI applied only on retrospective datasets without a human reader comparison; (ii) AI compared directly against human readers; and (iii) AI used as a support tool alongside human readers. This classification helps clinicians and policymakers understand not just whether AI "works," but how it integrates into real-world diagnostic workflows.

Classification by screening type: The authors also separated studies based on how patients were recruited. Some studies drew data exclusively from organized (population-based) screening programs, where women are systematically invited for mammography at regular intervals with double reading by two radiologists. Other studies used data from opportunistic (non-organized) screening, where individual women seek mammography on their own initiative, typically read by a single radiologist. A third group of multicenter studies combined data from both organized and non-organized programs. This distinction is critical because the baseline standard of care differs between these settings, which directly affects how much value AI adds.

Quality assessment: Study quality was evaluated using validated scales appropriate to each study type. Meta-analyses and systematic reviews were assessed with AMSTAR 2, randomized trials with the Cochrane Collaboration tool, and observational studies with the Newcastle-Ottawa scale. A final checklist from Tricco et al. was applied for the scoping review itself. This multi-layered quality control ensures that the review's conclusions rest on methodologically sound evidence.

TL;DR: The review used PRISMA to select 26 studies (2 meta-analyses, 1 trial, 1 prospective study, and 17+ retrospective studies). Studies were classified both by evidence type and by how AI was deployed (standalone, vs. readers, or as reader support). A key innovation was separating results by screening type: organized (population-based with double reading) vs. opportunistic (single-reader).
Pages 3-5
What the Two Meta-Analyses Reveal About AI Accuracy and Workload

Hickman et al. meta-analysis: The first meta-analysis, conducted across 14 eligible studies (7 triage studies and 8 comparison studies, with 185,252 patients), delivered two major findings. First, AI demonstrated an effective reduction in radiologists' reading time, with the workload decrease ranging from 17% to 91% depending on the study. Second, AI-missed cancers ranged from only 0% to 7% of all cancers, meaning that even in the worst-case triage scenario, AI correctly flagged at least 93% of malignancies. The AI's pooled sensitivity was 0.75 (95% CI: 0.65-0.83), specificity was 0.90 (95% CI: 0.82-0.95), and AUC was 0.89 (95% CI: 0.84-0.98). By comparison, human readers achieved a sensitivity of 0.73 (95% CI: 0.61-0.83), specificity of 0.87 (95% CI: 0.72-0.95), and AUC of 0.85 (95% CI: 0.78-0.97).

Yoon et al. meta-analysis: The second meta-analysis, covering 16 studies and 1,108,328 mammograms from 497,091 women, confirmed and extended the first. Across six reader studies on digital mammography, standalone AI achieved significantly higher AUCs than radiologists (0.87 vs. 0.81, p = 0.002). However, for historic cohort studies (where AI was tested on previously collected datasets), the difference was not statistically significant (0.89 vs. 0.96, p = 0.152). Notably, four studies on digital breast tomosynthesis (DBT) showed that AI significantly outperformed radiologists (AUC 0.90 vs. 0.79, p < 0.001). A consistent pattern emerged: AI tended to show higher sensitivity but lower specificity compared with radiologists.

Clinical interpretation: The meta-analytic evidence paints a clear picture: AI can match or exceed the diagnostic accuracy of individual human readers while dramatically cutting the time radiologists spend reviewing images. The 17-91% workload reduction is particularly significant for healthcare systems facing radiologist shortages. However, the slightly lower specificity of AI compared with human readers means AI may generate more false positives, which has implications for patient anxiety and follow-up costs. The ideal deployment, both meta-analyses suggest, is not AI replacing radiologists but AI working alongside them.

TL;DR: Two meta-analyses (covering 14 and 16 studies respectively) found that AI reduced radiologist reading workload by 17-91% while missing only 0-7% of cancers. AI achieved pooled AUCs of 0.87-0.89, matching or exceeding individual radiologists (0.81-0.85). AI tended to have higher sensitivity but slightly lower specificity than human readers.
Pages 5-8
AI Performance in Population-Based Screening With Double Reading

The MASAI randomized trial: The strongest evidence comes from the MASAI trial conducted in Sweden, a true randomized controlled trial enrolling 80,033 women aged 40-74. Women were randomly assigned to either AI-supported screening (n = 40,003) or standard double reading without AI (n = 40,030). Cancer detection rates were 6.1 per 1,000 in the AI group and 5.1 per 1,000 in the control group (ratio 1.2, 95% CI: 1.0-1.5, p = 0.052). Recall rates were 2.2% vs. 2.0%, and the false-positive rate was 1.5% in both groups. The positive predictive value (PPV) of recall was higher in the AI group (28.3% vs. 24.8%). Most importantly, the screen-reading workload was reduced by 44.3% when AI was used.

Dembrower prospective study: A prospective study of 58,344 women in Sweden compared multiple reading configurations. Double reading by one radiologist plus AI was non-inferior for cancer detection compared with double reading by two radiologists (261 [0.5%] vs. 250 [0.4%] detected cases, relative proportion 1.04, 95% CI: 1.00-1.09). Even single reading by AI alone was non-inferior (246 [0.4%] vs. 250 [0.4%], relative proportion 0.98). Triple reading by two radiologists plus AI detected the most cancers (269 [0.5%] vs. 250 [0.4%], relative proportion 1.08).

Sharma multi-vendor evaluation: A large study of 304,360 mammograms from Hungary and the UK tested AI across multiple mammography vendors. Double reading with AI showed at least non-inferior recall rate, cancer detection rate, sensitivity, specificity, and PPV for each vendor. For two systems, AI plus double reading actually achieved superior recall rate, specificity, and PPV. The study estimated that AI would increase the arbitration rate (from 3.3% to 12.3%) but could reduce the human workload by 30.0% to 44.8%.

Key pattern from organized screening: Across studies from Turkey, Norway, Denmark, Germany, Spain, Sweden, The Netherlands, and Switzerland, a consistent finding emerged. In European organized screening programs that already employ double reading, AI does not dramatically improve diagnostic performance because the baseline is already high. In the works of Lauritzen and Leibig, AI sensitivity was actually slightly lower than that of human readers (69.7 vs. 70.8 and 84.6 vs. 87.2, respectively). Where AI clearly adds value in these settings is workload reduction, enabling systems to maintain double-reading-level quality with fewer radiologist hours.

TL;DR: In organized screening with double reading, the MASAI trial (80,033 women) showed AI reduced workload by 44.3% while maintaining comparable detection rates (6.1 vs. 5.1 per 1,000). In Dembrower's study, one radiologist plus AI was non-inferior to two radiologists. AI's main value in organized screening is not better accuracy (the baseline is already high) but significant workload reduction of 30-44%.
Pages 8-11
AI Performance in Non-Organized Screening and Risk Prediction

AI vs. clinical risk models: Several U.S.-based studies compared AI against established clinical risk tools. Arasu et al. (13,628 patients) found that AI predicted incident cancers at 0-5 years better than the Breast Cancer Surveillance Consortium (BCSC) clinical risk model (AI AUC range 0.63-0.67 vs. BCSC AUC 0.61, p < 0.0016). Lehman et al. (57,635 patients) showed that deep learning detected 8.6 cancers per 1,000 patients screened versus 4.4 for Tyrer-Cuzick and 3.8 for the NCI BCRAT model (p < 0.001). The DL model AUC of 0.68 was significantly higher than both traditional models (0.57 each).

Radiologist augmentation: Lee's study from South Korea is particularly revealing. With 200 patients, AI achieved an AUC of 0.915 (95% CI: 0.876-0.954), while experienced breast radiologists (BSR) averaged 0.813 and general radiologists (GR) averaged only 0.684. When AI was added as a decision-support tool, the BSR group's AUC rose significantly to 0.884 (p = 0.007) and the GR group's AUC jumped to 0.833 (p < 0.001). Sensitivity improved dramatically in both groups: from 74.6% to 88.6% for breast specialists and from 52.1% to 79.4% for general radiologists (both p < 0.001). Specificity, however, did not change significantly.

Dang's French study: In a study of 314 patients with 12 different radiologists, the AUC was significantly improved when radiologists used AI support (0.74 vs. 0.77, p = 0.004). Although the absolute improvement appears modest, it is statistically significant and clinically meaningful in a high-volume screening context where even small gains in accuracy affect thousands of patients.

The single-reader context: The critical insight from non-organized screening data is that in settings where only one radiologist reads each mammogram (as is typical in the United States), AI provides a far more substantial boost than in European double-reading settings. AI effectively serves as a "virtual second reader," raising diagnostic performance to a level comparable with double human reading. Sasaki's study from Japan, however, provided a counterpoint: the AUC was higher for human readers than standalone AI (0.816 vs. 0.706, p < 0.001), though AI achieved higher sensitivity at certain cutoff thresholds (93% at cutoff 4 vs. 89% for unaided humans).

TL;DR: In single-reader (opportunistic) screening, AI provides the biggest gains. AI outperformed traditional clinical risk models (AUC 0.63-0.68 vs. 0.57-0.61). In Lee's study, AI assistance boosted general radiologists' sensitivity from 52.1% to 79.4%. Where screening involves only one human reader, AI effectively acts as a virtual second reader, raising performance to match double-reading standards.
Pages 9-11
AI's Unfinished Business: Interval Cancers and Dense Breast Tissue

What are interval cancers: Interval cancers are tumors that appear between scheduled screening rounds. They are among the most dangerous missed diagnoses because they often present at a more advanced stage. Several studies in this review specifically tested AI's ability to detect or predict interval cancers, and the results were mixed. Lang's study found that AI scored one in three (143 of 429) interval cancers with the highest risk score (10), and of these, 67% (96 of 143) showed minimal signs or were originally false negatives. Critically, 58% (83 of 143) were correctly located by AI, suggesting a potential 19.3% (95% CI: 15.9-23.4%) reduction in interval cancers.

Hickman's three-model triage approach: Three different deep learning models were tested as both triage tools and interval cancer detectors on 78,849 mammograms. As triage tools, the models successfully triaged 35.0-55.6% of mammograms with only 0.0-0.1% of screening-detected cancers going undetected. For interval cancers, the DL algorithms flagged 4.6-8.2% of interval cancers and 5.2-6.1% of subsequent-round cancers when applied after routine double-reading workflow. Overall, the adaptive AI workflow showed non-inferior specificity (difference -0.9%, p < 0.001) and superior sensitivity (difference 2.7%, p < 0.001) compared with routine double reading.

The breast density challenge: Breast density is an independent risk factor for cancer, and dense tissue on mammograms can mask tumors. Gastounioti's study demonstrated that a hybrid framework combining CNN-extracted features achieved an AUC of 0.90 (95% CI: 0.82-0.98) for separating cancer cases from controls, far outperforming breast density alone. Ha's study similarly showed that the CNN risk model had greater predictive potential (OR = 4.42, 95% CI: 3.4-5.7) compared with breast density alone (OR = 1.67, 95% CI: 1.4-1.9). However, Wanders' study on interval cancer risk found that combining AI with breast density measures improved diagnosis but remained sensitive to threshold values.

Remaining gaps: The review highlights that AI still underperforms in interval cancer detection. Hinton's study found that while the deep learning model achieved an AUC of 0.82 for overall classification, incorrect classifications were slightly more common for interval cancer mammograms than for other cases. Zhu's study confirmed that combining DL with clinical risk factors improved screening-detected cancer identification but lost effectiveness for interval cancers. The authors conclude that breast density assessment and interval cancer prediction remain areas requiring significantly more research and AI development.

TL;DR: AI showed promise for interval cancers, with Lang's study suggesting a 19.3% reduction, but performance remains inconsistent. For breast density, CNN models outperformed density-only measures (AUC 0.90 vs. density alone). However, AI still struggles with interval cancers overall, and breast density integration needs more research. These are the two biggest unsolved problems in AI-assisted screening.
Pages 11-14
How AI Performance Varies Across Countries and Healthcare Systems

Shaffer's Sweden-U.S. comparison: This multicenter study compared AI algorithm performance between Swedish organized screening and U.S. non-organized screening. The top-performing algorithm achieved an AUC of 0.903 in Sweden compared with 0.858 in the United States. At radiologists' sensitivity levels, AI specificity was 81.2% in Sweden but only 66.2% in the U.S., both lower than community-practice radiologists' specificity of 98.5% (Sweden) and 90.5% (U.S.). However, combining top-performing algorithms with U.S. radiologist assessments boosted the AUC to 0.942 and achieved significantly improved specificity of 92.0% at the same sensitivity level.

McKinney's UK-U.S. study: This study compared AI performance in the UK (organized screening with double reading) and the U.S. (opportunistic screening with single reading). In the UK, AI showed a 1.2% improvement in specificity over the first reader and a 2.7% improvement in sensitivity. Compared with the second reader, AI demonstrated non-inferiority for both specificity (p < 0.001) and sensitivity (p = 0.02). In the U.S. context with single readers, AI demonstrated a 5.7% improvement in specificity over the typical reader, a substantially larger gain than in the UK double-reading setting.

Kim's South Korea-U.S. study: Analyzing 166,578 mammograms from 68,008 patients across South Korea (organized screening) and the U.S. (non-organized screening), this study found that AI achieved an overall AUC of 0.95, significantly higher than human readers at 0.81. This was one of the largest performance gaps observed in the review, though the authors note that sensitivity and specificity breakdowns were not available for this study.

The organized vs. opportunistic divide: The cross-national evidence reinforces the review's central finding: AI's incremental value depends heavily on the baseline standard of care. In European organized screening programs with two trained readers and established quality infrastructure, AI adds relatively modest diagnostic improvement but delivers major workload savings. In U.S.-style opportunistic screening with single readers and less standardized protocols, AI delivers substantially larger improvements in both sensitivity and specificity. The review's authors argue that this evidence supports universal adoption of organized screening programs, and where that is not feasible, AI can help bridge the quality gap.

TL;DR: Cross-national studies confirm that AI performs better in organized screening settings (AUC 0.903 in Sweden vs. 0.858 in the U.S.) but adds more value in single-reader systems. McKinney's study showed AI improved U.S. specificity by 5.7% vs. only 1.2% in the UK. The bottom line: where double reading exists, AI saves time; where only single reading exists, AI improves accuracy to match double-reading quality.
Pages 15-17
What This Means for Screening Policy, Ethics, and the Future

The case for organized screening + AI: The FDA approved computer-aided diagnosis (CAD) for mammographic images as early as 1998, and AI has evolved enormously since then. The review argues that deep learning models integrating both image analysis and clinical risk scores are more effective than any single tool alone, emphasizing the need for a multidisciplinary diagnostic approach. A critical factor is the volume of mammograms: more exams lead to better training data and higher diagnostic accuracy. Organized population-based screening programs generate the largest volumes and the most standardized datasets, making them the ideal environment for AI deployment.

Practical implications for different healthcare systems: For European countries already operating organized screening with double reading, AI does not necessarily improve diagnostic accuracy but can reduce the workload by 30-44%, potentially addressing chronic radiologist shortages. For systems relying on opportunistic screening with single readers (such as the United States), AI provides a cost-effective path to achieving double-reading-level performance. Dembrower's and Lang's Swedish studies demonstrated that AI would not actually improve upon the diagnostic performance already achieved by double human reading, but could maintain it with fewer resources.

Desirable properties of clinical AI: The review references Combi et al.'s framework for evaluating AI systems, requiring four characteristics: interpretability (can users understand how the system makes decisions), understandability (can users comprehend results and mechanisms), usability (ease of interface use), and usefulness (practical value for its intended purpose). In clinical practice, this translates to a system focused on the doctor's decision-making needs and one that integrates seamlessly into the patient's diagnostic and treatment pathway. Several ongoing trials are investigating these real-world integration questions.

Ethical considerations and bias: The authors emphasize that AI systems trained on large datasets may contain biases reflecting historical inequalities. Ensuring fairness requires recognizing and mitigating bias in data, algorithms, and decision-making processes. AI's capacity to collect and process vast amounts of personal data also raises privacy concerns. The ethical deployment of AI in screening demands clear guidelines, informed consent, and responsible data governance. Without these safeguards, AI could inadvertently perpetuate healthcare disparities rather than reduce them.

Future directions: Several randomized trials are currently ongoing, reflecting growing interest in AI-assisted breast cancer screening. The review identifies breast density analysis and interval cancer detection as the two areas most in need of further AI development. If future studies confirm AI's safety and benefit, the technology could become a standard support tool, potentially eliminating the need for double reading or for a third arbitrating radiologist in cases of disagreement. This would free specialists to focus on more complex diagnostics and reduce patient waiting times.

TL;DR: The review concludes that AI is most transformative in single-reader (opportunistic) screening, where it can match double-reading accuracy. In organized screening, AI's main benefit is 30-44% workload reduction. Key unsolved challenges include interval cancer detection, breast density analysis, dataset bias, and privacy. Multiple ongoing trials will determine whether AI becomes a standard component of screening workflows.
Citation: Altobelli E, Angeletti PM, Ciancaglini M, Petrocelli R.. Open Access, 2025. Available at: PMC11855082. DOI: 10.3390/healthcare13040378. License: cc by.