Deep Learning in Breast Cancer Imaging: State of the Art and Recent Advancements in Early 2024

PMC (Open Access) 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Deep Learning Is Reshaping Breast Cancer Imaging

Breast cancer is the most common neoplastic disease in women worldwide, with over 2.3 million diagnoses and 685,000 deaths registered globally in 2020. Screening programs using mammography (MG) and ultrasonography (US) remain the primary defense for early detection, targeting early signs such as microcalcifications, architectural distortions, and solid masses. However, the sheer volume of high-resolution images that radiologists must evaluate each day creates an enormous cognitive burden where fatigue, biases, and distractions inevitably degrade performance, even among highly trained specialists.

The AI opportunity: Appropriately trained deep learning (DL) algorithms can serve multiple roles in this pipeline: as second or third independent readers providing failsafe mechanisms, as real-time assistants enhancing radiologist sensitivity and specificity, or even as automated single readers to increase throughput and reduce costs. DL models represent the most advanced form of computer-aided detection (CADe), surpassing earlier statistical classifier-based CAD tools that suffered from low specificity and limited clinical adoption. Convolutional neural networks (CNNs) in particular have become the dominant DL architecture for complex computer vision tasks in the medical imaging domain.

Beyond detection: After a tumor is detected, staging and cancer burden monitoring with magnetic resonance imaging (MRI) often rely on subjective bi-dimensional measurements that introduce significant interobserver variability. AI-based tools provide automated or semi-automated lesion identification with more consistent and reproducible results, enabling more precise staging and dramatically reducing the time required for comparing different studies and evaluating treatment response. Additionally, radiomics-based models can predict histopathological features, prognosis, and treatment response from imaging examinations by analyzing quantitative image patterns hidden from human qualitative observation.

Scope of this review: This narrative review from Carriero, Groenhoff, and colleagues at the Maggiore della Carita Hospital in Novara, Italy, aims to bridge a gap identified in the existing literature. Most previous reviews focused on either clinical or technical aspects in isolation. This paper targets a middle ground, providing essential technical information about AI development alongside up-to-date clinical research results, specifically geared toward radiologists. The review also addresses novel techniques like thermography and microwave-based imaging that have received limited prior coverage.

TL;DR: Breast cancer affects over 2.3 million women annually, and radiologists face enormous workloads analyzing screening images. Deep learning, particularly CNNs, has emerged as the leading approach for computer-aided detection, staging, and prognosis prediction. This 2024 review bridges technical and clinical perspectives on DL in breast cancer imaging, including both conventional and novel modalities.
Pages 3-7
Core DL Architectures: CNNs, GANs, and Large Language Models

Convolutional Neural Networks (CNNs): CNNs are feedforward neural networks consisting of multiple convolutional layers followed by pooling layers and fully connected layers, inspired by the animal visual cortex. Convolutional layers apply filters or kernels to extract features from local regions of input images, detecting edges, shapes, textures, and patterns. After several convolution operations, the output feature maps undergo downsampling through max-pooling, average-pooling, or min-pooling layers to reduce spatial resolution while retaining salient information. Fully connected layers then perform high-level reasoning to produce classification results. Key CNN architectures mentioned include ResNet, DenseNet, MobileNets, EfficientNet, and ConvNeXt for classification; YOLO, R-CNN, and SSD for object detection; and U-Net and nnU-Net for segmentation.

Classification, detection, and segmentation: Classification models evaluate an image and assign it to a labeled class (e.g., normal or abnormal). Object detection models expand on this by also identifying the approximate location of an abnormality through bounding boxes. Segmentation models go further still, identifying the exact boundary of an object in an image or volume, enabling precise calculation of physical properties such as diameters, surface area, and volume. The most widely used segmentation network for biomedical imaging is U-Net, with the self-configuring nnU-Net pipeline achieving state-of-the-art performance across tasks ranging from whole-body CT segmentation to brain cancer MRI segmentation.

Generative Adversarial Networks (GANs): GANs comprise a generator that synthesizes artificial samples mimicking real data distributions and a discriminator that distinguishes genuine from fake instances. During training, both components engage in a minimax game until equilibrium is reached. In medical imaging, GANs have been used for synthetic data generation to augment existing datasets (particularly for imbalanced classes or rare events) and for image enhancement, including noise suppression, artifact reduction, and harmonization of inconsistent features across multi-center studies.

Large Language Models (LLMs): LLMs have recently been explored for radiology report generation and medical information retrieval. General-purpose models like GPT and LLaMA have spawned domain-specific variants such as Med-PaLM, MedAlpaca, and LaVA-Med, some integrating image assessment capabilities to fully automate the reporting workflow. Performance metrics discussed include accuracy, precision, sensitivity (recall), F1 score, the area under the ROC curve (AUC), and the Dice Similarity Coefficient (DSC) for segmentation.

TL;DR: CNNs remain the workhorse for breast cancer imaging tasks (classification, detection, segmentation), with architectures like ResNet, U-Net, and YOLO dominating. GANs generate synthetic training data and enhance image quality. LLMs are being explored for automated radiology reporting. Key evaluation metrics include AUC for classification and DSC for segmentation.
Pages 8-10
Public Datasets Fueling Breast Cancer DL Research

Mammography datasets: The review catalogs the principal public datasets available for breast cancer imaging. The Digital Database for Screening Mammography (DDSM), released in 1999 with approximately 2,620 patients, was the pioneering dataset for CAD research using screen-film mammography. INBreast (2011, 115 patients from Portugal) was the first high-quality full-field digital mammography (FFDM) dataset with standardized, high-resolution images. CBIS-DDSM (2017, 1,566 patients) addressed shortcomings of DDSM with improved image quality, uniform resolution, and absence of overlapping patches. VinDr-Mammo (2022, 5,000 Vietnamese women, 30,000 ROIs) introduced a large, diverse FFDM collection with bounding box annotations.

The ADMANI mega-dataset: The Annotated Digital Mammograms and Associated Non-Image (ADMANI) datasets, introduced in late 2022, represent one of the largest mammography collections available, containing over 4.4 million screening images from 630,000 women with histopathological confirmation. A subset of 40,000 images from 10,000 screening episodes was donated as the test set for the RSNA Screening Mammography Breast Cancer Detection challenge, which attracted over 1,600 competing teams.

Ultrasound and MRI datasets: Compared to mammography, far fewer public datasets exist for breast ultrasound and MRI. Recently released US datasets include BrEaST (2023, 256 patients from Poland) and BUS-BRA (2023, 1,064 patients from Brazil). For dynamic contrast-enhanced MRI (DCE-MRI), the Duke Breast Cancer MRI dataset (2022, 922 patients) and BreastDM (2023, 232 patients from China) are the primary public resources. This scarcity of non-mammographic datasets has likely contributed to the slower development of DL models for these modalities.

Private datasets and cost barriers: Larger private datasets, such as those from New York University covering breast cancer screening, ultrasound, and MRI, have been commercially released but carry significant entry costs that constitute a substantial adoption barrier for researchers worldwide. The heterogeneity of demographic characteristics and scanning devices across public datasets also poses challenges, as a model trained at one institution may not perform well when tested externally. Transfer learning and local retraining have emerged as potential strategies for addressing this generalizability gap.

TL;DR: Public mammography datasets range from the pioneering DDSM (2,620 patients) to the massive ADMANI (630,000 women, 4.4 million images). Ultrasound and MRI datasets remain scarce and small. The RSNA challenge (1,600+ teams, 40,000 test images) has become a key benchmark. Private datasets exist but cost barriers limit access for global researchers.
Pages 10-14
Deep Learning for Screening Mammography: From Retrospective to Prospective Evidence

Early retrospective studies: Becker et al. (2017) tested a deep artificial neural network (dANN) on 1,144 patients (143 cancer-positive), achieving sensitivity/specificity of 73.7%/72.0% and an AUC of 0.82 in a screening-like cohort, comparable to experienced radiologists. Performance was highest in low-density breasts (AUC = 0.94). Watanabe et al. (2019) showed that AI-CAD software improved mean cancer detection rate (CDR) from 51% to 62% across all radiologists, with less than 1% increase in false-positive recalls. Akselrod-Ballin et al. (2019) evaluated a combined ML-DL model on over 13,000 women, achieving AUC = 0.91 with 77.3% specificity at 87% sensitivity.

Large-scale challenge results: Schaftter et al. (2020) organized a challenge using data from over 85,000 US women and 68,000 Swedish women. The top-performing algorithm achieved AUC = 0.858 and 0.903 on internal and external validation, respectively. While standalone AI specificity (66.2% internal, 81.2% external) was worse than both American (90.5%) and Swedish (98.5%) radiologists, combining top algorithms with radiologists yielded AUC = 0.942 and significantly improved specificity (92.0%). Kim et al. (2020) found standalone AI achieved AUC = 0.940, significantly outperforming unassisted radiologists (0.810), with AI-assisted radiologists reaching 0.881.

Landmark prospective studies: Dembrower et al. conducted the most comprehensive evaluation, first showing in a 2020 retrospective study (547 diagnosed patients, 6,817 controls) that AI triage could halve human workload while detecting a substantial proportion of human-missed cancers. Their 2023 prospective non-inferiority study on over 55,000 women demonstrated that double reading by one radiologist plus AI was non-inferior to double reading by two radiologists, with a 4% increase in screen-detected cancers. Ng et al. (2023) showed AI-assisted reading could increase detection rate by 0.7 to 1.6 per 1,000 cases with only 0 to 0.23% extra unnecessary recalls, and most additional detections were invasive, small-sized tumors (10 mm or less).

The RSNA 2022 challenge: The RSNA Screening Mammography Breast Cancer Detection challenge attracted over 1,600 teams. The winning solution used a multi-step pipeline: a YOLOX-nano detector for breast ROI extraction, several preprocessing operations, and a fourfold ConvNeXt-small network for final classification, achieving AUC = 0.93. Notably, participants were required to publicly release their source code, promoting transparency and reproducibility. Seven FDA-approved AI-based tools for breast cancer detection are now commercially available, including cmAssist, Genius AI Detection, INSIGHT MMG, MammoScreen 2.0, ProFound AI, Saige-Dx, and Transpara.

TL;DR: Screening mammography is the most mature DL application. Standalone AI achieves AUC up to 0.940, and AI-assisted radiologists consistently outperform unassisted readers. A prospective study on 55,000+ women confirmed AI plus one radiologist is non-inferior to two radiologists. The RSNA challenge winner (ConvNeXt-small) achieved AUC = 0.93. Seven FDA-approved commercial tools are now available.
Pages 14-17
DL Beyond Mammography: DBT, Contrast-Enhanced Mammography, Ultrasound, and MRI

Digital Breast Tomosynthesis (DBT): Studies show that AI-assisted DBT interpretation can increase sensitivity, reduce recall rates, and decrease double-reading workload. Romero-Martin et al. (2021) evaluated 15,999 examinations, finding AI achieved AUC = 0.93 for digital mammography and 0.94 for DBT. For digital mammography, AI demonstrated non-inferior sensitivity with up to 2% recall rate reduction. For DBT, sensitivity was also non-inferior but with a higher recall rate increase of up to 12.3%, suggesting the need for further optimization.

Contrast-Enhanced Mammography (CEM): Zheng et al. (2023) conducted a prospective multicenter study on over 1,900 Chinese women with single-mass breast lesions, achieving automated segmentation DSC = 0.837 and classification AUC = 0.891. Beuque et al. (2023) combined deep learning segmentation with handcrafted radiomics classification, reaching AUC = 0.88 on manual segmentations and 0.95 on automatic segmentations. Qian et al. (2023) used a multi-feature fusion network with dual-energy subtracted and low-energy bilateral dual-view CEM images, achieving AUC = 0.92 on an external dataset.

Breast Ultrasound: Compared to mammography, DL studies for breast ultrasound have been fewer, smaller-sampled, and more heterogeneous. A 2024 review by Dan et al. found no consistent superiority of AI over human readers, though some studies showed improvement when pairing AI with inexperienced radiologists. Gu et al. (2022) developed a DL classifier tested on over 5,000 patients, achieving AUC = 0.913 (comparable to experienced radiologists, significantly higher than inexperienced ones). Lyu et al. developed a segmentation model with DSC = 0.8 on external datasets using attention-enhanced edge recognition.

Breast MRI: Deep learning applications for breast MRI remain largely investigational but include promising directions. Janse et al. (2023) used nnU-Net to segment locally advanced breast cancer for neoadjuvant chemotherapy response assessment, achieving median DSC = 0.87 on 55 patients from four institutions. Chung et al. (2022) demonstrated the feasibility of generating synthetic contrast-enhanced T1-weighted breast MRI from pre-contrast sequences using a deep neural network, with DSC = 0.75 for tumor similarity. Li et al. (2023) showed a DL-based radiomic model combining pre- and early-treatment DCE-MRI information outperformed conventional radiomics (AUC = 0.900 vs. 0.644 and 0.888) for predicting pathological complete response, with a combined clinical model reaching AUC = 0.925.

TL;DR: DBT AI achieves AUC = 0.94, and CEM classifiers reach AUC = 0.92. Breast ultrasound AI (AUC = 0.913) matches experienced radiologists. MRI applications include nnU-Net segmentation (DSC = 0.87), GAN-based synthetic contrast generation, and DL radiomic models for chemotherapy response prediction (AUC = 0.925). All modalities lag behind mammography in study volume and maturity.
Pages 17-20
Thermography, Microwave Imaging, and Other Emerging Modalities

Thermography: Thermographic breast examination uses infrared cameras to detect abnormal heat patterns from altered blood vessel growth and metabolic activity in tumors. Multiple DL studies have shown excellent results: Mambou et al. (2017) achieved complete classification accuracy on 67 subjects using a CNN coupled with ML. Alshehri et al. (2022) used attention mechanisms to boost CNN accuracy from 92.3% to 99.46%, reaching 99.8% in a 2023 follow-up with a deeper architecture. Mohamed et al. (2022) combined U-Net segmentation with a bespoke classifier for 99.3% accuracy, and Civiliban et al. (2023) developed a Mask R-CNN with ResNet-50 backbone achieving 0.921 mean average precision for detection and 0.868 overlap for segmentation.

Thermography in clinical settings: Clinical validation of thermographic AI remains limited but promising. Singh et al. (2021) evaluated a commercial AI-based thermal screening device (Thermalytix) on 258 symptomatic patients, achieving AUC = 0.845 with 82.5% sensitivity versus 92% for mammography. Bansal et al. (2023) tested the same device on 459 women (symptomatic and asymptomatic), finding non-inferior performance compared to mammography with better sensitivity in women with dense breasts. Key advantages of thermography include no ionizing radiation, no breast compression, lower costs, and reduced performance loss in dense breast tissue.

Microwave Breast Imaging (MBI): MBI uses low-power radio waves to create images based on dielectric tissue properties, offering similar advantages to thermography: no radiation, no compression, and less degradation in dense breasts. Moloney et al. (2022) published the first clinical MBI study on 24 symptomatic patients, correctly detecting 12 of 13 benign lesions and 9 of 11 cancers, including a radiographically occult invasive lobular neoplasm. A large European-funded prospective study involving 10,000 patients across 10 centers is ongoing through November 2026. No studies have yet explored deep learning applied to MBI in clinical settings, representing a wide-open research opportunity.

Other investigational modalities: Breast elastography (ultrasound-based tissue stiffness evaluation) has benefited from DL feature extraction since Zhang et al.'s 2015 two-layer neural network, with recent fully DL models reducing inter-observer variability. For breast-specific gamma imaging (BSGI), Yu et al. demonstrated positive results with a ResNet18 classifier. Zhang et al. developed a fusion optical tomography-ultrasound DL model achieving AUC = 0.931 for breast cancer classification. Positron emission mammography (PEM) remains the only modality with no significant DL studies to date.

TL;DR: Thermographic AI achieves up to 99.8% accuracy in research settings, with clinical studies showing AUC = 0.845 and superior dense-breast sensitivity. MBI detected 9 of 11 cancers in its first clinical study, with a 10,000-patient European trial underway. These modalities offer radiation-free, compression-free alternatives, but clinical validation and DL integration remain at early stages.
Pages 20-24
Vision Transformers, ConvNeXt, YOLO, nnU-Net, and DL-Based Radiomics

Vision Transformers (ViTs): Introduced by Dosovitskiy et al. in 2020, ViTs treat images as sequences of non-overlapping patches processed through a transformer architecture originally designed for NLP. Unlike CNNs that focus on local patterns, ViTs capture long-range dependencies and global context. They have achieved state-of-the-art performance on ImageNet and shown promise in medical imaging, with Ayana et al. demonstrating a ViT-based mammography classifier outperforming CNN baselines via transfer learning. However, Cantone et al. (2023) found ViTs may underperform CNNs when trained with small datasets, and their high computational cost limits adoption in resource-constrained settings.

ConvNeXt: ConvNeXt modernizes the classic ResNet architecture by incorporating design principles from transformers: larger kernel sizes, depthwise separable convolutions, layer normalization (replacing batch normalization), and strided convolutions (replacing pooling). This yields transformer-competitive accuracy with CNN-level simplicity and computational efficiency. ConvNeXt v2 added improved layer normalization, dynamic depth, and better activation functions. Hassanien et al. (2022) used ConvNeXt for breast tumor malignancy prediction on ultrasound, outperforming both CNN and ViT alternatives. The RSNA challenge winner also employed ConvNeXt-small, achieving AUC = 0.93.

YOLO and nnU-Net: The YOLO (You-Only-Look-Once) architecture treats object detection as a regression problem, enabling real-time performance. YOLOv7 (2022), YOLOv8, and YOLOv9 (early 2024) have each brought incremental improvements. In breast imaging, Aly et al. (2021) demonstrated YOLO-based mass detection outperforming conventional CNNs. Su et al. (2022) combined YOLOv5 with a local-global transformer for mass detection and segmentation. For segmentation, nnU-Net is a self-configuring pipeline that automatically determines optimal network architecture without extensive manual tuning, with nnU-Net V2 (2023) adding improved usability and broader platform support.

DL-based radiomics: Traditional radiomics manually extracts predefined quantitative features from medical images, then applies statistical ML classifiers. DL-based radiomics replaces this with neural networks that automatically learn hierarchical feature representations, handling noisy or missing data more robustly. Applications in breast cancer include tumor and lymph node malignancy assessment, pathologic marker evaluation, and treatment response prediction. However, challenges persist around the need for large, diverse training datasets, reproducibility across imaging platforms, and the black-box nature of predictions that limits clinical interpretability.

TL;DR: ViTs capture global image context but need large datasets and compute. ConvNeXt bridges CNNs and transformers with competitive accuracy at lower cost (RSNA winner, AUC = 0.93). YOLO enables real-time detection, and nnU-Net automates segmentation pipeline creation. DL-based radiomics replaces manual feature engineering for prognosis and treatment response prediction.
Pages 24-27
Prospective Studies, AI Integration Strategies, and Public Challenges

Prospective vs. retrospective approaches: Historically, most DL studies in breast cancer imaging followed retrospective designs for rapid evaluation of novel models. However, retrospective studies suffer from selection bias (particularly from cancer-enriched datasets that do not reflect true disease prevalence), limited follow-up periods, and incomplete clinical characterization. The shift toward prospective studies, including works by Dembrower et al., Ng et al., Zheng et al., and Gu et al., provides more realistic performance estimates by reproducing common clinical scenarios in terms of disease prevalence, information availability, and interpretation setting.

AI integration strategies: The review identifies three primary integration approaches. Standalone AI systems process data and generate reports independently, with triage-based approaches only flagging cases above a critical threshold. AI-assisted single reading pairs AI with one radiologist, combining DL efficacy with human expertise while reducing cognitive strain. AI-assisted double reading (triple reading) adds AI analysis to two independent radiologist assessments. Dembrower et al.'s 2023 prospective study compared all three strategies: assisted double reading produced the most abnormal interpretations, followed by assisted single reading and unassisted double reading, while standalone AI had the lowest recall rate. Crucially, cancer detection rates were similar across all strategies, suggesting higher specificity for standalone AI without meaningful sensitivity loss.

The automation bias concern: A key risk in AI-assisted reading is automation bias, where radiologists develop excessive reliance on AI outputs, potentially leading to complacency or overconfidence. The interaction between healthcare professionals and AI algorithms is a relatively unexplored field with complex implications for human performance. Successful implementation requires seamless PACS integration, trust-building between radiologists and AI, and strategies to prevent complacency. Standalone AI reading raises additional ethics and liability issues that current regulations have not yet resolved.

Public challenges: Competitions organized by bodies like RSNA, MICCAI, and Grand Challenge have become powerful drivers of DL progress. They provide standardized benchmarks with identical datasets and evaluation criteria, foster interdisciplinary collaboration, facilitate data sharing and access, and promote transparency through mandatory code releases. The RSNA Screening Mammography challenge exemplifies these benefits, with over 1,600 teams competing on 40,000 test images and all source code made publicly available for external reproducibility.

TL;DR: The field is shifting from retrospective to prospective study designs for more realistic performance estimates. Three AI integration strategies (standalone, assisted single reading, assisted double reading) show similar cancer detection rates but different recall profiles. Automation bias in human-AI interaction remains a concern. Public challenges (RSNA, MICCAI) drive innovation through standardized benchmarks and mandatory code release.
Pages 27-30
Barriers to Clinical Adoption: Generalizability, Privacy, Costs, and Explainability

Generalizability: Medical imaging data varies significantly across centers due to differences in imaging protocols, equipment specifications, and patient demographics. These variations introduce biases and hinder model generalizability. A model trained on images from one institution may perform poorly when deployed elsewhere. Addressing this requires standardized data collection protocols, cross-institutional collaboration to build representative datasets, affordable dataset access for researchers worldwide, and systematic assessment of AI consistency across different clinical settings. Transfer learning and local retraining show promise but need more rigorous evaluation.

Multimodal interpretation: Experienced radiologists routinely integrate information from multiple sources: prior studies, different imaging modalities, lab tests, pathological specimens, and clinical status. Current AI systems largely interpret single-modality data in isolation, unable to replicate this holistic reasoning. DL models capable of effectively handling multimodal healthcare data will be essential for fully leveraging the complementary information provided by different techniques, but the complexity of integrating such diverse data sources into a single algorithm remains a substantial technical challenge.

Costs and privacy: High-performance GPUs and specialized accelerators are essential for DL training and inference, representing a significant financial barrier for clinical facilities. Many commercial AI solutions offload computations to external servers, reducing hardware costs but introducing privacy concerns around transmitting sensitive medical data. Implementing secure, privacy-preserving solutions while maintaining computational efficiency is critical for widespread adoption. The development of open training and inference platforms for medical imaging AI (analogous to llama.cpp for LLMs or ComfyUI for GANs) could significantly democratize access, but no such unified platform exists for medical imaging as of early 2024.

Explainability, ethics, and liability: Deep neural networks function as black boxes, converting inputs to predictions without direct insight into their reasoning. This opacity raises serious ethical and legal considerations for clinical deployment, particularly in scenarios where diagnoses and treatment decisions are strongly influenced by AI. Explainable AI (XAI) efforts to increase model interpretability will be paramount for establishing the feasibility and scope of AI implementation in clinical practice. The authors conclude that collaborative research from hardware and software vendors, clinicians, and policymakers will be required to improve computational infrastructure, enhance data security, promote responsible radiologist use, and confront the complex bioethical implications of AI-driven medicine.

TL;DR: Key barriers to clinical adoption include poor cross-site generalizability (different scanners, demographics), inability to integrate multimodal data, high hardware and software costs, privacy risks from cloud-based inference, and the black-box nature of DL predictions. No open-source unified medical imaging inference platform yet exists. Explainable AI and interdisciplinary collaboration between technologists, clinicians, and regulators are essential for moving forward.
Citation: Carriero A, Groenhoff L, Vologina E, Basile P, Albera M.. Open Access, 2024. Available at: PMC11048882. DOI: 10.3390/diagnostics14080848. License: cc by.