This paper is a comprehensive review of how artificial intelligence (AI) is being applied to breast cancer detection across three imaging modalities: mammography, ultrasound, and thermography. The authors note that while many existing reviews focus on individual imaging techniques or AI methods in isolation, there is a gap in the literature when it comes to understanding how these approaches compare and complement one another across modalities. This review consolidates that perspective by surveying both traditional machine learning (ML) and deep learning (DL) approaches from studies published between 2020 and 2025.
Beyond standard ML and DL techniques, the review incorporates emerging themes that are only beginning to appear in academic discourse. These include explainable AI (XAI), large language models (LLMs) such as GPT-4 and LLaMA, and multimodal LLMs (MLLMs) that can process text, images, and other data types within a single framework. The authors argue these tools represent a transformative shift in diagnostic AI, particularly for integrating imaging data with clinical narratives and medical records.
The paper is structured across nine major sections covering medical imaging modalities and their publicly available datasets, ML and DL technique descriptions, performance evaluation metrics, literature search methodology, modality-specific AI analyses (mammography, ultrasound, thermography), and a discussion of LLMs and XAI for breast cancer diagnostics. The review was designed to be readable not only by AI specialists or physicians but by the broader scientific community working in interdisciplinary cancer research.
Mammography: Recommended by NCCN, WHO, and ACR as the standard screening method for women above age 40, mammography uses low-dose X-rays to produce detailed images of soft and dense breast tissue. It has evolved through digital mammography, 3D tomosynthesis, and contrast-enhanced variants, which have improved sensitivity especially in dense breasts. However, mammography has well-known limitations including high cost, patient discomfort, radiation exposure, and the problem of overdiagnosis, where benign changes or clinically insignificant tumors may be misinterpreted as cancer. Dense breast tissue, where glandular and fibrous structures appear white on mammograms like tumors, can lead to false positives. A systematic review identified 254 mammography datasets, of which 22 were openly accessible. Key public datasets include DDSM (2,620 mammograms), CBIS-DDSM (10,239 mammograms with masks), INBreast (410 images from 115 patients), Mini-MIAS (322 images), and BCDR (1,010 cases, 3,703 mammograms).
Ultrasound: This modality serves as an alternative screening tool, particularly for women with dense breast tissue. It uses high-frequency sound waves rather than ionizing radiation, making it non-invasive, cost-effective, and safer for younger women who may require repeated screenings. Ultrasound can clearly differentiate fluid-filled cysts from solid masses and has been shown to raise cancer detection by 1.9 to 3.5 per 1,000 women with dense breasts under age 50 when used alongside mammography. Its main limitations are restricted ability to identify calcifications, reduced specificity compared to mammography, and dependence on operator expertise. Notable public datasets include BUSI (780 images from 600 patients), OASBUD (200 scans from 78 women), and BUS-BRA (1,875 images from 1,064 patients).
Thermography: First introduced in 1982 and still not approved as a standalone medical diagnostic tool, thermography captures temperature differences on the breast surface using thermal infrared cameras. Differences in surface temperature distribution may indicate tumors due to increased vascularity and metabolic activity. Advantages include no X-ray exposure, lower cost than mammography or MRI, and portability suitable for resource-limited settings. The main limitation is that it only captures surface-level measurements, excluding diagnosis of deeper tumors. Data availability is far more limited, with the Mastology Research Database (287 individuals, ages 23 to 120) being the most widely used thermography dataset.
The review distinguishes two main architectural workflows for AI-based medical imaging. Traditional ML follows a manual, segmented pipeline: images undergo preprocessing and normalization (using algorithms such as k-means clustering, fuzzy c-means, or watershed methods to isolate regions of interest), followed by hand-selected radiomic feature extraction. Techniques like PCA, LDA, t-SNE, UMAP, recursive feature elimination, or chi-square tests handle dimensionality reduction and feature selection. The selected features then feed into classifiers such as logistic regression (LR), SVM, decision trees (DT), random forests (RF), gradient boosting methods (XGBoost, LightGBM), naive Bayes, or k-Nearest Neighbors (kNN).
Deep learning models eliminate manual segmentation and feature engineering. The network automatically learns feature extraction and selection during training. Architectures include U-Net, Mask R-CNN, and fully convolutional networks (FCNs) for segmentation, while CNNs dominate classification tasks. For temporal ultrasound data, RNNs and LSTMs are employed. Transformer models are increasingly adopted for multimodal data integration. A key advantage of DL is that it can establish sophisticated connections within both large and small datasets while still providing reliable results.
Hybrid and ensemble models combine ML and DL strengths through stacking, bagging, boosting, or fusing data from multiple modalities. XAI tools, including Grad-CAM, LIME, SHAP, and attention mechanisms, are increasingly incorporated into these hybrid systems to improve transparency. The review emphasizes that both ML and DL approaches can be enhanced by incorporating additional parameters such as demographic information, risk factors, and molecular profiles. However, combining different data types into a single model remains a key challenge, and the principle of "garbage in, garbage out" is especially relevant for data quality in AI-driven medical research.
The authors performed a systematic search across three major databases: Scopus, Web of Science, and ScienceDirect. The search combined terms including "breast cancer" with modality-specific terms ("mammography," "ultrasound," or "thermography"), AI-related terms ("artificial intelligence," "machine learning," "deep learning"), XAI terms ("explainable AI," "XAI"), and LLM-related terms ("large language models," "LLMs," "multimodal AI"). The search was restricted to peer-reviewed articles published in English between 2020 and 2025.
Studies were included if they reported on ML, DL, or XAI techniques for breast cancer detection, classification, or diagnosis in at least one of the three imaging modalities. Exclusion criteria removed studies not related to breast imaging, those lacking sufficient methodological detail, and purely technical papers without clinical application. The authors also screened reference lists of key articles and reviews to identify additional relevant publications.
The authors acknowledge that given the enormous number of publications on AI and breast cancer detection, they applied strict inclusion criteria focused on studies with clearly described AI methods and evaluation metrics, research demonstrating direct relevance to clinical or diagnostic applications, and papers providing sufficient methodological detail for reproducibility. This approach prioritized quality and clinical relevance over comprehensiveness. As a result, for some modalities such as ultrasound, fewer studies met all criteria, which explains why a limited number are presented despite the broader literature available.
The review found that approximately 44% of studies published between 2020 and 2025 used SVM as the primary classifier for breast cancer detection, most frequently with mammogram images. ANNs (including RNNs, CNNs, and transformer-based networks) were the second most common, also primarily applied to mammographic data. Other methods such as kNN, decision trees, fuzzy logic, naive Bayes, random forests, and logistic regression each contributed between 2% and 10% of the studies reviewed.
Several mammography studies achieved exceptional results. Ahmad et al. (2024) proposed a CAD system combining YOLOv7 for lesion recognition, Associated-ResUNet for segmentation, and BreastNet-SVM (AlexNet-based) for classification on the CBIS-DDSM dataset, achieving 99.16% accuracy, 97.13% sensitivity, and 99.30% specificity. Mahmood et al. (2024) introduced a hybrid CNN+LSTM and CNN+SVM approach with modified VGGNet and SEResNet152 models using transfer learning, reaching an AUC of 0.99 and sensitivity of 0.99 on MIAS and INbreast datasets. Umamaheswari et al. (2024) developed ViT-MAENB7, a hybrid combining EfficientNetB7 and Vision Transformer architectures, achieving 96.6% accuracy with 93.4% precision and 94.9% F1-score.
The large-scale MASAI study from Sweden, involving over 105,000 women, provided real-world clinical evidence. The AI-supported screening system (Transpara) demonstrated a 29% increase in cancer detection identification compared to standard routine, with higher positive predictive values for recalls and lower false positive rates. Importantly, the radiologist workload decreased by 44.2%. Mannarsamy et al. (2025) presented a SIFT-CNN integrated fuzzy decision tree method on CBIS-DDSM, achieving up to 99.74% accuracy for benign case classification, while Puttegowda et al. (2025) used YOLOv3, Faster R-CNN, and RetinaNet on DDSM, INbreast, and AIIMS datasets, reaching 98.8% accuracy, 98.5% sensitivity, and AUC of 0.99.
Despite these high performance numbers, the authors note that many models were developed and tested on small or homogeneous datasets. The need for further validation on larger, more diverse datasets remains a common challenge across mammography-based AI studies.
AI-enabled ultrasound imaging demonstrated strong potential for accurate and interpretable breast cancer diagnosis. Ametefe et al. (2025) explored deep transfer learning using pre-trained CNNs (VGG16, VGG19, EfficientNetB3) for classification and U-Net for segmentation on 780 ultrasound images. VGG19 performed best with 95.5% accuracy, 97% specificity, and 96.9% precision, while U-Net achieved an average Dice similarity coefficient of 85.97% for tumor segmentation. The study noted limitations including high computational requirements and class imbalance in datasets.
Wang et al. (2025) introduced ABUS-Net, a graph convolutional network (GCN)-based model for Automated Breast Ultrasound. Unlike traditional 3D patch-based methods, ABUS-Net uses coronal plane features and models spatial relations between tumor slices through a graph-based structure, built on ResNet50 for multi-scale feature extraction. It achieved 96.6% accuracy and 94.9% F1-score on private and public datasets. Kiran et al. (2024) developed a hybrid model combining EfficientNetB3 with kNN classification and PCA for dimensionality reduction, reporting 100% accuracy, precision, recall, and F1-score on a curated ultrasound dataset, outperforming VGG16, AlexNet, and VGG19 baselines.
Tian et al. (2024) proposed a diagnostic model combining traditional radiomics with DL features extracted from pretrained MNASNet, using a dataset of 1,050 annotated cases. By fusing handcrafted radiomics features with deep features and applying ensemble ML classifiers (SVM, XGBoost, LightGBM), the model achieved a balanced accuracy of 0.964 and AUC of 0.981. Multi-center validation demonstrated the model's generalizability. Liu et al. (2020) took a more traditional approach using SVM with edge-based and morphological features, achieving 82.69% accuracy, 93.55% specificity, and 87.5% positive predictive value on 192 images, demonstrating that even simpler feature engineering approaches can contribute meaningful diagnostic value in ultrasound imaging.
Thermography-based AI systems have achieved notable diagnostic accuracy, often exceeding 97%, despite thermography not yet being approved as a standalone diagnostic tool. Ekici et al. (2020) developed a CNN model optimized by Bayesian algorithms, trained on 3,895 thermographic images from 140 patients from the Mastology Research Database. The model achieved 98.95% testing accuracy, demonstrating that DL-enhanced thermography has potential as a complement or alternative to mammography, especially for young women with dense breast tissue.
Mohamed (2022) proposed an automatic framework combining U-Net for breast region segmentation (removing noise from neck and shoulders) with a nine-layer CNN for classification. Tested on 1,000 frontal thermographic images from the DMR-IR dataset (500 normal, 500 abnormal), it achieved 99.33% accuracy, 100% sensitivity, and 98.67% specificity, outperforming pre-trained models like VGG16 and ResNet18 as well as traditional classifiers such as SVM and kNN. Allugunti (2022) compared CNN, SVM, and RF classifiers on over 1,000 thermal images from Kaggle, with CNN achieving 99.65% accuracy versus 89.84% for SVM and 90.55% for RF.
Civilibal (2023) employed Mask R-CNN with transfer learning using ResNet-50 pre-trained on the COCO dataset, processing thermal images from 56 women (19 healthy, 37 with tumors). The system performed simultaneous detection, segmentation, and classification, achieving 97.1% accuracy, mAP of 0.921, and 86.8% segmentation overlap. Ramacharan (2024) presented HERA-Net, integrating VGG19, U-Net, GRU, and ResNet-50 for deep feature extraction, segmentation, temporal analysis, and classification. Preprocessing included grayscale conversion, CLAHE, bilateral filtering, and NLMS filtering, with LBP and HOG for feature extraction. Trained on 3,534 thermographic images from the DMR database, it achieved 99.86% accuracy, 100% sensitivity, and 99.81% specificity.
Resmini et al. (2021) combined dynamic infrared thermography (DIT) and static infrared thermography (SIT) in a hybrid approach. The screening phase used k-means clustering with the K-Star algorithm, reaching 98.57% accuracy. The diagnostic phase used GLCM, LTP, wavelets, and fractal dimensions for feature extraction with SVM classification, achieving 94.61% accuracy and 94.87% AUC. Although these results are promising, the authors emphasize that further validation on diverse, real-world datasets is essential for widespread clinical adoption.
Large Language Models: LLMs such as GPT-4, LLaMA, Gemini, and DeepSeek represent a new frontier in breast cancer diagnostics. Models like ChatGPT-4 and Gemini have shown the ability to align with multidisciplinary tumor board recommendations, with reported accuracies ranging from 70% to 98%. Rao et al. (2023) tested GPT-4 and GPT-3.5 on breast cancer screening prompts, finding GPT-4 achieved 98.4% accuracy on screening SATA prompts and 77.7% on breast pain questions, compared to GPT-3.5's 88.9% and 58.3% respectively. Sorin et al. (2023) evaluated ChatGPT-3.5 against actual tumor board decisions for 10 breast cancer patients and found 70% agreement, though the model occasionally omitted critical clinical details such as HER2 status.
Multimodal LLMs: MLLMs process multiple data types (text, images, audio) within a single framework. Models such as GPT-4o, LLaVA, CLIP-ViT, and Flamingo combine transformers with vision models (CNNs or ViTs). Guo et al. (2024) proposed KAMnet, a framework using contrast-enhanced ultrasound and B-mode ultrasound videos with temporal attention and feature fusion, achieving 90.91% sensitivity, 88.24% accuracy, and AUC of 0.943 on 332 cases. Nakach et al. (2024) reviewed 47 studies and found multimodal fusion methods significantly improve prognostic performance compared to unimodal models, with most studies reporting accuracy above 80%. However, GPT-4V showed only 35.2% accuracy for ultrasound pathology identification versus 66.7% on X-rays, highlighting inconsistent performance across modalities.
Explainable AI: XAI tools are critical for clinical adoption. The most relevant methods for medical imaging include LIME (local interpretable models), SHAP (Shapley values for feature contribution), saliency maps and gradient-based methods (including Grad-CAM and LRP), feature importance scores from tree-based models, counterfactual explanations, anchors, surrogate models, and Bayesian Network-based knowledge representation. In breast imaging, tools like Grad-CAM generate heatmaps that highlight regions most influential for the model's decision, resembling how radiologists interpret suspicious areas. However, a major concern is saliency map instability, where the same input image may generate different explanation maps depending on model architecture or training conditions, meaning XAI outputs should be treated as supportive evidence rather than definitive diagnostic proof.
Haver et al. (2023) tested ChatGPT on 25 patient questions about breast cancer prevention and BI-RADS screening, finding responses appropriate in 88% of cases, unreliable in 8%, and inappropriate in 4%. LLM limitations include risk of hallucination, lack of source attribution, data bias, cybersecurity risks, and sensitivity to prompt formulation. The review recommends developing oncology-specific domain models (such as Med-PaLM M), improving interpretability through human-in-the-loop validation, and establishing LLM-specific reporting standards.
Dataset limitations: The most significant barrier to clinical AI deployment is the dependence on large, high-quality labeled databases. Many reviewed studies used small or homogeneous datasets, creating overfitting risk and limiting generalizability. Medical images collected from multiple centers introduce variations in equipment, imaging techniques, and clinical protocols, contributing to data inconsistency. Only 22 of 254 identified mammography datasets were openly accessible. Thermography datasets are even more scarce, with the Mastology Research Database (287 individuals) being the primary resource.
Standardization gaps: The field lacks standardized benchmarks and evaluation protocols, making it difficult to compare and reproduce results across studies. Many models are developed using narrow datasets without K-fold cross-validation on external or collaborative cohorts. While commercial AI systems such as Transpara, ProFound AI, and Lunit INSIGHT are already in clinical use, they are often limited to specific imaging modalities and face challenges related to transparency and integration across diverse healthcare settings.
Learning paradigm gaps: Most current studies rely on supervised learning, which performs well with abundant labeled data but degrades when data is limited. Unsupervised and semi-supervised learning methods, as well as reinforcement learning paradigms, remain insufficiently explored. These alternative approaches could be valuable for developing AI models that are more adaptive, perform well under data scarcity, and can learn with minimal supervision. The review also notes that the state-of-the-art has advanced beyond legacy architectures like VGGNet toward Vision Transformers (ViT), Swin Transformers, EfficientNetV2, ConvNeXt, and foundation models like SAM (Segment Anything Model), which offer better accuracy, faster convergence, and cross-modality adaptability.
Future directions: The review recommends several priority areas. First, creation of multicenter, high-quality datasets that represent diverse patient populations and imaging conditions. Second, investigation of hybrid and unsupervised learning methods that can handle data scarcity. Third, development of understandable, adaptive AI systems capable of sharing data across modalities and clinical settings. Fourth, better integration of imaging, genomics, and clinical data through multimodal models. Fifth, improvement of XAI methods to address saliency instability and support clinical trust. Finally, the authors call for development of oncology-specific LLMs, improved regulatory frameworks, and rigorous multi-site validation before these technologies can be responsibly adopted in routine clinical practice.