Explainable artificial intelligence in breast cancer detection and risk prediction: A systematic scoping review

Cancer Innovation (Open Access) 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Explainable AI Matters for Breast Cancer Diagnosis

Breast cancer (BC) is one of the most common cancers worldwide, with high morbidity and mortality. Early detection and treatment significantly improve survival chances, and computer-aided diagnosis (CAD) systems powered by artificial intelligence (AI) have become valuable tools for detection, classification, and diagnosis. However, many AI algorithms behave as "black boxes," meaning clinicians cannot understand how or why the model reached a particular decision. This lack of transparency raises serious concerns about accountability, fairness, and trust in high-stakes clinical settings where a minor error can have irreparable consequences.

The explainability gap: Explainable artificial intelligence (XAI), a term coined by DARPA, has emerged as a field dedicated to making AI systems more transparent, understandable, interpretable, safe, and reliable. XAI methods act as a bridge between model predictions and human understanding. After an AI model generates a result, a suitable XAI method processes that output and creates transparency that allows a human agent (typically a clinician collaborating with an XAI expert) to validate, confirm, or enhance the prediction. If the results are unsatisfactory, the XAI module investigates the model, data, or both, iterating until consensus is reached.

The human-in-the-loop (HITL) concept: The authors emphasize that XAI alone is insufficient for trustworthiness. The HITL framework combines XAI transparency with active human participation in decision-making. This means clinicians do not passively accept AI outputs but instead use XAI-generated explanations to apply their domain expertise, identifying whether the model focused on clinically relevant features or was misled by artifacts. This iterative collaboration between AI systems and human experts is central to achieving reliable, trustworthy AI-enabled healthcare.

Scope of this review: While most existing review articles on XAI focus on general healthcare applications, this paper specifically targets breast cancer screening, risk detection, and prediction. The authors note that their contribution is distinctive in both scope and breadth compared to prior reviews, which have covered XAI in other cancer types but not comprehensively in the breast cancer domain.

TL;DR: AI models for breast cancer diagnosis are often black boxes that clinicians cannot trust. Explainable AI (XAI) bridges this gap by making predictions transparent and interpretable. This scoping review systematically examines how XAI methods have been applied specifically to breast cancer detection and risk prediction across 30 published studies.
Pages 2-5
How XAI Methods Are Classified: Scope, Stage, and Type

The accuracy-explainability trade-off: AI models exist on a spectrum between accuracy and explainability. White-box models (such as linear regression, logistic regression, and decision trees) are intrinsically transparent and easy to interpret but are limited to learning linear associations and may not achieve high accuracy. Black-box models (such as deep neural networks and complex ensembles) can deliver outstanding performance but are non-transparent by nature. Gray-box models strike a balance, where connections from input data to output can be explained despite the model not being fully transparent. The fundamental challenge is that achieving both high accuracy and high explainability simultaneously remains non-trivial.

Scope of explanation (local vs. global): Local XAI methods explain why a particular decision was made for a specific input by highlighting the features that influenced the model's output. However, local methods cannot identify general relationships between features and outputs. Global XAI methods provide a broader understanding by analyzing the model's overall structure and general patterns across the entire dataset, helping users understand biases, limitations, and general decision-making patterns.

Stage of explanation (intrinsic vs. post hoc): Intrinsic methods use white-box models that are interpretable by nature, like decision trees or linear models. Post hoc methods explain model predictions after the training and inference processes are complete. Post hoc methods tend to be more accurate because they are applied to black-box models that inherently perform better. Post hoc methods are further divided into model-specific approaches (designed for specific architectures, such as gradient-based methods for CNNs) and model-agnostic approaches (providing explanations independent of the underlying AI model, such as SHAP and LIME).

Type of explanation: XAI outputs fall into four categories. Feature importance methods assign numerical values to input features reflecting their contribution. White-box model methods create an interpretable surrogate that mimics the original black-box model. Example-based methods use training samples to explain the model's behavior. Visual explanation methods produce heatmaps or saliency maps highlighting relevant image regions. These categories map directly to the XAI techniques examined in the 30 reviewed breast cancer studies.

TL;DR: XAI methods are classified along three axes: scope (local vs. global), stage (intrinsic vs. post hoc), and type (feature importance, white-box surrogates, example-based, or visual). Post hoc model-agnostic methods like SHAP and LIME can explain any model, while model-specific methods like Grad-CAM and LRP target specific deep learning architectures.
Pages 5-6
Systematic Search Strategy: PRISMA Framework and Study Selection

Review framework: This systematic scoping review followed the Preferred Reporting Items on Systematic Reviews and Meta-Analysis (PRISMA) guideline, executed in three structured steps. The search covered publications from January 2017 to July 2023, focusing on peer-reviewed studies implementing XAI methods in breast cancer datasets. The time range begins in 2017, when the concept of XAI gained significant traction following DARPA's program announcement.

Database search (Step 1): The authors conducted a comprehensive search across four major databases using Boolean search strings combining terms like "Explainable Artificial Intelligence," "XAI," "Interpretable Machine Learning," and "Breast Cancer." Scopus returned 104 results, PubMed returned 30 results, IEEE Xplore returned 9 results, and Google Scholar (first 50 citations) returned 50 results. In total, 193 studies were identified in this initial search step.

Study selection (Step 2): Two independent reviewers screened citations by title and abstract against specific inclusion criteria: studies had to be original, published in peer-reviewed English-language journals, and utilize at least one XAI methodology in the breast cancer context. The screening excluded review papers (n=16), XAI studies unrelated to breast cancer (n=18), breast cancer studies unrelated to XAI (n=20), preprints (n=7), conference papers (n=34), duplicate titles (n=37), and non-research materials such as books, dissertations, and editorials (n=12). This eliminated 144 studies, leaving 49 articles for full-text scrutiny. After further review for accessibility and relevance, 30 studies met the final inclusion criteria.

Data extraction (Step 3): A structured data extraction form was developed in Google Sheets covering eight variables: authors, year, objective, dataset(s), data type, important features, type of AI (ML or DL), and the explained model. Two reviewers independently extracted data from all 30 included articles, resolving any disagreements through consensus. This systematic approach ensures reproducibility and minimizes selection bias in the review findings.

TL;DR: Following PRISMA guidelines, the authors searched Scopus, PubMed, IEEE Xplore, and Google Scholar for studies from 2017-2023. From 193 initial results, 144 were excluded based on defined criteria (non-original, non-English, conference papers, duplicates, etc.), and 30 peer-reviewed studies implementing XAI in breast cancer datasets were included in the final analysis.
Pages 7-9
SHAP Dominates as the Most Used Model-Agnostic XAI Method (13/30 Studies)

How SHAP works: SHapley Additive exPlanations (SHAP) is grounded in cooperative game theory, specifically the Shapley value concept. In SHAP, input features of an observation act as "players" in a game, and the model's prediction serves as the "reward." SHAP computes the average marginal contribution of each feature to the prediction, ensuring that the distribution of credit among features is mathematically fair. This provides both local explanations (why this specific prediction was made) and global explanations (which features matter most overall).

Dominance in breast cancer studies: SHAP was the most frequently used XAI method, appearing in 13 out of 30 studies (43%). Notably, 12 of 13 SHAP studies used ensemble machine learning models as the underlying predictors, with XGBoost appearing in 9 of 13 studies as the most common pairing. No deep learning models were used with SHAP in any of the reviewed studies. This pattern exists because SHAP has a high-speed algorithm specifically optimized for tree-based models (XGBoost, CatBoost, GBM, AdaBoost, LightGBM), making the computation tractable and efficient.

Clinical applications of SHAP: Across the 13 studies, SHAP was applied to diverse breast cancer tasks. Chakraborty et al. used SHAP to reveal that boosting B cell and CD8+ T cell fractions in the tumor microenvironment (TME) above their inflection points could increase 5-year survival rates by up to 18%. Rezazadeh et al. applied SHAP to ultrasound texture analysis based on gray-level co-occurrence matrix (GLCM) features to predict malignancy likelihood. Other studies used SHAP for breast cancer subtyping from genomic data (TCGA), predicting radiation-induced lymphopenia, detecting diagnostic biomarkers from peripheral blood mononuclear cells (PBMC), predicting distant metastasis in male breast cancer, and assessing lymph node metastasis status for neoadjuvant systemic therapy (NST) eligibility.

Why SHAP over LIME: Among the 18 studies using model-agnostic methods, SHAP was strongly preferred over LIME. The review attributes this to three factors: SHAP is relatively easy to implement, it provides both local and global explanations (whereas LIME only offers local), and SHAP operates at higher speed for global-level explanations on high-performance ensemble ML models. SHAP's compatibility with tree-based models and its mathematical grounding in game theory make it particularly well-suited for tabular clinical and genomic data in breast cancer research.

TL;DR: SHAP was the top XAI method (13/30 studies, 43%), almost exclusively paired with tree-based ensemble models like XGBoost (9/13 studies). It was used for survival analysis, tumor microenvironment investigation, ultrasound texture analysis, biomarker detection, and metastasis prediction. SHAP's game-theory foundation, dual local/global scope, and speed with tree models explain its dominance.
Pages 9-10
LIME: Surrogate Model Explanations in 5 Studies

How LIME works: Local Interpretable Model-Agnostic Explanations (LIME) provides local explanations by creating a linear surrogate model around a specific data point. LIME perturbs the input features and generates modified instances to observe how the output changes, building a simple interpretable model in the neighborhood of the sample. For image data, perturbation might involve replacing certain image regions with gray pixels. The surrogate model then produces feature importance values that highlight which features most influenced the prediction for that specific instance.

Studies using LIME (5/30): LIME appeared in 5 out of 30 reviewed studies. In Kaplun et al., LIME explained image classification from the BreakHis histopathological database by placing yellow pixel masks to highlight important image segments the model focused on when classifying breast cancer cell images. Saarela and Jauhiainen compared linear (logistic regression) and nonlinear (random forest) ML classifiers with LIME, finding that the nonlinear model offered better explainability by focusing on fewer features (9 features vs. all but one). In Adnan et al., SHAP and LIME were used together to explain that small, biologically compact gene cluster features achieved similar or better performance than classifiers built with many more individual genes. Maouche et al. used cost-sensitive CatBoost with LIME to quantify impacts on distant metastasis ranging from high (nonuse of adjuvant chemotherapy) to moderate to low. Deshmukh et al. used LIME with a hybrid classical-quantum clustering approach (qk-means) on breast cancer data.

LIME vs. SHAP trade-offs: While LIME only offers local interpretation (unlike SHAP's local and global capabilities), it has advantages when a large volume of individual predictions need to be explained quickly. LIME can be faster than SHAP for high-volume local explanations. However, across the model-agnostic studies (18/30 total), SHAP was preferred for its dual-scope explanations and its speed advantage when computing global-level explanations for ensemble ML models.

TL;DR: LIME appeared in 5/30 studies, creating local surrogate models that perturb inputs to identify important features. It was applied to histopathology image classification, gene cluster analysis, and metastasis prediction. While LIME can be faster for individual predictions, SHAP was generally preferred because it offers both local and global explanations.
Pages 10-12
Visual Explanation Methods: CAM (5 Studies) and Grad-CAM (4 Studies) for Image-Based Detection

Class Activation Map (CAM): CAM is a local, propagation-based method that uses a global average pooling (GAP) layer after the last convolutional layer to identify the most discriminative regions of an image within a CNN. It combines a linear weighted sum of feature maps to produce a heatmap highlighting class-specific regions. CAM appeared in 5 out of 30 studies, all focused on imaging modalities. Qi et al. proposed two CNN-based networks (Mt-Net and Sn-Net) that used CAM as an enhancement mechanism to improve classification of malignant tumors and solid nodules from breast ultrasound. Other studies applied CAM to mammography (MIAS, DDSM datasets), MRI, and ultrasound from multiple hospital cohorts involving thousands of patients. CAM is limited to CNN architectures that specifically include a GAP layer before classification.

Gradient-weighted Class Activation Mapping (Grad-CAM): Grad-CAM builds on CAM but removes the architectural constraint, as it can be applied to any CNN architecture without retraining or modification as long as the layers are differentiable. Grad-CAM uses the feature maps from the last convolutional layer to create a coarse localization heatmap where "hot" regions correspond to the predicted class. It appeared in 4 out of 30 studies. Hussain et al. combined Grad-CAM with LIME to investigate how two different XAI methods explain misclassification of breast masses in digital breast tomosynthesis (DBT). Gerbasi et al. implemented Deep SHAP alongside Grad-CAM for mammogram microcalcification analysis, producing maps where pink pixels strongly contributed to malignant predictions and blue pixels to benign predictions.

Grad-CAM++ and LRP: Grad-CAM++, an enhanced version providing better localization of multiple objects in a single image, was used in only 1 out of 30 studies. To et al. applied it with a pretrained DenseNet169 to classify deep ultraviolet whole-slide images (DUV-WSI) as cancerous or benign. Layer-wise Relevance Propagation (LRP), which computes relevance scores backward through the network to highlight critical input regions, appeared in 2 out of 30 studies. Grisci et al. introduced relevance aggregation based on LRP for tabular microarray data (CuMiDa database) using LSTM networks. Chereda et al. extended LRP to Graph-CNNs (GLRP) on genomic breast cancer data, identifying patient-specific molecular subnetworks that could reveal druggable drivers of tumor progression.

Model-specific methods summary: In total, 12 out of 30 studies used model-specific XAI methods. CAM and Grad-CAM were the most frequently used, both being propagation-based, local explanation methods designed for convolutional neural networks. These methods are computationally less intensive than SHAP, making them better suited for real-time image processing applications in clinical settings such as mammography screening and intraoperative margin assessment.

TL;DR: CAM (5 studies) and Grad-CAM (4 studies) dominated the model-specific XAI methods, generating visual heatmaps on breast ultrasound, mammography, MRI, and histopathology images. Grad-CAM++ (1 study) and LRP (2 studies) were less common. These methods are computationally efficient for image data, making them well-suited for real-time clinical imaging applications.
Pages 13-14
How Different XAI Methods Map to Clinical Scenarios

Two primary clinical domains: The 30 reviewed studies focused on two main clinical application areas: (1) diagnosis and classification of breast cancer, and (2) survival and prognosis analysis. Within each domain, studies either employed image recognition techniques on radiology data (ultrasound, mammography, MRI, histopathology) or used alternative approaches with clinical, demographic, and genomic data. The distribution of XAI methods across these domains reveals clear preferences driven by computational requirements and data characteristics.

SHAP for clinical and tabular data: SHAP was predominantly used in studies analyzing clinical, demographic, and genomic data rather than imaging studies. The review attributes this to SHAP's computational intensity, which poses challenges when handling the high-dimensional feature space inherent in image data. SHAP excels at explaining which clinical variables (such as tumor size, hormone receptor status, Ki-67 levels, lymphovascular invasion, or immune cell compositions) most influence survival predictions or subtyping decisions. For 5-year and 10-year invasive disease event prediction, SHAP revealed that age, tumor diameter, surgery type, and therapy-related features were the most important predictors.

CAM and Grad-CAM for imaging: Conversely, CAM and Grad-CAM were the preferred methods for image recognition tasks. These techniques are computationally less intensive than SHAP and naturally produce spatial heatmaps that clinicians can overlay on the original medical images. In diagnosis and classification studies, these visual XAI methods helped clinicians understand which image regions (masses, calcifications, irregular boundaries) the CNN models used to distinguish between healthy and diseased tissue. This alignment between model attention and known radiological features builds clinical confidence in AI-assisted interpretation.

Prognosis and survival analysis: In survival and prognosis models, clinicians sought to predict events such as mortality, metastasis, or treatment response. XAI methods proved instrumental in interpreting each factor's contribution to patient outcomes. For example, Cox Proportional Hazards models explained by SHAP revealed which prognostic factors (age, pathological tumor size, lymph node status) had the greatest impact on recurrence or survival. This interpretability makes the models more understandable, usable, and trustworthy for both clinicians and patients, fostering confidence in the decision-making process.

TL;DR: SHAP dominated studies using clinical, genomic, and tabular data for survival analysis and prognosis, while CAM and Grad-CAM were preferred for image-based diagnosis using ultrasound, mammography, and histopathology. The computational demands of SHAP make it less practical for high-dimensional image data, whereas CAM/Grad-CAM produce efficient spatial heatmaps ideal for clinical imaging workflows.
Pages 2-3, 7-10
The Machine Learning and Deep Learning Architectures Behind the Reviewed Studies

Machine learning models with SHAP: The dominant ML models in the reviewed studies were tree-based ensemble methods. XGBoost appeared in 9 of the 13 SHAP studies, making it the single most commonly explained model. Other ensemble models included LightGBM, GBM, CatBoost, AdaBoost, Extra-Trees (ET), and Gradient Boosted Decision Trees (GBDT). Random Forest (RF) appeared in several studies as both a classifier and comparison model. The preference for tree-based models stems from SHAP's high-speed algorithm specifically optimized for these architectures. Non-ensemble models like logistic regression and decision trees appeared less frequently.

Deep learning models with visual XAI: For image-based studies using CAM, Grad-CAM, Grad-CAM++, and LRP, the underlying models were exclusively deep learning architectures. Popular CNN variants included VGG-16, ResNet (ResNet18, ResNet34, ResNet50), DenseNet (DenseNet121, DenseNet169), Inception V3, GoogLeNet, U-Net, MobileNet-v2, SqueezeNet, and EfficientNet. More specialized architectures included Graph Convolutional Networks (GCN) for genomic data and LSTM networks for tabular microarray data. Federated Learning (FL) was used in one study to preserve patient data privacy while training on histopathological images across institutions.

Datasets used across studies: The reviewed studies drew from a wide range of breast cancer datasets spanning multiple data modalities. Clinical datasets included the Breast Cancer Wisconsin (Diagnostic) dataset, TCGA (The Cancer Genome Atlas), SEER database, and institution-specific clinical records. Imaging datasets included DDSM (Digital Database for Screening Mammography), MIAS (Mammographic Image Analysis Society), BreakHis (histopathological images), INbreast, and institutional ultrasound collections from hospitals in China, Belgium, Italy, Croatia, and Chile. Genomic datasets included NCBI-GEO (Gene Expression Omnibus), CuMiDa (Curated Microarray Database), and ACES (Amsterdam Classification Evaluation Suite).

Data types and modalities: The data types spanned text-based clinical records, ultrasound images, mammography (X-ray), MRI, digital breast tomosynthesis (DBT), microscopic histopathology images, deep ultraviolet whole-slide images (DUV-WSI), genomic/omics data (DNA, RNA, CNV), and blood test data from peripheral blood mononuclear cells (PBMC). This multimodal landscape reflects the complexity of breast cancer diagnosis, where different XAI methods are better suited to different data types.

TL;DR: XGBoost was the most commonly explained model (9/13 SHAP studies), paired with other tree-based ensembles. Image-based studies used CNNs including VGG-16, ResNet, DenseNet, and Inception. Datasets ranged from public benchmarks (Wisconsin, DDSM, BreakHis, TCGA) to multi-institutional clinical cohorts, covering ultrasound, mammography, MRI, histopathology, genomic, and clinical data.
Pages 14-15
Current Gaps, Study Limitations, and the Road Ahead for XAI in Breast Cancer

Key findings in summary: SHAP was the most used model-agnostic XAI method (13/30), primarily paired with tree-based ensemble ML models due to speed and compatibility. Grad-CAM and CAM were the most used model-specific methods (9/30 combined), applied to CNN-based image classification tasks. The review found that many established XAI methods listed in the literature (such as Anchors, occlusion sensitivity, partial dependence plots, counterfactuals, integrated gradients, DeepLIFT, deep Taylor decomposition, guided backpropagation, activation maximization, TCAV, and GraphLIME) have not yet been applied to breast cancer studies, representing untapped opportunities for future research.

Critical gap identified -- bias detection: A significant finding is that the XAI methods used across the 30 studies primarily served as a "sanity check" for model predictions. The authors note that finding biases in models and data through explainability methods was either missing or only mentioned in a few studies. Given that AI systems are susceptible to biases stemming from low-quality datasets, faulty algorithms, and human cognitive biases, the underutilization of XAI for bias detection represents a major gap. Future studies should actively use XAI to identify and mitigate biases, not merely to verify predictions.

Clinical evaluation missing: Although the review investigated clinical applications of XAI methods, the results generated by these methods were not evaluated by oncologists in most studies. The authors stress that clinical evaluation by domain experts is essential to establish trustworthiness and reliability. Without this validation step, XAI outputs remain technically interesting but clinically unverified. Researchers in other medical domains have already begun using XAI for health intervention evaluation, disease causal pathway analysis, mental health surveillance, precision dermatology, and immune response prediction.

Challenges and future directions: The rapid evolution of advanced AI frameworks (such as large language models and generative AI) is transforming healthcare, making the need for XAI increasingly imperative. Potential challenges include data availability in appropriate temporal and geographic resolutions, representativeness and diversity of datasets, semantic heterogeneity, fusion of heterogeneous data streams, AI-readiness of clinical datasets, and algorithmic and human biases in explanations. The review's own limitations include exclusion of non-English articles, gray literature, conference papers, and low-citation studies, as well as the possibility that diverse XAI terminology may have caused some studies to be missed.

Multimodal XAI opportunity: Addressing the challenges of complex multimodal clinical data is key to the widespread acceptance of XAI in cancer care. Breast cancer datasets are typically high-dimensional, multimodal, noisy, and sparse. Future research should prioritize refinement of XAI methods for advanced AI models, ensure the synergy between AI advancements and XAI evolution, and work toward personalized healthcare where innovative models translate into tangible benefits for clinicians and patients.

TL;DR: SHAP (13/30 studies) and CAM/Grad-CAM (9/30 combined) dominate current XAI use in breast cancer research. Major gaps remain: many XAI methods are unexplored in this domain, bias detection is underutilized, and clinical evaluation by oncologists is largely absent. Future work should address multimodal data challenges, validate XAI outputs clinically, and leverage newer AI architectures with robust explainability.
Citation: Ghasemi A, Hashtarkhani S, Schwartz DL, Shaban-Nejad A.. Open Access, 2024. Available at: PMC11488119. DOI: 10.1002/cai2.136. License: cc by.