Deep Learning and Explainable AI for Breast Cancer

Plain-English Explanations

Background & Motivation

Pages 1-3

Why Breast Cancer Detection Needs Both Accuracy and Explainability

Breast cancer is the second leading cause of cancer-related deaths in women, with approximately 310,720 new cases and 42,250 deaths projected in the United States for 2024. About 90% of cases are linked to age-related genetic anomalies, and the remaining 5 to 10% are hereditary. Conventional diagnostic techniques, including mammography, ultrasound, biopsy, and MRI, can be time-consuming, expensive, and sometimes lack the precision needed for personalized therapy. Machine learning (ML) and deep learning (DL) models have emerged as powerful tools for analyzing medical data and detecting subtle patterns, but their adoption in clinical settings has been limited by a fundamental problem: opacity.

The black-box barrier: DL models, despite their exceptional classification performance, often function as opaque entities whose internal decision-making processes are not interpretable by human clinicians. Medical practitioners depend on evidence-based reasoning to guide treatment choices, and they are understandably reluctant to adopt AI models that cannot explain why they arrived at a particular diagnosis. This skepticism is well-founded in a domain where a single incorrect prediction can endanger a patient's life. False negative predictions (predicting benign when the tumor is malignant) are particularly dangerous because they may lead patients and doctors to skip further conventional diagnostic steps, allowing the cancer to progress undetected.

The recall imperative: In breast cancer detection, recall (sensitivity) is arguably more critical than overall accuracy. A system that achieves high accuracy but misses even a small percentage of malignant cases can have fatal consequences. The authors emphasize that false pessimistic predictions (false negatives) can endanger lives, as patients or doctors who rely on these predictions may forgo further diagnosis. Explanations become even more important when a prediction is optimistic (benign), because medical practitioners can evaluate the reasoning behind such a prediction and decide whether additional testing is warranted.

Existing gaps in the literature: Many prior systems for breast cancer prediction already achieve high accuracy, but most lack two critical properties simultaneously: robust recall to minimize false negatives and meaningful explainability to justify each prediction. The authors frame two research questions guiding this work: (1) How can hybrid and ensemble deep learning models effectively predict malignant cases of breast cancer with better recall and sensitivity? (2) How can eXplainable AI (XAI) methods assist physicians and patients in understanding the reasoning behind each prediction?

TL;DR: Breast cancer kills over 42,000 Americans annually, and AI models that detect it often cannot explain their reasoning. This paper targets two gaps at once: improving recall (sensitivity) so fewer malignant cases are missed, and adding SHAP-based explainability so clinicians understand why a specific prediction was made.

System Architecture

Pages 3-4

The Three-Layer DXAIB System Model

The proposed scheme, called "DXAIB" (Deep Learning and eXplainable AI for Breast cancer), is built around a three-layer architecture: a data layer, a prediction layer, and an explainability layer. Each layer handles a distinct phase of the diagnostic pipeline, and together they form an end-to-end system that ingests patient data, produces a cancer prediction, and generates human-readable explanations for that prediction.

Data layer: This layer is responsible for providing breast cancer data for training and testing at regular intervals. It records and delivers patient vitals (10 clinical features derived from fine needle aspiration of breast masses) to the prediction layer. The features include radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. Each of these parameters characterizes the cell nuclei observed in ultrasound images. The outcome variable is binary: benign (B, encoded as 0) or malignant (M, encoded as 1).

Prediction layer: This layer trains the DXAIB model using available data and produces breast cancer status predictions based on patient vitals. The model is a hybrid combining a Convolutional Neural Network (CNN) for automated feature learning with a Random Forest (RF) classifier for final prediction. The CNN extracts learned feature representations from tabular data through its convolutional layers, and these features are then fed into the RF model for class label prediction. Both patients and medical practitioners can use this layer to obtain a diagnosis.

Explainability layer: After the prediction layer generates an output, the explainability layer uses SHAP (SHapley Additive exPlanations) to produce both local and global explanations. Local explanations describe the specific features that drove a particular patient's prediction, while global explanations reveal overarching patterns across the dataset. This layer allows medical practitioners to examine the logical reasoning behind each prediction outcome, evaluating the role of each individual input vital in the diagnosis. The objective function minimizes binary cross-entropy loss, subject to constraints ensuring predicted probabilities fall between 0 and 1 and outcomes are binary.

TL;DR: DXAIB has three layers: a data layer (10 clinical features from FNA), a prediction layer (CNN + Random Forest hybrid), and an explainability layer (SHAP). The system ingests patient data, predicts benign or malignant status, and provides both per-patient and dataset-wide explanations for each diagnosis.

Dataset & Preprocessing

Pages 4-5

The Wisconsin Breast Cancer Dataset and SMOTE Balancing

Dataset description: The study uses the "Breast Cancer Wisconsin (Diagnostic)" dataset from the UCI Machine Learning Repository. This dataset contains 569 individual patient samples, each described by 30 numeric features. These 30 features are derived from 10 base measurements (radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension), with three statistical variants computed for each: the mean, standard error, and "worst" (largest) value. All features are extracted from digitized images of fine needle aspirates (FNA) of breast masses and describe cell nuclei characteristics. The dataset is clean, well-structured, and has been extensively validated in breast cancer research.

Preprocessing steps: The raw data undergoes several transformation steps before model training. First, the target column's categorical values ("M" for malignant and "B" for benign) are label-encoded as 1 and 0, respectively. Superfluous or redundant features are removed, and duplicate data instances are eliminated. Following these cleaning steps, all features are standardized by subtracting the mean and dividing by the standard deviation, ensuring all characteristics are measured on a consistent scale and minimizing biases caused by differences in measurement units. After standardization, MinMaxScaler normalization is applied to the training and testing splits.

Class imbalance and SMOTE: The original dataset is imbalanced, with 62.85% benign cases and only 37.15% malignant cases. In a cancer detection task where recall is paramount, this imbalance is problematic because the model may develop a bias toward predicting the majority class (benign), leading to more false negatives on malignant cases. To address this, the authors apply Synthetic Minority Over-sampling Technique (SMOTE), a data augmentation method for tabular data. SMOTE creates synthetic samples for the minority malignant class by interpolating between existing data points and their nearest neighbors, expanding the feature space of the minority class. This helps the model learn more diverse patterns and reduces the chance of misclassifying malignant instances as benign.

Train-test split: The dataset is partitioned using an 80-20 split, with 80% of samples used for training and 20% held out for testing. This split ensures a sufficient amount of data for model training while reserving an independent set for unbiased performance evaluation.

TL;DR: The Wisconsin Breast Cancer (Diagnostic) dataset contains 569 samples with 30 features from FNA images. The data was 62.85% benign and 37.15% malignant, so SMOTE was applied to balance classes and improve recall. Features were standardized and normalized, with an 80/20 train-test split.

Model Architecture

Pages 5-7

The Hybrid CNN + Random Forest Pipeline

CNN for feature learning: The core of the DXAIB scheme is a hybrid model that uses a Convolutional Neural Network not for direct classification, but for automated feature extraction from tabular data. The CNN consists of 10 layers: four Conv1D layers, two MaxPooling1D layers, one flatten layer, and three dense layers. The first two convolutional layers use 128 filters each with a 3x3 kernel size, while the subsequent two use 256 filters. MaxPooling layers with a pool size of 2x2 follow each pair of convolutional layers. All convolutional layers use ReLU activation. A dropout rate of 0.20 is applied after every max-pooling and dense layer to combat overfitting.

Dense layers and feature transformation: After the convolutional and pooling layers, the data is flattened into a one-dimensional vector and passed through three dense layers with 512, 256, and 2 neurons respectively. The first two dense layers use ReLU activation, while the final dense layer uses Softmax. However, critically, this CNN is used only for feature learning rather than final classification. Before the output reaches the Softmax classification layer, a dense layer restructures the CNN's learned representations into a vector format suitable for the Random Forest classifier.

Random Forest as the classification layer: The classification layer of the CNN is replaced by a Random Forest (RF) classifier. RF was selected based on its superior accuracy compared to other ML techniques when applied to this particular dataset. The RF model receives the CNN-extracted features and generates the final class prediction (benign or malignant). This hybrid approach leverages CNN's strength in automated feature learning and RF's robust ensemble decision-making, which aggregates predictions from multiple decision trees to reduce variance and improve generalization.

Training configuration: The model is compiled with Adam optimizer and categorical cross-entropy loss. The learning rate is set to 0.0001, with a ReduceLROnPlateau callback (factor of 0.1, patience of 5) monitoring validation loss. Training runs for 100 epochs with a batch size of 64. Hyperparameter tuning for the Random Forest component uses logistic chaotic maps, with n_estimators ranging from 100 to 500, min_samples_split from 2 to 20, and max_depth options including None and values from 5 to 20.

TL;DR: DXAIB uses a 10-layer CNN (four Conv1D, two MaxPooling1D, three dense layers) purely for feature extraction, then feeds the learned features into a Random Forest classifier for final prediction. The CNN trains for 100 epochs with Adam optimizer (lr=0.0001) and batch size 64, with 0.20 dropout throughout.

Explainability Method

Page 7

How SHAP Provides Local and Global Explanations

Why SHAP over other XAI techniques: The authors chose SHAP (SHapley Additive exPlanations) over alternatives like LIME or Grad-CAM for several reasons. SHAP has a rigorous theoretical foundation based on Shapley values from cooperative game theory, which guarantees consistency and additivity in feature importance attribution. It provides both model-agnostic and model-specific variants, supports global and local interpretability simultaneously, reliably captures feature interactions, and produces visualizations that are well-suited for clinical interpretation. Its wide adoption and continuous improvement in the research community further support its selection.

Local explanations: SHAP local explanations describe the specific factors driving a particular patient's diagnosis. After the DXAIB model produces predictions, SHAP values are computed for every feature across all patients. Each SHAP value measures the effect of a feature's presence or absence on the model's output. Force plots (Figures 2 and 3 in the paper) visualize these values for individual patients: features shown in red push the prediction toward class 1 (malignant), while features in blue push toward class 0 (benign). The model's baseline value is 0.3783, which serves as the classification threshold. Predictions above this threshold are classified as malignant; those at or below it are classified as benign.

Waterfall plots: Figure 4 in the paper presents waterfall force plots that rank features by their SHAP values for specific patient instances. Features are arranged from lowest to highest impact, with red bars indicating features contributing to a breast cancer (malignant) prediction and blue bars indicating features associated with a benign outcome. These detailed visualizations allow physicians to identify exactly which clinical measurements most influenced a particular patient's diagnosis and by how much.

Global explanations: At the dataset level, SHAP values are averaged across all samples to reveal which features are consistently most influential across the entire patient population. The global similarity plot (Figure 5) clusters 115 test instances based on their explanation patterns, while the global ordering plot (Figure 6) shows the sequential organization of samples revealing consistent trends. The global summary plot (Figure 7) juxtaposes all features by their SHAP value distributions, using color coding (blue for low feature values, red for high) and horizontal spread to indicate each feature's relative significance in shaping predictions.

TL;DR: SHAP provides both per-patient (local) and dataset-wide (global) explanations. Force plots show which features push each prediction toward malignant or benign, with a baseline threshold of 0.3783. Global summary plots rank all 30 features by their overall influence on the model, giving clinicians a transparent view of how the AI reaches its decisions.

Results

Pages 8-10

DXAIB Performance: 98.35% Accuracy and Comparison to 23 State-of-the-Art Methods

Headline metrics: The proposed DXAIB scheme achieves 98.35% accuracy, 98.76% precision, 98.74% recall, and 98.72% F1 score on the Wisconsin Breast Cancer (Diagnostic) dataset. The near-perfect recall of 98.74% is particularly significant because it indicates that the model misses very few malignant cases, directly addressing the paper's central concern about minimizing false negative predictions that could endanger patient lives.

Baseline ML comparisons: Before comparing to external methods, the authors benchmarked several standard ML algorithms on the same dataset. Naive Bayes achieved 92.12% accuracy with 93.16% recall; standalone Random Forest reached 94.16% accuracy with 93.23% recall; Support Vector Machine scored only 83.21% accuracy with 87.65% recall; and Light Gradient Boosting Machine obtained 92.34% accuracy with 95.67% recall. The DXAIB hybrid approach outperforms all of these by a significant margin, demonstrating that the CNN feature extraction step adds substantial value beyond what the RF classifier can achieve on raw features alone.

Comparison to 23 published methods: The DXAIB scheme was benchmarked against 23 state-of-the-art methods from the literature (Table 4). Among these, the closest competitors include: Wani et al. with 98.29% accuracy, 98.72% precision, and 98.72% recall using a hybrid DL approach with SHAP; Singh et al. at 98.31% accuracy and 98.00% precision; Hasan et al. at 97.95% accuracy; and Obaid et al. at 98.10% accuracy. The DXAIB scheme's 98.35% accuracy edges out all of these, and its simultaneous high recall (98.74%) and precision (98.76%) demonstrate a well-balanced model that avoids trading off one metric for the other.

Key differentiator: While several competing methods achieve accuracy above 97%, most of them do not include any explainability component. Of the 23 compared methods, only Wani et al. incorporated SHAP explainability. The DXAIB scheme is distinguished by delivering top-tier performance across all four metrics (accuracy, precision, recall, F1) while simultaneously providing both local and global SHAP explanations. The K-fold cross-validation technique further strengthens the reliability and validity of the results.

TL;DR: DXAIB achieves 98.35% accuracy, 98.76% precision, 98.74% recall, and 98.72% F1 score, outperforming all 23 compared state-of-the-art methods. It surpasses the closest competitor (Wani et al. at 98.29%) while being one of only two methods in the comparison that include explainability.

Explainability Results

Pages 10-13

What the SHAP Visualizations Reveal About Feature Importance

Local force plots: The SHAP force plots for individual patients reveal how specific clinical measurements drive each diagnosis. For a patient classified as malignant (class 1), features shown in red (such as elevated worst concave points or worst perimeter) push the prediction above the 0.3783 baseline threshold. For a benign patient (class 0), features in blue dominate, pulling the predicted value below the threshold. These individual-level explanations allow a physician to examine a specific patient's result and understand exactly which measurements contributed most to the model's conclusion.

Waterfall analysis: The waterfall plots provide a more granular decomposition for selected patient instances. Features are ranked by the magnitude of their SHAP values, from least to most impactful. Red bars represent features that increase the probability of a malignant diagnosis, while blue bars represent features that decrease it. This ranking helps clinicians quickly identify the two or three most decisive features for any given patient, enabling targeted follow-up examination of those specific clinical characteristics.

Global patterns: The global similarity plot groups all 115 test instances by the similarity of their SHAP explanation patterns. Patients numbered approximately 1 through 40 tend to cluster together, predominantly associated with blue coloring (benign class), while instances beyond this range form a distinct cluster with more red coloring (malignant class). The global ordering plot arranges all samples from 0 to 115, revealing consistent trends in how explanations are distributed across the dataset. These global views help researchers and clinicians understand the overall decision-making behavior of the model rather than just individual cases.

Global summary (beeswarm) plot: Figure 7 presents the most comprehensive global visualization, ranking all features by their aggregate SHAP value magnitudes. Each dot represents a single patient's SHAP value for a given feature, colored by the actual feature value (red = high, blue = low). Features with wider horizontal spread of dots have greater influence on predictions. This plot reveals which of the 30 clinical features are consistently most important across the entire patient population, providing radiologists with a transparent understanding of the model's overall reasoning strategy and allowing them to compare these AI-derived feature importances against established clinical diagnostic criteria.

TL;DR: SHAP force plots show per-patient feature contributions (red = pushes toward malignant, blue = toward benign) relative to a 0.3783 baseline threshold. Global summary plots rank all 30 features by importance across the full test set. These visualizations let clinicians verify that the model's reasoning aligns with established clinical knowledge about breast cancer indicators.

Conclusions & Limitations

Pages 13-14

Clinical Implications, Limitations, and What Comes Next

Clinical value: The DXAIB scheme's combination of 98.35% accuracy, 98.74% recall, and SHAP-based explainability addresses two of the most pressing barriers to AI adoption in oncology: prediction reliability and clinical trust. The high recall means the system misses very few malignant cases, reducing the risk of patients foregoing necessary conventional diagnosis based on a false negative. The SHAP explanations allow medical practitioners to evaluate the reasoning behind each prediction, significantly reducing the downstream impact of any remaining false negatives by providing a basis for human judgment. The system recognizes the specific clinical factors that contribute to breast tumor formation, improving early planning for healthcare coordination, resource allocation, and patient support.

Strengths of the hybrid approach: By combining CNN-based automated feature learning with Random Forest classification, DXAIB leverages the strengths of both paradigms. The CNN extracts complex, non-linear feature representations from the 30 tabular features that would not be captured by a standalone RF model, while the RF provides robust ensemble classification that aggregates multiple decision trees. This hybrid design is paired with the model-agnostic SHAP technique, which can explain the RF component's predictions regardless of the CNN's internal complexity. The result is a system that is simultaneously high-performing and transparent.

Acknowledged limitations: The authors identify several constraints. The Wisconsin Breast Cancer (Diagnostic) dataset, while well-validated, lacks diversity in several dimensions: it uses a single imaging modality (FNA), has limited demographic and clinical heterogeneity, and represents data collected at a specific point in time rather than longitudinal patient monitoring. Real-world breast cancer diagnosis often involves analyzing data across multiple time periods to track tumor development. The relatively small sample size of 569 instances, even after SMOTE augmentation, may limit generalizability to larger, more diverse clinical populations. The binary classification (benign vs. malignant) also does not capture the full spectrum of clinical subtypes.

Future directions: The authors plan to address these limitations in future work by incorporating more diverse datasets with multiple imaging modalities and longitudinal patient data. They also intend to explore additional explainability techniques beyond SHAP and to validate the system in real clinical environments where practitioners can provide feedback on the utility of the explanations. Extending the model to multi-class classification covering specific breast cancer subtypes represents another key direction for enhancing the scheme's practical clinical applicability.

TL;DR: DXAIB delivers 98.35% accuracy with 98.74% recall and full SHAP explainability, addressing both prediction reliability and clinical trust. Key limitations include a small, single-modality dataset (569 samples from FNA only), binary-only classification, and no longitudinal validation. Future work targets multi-modal data, multi-class subtypes, and real-world clinical testing.

A Deep Learning and Explainable Artificial Intelligence based Scheme for Breast Cancer Detection

Original Paper (PDF)