Breast cancer remains the most common and fatal malignancy impacting women worldwide, with roughly 31% of all cancer cases across the globe being breast cancer. Conventional diagnostic approaches, including mammography, ultrasound examination, and histopathological analysis, suffer from time-consuming workflows, human errors, and inter-observer variability that elevate false-positive rates. Although deep learning (DL) has shown strong performance for medical image analysis, individual models often struggle to capture complex or subtle patterns from ultrasound data, resulting in suboptimal classification accuracy.
The single-model limitation: A standalone CNN architecture may learn certain types of features well (for instance, VGG16 excels at hierarchical spatial features through its 13 convolutional layers, while Xception captures channel-wise and spatial information via depthwise separable convolutions) but miss complementary patterns that a different architecture would catch. DenseNet121, with its dense connectivity where every layer receives input from all preceding layers, promotes feature reuse and mitigates vanishing gradients. Each architecture has distinct strengths, and none alone reliably captures the full spectrum of features in breast ultrasound images.
Model fusion vs. ensemble techniques: The authors distinguish their fusion approach from ensemble methods like voting and stacking. Voting and stacking aggregate model outputs at the decision level, whereas intermediate-layer fusion concatenates feature representations extracted by each model before the final classification head. This allows the fused system to learn joint representations from a richer, multi-perspective feature space rather than simply averaging or voting on separate predictions. The proposed fusion of VGG16, DenseNet121, and Xception at intermediate layers captures multi-scale features that a single model cannot.
The explainability gap: Many existing AI systems for breast cancer detection are "black boxes," providing a prediction without any justification. Clinicians and radiologists need to understand why a model reached a particular decision before they can trust and incorporate it into their diagnostic workflow. This paper addresses that gap by pairing the fused DL model with Grad-CAM++, a class-discriminative visualization technique that generates heatmaps showing which image regions drove the prediction.
Literature search strategy: The authors conducted a systematic keyword-based search using phrases such as "AI for early BC detection," "DL in cancer diagnosis," "ultrasound image analysis," and "mammography image processing." They searched IEEE Xplore, Elsevier, SpringerLink, Scopus, and Google Scholar, prioritizing studies that used minimally invasive or non-invasive diagnostic methods and those employing machine learning or deep learning with explainable outputs.
Key prior work: Several notable studies are reviewed. In one study, five fine-tuned pre-trained models (Xception, InceptionV3, VGG16, MobileNet, ResNet50) were applied to a GAN-augmented MRI dataset and classified images into eight categories (four benign, four malignant). Another study used an upgraded Deep CNN with ResNet50 and a novel adaptive learning rate to achieve 88% accuracy across four abnormality types. Jahangeer et al. implemented VGG16 with median filtering on mammography images. A transfer-learning framework combining GoogleNet, VGGNet, and ResNet with an average pooling classifier demonstrated superior performance over individual models for cytology image classification. A VGG16-based transfer learning method with a two-layer DNN classifier and dropout regularization, combined with Grad-CAM, achieved 91% accuracy on ultrasound data.
XAI techniques in the literature: Prior explainability efforts include Grad-CAM visualizations for DenseNet-based breast cancer detection (achieving 89.87% accuracy), LIME-based explanations on an EfficientNetB7 classifier (91% accuracy on ultrasound segmentation with U-Net), and SHAP-based feature importance in an XGBoost model reaching 85% accuracy. Attention-guided Grad-CAM with DenseNet201, VGG19, and EfficientNetB7 ensembles, and the DALAResNet50 lightweight attention model with Dynamic Threshold Grad-CAM, represent more advanced attempts. However, many high-performing models in the literature lack any explainability, and those that include XAI often show lower classification accuracy.
Identified gap: The systematic review reveals that most existing approaches either use a single model (limited feature extraction), employ ensemble techniques at the output level without deep feature fusion, or provide XAI but with reduced accuracy. Few studies combine intermediate-layer model fusion with robust explainability. The proposed methodology fills this gap by implementing feature-level fusion of three architectures with Grad-CAM++ interpretability on the Breast Ultrasound Image (BUSI) dataset.
Preprocessing pipeline: Input breast ultrasound images are resized to 128x128 pixels to standardize dimensions across all three models. Pixel values are normalized to the [0, 1] range by dividing by 255. Data augmentation through slight rotation and flipping is applied to reduce overfitting and improve generalization. These resized, normalized, and augmented images are then fed into the hybrid deep learning model.
Three backbone networks: The fusion model integrates three pre-trained CNN architectures, each contributing distinct feature extraction capabilities. VGG16 uses 13 convolutional layers with small 3x3 filters followed by three fully connected layers, providing detailed hierarchical spatial feature extraction suited for complex medical imaging patterns. DenseNet121 uses dense connectivity blocks where each layer receives inputs from all preceding layers, promoting feature reuse and mitigating vanishing gradients. Xception is built on depthwise separable convolutions, efficiently capturing both channel-wise and spatial information while reducing computational costs. All three models are pre-trained on ImageNet, and their weights are frozen during training to preserve the rich feature representations learned from the large-scale dataset and to prevent overfitting on the smaller breast ultrasound dataset.
Feature extraction and fusion: Each backbone processes input images independently and extracts features from its penultimate layer. These features are then passed through Global Average Pooling (GAP), which reduces spatial dimensions without sacrificing discriminative information and helps handle variations in the ultrasound images. The feature vectors from all three models are concatenated into a single high-dimensional fusion vector, combining the hierarchical features from VGG16, the densely connected features from DenseNet121, and the depthwise separable features from Xception.
Classification head: The concatenated fusion vector passes through a classification head consisting of two fully connected (dense) layers with 256 and 128 neurons respectively, each using ReLU activation for non-linear feature transformation. Dropout regularization (rate of 0.5) is applied at each dense layer to prevent overfitting. The final output layer uses a sigmoid activation function to produce a binary prediction: benign (0) or malignant (1). The model is trained to minimize binary cross-entropy loss.
Why Grad-CAM++ over other XAI methods: The authors chose Grad-CAM++ over alternatives like SHAP and LIME for several reasons. SHAP and LIME are model-agnostic and tend to show lower performance on high-dimensional visual data like ultrasound images. Grad-CAM++, by contrast, is specifically designed for convolutional neural networks and produces pixel-level heatmaps directly from the gradient information flowing through the network. It computes second-order gradients to determine the importance of each pixel, resulting in better localization of relevant regions compared to the original Grad-CAM technique.
How it works: For each input image, Grad-CAM++ computes gradients of the predicted class score with respect to the feature activations of the last convolutional layer. These gradients are used to generate a class-discriminative saliency map, essentially a heatmap that highlights which spatial regions of the input image contributed most to the model's classification decision. The heatmap is then overlaid on the original ultrasound image, producing a composite visualization where red and yellow areas indicate regions of high model attention (strongly influencing the prediction), while blue and black areas indicate minimal influence.
Clinical interpretation of heatmaps: In benign cases, the Grad-CAM++ heatmaps rarely show intense red regions, and the overlay images display predominantly blue highlighting with prediction values near 0.00, indicating confident benign diagnosis. For malignant cases, the heatmaps show prominent red regions corresponding to areas of high-level model attention, with prediction values near 1.00 indicating confident malignancy detection. Clinicians can use these visualizations to correlate model attention with clinical and anatomical features such as masses, irregular edges, and shadowing artifacts in the ultrasound images.
Advantage over standard Grad-CAM: Standard Grad-CAM can struggle when multiple instances of the same class appear in an image. Grad-CAM++ addresses this with its second-order gradient computation, enabling better localization of multiple lesions with finer edges from the ultrasound images. This is particularly relevant in breast cancer diagnosis where tumors may present with varying size, texture, and shading effects in ultrasound imaging.
Dataset details: The study uses the Ultrasound Breast Images for Breast Cancer dataset, a publicly available collection from Kaggle. It contains a total of 8,116 ultrasound images divided into two classes: 4,074 benign images and 4,042 malignant images, making the dataset approximately balanced. All images are in JPEG format at 224x224 pixel resolution. Professional radiologists acquired and annotated the original images, providing clinical-grade ground truth labels for supervised learning. The annotations reflect expert-level diagnostic knowledge. Image augmentation techniques including sharpening and rotation were applied to improve dataset diversity and model robustness.
Data splitting: The dataset was divided in a 75:25 ratio. 75% of the data (approximately 6,087 images) was used for model training, and the remaining 25% (approximately 2,029 images) was held out for testing. From the training data, 10% was reserved for model validation during training, allowing the researchers to monitor performance and detect overfitting.
Training configuration: All experiments were conducted on an Intel Core i7 laptop with 2.8 GHz processing speed, 32GB DDR4 RAM, and an NVIDIA GeForce RTX 3090 GPU with 24GB VRAM. The programming environment used Anaconda with Python 3, along with NumPy, Pandas, Keras, TensorFlow, Matplotlib, OpenCV, and Scikit-learn. The fusion model was trained for 50 epochs with a batch size of 32. The authors tested batch sizes of 8, 16, 32, 64, and 128. They found that lower batch sizes (8 and 16) caused slower gradient updates, while larger batch sizes (64 and 128) demanded excessive memory and led to faster but less stable gradient convergence. The batch size of 32 balanced training stability with computational efficiency. The Adam optimizer was used, and training beyond 50 epochs was avoided because validation losses increased after that point.
Phase 1, individual model benchmarking: Nine standalone CNN architectures were evaluated on the breast ultrasound dataset. The results ranked as follows: VGG16 achieved 84.43% accuracy, 80.65% precision, 90.40% recall, and 85.25% F1 score. DenseNet121 reached 83.54% accuracy, 84.36% precision, 83.54% recall, and 83.45% F1 score. Xception recorded 82.45% accuracy, 83.15% precision, 82.24% recall, and 82.37% F1 score. The remaining models performed lower: Inception at 78.8% accuracy, GoogleNet at 76.43%, MobileNetV2 at 71.1%, ResNet50 at 70.8%, EfficientNetB0 at 69.33%, and AlexNet at 62.32%.
Phase 2, fusion model: The top three performing individual models (VGG16, DenseNet121, Xception) were fused using intermediate-layer feature concatenation. The fusion model achieved 97.14% accuracy, 95.96% precision, 98.42% recall, and 97.18% F1 score. This represents an improvement of approximately 13 percentage points over VGG16 (the best individual model at 84.43%). The recall of 98.42% is particularly notable in a clinical context, as it means the model correctly identifies the vast majority of malignant cases, minimizing dangerous false negatives.
Epoch selection rationale: The authors trained the fusion model across multiple epoch settings: 10, 20, 35, 50, 70, and 100. They observed rising validation losses after 50 epochs, indicating the onset of overfitting. Training was therefore fixed at 50 epochs. This finding is consistent with the relatively small dataset size (8,116 images), where extended training allows the model to memorize training patterns rather than learning generalizable features.
Confusion matrix analysis: Confusion matrices for VGG16, DenseNet121, Xception, and the fusion model were reported. The fusion model demonstrated notably improved prediction performance, producing more accurate classifications across both benign and malignant categories compared to any individual model. The improvement was driven by the complementary feature extraction capabilities of the three architectures working together through the intermediate fusion process.
Visualization structure: Each Grad-CAM++ result consists of three image sections presented side by side. The first section shows the original breast ultrasound image selected from the testing dataset. The second section displays the standalone Grad-CAM++ heatmap, showing spatial activation patterns that influence the classification output. The third section overlays the heatmap on the original image, creating a composite view that maps model attention to anatomical structures. Red and yellow regions indicate high model attention (strong influence on prediction), while blue and black regions indicate minimal or insignificant impact.
Benign case interpretations: For benign tumor examples, the Grad-CAM++ heatmaps rarely contain intense red regions indicating high malignancy attention. The resultant overlay images show a prediction value of 0.00, demonstrating confident benign classification. The critical regions are predominantly highlighted in blue, indicating low concern. This visual pattern helps clinicians quickly recognize benign cases and provides validation that the model is not being triggered by normal tissue structures or imaging artifacts.
Malignant case interpretations: For malignant tumor examples, the Grad-CAM++ heatmaps show prominent red activation regions that correspond to areas of high-level model attention. The overlay images display prediction values of 1.00, indicating confident malignancy detection. The critical regions are highlighted in red, and these regions align with clinically relevant features such as irregular masses, boundary irregularities, and shadowing patterns. This alignment between model attention and known diagnostic features supports the clinical validity of the model's decision-making process.
Clinical utility: The visualization approach enables radiologists and clinicians to relate model attention to anatomical and clinical features, including masses, irregular edges, and acoustic shadowing. Rather than simply accepting a binary benign/malignant output, clinicians can examine the heatmap to assess whether the model focused on diagnostically relevant regions. This promotes trust in the AI-assisted system and helps validate predictions before clinical decisions are made. The authors emphasize that this interpretability is essential in sensitive fields like breast cancer diagnosis, where incorrect predictions carry significant consequences.
Benchmark comparison: The proposed fusion model was compared against ten recent approaches for breast cancer detection on the same Breast Ultrasound Image Dataset. VGG19 with transfer learning and five-fold cross-validation achieved 87.8% accuracy but lacked explainability. An explainable DenseNet framework with Grad-CAM reached 89.87% accuracy. A multi-layer CNN achieved 96.10% accuracy but had no XAI component. A deep CNN with multi-scale kernels recorded 90.13%. EfficientNetB7 with U-Net segmentation and LIME explainability reached 91.67%. A hybrid ensemble combining MobileNetV2, ResNet101, VGG16, and ResNet50 with Grad-CAM achieved 93.5%. XGBoost with SHAP scored 85%. An ensemble meta-learning approach (ResNet50 + DenseNet121 + InceptionV3) reached 90%. A voting learning model with AlexNet, ResNet101, and InceptionV3 hit 94.20%. And a VGG16 transfer learning approach with Grad-CAM achieved 91%.
Proposed model superiority: The fusion model's 97.14% accuracy, 95.96% precision, 98.42% recall, and 97.18% F1 score exceeded all compared methods. The closest competitor in accuracy was the standalone CNN at 96.10%, which lacked any explainability. Among explainable approaches, the best competitor was the hybrid ensemble model at 93.5% with Grad-CAM. The proposed approach achieves both the highest accuracy and includes Grad-CAM++ explainability, making it the only method to simultaneously lead in both dimensions.
XAI coverage across studies: A notable finding from the comparative analysis is that many high-performing models (the 96.10% CNN, the 94.20% voting model, the 90% ensemble learning model) did not employ any XAI technique. Conversely, models that did include XAI (DenseNet with Grad-CAM at 89.87%, EfficientNetB7 with LIME at 91.67%, XGBoost with SHAP at 85%) generally showed lower accuracy. The proposed fusion model breaks this pattern by delivering the highest accuracy while maintaining robust visual explainability through Grad-CAM++.
Feature extraction advantage: The authors attribute the performance gap to intermediate-layer fusion, which extracts both global and local features through the complementary architectures. Unlike voting or stacking ensembles that operate at the decision level, the fusion approach creates a richer shared feature representation. This produces more reliable feature extraction across varying tumor morphologies, sizes, textures, and shading effects commonly encountered in breast ultrasound imaging.
Clinical value: The fusion model's 97.14% accuracy and 98.42% recall translate to meaningful clinical impact. The high recall is particularly important because it means the system misses very few malignant cases, minimizing false negatives that could lead to delayed cancer diagnosis. Combined with Grad-CAM++ heatmaps, clinicians gain both a highly accurate classification and a visual explanation showing the anatomical regions driving each prediction. The model fusion approach reduces both false positives and false negatives relative to individual models, and the explainable outputs promote the clinical trust needed for real-world adoption.
Strengths of the approach: By extracting low-level structural features and high-level semantic characteristics from ultrasound images through the complementary capabilities of VGG16, DenseNet121, and Xception, the fused model captures a wider range of diagnostic patterns than any single architecture. The intermediate-layer fusion strategy is more powerful than output-level ensemble techniques (voting, stacking) because it allows the classification head to learn from a joint feature space rather than aggregating independent decisions. The frozen backbone weights from ImageNet pre-training prevent overfitting on the relatively small BUSI dataset (8,116 images).
Acknowledged limitations: The authors recognize that the model faces difficulties in achieving strong generalization across datasets with different imaging setups and acquisition techniques. The evaluation was conducted on a single dataset (BUSI), and performance on data from different ultrasound machines, hospitals, or patient populations remains unvalidated. The authors also note that the model was trained and tested on a binary classification task (benign vs. malignant), which does not capture the full spectrum of clinical subtypes such as ductal carcinoma in situ (DCIS), invasive ductal carcinoma (IDC), or invasive lobular carcinoma (ILC).
Future work: The authors outline several directions for improvement: incorporating sophisticated feature selection techniques, improving computational efficiency, and enhancing generalizability through domain adaptation strategies. They plan to explore self-supervised and unsupervised learning techniques to further improve diagnostic accuracy and model resilience in real clinical settings. Privacy-preserving techniques for patient data protection are also listed as a future priority, which would be necessary for any deployment in healthcare environments subject to data protection regulations.