Deep Learning Methods for Breast Cancer Diagnosis: Systematic Review

Plain-English Explanations

Background & Motivation

Pages 1-3

Why a Systematic Review of Deep Learning for Breast Cancer Is Needed

Breast cancer remains one of the most dangerous cancers affecting women globally, and it is the leading cause of cancer-related deaths in this population. According to recent figures from the American Cancer Society, over 40,000 women and approximately 600 men die each year from breast cancer. The disease manifests in four primary forms: benign tumors (non-toxic, not fitting the description of dangerous cancer), normal tissue, in situ carcinoma (confined to the mammary duct lobules and treatable if detected early), and invasive carcinoma, the most severe type with the potential to metastasize to all other organs.

Conventional detection methods: Breast cancer has historically been identified through multiple imaging approaches, including mammography, X-ray, ultrasound, Positron Emission Tomography (PET), Computed Tomography (CT), temperature measurement, and Magnetic Resonance Imaging (MRI). The gold standard for diagnosis remains pathological analysis, in which extracted tissue is stained with Hematoxylin and Eosin (H&E) and subjected to microscopic imaging analysis. Histopathological imaging and genomics represent the two primary modalities used for breast cancer identification.

The rise of deep learning: Computer-Aided Design (CAD) systems were introduced to simplify breast cancer identification, but traditional CAD systems depend on manually created features that weaken overall performance. Deep learning changed this landscape by offering representation learning across multiple layers, where lower-level features are more comprehensible and higher-level features more abstract. Compared to classical machine learning (ML) methods, deep learning requires fewer human interventions for pattern recognition and can effectively solve complex problems in image analysis and natural language processing.

Gaps in existing reviews: While several literature reviews on breast cancer detection have been published, most focus narrowly on image-based methods or traditional ML approaches. Reviews by Yassin et al. examined ML-based classifiers across image modalities. Others, such as Oyelade et al., focused specifically on mammography. Husaini et al. examined thermography. Most existing reviews that address deep learning cover very limited studies with no comprehensive, systematic analysis. This systematic literature review (SLR) aims to fill that gap by analyzing deep learning methods for breast cancer detection from 2010 to 2021, covering both histopathological imaging and genomics data across 98 qualifying articles.

TL;DR: Breast cancer kills over 40,000 women annually, and deep learning has emerged as a powerful diagnostic tool. This systematic review analyzes 98 articles (2010 to 2021) covering deep learning methods for breast cancer detection via histopathological imaging and genomics, filling gaps left by narrower prior reviews.

Deep Learning Architectures

Pages 3-6

The Seven Deep Learning Methods Used for Breast Cancer Detection

Convolutional Neural Networks (CNN): CNN is the most popular deep learning method applied for breast cancer diagnosis. It consists of three fundamental layer types: a convolutional layer, a pooling layer, and a fully connected layer. These layers are stacked to create deep architectures for automatic feature extraction from raw data. Notable CNN variants used in breast cancer research include VGG, AlexNet, and GoogLeNet. CNNs can be grouped into two categories for breast cancer work: transfer learning (TL)-based models that use pre-trained networks such as AlexNet, ResNet, and VGG, and de novo trained models built and trained from scratch.

Deep Neural Networks (DNN) and Recurrent Neural Networks (RNN): DNNs include an output layer, a convolution layer, a fully connected layer, and a pooling layer, and have been shown effective for breast cancer detection. The convolution layer extracts high-level characteristics, while the pooling layer reduces the amount of computation by decreasing the size of convolved features. RNNs are supervised deep learning techniques specifically designed for sequential data. However, standard RNNs suffer from vanishing or exploding gradients, which limits their ability to model temporal dependencies. To mitigate this, Long Short-Term Memory (LSTM) networks incorporate memory cells to store relevant data, while Gated Recurrent Units (GRU) use fewer parameters for faster, less complicated training.

Autoencoders (AE) and Deep Belief Networks (DBN): Autoencoders replicate input values as output through encoder and decoder units, reducing data dimensions to obtain the most discriminative features from unlabeled data. The encoder converts input data into hidden features that are reconstituted by the decoder. Stacked Sparse Autoencoders (SSAEs) have been specifically applied to histopathological images for nuclei detection. DBNs are constructed by stacking multiple Restricted Boltzmann Machines (RBMs) in a greedy, layer-by-layer fashion. The top layer contains undirected connections, while lower layers have directed connections, with weight fine-tuning performed via Contrastive Divergence.

Generative Adversarial Networks (GAN) and Multilayer Perceptrons (MLP): GANs use two competing neural networks (generator and discriminator) in a zero-sum game with both supervised and unsupervised learning. For breast cancer, GANs have been applied for mammographic image generation, tumor segmentation, and data augmentation to address limited training data. Extensions like Wasserstein GAN and Loss Sensitive GAN (LSGAN) were developed to improve training convergence. MLPs, the simplest deep learning architecture, employ additional layers and nonlinear activation functions. They are the foundational framework upon which most deep learning architectures are built, and have been applied across multiple breast cancer studies.

TL;DR: Seven deep learning architectures dominate breast cancer research: CNN (most popular, with VGG, AlexNet, ResNet variants), DNN, RNN (with LSTM/GRU variants), Autoencoder, DBN, GAN, and MLP. Each has distinct strengths, from CNN's feature extraction to GAN's data augmentation capabilities.

Review Methodology

Pages 7-9

PRISMA-Based Systematic Search Across Eight Databases

Search strategy: The review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, which provide a replicable and consistent method for identifying, selecting, and critically examining existing studies. Eight bibliographic databases were searched: Scopus, Google Scholar, IEEE Xplore Library, Web of Science, SpringerLink, ScienceDirect, ACM Digital Library, and PubMed. The timespan covered was 2010 to 2021. Search strings combined terms like "breast cancer," "deep learning," "Artificial Neural Network," and "Artificial Intelligence AND breast cancer AND detection techniques."

Selection and screening process: In the identification stage, a total of 1,267 studies were obtained from both automatic and manual searches. During screening, 1,060 published papers remained after filtering duplicates, unsuitable, and irrelevant articles. Application of removal criteria discarded 774 articles, leaving 286. After quality assessment, 98 research articles were designated for the final review. The inclusion criteria required studies published from 2010 to 2021, involving breast cancer detection, focusing on deep learning-based approaches, written in English, and from journals or conferences with experimental results. Exclusion criteria removed studies without experimental results, those published before 2010, papers focusing on other cancer types or non-DL techniques, non-English publications, and non-journal/conference sources such as books and theses.

Quality assessment: To ensure quality, the standard quality checklist (SCQ) consisting of 10 questions was applied. These questions evaluated whether reports were clear and coherent, whether aims were clearly specified, whether data collection methods were described, whether diversity contexts were explored, and whether findings were reliable and replicable. Following established conventions, only articles answering "yes" to at least seven of the ten questions were included in the final analysis.

Research questions: The SLR was guided by five research questions: (RQ1) What are the most common deep learning methods for breast cancer detection? (RQ2) Which deep learning methods perform most effectively? (RQ3) What evaluation metrics are commonly used? (RQ4) What datasets are available? (RQ5) What are the challenges and future directions? Data extraction captured key details including paper titles, publication years, author lists, publishers, deep learning methods used, reported accuracy, and evaluation metrics.

TL;DR: The PRISMA-based review searched 8 databases and screened 1,267 initial studies down to 98 qualifying articles (2010-2021). A 10-question quality checklist required at least 7 "yes" answers. Five research questions guided the analysis, covering methods, performance, metrics, datasets, and future directions.

RQ1: Most Common Methods

Pages 10-12

CNN Dominates with 23 Studies, Followed by MLP (13) and DNN (8)

CNN leads by a wide margin: The systematic review revealed that CNN is the most extensively used deep learning method for breast cancer detection, appearing in 23 of the 98 reviewed studies. CNN models were applied for diverse tasks including extracting features from validated gene expression data to detect clinical outcomes, identifying the mitosis process for invasive breast cancer grading from histopathological imaging, and classifying tumor-associated stroma in diagnostic breast biopsies. CNN techniques were divided into transfer learning-based models (using pre-trained networks like AlexNet, ResNet, VGG) and de novo models trained from scratch.

MLP and DNN as strong runners-up: The Multilayer Perceptron (MLP) appeared in 13 studies, making it the second most commonly used architecture. Despite being the simplest deep learning design, MLP's nonlinear transformation capabilities made it applicable across multiple breast cancer detection tasks. Deep Neural Networks (DNN) were used in 8 studies, typically for tasks such as multi-NMF attention-based breast cancer prognosis, deep feature representation of tumor-infiltrating lymphocytes, and identification of cancer subtypes by integrating multiple types of transcriptomics data.

GAN, AE, DBN, and RNN: GANs appeared in 8 studies and were used not only for classification but also as image-augmentation tools to address limited training data. For example, Shams et al. designed a deep generative multi-tasking model combining GAN and CNN for mammography diagnosis, while Singh et al. used GAN for breast tumor segmentation. Autoencoders appeared in 6 studies, with applications including stacked sparse autoencoders (SSAEs) for nuclei detection on histopathology images. DBN appeared in only 2 studies, and RNN in just 1 study, reflecting their more limited applicability to breast cancer detection compared to the dominant CNN approach.

The deep learning system needs to process 200 to 300 cells per frame for effective cancer detection, which is impossible through manual tracking. This requirement underscores why automated deep learning approaches have become essential. The reviewed methods demonstrated the capability of diagnosing breast cancer up to 12 months earlier than conventional clinical procedures, highlighting the transformative potential of these approaches for early detection and improved patient outcomes.

TL;DR: CNN is the most popular method with 23 studies, followed by MLP (13), DNN (8), GAN (8), AE (6), DBN (2), and RNN (1). CNN's dominance spans both transfer learning (AlexNet, ResNet, VGG) and de novo models. Deep learning can detect breast cancer up to 12 months earlier than conventional methods.

RQ2: Performance Comparison

Pages 13-15

Binary Classification Outperforms Multiclass, with Top Accuracies Reaching 98.7%

Binary vs. multiclass classification: The reviewed deep learning methods used two classification approaches: binary (e.g., benign vs. malignant) and multiclass (multiple cancer subtypes). A clear pattern emerged: binary classification consistently produced higher accuracies than multiclass classification. The best multiclass accuracy reported was 95.7% by Mostavi et al. using CNN for breast cancer subtype classification based on feature extraction from gene expression data. In contrast, binary classification results frequently exceeded 96%, with multiple studies reaching 98% or higher.

Top-performing methods: Among binary classification approaches, the highest reported accuracy was 98.7% by Togacar et al. using a CNN combined with linear discriminant analysis and ridge regression on autoencoder-processed invasive breast cancer images. Other high performers included Feed-Forward Neural Networks at 98.3% accuracy for tumor detection based on cancer subtypes, CNN at 98% for feature selection-based detection, and MLP at 98% for malignant/benign classification. The CNN model by Ha et al. achieved 97% accuracy for binary classification on imaging data, demonstrating strong performance across modalities.

Genomic vs. imaging data: When imaging data were used for binary classification, high performance was generally obtained. However, for multiclass classification (subtypes), genomic sequencing data exhibited better results than imaging data. For example, multiclass genomic approaches achieved up to 95.7% accuracy, while multiclass imaging approaches topped out at approximately 90%. This suggests that gene expression data may carry richer information for distinguishing among cancer subtypes, while imaging excels at the fundamental benign-vs.-malignant distinction.

Hybrid models: The reviewed studies used both standalone deep learning models and hybrid approaches combining deep learning with traditional machine learning. Hybrid models that leveraged CNN for feature extraction paired with classical ML classifiers tended to achieve competitive or superior results. The extensive use of different algorithms across studies, with CNN and MLP being the most frequently employed, reflects the field's active experimentation with architectural choices optimized for specific data types and classification tasks.

TL;DR: Binary classification consistently outperforms multiclass, with top accuracies reaching 98.7% (CNN + ridge regression). Genomic data performs better for subtype classification (up to 95.7%), while imaging excels at binary detection (up to 99%). Hybrid DL + ML models are increasingly competitive.

RQ3: Evaluation Metrics

Pages 15-16

Accuracy Is the Most Popular Metric (42 Studies), but F1 and AUC Are Gaining Ground

Accuracy dominates: Among the 98 reviewed studies, accuracy was the most frequently used evaluation metric, appearing in 42 studies. Accuracy is calculated by dividing the number of correct predictions (true positives plus true negatives) by the total number of predictions. While it provides a straightforward measure of overall performance, accuracy alone can be misleading when datasets are imbalanced, as a model could achieve high accuracy simply by predicting the majority class.

F1-score and precision: The F1-score, which is the harmonic mean of precision and recall, was the second most used metric, appearing in 23 studies. Precision (measuring how many positive predictions were actually correct) appeared in 21 studies. The authors note an important trade-off: enforcing higher precision may reduce recall, and vice versa. In breast cancer detection, this trade-off is clinically significant because a false negative (missing a malignant case) is far more dangerous than a false positive. The F1-score helps balance these competing objectives.

Recall, specificity, and AUC-ROC: Recall (sensitivity), which measures the proportion of actual positive cases correctly detected, appeared in 18 studies. Specificity, measuring the proportion of negative cases correctly identified, appeared in 14 studies. The AUC-ROC (Area Under the Receiver Operating Characteristic Curve), which captures the trade-off between false positive rate and true positive rate, appeared in 13 studies. A perfect classifier achieves an AUC of 1.0, while a random classifier scores 0.5. For false positive rate (FPR) and false negative rate (FNR), lower values indicate better generalization ability.

Inconsistency in evaluation: A notable finding from the review is that deep learning methods for breast cancer detection are not rigorously evaluated using any specific standardized set of metrics. Not all published works adopted confusion matrix parameters, limiting the review's ability to perform consistent cross-study comparisons. Only accuracy metrics were universally available for analysis. This inconsistency suggests a need for standardized evaluation protocols in future breast cancer deep learning research to enable more meaningful comparisons across methods and datasets.

TL;DR: Accuracy is used in 42 of 98 studies, followed by F1-score (23), precision (21), recall (18), specificity (14), and AUC-ROC (13). No standardized evaluation protocol exists across studies, making cross-method comparisons difficult. The recall-precision trade-off is especially critical in cancer detection.

RQ4: Datasets

Pages 16-19

Key Genomic and Imaging Datasets for Breast Cancer Deep Learning Research

Genomic datasets: The Cancer Genome Atlas (TCGA) is the most popular genomic dataset, containing 11,429 instances with clinical data for each participant including generic and genomic information. It aims to detect the complete set of DNA changes in various cancer types. The METABRIC dataset, the second most widely used, contains 543 instances with clinical characteristics, SNP genotypes, CNV profiles, and expression data derived from breast cancer patients. Other genomic resources include the NCI Genomic Data Commons (GDC) with 9,114 instances, the GEO database with 404 instances of gene sequencing data, and the Spark Dataset with 106 instances. Private databases such as Array Express and STRING/BIOGRID also contribute protein interaction networks and high-throughput functional genomics data.

Imaging datasets: The Wisconsin Breast Cancer Dataset (WBCD) from the UCI repository is the most used imaging dataset, containing 569 records from fine needle aspiration (FNA) of breast tissue, each described by features such as clump thickness, uniform cell size, marginal adhesion, and bare nuclei. The DDSM (Digital Database for Screening Mammography) dataset is the most comprehensive imaging resource with 10,239 instances combining benign, normal, and cancer volumes. Its enhanced version, CBIS-DDSM, provides 1,644 instances with improved ROI-segmented images, including 891 mass cases and 753 microcalcification instances.

Additional imaging resources: The MIAS (Mammographic Image Analysis Society) database contains 322 digitized MLO images with 161 cases featuring benign tumors, malignant tumors, and normal images. The INbreast dataset contains 410 mammography images from screening, diagnosis, and follow-up cases captured between 2008 and 2010 at the Breast Center in CHSJ, Porto, with expertly annotated ground truth. MRI datasets with 1,500 instances are also available through HIPAA-compliant institutional approvals. Several private datasets, such as those from Helsinki University and the University of Vermont Medical Center, supplement these public resources.

Data availability challenges: Deep learning methods require large amounts of data for training, and a major barrier to applying these algorithms for medical diagnosis is the persistent lack of sufficient data. Imaging data are more readily available than genetic data as datasets. The distribution between public and private datasets shows a clear imbalance, with approximately 10 public datasets and 9 private datasets identified in the review. The availability of comprehensive, diverse, and well-annotated public databases remains a critical bottleneck for advancing deep learning research in breast cancer detection.

TL;DR: The Cancer Genome Atlas (11,429 instances) and DDSM (10,239 instances) are the largest genomic and imaging datasets, respectively. The Wisconsin Breast Cancer Dataset (569 instances) is the most commonly used. Imaging data is more available than genomic data, but overall data scarcity remains a critical bottleneck.

RQ5: Challenges & Future Directions

Pages 19-20

Three Key Research Gaps: Data Imbalance, Interpretability, and Clinical Translation

Balanced dataset challenge: Deep learning methods for breast cancer face a fundamental data imbalance problem. Publicly available imaging and genomic databases lack pathological heterogeneity and representation of coexisting benign malignancy across different populations. Private datasets are arbitrary in terms of size, number, and format. The labor- and time-intensive annotation process, for which clinical radiologists are not always available, further compounds this issue. While oversampling methods and SMOTE (Synthetic Minority Oversampling Technique) have been applied, research has shown that oversampling can lead to choice-based sample biases. Future work should explore both under- and oversampling methods combined and invest in creating adequate public and private breast cancer databases.

Interpretable deep learning architectures: Designing deep learning models appropriate for medical data remains challenging, especially given the complexity of radiological images. Deep learning networks that are difficult to interpret pose a common problem in breast cancer detection. Along with diagnostic and histological reports, the ability of deep learning to handle heterogeneous datasets increases the possibilities for interpretable models. Image captioning, which combines computer vision and NLP in a standard encoder-decoder architecture, has made significant progress. This hybrid approach can more accurately understand breast cancer diagnostic data by using text features to express radiomic features, creating a promising direction for interpretable breast cancer AI.

Clinical application gaps: Despite their best efforts, most studies did not fully utilize all relevant available data. Important breast cancer characteristics from imaging, such as detailed malignant lesion features on ultrasound (hypoechogenicity, angular margin, posterior shadowing, internal vascularity), are often overlooked. Similarly, risk factors routinely considered by clinicians, including age, family history, and genetics, are frequently absent from deep learning models. The deep features obtained through deep learning representations could serve as valuable data for additional research, and multitask learning represents a promising future direction that can reduce overfitting through shared learning features.

Attention mechanisms and gene data: The review found that most image classification studies had not properly utilized attention mechanisms, opening an opportunity for researchers to improve the precision of deep learning methods through attention-based architectures. Meanwhile, researchers have increasingly focused on gene sequence data, with future opportunities to combine different gene sequencing datasets for larger-scale predictions. More studies should emphasize deriving important aspects from gene expression data while employing confusion matrix parameters for more rigorous evaluation.

TL;DR: Three major gaps exist: (1) imbalanced and insufficient datasets, with SMOTE as a partial but imperfect solution, (2) lack of interpretable deep learning architectures for medical data, with image captioning (CNN + RNN) as a promising direction, and (3) clinical features and risk factors that doctors use are largely absent from current models.

Study Limitations

Pages 20-21

Scope Boundaries and Potential Biases of This Review

Source type restriction: The SLR is exclusively restricted to journal and conference materials discussing breast cancer detection in deep learning. While the search method systematically identified and eliminated irrelevant publications in the early stages, the authors acknowledge that incorporating additional sources, such as books and technical reports, could have enriched the review. Several irrelevant research publications were found and removed using the automated search criteria, which guarantees that chosen papers met investigation requirements but may have also excluded borderline relevant work.

Language and database limitations: The review was limited to English-language publications. While this introduces potential linguistic bias, the authors note that all papers gathered for the study were written in English, reducing the practical impact of this restriction. However, related publications in other languages may exist in this area of study. Additionally, although primary databases were considered when searching study articles, it is possible that digital libraries with pertinent studies were disregarded. The authors attempted to mitigate this by comparing search phrases and keywords against well-known collections of research studies.

Keyword coverage: When searching for keywords, certain synonyms might have been missed despite the broad search strings employed. To address this issue, the SLR protocol was updated to ensure no crucial phrases were omitted. The search strategy combined multiple Boolean operators across terms like "breast cancer," "deep learning," "Artificial Neural Network," "Artificial Intelligence," and "detection techniques." However, the rapidly evolving terminology in deep learning means that newer architectural names or domain-specific jargon may not have been captured by the search strings designed at the study's outset.

Evaluation consistency limitations: Because not all published works in the reviewed articles adopted confusion matrix parameters, only accuracy metrics were considered for cross-study performance analysis. This limitation means that the review could not comprehensively compare precision, recall, F1-score, or AUC across all 98 studies, potentially masking important performance differences between methods. Studies that reported only accuracy may appear comparable to those with more complete evaluation profiles, even though the latter may reveal important distinctions in clinical utility such as sensitivity to malignant cases.

TL;DR: The review is limited to English-language journal and conference papers from 8 databases, potentially missing relevant work in other languages or sources. Only accuracy was consistently available for cross-study comparison because many papers did not report full confusion matrix metrics.

Conclusions & Key Takeaways

Pages 21-23

CNN Reigns Supreme, but Major Opportunities Remain in Multiclass, Gene Data, and Interpretability

The CNN consensus: The systematic review of 98 articles conclusively established CNN as the most accurate and widely used deep learning model for breast cancer detection. The widespread application of CNN algorithms to both MRI images and gene expression data represents a significant breakthrough. When compared with other algorithms, CNN consistently produces strong results across different data modalities and classification types. The authors recommend that researchers carry out additional studies applying more hybrid algorithms with CNN to potentially push performance even higher.

Attention mechanisms as an untapped resource: A striking finding was that most image classification studies have not properly utilized attention mechanisms. Attention mechanisms allow deep learning models to selectively focus on the most relevant regions of an image or the most informative features in a dataset. Their underutilization in breast cancer detection represents a significant opportunity. Researchers who incorporate attention-based architectures into CNN or hybrid models could potentially improve classification precision, especially for challenging cases where subtle visual differences distinguish malignant from benign tissue.

Gene sequence data and multiclass classification: The review highlighted that binary classification (benign vs. malignant) dominates current research, with relatively few studies tackling multiclass classification of cancer subtypes. Future researchers have various opportunities to contribute by combining different gene sequencing datasets for larger-scale predictions and by leveraging genetic data to create multiclass predictors. The majority of studies using genetic sequencing data focused on breast cancer detection and survival likelihood with binary categorization, leaving multiclass subtype prediction, risk level determination, and recurrence likelihood prediction as open frontiers.

The data and standardization imperative: Large-scale, thorough, and fully labeled whole-slide image (WSI) datasets are currently lacking. The creation of sizable public databases is crucial for future research. More studies should emphasize deriving important aspects from gene expression data to improve outcomes and accuracy by employing confusion matrix parameters for more rigorous evaluation. The accuracy metric alone, while the most commonly reported (42 of 98 studies), is insufficient for evaluating clinical utility. Standardizing evaluation across metrics including recall, precision, F1-score, specificity, and AUC-ROC would enable more meaningful progress tracking across the field.

TL;DR: CNN is the dominant method across 98 reviewed articles, but attention mechanisms remain underutilized. Binary classification dominates over multiclass subtype prediction. Key needs include larger public WSI datasets, standardized evaluation metrics beyond accuracy, and hybrid CNN architectures incorporating attention and gene expression data.

Deep Learning Based Methods for Breast Cancer Diagnosis: A Systematic Review and Future Direction

Original Paper (PDF)