Leukemia is a cancer of the blood cells arising from abnormal bone marrow, and it requires prompt, accurate diagnosis for effective treatment and positive patient prognosis. Traditional diagnostic methods include microscopy of blood smears, flow cytometry, and bone marrow biopsy. Each has drawbacks: microscopy demands manual inspection of densely packed cells where morphological changes are easy to miss, flow cytometry requires fresh blood draws and cannot assess cell morphology, and bone marrow biopsy is invasive with a turnaround time of one to two weeks. The realistic possibility of human error and the time-intensive nature of these approaches raise an important question about whether deep learning (DL) can streamline and improve leukemia diagnosis.
Deep neural networks (DNNs) are a subset of artificial neural networks inspired by the human brain, composed of interconnected artificial neurons organized into layers. Convolutional neural networks (CNNs) are specialized DNNs for image and video processing, while recurrent neural networks (RNNs) handle sequential data like text. Pre-trained DNNs have widespread medical applications since they can be fed large image datasets to produce new biomedical informatics and associations. Prior work has shown CNN-based systems recognizing myeloblasts with over 90% accuracy from 18,000 images, and other models achieving 99% accuracy comparing malignant cells with healthy ones based on cell size and nucleus features.
Scope of this review: This scoping review used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to survey articles published between 2010 and 2023 that applied DL specifically to leukemia diagnosis. The authors searched Embase, Ovid MEDLINE, and Web of Science using terms combining "leukemia" with "deep learning," "artificial neural network," "neural network," "diagnosis," and "detection." The review ultimately included 20 articles, reported chronologically to highlight the progression of DL technology over a 13-year span.
The authors note that researchers have also used DNNs to predict risk of leukemia development based on genetic factors, including predicting mutations in nucleophosmin 1 (NPM1), a pathognomonic mutation for leukemia. One DL model achieved 86% accuracy in predicting NPM1 mutation status from bone marrow cytomorphology, and 91% accuracy when differentiating a leukemia subtype from healthy donor samples. Data augmentation, color standardization, and image pre-processing have all proven important for maximizing DNN performance in distinguishing leukemic cells from healthy cells.
The literature search was conducted in September 2023 across three databases: Embase, Ovid MEDLINE, and Web of Science. Eligible articles were published between 2010 and 2023 in English. The search terms were structured as "leukemia" AND "deep learning" or "artificial neural network" OR "neural network" AND "diagnosis" OR "detection." The Embase search alone yielded a complex Boolean query across multiple fields, beginning with 437,276 results for leukemia-related terms and 117,784 for AI-related terms, ultimately narrowing to 439 articles when intersected with detection and diagnosis terms.
Screening process: The initial search identified 1,229 citations across all three databases. After removing 375 duplicates, 854 studies remained for screening. Team members reviewed titles and abstracts, achieving consensus through discussion among all three reviewers. A total of 834 articles were excluded: 470 for wrong topic, 152 for wrong publication type (abstracts and dissertations), 96 for being too old, 74 for wrong population, 30 for not aligning with the scoping review objective, and 12 for wrong study design. This left 20 articles for the final review.
Critical appraisal: All 20 articles underwent evaluation using the Joanna Briggs Institute (JBI) critical appraisal tools, which are known for reliability and ongoing improvement. Two team members independently performed blinded appraisals. Articles were categorized by risk of bias: below 50% was high risk, between 50% and 70% was moderate risk, and above 70% was low risk. Only articles scoring above 70% were retained. After deliberation, all 20 articles met the threshold and were included. Data charting was performed using Excel, with an iterative extraction process covering each article's purpose, study population, sample, methods, limitations, and key findings based on the percentage accuracy of the DL model.
The inclusion criteria encompassed both experimental and nonexperimental studies, required full-text availability, and limited articles to the English language. The review specifically targeted studies using DL and its subsets (deep neural networks) rather than broader concepts like general artificial intelligence or traditional machine learning. Abstracts, opinion pieces, presentations, and gray literature were all excluded.
Only two articles from the 2010-2018 period met the inclusion criteria, reflecting how nascent DL-based leukemia detection was at the time. In 2010, Adjouadi et al. created an early DL neural network model called Neural Studio, which classified and detected leukemia with 96.67% accuracy using 220 blood samples (60 abnormal, 160 normal). The model relied on Beckman-Coulter flow cytometry data containing 24 parameters including direct current impedance, opacity, and light scatter. Statistical feature extraction reduced each sample to a manageable 5x93 matrix, and a binary classifier categorized samples as normal or abnormal. The true positive fraction for acute myeloid leukemia (AML) samples was 90%, with a false positive fraction of only 2%.
The 2018 leap: Vogado et al. represented a major advance by eliminating the need for image segmentation entirely. Using three pre-trained CNNs (AlexNet, Vgg-f, and CaffeNet) for feature extraction on a dataset of 891 blood smear images, and a support vector machine (SVM) for classification, they achieved over 99% accuracy. Their approach used a structured pipeline: CNN-based feature extraction, gain ratio-based feature selection, and SVM classification. A key finding was that more features were needed to classify images containing multiple leukocytes, while fewer features sufficed for single-leukocyte images.
These two early studies demonstrate a critical methodological shift. In 2010, DL models depended on flow cytometry data from bone marrow samples, a technique not seen in later studies because of the ability to train networks directly on microscopic images for feature extraction. By 2018, pre-trained CNNs combined with SVM classifiers set a new accuracy standard. The accuracy improvement from 96.67% in 2010 to over 99% in 2018 reflects the broader transition from specialized, data-dependent approaches to more generalized methods leveraging transfer learning and established CNN architectures.
The early works also revealed important limitations. Adjouadi et al. noted that increasing the dataset size did not always guarantee better results when dealing with data contaminated by high-class overlap. Vogado et al. worked with a relatively small sample of 891 images. Nevertheless, these studies provided the historical foundation and proof of concept that DL could meaningfully assist in leukemia cell classification.
Seven articles published in 2020-2021 were included in this review, reflecting a burst of activity in the field. Four 2020 studies used different approaches to achieve high detection accuracy. Abou El-Seoud et al. built a five-layer CNN with grayscale conversion that classified four types of white blood cells (monocytes, lymphocytes, neutrophils, eosinophils) at 96.78% accuracy. Huang et al. applied noise reduction in grayscale images with three CNN frameworks (GoogleNet, ResNet, and DenseNet) combined with transfer learning, achieving 90% accuracy for the normal group and 97% average for leukemia detection across AML, ALL, and CML categories using bone marrow microscopy images from 104 subjects.
Novel architectures in 2020: Joshi et al. introduced a hybrid disruption-based salp swarm and cat swarm (DSSCS) CNN model with novel techniques for noise suppression, image segmentation, and color normalization, reaching 97% global classification accuracy on 15,920 images from a Barcelona clinic. The DSSCS-CNN model outperformed standalone SVM, SVM+NN, and CNN approaches by resolving the hyperparameter convergence problem in traditional CNNs, achieving 99% accuracy with VGG-16 training. Kalaiselvi et al. used a six-layer CNN with color normalization on nearly 10,000 microscopic blood images, attaining 98% accuracy for classification and approximately 97% validation accuracy.
2021 innovations: The 2021 studies marked the introduction of image augmentation and segmentation as standard techniques. Amin et al. achieved 99.57% accuracy using Open Neural Network Exchange (ONNX) and a YOLOv2 CNN (a single-stage real-time object detection model) for feature extraction, combined with multi-kernel SVM classifiers on 6,250 images across five WBC types. Their segmentation method yielded IoU and F1 scores of 0.97 and 1.0, respectively. Loddo et al. used k-nearest neighbor (KNN) for feature extraction with SVM and random forest for classification, reaching 98% accuracy, with CNN-trained models on large datasets averaging 97.9% versus 88.9% for the best traditional machine learning classifier (random forest).
Vogado et al. (2021) evaluated LeukNet, a fine-tuned CNN based on VGG-16, using a leave-one-dataset-out cross-validation approach on 3,536 blood smear images from multiple sources. LeukNet achieved 98.61% accuracy through transfer learning. Their key finding was that fine-tuning may be more efficient than off-the-shelf feature extraction, and that CNNs with more feature map representations perform better in cross-dataset experiments. The choice of fine-tuning technique proved essential for correct definition of CNN parameters.
Seven articles from 2022 were included, representing the most productive single year in the review. Anilkumar et al. used AlexNet and LeukNet CNNs to perform dual work (feature extraction and classification) on 56 peripheral blood smear images of B-cell ALL and T-cell ALL, achieving 94.12% accuracy with both architectures. LeukNet, with only 17 layers and a depth of 5, matched AlexNet's accuracy while training in just 2 minutes and 26 seconds compared to AlexNet's 8 minutes and 27 seconds. Data augmentation techniques including flipping, rotation, translation, and scaling helped overcome the limited dataset size.
Ensemble approaches: Baig et al. used hybrid CNN models (CNN-1 and CNN-2) for feature extraction on 4,150 blood smear images, with canonical correlation analysis (CCA) fusion of extracted features. Individual class accuracies were 77.27% for ALL, 98.91% for AML, and 92.22% for multiple myeloma. The bagging ensemble model with CCA fusion achieved the highest accuracy at 97.04%. Claro et al. conducted the most comprehensive augmentation study, using 18 public datasets with 3,536 images. DenseNet121 with data augmentation achieved 97.11% accuracy in multiclass classification, with rotation identified as the most effective augmentation technique. ResNet-50 with augmentation excelled in the binary leukemia vs. healthy classification.
Transfer learning and microarray approaches: Muhamad et al. used transfer learning with three pre-trained networks on 1,728 images from Hiwa Cancer Hospital in Iraq, yielding 95.2% for their CNN model, 97.2% for MobileNet-v2, and 81.5% for AlexNet. Despite different architectures and layer counts, overall model performance did not differ dramatically. Prabhakar et al. took a unique approach by applying probabilistic neural networks (PNN) to microarray gene expression data from the Golub dataset (7,129 genes, 47 ALL samples and 25 AML samples), achieving 95.705% accuracy without using data augmentation, image segmentation, or color normalization.
IoMT and near-perfect accuracy: Sakthiraj introduced a hierarchical CNN with integrated attention and spatial optimization (HCNN-IASO) within an Internet of Medical Things (IoMT) framework, achieving a remarkable 99.87% accuracy. The system classified leukemia subtypes (healthy, CML, CLL, AML, ALL) using data from the American Society of Hematology image database. Saleem et al. achieved an average of 99.7% accuracy using DarkNet-53 and ShuffleNet for feature extraction combined with SVM, ensemble methods, KNN, and naive Bayes for classification. The ensemble subspace KNN classifier hit 100% accuracy on the ALL-IDB database, and their DeepLab V3+ with ResNet-18 semantic segmentation achieved a global accuracy of 98.6%.
The year 2023 demonstrated a pivotal shift toward using a single model for both feature extraction and cell classification, moving away from the multi-model pipelines that defined earlier work. Four articles were included from this year. Houssein et al. achieved 99.80% accuracy with DenseNet-161 using a one-cycle cyclical learning rate (CLR) policy on the Blood Cell Count and Detection (BCCD) database, which contained approximately 9,966 training images and 2,487 validation images across four leukocyte types (eosinophils, lymphocytes, monocytes, and neutrophils). The combination of DenseNet-161 with CLR allowed rapid hyperparameter optimization, outperforming existing state-of-the-art methods.
Simplified approaches: Kadmin et al. proposed a CNN classifier to detect acute myeloid leukemia from single blood smears using just 100 images from the American Society of Hematology database (45 blast cells, 55 non-blast cells, with an estimated 35,000 distinct blood components). The system achieved accuracy of 96% for benign, 97% for early, 99% for pre, and 99% for pro categories. Despite the small dataset, the model demonstrated reliable automated processing including color correlation, segmentation of nucleated cells, and efficient validation.
Wavelet-based and large-scale approaches: Naz et al. developed an automated classification system using AlexNet with wavelet transformation for extracting low- and high-frequency information from augmented microscopic blood images. They achieved 96.9% accuracy on the LISC dataset (400 microscopic blood images augmented to 3,600 samples) but only 81.9% on the Dhruv dataset (augmented to 10,000 samples from the Australian National Database). The significant accuracy drop between datasets highlights generalizability challenges. Wang et al. used the largest dataset in the review, with 11,788 fully annotated micrographs from 728 smears and 131,300 expert-annotated single-cell images. Their YOLOX-s model for feature extraction combined with Meta-Learning Fusion and Learning Network (MLFL-Net) for classification achieved 92.50% overall accuracy, with some leukemia subtypes reaching 100% accuracy while others, particularly acute myelomonocytic leukemia, showed lower performance due to monoblast-myeloblast confusion.
Early era (2010-2018): The foundational studies demonstrated that trained models were nearly perfect in feature extraction but depended on SVM for cell classification. The shift from Neural Studio's 96.67% accuracy in 2010 to over 99% in 2018 reflects the broader transition from specialized, flow-cytometry-dependent techniques to generalized approaches leveraging pre-trained CNNs. This methodological evolution from using flow cytometry data to direct microscopic image analysis opened the door to purely image-based diagnostic pipelines.
Middle era (2020-2022): Innovations including grayscale conversion, noise reduction, the DSSCS-CNN model, and color normalization yielded accuracy percentages ranging from 81.5% to 99.57%. The DSSCS-CNN model was distinct because it handled both feature extraction and cell classification, a departure from earlier approaches that used separate techniques like SVM, bagging, and multiclass ensembles. Image augmentation proved especially valuable by allowing datasets to be artificially expanded without requiring additional bone marrow samples, which is critical where large patient datasets are hard to obtain. This contributes to more efficient and cost-effective development of neural network models.
Current era (2023): The shift from using several CNNs for specific tasks to applying a single model for both feature extraction and classification represents a transition toward simplicity and efficiency. DenseNet-161 achieved 99.80% accuracy, demonstrating that densely connected layers can be highly effective for both tasks. This architectural approach enhances the model's ability to learn complex features without excessively increasing model size, making it more feasible for clinical implementation. However, the highly variable accuracy rates across the 2023 studies (81.9% to 99.80%) indicate that more research and replication are still necessary.
The authors emphasize that despite 13 years of published evidence, none of these models have been validated in real-world clinical settings where patient outcomes depend on diagnostic accuracy. Integrating image augmentation, segmentation, and advanced CNN architectures shows promise, but the range of different methodologies also suggests an ongoing need for standardization and refinement.
Dataset variability: A fundamental challenge across the 20 reviewed studies is the lack of standardized datasets. Different studies used different image databases (ALL-IDB, LISC, BCCD, ASH, Raabin-WBC, and hospital-specific collections), making direct comparison of accuracy scores unreliable. Several studies relied on single-center data from one hospital, which limits the generalizability of findings to broader patient populations. The variability in image acquisition protocols, microscope magnifications (ranging from 300x to 500x), and staining techniques further complicates cross-study comparisons.
Augmentation and segmentation risks: While data augmentation has been valuable for expanding limited datasets, excess segmentation and augmentation can introduce artifacts and lead to the omission of key data, thereby decreasing the validity of reported accuracy rates. Blood cell images contain immense amounts of genetic variables that artificial enhancement cannot replicate through augmentation. Researchers specifically identified the "curse of dimensionality," where the number of variables at the genetic level far exceeds the number of available samples, as a barrier to robust model training.
Review-level limitations: The scoping review itself has notable constraints. The search was limited to three databases (Embase, Ovid MEDLINE, Web of Science), and other databases might have yielded additional articles. Only English-language articles were included, potentially missing relevant studies published in other languages. The search terms, while comprehensive, may not have captured all relevant terminology used in the DL and leukemia literature. These factors may affect the comprehensiveness, generalizability, and relevance of the synthesized evidence.
Future research priorities: The authors identify clinical validation as the most salient gap in the field. While DL has shown promising results in controlled research settings, performance in real-world clinical environments may vary substantially. Future work should focus on enhancing model robustness by addressing variations in data quality, acquisition protocols, and patient populations to mitigate overfitting risks. Exploring new modalities such as gene expression profiles and diverse imaging modalities, combined with attention to patient demographics and disease subtypes, could strengthen model applicability. Collaborative efforts among researchers on a global scale are described as essential to overcome these limitations and move toward clinical deployment.