Deep learning enhances acute lymphoblastic leukemia diagnosis and classification using bone marrow images

PMC 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Bone Marrow Analysis Is the Gold Standard for ALL, and Why It Needs Automation

Acute lymphoblastic leukemia (ALL) is the most common pediatric malignancy, with pediatric cases representing approximately 80% of all ALL diagnoses. In the United States, the estimated incidence is about 1.6 cases per 100,000 individuals. ALL arises from precursor cells of both B-lineage and T-lineage cells, and the World Health Organization classifies these conditions as B- or T-lymphoblastic leukemia/lymphoma. Certain genetic conditions, including Down syndrome, Fanconi anemia, Bloom syndrome, and Ataxia Telangiectasia, have been identified as risk factors in children.

Bone marrow as the diagnostic cornerstone: While peripheral blood smears (PBS) offer a rapid and non-invasive initial screening method, bone marrow aspiration and biopsy remains the gold standard for confirming ALL. Bone marrow analysis provides a complete examination of cellular structure and appearance, which helps indicate prognosis and disease evolution. Additional tests such as flowcytometric immunophenotyping complement the evaluation. However, bone marrow aspiration is invasive, painful (particularly for pediatric patients), and obtaining quality samples can be difficult.

The gap this review fills: Prior reviews of AI-based ALL classification, including those by Das et al. and Mustaqim et al., primarily focused on peripheral blood smear samples. This review specifically targets deep learning (DL) applied to bone marrow images, an area that had been underrepresented in the literature. The authors reviewed ten studies published between 2013 and 2023 across India, China, KSA, and Mexico, evaluating how convolutional neural networks (CNNs) and related architectures perform in detecting and classifying ALL from bone marrow aspirates.

TL;DR: ALL is the most common childhood cancer (80% of cases are pediatric, incidence ~1.6 per 100,000 in the US). Bone marrow biopsy is the gold standard for diagnosis but is invasive and subjective. This review covers 10 studies (2013-2023) on deep learning applied specifically to bone marrow images for ALL detection, filling a gap left by prior reviews that focused on peripheral blood.
Pages 2-3
How the Authors Searched, Screened, and Selected the Ten Studies

The search strategy was developed on June 11, 2023, beginning with the PubMed/MEDLINE database. The authors used broad search terms including "acute lymphoblastic leukemia," "acute lymphocytic leukemia," "acute lymphoid leukemia," "ALL," "artificial intelligence," "machine learning," "deep learning," and "neural network." No language or time frame restrictions were applied. The search strategy was then transferred to Scopus, Embase, and Web of Science using the Polyglot translator tool to ensure consistency across databases.

Screening process: The initial database search yielded 496 results, plus one article identified through manual extraction. After removing 282 duplicates using EndNote X9 and the Rayyan screening platform, 215 articles remained for title and abstract screening. Two reviewers independently screened these, resolving discrepancies by consensus. This process excluded 195 articles, leaving 20 for full-text review. After examining the complete texts and applying exclusion criteria, 10 more articles were eliminated, resulting in 10 studies included in the final review.

Inclusion criteria: Studies had to meet five specific requirements: (1) use of human ALL samples, (2) publication in English, (3) employment of deep learning techniques for diagnosing or classifying ALL, (4) use of bone marrow samples specifically, and (5) reporting of performance metrics. This last criterion was essential because the review aimed to directly compare model performance across studies using accuracy, precision, sensitivity, specificity, and F1 score.

Data extraction: Two investigators independently extracted data from each eligible study, recording the primary author, publication year, country of origin, dataset used, target outcome, validation methodology, model architecture, and all available performance metrics. Disagreements were resolved in team meetings, and a study was included in the final review only if the team reached consensus.

TL;DR: The authors searched PubMed, Scopus, Embase, and Web of Science with no language or date restrictions. From 496 initial results, 282 duplicates were removed, 215 were screened by title/abstract, 20 underwent full-text review, and 10 studies met all five inclusion criteria (human ALL samples, English, DL techniques, bone marrow images, reported metrics).
Pages 3-5
The Ten Studies at a Glance: Datasets, Models, and Key Performance Numbers

The ten included studies were published between 2013 and 2023, with five originating from India, three from China, one from KSA, and one from Mexico. Five of the ten studies used the SN-AM dataset, which contains microscopic bone marrow aspirate images from patients diagnosed with B-cell ALL and multiple myeloma (MM). The remaining studies used retrospectively collected hospital bone marrow samples of varying sizes, ranging from datasets with fewer than 50 patients to collections of over 1,000 images.

Validation approaches: Only two out of ten studies conducted external validation, a significant limitation. Yang et al. used the SN-AM dataset alongside the ALL-IDB1 database of peripheral blood smear images for external testing. Zhou et al. created a novel "AI-cell platform" database for white blood cell classification and externally validated their ensemble model on real clinical samples. The remaining eight studies relied solely on internal validation, using either train-test split (6 studies) or k-fold cross-validation (2 studies).

Model architectures: The majority of studies employed CNNs as their primary classifier, incorporating various enhancements such as additional layers, optimization algorithms, and boosting techniques. Specific architectures included DenseNet121, AlexNet, ResNet50, VGG-19, ResNeXt101, and custom designs like "i-Net." One study by Ordaz-Gutierrez et al. diverged from the CNN paradigm entirely, using radial basis function neural networks combined with a fuzzy logic algorithm. Transfer learning and gradient boosting algorithms (CatBoost, XGBoost) were commonly employed to boost performance.

Top-line performance: The CNN with CatBoosting by Devi et al. achieved 100% accuracy, 100% precision, 99.9% sensitivity, and 100% specificity. The "i-Net" model by Ikechukwu et al. reached 99.18% accuracy. Kavitha et al.'s CNN with Cat-Swarm Optimization attained 99.6% accuracy, 99.2% precision, 99.5% sensitivity, and 99.3% specificity. Huang et al.'s DenseNet121 with transfer learning achieved 99% accuracy for ALL classification. At the lower end, Zhou et al.'s ensemble achieved 89% accuracy on external validation with 86% sensitivity and 95% specificity.

TL;DR: Ten studies across 4 countries; 5 used the SN-AM dataset. Only 2 of 10 performed external validation. Top accuracy: 100% (CNN + CatBoost), 99.6% (CNN + Cat-Swarm Optimization), 99.18% (i-Net), 99% (DenseNet121 + transfer learning). Most models used CNNs with boosting algorithms or transfer learning.
Pages 5-7
B-ALL vs. Multiple Myeloma Classification Using the SN-AM Dataset

Devi et al. (CLR-CXG model): This study combined a Convolutional Leaky ReLU architecture with CatBoost and XGBoost boosting algorithms for classifying B-ALL and MM in bone marrow images. The pipeline began with data preprocessing to eliminate anomalies, followed by augmentation techniques to expand the dataset. The CNN extracted features, which were then fed into the boosting algorithms for refined classification. The CatBoost variant achieved 100% accuracy, 100% precision, 100% specificity, a sensitivity of 99.9%, and an F1 score of 100%. The XGBoost variant reached 97.12% accuracy, 98.5% precision, 99% sensitivity, and 97.2% specificity. However, the study lacked external validation and did not report on resource allocation, memory usage, or energy efficiency.

Ikechukwu et al. (i-Net): This custom CNN was designed specifically for ALL classification using the SN-AM and ALL-IDB datasets. The preprocessing pipeline included grayscale conversion, contrast enhancement, and resizing. Segmentation used a UNet model with InceptionV2 architecture, while the classification CNN was built from scratch with additional convolutional layers and fine-tuned hyperparameters. The "i-Net" achieved 99.18% accuracy on the SN-AM dataset, substantially outperforming pre-trained ResNet-50 (84.5%) and VGG-19 (93.5%). Overfitting was mitigated through data augmentation, dropout regularization, and batch normalization.

Kavitha et al. (Cat-Swarm Optimization): This study introduced a CNN with hyperparameters tuned using a Cat Swarm Optimization (CAT) algorithm, inspired by the seeking and tracing behaviors of cats. The three-phase pipeline involved data preparation from Jenner-Giemsa-stained bone marrow aspirate slides, data augmentation, and CNN classification with convolutional layers for feature extraction, pooling layers for dimension reduction, and fully connected layers for classification. The model achieved 99.6% accuracy, outperforming AlexNets, VGG-16 Nets, U-Nets, support vector machines, random forest, and naive Bayes classifiers.

Kumar et al. (Dense CNN): This study used a dense convolutional neural network (DCNN) for classifying B-ALL and MM. Training utilized an Adam optimizer with a sigmoid cross-entropy loss function and a learning rate of 0.01. Feature selection relied on the Chi-square test. The model achieved 97.25% overall accuracy, with precision of 100%, sensitivity of 93.97%, specificity of 95.19%, and an F1 score of 96.89%. Random Forests on the same dataset reached only 96.83%, and the proposed DCNN also outperformed VGG-16.

TL;DR: Four studies used the SN-AM dataset for B-ALL vs. MM classification. Devi et al.'s CNN + CatBoost hit 100% accuracy. Ikechukwu et al.'s i-Net reached 99.18%, beating ResNet-50 (84.5%) and VGG-19 (93.5%). Kavitha et al.'s Cat-Swarm Optimization CNN achieved 99.6%. Kumar et al.'s DCNN reached 97.25% with 100% precision. All four lacked external validation.
Pages 7-9
ALL Diagnosis from Retrospectively Collected Bone Marrow Samples

Duggal et al. (Stain Deconvolution Layer): This study addressed a fundamental limitation of standard CNNs: they operate in the RGB color space and can miss nuanced tissue-stain interactions critical for diagnostics. The authors introduced a Stain Deconvolution Layer (SD-Layer) placed at the front of CNN architectures, operating in the optical density (OD) color space using Beer-Lambert's law. The SD-Layer converted RGB microscopic images into OD space, revealing pixel stain quantities that hold diagnostic information. On a dataset of approximately 9,000 cell nuclei stained with Jenner-Giemsa, the Texture-CNN with SD-Layer achieved 93.20% accuracy and a 93.08% F1 score, while CNN (AlexNet) with SD-Layer reached 88.5% accuracy and 88.32% F1 score on 5-fold cross-validation.

Rehman et al. (AlexNet with Transfer Learning): This study used stained bone marrow images from patients with ALL subtypes (L1: 100 images, L2: 100, L3: 30) and healthy controls (100 images). The pipeline included a novel thresholding-based segmentation technique followed by CNN classification using the AlexNet architecture with transfer learning. The model achieved 97.78% accuracy on the test dataset, with training taking approximately 163.63 seconds for 20 epochs. It outperformed naive Bayesian, K-nearest neighbor, and support vector machine classifiers on the same data.

Huang et al. (DenseNet121 with Transfer Learning): This study tackled multi-class leukemia classification, distinguishing ALL, acute myeloid leukemia (AML), and chronic myelocytic leukemia (CML) from healthy controls. The dataset included 23 ALL, 53 AML, 10 CML, and 18 healthy bone marrow samples, split 3:1 into training (991 samples) and prediction (331 samples) sets. Three CNN architectures were tested: Inception-V3, ResNet50, and DenseNet121. DenseNet121 on preprocessed data initially achieved 74.8% accuracy, but after applying transfer learning, accuracy surged to 95.3%, a 20.5% improvement. Class-specific accuracies reached 99% for ALL, 97% for CML, 95% for AML, and 90% for normal samples. The model struggled to distinguish immature granulocytes from lymphocytes, which impacted AML classification.

Ordaz-Gutierrez et al. (Fuzzy Logic + RBFNN): This study, designed for resource-constrained settings in Mexico, used a hybrid of robust fuzzy logic and radial basis function neural networks (RBFNN). Bone marrow aspirate images were converted to grayscale, enhanced via histogram equalization, and segmented using Sobel edge detection and mathematical morphology. Features including cell size, circularity, and nuclei-to-cytoplasm ratio were analyzed. The fuzzy logic algorithm generated a diagnosis variable, and the RBFNN improved classification accuracy. The system achieved 96.7% accuracy with 98.00% sensitivity and 91.00% specificity on 118 ALL and 62 healthy samples.

TL;DR: Four hospital-based studies used diverse approaches. Duggal et al.'s SD-Layer CNN reached 93.20% accuracy on ~9,000 nuclei. Rehman et al.'s AlexNet achieved 97.78% across ALL subtypes (L1, L2, L3). Huang et al.'s DenseNet121 + transfer learning hit 99% ALL-specific accuracy (20.5% improvement over baseline). Ordaz-Gutierrez et al.'s fuzzy logic + RBFNN reached 96.7% accuracy with 98% sensitivity.
Pages 9-10
Vision Transformers and Ensemble CNNs for Leukemia Diagnosis

Yang et al. (MobileViTv2 + MultiPathGAN): This study collected 2,033 microscopic bone marrow images covering 6 disease types and 1 healthy control from two Chinese medical websites. To handle variations in staining styles across different laboratories, the authors introduced "stain domain augmentation" using a MultiPathGAN model, which normalized stain styles and expanded the dataset. They then developed MobileViTv2, a lightweight hybrid model combining the local feature extraction strengths of CNNs with the global context modeling of vision transformers (ViTs). Despite using only 9.8 million parameters, MobileViTv2 outperformed both standalone CNNs and standalone ViTs. On the test set, it achieved an average accuracy of 94.28%, with class-specific accuracies of 98% for MM, 96% for ALL, and 96% for lymphoma. Patient-level prediction accuracy averaged 96.72%. External validation on the ALL-IDB1 and SN-AM public datasets yielded accuracy values of 99.75% and 99.72%, respectively.

Zhou et al. (Ensemble of ResNet and ResNeXt): This study developed a deep learning system that mimicked the actual workflow of hematologists. The researchers collected 1,732 bone marrow images containing 27,184 cells from children with leukemia, creating the "AI-cell platform" dataset. Unlike prior studies that relied on preprocessed images, this system used raw clinical images. It detected and excluded uncountable and crushed cells, classified remaining cells, and generated diagnoses using an ensemble of ResNeXt101_32x8d, ResNeXt50_32x4d, and ResNet50. On internal validation, the ensemble achieved 82.93% accuracy, 86.07% precision, and 82.02% F1 score for WBC classification. On external validation using real-world clinical bone marrow samples, the system achieved 89% accuracy, 86% sensitivity, and 95% specificity for ALL diagnosis. It also accurately detected bone marrow metastasis of lymphoma and neuroblastoma with an average accuracy of 82.93%.

The contrast between these two studies is instructive. Yang et al.'s MobileViTv2 produced very high accuracy numbers but was validated on public datasets rather than in clinical practice. Zhou et al.'s ensemble posted lower headline numbers but was tested on real clinical samples with raw, unprocessed images, representing a more realistic evaluation of what deep learning can achieve in actual hospital workflows. Both studies conducted external validation, making them the only two of the ten reviewed studies to do so.

TL;DR: Yang et al.'s MobileViTv2 (only 9.8M parameters) achieved 94.28% average accuracy on test data and 99.72% on external validation (SN-AM). Zhou et al.'s ensemble of ResNet/ResNeXt models reached 89% accuracy and 95% specificity on real clinical samples. These were the only 2 of 10 studies with external validation.
Pages 10-11
Small Datasets, Missing External Validation, and Interpretability Gaps

Dataset size: The most pervasive limitation across the reviewed studies was small dataset sizes. Five of the ten studies used the SN-AM dataset, which contains only 90 B-ALL and 100 MM images. Other datasets ranged from 49 patients (Zhou et al.'s AI-cell platform for training) to a few hundred images. These small sample sizes raise serious concerns about whether the reported accuracy values, some reaching 100%, would hold on larger, more diverse patient populations. Models trained on limited data are prone to overfitting and may not generalize across different imaging equipment, staining protocols, or patient demographics.

External validation deficit: Only 2 of the 10 studies performed external validation. The remaining 8 relied solely on internal validation through train-test splits or k-fold cross-validation. Internal validation alone can produce inflated performance metrics because the training and test data often come from the same institution, use the same imaging equipment, and follow the same staining protocols. Without external validation on independent datasets from different centers, the reported performance numbers must be interpreted cautiously.

Additional recurring limitations: Several studies acknowledged dependency on image quality, meaning performance could degrade with poorly stained or poorly captured samples. Computational complexity was flagged as a concern, particularly for deployment in resource-constrained settings. Limited interpretability was noted in multiple studies, as CNN-based models function largely as "black boxes," making it difficult for clinicians to understand why a particular diagnosis was generated. The lack of standardized evaluation frameworks across studies also makes direct comparison of models challenging, since different studies used different datasets, different validation approaches, and different subsets of performance metrics.

The study by Ordaz-Gutierrez et al. specifically highlighted the importance of developing models that work in developing countries with limited laboratory resources. While the fuzzy logic + RBFNN approach was designed with this constraint in mind, most CNN-based models in the review require substantial computational infrastructure that may not be available in all clinical settings. This tension between accuracy and accessibility remains unresolved.

TL;DR: Key limitations: small datasets (the SN-AM dataset has only 190 images), only 2 of 10 studies performed external validation, dependency on image quality, computational complexity, and limited model interpretability. The 100% accuracy reported by some models should be viewed cautiously given these constraints.
Pages 11-12
Genomic Integration, Standardized Benchmarks, and Clinical Deployment

Molecular and genomic data integration: The authors emphasize that future research should combine image-based analyses with molecular and genomic data. Currently, all ten reviewed studies rely exclusively on morphological features from bone marrow images. Incorporating genomic information could provide a more holistic assessment of ALL cases, potentially enabling not just diagnosis but also subtype classification and prognostic stratification. Building AI models that can interpret both imaging and molecular/genomic data simultaneously represents a major opportunity for the field.

Larger and more diverse datasets: Expanding dataset sizes and ensuring representation across different patient populations, imaging equipment, staining protocols, and clinical centers is critical. The authors note that including more complex samples, such as those with 10-15% blast cells in otherwise normal marrow, would provide a more rigorous test of deep learning's ability to distinguish normal from malignant blasts in borderline cases. Multi-institutional data sharing and federated learning approaches could help address dataset limitations without requiring centralized data collection.

Standardized evaluation protocols: Establishing standardized protocols for external validation and cross-institutional benchmarking would make it possible to reliably compare models. Currently, the diversity in datasets, validation methods, and reported metrics makes it difficult to determine which architectures truly perform best. The authors advocate for comprehensive evaluation frameworks that incorporate external validation and real-world clinical testing as mandatory components.

Finer disease subtype classification: Most reviewed models focused on binary classification (ALL vs. MM or ALL vs. healthy) or broad multi-class tasks. Future work should explore finer-grained subtype classification, distinguishing between specific ALL subtypes (L1, L2, L3) or classifying rare leukemia variants. Yang et al.'s MobileViTv2, which handled 6 disease categories plus healthy controls, represents a step in this direction, but further refinement is needed. Optimizing lightweight architectures for clinical deployment, where inference speed and computational cost matter as much as accuracy, is another priority.

TL;DR: Key future priorities include integrating genomic/molecular data with image analysis, building larger multi-institutional datasets (including borderline 10-15% blast cases), establishing standardized external validation protocols, and developing finer subtype classification. Lightweight models like MobileViTv2 (9.8M parameters) point toward clinically deployable architectures.
Citation: Elsayed B, Elhadary M, Elshoeibi RM, et al.. Open Access, 2023. Available at: PMC10731043. DOI: 10.3389/fonc.2023.1330977. License: cc by.