Deep Learning Leukemia Diagnosis from Single Blood Cell

Plain-English Explanations

Overview & Background

Pages 1-2

Diagnosing Leukemia from a Single Blood Cell, Without Bone Marrow

Leukemia is a highly heterogeneous group of blood cancers, and different subtypes require different treatments and carry different prognoses. Current diagnostic workflows rely on bone marrow (BM) aspiration followed by morphologic analysis, immunophenotyping via flow cytometry (FCM), cytogenetics, and molecular genetics. These tests are time-consuming, expensive, and invasive. Critically, the requirement for BM aspiration leads to poor patient compliance and contributes to delayed diagnosis, particularly in primary hospitals and resource-limited settings where specialized equipment and trained professionals are scarce.

Peripheral blood cell (PBC) morphology has long been used to screen for benign versus malignant hematologic disorders. However, diagnosing and typing leukemia solely from a peripheral blood smear remains a challenge for human experts. The manual inspection process is labor-intensive, subjective, and limited by the fact that certain morphological features are too subtle for the naked eye to reliably distinguish across leukemia subtypes.

This study from Shanxi Medical University and Taiyuan University of Technology proposes a deep learning approach that can diagnose and type leukemia using a single peripheral blood cell image. The system was trained on a self-built dataset of 21,208 cell images from 237 patients, spanning five types of normal white blood cells (WBCs) and eight types of leukemic cells. The authors drew on the concept of imaging genomics, hypothesizing that different leukemia types, characterized by distinct chromosomal aberrations and transcriptional networks, ultimately produce recognizable morphological patterns in individual cells.

Clinical significance: If peripheral blood analysis could reliably type leukemia, it would eliminate the need for invasive BM aspiration as a first-line screening tool. This would improve patient compliance, accelerate time to diagnosis, and make leukemia screening feasible in primary care settings and low-resource regions where bone marrow biopsies are not routinely available.

TL;DR: A deep learning model trained on 21,208 peripheral blood cell images from 237 patients can diagnose and type leukemia from a single cell image, potentially replacing invasive bone marrow aspiration for initial screening. The dataset covers 5 normal WBC types and 8 leukemic cell types.

Dataset & Sample Collection

Pages 2-3

Building a Comprehensive Peripheral Blood Cell Database

The study was approved by the Ethics Committee of Shanxi Medical University, with informed consent waived due to full de-identification of samples per the Declaration of Helsinki. From 2020 to 2022, the researchers enrolled 161 patients diagnosed with hematological malignancies and 118 controls from the Second Hospital of Shanxi Medical University and Shanxi Provincial People's Hospital. All diagnoses were confirmed by independent hematopathologists using clinical information, BM cell morphology, BM biopsy results, FCM, and genetic data.

Leukemia cohort: The malignant dataset comprised 8,955 images from 119 patients, covering acute myeloid leukemia (AML, 50 patients, 5,127 images), myelodysplastic syndromes (MDS, 22 patients, 496 images), chronic myeloid leukemia (CML, 5 patients, 452 images), chronic myelomonocytic leukemia (CMML, 3 patients, 62 images), acute lymphoblastic leukemia (ALL, 18 patients, 1,232 images), chronic lymphocytic leukemia (CLL, 17 patients, 1,250 images), plasma cell leukemia (PCL, 3 patients, 64 images), and hairy cell leukemia (HCL, 1 patient, 15 images). This distribution reflects real-world incidence rates, which inherently creates class imbalance.

Control cohort: The benign dataset contained 12,253 images from 118 controls, including 68 individuals undergoing routine physicals and 50 patients with conditions unrelated to hematological malignancies. Only five types of normal leukocytes were included: neutrophils (8,181 images), lymphocytes (3,261), monocytes (472), eosinophils (248), and basophils (91). Any suspected or ambiguous malignant cells were excluded from the control dataset. Atypical lymphocytes and neutrophils with toxic granules were retained to increase diversity.

Slide preparation and imaging: Slides were processed on an SP-10 automated slicer (Sysmex) and stained with May-Grunwald Giemsa. Images were captured using the DI-60 automated digital cell image analyzer (Sysmex), producing approximately 120 images per slide at 250 x 250 pixels. Two to three slides were prepared per tumor sample, while a single slide sufficed for controls. PB smears were obtained shortly after diagnosis, before chemotherapy, at designated time intervals from individuals suspected of hematological malignancies.

TL;DR: The dataset contained 21,208 images total: 8,955 malignant cell images from 119 patients across 8 leukemia subtypes and 12,253 benign images from 118 controls across 5 normal WBC types. All slides were processed on Sysmex equipment with standardized staining and imaging at 250 x 250 pixels.

Model Architecture

Pages 3-4

Segmentation-Enhanced Residual Network with Progressive Multigranularity Training

Classifying leukemia cells is a fine-grained visual classification problem because different leukemia subtypes exhibit only subtle morphological differences. The authors adopted a segmentation-first approach, using the Segment Anything Model (SAM) with the ViT-L backbone and SamAutomaticMaskGenerator to automatically segment individual cells from background. Manual inspection confirmed correct segmentation of intact cells. This step was critical because initial experiments without segmentation showed the model focusing on image regions outside the cell boundaries, as revealed by heatmap visualization.

Progressive multigranularity (PMG) training: The core classification network is based on an improved ResNeXt framework that employs progressive multigranularity training with jigsaw patches, originally developed by Du et al. (2020). PMG works by augmenting training data through a jigsaw generator and sequentially feeding it into the network in multiple steps at each iteration. This strategy forces the model to identify the most discriminative features within local regions and leverage information across diverse granularities, improving classification accuracy for fine-grained distinctions.

Multistage hierarchical framework: Rather than attempting a single multiclass classification, the authors designed a five-stage framework reflecting the natural taxonomy of leukemia. Stage 1 classifies cells as benign or malignant. Stage 2B classifies benign cells into five normal types. Stage 2T classifies malignant cells into broader leukemia categories. Stage 3M further classifies myeloid neoplasm subtypes (AML, MDS, CML, CMML). Stage 3L classifies lymphoid neoplasm types (ALL, CLL, PCL, HCL). The output of each stage's PMG feeds into the next stage's loss calculation and parameter update, constraining and guiding downstream classification.

Training details: The network was trained on a server with two NVIDIA RTX A5000 GPUs and 64 GB of memory, with total PMG training taking 48 hours. Comparative architectures included ViT-B/16 and ResNeXt50. Models were initialized with Xavier initialization, optimized using Stochastic Gradient Descent (SGD) with a learning rate of 1e-4, weight decay of 5e-5, and 150 training iterations. Data was split at the subject level (80% training, 20% testing), ensuring no cell images from the same patient appeared in both sets.

TL;DR: The model uses SAM for cell segmentation, then a ResNeXt-based PMG network with jigsaw patch training across a 5-stage hierarchical framework (benign/malignant, cell type, leukemia subtype). Training ran 48 hours on dual NVIDIA RTX A5000 GPUs. Data was split at the subject level to prevent data leakage.

Classification Results

Pages 4-6

99.53% Top-1 Accuracy with Multistage PMG, but Performance Varies by Subtype

The multistage PMG architecture achieved a top-1 accuracy of 99.53% (+/- 3.05%), with mean precision of 89.26%, mean recall of 89.89%, and mean F1 score of 90.28%. This represented a dramatic improvement over the baseline architectures: ViT-B/16 achieved only 76.25% top-1 accuracy (mean F1: 53.69%), plain ResNeXt reached 72.45% (mean F1: 55.15%), and single-stage ResNeXt + PMG managed 82.43% (mean F1: 65.91%). The multistage design, which leverages the hierarchical nature of leukemia classification, was the key differentiator.

Benign cell classification: The model excelled at classifying normal WBC types. Neutrophils achieved 99.69% precision and 99.37% recall. Lymphocytes reached 97.17% across all three metrics. Monocytes showed 94.15% precision and 89.58% recall (F1: 91.98%). Eosinophils achieved 90.70% precision and 95.12% recall. Even basophils, the rarest normal cell type with only 91 images, reached 79.17% precision with a perfect 100% recall.

Malignant cell classification: Performance was strong for common subtypes but dropped sharply for rare ones. AML (5,127 images) achieved 81.42% precision and 96.94% recall (F1: 88.50%). ALL (1,232 images) reached 93.90% precision and 81.63% recall (F1: 87.33%). CLL (1,250 images) showed 82.31% precision and 94.16% recall (F1: 87.84%). MDS (496 images) achieved 78.57% precision and 67.35% recall (F1: 72.53%). However, CML struggled with only 25.93% recall (F1: 39.13%), CMML reached 48.70% F1, PCL managed only 20.00% F1, and HCL (just 15 images from 1 patient) hovered around 50% across metrics.

Stage-by-stage breakdown: Stage 1 (benign vs. malignant) achieved 96.07% top-1 accuracy with balanced precision and recall around 96%. Stage 2B (benign cell typing) reached 97.72% accuracy. Stage 2T (tumor classification) achieved 93.03% accuracy. Stage 3M (myeloid subtyping) was the weakest at 80.58% accuracy, with mean recall dropping to 47.55%. Stage 3L (lymphoid typing) performed well at 87.59% accuracy. The myeloid-versus-lymphoid distinction showed over 90% accuracy, while acute-versus-chronic classification was more difficult, with F1 just over 50%.

TL;DR: Multistage PMG achieved 99.53% top-1 accuracy overall. Benign/malignant classification reached 96% F1. AML, ALL, and CLL each exceeded 87% F1. Rare subtypes (CML: 39.13% F1, PCL: 20% F1, HCL: ~50% F1) suffered from limited training data. Myeloid subtyping (Stage 3M) was the hardest stage at 80.58% accuracy.

Genetic Subtype Classification

Pages 6-7

Distinguishing APL from Non-APL and Ph+ALL from Ph-ALL by Morphology Alone

Beyond the broad classification task, the authors trained separate binary classifiers for two clinically critical distinctions. The first separated acute promyelocytic leukemia (APL) from other AML subtypes (non-APL). APL is a medical emergency requiring immediate treatment with all-trans retinoic acid (ATRA) and arsenic trioxide, and misdiagnosis can be fatal. The APL classifier, trained on 216 APL images from 7 patients and 4,911 non-APL images from 43 patients, achieved 89.34% precision, 97.37% recall, and 93.18% F1 for APL detection. For non-APL cases, precision was 92.86%, recall was 74.63%, and F1 was 82.75%.

Ph chromosome detection in ALL: The second binary classifier distinguished Ph+ALL (ALL carrying the Philadelphia chromosome) from Ph-ALL. The Ph chromosome produces the BCR-ABL fusion gene, and Ph+ALL is a high-risk subtype with elevated relapse rates and poor prognosis that requires targeted therapy with tyrosine kinase inhibitors. Trained on 247 Ph+ALL images from 7 patients and 985 Ph-ALL images from 9 patients, the classifier achieved 92.86% precision, 83.06% recall, and 87.68% F1 for Ph+ALL. For Ph-ALL, precision was 87.25%, recall was 94.78%, and F1 was 90.86%.

Why this matters genetically: These results validate the hypothesis that specific genetic aberrations leave morphological imprints on individual cells that deep learning can detect. Different leukemia types are characterized by chromosomal abnormalities and alterations in transcription factors, epigenetic regulators, and signaling molecules. Despite the complexity of these regulatory networks, they maintain recognizable cell morphologies. The BCR-ABL fusion gene in Ph+ALL suppresses the apoptotic pathway, leading to impaired differentiation, enhanced proliferation, and elevated nuclear-to-cytoplasmic ratio. The model also identified a case of AML where all 13 routinely tested fusion genes were negative, demonstrating that morphology-based classification can work even when standard molecular markers are absent.

TL;DR: APL vs. non-APL classification achieved 93.18% F1 for APL (89.34% precision, 97.37% recall). Ph+ALL vs. Ph-ALL classification achieved 87.68% F1 for Ph+ALL (92.86% precision, 83.06% recall). These results demonstrate that genetic aberrations produce detectable morphological signatures in peripheral blood cells.

Explainability & Visualization

Pages 7-8

Grad-CAM Reveals What the Model Learns to See in Leukemia Cells

To understand the model's classification decisions, the authors applied Gradient-weighted Class Activation Mapping (Grad-CAM) to the last three layers of the PMG network. Grad-CAM generates heatmaps that highlight which regions of the input image contributed most to the classification decision, with red areas indicating higher importance. This analysis was performed for both the APL vs. non-APL and Ph+ALL vs. Ph-ALL classifiers.

APL morphological features: The Grad-CAM heatmaps showed that the model focused on the cytoplasm for APL classification. APL cells are densely packed or agglomerated with large granules, causing blurring of the boundaries between nucleus and cytoplasm. In some cells, the cytoplasm is filled with fine dust-like granules. In contrast, the model focused on the nucleus for non-APL cells. These attention patterns align with known morphological hallmarks of APL that pathologists use in practice.

Ph+ALL morphological features: For the Ph chromosome classifier, Grad-CAM revealed that nuclear chromatin in Ph+ALL lymphoblasts is coarser, more condensed, and less homogeneous, reflecting the morphological features of aged nuclei. In contrast, Ph-ALL lymphocytes showed more refined and homogeneous nuclear chromatin distribution. These morphological differences had not been formally described prior to this study, demonstrating the model's ability to discover new morphological biomarkers.

Segmentation improves focus: Without cell segmentation, the model's attention was scattered across regions outside the cell, including background artifacts. The segmentation-first approach using SAM brought the model's focus back to the cell itself, which likely explains why segmentation improved classification performance. In comparison, the baseline ResNeXt model without PMG showed inconsistent foci, such as scattered dots in the cytoplasm or nucleus, which do not align with human intuitive interpretation.

TL;DR: Grad-CAM visualization showed the model focuses on cytoplasmic granules for APL and nuclear chromatin patterns for Ph+ALL, consistent with known pathology. The model also identified previously undescribed morphological differences between Ph+ALL and Ph-ALL. SAM-based segmentation was essential for directing the model's attention to relevant cellular features.

External Validation

Pages 8-9

Generalization Tested on External Datasets, with Mixed Results

The authors evaluated the model on several external datasets to assess generalizability. A newly collected self-built external set of 795 cell images from 5 patients (3 AML, 1 CML, 1 CLL) yielded precision rates of 92.44% for AML, 83.75% for CML, and 92.85% for CLL. These results were encouraging, though the small number of patients limits strong conclusions about generalization.

Public dataset performance: The model was also tested on two annotated public datasets from The Cancer Imaging Archive (TCIA). On the AML-Cytomorphology_LMU dataset by Matek et al., which contains 3,268 myeloblast (MYO) images, the model achieved 96.96% precision, 98.17% recall, and 97.56% F1. On the CNMC 2019 dataset (C-NMC 2019) by Gupta et al., using 2,397 ALL single cell images from the fold 0 ALL group, precision reached 95.45% with 98.52% recall and 90.17% F1. These strong results on public benchmarks support the model's robustness for common leukemia types.

Domain shift challenges: However, performance on the ALL-IDB2 classic benchmark was significantly lower, with ALL identification precision at only 46.51%. The authors attribute this to domain shift: ALL-IDB2 images were captured with a laboratory optical microscope and Canon PowerShot G5 camera at 257 x 257 pixels in TIF format, using different staining techniques and imaging methods from the training data. These differences in staining, imaging equipment, and file format introduce distributional shifts that adversely affect classification. This result highlights that the model's generalizability is currently bounded by the imaging pipeline used during training.

TL;DR: External validation showed strong results on compatible datasets: 96.96% precision on AML-Cytomorphology_LMU (3,268 images) and 95.45% on CNMC 2019 (2,397 images). Performance dropped to 46.51% on ALL-IDB2 due to domain shift from different staining and imaging equipment. Generalizability remains tied to imaging standardization.

Limitations & Future Directions

Pages 9-11

Class Imbalance, Domain Shift, and the Road to Prospective Validation

Class imbalance: The most significant limitation is the severe data imbalance reflecting natural incidence rates. HCL had only 15 images from 1 patient, PCL had 64 images from 3 patients, and CMML had 62 images from 3 patients. Despite attempts at data augmentation through image rotation and flipping, performance improvements for rare subtypes were minimal. The authors chose not to adjust class weights or loss functions, reasoning that this would distort true incidence rates and hinder recognition of the majority class during testing. This is a principled choice but means the system cannot reliably identify rare leukemia subtypes in its current form.

CML classification difficulty: CML posed a unique challenge beyond sample size. Unlike other leukemias where blast morphology is the diagnostic feature, CML diagnosis relies primarily on an abnormally high WBC count and the presence of numerous basophils and eosinophils, not on individual blast cell appearance. The model's poor CML recall (25.93%) reflects this fundamental mismatch between the single-cell classification approach and the diagnostic criteria for CML, which depend on population-level features rather than individual cell morphology.

Domain shift and imaging variability: The 46.51% precision on the ALL-IDB2 dataset underscores that heterogeneity in image pixels, format, staining, and imaging systems can dramatically affect diagnostic accuracy. The model was trained on Sysmex DI-60 images with May-Grunwald Giemsa staining, and performance degrades when these conditions change. Considerable variation in imaging and annotation strategies across institutions remains a barrier to deployment.

Retrospective design: The current model is based entirely on retrospective data. The authors acknowledge that prospective validation is needed to confirm diagnostic utility. Additionally, a correct diagnosis of hematological malignancies still requires clinical information, BM examination, flow cytometric data, and genetic tests. The system currently serves as an aided screening tool, not a standalone diagnostic. In cases where leukemia cells have not yet broken through the BM to enter peripheral blood, the information available from PB cells is inherently limited.

Future work: The authors plan to integrate morphological and genetic information from BM to classify disease subtypes and risk levels for stratified and targeted therapy. Expanding the dataset to include more patients with rare subtypes, standardizing imaging protocols across institutions, and conducting prospective clinical trials are all necessary steps before this approach can be clinically deployed.

TL;DR: Key limitations include severe class imbalance (HCL: 15 images, PCL: 64 images), CML's reliance on population-level rather than single-cell features (recall: 25.93%), domain shift across imaging platforms (ALL-IDB2 precision: 46.51%), and the retrospective study design. Future work will integrate BM data and pursue prospective validation.

Diagnosis and typing of leukemia using a single peripheral blood cell through deep learning

Original Paper (PDF)