Customized Deep Learning Classifier for Detection of Acute Lymphoblastic Leukemia Using Blood Smear Images

PMC 2023 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Early Detection of Acute Lymphoblastic Leukemia Matters

Acute lymphoblastic leukemia (ALL) is a blood cancer driven by the uncontrolled overproduction of immature lymphocytes (white blood cells) in the bone marrow. According to World Health Organization data cited in the paper, there were 437,033 new leukemia cases and 303,006 leukemia deaths globally as of 2022. ALL is one of the most common childhood cancers, with a reasonable chance of cure when caught early. In adults, however, the prognosis worsens sharply when diagnosis comes late, as cancerous blast cells can spread through the bloodstream to the kidneys, liver, brain, and heart.

Current diagnostic bottleneck: The standard diagnostic approach relies on bone marrow examination, which is labor-intensive, time-consuming, and prone to subjective error. The peripheral blood smear (PBS) test offers a less invasive screening alternative, where a blood sample is examined under a microscope. However, manual microscopic analysis still depends heavily on trained pathologists and is subject to human variability.

The case for automation: This paper proposes a customized convolutional neural network (CNN) called ALLNet to automate the classification of white blood cell images as either leukemic blast cells or healthy cells. The authors argue that deep learning can eliminate the need for manual hand-crafted feature engineering while reducing both time and error in the screening process. The goal is to build a computer-aided diagnosis (CAD) system suitable for pre-screening during complete blood count (CBC) and peripheral blood tests.

TL;DR: ALL caused 303,006 deaths globally in 2022. Current diagnosis via bone marrow examination is slow and error-prone. This paper proposes ALLNet, a custom CNN that classifies blood smear images as leukemic or healthy to enable faster, automated pre-screening.
Pages 2-3
Landscape of Existing Deep Learning Approaches for ALL Detection

The authors survey a broad range of prior work that has applied machine learning and deep learning to leukemia detection. Jiang et al. used a ViT-CNN ensemble model and achieved 89% accuracy, employing differentiate enhancement-random sampling (DEES) to handle data imbalance. Ghaderzadeh et al. tested ten well-known CNN architectures on a Kaggle competition dataset, reaching accuracy of 99.85%, specificity of 99.89%, and sensitivity of 99.82% for diagnosing B-ALL subtypes from peripheral smear images.

Other notable results: Qiao et al. used a compact CNN for preliminary ALL screening on two datasets (APL-Cytomorphology-JHH and APL-Cytomorphology-LMU), achieving precision values of 96.53% and 99.20%, respectively. A hybrid model using mutual information reached 98% accuracy on the AA-IDB2 database. Additional studies reported accuracies ranging from 75% (DNA sequence-based approaches with traditional ML models) to 99% (CNN combined with SVM on CellaVision databases).

Key comparison methods: The literature includes approaches using AlexNet (96% accuracy on ALL-IDB2), VCGNet (96% accuracy, 93% precision on the GRTD dataset), and various CNN architectures on the LISC, Dhruv, JTSC, and BCCD datasets, with accuracies typically in the 97% range. The diversity of datasets and architectures makes direct comparison difficult, which is why the authors focus their evaluation specifically on the C_NMC_2019 challenge dataset to enable fair benchmarking.

TL;DR: Prior work spans a wide range of architectures and datasets, with reported accuracies from 75% to 99.85%. The authors focus on the C_NMC_2019 dataset to enable standardized comparison, as dataset variability across studies makes cross-study benchmarking unreliable.
Pages 4-6
C_NMC_2019 Dataset and Image Segmentation Pipeline

The study uses the C_NMC_2019 challenge dataset from ISBI 2019, which contains 10,661 microscopic images collected from 73 participants. The dataset includes 7,272 images of blast cells (ALL-positive) and 3,389 images of healthy cells. All images are uniform at 450 x 450 x 3 pixels and have been pre-processed so that only the white blood cell (WBC) region of interest is included, with everything else padded in black. Importantly, the blast/healthy classification labels were assigned by expert oncologists, lending credibility to the ground truth.

Color segmentation: The authors convert the original blood smear images into HSI (Hue, Saturation, Intensity) color space, where the white blood cells exhibit better contrast than surrounding components. The saturation component is extracted because it best captures the color intensity of WBCs. This saturation image is then converted to a binary mask using gray-scale thresholding: pixels in the 180-255 range become white, and all values below this threshold become black. The final segmented image is obtained by multiplying the original image with this binary mask.

Data imbalance correction: The original dataset is imbalanced, with roughly twice as many blast cell images as healthy cell images (7,272 vs. 3,389). To address this bias, the authors applied four augmentation techniques: (1) vertical and horizontal flipping, (2) clockwise and anti-clockwise rotation, (3) random brightness adjustments, and (4) random Gaussian blur with salt-and-pepper noise. After augmentation, the final balanced dataset consisted of 12,000 images, with 6,000 images per class.

TL;DR: The C_NMC_2019 dataset has 10,661 images from 73 participants (7,272 blast, 3,389 healthy). HSI color-space segmentation isolates WBCs. After augmentation (flipping, rotation, brightness, noise), the balanced dataset grew to 12,000 images with 6,000 per class.
Pages 7-9
ALLNet: A Custom CNN with 95 Million Parameters

The ALLNet architecture is a custom-built convolutional neural network designed specifically for this binary classification task (blast cell vs. healthy cell). The model consists of 4 convolutional layers alternated with 4 max-pooling layers, followed by 3 fully connected (dense) layers. The total parameter count is 95,099,266, of which 95,097,474 are trainable and 1,792 are non-trainable (from batch normalization).

Layer progression: The input layer accepts 450 x 450 x 3 images. The first Conv2D layer produces a 450 x 450 x 64 feature map (1,792 parameters), followed by max pooling that reduces it to 150 x 150 x 64. The second Conv2D layer expands to 256 filters (147,712 parameters), with another pooling step reducing the spatial dimensions to 50 x 50 x 256. The third Conv2D layer outputs 384 filters (885,120 parameters), and after pooling the dimensions are 17 x 17 x 384. The fourth Conv2D layer reaches 512 filters (1,769,984 parameters), followed by a final pooling to 6 x 6 x 512. The dense layers then flatten this to 4,096 neurons (75,501,568 parameters in the first dense layer alone), then another 4,096-neuron layer (16,781,312 parameters), before the 2-neuron output layer (8,194 parameters).

Regularization strategies: Batch normalization is applied after every alternate max-pooling layer to stabilize activations and prevent vanishing or exploding gradients. Dropout layers are inserted to prevent overfitting by randomly deactivating neurons during training. The loss function is categorical cross-entropy, and the Adam optimizer (a combination of RMSProp and AdaGrad) is used for weight updates.

TL;DR: ALLNet has 95.1 million parameters across 4 Conv2D layers (64, 256, 384, 512 filters), 4 max-pooling layers, and 3 dense layers (4,096 and 4,096 neurons). Batch normalization and dropout regularize the model. Adam optimizer and categorical cross-entropy loss drive training.
Pages 10-11
Training Protocol: 5-Fold Cross-Validation with Careful Epoch Selection

The dataset was split 80/20 into training and test sets. The 80% training portion was further divided using 5-fold cross-validation. The model was trained on 12,000 images (after augmentation) and tested on 2,132 images. Each fold was trained for 65 epochs. The training was conducted on Google Colaboratory using an Nvidia Tesla P-100 GPU.

Hyperparameter tuning: The learning rate was initially set at 1 x 10^-3, but the authors found that reducing it to 1 x 10^-5 gave marginally better improvement during training. The batch size was set at 16. The authors originally planned for 50 to 100 epochs but observed that training beyond 70 epochs led to overfitting. After fine-tuning, they settled on 65 epochs as the optimal value. The categorical cross-entropy loss showed heavy reduction in the 10-30 epoch range, with a steady but slower decrease after epoch 35.

Overfitting prevention: The 5-fold cross-validation approach was specifically chosen to guard against overfitting that could occur from repeated model exposure to one class of images. The best-performing fold was selected for final evaluation on the holdout 20% test set. The augmented data was mixed with the original data to provide training diversity and help the model generalize across different image variations.

TL;DR: Training used 12,000 images with 5-fold cross-validation over 65 epochs, a learning rate of 1 x 10^-5, and batch size of 16. The model was tested on 2,132 images. Training beyond 70 epochs caused overfitting, so 65 was selected as optimal.
Pages 11-13
Performance Across Five Training Instances: Up to 95.54% Accuracy

The model was evaluated across five separate training instances, each using a different fold configuration. Out of 2,132 test images (1,454 blast cells and 667 healthy cells), the model correctly classified 1,996 images (94.2%) and misclassified 136 images (5.8%). A confusion matrix was generated for each of the five runs to visualize true positives, true negatives, false positives, and false negatives.

Instance-by-instance breakdown: Instance 1 achieved accuracy of 94.94%, specificity of 94.87%, recall of 94%, F1-score of 94.96%, and precision of 95.95%, with a Matthews Correlation Coefficient (MCC) of 89.42. Instance 2 showed accuracy of 94.5%, specificity of 95.8%, recall of 93.2%, and precision of 96%. Instances 3 and 4 maintained similar metrics, with Instance 4 reaching 95% accuracy and 95.91% specificity. Instance 5 delivered the best overall performance: accuracy of 95.45%, recall of 95.91%, F1-score of 95.43%, precision of 94.94%, and MCC of 89.5.

Consistency of results: Across all five instances, the F1-score remained above 93%, and the MCC ranged from 88 to 89.53. The consistent F1 scores above 90 indicate that the balance between precision and recall was maintained throughout. This is particularly important for a screening tool like ALLNet, where both false positives (unnecessary follow-up) and false negatives (missed cancers) carry significant clinical consequences.

TL;DR: Across 5 training instances, ALLNet achieved peak accuracy of 95.45%, peak recall of 95.91%, peak precision of 96%, and peak F1-score of 95.43%. MCC ranged from 88 to 89.53. Of 2,132 test images, 94.2% were classified correctly.
Pages 13-14
How ALLNet Compares to Other Methods on the Same Dataset

The authors provide a head-to-head comparison of ALLNet against other published methods that used the same C_NMC_2019 dataset. Abunadi et al. applied a bagging ensemble with deep learning and achieved an F1-score of 88%. Yongsheng Pan et al. used a neighborhood correction algorithm (fine-tuning a pre-trained residual network, constructing Fisher vectors from feature maps, and applying weighted majority correction), reaching an F1-score of 92%. Christian et al. applied an attention-based CNN with a regional proposal subnetwork and obtained an F1-score of 83%.

ALLNet's advantage: The proposed ALLNet model achieved an F1-score of 96% on the same C_NMC_2019 dataset, outperforming all three comparison methods. This represents a 4-percentage-point improvement over the next-best neighborhood correction algorithm (92%) and an 8-percentage-point improvement over the bagging ensemble approach (88%). The gains are notable because ALLNet uses a relatively straightforward custom CNN without the complexity of ensemble methods or attention mechanisms.

Additional context from other datasets: Khandekar et al. used the YOLOv4 algorithm on a different dataset and reached a maximum recall of 96%. While the architectures and datasets differ, the comparison underscores that ALLNet's performance is competitive with state-of-the-art object detection frameworks. The simplicity of the CNN approach, compared to more complex ensemble or attention architectures, is itself a practical benefit for potential clinical deployment.

TL;DR: On the C_NMC_2019 dataset, ALLNet's F1-score of 96% outperformed bagging ensemble (88%), neighborhood correction (92%), and attention-based CNN (83%). The simpler architecture achieved better results than more complex methods.
Pages 14-15
Preprocessing Dependencies and Paths to Stronger Generalization

Preprocessing reliance: The authors acknowledge that the need for image preprocessing (HSI conversion, thresholding, segmentation) before feeding images to the classifier is a drawback. In a real clinical setting, blood smear images come with variable staining quality, lighting conditions, and microscope calibrations. The current pipeline assumes relatively clean, standardized input images, which may not reflect the variability encountered in routine laboratory practice.

Dataset limitations: While the C_NMC_2019 dataset is well-curated and expert-labeled, it represents a single data source. The study does not include external validation on independent hospital datasets, which limits the ability to assess how well ALLNet would generalize to different patient populations, staining protocols, or imaging equipment. The model was trained on augmented data from 73 participants, a relatively small cohort for drawing broad clinical conclusions.

Future directions: The authors propose expanding the training dataset with noisier, less pre-processed images to better simulate real medical imaging conditions. They also suggest combining the classifier with explainability models (such as Grad-CAM or SHAP) to provide interpretable outputs for clinicians. Additional architectures including YOLOv4, ResNet, and AlexNet are identified as candidates for further exploration, as they may achieve even stronger performance on these image classification tasks. The integration of such tools into telehealth frameworks is also mentioned as a pathway toward remote diagnostic capability.

Clinical deployment gap: The paper does not address prospective clinical validation, regulatory considerations, or integration with existing laboratory information systems. Moving from a proof-of-concept classifier to a deployable clinical tool would require multi-center validation studies, robustness testing against edge cases (e.g., overlapping cells, artifacts, rare WBC morphologies), and compliance with medical device standards.

TL;DR: Key limitations include reliance on image preprocessing, a single-source dataset of 73 participants, and no external validation. Future work should test on noisier real-world images, add explainability (Grad-CAM, SHAP), explore architectures like YOLOv4 and ResNet, and pursue multi-center clinical validation.
Citation: Sampathila N, Chadaga K, Goswami N, et al.. Open Access, 2022. Available at: PMC9601337. DOI: 10.3390/healthcare10101812. License: cc by.