Acute lymphoblastic leukemia (ALL) is one of the four main subtypes of leukemia, a cancer that originates in the bone marrow and affects the production of blood cells. ALL specifically targets lymphocytes and the cells that produce them. It is the most common cancer in children and the second most common acute leukemia in adults. While treatment is successful in roughly 90% of pediatric cases, ALL remains one of the leading causes of death in both children and adults. Early diagnosis is critical because it directly influences survival rates and opens up a wider range of treatment options, including chemotherapy, radiotherapy, anticancer drugs, or combinations of these approaches.
The diagnostic bottleneck: Current leukemia diagnosis methods are entirely manual and depend on trained, experienced medical professionals. A pathologist must examine blood cell images under a microscope to identify abnormal patterns. This process is prone to error due to tissue and cell diversity, specialist shortages, long working hours, eye fatigue, and lapses in concentration. Computer-assisted image analysis offers a way to reduce these human limitations and accelerate the diagnostic pipeline.
The promise of deep learning: Recent advances in artificial intelligence, particularly deep learning and machine learning algorithms, have shown remarkable success in medical image analysis. Deep learning is especially attractive because it eliminates the manual feature engineering step required by traditional machine learning. Instead of a human designing and applying feature extractors to an image, deep learning networks learn features directly from raw input data, automatically extracting increasingly abstract representations through multiple layers.
This study set out to develop and compare three deep learning architectures and six machine learning classifiers for the binary classification of microscopic blood cell images into healthy versus leukemic categories. The goal was not only to maximize accuracy but also to evaluate trade-offs between model complexity, execution time, and classification performance.
The dataset was sourced from a CodaLab competition designed for classifying leukemic cells from normal cells in microscopic images. It contains images of leukemic B-lymphoblast cells (malignant) and normal B-lymphoid cells from 118 patients. After preprocessing, the dataset was reduced to images from 73 patients: 47 with cancer and 26 healthy. The final dataset comprised 12,528 images total, split into 4,037 healthy cell images and 8,491 leukemia cell images.
Image preprocessing: The original images had a resolution of approximately 300 x 300 pixels and had already been segmented and normalized from raw microscopy captures. The authors resized all images to 70 x 70 pixels for their experiments. The data was divided using an 80/20 split, with 10,022 images for training and 2,506 for testing (814 healthy and 1,692 leukemia images in the test set). Importantly, the split was performed at the image level rather than the patient level, meaning images from the same patient could appear in both training and testing sets.
Handling class imbalance: Because the dataset was imbalanced (8,491 leukemia images vs. 4,037 healthy images, roughly a 2:1 ratio), the authors applied a class weighting technique during training. This method assigns higher weights to the minority class (healthy cells) and lower weights to the majority class (leukemia cells), penalizing misclassification of the underrepresented class more heavily. This is a standard approach to prevent models from simply predicting the majority class and achieving superficially high accuracy.
ResNet-50: The Residual Neural Network (ResNet) was proposed by Microsoft and won first place in the ILSVRC 2015 competition. The version used here, ResNet-50, contains 50 layers organized into five stages. The core innovation of ResNet is the residual module, which allows information to bypass layers through "skip connections." This solves the vanishing gradient problem that makes very deep networks difficult to train. The architecture progresses through stages with filter sizes increasing from 64 to 2048, using convolutional blocks, identity blocks, batch normalization, and ReLU activations throughout. The final layers consist of average pooling, a flatten layer, and a dense layer with softmax for binary classification.
VGG-16: Developed by the Visual Geometry Group at Oxford, VGG-16 contains 16 weight-bearing layers. Its architecture is more straightforward than ResNet: two blocks of two convolution layers followed by max-pooling, then three blocks of three convolution layers followed by max-pooling, and finally fully connected dense layers. Filter depths progress from 64 to 512, with all kernels sized at 3 x 3 and all pool sizes at 2 x 2 with stride 2. The original VGG-16 accepts 224 x 224 x 3 input images, but the authors modified this to 70 x 70 x 3 to match their dataset. Two 4096-unit fully connected layers precede the final softmax classification layer.
A critical design choice: The authors explicitly did not use the pretrained weights that come with ResNet-50 and VGG-16. Instead, they used only the architectures and trained both networks from scratch with their own parameters. Both networks used the Adam optimizer with a learning rate of 0.0001, ReLU activation on all layers except the last (softmax), and were trained for 100 epochs. ResNet-50 used a batch size of 16 with an epoch duration of 28 seconds, while VGG-16 used a batch size of 32 with an epoch duration of 22 seconds.
All implementations were done in Keras with a TensorFlow backend, running on Google Colaboratory. This choice of environment is notable because Colab provides free GPU access, making it accessible to researchers without dedicated hardware.
In addition to the two established architectures, the authors designed a custom convolutional neural network with 10 convolutional layers. The architecture uses a decreasing filter strategy: two layers of 128 filters, two layers of 64 filters, two layers of 32 filters, two layers of 16 filters, and two layers of 8 filters. All kernels are 2 x 2, and 2 x 2 max-pooling with stride 2 is applied between each pair of convolutional layers.
Regularization strategy: To prevent overfitting, the proposed CNN incorporates batch normalization after every convolutional layer and dropout layers with a rate of 0.1 after every max-pooling layer. The output of the final pooling layer is flattened and fed into a 1,024-unit fully connected layer, followed by two additional dense layers for binary classification via softmax. Same padding was used for all convolution layers, and the Adam optimizer with a learning rate of 0.0001 was again selected.
Efficiency advantage: This network was trained with a batch size of 32 over 100 epochs, with each epoch taking only 14 seconds. That is half the time of ResNet-50 (28 seconds per epoch) and significantly faster than VGG-16 (22 seconds per epoch). The authors argue that this simpler architecture can run on ordinary hardware without high-end GPU requirements, making it more practical for deployment in clinical laboratory settings where computational resources may be limited.
The total training time for the proposed CNN was approximately 23 minutes (100 epochs x 14 seconds), compared to roughly 37 minutes for VGG-16 and 47 minutes for ResNet-50. This efficiency comes from the lower complexity of the network, with far fewer parameters than either pretrained architecture.
Training vs. validation accuracy: VGG-16 achieved the highest training accuracy at 97.41% and the highest validation accuracy (mean of training and test accuracy over 100 cycles) at 84.62%, with a training loss of 0.0653 and test loss of 0.9236. ResNet-50 followed with a training accuracy of 95.76% and validation accuracy of 81.63% (train loss: 0.1064, test loss: 0.7592). The proposed CNN achieved 85.79% training accuracy and 82.10% validation accuracy (train loss: 0.3356, test loss: 0.4517). Notably, the proposed CNN showed the smallest gap between training and test loss, suggesting less overfitting than the pretrained architectures.
Precision, recall, and F-measure: For VGG-16, the precision was 0.81 for healthy cells and 0.89 for leukemia cells, with recall of 0.75 and 0.92, respectively, and F-measure of 0.78 (healthy) and 0.90 (leukemia). ResNet-50 showed similar patterns: precision of 0.80/0.88, recall of 0.74/0.91, and F-measure of 0.77/0.90 for healthy/leukemia classes. The proposed CNN achieved precision of 0.81/0.86, recall of 0.70/0.92, and F-measure of 0.75/0.89. All three networks showed stronger performance on leukemia detection than on healthy cell identification.
ROC curve analysis: The area under the receiver operating characteristic curve (AUROC) provided additional insight. ResNet-50 achieved an AUC of 0.91 for healthy and 0.90 for leukemia, with a micro-average of 0.92. VGG-16 achieved 0.90 for both classes with a micro-average of 0.92. The proposed CNN had an AUC of 0.88 for both classes with a micro-average of 0.91. All three networks demonstrated clinically meaningful discriminative ability.
Confusion matrix details: On the test set of 2,506 images (814 healthy, 1,692 leukemia), VGG-16 correctly classified 616 healthy images and misclassified 201. For leukemia, it misclassified only 141 out of 1,692. The proposed CNN correctly identified 571 healthy images (243 false negatives) and misclassified 137 leukemia images (1,555 true negatives). ResNet-50 had 606 true positives for healthy cells with 208 false negatives and 153 false positives for leukemia.
Six traditional machine learning classifiers were also evaluated on the same dataset to provide a baseline comparison against the deep learning approaches. The classifiers tested were random forest (RF), logistic regression (LR), support vector machine (SVM), k-nearest neighbor (KNN), stochastic gradient descent (SGD), and multilayer perceptron (MLP). These methods rely on hand-crafted or extracted features rather than learning representations directly from raw pixel data.
Accuracy rankings: Random forest achieved the highest accuracy at 81.72%, followed closely by logistic regression at 79.88%, SVM at 79.28%, and KNN at 77.89%. SGD lagged significantly at 68.91%, and multilayer perceptron performed worst at just 27.33%. The poor MLP performance likely reflects the difficulty of training a shallow neural network on high-dimensional image data without the architectural advantages of convolutional layers.
Per-class performance: For healthy cell classification, random forest achieved the highest precision at 83% and the highest F1-score at 68% (tied with logistic regression). KNN had the highest healthy-class recall at 66%. For leukemia detection, logistic regression and KNN shared the highest precision at 83%, while random forest achieved the highest recall at 94% and the highest F1-score at 87%. Random forest's strength with high-dimensional data, its ability to work on subsets of features, and its robustness to outliers and nonlinear data contributed to its top performance among ML methods.
Importantly, even the best traditional ML classifier (RF at 81.72%) fell below VGG-16's validation accuracy of 84.62%, though it came close to ResNet-50 (81.63%) and the proposed CNN (82.10%). This suggests that for this particular dataset size, the advantage of deep learning over well-tuned traditional ML is present but moderate.
The authors compared their results against several prior studies. Shafique and Tehsin achieved 99.50% accuracy and 96.06% subtype classification accuracy using a pretrained AlexNet, but their AUC of 0.98 was obtained on a much smaller test set of 306 images in a MATLAB environment. Ahmed et al. reported 88.25% accuracy with a 6-layer CNN on the ALL-IDB and ASH Image Bank datasets (511 test images). Kasani et al. achieved 96.58% accuracy using ensemble deep learning methods, but their training time was approximately 130 minutes per epoch, far exceeding the 14-28 seconds per epoch in this study. Rehman et al. reached 97.78% with a fine-tuned AlexNet on 330 test images.
Why VGG-16 outperformed ResNet-50: The authors attribute VGG-16's superior performance to its straightforward architecture and remarkable feature extraction capability. VGG-16 uses a consistent pattern of stacked convolution blocks with increasing depth, which the authors argue provides better feature learning ability than ResNet-50 for this particular task. The simplicity of VGG-16's design, with uniform 3 x 3 convolutions throughout, allows it to capture more sparse features effectively.
The case for the proposed CNN: While the proposed CNN achieved slightly lower accuracy than VGG-16 (82.10% vs. 84.62%), its advantages are significant in practical terms. Its execution time of 14 seconds per epoch is roughly half that of VGG-16 (22 seconds) and half that of ResNet-50 (28 seconds). It requires less computational memory and can run on ordinary hardware, making it suitable for clinical deployment in resource-constrained settings. The network also showed less overfitting, with a narrower gap between training accuracy (85.79%) and validation accuracy (82.10%) compared to VGG-16's gap (97.41% vs. 84.62%).
The study highlights a fundamental trade-off in clinical AI: higher accuracy often comes with greater computational cost and complexity. For rapid screening in settings with limited infrastructure, a faster, simpler model with slightly lower accuracy may be more valuable than a complex model that requires specialized hardware and longer processing times.
Data leakage concern: One significant methodological limitation is that the train-test split was performed at the image level rather than the patient level. Since multiple images can come from the same patient, images from a single patient may appear in both the training and test sets. This introduces data leakage, where the model may learn patient-specific features rather than generalizable disease characteristics. Patient-level splitting would provide a more realistic estimate of how the model would perform on completely unseen patients in clinical practice.
Limited dataset size: The authors acknowledge that they used a limited training and evaluation dataset, which can affect the training process of deep neural networks. The dataset of 12,528 images from 73 patients, while reasonable for a proof-of-concept study, is small compared to datasets used in state-of-the-art medical imaging research. The post-preprocessing reduction from 118 patients to 73 further constrains the diversity of cases the models can learn from. Additionally, the dataset came from a single source (CodaLab competition), raising questions about how well the models would generalize to images from different labs, staining protocols, or microscope types.
Single leukemia subtype: The study focused exclusively on acute lymphoblastic leukemia (ALL), one of four main leukemia subtypes. The authors note their intent to expand to a dataset covering all four types: ALL, acute myeloid leukemia (AML), chronic lymphoid leukemia (CLL), and chronic myeloid leukemia (CML). A multi-class classification system would be far more clinically useful, as pathologists need to distinguish not just healthy from leukemic cells but also differentiate between leukemia types for appropriate treatment planning.
Absence of pretrained weight comparison: The decision to train ResNet-50 and VGG-16 from scratch, without using their ImageNet-pretrained weights, is an unusual choice. Transfer learning typically improves performance on medical imaging tasks, especially with limited data. The study does not include a comparison showing how much performance might improve with pretrained weights, leaving open the question of whether the reported accuracy represents the ceiling for these architectures on this dataset.
Future work: The authors plan to build larger datasets and train deep learning models from scratch with more diverse image collections. They envision these computational systems being integrated into everyday clinical workflows, assisting specialists and oncologists in detecting leukemia more effectively. The ethical approval (IR.UMSHA.REC.1399.1056) from Hamadan University of Medical Sciences provides a foundation for future clinical validation studies.