Deep Feature Fusion for Leukemia Classification

Overview & Background

Pages 1-3

Why Automated Leukemia Detection Matters, and What This Study Proposes

Acute lymphoblastic leukemia (ALL) is a blood cancer affecting both children and adults, responsible for roughly 25% of all childhood cancers. Globally, leukemia affected 2.3 million individuals in 2015, causing 353,500 deaths. In the United States alone, approximately 59,610 new leukemia diagnoses and 23,710 deaths were reported in a recent year. ALL progresses rapidly and can damage the bone marrow, blood, liver, brain, and kidneys. Without timely detection and treatment, it can be fatal within months.

The diagnostic bottleneck: Traditional diagnosis requires hematologists to manually examine blood smears or bone marrow samples under a microscope, a process that is laborious, expensive, and heavily dependent on individual expertise. This manual workflow introduces variability and delays that can compromise patient outcomes, particularly in resource-limited settings where specialist pathologists are scarce.

The proposed solution: This study introduces an AI-based Internet of Medical Things (IoMT) framework that automatically classifies ALL from peripheral blood smear (PBS) images. The core innovation is a deep feature fusion model that combines VGG16 (processing original images) and DenseNet-121 (processing segmented images), merging their extracted features before classification. The framework is designed to operate through a cloud server, where a WiFi-enabled microscope uploads blood smear images for automated analysis, and results are delivered to both the medical center and the patient's personal devices.

The authors trained their fusion model on a dataset of 6,512 images (original and segmented) from 89 individuals. The proposed model achieved an accuracy of 99.89%, precision of 99.80%, and recall of 99.72%, outperforming several state-of-the-art CNN models. A beta web application was also developed to demonstrate the end-to-end workflow from image upload to leukemia prediction.

TL;DR: This paper proposes an IoMT-enabled deep learning framework for automatic ALL detection from blood smear images. A fusion model combining VGG16 and DenseNet-121 achieved 99.89% accuracy, 99.80% precision, and 99.72% recall on a dataset of 6,512 images from 89 patients.

Literature Review

Pages 4-6

Prior Work in AI-Based Leukemia Detection and Where Gaps Remain

The authors review over a dozen existing approaches to leukemia classification. Sakthiraj et al. proposed a Hybrid CNN with Interactive Autodidactic School (HCNN-IAS) on an IoMT platform, achieving 99% precision and recall on the ASH image bank. Bibi Nighat et al. built an IoMT-based framework using DenseNet-121 and ResNet-34, reaching 99.91% and 99.56% accuracy on publicly available datasets. Dese et al. created a real-time SVM-based diagnostic system with 97.69% test accuracy and processing times under one minute.

Feature fusion approaches: Several prior studies explored combining features from multiple models. Yadav et al. used SqueezeNet and ResNet-50 fusion to achieve 99.3% classification accuracy with 5-fold cross-validation. Ahmed et al. combined DenseNet121, ResNet50, and MobileNet features with PCA and a Random Forest classifier, reaching 98.8% accuracy and 99.1% AUC on the C-NMC 2019 and ALL-IDB2 datasets. These fusion-based methods consistently outperformed single-model approaches.

Identified gaps: The literature reveals several recurring limitations. Many studies used only original images or only segmented images, potentially losing valuable features from the other representation. Some methods suffered from small datasets causing overfitting. Others lacked the ability to diagnose leukemia subcategories. Additionally, few prior studies integrated their classification models into a complete IoMT framework capable of end-to-end automated diagnosis from image acquisition to result delivery.

The authors position their work as addressing these gaps by combining both original and segmented image features through a dual-channel architecture, using a sufficiently large dataset of 6,512 images, classifying ALL subtypes (Early Pre-B, Pre-B, and Pro-B) along with benign hematogone cells, and embedding the model within a practical IoMT cloud infrastructure.

TL;DR: Prior methods achieved up to 99.91% accuracy (Bibi Nighat et al.) and 99.3% (Yadav et al. with feature fusion), but most used only original or segmented images, not both. Many lacked IoMT integration and subtype classification. This study addresses those gaps with a dual-channel fusion model and cloud-based deployment.

Dataset & Preprocessing

Pages 6-8

Data Source, Segmentation Pipeline, and Augmentation Strategy

Dataset: The ALL dataset was sourced from Kaggle, originally created at Taleqani Hospital in Tehran, Iran. It contains 3,256 peripheral blood smear (PBS) images collected from 89 patients suspected of having ALL. A matching set of 3,256 segmented images was also available, bringing the total to 6,512 images. The dataset includes four classes: benign hematogone cells and three malignant ALL subtypes, specifically ALL (Early Pre-B), ALL (Pre-B), and ALL (Pro-B).

Image preprocessing: All images were decoded and resized to 128 x 128 pixels to ensure dimensional uniformity. Pixel values were normalized from the 0-255 range to 0-1. Six data augmentation techniques were applied to the training set: brightness adjustment (plus or minus 5%), contrast modification (plus or minus 8%), rotation (plus or minus 15 degrees), JPEG noise injection (quality range 30-100), and horizontal and vertical flipping. These augmentation strategies address common medical imaging challenges including class imbalances and noisy data.

Image segmentation: Segmentation isolates blast cells, the immature white blood cells that indicate leukemia, from the rest of the blood smear. The method uses HSV color space thresholding. The original RGB image is first converted to HSV color space, which is better suited for color-based segmentation. Two thresholds are set for the purple hue characteristic of blast cells, and a mask is applied to isolate them from surrounding cells. This entire segmentation pipeline was implemented on the cloud server to automatically generate segmented images before feeding them to the classification model.

The dataset was split in a 7:2:1 ratio, allocating 2,279 images for training, 652 for testing, and 325 for validation. This split was applied consistently across all experimental conditions to ensure fair comparison between models.

TL;DR: The dataset contains 6,512 images (3,256 original plus 3,256 segmented) from 89 patients across 4 classes. Images were resized to 128 x 128, normalized, and augmented with 6 techniques. HSV color thresholding was used for blast cell segmentation. Data split: 70% training, 20% testing, 10% validation.

Model Architecture

Pages 8-10

How the Dual-Channel Fusion Model Combines VGG16 and DenseNet-121

Transfer learning backbone: The fusion model uses two pre-trained CNN architectures as feature extractors. VGG16, a 16-layer network known for capturing high-detail hierarchical features, processes the original blood smear images. DenseNet-121, a 121-layer network with dense connectivity where each layer receives input from all preceding layers, processes the segmented images. Both models were pre-trained on ImageNet, and their weights were frozen during training so that only the newly added classification layers were updated.

Feature extraction and fusion: From VGG16, the block5_conv3 layer is extracted; from DenseNet-121, the conv4_block9_0_bn layer is used. Both branches output 512 features in an 8 x 8 spatial format from the 128 x 128 input images. Global Average Pooling 2D is applied to each branch, reducing each feature map to a single value while preserving spatial information and reducing parameters. The two 512-dimensional feature vectors are then concatenated into a single 1,024-dimensional vector.

Classification head: The concatenated 1,024-dimensional vector passes through a series of dense layers that progressively halve in size: 1,024, 512, 256, 128, 64, 32, 16, and finally 4 units (matching the four output classes). Two dropout layers with a rate of 0.2 are placed after the first two dense layers to prevent overfitting by randomly deactivating 20% of neurons during training. The final 4-unit layer produces the classification output.

Model complexity: The total parameter count is 18,598,836, of which only 1,749,556 (9.4%) are trainable. The remaining 16,849,280 parameters belong to the frozen VGG16 and DenseNet-121 backbones. This design leverages pre-trained feature extraction capability while keeping the trainable parameter space manageable, reducing the risk of overfitting on the relatively modest training set.

TL;DR: VGG16 extracts features from original images, DenseNet-121 from segmented images. Both output 512-dimensional vectors via Global Average Pooling, which are concatenated into 1,024 features. Classification uses 8 dense layers halving from 1,024 to 4 units with 0.2 dropout. Total parameters: 18.6M, of which only 1.75M (9.4%) are trainable.

Training & Hyperparameters

Pages 10-12

Training Configuration and Experimental Design

Hyperparameters: The model was trained for 50 epochs with a batch size of 32. The Adam optimizer was used with a learning rate of 0.001. The loss function was sparse categorical cross-entropy, appropriate for multi-class classification with integer-encoded labels. Two dropout layers at 0.2 rate within the classification block served as additional regularization alongside the data augmentation applied during preprocessing.

Experimental design: The authors conducted a systematic set of experiments to isolate the contribution of each component. Three models were tested: VGG16 alone, DenseNet-121 alone, and the proposed fusion model. Each was trained under three conditions: using only original images, using only segmented images, and using the combination of both. This 3 x 3 experimental matrix (9 total configurations) enabled direct comparison of the impact of both the model architecture and the input data type.

Environment: All experiments ran on the Kaggle platform using dual NVIDIA T4 GPUs with 13 GB RAM. The TensorFlow Keras framework was used for model implementation and training. Performance was evaluated using accuracy, precision, recall, F1-score, specificity, and confusion matrices.

The pre-trained ImageNet weights for both VGG16 and DenseNet-121 were loaded at initialization and remained frozen throughout training. Only the newly added fully connected classification layers were updated during backpropagation. This transfer learning strategy exploits the rich feature representations learned from ImageNet's millions of images while adapting the classification head to the specific leukemia detection task.

TL;DR: Training used 50 epochs, batch size 32, Adam optimizer at 0.001 learning rate, and sparse categorical cross-entropy loss. A 3 x 3 experimental matrix tested 3 models (VGG16, DenseNet-121, fusion) across 3 data types (original, segmented, combined) on Kaggle with dual T4 GPUs.

Results & Performance

Pages 12-17

The Fusion Model Outperforms Individual Networks Across All Metrics

Single-model baselines: When trained on original images alone, VGG16 achieved 96.62% accuracy while DenseNet-121 reached 97.54% precision and 97.53% recall. On segmented images alone, both models performed slightly worse: VGG16 dropped to 95.38% accuracy, and DenseNet-121's precision fell to 96.62%. However, when both image types were combined, performance jumped significantly. VGG16 reached 98.95% accuracy (up from 96.62%), and DenseNet-121 reached 99.08% accuracy with 99.10% precision and 99.09% recall. DenseNet-121 consistently outperformed VGG16 across all three data conditions.

Fusion model performance: The proposed fusion model surpassed both individual models in every comparison. On original images alone, it achieved 98.46% accuracy. On segmented images alone, 97.85%. On combined images, the fusion model reached its peak: 99.89% accuracy, 99.80% precision, 99.72% recall, and 99.76% F1-score. The confusion matrix for combined images showed only 2 misclassifications total, compared to 5 misclassifications with original-only and 7 with segmented-only inputs.

Alternative fusion combinations: The authors also tested other feature fusion pairings. DenseNet121-ResNet50 achieved 98.9% accuracy with 99.2% sensitivity, 97.8% precision, and 94.7% specificity. DenseNet121-MobileNet reached 99.2% accuracy with 98.8% sensitivity, 98.6% precision, and 97.4% specificity. The proposed DenseNet121-VGG16 combination at 99.87% accuracy outperformed both alternatives, confirming that the VGG16 and DenseNet-121 pairing captures the most complementary features.

Comparison with prior work: Against Mohamed E. Karar et al., who achieved 99.58% accuracy, the proposed model's 99.89% represents a meaningful improvement, especially in precision (99.80% vs. 96.67%), recall (99.72% vs. 94%), and F1-score (99.76% vs. 95%). Against Mustafa Ghaderzadeh et al. (99.85% accuracy), the improvement is narrower but consistent across all metrics: precision 99.80% vs. 99.74%, recall 99.72% vs. 99.52%, and F1-score 99.76% vs. 99.63%.

TL;DR: The fusion model with combined images achieved 99.89% accuracy, 99.80% precision, 99.72% recall, and 99.76% F1-score, with only 2 total misclassifications. It outperformed VGG16 alone (96.62%), DenseNet-121 alone (99.08%), and prior state-of-the-art methods (99.58% and 99.85%).

IoMT Framework & Web Application

Pages 18-20

Cloud Deployment on AWS and the Beta Web Application

IoMT workflow: The proposed framework operates through a multi-step pipeline. Blood samples are collected at a hospital, and a WiFi-enabled microscope captures peripheral blood smear images and uploads them to a cloud server. On the cloud, the system automatically generates segmented images from the originals using the HSV thresholding method, then feeds both versions into the trained fusion model. The classification result is sent back to the medical center and to the patient's personal devices, enabling remote access to diagnostic reports.

AWS implementation: The testbed was deployed on Amazon Web Services using S3 buckets for image storage and Lambda functions for model execution. When a user uploads a blood sample image, it is stored in an S3 container, and the user receives confirmation. A Lambda function then retrieves the image, runs the segmentation pipeline, and executes the fusion model. The prediction, along with associated confidence probabilities, is transmitted back to the user's devices.

Web application demo: A beta web application was developed to demonstrate the complete workflow. Users can upload blood smear images and receive predictions with probability scores. In the demonstration, a benign sample was classified with 90% confidence as benign, with negligible probabilities for other classes. A leukemia-positive sample was classified as the "Pre" subtype with 95% probability, triggering a recommendation to consult a specialist. The application provides an accessible interface for clinical staff who may not have deep technical expertise.

The IoMT architecture addresses several practical challenges in leukemia diagnostics. It eliminates the need for on-site computational infrastructure, reduces dependency on specialist pathologists for initial screening, and enables timely diagnosis even in remote or under-resourced facilities. The cloud-based approach also allows for centralized model updates, meaning improvements to the classification model can be deployed once and immediately benefit all connected facilities.

TL;DR: The system uses WiFi-enabled microscopes to upload blood smear images to AWS (S3 storage, Lambda for execution). Automated segmentation and classification run in the cloud, returning predictions to both the medical center and the patient's devices. A beta web app demonstrated benign detection at 90% confidence and leukemia subtype detection at 95% confidence.

Limitations & Future Directions

Pages 20-21

Dataset Constraints, Missing Hyperparameter Tuning, and What Comes Next

Dataset size: The most significant limitation acknowledged by the authors is the relatively small dataset. At 6,512 combined images (3,256 original plus 3,256 segmented) from only 89 patients, the dataset is insufficient for guaranteeing model robustness across diverse clinical populations. The data originated from a single hospital (Taleqani Hospital, Tehran), which introduces potential demographic and equipment-specific biases. Without multi-center validation, it is unclear how well the model generalizes to blood smear images acquired with different microscopes, staining protocols, or from patients of different ethnic backgrounds.

No hyperparameter optimization: The authors explicitly note that the model was not hyperparameter-tuned. The learning rate (0.001), batch size (32), dropout rate (0.2), and epoch count (50) were set without systematic search. Techniques such as grid search, random search, or Bayesian optimization could potentially improve performance further or achieve equivalent performance with a simpler architecture. The absence of tuning means the reported 99.89% accuracy may not represent the model's ceiling.

Validation methodology: The study uses a single train-test-validation split (70:20:10) rather than k-fold cross-validation. While some prior studies in this space used 5-fold cross-validation, this work's single split makes it harder to assess performance variability. The extremely high accuracy (99.89%) with only 2 misclassifications out of 652 test images is impressive, but the narrow confidence this provides on the true error rate is a concern for clinical deployment.

Future directions: The authors plan to expand the training dataset with more images to improve robustness. They also intend to apply systematic hyperparameter tuning. Beyond these immediate improvements, the framework's architecture is designed to generalize to other diseases diagnosed through Complete Blood Count (CBC) or blood cell analysis. The IoMT infrastructure could support multi-disease detection platforms, integration with electronic health records, and real-time monitoring with continuous feedback. The broader vision includes extending AI-driven diagnostics to other hematological malignancies and blood disorders.

TL;DR: Key limitations include a small single-center dataset (6,512 images, 89 patients), no hyperparameter optimization, and a single train-test split instead of cross-validation. Future work targets dataset expansion, systematic tuning, and extension of the IoMT framework to other blood-based diseases.

Utilizing Deep Feature Fusion for Automatic Leukemia Classification: An Internet of Medical Things Enabled Framework

Original Paper (PDF)

Plain-English Explanations