AI-Assisted Detection for Early Screening of Acute Myeloid Leukemia Using Infrared Spectra and Clinical Biochemical Reports of Blood

Biosensors 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
What This Study Is About and Why It Matters

Acute myeloid leukemia (AML) is one of the most common and deadliest forms of adult leukemia. It is characterized by the uncontrolled proliferation of abnormal white blood cells in the bone marrow, which suppresses the production of healthy red blood cells, white blood cells (WBC), and platelets. Patients frequently experience infections, easy bruising, fever, and fatigue. In China alone, approximately 15,000 new leukemia cases are reported each year, with AML carrying an extremely high mortality rate.

Current diagnostic methods for leukemia are comprehensive but burdensome. A complete blood count (CBC) is inexpensive and fast but lacks the specificity to directly identify leukemia cells. Flow cytometry offers high accuracy through multi-parameter analysis but requires expensive equipment and trained operators. PCR and mass spectrometry provide high sensitivity and specificity but involve complex workflows and specialized infrastructure. The gold standard, bone marrow biopsy, is an invasive surgical procedure that causes significant patient discomfort and carries risks of pain, bleeding, infection, and nerve damage.

This study, conducted by researchers at Zhejiang University and Taizhou Hospital, introduces an AI-powered screening system that combines two non-invasive data sources: attenuated total reflectance Fourier transform infrared spectroscopy (ATR-FTIR) measurements of lyophilized (freeze-dried) serum and standard clinical blood biochemical test results. The goal is to provide a rapid, minimally invasive method that can screen adults for AML using only a small blood draw, avoiding the need for bone marrow biopsy as a first-line test.

The researchers developed a novel deep learning architecture called the multi-modality spectral transformer network (MSTNetwork) that fuses features from both infrared spectral data and biochemical indicators. This multi-modality approach achieved 98% accuracy and 98% sensitivity, improving sensitivity by 12% over biochemical indicators alone and over 6% compared to FTIR spectra alone.

TL;DR: This study presents an AI system that screens for AML using FTIR infrared spectroscopy of blood serum combined with standard biochemical tests. The MSTNetwork deep learning model fuses both data types to achieve 98% accuracy and 98% sensitivity, significantly outperforming either data source used alone and offering a minimally invasive alternative to bone marrow biopsy for initial screening.
Pages 2-3
Why Infrared Spectroscopy and AI Are a Powerful Combination

Infrared spectroscopy is a rapid, highly sensitive technique that identifies materials by measuring how molecules absorb infrared light at specific wavelengths. Each molecule has a unique spectral "fingerprint" based on its chemical bonds and structure. In clinical medicine, FTIR can detect characteristic patterns from proteins, nucleic acids, lipids, and other biomolecules in biological samples. It has already been applied to identify diseases including Alzheimer's disease, cervical cancer, gastric cancer, and leukemia.

Previous work by Chaber et al. used FTIR spectroscopy for early screening of pediatric leukemia but achieved only 85% accuracy. Zelig et al. analyzed monocytes in peripheral blood using infrared spectroscopy, demonstrating its potential for pre-screening in pediatric leukemia. However, these earlier methods suffered from relatively low diagnostic accuracy, highlighting the need for more sophisticated analytical approaches.

Artificial intelligence has shown strong results in spectroscopy-based cancer detection. Du et al. built a one-dimensional convolutional neural network (1D-CNN) for processing Raman spectroscopy data, achieving 94.5% accuracy across four cancer types. A handheld Raman spectrometer integrated with a CNN achieved over 95% accuracy for pancreatic cancer classification. Multi-modality fusion models have shown even stronger performance: fusing infrared and Raman spectroscopy increased precision by nearly 20% compared to single-modality approaches in some disease contexts.

While clinical serum biochemical testing remains essential for measuring specific blood compounds and guiding treatment, its drawbacks include complex operation, long detection times, and limited sensitivity. The authors argue that combining biochemical data with spectroscopic data through AI can overcome the limitations of each method used independently, creating a more accurate and robust screening tool.

TL;DR: FTIR spectroscopy provides molecular-level fingerprints from blood serum but prior leukemia studies achieved only 85% accuracy. CNN-based AI models have boosted spectroscopy-based cancer detection to over 94% accuracy, and multi-modality fusion (combining spectral and biochemical data) consistently outperforms single-modality approaches, motivating this study's combined approach.
Pages 3-5
Patient Population, Sample Collection, and FTIR Measurement

Patient cohort: The study was approved by the Ethics Committee of Zhejiang Taizhou Enze Medical Center. A total of 70 patients and 24 healthy individuals participated. The patient group included 36 AML patients and 24 individuals with other non-leukemia diseases (10 with hyperuricemia, 7 with gout, and 7 with rheumatoid arthritis). All leukemia diagnoses were confirmed through established clinical methods such as blood analysis, bone marrow examination, and morphological analysis. Blood samples were collected from each patient at different time points, and patients with non-leukemia diseases were also diagnosed through standard clinical methods.

Sample processing: Blood was collected in anticoagulant tubes and centrifuged at 3000 rpm within two hours to obtain serum, which was stored at -80 degrees Celsius. For spectroscopic analysis, 100 microliters of each serum sample was freeze-dried for 48 hours using an LGJ-12A freeze dryer. This lyophilization step is critical because it eliminates water interference in the infrared spectra, producing clearer molecular signatures. The freeze-dried serum solids were then analyzed by FTIR.

FTIR measurement protocol: Infrared spectral data were obtained using an INVENIO S FTIR spectrometer equipped with a diamond attenuated total reflectance (ATR) accessory. All spectra were recorded within the wavenumber range of 400 to 4000 cm-1 with a spectral resolution of 2 cm-1. Before each measurement, the detection area was cleaned with anhydrous ethanol and air was used as background. Each serum sample was tested at a minimum of five different locations, and the average value was used. This multi-point sampling approach reduces measurement variability and improves reliability.

The researchers found that certain biochemical indicators showed clear differences between positive (AML) and negative (control) cases. Compounds such as albumin, total protein, and lactate dehydrogenase differed significantly between groups, playing an important role in AI classification. Other compounds like potassium and uric acid showed smaller variations and contributed less to the classification task.

TL;DR: The study used 36 AML patients, 24 non-leukemia disease patients, and 24 healthy individuals. Blood serum was freeze-dried and analyzed with ATR-FTIR spectroscopy across 400-4000 cm-1. Key biochemical indicators like albumin, total protein, and lactate dehydrogenase showed significant differences between AML and control groups, while the multi-point sampling protocol ensured reliable spectral measurements.
Pages 6-7
The Multi-Modality Spectral Transformer Network

Why not use LDA alone? Linear discriminant analysis (LDA) is a classic machine learning algorithm that works well when data are linearly separable and feature dimensionality is moderate. However, LDA struggles with high-dimensional data because the number of samples is often much smaller than the number of features, causing singular matrices when computing the within-class scatter matrix. Furthermore, LDA lacks a natural mechanism to integrate heterogeneous data from multiple modalities, requiring separate processing that risks information loss and introduces integration difficulties.

The MSTNetwork solution: To address these limitations, the authors propose the multi-modality spectral transformer network (MSTNetwork), a novel transformer-based neural network that encodes raw data before feeding them into LDA for final classification. The transformer architecture uses a self-attention mechanism to capture long-range dependencies across different dimensions of the data, identifying important disease-related features. Unlike traditional CNNs, which rely on local perception, transformers employ a multi-head attention mechanism to simultaneously focus on multiple subspaces of the data.

Dual-path input and cross-modal attention: The MSTNetwork takes two types of input: infrared spectral data and biochemical data. Because these data types have fundamentally different physical properties (absorption rates at various wavelengths versus concentrations of biochemical indicators), they are encoded separately into query (Q), key (K), and value (V) vectors using fully connected layers. The FTIR data is encoded to dimension 32 and the biochemical data to dimension 8. Batch normalization is applied to stabilize the input distribution. The Q, K, and V vectors from both modalities are then concatenated and processed through a multi-head attention mechanism, enabling the model to learn both intra-modal features and cross-modal correlations simultaneously.

The key innovation of MSTNetwork compared to existing transformer architectures is its dual-path attention layer, which dynamically generates attention weight matrices for both within-modality and between-modality interactions. This enables hierarchical feature interaction in an end-to-end learning paradigm, eliminating the need for manually designed feature fusion rules and significantly enhancing compatibility with heterogeneous modal data.

TL;DR: The MSTNetwork is a transformer-based architecture that fuses FTIR spectral data and biochemical indicators through a dual-path attention mechanism. It encodes each modality separately (dimensions 32 and 8), then uses multi-head attention to learn both within-modality and cross-modality features before passing the output to LDA for classification, overcoming the limitations of LDA alone on high-dimensional, multi-modal data.
Pages 7-8
Training Strategy: Contrastive Learning for Feature Separation

What is contrastive learning? Contrastive learning is a deep learning method that learns feature representations by comparing similarities and differences between samples. The core idea is to pull similar samples (positive pairs) closer together in the learned feature space while pushing different samples (negative pairs) apart. This forces the model to develop representations that are highly sensitive to the differences between sample classes.

How it works in this study: A set of infrared spectral data and biochemical data is input into the MSTNetwork, which reduces their combined dimensionality to produce a compact two-dimensional feature matrix. The Euclidean distances between different row vectors in this matrix are calculated to form a distance matrix. A contrastive loss function is then defined based on the ground truth labels: when two samples belong to the same category (both AML or both control), the loss minimizes their distance; when they belong to different categories, the loss maximizes their distance up to a margin threshold of 0.5.

Training details: The batch size was set to 300 after experimentation. The Adam optimizer was used with a learning rate of 0.001. A StepLR learning rate scheduler reduced the rate by 50% every 60 epochs, helping the model escape local minima and accelerate convergence. The total training lasted 240 epochs. The choice of batch size is important: too small a batch makes it difficult for the model to comprehensively distinguish positive from negative samples, while too large a batch may lead to overfitting.

TL;DR: The MSTNetwork is trained using contrastive learning, which pulls same-class samples together and pushes different-class samples apart in the feature space. Training used a batch size of 300, the Adam optimizer with a learning rate of 0.001, and a StepLR scheduler that halved the learning rate every 60 epochs across 240 total epochs.
Pages 8-9
Visualizing How the Model Separates AML from Controls

PCA visualization: To demonstrate the feature extraction capability of the pre-trained model, the authors applied principal component analysis (PCA) to both the raw data and the data processed by the MSTNetwork. PCA is a dimensionality reduction technique that projects high-dimensional data onto a smaller number of axes that capture the most variance, allowing researchers to visualize complex data in two or three dimensions.

Raw data overlap: When PCA was applied to the original, unprocessed data, the visualization showed a high degree of overlap between AML samples and control samples. This means that without feature extraction, the two classes are not easily separable in the raw feature space. Using an end-to-end deep learning model directly on this overlapping data would require a very deep network to capture the distinguishing features, resulting in poor model robustness.

Clear separation after MSTNetwork encoding: In contrast, after pre-training with the contrastive learning approach, the PCA visualization showed that different data types were distinctly separated in the learned feature space. The distances between different classes became relatively large, which significantly reduced the classification difficulty for subsequent models (in this case, LDA) and enhanced the overall robustness of the system. This visualization was consistent across training, validation, and test sets, confirming that the separation generalizes to unseen data.

TL;DR: PCA visualization showed that raw data had heavy overlap between AML and control samples, making classification difficult. After MSTNetwork encoding with contrastive learning, the classes became clearly separated in feature space across training, validation, and test sets, confirming the model learns robust, generalizable representations.
Pages 9-10
Tuning Batch Size and Experimental Design

Batch size optimization: The performance of the MSTNetwork is influenced by the batch size used during training. As the batch size increases, the model differentiates or merges features from more pairs of data, increasing its fit to the training set. However, a batch size that is too large can cause the model to overfit by learning overly specific feature relationships, while a batch size that is too small may prevent effective feature extraction. The authors tested batch sizes in increments of 50 and found that a batch size of 100 yielded the highest accuracy for their dataset.

Experimental protocol for robustness: To mitigate interference from model initialization and test set variations, the researchers conducted 10 parallel experiments. For each experiment, the dataset was divided into training, validation, and test sets at a 7:0.5:2.5 ratio, with five random model initializations performed per split. The final accuracy was evaluated as the average of 50 total trials. A 5% validation set provided reliable feedback during training and tuning, while the larger 25% test set ensured robust and unbiased evaluation.

Patient-level splitting: A critical design choice was splitting the dataset based on individual subjects rather than individual samples. Since a single patient may contribute multiple serum samples collected at different time points, splitting at the sample level could allow the same patient's data to appear in both training and test sets, creating data leakage. By ensuring all samples from one patient stay within the same split, the authors preserved the independence of training and testing sets and produced a more realistic assessment of clinical generalizability.

TL;DR: Batch size of 100 was selected after systematic testing. The study used 10 parallel experiments with 5 random initializations each (50 total trials) and a 70/5/25 train/validation/test split. Patient-level splitting prevented data leakage from multiple samples per patient, ensuring a realistic evaluation of clinical generalizability.
Pages 10-11
Multi-Modality Fusion Dramatically Outperforms Single-Modality Approaches

Ablation study design: The researchers evaluated three configurations to demonstrate the benefit of multi-modality fusion: biochemical indicators alone (36 standard major biochemical indicators), FTIR spectra alone, and the combined multi-modality system. Classification performance was measured using accuracy, sensitivity (the ability to correctly identify AML cases), and specificity (the ability to correctly identify non-AML cases). When the numbers of positive and negative samples are approximately equal, accuracy approximates the average of sensitivity and specificity.

Biochemical indicators only: Using only the 36 standard biochemical indicators achieved 93% accuracy but only 86% sensitivity with a large standard deviation of 0.09. The low sensitivity means a substantial proportion of true AML cases were missed (false negatives), which is clinically dangerous because missed leukemia cases delay critical treatment. The high variability (standard deviation) also indicates inconsistent performance across different data splits.

FTIR spectra only: Using only infrared spectroscopy data achieved 93% accuracy with improved sensitivity of 92% (standard deviation 0.03) and specificity of 94%. The spectral approach captures vibrational energy information from diverse molecules, providing a richer set of features than biochemical indicators alone. The smaller standard deviation indicates more consistent performance.

Combined multi-modality system: When both FTIR spectra and biochemical indicators were integrated through the MSTNetwork, accuracy rose to 98% and sensitivity rose to 98% (standard deviation 0.02). This represents a 12-percentage-point improvement in sensitivity over biochemical indicators alone and a 6-percentage-point improvement over FTIR alone. The multi-modality system also produced the smallest standard deviation, indicating it is both the most accurate and the most robust of the three approaches. Additional experiments comparing AML patients to only healthy individuals (excluding other disease patients) showed similarly high accuracy and sensitivity.

TL;DR: Biochemical data alone achieved 93% accuracy / 86% sensitivity; FTIR alone reached 93% / 92%. The combined MSTNetwork multi-modality system achieved 98% accuracy and 98% sensitivity with the smallest standard deviation (0.02), proving that fusing spectral and biochemical data dramatically improves both accuracy and robustness for AML screening.
Pages 11-14
Clinical Significance, Current Limitations, and Next Steps

Clinical significance: The system requires only a few hundred microliters of serum per test, making it far less invasive than bone marrow biopsy. By combining the molecular-level fingerprint information from infrared spectroscopy with the metabolic insights from biochemical analysis, and analyzing them through AI, this approach offers a novel solution for early leukemia screening. The 98% sensitivity is particularly important in a screening context because it means very few AML cases would be missed, allowing patients to be identified earlier when treatment is most effective.

Sample processing time: One limitation is the freeze-drying step, which takes 48 hours. This delay between blood collection and spectroscopic analysis could affect screening efficiency in a clinical setting where rapid turnaround is valued. The authors plan to explore more efficient and rapid sample processing techniques to accelerate the workflow. Eliminating or shortening the lyophilization step would be a significant practical improvement.

Binary classification scope: The current study performed only a simple binary classification, distinguishing AML from the control adult group (healthy individuals and patients with non-leukemia diseases). All leukemia patients in the study were adults aged 40 to 70 with AML, which accounts for 60 to 70% of adult acute leukemia. The study did not address acute lymphoblastic leukemia (ALL) or other leukemia subtypes, nor did it include pediatric patients. Future research should refine the classification to include all leukemia subtypes across all ages to provide a more comprehensive diagnostic tool.

Sample size and generalizability: The cohort of 94 total participants (36 AML, 34 other diseases, 24 healthy) is relatively small. While the patient-level splitting and 50-trial averaging help mitigate small-sample effects, larger multi-center validation studies would be needed to confirm generalizability across diverse populations. The authors' robust experimental design with multiple random initializations is commendable, but external validation at independent hospitals remains an important next step before clinical deployment.

TL;DR: The system needs only a few hundred microliters of blood and achieves 98% sensitivity, offering a minimally invasive alternative to bone marrow biopsy for AML screening. Key limitations include the 48-hour freeze-drying step, binary-only classification (no ALL or subtype differentiation), and a small cohort of 94 participants. Future work should address faster sample processing, multi-subtype classification, and larger multi-center validation.
Citation: Zhang C, Li J, Luo W, He S.. Open Access, 2025. Available at: PMC12024367. DOI: 10.3390/bioengineering12040340. License: cc by.