AI-Assisted Flow Cytometry for Acute Leukemia Diagnosis

Plain-English Explanations

Overview

Pages 1-2

What This Study Is About and Why It Matters

Multiparameter flow cytometry (MFC) has become an essential tool over the past 30 years for diagnosing leukemia. By measuring multiple surface and intracellular antigens on cells simultaneously, MFC allows pathologists to identify abnormal blast populations, assign their lineage (myeloid versus lymphoid), and distinguish malignant cells from normal progenitors. However, interpreting MFC data is highly sophisticated and requires skilled, extensively trained pathologists, a limitation that is especially problematic in countries like China where pathologist shortages and heavy workloads are common.

This study, conducted at Sun Yat-Sen University's First Affiliated Hospital in collaboration with DeepCyto LLC, evaluated an AI-assisted methodology for diagnosing acute leukemia using MFC data. The researchers enrolled 200 acute leukemia patients (including AML, B-ALL, and T-ALL subtypes) and 94 non-leukemic control patients with cytopenia or hematocytosis from non-neoplastic conditions such as infection, autoimmune cytopenias, and post-chemotherapy bone marrow suppression. All diagnoses followed current WHO classification criteria, combining clinical findings, morphologic examination, cytogenetic data, and molecular analysis.

Existing computational approaches to MFC analysis, such as T-SNE, K-means, SPADE, FlowSOM, and PhenoGraph, have been developed primarily for research settings. Clinical MFC data analysis still relies heavily on manual logic gating with conventional software, where detection efficiency depends on the examiner's experience. Manual gating is also limited to two-dimensional scatter plot combinations, making it difficult to separate and gate cells with high-dimensional data or consistently measure antigen expression levels across multiple dimensions.

The central goal of this paper was to validate a clinic-oriented AI workflow that could perform automatic MFC data analysis while producing not only final diagnostic results but also human-understandable and editable intermediate steps, allowing pathologists to review, adjust, or reject the AI's output at each stage.

TL;DR: This 2022 study validated an AI-assisted workflow for diagnosing acute leukemia from multiparameter flow cytometry data. Using 200 leukemia patients and 94 controls, the researchers compared AI-generated diagnoses against manual expert analysis, focusing on diagnostic accuracy, abnormal cell quantification, and immunophenotypic classification.

Study Design

Pages 2-4

Patient Cohort and MFC Immunophenotyping Protocol

The leukemia-positive group included 200 patients (95 men, 105 women, mean age 43.12 years) referred between March 2019 and June 2020. Diagnosis was established according to WHO classification criteria using a combination of clinical findings, morphologic examination of peripheral blood and bone marrow specimens, and cytogenetic and molecular data. Cases with equivocal findings or insufficient data were excluded. The non-leukemic group comprised 94 patients (44 men, 50 women, mean age 39.59 years) with conditions including infection, post-chemotherapy bone marrow suppression, autoimmune cytopenias, chronic renal insufficiency, iron deficiency anemia, and drug-induced cytopenias.

Bone marrow aspirate samples were collected in EDTA anticoagulant and processed within 24 hours. An eight-color flow cytometry analysis was performed on a FACS Canto Plus flow cytometer, standardized daily using CS&T beads. The antibody panel included CD45 (present in all tubes for blast identification), along with CD2, CD3, CD4, CD5, CD7, CD8, CD10, CD11b, CD13, CD14, CD15, CD16, CD19, CD20, CD22, CD33, CD34, CD38, CD56, CD64, CD117, HLA-DR, MPO, CD79a, and cCD3. Additional markers such as CD235a, CD71, CD41, CD42b, and CD61 were included for specific AML subtypes (M6 and M7).

This comprehensive panel of CD markers is critical because each leukemia subtype has a characteristic immunophenotypic signature. For example, AML blasts typically express myeloid markers like CD13, CD33, and MPO, while B-ALL blasts express CD19, CD22, and CD79a, and T-ALL blasts express CD3, CD7, and CD5. The AI system needed to learn these complex patterns across all marker combinations to produce accurate lineage assignments.

TL;DR: The study used 200 acute leukemia patients and 94 non-leukemic controls, all tested with eight-color flow cytometry using a panel of over 25 CD markers. Bone marrow samples were processed within 24 hours and analyzed on standardized equipment following WHO classification criteria.

AI Workflow Architecture

Pages 3-5

The Five-Phase DeepFlow AI Pipeline

The AI system, called DeepFlow (version 1.0.1, developed by DeepCyto LLC), follows a five-phase analysis pipeline. Phase 1: Data Validation incorporates multiple machine learning models to extract nucleated single cells from raw MFC data. This includes flow time stability screening (checking the moving average of forward scattering signals for inconsistent changes), doublets filtering (using a linear regression model to separate single cells from doublets in FSC-A/FSC-H space), and debris removal (using an unsupervised learning algorithm for clustering combined with a supervised learning model to identify debris based on mean fluorescence intensity on FSC-A and SSC-A). These data preparation models were trained on 500 cases and validated on an additional 227 cases.

Phase 2: Population Classification applies a multidimensional density-phenotype coupling (MDPC) algorithm across all nucleated cell data. Unlike manual gating, which is restricted to two-dimensional scatter plots, this algorithm considers all channel distributions and phenotypes simultaneously and automatically adjusts cluster spans based on overall distribution. Two criteria define cell populations: the cell distribution density across all markers and the marker expression phenotype on all markers. Expression levels for each channel are classified into five tiers: bright, positive, partial, dim, and negative. The MDPC algorithm is optimized for large cell groups (5% and above), which aligns with the clinical threshold of 20% abnormal cells for acute leukemia diagnosis.

Phase 3: Immune-Phenotype Classification uses a random forest classifier with bootstrap aggregating built separately for each antibody tube. The classifier characterizes five common cell categories (lymphocytes, monocytes, granulocytes, blasts, and nucleated red blood cells) and distinguishes subcategories such as T-cells and B-cells. Each cell cluster's raw attributes are encoded into a cluster-level feature vector that includes statistical parameters such as mean fluorescence intensity, standard deviation, and channel distributions. A cross-tube match algorithm then integrates antibody expression information from multiple tubes to improve classification accuracy when a single tube's markers are insufficient.

Phase 4: AI-Assisted Diagnosis employs a boosted random forest algorithm that aggregates all previously obtained information, including categorized cell clusters parameterized by ratio, category, and phenotypes. The model examines normal cell clusters for unusual expression, analyzes abnormal clusters, combines those with similar expressions, computes cell percentages, and diagnoses the acute leukemia subtype. Phase 5: Report Generation automatically produces a comprehensive FCM diagnosis report in PDF format, including 2D scatterplots, t-SNE visualizations, and heat maps, along with quality control measures at critical stages.

TL;DR: DeepFlow uses a five-phase pipeline: data validation (debris/doublet removal), multidimensional density-based clustering, random forest classification of cell populations across antibody tubes, boosted random forest diagnosis, and automated report generation with visualizations. Each phase produces human-readable intermediate results that pathologists can review and edit.

Diagnostic Accuracy

Pages 5-7

Comparison of AI and Manual Diagnostic Results

The overall diagnostic consistency between AI and manual analysis was 0.976, with a kappa value of 0.963, indicating excellent agreement. Breaking this down by subtype, AML cases showed 0.971 consistency (134 of 138 correctly classified), B-ALL showed 0.981 consistency (52 of 53 correctly classified), T-ALL showed 0.778 consistency (7 of 9 correctly classified), and non-leukemic cases showed perfect 1.000 consistency (all 94 correctly identified). In total, only 7 of 294 cases were classified as "abnormal" by AI rather than receiving a specific subtype diagnosis.

The seven discordant cases were clinically challenging even for human experts. All four misclassified AML cases were MPO-negative and cross-expressed lymphoid antigens such as CD7 or CD56, making lineage assignment difficult. In the three ALL cases, the lymphoid-lineage-specific markers (cCD3 or CD79a) showed only dim expression, while myeloid antigens (CD13 or CD33) were aberrantly co-expressed. In these situations, the AI correctly identified the cells as abnormal blasts and flagged them for manual review rather than forcing an incorrect subtype classification. This behavior represents a deliberate safety feature of the system.

A critical finding was the dramatic difference in analysis speed. The AI analysis time per case averaged 83.72 seconds (SD: 23.90 s), compared to 15.64 minutes (SD: 7.16 min) for manual analysis. This represents approximately a 10-fold speedup, which has significant implications for clinical laboratory throughput, especially in high-volume settings where pathologist time is a bottleneck.

TL;DR: The AI achieved 97.6% overall diagnostic consistency with manual analysis (kappa = 0.963). AML accuracy was 97.1%, B-ALL was 98.1%, and non-leukemic cases were 100%. The seven discordant cases involved atypical immunophenotypes that are challenging even for experts. AI analysis was roughly 10 times faster than manual analysis.

Abnormal Cell Quantification

Pages 7-8

Comparing AI and Manual Abnormal Cell Proportions

Beyond binary diagnosis, accurate quantification of the abnormal cell proportion is clinically important for staging, treatment planning, and monitoring response to therapy. The mean abnormal cell proportion was 64.49% (SD: 23.36) for AI analysis and 62.97% (SD: 22.69) for manual analysis. The Pearson correlation coefficient between the two methods was 0.913 (p < 0.04), indicating strong statistical correlation.

Bland-Altman analysis provided a more detailed assessment of agreement. The bias (mean difference) was only 0.752 percentage points, with a standard deviation of 6.646. The 95% limits of agreement ranged from -12.775 to 13.779 percentage points. A paired t-test confirmed that the difference was not statistically significant (p = 0.1225), meaning the AI and manual methods produce equivalent cell proportion estimates from a clinical perspective.

Twelve patients showed a difference of more than 20 percentage points between AI and manual analysis. In nine cases, the manual proportion was higher, likely because the AI flagged some ambiguous cells as unclassified rather than assigning them as abnormal, requiring manual review. In three cases, the AI proportion was higher, probably because the AI misidentified some granulocytes or monocytes as abnormal cells. These edge cases highlight the importance of the system's design philosophy: providing editable intermediate results so pathologists can catch and correct such errors.

TL;DR: AI and manual methods showed strong agreement in abnormal cell quantification, with a Pearson correlation of 0.913 and a non-significant mean difference of only 0.75 percentage points. Bland-Altman analysis confirmed the two methods are clinically equivalent, though 12 cases showed differences exceeding 20 percentage points in either direction.

Immunophenotyping Accuracy

Pages 8-10

AI Performance on Individual CD Marker Classification

The most granular level of comparison examined how well the AI classified the expression level of each individual CD marker on abnormal cells. Expression was categorized as positive, partial, or negative. Across all 25 markers evaluated on 200 leukemia cases (5,000 total marker assessments), the overall consistency between AI and manual classification was 0.889, with a kappa value of 0.775, indicating good agreement.

Individual marker consistencies ranged from 0.75 to 0.99. The highest-performing markers included CD16 (0.99), CD3 (0.985), CD56 (0.935), cCD3 (0.975), and CD8 (0.975). These markers tend to have clearer bimodal expression patterns that are easier for the AI to classify. Markers with lower consistency included CD13 (0.75), CD38 (0.775), and CD11b (0.885). CD13 and CD38 are particularly challenging because they often show continuous, overlapping expression distributions rather than clear-cut positive or negative populations, making the boundary between "partial" and "positive" or "partial" and "negative" subjective even among human experts.

The kappa values for individual markers showed more variation, ranging from 0.139 (CD14) to 0.925 (CD10). The low kappa for CD14 is notable because while raw consistency was high (0.97), nearly all cases were negative for CD14, resulting in a statistical paradox where high agreement on a dominant category produces a low kappa. This underscores the importance of examining both consistency percentages and kappa values together when evaluating AI performance on immunophenotypic classification.

TL;DR: Across 5,000 individual marker assessments, the AI achieved 88.9% overall consistency with manual immunophenotyping (kappa = 0.775). Markers with clear bimodal patterns (CD16, CD3, cCD3) performed best, while markers with continuous distributions (CD13, CD38) proved more challenging for both AI and human experts.

Clinical Context

Pages 10-12

Why a Clinic-Oriented AI Workflow Is Different from Research Tools

The authors draw an important distinction between AI tools designed for research flow cytometry and those suitable for clinical practice. Research-oriented algorithms like T-SNE are innovative but impractical for clinical use because they can take hours or even days to process million-event minimal residual disease (MRD) MFC data, which is unacceptable when clinical turnaround time is a priority. Clinical laboratories operate under economic constraints and regulatory requirements that demand fast, reproducible, and transparent analysis.

Most previously published AI approaches for flow cytometry used end-to-end black-box models that lack human-understandable intermediate results. Clinical practitioners found these models difficult to review and validate because there was no way to inspect the reasoning behind a diagnosis. The DeepFlow system explicitly addresses this gap by producing interpretable intermediate outputs at every stage: debris removal gates, cell cluster assignments, phenotype classifications, and diagnostic logic. This mirrors the step-by-step reasoning that pathologists use in manual analysis, making it far easier for a clinician to understand why the AI reached a particular conclusion and where it might have erred.

The interactive editing capability is another clinically important feature. Rather than presenting a fixed diagnosis, the system allows pathologists to adjust gating boundaries, reassign cell clusters, or modify phenotype calls at any step. Over time, this adaptive learning means the AI can be tuned to each pathologist's gating preferences and institutional protocols. This flexibility is particularly valuable because flow cytometry panels are highly customized across different laboratories, and a rigid model trained on one panel design would not generalize well to others.

TL;DR: Unlike research tools such as T-SNE that are too slow for clinical use, DeepFlow produces interpretable intermediate results at every stage, mirroring manual analysis logic. Pathologists can review, adjust, or reject AI output at each step. This transparency and editability are what make the system practical for real clinical laboratory deployment.

Limitations and Future Directions

Pages 11-13

Current Limitations and Where the Field Goes Next

The study focused exclusively on acute leukemia, covering AML, B-ALL, and T-ALL. The authors acknowledge that additional flow cytometry panels for other applications, including B-ALL minimal residual disease (MRD), AML MRD, and B-cell lymphoma, will need to be validated in future research. Testing the AI methodology on variant MFC panels from different laboratories will be essential for demonstrating generalizability, since panel design and instrument configuration vary substantially between institutions.

The current AI model relies primarily on clustering, classification, and dimensionality reduction algorithms, specifically the MDPC clustering algorithm, random forest classifiers, and boosted random forest for diagnosis. The authors note their intention to explore convolutional neural network architectures in future work to potentially improve the detection of acute leukemia cells, reduce false positives, and further enhance diagnostic accuracy. CNNs could be particularly beneficial for learning spatial patterns in high-dimensional MFC data that tree-based models may not capture as effectively.

The T-ALL subtype had the lowest consistency at 0.778, reflecting the fact that only 9 T-ALL cases were included in the study. This small sample size limits the model's ability to learn the full range of T-ALL immunophenotypic variation. Expanding the training dataset with more atypical phenotypes, particularly cases with aberrant antigen co-expression and dim lineage-specific marker expression, would help the AI move from flagging difficult cases for review to making more definitive diagnoses on its own.

Looking forward, the authors envision integrating the AI-assisted MFC analysis with other diagnostic modalities, including morphological examination, cytogenetic analysis, and molecular testing. Such integration would create a comprehensive decision-support system for acute leukemia diagnosis and prognostic stratification, moving closer to the multimodal diagnostic standard described in the WHO classification framework.

TL;DR: The study was limited to acute leukemia with a small T-ALL subset (only 9 cases). Future work will expand to MRD panels and lymphoma, test generalization across laboratories, explore CNN architectures, and integrate MFC analysis with morphological, cytogenetic, and molecular data for comprehensive AI-assisted diagnosis.

Diagnosis of Acute Leukemia by Multiparameter Flow Cytometry with the Assistance of Artificial Intelligence

Original Paper (PDF)