Deep Learning Diagnosis & Molecular Characterization of AML

Plain-English Explanations

Overview & Background

Pages 1-3

Why AML Diagnosis Needs Automation, and What This Study Set Out to Do

Diagnosing acute myeloid leukemia (AML) currently requires integrating flow cytometry, microscopic assessment, cytogenetics, and targeted sequencing panels. Flow cytometry can measure physical properties and surface marker expression of hundreds of thousands of hematopoietic cells within hours, but the analysis depends heavily on manual intervention for compensation, gating, and interpretation. This reliance on highly trained hematopathologists introduces subjectivity and limits accessibility, particularly in low-resource settings. Meanwhile, concurrent molecular characterization through karyotyping and sequencing often takes multiple days, delaying identification of AML subtypes that require targeted therapies or changes in induction chemotherapy.

Prior machine learning approaches to flow cytometry data have included dimensionality reduction (PCA, tSNE, UMAP), unsupervised clustering (flowSOM, PhenoGraph), and supervised methods such as linear discriminant analysis and support vector machines. More recent deep learning efforts used convolutional neural networks (CNNs) on data histograms for Hodgkin lymphoma detection and on CyTOF data for CMV infection diagnosis. However, CNNs are architecturally suited for image data, not the tabular structure of flow cytometry. Attention-based multi-instance learning models (ABMILMs), which have shown strong results in histopathology, had not yet been applied to flow cytometry despite their ability to identify which individual events carry the most diagnostic weight.

This study from Brigham and Women's Hospital, Northwestern University, and Emory University developed a computational pipeline using ABMILMs for three sequential tasks: detecting acute leukemia, differentiating AML from ALL, and predicting 9 cytogenetic aberrancies and 32 pathogenic variants. The pipeline was tested on 1,820 flow cytometry samples collected from 2019 to 2022, making it the first application of attention-based multi-instance learning to flow cytometry for automated AML diagnosis and molecular characterization.

TL;DR: AML diagnosis relies on slow, subjective manual flow cytometry analysis. This study is the first to apply attention-based multi-instance learning models (ABMILMs) to flow cytometry data, building a pipeline that diagnoses acute leukemia, distinguishes AML from ALL, and predicts 41 cytogenetic/molecular variants across 1,820 samples.

Methodology

Pages 4-7

Dataset, Preprocessing, and Model Architecture

Dataset composition: The study analyzed 1,820 flow cytometry cases from Brigham and Women's Hospital (2019-2022), run on BD FACSCanto II systems using two-tube 10-color leukemia panels. Of these, 732 cases had definitive absence of acute leukemia and 736 had confirmed acute leukemia (568 AML, 168 B- or T-ALL). A 20% blast count threshold was applied uniformly. Among the 568 AML cases, 478 had concurrent cytogenetics from karyotyping and/or FISH, and 476 had molecular data from Rapid Heme Panel testing. Only variants with at least 2% dataset prevalence were modeled.

Preprocessing approach: Raw FCS files were loaded via the FlowKit Python package and normalized to mean 0 and standard deviation 1 per marker. Critically, no manual compensation, gating, or doublet exclusion was performed, making the pipeline fully automated from raw data. A maximum threshold of 150,000 events per tube was imposed due to memory constraints, affecting roughly 8.3-8.4% of samples.

Encoder networks: Two encoder neural networks (one per tube) were built as multilayer perceptrons (MLPs) with periodic activation function embeddings (MLP-PLR), which perform comparably to transformer models on tabular data at significantly lower computational cost. These encoders were pre-trained on all 1,820 samples using Self-Supervised Contrastive Learning using Random Feature Corruption (SCARF), with hyperparameter optimization over 100 iterations on a 1% data subset.

Predictive models: For each classification task, separate ABMILMs with gated attention mechanisms (following the Ilse et al. architecture) processed each tube's encoded features. The outputs from both tubes were concatenated and fed into a final MLP for sample-level prediction. Hyperparameter optimization ran for 50 iterations per model, using negative log-likelihood loss with class weights proportional to sample imbalance. Performance was assessed via 5-fold cross-validation, with AUROC, accuracy, sensitivity, and specificity reported at the Youden's J threshold.

TL;DR: 1,820 flow cytometry cases (568 AML, 168 ALL, 732 non-leukemic) from BD FACSCanto II two-tube panels. No manual gating or compensation. MLP-PLR encoders pre-trained with SCARF, then ABMILMs with gated attention for classification. 5-fold cross-validation across all models. Fully automated from raw FCS files.

Acute Leukemia Detection

Pages 8-10

Detecting Acute Leukemia with AUROC 0.961

The first model in the pipeline distinguishes non-leukemic cases from acute leukemia. It achieved an AUROC of 0.961 (plus or minus 0.011), with an overall accuracy of 0.903 (plus or minus 0.013). At the optimal Youden's J threshold, sensitivity was 0.881 (plus or minus 0.020) and specificity was 0.925 (plus or minus 0.027). The authors note that adjusting the probability threshold along the ROC curve can substantially increase sensitivity at the cost of specificity, enabling a triaging or screening role where missing leukemia cases carries higher clinical risk than false positives.

Attention pattern analysis: Examination of event-level attention values revealed that the model learned to focus most heavily on blast cells for its predictions, mirroring what a hematopathologist does during manual review. Monocytic cells received mixed attention, while mature lymphocytes and granulocytes drew less focus. This pattern confirms that the ABMILM learned clinically meaningful features without any manual gating to define blast populations.

Predictive power scores (PPS): PPS analysis quantified which markers drove model predictions using a non-parametric score that captures non-linear relationships between attention values and marker expression. Forward scatter (FSC) and side scatter (SSC) carried small but non-zero importance in the largest proportion of cases. Among surface markers, CD15 (sialyl Lewis x, a myeloid cell marker) had both the highest percentage of non-zero PPS values and the highest magnitude PPS values. CD19, important for B-ALL and AML with t(8;21), also showed strong predictive performance.

Interestingly, markers traditionally used for manual blast identification, including CD45, CD34, and HLA-DR, ranked among the least important surface markers for the model. CD3 was rarely used but showed the second-highest mean PPS when active, likely reflecting its role in identifying the small subset of T-ALL cases. These findings suggest that automated models may rely on different marker hierarchies than human experts.

TL;DR: Acute leukemia detection achieved AUROC 0.961, accuracy 0.903, sensitivity 0.881, specificity 0.925. The model focused on blast cells and relied most on CD15 and FSC/SSC, while traditional markers like CD45 and CD34 were less important. No manual gating was required.

AML vs. ALL Classification

Pages 10-12

Distinguishing AML from ALL with AUROC 0.965

The second model classified confirmed acute leukemia cases as AML versus ALL (including both B-ALL and T-ALL). Performance matched or exceeded the leukemia detection model, with an AUROC of 0.965 (plus or minus 0.015) and accuracy of 0.922 (plus or minus 0.025). This strong performance is notable given that AML and ALL can present with overlapping immunophenotypic features, particularly in cases with aberrant marker expression.

Attention beyond blasts: Unlike the leukemia detection model, which concentrated attention on blast cells, the AML-versus-ALL model distributed attention more broadly. Mature granulocytes with high SSC values and subsets of mature lymphocytes received comparable or greater attention than blasts in many cases. This suggests that the immunophenotypic profile of non-blast cell populations may carry underexplored diagnostic value for classifying acute leukemia subtypes.

Confidence thresholds and accuracy: The authors analyzed how model-outputted confidence levels affected prediction accuracy. At a 95% confidence threshold, covering 560 of 736 cases (76.1%), cumulative accuracy rose from 91.4% to 97.9%. At an extreme 99.9% threshold, covering 254 of 736 cases (34.5%), only one misclassification occurred, yielding 99.6% accuracy. This tiered approach enables high-confidence automated classification for the majority of cases while flagging uncertain ones for manual review.

Key markers: PPS analysis showed forward scatter carried small but consistent importance, reflecting subtle size differences between myeloblasts and lymphoblasts. CD123 (IL-3 receptor, expressed in most AML and B-ALL but not T-ALL) and CD10 (commonly expressed in ALL but not AML) showed relatively high importance. CD15 and T-cell markers (CD3, CD5, CD7) demonstrated high importance in smaller subsets of AML and T-ALL cases, respectively. Side scatter showed surprisingly low relative importance despite known granularity differences between blast types.

TL;DR: AML vs. ALL classification reached AUROC 0.965, accuracy 0.922. At 95% confidence (76.1% of cases), accuracy jumped to 97.9%. At 99.9% confidence (34.5% of cases), accuracy hit 99.6% with just one error. Key markers included CD123, CD10, and FSC. Non-blast cell populations contributed meaningfully to predictions.

Molecular Characterization

Pages 12-14

Predicting Cytogenetic Aberrancies and Pathogenic Variants from Flow Data Alone

Individual models were trained to predict the presence or absence of 9 cytogenetic aberrancies and 32 pathogenic variants among the 568 AML cases, using only flow cytometry data. This is the most ambitious component of the pipeline, as it attempts to infer genomic information from immunophenotypic profiles alone, potentially saving days of wait time for cytogenetic and sequencing results.

Top-performing models: The strongest results came from variants with known immunophenotypic signatures. The t(15;17)(PML::RARA) model, corresponding to acute promyelocytic leukemia (APL), achieved AUROC 0.929 (plus or minus 0.032), accuracy 0.885, sensitivity 0.925, and specificity 0.882. APL has a distinctive profile: typically CD34-negative (in the hypergranular variant), HLA-DR-negative, and CD117-positive. The t(8;21)(RUNX1::RUNX1T1) model reached AUROC 0.814 (plus or minus 0.050), consistent with this translocation's known association with CD19 and CD56 expression. NPM1 variant prediction achieved AUROC 0.807 (plus or minus 0.020), reflecting NPM1's links to CD19 and CD4 expression and monocytic markers.

Overall landscape: Of 41 variant models, 9 (22.0%) achieved AUROC above 0.7, and 32 (78.0%) exceeded AUROC 0.6. While not all variants were predicted with high accuracy, the fact that any molecular information can be extracted from flow cytometry alone is clinically significant. Variants with weaker immunophenotypic correlations naturally produced lower-performing models, but even moderate predictive accuracy could help prioritize cases for expedited molecular testing.

Marker importance across models: PPS analysis across all 41 models revealed three tiers of marker importance. Forward scatter (FSC-A/H) and CD33 showed strong predictive utility across nearly all models, with CD33 expression implicated in NPM1- and FLT3-mutated AML and apparently associated with many other variants as well. B-cell markers (CD19, CD20) and T-cell markers (CD3, CD5, CD7) had minimal predictive value for most models. A third tier contained markers important for specific variants only: monocytic markers CD14 and CD64 were highly important for inv(16)/t(16;16)(CBFB::MYH11), which often shows monocytic differentiation, and CD117 had elevated importance for IDH2 mutations, consistent with prior reports linking IDH2 R172 mutations to high CD117 expression.

TL;DR: 41 models predicted cytogenetic/molecular variants from flow data alone. Top performers: t(15;17) AUROC 0.929, t(8;21) AUROC 0.814, NPM1 AUROC 0.807. Overall, 22% of models exceeded AUROC 0.7 and 78% exceeded 0.6. CD33 and FSC were important across nearly all models. CD14/CD64 were specific to inv(16) prediction, and CD117 to IDH2.

Case Study

Pages 14-16

A Challenging APL Case Where the Model Found What Humans Almost Missed

The authors presented a detailed case study of acute promyelocytic leukemia (APL) with confirmed t(15;17)(PML::RARA) that had significant diagnostic uncertainty during manual interpretation. The model correctly identified the presence of leukemia (predicted probability 98.6%), classified it as AML rather than ALL (100.0% confidence), and correctly predicted the t(15;17) translocation (75.4% probability). It also flagged a possible FLT3 pathogenic variant with lower confidence (50.4%).

Why this case was difficult: The immature cell population showed an immunophenotype of SSC(increased)/CD45(dim)/CD34(variable)/HLA-DR-negative/CD117+/CD13(variable)/CD33+/CD14-/CD64+. While most of this profile is consistent with hypergranular APL, the variable CD34 positivity is unusual, occurring in only about 16% of hypergranular APL cases. This atypical feature likely drove the diagnostic uncertainty during manual review.

Novel marker associations: PPS analysis for this case revealed that the model relied heavily on markers not conventionally used in APL assessment, specifically CD123, CD10, and CD38. The promyelocyte population demonstrated high expression of these markers, and model attention values correlated strongly with their expression levels. Conversely, the markers traditionally used for APL diagnosis (CD45, CD34, HLA-DR) carried little or no importance in the model's prediction. The model effectively identified a previously uncharacterized association between t(15;17) and increased expression of CD123, CD10, and CD38.

This case illustrates two key advantages of the automated approach: its ability to reach correct diagnoses in ambiguous presentations, and its capacity to uncover novel biological associations between flow cytometric marker expression and specific genetic aberrancies that may not be apparent during routine manual analysis.

TL;DR: In a diagnostically challenging APL case with atypical CD34 positivity (seen in only ~16% of hypergranular APL), the model correctly predicted t(15;17) at 75.4% probability. It relied on non-traditional markers (CD123, CD10, CD38) rather than conventional ones (CD45, CD34, HLA-DR), revealing a novel marker-translocation association.

Clinical Applications

Pages 16-18

Use Cases Across the Clinical Workflow

For clinicians: The pipeline provides accurate, rapid predictions within minutes after flow cytometry data collection, including insight into molecular subtypes. This is particularly important for AML patients with genetic abnormalities requiring targeted therapies, such as all-trans retinoic acid (ATRA) for APL with t(15;17), where treatment delays can be life-threatening. Model output visualizations can be tailored to give clinicians succinct immunophenotypic summaries with therapeutic implications.

For hematopathologists: The system serves as both a triaging tool and a diagnostic assistant. Cases with very low predicted probabilities of acute leukemia can be deprioritized, freeing up expert time for complex cases. For diagnostically uncertain cases, the pipeline provides unbiased predictions alongside attention-weighted data visualizations that highlight which specific cell populations and markers drove the model's conclusion. The case study of atypical APL demonstrates this utility clearly.

For laboratories and low-resource settings: Because the pipeline operates directly on raw FCS files without manual compensation, gating, or doublet exclusion, it offers a software- and personnel-agnostic approach to flow cytometry-based diagnosis. Laboratories without subspecialty-trained hematopathologists could use this system as a first-pass diagnostic tool. The fully automated nature also eliminates inter-observer variability, a known source of diagnostic inconsistency in flow cytometric analysis.

For researchers: The attention and PPS analyses generate novel hypotheses about associations between cell surface marker expression and AML molecular subtypes. The finding that CD33 associates broadly with many cytogenetic and mutational variants has implications for CD33-targeted therapies such as gemtuzumab ozogamicin. Similarly, the discovery of CD123, CD10, and CD38 associations with t(15;17) provides new avenues for biological investigation and potentially improved diagnostic criteria.

TL;DR: Four use cases: (1) clinicians get rapid molecular subtype predictions within minutes, (2) hematopathologists gain a triaging and decision-support tool with interpretable visualizations, (3) laboratories can run fully automated diagnostics from raw FCS files without manual gating, and (4) researchers discover novel marker-variant associations like CD33's broad relevance and CD123/CD10/CD38 links to t(15;17).

Limitations & Future Directions

Pages 18-20

Single-Center Design, Panel Rigidity, and the Road to Clinical Deployment

Single-center limitation: All 1,820 samples came from Brigham and Women's Hospital, using BD FACSCanto II instruments with a fixed two-tube 10-color panel. The models have not been validated on data from other institutions, instruments, or panel configurations. Flow cytometry panels vary significantly across laboratories in terms of markers, antibodies, fluorochromes, and the number of colors per tube. External validation on multi-center, multi-instrument datasets is essential before clinical deployment.

Blast threshold constraint: The current pipeline requires a 20% blast count for positive acute leukemia cases, regardless of AML-defining translocations. This means it cannot currently detect AML cases defined by specific translocations (such as t(15;17) or t(8;21)) that present with less than 20% blasts. Additionally, it does not distinguish new from recurrent cases, and cannot assess minimal or measurable residual disease (MRD) in post-treatment samples with minute blast populations.

Molecular prediction ceiling: While some variant models performed strongly (t(15;17) at AUROC 0.929, NPM1 at 0.807), many of the 41 models achieved only moderate accuracy. Variants without strong immunophenotypic correlations inherently limit what flow cytometry-based prediction can achieve. The approach is best understood as a complement to, not a replacement for, cytogenetic and sequencing analysis. It could be used to prioritize expedited molecular testing for patients most likely to have actionable variants.

Future technical directions: The authors identify several paths forward. Incorporating a patient's previously characterized immunophenotype could improve performance on recurrent cases and MRD assessment. Cross-institutional training with marker-mapping techniques could enable the models to handle different panel configurations. Adding CNNs to the architecture would allow integration of cell morphology from blood and bone marrow smears alongside flow cytometry, potentially improving diagnostic accuracy. The overall trajectory points toward multi-modal models combining flow cytometry, morphology, and potentially clinical data for comprehensive automated hematopathology diagnostics.

TL;DR: Key limitations: single-center data (Brigham and Women's Hospital only), fixed panel configuration, 20% blast threshold excluding low-blast AML, and no MRD capability. Future work includes multi-center validation, cross-panel generalization, morphology integration via CNNs, and recurrent case handling. The approach complements rather than replaces molecular testing.

Automated Deep Learning-Based Diagnosis and Molecular Characterization of Acute Myeloid Leukemia

Original Paper (PDF)