Artificial intelligence-based assessment of PD-L1 expression in diffuse large B-cell lymphoma

Pathology - Research and Practice 2024 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Page 1
Why PD-L1 Quantification in DLBCL Matters for Immunotherapy

Diffuse large B-cell lymphoma (DLBCL) is the most common aggressive form of non-Hodgkin lymphoma, characterized by rapid progression and high incidence. A growing body of evidence suggests that PD-L1 checkpoint inhibitors show promising performance in treating lymphoma, making the accurate assessment of PD-L1 expression a critical step in identifying patients who may benefit from targeted immunotherapy. However, unlike solid tumors such as lung or breast cancer, DLBCL lacks clearly defined tumor boundaries on immunohistochemistry (IHC) slides, making it extremely difficult to distinguish malignant tumor cells from normal cells in PD-L1-stained whole slide images (WSIs).

The core challenge: In DLBCL, tumor B-cells and non-malignant immune cells can both express PD-L1 within the tumor microenvironment. This means that cells appearing in PD-L1-positive regions are not necessarily tumor cells. The tumor proportion score (TPS), which calculates the percentage of PD-L1-stained tumor cells among all viable tumor cells, is the standard quantitative indicator. But manual counting by pathologists is time-consuming and prone to inter-reader variability.

What this study proposes: The researchers developed an AI-based image analysis framework specifically designed for assessing PD-L1 expression in DLBCL patients using IHC slides. The system performs large-scale cell annotations across over 5,101 tissue regions and 146,439 live cells, with primary and external validation cohorts. They also introduced a novel digital quantification rule tailored to address the unique challenge of identifying tumor cells in lymphoma, where traditional tumor region segmentation approaches used in solid cancers do not apply.

TL;DR: DLBCL is the most common aggressive non-Hodgkin lymphoma. PD-L1 expression guides immunotherapy decisions, but manual scoring is slow and inconsistent. This study built an AI framework covering 5,101 tissue regions and 146,439 cells with a new quantification rule designed specifically for lymphoma.
Pages 1-2
Patient Cohorts, Annotations, and Specimen Types

The study collected 220 DLBCL patients diagnosed or treated at Shanghai Ruijin Hospital from June 2019 to June 2020. All specimens were diagnosed as DLBCL (either germinal center B-cell subtype or activated B-cell subtype) by three hematopathologists following the 2016 WHO Classification. All 220 WSIs underwent PD-L1 immunohistochemical staining (clone 22C3, DAKO) and were digitized on the SQS-600P scanner at 40x magnification. The dataset included specimens from 30 different body parts, comprising 88 surgical specimens and 132 fine needle biopsies.

Annotation process: Three senior clinical pathologists (with 15, 5, and 3 years of experience) scored the TPS on all 220 WSIs in a double-blind condition. A labeling expert with 7 years of pathology experience then annotated non-regions of interest (Non-ROIs), which included areas with extrusion, burn artifacts, carbon foam, inflammation, fat, blood cells, interstitial cells, necrotic cells, and debris. This produced 4,101 tissue region annotations for the primary cohort. The team applied an inversion operation on annotated Non-ROIs to delineate effective ROIs for algorithm input. Additionally, two individuals with medical training labeled cell center points using LabelMe and SenseCare software, ultimately yielding 498 patches (256 x 256 pixels at 40x magnification) for cell detection model training.

External validation cohort: A separate set of 61 PD-L1-stained WSIs from unique patients was collected from the North Branch of Shanghai Ruijin Hospital. This cohort included 1,000 annotated tissue regions as Non-ROIs and 475 patches for cell center point annotations, with consistent patient statistics and annotation settings aligned to the primary cohort.

TL;DR: Primary cohort of 220 DLBCL patients (88 surgical, 132 fine needle biopsies) with 4,101 tissue regions annotated. External validation cohort of 61 patients with 1,000 tissue regions. Three pathologists scored TPS in double-blind conditions across both cohorts.
Pages 3-5
The Four-Stage AI Pipeline: From ROI Segmentation to PD-L1 Scoring

The proposed framework comprises four major components. Stage 1, ROI segmentation: The system treats ROI identification as a binary classification problem at the patch level. Each WSI is partitioned into patches, and the algorithm classifies each as inside or outside the ROI. The team fine-tuned several pre-trained models including DenseNet121, ResNet18, and Vision Transformer (ViT), ultimately selecting the ViT (tiny version) based on superior performance. The ViT model was trained with a batch size of 256, learning rate of 3e-4, for 30 epochs using five-fold cross-validation.

Stage 2, cell detection: The team used the AuxCNN model, built on a concatenated fully convolutional regression network (C-FCRN), as the cell detection backbone. AuxCNN uses auxiliary convolutional neural networks to assist intermediate layer training for automatic cell counting. The model was retrained using manually annotated PD-L1 cell center points as ground truth, with a batch size of 256, learning rate of 3e-4, for 90 epochs.

Stage 3, cell segmentation: Following cell detection, the detected center point locations were fed into NuClick, an interactive point-to-mask instance segmentation model. The team also benchmarked against SAM (Segment Anything Model) for comparative analysis. Although neither NuClick nor SAM was explicitly fine-tuned for PD-L1 data, NuClick was chosen because the HE dataset (MoNuSeg) it was trained on shares the hematoxylin staining channel with PD-L1, making it medically instructive for nuclei segmentation. After segmentation, a dilation operation expanded the masks to encompass the cell membrane for more accurate positive cell identification.

Stage 4, positive/negative classification: Cells with brownish membranes after DAB staining are classified as PD-L1-positive. Rather than relying on fixed-size regions around cell centers, the system calculated the ratio of brown area within the dilated membrane boundary to the total area within that boundary. A threshold parameter (t) determined whether a cell was classified as positive or negative.

TL;DR: Four-stage pipeline using ViT for ROI segmentation, AuxCNN (C-FCRN) for cell detection, NuClick for point-to-mask cell segmentation, and brown-area ratio thresholding for PD-L1 positive/negative classification. All models trained with five-fold cross-validation on A100 GPUs.
Pages 5-6
A New Digital TPS Rule Designed Specifically for DLBCL

A central contribution of this study is the proposal of a new PD-L1 digital quantification rule tailored for DLBCL. The traditional tumor proportion score (TPS) simply divides the number of PD-L1-stained tumor cells by the total number of viable tumor cells. However, in DLBCL, reliably identifying which cells are tumor cells is extremely difficult because lymphoma does not form distinct solid masses, and tumor B-cells, macrophages, and other immune cells intermingle.

The new rule is based on three observations: (1) Not all cells within the ROIs are tumor cells, as macrophages and other cell types are also present. (2) Certain cell categories are inherently difficult to determine from PD-L1 slides alone. (3) DLBCL tumor cells typically exhibit medium-to-large nuclei, either equal to or larger than normal macrophages, or more than twice the size of normal lymphocytes. The algorithm exploits this morphological characteristic by sorting all detected cells by nuclear area and applying a filtering strategy.

Three key parameters govern the rule: The parameter m controls the exclusion of the first m% of cells with the largest areas (likely non-tumor cells such as macrophages or vascular endothelial cells). The parameter k selects the top-k cells with the largest area as the presumed tumor cell population. The parameter t sets the threshold for the proportion of brown area in a single cell that determines positive vs. negative classification. After optimization, the final parameters were m = 0.06 (remove top 6% largest cells), k = 3,000 (select top 3,000 cells per WSI), and t = 0.1 (10% brown area threshold).

The modified TPS* formula calculates the number of PD-L1-stained tumor cells among the top-k selected cells, divided by k. This approach sidesteps the need to explicitly classify every cell as tumor or non-tumor, instead leveraging cell size distributions to approximate the tumor cell population. Experiments showed that both the m and t parameters had meaningful effects on scoring accuracy, with high m values over-filtering tumor cells and low m values failing to exclude non-tumor contaminants.

TL;DR: New TPS* formula filters cells by nuclear area: removes top 6% largest cells (likely macrophages), selects top 3,000 cells per WSI as presumed tumor population, and applies a 10% brown-area threshold for positive classification. This bypasses the impossible task of explicitly classifying every DLBCL cell type.
Pages 2-4
Strong AI-Pathologist Concordance in the Primary Cohort

The correlation analysis between AI algorithmic outcomes and pathologist scores demonstrated consistently high agreement. The algorithm provided stable quantitative findings closely aligned with the mean and median scores of the three pathologists. When TPS was categorized into three clinically relevant stages using cutoff levels of 5% and 50% (based on supported treatment guidelines), the AI system maintained reliable stratification performance.

Overall concordance: The intra-pathologist concordance (agreement among the three human raters) yielded an intraclass correlation coefficient (ICC) of 0.94 (95% CI: 0.92, 0.95). The correlation between the automated AI scores and manual pathologist scores was even higher, with an ICC of 0.96 (95% CI: 0.94, 0.97). This means the AI was more consistent with the pathologists' average than the pathologists were with each other.

Surgical specimens vs. fine needle biopsies: For surgical specimens, the intra-pathologist concordance was 0.91 (95% CI: 0.87, 0.94), while the ICC between the mean pathologist scores and the algorithm was 0.95 (95% CI: 0.93, 0.97). For fine needle biopsies, the intra-pathologist concordance rose to 0.96 (95% CI: 0.94, 0.97), and the ICC between mean pathologist scores and the algorithm was 0.96 (95% CI: 0.95, 0.97). Fine needle biopsies showed higher agreement across the board, likely because they contain less noisy background tissue compared to surgical specimens.

The graded stratification at the 5% and 50% cutoffs is clinically significant because these thresholds can determine whether patients receive immunotherapy. The AI system maintained stable and consistent performance across different cutoff levels, as confirmed in supplementary analyses. Six slides from the primary cohort were excluded from outcome assessment due to insufficient cell counts for TPS calculation, leaving 214 slides in the final analysis.

TL;DR: AI-pathologist ICC reached 0.96 (95% CI: 0.94, 0.97), exceeding inter-pathologist ICC of 0.94. Fine needle biopsies showed higher concordance (ICC 0.96) than surgical specimens (ICC 0.91) among pathologists. The AI consistently aligned with mean/median pathologist scores across clinically relevant TPS cutoffs.
Pages 4-5
Validation on an Independent Cohort with Different Image Quality

The external validation cohort of 61 patients from the North Branch of Shanghai Ruijin Hospital introduced notable variations in image quality derived from different scanners and DAB staining conditions. Despite these differences, the AI algorithm maintained strong concordance with pathologist ratings. While the algorithm's results were slightly lower than those of pathological experts for surgical specimens, they remained relatively stable in the more challenging setting of fine needle biopsies.

Overall validation results: The intra-pathologist concordance in the validation cohort was 0.97 (95% CI: 0.95, 0.98). The ICC between the mean pathologist scores and the algorithm was 0.96 (95% CI: 0.93, 0.98), and between median pathologist scores and the algorithm was 0.96 (95% CI: 0.93, 0.97). These numbers demonstrate strong generalizability of the AI framework.

Surgical specimens in validation: The intra-pathologist concordance was 0.96 (95% CI: 0.92, 0.98). The ICC between the mean and median pathologist scores and the algorithm were both approximately 0.94 (95% CI: 0.87, 0.97 and 0.88, 0.97 respectively). Fine needle biopsies in validation: The intra-pathologist concordance reached 0.98 (95% CI: 0.96, 0.99). The ICC between mean pathologist scores and the algorithm was 0.98 (95% CI: 0.95, 0.99), and between median scores and the algorithm was approximately 0.97 (95% CI: 0.95, 0.99). These results confirm the finding from the primary cohort that fine needle biopsies yield higher AI-pathologist agreement than surgical specimens.

The researchers attribute the discrepancy between specimen types to the fact that fine needle biopsies contain less extraneous tissue (Non-ROIs), while surgical specimens carry more noisy information such as inflammation, necrosis, and non-tumor cell populations that can interfere with PD-L1 quantification. Importantly, compared to the pathologists whose TPS ratings tended to fluctuate, the algorithm produced relatively stable results across both specimen types.

TL;DR: External validation (61 patients) confirmed strong generalizability. Fine needle biopsy ICC between AI and pathologists reached 0.98 (95% CI: 0.95, 0.99). Surgical specimen ICC was 0.94. The AI was more stable than pathologists across different scanner and staining conditions.
Pages 6-8
How This Framework Differs from Prior PD-L1 Quantification Work

The authors highlight four primary ways their approach differs from existing PD-L1 quantification methods. First, unlike joint analysis approaches that require rigorous alignment of H&E and PD-L1 or other IHC slides, this framework operates solely on PD-L1-stained slides. This simplifies the workflow by eliminating the high demand on paired whole slide preparation and registration.

Clinical workflow alignment: Second, the analysis follows the routine diagnostic workflow for quantifying protein expression in PD-L1 images, offering a clinically relevant and interpretable tool for pathologists. This contrasts with approaches that rely on simple color thresholding to calculate the percentage of positive area in the WSI, which does not reflect how pathologists actually evaluate slides. Third, the immunohistochemical quantitative rule is specifically tailored for DLBCL, as opposed to rules designed for solid tumors like lung, breast, or head and neck cancers.

Whole-slide analysis: Fourth, the system performs automatic quantification across entire WSIs rather than being limited to manually selected specific regions. This whole-slide strategy mimics the clinical evaluation process and provides a more comprehensive and unbiased characterization of overall PD-L1 expression. The framework also provides detailed explainability through thumbnail-level visualizations showing the area distribution of each detected cell, the exact number of tumor cells, and the count of positive tumor cells selected for TPS calculation.

The integrated pipeline of cell detection, segmentation, and quantification produces highly correlated results compared to subjective pathologist assessments. The two-stage cell segmentation model streamlines cell sequencing and qualitative discrimination in DLBCL by leveraging the morphological hypothesis that DLBCL tumor cells generally exhibit larger nuclear sizes than non-malignant cells.

TL;DR: Four key advantages over prior work: no need for H&E slide pairing, alignment with routine clinical workflow, DLBCL-specific quantification rules, and whole-slide (not region-selected) analysis. The system also provides visual explainability at the thumbnail level.
Pages 7-9
What Remains to Be Done Before Clinical Deployment

Prospective validation needed: Although the results are promising, the study was conducted retrospectively on data from a single hospital system (Shanghai Ruijin Hospital and its north branch). Prospective clinical trials are necessary to validate the AI framework's utility in real-world immunotherapy decision-making with more diverse and geographically varied datasets. The current cohort, while substantial at 281 total patients, may not capture the full spectrum of DLBCL presentations across different populations and healthcare settings.

Limited public datasets: The authors note an ongoing lack of large-scale, publicly available IHC image cell datasets, particularly for PD-L1 membrane staining. The datasets used in this study are not publicly available due to data usage agreement restrictions, which limits independent reproducibility. The team plans to advance their work by constructing high-quality, annotation-ready cell cohorts from immunohistochemistry images. They also suggest that synthetic image samples could bolster the robustness of AI models by augmenting limited training data.

Single biomarker scope: This study focuses exclusively on PD-L1 expression quantification in DLBCL. The authors acknowledge that developing new IHC quantification rules and jointly training models with other immunohistochemical biomarkers, including CD3, CD20, CD5, BCL2, BCL6, Ki67, and MUM1, would provide more comprehensive insights for lymphoma diagnosis and treatment planning. Multi-biomarker integration could enable more nuanced patient stratification beyond what PD-L1 alone can offer.

Foundation model potential: The team explored the Segment Anything Model (SAM) for cell segmentation but ultimately chose NuClick due to its better alignment with PD-L1 staining characteristics. However, they note that foundation model evaluation could be a promising direction for improving prediction performance. As these large-scale pre-trained models continue to evolve, fine-tuning them specifically for lymphoma IHC analysis may yield further gains in accuracy and generalizability.

TL;DR: Key limitations include retrospective single-center design, no public dataset availability, and restriction to PD-L1 only. Future work should include prospective trials, multi-biomarker integration (CD3, CD20, Ki67, BCL2, and others), synthetic data augmentation, and fine-tuning of foundation models for IHC analysis.
Citation: Yan F, Da Q, Yi H, et al.. Open Access, 2024. Available at: PMC10973523. DOI: 10.1038/s41698-024-00577-y. License: cc by.