Deep Learning for Lymphoma Segmentation

Plain-English Explanations

Overview & Background

Pages 1-3

Why Lymphoma Segmentation Is a Critical Clinical Challenge

Lymphoma is one of the most common malignancies of the hematological system globally. It originates in the lymphatic system and encompasses a complex classification with each subtype demonstrating distinct biological behaviors and clinical manifestations. Because the lymphatic system is distributed throughout the body, lymphomas can appear in lymph nodes, spleen, liver, bone marrow, and other organs. This widespread distribution, combined with significant heterogeneity in radiological presentations (differences in size, shape, density, and signal intensity even within the same lymphoma type), makes manual evaluation of lymphoma images extremely challenging and prone to subjective variability across clinicians.

Why segmentation matters: Medical image segmentation aims to delineate specific anatomical or pathological structures from the background in medical images. For lymphoma, segmentation enables robust and reproducible extraction of clinically meaningful quantitative features. Among these, total metabolic tumor volume (TMTV) has emerged as a key biomarker for staging, risk stratification, treatment planning, and response monitoring. Accurate segmentation also supports defining radiation fields and surgical margins, tracking disease progression, assessing therapeutic outcomes, and informing prognosis.

Traditional vs. deep learning approaches: Earlier methods relied on thresholding based on pixel intensity, edge detection, region-growing algorithms, and clustering techniques. While interpretable and computationally modest, these depend heavily on handcrafted features and expert intervention, leading to inconsistent results across institutions. Deep learning has become the dominant paradigm because it learns complex visual patterns directly from raw imaging data, supports fully automated pipelines, and achieves superior performance under standardized evaluation settings. Frameworks like nnU-Net have achieved state-of-the-art results in 23 different biomedical segmentation challenges without any manual feature engineering.

Review scope: The authors searched Google Scholar and Web of Science using keywords "lymphoma segmentation" and "deep learning," screening 45 studies published from 2018 to 2024. They excluded non-English articles, studies outside deep learning-based segmentation, and duplicates. The review covers dataset characteristics, backbone network architectures, network structure adjustments, model performance, and clinical applicability.

TL;DR: This review covers 45 deep learning studies (2018-2024) on lymphoma segmentation from PET/CT, MRI, and CT imaging. Lymphoma's heterogeneity and multi-organ distribution make manual segmentation unreliable. Deep learning automates this process and enables accurate extraction of TMTV and other biomarkers critical for staging and treatment planning.

Methodology

Pages 4-8

Dataset Characteristics and Imaging Modalities Across 45 Studies

The construction of high-quality datasets is the critical first step in deep learning-based lymphoma segmentation. The authors identified several key requirements: dataset diversity and balance (covering different types and stages of lymphoma), precise pixel-level or voxel-level annotations (not just bounding boxes or region-of-interest labels), sufficient dataset scale to prevent overfitting, and multimodal image datasets that support tasks like lymphoma identification, prognosis prediction, and treatment monitoring.

Imaging modalities: Among the 45 studies, PET/CT was the dominant modality with 38 studies, combining functional and structural imaging to provide information on lesion metabolism and anatomical location. PET/CT is the primary modality for lymphoma diagnosis, staging, treatment evaluation, and prognosis assessment. Six studies used different MRI sequences, which offer higher soft tissue contrast and multiplanar imaging capabilities, particularly useful for evaluating central nervous system lymphomas where PET/CT has limitations. Two studies used contrast-enhanced CT, chosen for its low radiation and rapid imaging characteristics, making it suitable for assessing lymphoma in children.

Lymphoma subtypes covered: The majority of studies focused on non-Hodgkin lymphomas, with a smaller portion addressing Hodgkin lymphomas. The primary subtypes included diffuse large B-cell lymphoma (DLBCL), primary mediastinal large B-cell lymphoma (PMBCL), mantle cell lymphoma, and follicular lymphoma. Most PET/CT datasets covered the entire body excluding the head and neck, though two studies focused on the chest and mediastinum, one on nasopharynx, and one on abdominal organs.

Public datasets: Due to ethical and privacy considerations, most datasets were private. The only publicly accessible datasets were HECKTOR (head and neck tumor segmentation with PET/CT images from multiple institutions), BraTS (brain tumor segmentation benchmark from multimodal MRI), and AutoPET (whole-body FDG-PET/CT with expert-annotated lesion segmentations). The lack of high-quality annotated public datasets remains a significant constraint on research progress in lymphoma segmentation.

TL;DR: PET/CT dominated with 38 of 45 studies, followed by MRI (6 studies) and contrast-enhanced CT (2 studies). Most datasets were private. Only three public benchmarks exist: HECKTOR, BraTS, and AutoPET. DLBCL, PMBCL, mantle cell lymphoma, and follicular lymphoma were the most studied subtypes.

Network Architecture

Pages 8-13

Backbone Networks: U-Net, ResNet, nnU-Net, V-Net, and Beyond

U-Net: Originally proposed by Ronneberger et al. in 2015, U-Net is the most widely used backbone across the 45 studies. Its characteristic "U" shape features a contracting encoder path and an expansive decoder path connected by skip connections that preserve fine-grained spatial information. Initially a 2D architecture, researchers extended it to 3D U-Net by replacing 2D convolutions with 3D counterparts, significantly improving volumetric segmentation for lymphoma where lesions exhibit irregular shapes, heterogeneous uptake, and complex spatial distributions. Its open and modular design makes it highly adaptable, allowing researchers to incorporate attention mechanisms, residual connections, or multiscale feature fusion modules.

ResNet: Proposed by He et al. in 2015, ResNet (Residual Network) addresses vanishing and exploding gradient problems through residual blocks with identity shortcut connections. This allows construction of very deep networks (ResNet-50, ResNet-101) with more complex feature representations. While computationally heavier than lightweight alternatives and less efficient at capturing global context compared to transformer-based models, ResNet remains foundational in lymphoma segmentation, especially when combined with attention mechanisms and multiscale processing.

nnU-Net: Proposed by Isensee et al. in 2018, nnU-Net is an adaptive framework that automates the entire segmentation pipeline, including preprocessing, network structure selection, hyperparameter tuning, training, and postprocessing. It adapts 2D and 3D U-Net architectures to different data modalities without manual intervention. Its ability to handle multimodal and multiviewpoint data makes it well-suited for lymphoma segmentation. V-Net: Proposed by Milletari et al. in 2016, V-Net is natively designed for 3D images, introducing residual structures to accelerate convergence and a Dice similarity coefficient-based objective function. DenseNet: Introduced by Huang et al. in 2016, it connects each layer to every other layer in a feed-forward fashion, promoting feature reuse and parameter efficiency. DeepMedic: An open-source backbone using 3D CNNs with a patch-based training strategy, particularly effective for neuroimaging tasks and central nervous system lymphoma segmentation. DeepLabv3: A semantic segmentation model featuring atrous (dilated) convolutions and pyramid pooling for multiscale feature fusion.

All backbone architectures in the 45 studies were based on CNN architectures and their variants. The choice of backbone significantly impacts feature extraction, spatial resolution preservation, handling of multiscale and multimodal data, computational efficiency, and generalization.

TL;DR: U-Net and its variants were the most common backbones across 45 studies. Other key architectures included ResNet, nnU-Net (self-configuring framework), V-Net (native 3D with Dice-based loss), DenseNet (feature reuse via dense connections), DeepMedic (patch-based 3D CNNs), and DeepLabv3 (atrous convolutions). All were CNN-based.

Segmentation Accuracy

Pages 13-16

Strategies for Improving Segmentation Accuracy: Edges, Class Imbalance, and Feature Fusion

Edge detection: Lymphoma boundaries are inherently uncertain in medical imaging. Several studies proposed specialized modules to improve boundary delineation. Zhu et al. developed the cross-shaped structure extraction (CCE) method based on axial context, along with a boundary gradient change-based loss function using the Sobel operator to supervise boundary segmentation. Their CGBO-Net achieved a Dice similarity coefficient (DSC) of 90.7%, precision of 89.4%, and Jaccard of 83.1%. Luo et al. integrated a multi-atlas boundary awareness (MABA) module using gradient maps, uncertainty maps, and level set maps to capture potential tumor boundaries. Jurdi et al. optimized boundary segmentation through a Boundary Irregularity Index (BI) that minimizes the difference between smoothed and true segmentation maps.

Class imbalance: In lymphoma images, tumor pixels represent a small minority compared to background pixels. Liu et al. proposed Class Balanced Dice Loss (CBDL), which considers the effective number of samples to prevent loss and gradient computations from being dominated by majority negative samples. Wang et al. introduced the Prior-Shift Regularization (PSR) module, which performs Online Informative Voxel Mining (OIVM) based on Expected Prediction Confidence to extract informative voxels for regularization, achieving a DSC of 90.94% and sensitivity of 87.18%.

Spatial and modal feature fusion: Since PET/CT combines metabolic and anatomical information across multiple perspectives, several groups developed fusion strategies. Hu et al. proposed multiview and 3D fusion, training three separate 2D ResU-Net networks to capture information from different directions and fusing results with a Conv3D strategy. Yuan et al. designed dual encoder branches for PET and CT respectively, with a hybrid learning component generating spatial fusion maps to quantify each modality's contribution, achieving a DSC of 73.03%. Diao et al. proposed spatial compression and multimodal feature fusion attention (CSAE-Net + PFAS-Net), achieving a DSC of 79.81% with 99.90% specificity.

Temporal correlation handling: PET/CT captures metabolic information at different time points. Wang et al. introduced the Recurrent Dense Siamese Decoder (RDS-Decoder), simulating recurrent neural network behavior to capture temporal dependencies between feature maps, achieving a DSC of 85.58% and sensitivity of 94.63%. Pang et al. introduced metabolic variance features (MVF) and metabolic heterogeneity features (MHF) to quantify metabolic differences between tissues at different time points, achieving DSC of 86.67%.

TL;DR: Key accuracy strategies include edge detection (CGBO-Net DSC 90.7%), class imbalance correction (PSR-Net DSC 90.94%), spatial/modal feature fusion (dual-encoder approaches, DSC 73-80%), and temporal correlation handling (RDS-Decoder DSC 85.58%). Boundary-aware loss functions and multi-atlas modules improved tumor boundary delineation.

Advanced Methods

Pages 16-18

Tackling Uncertainty, Reducing Label Dependence, and Enriching Datasets

Solving uncertainty with Dempster-Shafer theory: Uncertainty in lymphoma segmentation arises from variability in imaging quality, blurred boundaries between lymphoma and normal tissues, challenges in multimodal data fusion, and subjective annotations. Huang et al., Diao et al., and Huang et al. all utilized Dempster-Shafer (DS) evidence theory to construct evidence fusion layers. DS theory provides a mathematical framework for handling uncertain information without requiring the additivity condition of traditional probabilities. Using Dempster's combination rule, evidence from PET and CT images is effectively fused, and when conflicting segmentation results arise from different modalities, DS theory quantifies and resolves these conflicts. Huang et al. achieved DSC of 84.6%, while another Huang et al. study reached DSC 86.90% with Hausdorff distance of 2.71, sensitivity 94.62%, and specificity 99.86%.

Weakly supervised learning: Obtaining accurate pixel-level annotations is extremely costly for lymphoma, which involves over 100 distinct pathological subtypes. Huang et al. proposed a weakly supervised approach where only a portion of lymphoma volume was manually annotated, introducing multiscale feature consistency constraints and cosine similarity-based feature-level distances between tumor and normal tissue. Their combined loss framework (supervised loss + deep supervision loss + regularization loss) achieved DSC of 75.2%. Semi-supervised learning: Yousefirizi et al. developed a semi-supervised framework incorporating a loss function derived from Fuzzy C-Means (FCM) clustering, capturing inherent fuzziness in medical image classification and modeling soft tumor boundaries. This approach used soft cluster memberships of unlabeled voxels alongside labeled samples, achieving DSC of 69.0%.

Dataset enrichment with GANs: Medical imaging datasets face ethical and acquisition challenges that limit scale, while deep learning models demand large-scale, diverse datasets for clinical generalization. Generative adversarial networks (GANs) offer a solution: a generator produces realistic synthetic images from random noise while a discriminator distinguishes between generated and real data. Conte et al. used U-Net combined with a GAN to enrich limited MRI lymphoma data, achieving DSC of 82.0% and MSE of 0.006. Shi et al. similarly employed GAN-augmented PET/CT data, reaching DSC of 86.09% and precision of 81.08%. Both studies verified the feasibility and utility of GAN-generated synthetic lymphoma images.

TL;DR: DS evidence theory resolved multimodal uncertainty (DSC up to 86.90%, specificity 99.86%). Weakly supervised methods achieved DSC 75.2% with partial annotations. Semi-supervised FCM-based learning reached DSC 69.0%. GANs successfully enriched limited datasets, with GAN-augmented models achieving DSC 82.0-86.09%.

Downstream Tasks

Pages 18-19

Multitask and Cascade Learning for TMTV Computation and Prognosis

Multitask learning (MTL): MTL allows a single network to handle multiple related tasks simultaneously, such as lesion segmentation and prognosis prediction, by sharing underlying feature representations while maintaining separate objective functions. Liu et al. proposed a U-Net-based MTL framework where shared features supported both lesion segmentation and prediction of 2-year event-free survival (2y-EFS), with a weighted sum of segmentation and classification losses facilitating joint optimization. Their model achieved DSC of 86.8%. MTL faces challenges including task conflicts and negative transfer, imbalanced task difficulty (pixel-level segmentation vs. patient-level survival data), and limited interpretability. Solutions being explored include dynamic loss weighting, curriculum learning, task-specific modules, and attention visualization for explainability.

Cascade learning: Unlike MTL, cascade learning sequentially addresses a series of tasks where the output of each task serves as input for the next, forming a processing pipeline. This is a primary strategy for achieving downstream derivative tasks like TMTV and metabolic tumor volume (MTV) computation. Jemaa et al. cascaded U-Net and V-Net to sequentially achieve whole-body-region-organ-lesion segmentation and TMTV computation, reaching DSC of 88.6% and sensitivity of 93.0%. Yousefirizi et al. cascaded two 3D U-Nets using a soft voting module under a semi-supervised setting, achieving DSC of 77.0% and Hausdorff distance of 0.16.

Clinical relevance: TMTV quantifies tumor metabolic activity based on FDG uptake and its accurate computation relies on precise segmentation. ROC curve analysis determines optimal TMTV thresholds for patient risk stratification. Baseline PET/CT scans assess TMTV to predict prognosis, while interim scans monitor treatment response. TMTV serves as an independent prognostic factor in multivariate Cox regression models. Additional imaging-derived indicators, such as tumor dissemination patterns, metabolic heterogeneity, and total lesion glycolysis, further contribute to patient stratification.

TL;DR: Multitask learning jointly optimizes segmentation and prognosis (Liu et al. DSC 86.8% with 2y-EFS prediction). Cascade learning pipelines (Jemaa et al. DSC 88.6%, sensitivity 93.0%) enable sequential segmentation-to-TMTV computation. TMTV is a critical independent prognostic factor derived from accurate PET/CT segmentation.

Performance Metrics

Pages 19-21

Evaluation Metrics and Model Performance Across Studies

Primary metrics: The Dice similarity coefficient (DSC) was the most widely reported metric, measuring spatial overlap between predicted and ground truth segmentation (0 = no overlap, 1 = perfect agreement). Across all 45 studies, DSC values ranged from 39.90% (Nam et al., CT-based cervical lymphadenopathy detection) to 93.49% (Zhao et al., MRI-based nnU-Net). The Jaccard Index, measuring set intersection over union, was reported in several studies with values ranging from 53.05% to 83.1%. Hausdorff distance (HD) measured maximum boundary deviation, with reported values ranging from 0.16 to 8.

Top-performing models: The highest DSC scores included Zhao et al. at 93.49% (nnU-Net on MRI), Wang et al. at 90.94% (PSR-Net on PET/CT), Zhu et al. at 90.7% (CGBO-Net on PET/CT), Jemaa et al. at 88.6% (cascaded U-Net/V-Net on PET/CT), Chen et al. at 87.75% (HAFS-Net on PET+CT with sensitivity 88.94% and specificity 99.93%), and Constantino et al. at 87.0% (3D U-Net on PET/CT). Sensitivity values ranged from 71% to 96.16%, with Wang et al.'s Memory-HDRDS-UNet achieving the highest at 96.16%.

Why accuracy alone is misleading: The authors emphasized that in medical image segmentation, particularly PET, CT, and MRI, there is often severe class imbalance between background and lesion voxels. Even models with suboptimal segmentation performance can yield deceptively high accuracy scores due to the predominance of background voxels. Therefore, overlap-based metrics like DSC and Jaccard, along with distance-based metrics like Hausdorff distance, are more informative and widely accepted for evaluating segmentation quality. Other reported metrics included AUC (Ahamed et al. at 0.92, Thiery et al. at 0.72), mean squared error, mean absolute error, and mean absolute deviation.

TL;DR: DSC ranged from 39.90% to 93.49% across 45 studies. Top performers: Zhao et al. 93.49% (nnU-Net/MRI), Wang et al. 90.94% (PSR-Net/PET-CT), Zhu et al. 90.7% (CGBO-Net). Sensitivity reached up to 96.16%. Standard accuracy is misleading due to class imbalance. DSC, Jaccard, and Hausdorff distance are the preferred evaluation metrics.

Limitations & Future Directions

Pages 21-23

Technical Limitations and the Path Toward Clinical Deployment

Limited generalizability: Models trained on data from a single institution or scanner often underperform on external datasets due to domain shifts from variations in imaging protocols, scanner manufacturers, and reconstruction parameters. The presence of over 100 histological lymphoma subtypes, each with significant heterogeneity in radiological and metabolic features, further constrains generalization. Current public and private datasets fail to represent the full spectrum of lymphoma subtypes. Strategies being explored include domain adaptation, multicenter collaborative training, and federated learning to enhance cross-site robustness.

Computational demands: Processing high-resolution 3D volumetric PET/CT data imposes substantial computational and memory requirements, limiting practical deployment in real-time and resource-constrained clinical environments. While model compression, quantization, and network pruning offer potential solutions, achieving efficiency without compromising accuracy remains a key bottleneck. The 3D U-Net variants, while more powerful than their 2D counterparts, are particularly demanding in GPU memory usage.

Interpretability gap: Most deep learning models operate as black boxes, providing little insight into their decision-making processes. Clinicians require transparent, explainable outputs to validate and trust automated suggestions. Recent efforts focus on integrating attention mechanisms, saliency maps, and feature attribution methods, but the clinical relevance and robustness of these interpretability techniques remain under investigation. This is especially critical in lymphoma, where diagnostic complexity involves over 100 pathological subtypes with varying imaging presentations.

Reproducibility and dataset scarcity: Variations in preprocessing steps, model architectures, training settings, and evaluation metrics make it difficult to reproduce and fairly compare results across studies. The adoption of standardized frameworks like nnU-Net and public benchmarks such as AutoPET and HECKTOR are crucial for advancing reproducible research. Nearly all 45 studies relied on private datasets, severely limiting external validation. Future priorities include enhancing clinical generalizability, integrating segmentation models into clinical workflows, reducing computational demands, and expanding high-quality annotated datasets to facilitate broad application of deep learning in lymphoma diagnosis and treatment monitoring.

TL;DR: Key limitations include poor cross-site generalizability (domain shift from scanner and protocol differences), high computational demands for 3D PET/CT data, black-box interpretability, and near-total reliance on private datasets. Future directions center on federated learning, model compression, explainable AI, standardized benchmarks, and multicenter validation across 100+ lymphoma subtypes.

Recent advances in deep learning for lymphoma segmentation: Clinical applications and future directions

Original Paper (PDF)