Deep Learning for Endometrial Cancer Cytopathology Across Staining Styles

Plain-English Explanations

Overview & Background

Pages 1-3

Why Cytopathology-Based Screening for Endometrial Cancer Needs Deep Learning

Endometrial cancer is one of the most common tumors of the female reproductive system, predominantly affecting postmenopausal women and responsible for approximately 76,000 deaths each year worldwide. Its incidence and mortality continue to rise, making early screening essential for improving long-term patient outcomes and survival rates. However, established screening methods remain limited. Endometrial biopsy and hysteroscopy are invasive, expensive, and require cooperation with anaesthetists. Transvaginal ultrasound, while less invasive, lacks specificity. Cytopathology, which involves collecting and staining endometrial cells for microscopic analysis, offers a minimally invasive and cost-effective alternative that is already widely used in countries such as Japan.

The staining problem: A major challenge for automated screening is that cytopathology slides come from different medical centers and use different staining protocols. Some samples are stained with hematoxylin and eosin (H&E), while others use papanicolaou staining. Slides can also vary due to preservation conditions and scanner differences. These visual inconsistencies reduce the generalization ability of deep learning models, causing them to perform poorly when encountering staining styles not seen during training.

The gap in existing research: While deep learning has been extensively applied to radiology images and histopathology images of endometrial cancer, there is very little work on cytopathology-based screening. Most prior studies focus on segmenting tumors on MRI, classifying histopathology slides at the whole-slide level, or predicting lymph node metastasis from histopathologic images. This study by Wang et al. from Xi'an Jiaotong University addresses that gap by developing the first deep learning framework specifically designed to screen endometrial cancer from cytopathologic images across different staining styles.

Core contributions: The authors make four key contributions. First, they built the XJTU-EC dataset, the first endometrial cytopathology dataset containing both segmentation and classification labels. Second, they proposed CM-UNet for robust cell clump segmentation across staining styles. Third, they developed ECRNet, a contrastive learning-based classifier that reduces false negatives. Fourth, the complete two-stage framework achieved 98.50% accuracy, 99.32% precision, and 97.67% sensitivity on the test set.

TL;DR: This study presents the first deep learning framework for endometrial cancer screening from cytopathology images, tackling the challenge of variable staining styles (H&E vs. papanicolaou). The two-stage pipeline achieved 98.50% accuracy and 99.32% precision on the XJTU-EC dataset of 139 patients.

Dataset & Data Collection

Pages 4-6

Building the XJTU-EC Dataset: Three Years of Collection and Annotation

The authors collected endometrial cells from 139 women at the First Affiliated Hospital of Xi'an Jiaotong University between December 2019 and December 2020, using a custom-designed sampling device called the Li Brush. Patients who underwent curettage or hysterectomy were enrolled, with exclusions for suspected pregnancy, acute reproductive system inflammation, prior hysterectomy for cervical or ovarian cancer, clotting disorders, and elevated body temperature. The cohort included 81 inpatients and 58 outpatients, with 32 patients under 40 years old and 107 aged 40 or older. Of these, 77 were premenopausal and 35 were postmenopausal.

Histopathological breakdown: The pathological diagnoses included 62 endometrial carcinomas (47 endometrioid carcinoma G1/G2, 11 endometrioid carcinoma G3, 2 serous carcinoma, and 2 clear cell carcinoma), 4 endometrial atypical hyperplasia cases, 39 endometrial hyperplasia without atypia, and 34 benign endometrial conditions (proliferative, secretory, atrophic, and mixed). Written informed consent was obtained from all patients, and minors were excluded.

Image preparation: The collected cells were stained using either H&E or papanicolaou protocols, producing 100 H&E-stained and 39 papanicolaou-stained whole slide images (WSIs). A MOTIC digital biopsy scanner (EasyScan 60) with a 20x lens was used to digitize slides. Because each WSI is extremely large (e.g., 95,200 x 87,000 pixels), the images were cropped into 1,024 x 1,024 pixel tiles. A thresholding algorithm retained only tiles with a mean pixel value between 50 and 230 and a standard deviation above 20, filtering out meaningless background.

Annotation process: Two experienced cytopathologists annotated the images using Adobe Photoshop CC 2019. One senior cytopathologist segmented cell clumps, and the second reviewed the results. Classification labels followed the International Society of Gynecologic Pathologists and the 2014 WHO classification of uterine tumors. Cell clumps were classified as malignant (atypical cells of undetermined significance, suspected malignant, or confirmed malignant) or benign (non-malignant). When votes disagreed, the pathologists discussed the case, and unresolvable cases were discarded. The final XJTU-EC dataset contained 3,620 positive images and 2,380 negative images.

TL;DR: The XJTU-EC dataset includes 139 patients, 100 H&E-stained and 39 papanicolaou-stained WSIs, cropped into 6,000 tiles (3,620 positive, 2,380 negative). Two cytopathologists annotated both segmentation masks and classification labels over a three-year period, making this the first endometrial cytopathology dataset with dual annotations.

Methodology: Segmentation

Pages 7-9

Stage 1: CM-UNet for Cell Clump Segmentation Across Staining Styles

The first stage of the framework focuses on extracting regions of interest (ROIs), specifically endometrial cell clumps, from cytopathology images. In clinical practice, cell clumps are what cytopathologists examine, while the background contains noise from neutrophils, dead cells, and other impurities. The authors developed CM-UNet, a modified UNet architecture that uses ResNet-101 as its backbone and applies dense connections to aggregate feature maps at each decoder node. Dense skip connections allow more flexible multi-scale feature fusion compared to the standard UNet, which only uses simple concatenation between corresponding encoder and decoder levels.

Channel Attention (CA) module: To handle the visual differences between staining styles, the authors placed a channel attention module at the bottleneck of the encoder-decoder network. The CA module computes a channel affinity matrix using matrix multiplication between the feature maps and their transpose, followed by a softmax normalization. This mechanism integrates semantic relationships between different channel mappings and emphasizes strongly interdependent channels by adjusting weights. A scaling parameter (beta) is initialized at 0 and learned gradually during training. The key insight is that channel attention helps the network focus on structural features of cell clumps rather than color information that varies with staining protocol.

Multi-level Semantic Supervision (MSS) module: To address gradient vanishing and explosion problems in the deep network, the authors introduced a multi-level semantic supervision module. Each side output layer performs a 1x1 convolution followed by global average pooling to extract global contextual information. Weight factors are assigned to each layer, with alpha values set to 0.1, 0.3, 0.6, and 0.9 from the deepest to shallowest layers. This design forces the network to learn meaningful semantic representations at multiple scales rather than relying solely on the final output layer.

Loss function design: The segmentation model uses a hybrid loss function that combines binary cross-entropy with a Dice loss component to address class imbalance between cell clump pixels and background pixels. The overall loss is the weighted sum of this hybrid segmentation loss and the multi-level side loss across all depth levels. After training, morphological processing is applied to fill holes and remove small artifacts in the segmentation masks. The model was trained using ten-fold cross-validation on the annotated dataset.

TL;DR: CM-UNet uses a ResNet-101 backbone with dense skip connections, a channel attention module at the bottleneck to handle staining variation, and multi-level semantic supervision (alpha weights 0.1 to 0.9) to prevent gradient issues. The hybrid loss function combines cross-entropy and Dice loss for class-balanced segmentation.

Methodology: Classification

Pages 9-11

Stage 2: ECRNet and Contrastive Learning for Cell Clump Classification

After segmentation, the extracted ROIs vary in shape and size. Each ROI is padded with zero-valued pixels to a uniform 512 x 512 size before being passed to the classification stage. The authors developed ECRNet, a classification algorithm that combines contrastive learning with supervised learning to handle the representation differences caused by varying staining styles. The core idea is that cell clumps with the same diagnostic label should be grouped together in the feature space, regardless of whether they come from H&E-stained or papanicolaou-stained slides.

Label memory bank and contrastive loss: Unlike standard contrastive learning methods that treat different augmentations of the same image as positive pairs, ECRNet introduces a label memory bank to store image representations alongside their classification labels. Two instances with the same label (e.g., both malignant) are treated as a positive pair, while two instances with different labels form a negative pair. The contrastive loss function uses an indicator function to identify same-label pairs and a temperature parameter (tau = 0.07) to scale the similarity scores. This approach leverages class-level discriminative information rather than instance-level augmentation similarity.

Momentum-based updating: To maintain a large and consistent dictionary of image representations in the label memory bank without excessive computational overhead, the authors use a momentum update method with a coefficient of m = 0.9. The key encoder's parameters are updated as a weighted average of its previous parameters and the query encoder's current parameters. This allows the dictionary to remain large enough to provide meaningful contrastive signals while keeping the representations temporally consistent.

Supervised learning component: The supervised branch uses VGG-16 as the classifier with a cross-entropy loss function. The total ECRNet loss combines the classification loss and the contrastive loss, weighted by a hyperparameter beta = 0.5. The model was trained using the Adam optimizer with an initial learning rate of 5 x 10^-3, a batch size of 32, and ImageNet pre-trained weights for initialization. Data augmentation included vertical flipping, horizontal flipping, random rotation (90, 180, 270 degrees), scaling, and grayscale conversion. All experiments used ten-fold cross-validation on two NVIDIA GeForce GTX 1080 GPUs.

TL;DR: ECRNet combines contrastive learning (with a label memory bank, temperature tau = 0.07, momentum m = 0.9) and supervised VGG-16 classification (beta = 0.5 weighting). It groups same-label cell clumps together in feature space regardless of staining style, using Adam optimizer at learning rate 5 x 10^-3 with ten-fold cross-validation.

Segmentation Results

Pages 12-13

CM-UNet Outperforms Six Classical Segmentation Models

The authors benchmarked CM-UNet against six established segmentation architectures: FCN, UNet, UNet++, LinkNet, DeepLabV3, and DeepLabV3+. CM-UNet achieved the highest average Dice coefficient of 0.89, exceeding the 0.85 threshold considered excellent for cell clump segmentation. For comparison, FCN scored only 0.61, basic UNet reached 0.75, LinkNet achieved 0.79, DeepLabV3 scored 0.81, and both UNet++ and DeepLabV3+ matched at 0.85. The inference time for CM-UNet was 0.039 seconds per image with 33M parameters, comparable to UNet++ (0.029s, 30M parameters) and substantially smaller than FCN (270M parameters).

Ablation study on modules: The ablation experiments confirmed the contribution of each proposed module. Adding only the channel attention module to UNet++ raised the Dice from 0.85 to 0.86. Adding only the multi-level semantic supervision module raised it further to 0.88. Combining both modules in CM-UNet achieved the best Dice of 0.89, demonstrating that the CA and MSS modules provide complementary benefits. The training time increased modestly from 1.49 hours (UNet++) to 1.90 hours (CM-UNet), a reasonable trade-off for the improved segmentation quality.

Qualitative analysis of staining robustness: Visual inspection of segmentation results across both H&E-stained and papanicolaou-stained images revealed important differences between models. FCN and UNet frequently under-segmented, failing to identify all cell clumps. LinkNet and DeepLabV3 tended to over-segment, mistaking mucus and single cells for cell clumps. UNet++ performed well on H&E-stained images but poorly on papanicolaou-stained images, occasionally missing clumps. DeepLabV3+ made fewer errors on papanicolaou-stained images but missed clumps on H&E-stained images. CM-UNet was the only model that performed consistently well across both staining styles, producing segmentation results closest to the pathologists' annotations.

TL;DR: CM-UNet achieved a Dice coefficient of 0.89, outperforming FCN (0.61), UNet (0.75), LinkNet (0.79), DeepLabV3 (0.81), UNet++ (0.85), and DeepLabV3+ (0.85). Ablation showed the CA module added +0.01 and the MSS module added +0.03 over UNet++, and CM-UNet was the only model robust across both H&E and papanicolaou staining styles.

Classification Results

Pages 14-15

ECRNet Achieves 98.50% Accuracy, Beating Eight Baseline Classifiers

In the classification stage, ECRNet was compared against eight deep learning models and three CNN+SVM pipelines. ECRNet achieved 98.50% accuracy, 99.32% precision, 97.67% recall, and 99.33% F1-score. The closest competitor was DenseNet-121 at 93.50% accuracy and 93.59% F1-score. ResNet-101 achieved 92.17% accuracy with 97.03% precision (second only to ECRNet) but had a recall of just 87.00%, meaning it would miss more cancer cases. ResNeXt-101 had the highest recall among competitors (99.12%) but only 86.50% accuracy, indicating many false positives.

Failure of lightweight and transformer models: ViT (Vision Transformer) performed the worst among end-to-end classifiers at 65.00% accuracy, likely due to the small dataset size causing overfitting given ViT's 343M parameters. MobileNet-V1, a lightweight architecture with only 5M parameters, also struggled at 82.99% accuracy, suggesting that lightweight networks lack the capacity to learn the complex morphological features of endometrial cytopathology. The CNN+SVM approaches performed poorly across the board, with VGG-16+SVM reaching only 78.83% accuracy, indicating that hand-crafted feature extraction followed by SVM classification is inadequate for this task.

Two-stage vs. one-stage strategy: Ablation experiments demonstrated the critical importance of the segmentation-first approach. When VGG-16 classified raw cytopathology images directly (one-stage), accuracy was only 84.29%. With the two-stage approach (segment first, then classify ROIs), VGG-16 accuracy rose to 91.07%, a gain of 6.78 percentage points. For ECRNet, the improvement was even more dramatic: one-stage accuracy was 89.17%, while the two-stage strategy achieved 98.50%, a gain of 9.33 percentage points. This confirms that removing background noise through segmentation is essential for reliable classification.

Contrastive learning contribution: The contrastive learning component of ECRNet added 7.43% accuracy over the VGG-16 backbone alone in the two-stage setting (98.50% vs. 91.07%). This improvement stems from the label memory bank's ability to bring same-class representations closer together in feature space, which is particularly valuable when the same type of cell clump can look visually different due to staining variation.

TL;DR: ECRNet reached 98.50% accuracy, 99.32% precision, 97.67% recall, and 99.33% F1-score. The two-stage strategy added +9.33 percentage points over one-stage for ECRNet, and contrastive learning added +7.43 percentage points over the VGG-16 backbone. ViT (65.00%) and MobileNet (82.99%) performed poorly due to overfitting and insufficient model capacity, respectively.

External Validation

Pages 15-17

External Validation on a Public Dataset Confirms ECRNet's Generalizability

To test whether the framework generalizes beyond the single-institution XJTU-EC dataset, the authors performed external validation using a publicly available dataset from the AIstudio platform. This external dataset contained 848 negative and 785 positive endometrial cytopathology images (ratio 1.08:1), all papanicolaou-stained. Because the external dataset lacked segmentation labels, this validation focused solely on ECRNet's classification performance. The models were trained on 1,024 x 1,024 pixel images from the XJTU-EC dataset (not ROIs) and tested directly on the external data.

ECRNet dominated the external validation: ECRNet achieved 95.32% accuracy, 94.57% precision, 96.17% recall, and 95.37% F1-score on the external dataset. The next best model, EfficientNet-B7, reached only 83.53% accuracy with 59.2% recall. DenseNet-121 achieved 80.10% accuracy with 67.20% recall. ResNet-101 had perfect precision (100%) but only 53.10% recall, meaning it would miss nearly half of all positive cases. ResNeXt-101 performed worst at 64.50% accuracy with just 44.20% recall.

Clinical significance of recall: The recall metric is especially important in cancer screening because a missed positive case (false negative) means a patient with cancer is not flagged for follow-up. ResNet-101's 53.10% recall and ResNeXt-101's 44.20% recall would be unacceptable in a clinical setting. ECRNet's 96.17% recall on external data demonstrates that its contrastive learning approach produces features robust enough to transfer across datasets from different institutions and staining conditions. The 11.79 percentage point gap between ECRNet's accuracy (95.32%) and the next-best classifier (EfficientNet-B7 at 83.53%) is a substantial margin.

Error analysis: Among the classification failure cases on the internal test set, 4 false-negative cases consisted of 1 well-differentiated and 3 poorly differentiated endometrial adenocarcinomas. The 4 false-positive cases included 3 normal cell clumps misclassified as cancerous. Common failure patterns involved cell stacking that obscured structural features and images with very few cells, which were easier for the model to misclassify. The authors noted that improving ECRNet's ability to classify small targets is a priority for future work.

TL;DR: On an external public dataset (848 negative, 785 positive papanicolaou-stained images), ECRNet achieved 95.32% accuracy and 96.17% recall. The next-best model (EfficientNet-B7) reached only 83.53% accuracy and 59.2% recall. ResNet-101 had 100% precision but missed 46.9% of positive cases. ECRNet's contrastive learning approach proved robust across institutions.

Limitations & Future Directions

Pages 17-18

Single-Institution Data and Annotation Bottlenecks Limit Current Scope

The authors acknowledge two primary limitations of this work. First, all training data came from a single institution (the First Affiliated Hospital of Xi'an Jiaotong University), which limits external generalizability despite the inclusion of two staining styles and the use of contrastive learning to enhance robustness. While the external validation on the AIstudio public dataset showed promising results (95.32% accuracy), the authors recognize that validation across multiple medical centers with their own institutional staining protocols and scanner equipment is needed before clinical deployment.

Annotation scarcity: The second major limitation is the annotation bottleneck. Endometrial cytopathologists are scarce, and the annotation process required three years to complete for just 139 patients. Each whole slide image required expert segmentation of individual cell clumps followed by classification labeling, with a two-pathologist review process that discarded ambiguous cases. This high-quality annotation pipeline does not scale easily, which limits the ability to expand the dataset or create multi-center training sets.

Future directions: The authors plan to extend their method to datasets from other medical centers for broader external validation. They also intend to investigate self-supervised learning techniques to reduce the annotation workload on cytopathologists. Self-supervised learning could allow the model to learn useful representations from large amounts of unlabeled cytopathology images, requiring expert annotation only for a smaller subset used for fine-tuning. Additionally, improving the model's ability to classify cell clumps with few cells or heavily stacked cells was identified as a specific area for improvement.

Dataset availability: The XJTU-EC dataset cannot be shared publicly due to patient privacy protections enforced by the hospital's Ethics Committee, though it is available to qualified researchers upon request. The external validation data is publicly available on the AIstudio platform (Baidu). The authors have stated that the framework code will be released on GitHub, which would allow other research groups to adapt and test the approach on their own institutional datasets.

TL;DR: Key limitations include single-institution training data (139 patients from one hospital) and a three-year annotation process constrained by cytopathologist scarcity. Future work will pursue multi-center validation, self-supervised learning to reduce annotation burden, and improved classification of small or stacked cell clumps.

A Deep Learning Framework for Predicting Endometrial Cancer from Cytopathologic Images with Different Staining Styles

Original Paper (PDF)