Deep Learning CAD for Atypical Hyperplasia and Endometrial Cancer

Plain-English Explanations

1. Why Early Detection of AEH and Endometrial Cancer Matters

Endometrial cancer (EC) is the sixth most common cancer among women worldwide, with 417,336 new cases reported in 2020. While 67% of patients are diagnosed at an early stage and enjoy a 5-year overall survival rate of 81%, outcomes deteriorate dramatically for advanced disease. The 5-year survival rate plummets to just 17% for stage IVA and 15% for stage IVB. Atypical endometrial hyperplasia (AEH), a recognized precursor to EC, carries a 28% risk of progressing to cancer over 20 years. Compounding this risk, AEH may coexist with occult EC in up to one-third of cases, making early and accurate differentiation between benign and malignant conditions clinically essential.

Hysteroscopy as a diagnostic tool: Hysteroscopic-guided curettage is increasingly favored for treatment planning in EC patients. A large meta-analysis by Gkrozou et al. covering over 9,000 patients found that hysteroscopy achieves 82.6% sensitivity and 99.7% specificity for diagnosing EC. However, traditional hysteroscopy still has significant blind spots. A separate meta-analysis of 1,106 patients with preoperative AEH diagnoses showed that uterine curettage and hysteroscopic-guided biopsy underestimated the presence of EC in 32.7% to 45.3% of cases. Another systematic review found an 11% failure rate and 31% infeasible endometrial samples in postmenopausal women, leading to missed diagnoses in roughly 7% of cases.

The case for AI assistance: These diagnostic gaps create a clear opening for computer-aided diagnosis. Deep learning has already gained traction in gastrointestinal endoscopy for detecting polyps, adenomas, and cancers. Contrastive learning (CL), a self-supervised approach, has demonstrated particular promise in medical imaging by enabling models to learn discriminative features even when labeled data are scarce. This study introduces ECCADx, a deep learning-based computer-aided diagnosis system that applies contrastive learning to hysteroscopic image analysis for differentiating AEH and EC from benign lesions.

Prior work in hysteroscopic AI: Existing approaches have been limited. Neofytou et al. built a CADx system validated on just 516 regions of interest from 52 subjects, achieving 81% accuracy using statistical features with an SVM classifier. Zhang et al. used VGG-Net-16 to classify endometrial lesions but achieved only 68% sensitivity for atypical hyperplasia. Takahashi et al. combined three neural networks with continuity analysis to reach 90.29% accuracy on 177 patients, but all prior studies were single-center. ECCADx addresses these limitations with multicenter validation and contrastive learning pre-training.

TL;DR: Endometrial cancer affects 417,336 women annually, and hysteroscopy misses EC in 32.7% to 45.3% of AEH cases. ECCADx is a deep learning system using contrastive learning to detect AEH/EC from hysteroscopic images, addressing limitations of prior single-center studies with smaller datasets.

2. Study Design: Multicenter Retrospective Cohort Across Three Hospitals

This multicohort retrospective study was conducted across three tertiary hospitals in China and adhered to the Declaration of Helsinki. It enrolled a total of 1,394 patients contributing 55,874 hysteroscopy images in PNG format. All images were confirmed by two expert professionals. The AEH/EC categories included both atypical endometrial hyperplasia and endometrial cancer cases, while the control group encompassed benign lesions including endometrial polyps, submucosal uterine leiomyoma, endometrial hyperplasia without atypia, and normal uterine cavities.

Dataset composition: The data was stratified into a training set of 1,204 patients (49,646 images) and two independent test sets. The internal test set from the Maternal and Child Hospital of Hubei Province (MCH) included 85 patients (3,419 images), with 23 AEH/EC cases (698 images) and 62 controls (2,721 images). The external test set from Tongji Hospital (TJH) and the Second Affiliated Hospital of Zhengzhou University (ZZSH) included 105 patients (2,809 images), with 16 AEH/EC cases (760 images) and 89 controls (2,049 images). There was no overlap between training and test datasets.

Temporal and equipment separation: The training dataset was collected from January 2008 to December 2017 at MCH using Olympus OTV-S190 and Karl Storz 26105FA or 26120BA devices. The internal test set came from January 2018 to June 2019 at MCH using the same equipment, providing temporal separation. The external test set was collected from January 2019 to December 2019 at TJH and ZZSH, predominantly using Olympus OTV-S190 equipment. This design ensures that the model was evaluated on data from different time periods, institutions, and patient populations.

Expert evaluation protocol: Twelve gynecological endoscopists were recruited for human comparison: four junior (less than 1 year experience), four intermediate (1 to 5 years), and four senior (over 10 years). Importantly, the MCH test dataset was evaluated by endoscopists from TJH, and the TJH/ZZSH dataset was assessed by endoscopists from MCH. This cross-institutional evaluation design avoided any familiarity bias. Each endoscopist classified patient images on a six-point scale from "Definitely benign" to "Definitely malignant."

TL;DR: The study enrolled 1,394 patients (55,874 images) across three hospitals. Training used 1,204 patients (49,646 images), with separate internal (85 patients) and external (105 patients) test sets. Twelve endoscopists at three experience levels provided human baseline comparisons using cross-institutional evaluation.

3. ECCADx Architecture: ResNet-50 with SimCLR Contrastive Learning

The ECCADx system is built on a ResNet-50 backbone, a 50-layer deep convolutional neural network (CNN) known for its residual connections that mitigate the vanishing gradient problem. The model followed a three-stage training pipeline: initial pre-training on ImageNet (over 100 million images), self-supervised contrastive learning pre-training on external colonoscopy data, and finally supervised fine-tuning on hysteroscopy images for AEH/EC classification. All hysteroscopy images were resized to 224x224 pixels before processing.

Contrastive learning with SimCLR: The second pre-training phase used the SimCLR framework, a self-supervised contrastive learning method. SimCLR learns discriminative features by maximizing agreement between different augmented views of the same image while pushing apart representations of distinct images using the NT-Xent (normalized temperature-scaled cross-entropy) loss. The model architecture included the ResNet-50 backbone followed by a two-layer MLP projection head that transformed 2,048-dimensional backbone features through a 2,048-dimensional hidden layer into a 128-dimensional latent space for loss calculation. The temperature parameter was set to 0.07.

Pre-training on colonoscopy data: Rather than relying solely on ImageNet weights (which come from natural images, not endoscopic images), the authors pre-trained ECCADx on four publicly available colonoscopy datasets: CP-CHILD, PolypGen, IPCL, and Hyper Kvasir. Although colonoscopy and hysteroscopy image distinct anatomical sites, they share fundamental characteristics as endoscopic images, including mucosal patterns, vascular structures, luminal views, and common imaging challenges like variable illumination and specular reflections. This cross-domain strategy provided a much stronger feature initialization than ImageNet alone.

Training infrastructure and hyperparameters: Contrastive pre-training was conducted on four NVIDIA A800 80GB GPUs with a global batch size of 2,048, using the AdamW optimizer with an initial learning rate of 3e-4, weight decay of 1e-5, and cosine annealing schedule for 300 epochs. Global Batch Normalization was implemented to aggregate batch statistics across all GPUs and prevent information leakage during distributed training.

TL;DR: ECCADx uses a ResNet-50 backbone pre-trained first on ImageNet, then on four colonoscopy datasets via SimCLR contrastive learning (128-dimensional latent space, NT-Xent loss, temperature 0.07), before fine-tuning on hysteroscopy data. Training used 4x NVIDIA A800 GPUs with batch size 2,048 for 300 epochs.

4. Fine-Tuning, Data Augmentation, and Handling Class Imbalance

Fine-tuning strategy: After contrastive pre-training, the model was fine-tuned for the downstream AEH/EC classification task. During fine-tuning, only the initial convolutional layer (conv1) and its corresponding batch normalization layer (bn1) were frozen. This strategic freezing preserved robust low-level feature extraction capabilities while allowing the remaining layers to adapt extensively to the specific characteristics of hysteroscopy images. The SGD optimizer was used with momentum 0.9, an initial learning rate of 1e-4, cosine annealing schedule, a global batch size of 4,096, and training ran for 50 epochs on the same 4x NVIDIA A800 GPU setup.

Domain-specific data augmentation: Beyond conventional augmentations (random cropping, horizontal flipping, color distortions in brightness, contrast, and saturation, and random Gaussian blur), the authors designed augmentations tailored specifically to hysteroscopic imaging challenges. These included simulating realistic variations in lighting and brightness, random color cast adjustments to account for different equipment and physiological fluid influences, simulated instrument occlusion with random black patches mimicking surgical tools, and mild geometric distortions to account for lens aberrations or scope movements. This combination of conventional and domain-adaptive augmentations was critical for bridging the domain gap from colonoscopy pre-training to hysteroscopy fine-tuning.

Addressing class imbalance: The training dataset exhibited significant class imbalance, with only 106 AEH/EC cases (3,204 images) versus 1,098 control cases (46,442 images). To handle this, the authors adopted Focal Loss, which multiplies the standard cross-entropy with a modulating factor that increases sensitivity toward misclassified AEH/EC observations. This was complemented by an oversampling strategy to further compensate for data imbalance. The optimal classification threshold was determined by maximizing the F1 score on a dedicated validation set.

Statistical evaluation framework: Model performance was assessed using accuracy, sensitivity, specificity, precision, recall, F1 score, and AUC, all with 95% confidence intervals. McNemar's chi-square test was used for paired comparisons between ECCADx and individual human experts, while Fisher's exact test was used for unpaired comparisons. AUC differences were assessed using DeLong's test, with statistical significance set at a two-sided p-value below 0.05.

TL;DR: Fine-tuning froze only conv1/bn1, used SGD with batch size 4,096 for 50 epochs. Focal Loss and oversampling addressed the severe class imbalance (106 AEH/EC vs. 1,098 controls). Domain-specific augmentations simulated instrument occlusion, color casts, and lighting variations unique to hysteroscopy.

5. Internal Test Results: ECCADx Outperforms Senior Endoscopists at MCH

On the MCH internal test dataset (85 patients, 3,419 images), ECCADx with contrastive learning achieved an AUC of 0.979 (95% CI: 0.942 to 1.000), accuracy of 94.1% (95% CI: 89.1% to 99.1%), sensitivity of 95.2% (95% CI: 89.5% to 100%), specificity of 91.3% (95% CI: 78.2% to 100%), and an F1 score of 0.959 (95% CI: 0.920 to 0.992). The positive predictive value was 96.7% (95% CI: 91.8% to 100%) and the negative predictive value was 87.5% (95% CI: 72.7% to 100%). The Brier score, a measure of calibration quality where lower is better, was 0.040 (95% CI: 0.014 to 0.075).

Comparison with endoscopists: ECCADx with CL outperformed the average of senior endoscopists across every metric. The AI achieved AUC 0.979 versus 0.952 for seniors, accuracy 94.1% versus 90.9%, sensitivity 95.2% versus 87.0%, and F1 score 0.959 versus 0.833. The gap widened further when comparing against junior endoscopists, who averaged AUC 0.872, accuracy 83.5%, sensitivity 73.9%, and F1 score 0.701. Medium-level endoscopists fell between the two groups with AUC 0.934 and accuracy 87.9%.

Impact of contrastive learning: Even ECCADx without CL performed competitively, achieving AUC 0.969, accuracy 91.8%, sensitivity 96.8%, and specificity 78.3%. However, the addition of CL provided a substantial improvement in specificity (from 78.3% to 91.3%), which is clinically important because low specificity means more false positives, leading to unnecessary biopsies and patient anxiety. The Kappa coefficient improved from 0.782 to 0.853, indicating stronger agreement with the ground truth.

The Kappa coefficient of 0.853 for ECCADx with CL indicates near-perfect agreement with histopathological ground truth, substantially exceeding even the senior endoscopists' average Kappa of 0.770. This consistency is particularly notable because AI systems are immune to the fatigue and perceptual biases that affect human readers during extended evaluation sessions.

TL;DR: On the internal MCH test set, ECCADx with CL achieved AUC 0.979, 95.2% sensitivity, 91.3% specificity, and F1 0.959, outperforming senior endoscopists (AUC 0.952, sensitivity 87.0%, F1 0.833). Contrastive learning boosted specificity from 78.3% to 91.3%.

6. External Validation: 100% Specificity on the TJH/ZZSH Dataset

On the external TJH/ZZSH test dataset (105 patients, 2,809 images), ECCADx with contrastive learning achieved an AUC of 0.975 (95% CI: 0.942 to 0.998), accuracy of 93.3% (95% CI: 88.6% to 98.1%), sensitivity of 92.1% (95% CI: 86.4% to 96.8%), specificity of 100% (95% CI: 100% to 100%), and an F1 score of 0.959 (95% CI: 0.925 to 0.988). The positive predictive value was 100% and the negative predictive value was 69.6% (95% CI: 50.0% to 87.5%). The Brier score was 0.072 (95% CI: 0.044 to 0.109).

Dramatic improvement over human experts: The performance gap between ECCADx and endoscopists was even larger on the external dataset. ECCADx with CL achieved AUC 0.975 versus the senior endoscopists' average of 0.862, accuracy 93.3% versus 80.2%, and sensitivity 92.1% versus 71.9%. Most strikingly, the AI's F1 score of 0.959 far exceeded the seniors' 0.530, underscoring a massive consistency advantage. Junior endoscopists performed even worse, with sensitivity of only 65.6% and F1 of 0.448.

The CL effect on external generalization: The benefit of contrastive learning was especially pronounced on external data. Without CL, ECCADx achieved AUC 0.891, accuracy 89.5%, and specificity of just 62.5%. With CL, specificity jumped to 100% and AUC increased by 0.084 points to 0.975. This demonstrates that contrastive pre-training on colonoscopy data substantially improved the model's ability to generalize across institutions, equipment types, and patient populations. The Kappa coefficient rose from 0.584 without CL to 0.781 with CL.

Why 100% specificity matters: Achieving 100% specificity and 100% PPV on the external test set means ECCADx produced zero false positives for benign cases. In clinical practice, this translates to no unnecessary biopsies triggered by the AI system when lesions are actually benign. This is particularly significant because the external dataset used different hysteroscopy equipment (predominantly Olympus) and involved geographically distinct patient populations, demonstrating that ECCADx can handle real-world inter-institutional variability without sacrificing precision.

TL;DR: On the external TJH/ZZSH dataset, ECCADx with CL achieved AUC 0.975, 92.1% sensitivity, and 100% specificity (zero false positives), vastly outperforming senior endoscopists (AUC 0.862, sensitivity 71.9%, F1 0.530). Contrastive learning lifted specificity from 62.5% to 100%.

7. Feature Visualization and Interpretability with t-SNE and Grad-CAM

t-SNE visualization: To understand why contrastive learning improved performance, the authors used t-distributed stochastic neighbor embedding (t-SNE) to visualize the high-dimensional features extracted by the model. t-SNE is a nonlinear dimensionality reduction technique that maps high-dimensional data onto a 2D plane while preserving local neighborhood structures. Without CL, the t-SNE plots for both the MCH and TJH/ZZSH datasets showed significant intermixing between control (orange) and AEH/EC (blue) sample clusters, indicating poor feature separation. With CL, the clusters became far more distinct, compact, and well-separated on both datasets. This qualitative evidence confirms that CL enables the model to learn highly discriminative features that transfer across clinical sites.

Grad-CAM analysis: The Gradient-weighted Class Activation Mapping (Grad-CAM) algorithm was used to identify which regions of hysteroscopic images the model relied on for predictions. The resulting heatmaps highlighted areas containing significant morphological and vascular features, including gross distortion of the endometrial cavity, focal necrosis, friable consistency, and atypical vessels. These are all recognized pathological hallmarks of AEH and EC. The model with CL focused more precisely on these crucial diagnostic regions compared to the model without CL, demonstrating that contrastive pre-training sharpened the model's attention to clinically relevant features.

Analysis of false negatives: On the MCH internal test set, ECCADx produced two false negative cases. The first involved polyp cystic degeneration, where smooth, translucent surfaces closely mimicked benign endometrial polyps. The second involved papillary proliferation with fine, delicate fronds showing uniform coloration that appeared nearly identical to surrounding healthy endometrium. Both cases lacked the pronounced hyper-vascular patterns or surface necrosis that typically signal malignancy, which explains why they fell below the model's detection threshold. Notably, on the TJH/ZZSH external dataset, ECCADx correctly identified all cases of AEH and EC with no false negatives.

The combination of t-SNE and Grad-CAM provides both global and local interpretability. t-SNE demonstrates that the learned feature space meaningfully separates disease classes, while Grad-CAM reveals the specific image regions driving individual predictions. This dual interpretability approach is important for building clinical trust, as it shows that ECCADx's decisions are grounded in pathologically relevant image features rather than spurious correlations.

TL;DR: t-SNE visualization showed CL dramatically improved feature separation between AEH/EC and benign classes. Grad-CAM heatmaps confirmed the model attends to clinically relevant features like atypical vessels and focal necrosis. Only 2 false negatives occurred (both mimicked benign polyps), and all external AEH/EC cases were correctly detected.

8. Limitations and Future Directions

Imbalanced datasets and inter-rater variability: The TJH/ZZSH external dataset contained only 16 AEH/EC cases out of 105 total patients, creating a significant class imbalance that challenged both the model and human evaluators. Human diagnostic accuracy on this dataset ranged from 76.2% to 86.7%, reflecting notable inter-rater variability among endoscopists. The model was trained exclusively on MCH data, and performance was somewhat lower on external data due to inter-hospital differences in imaging equipment, patient demographics, and image quality. The variability inherent in hysteroscopy images from different machine models (Olympus vs. Karl Storz) introduces differences in color rendition, image size, and visual nuances that affect generalization.

Binary classification limitation: ECCADx currently performs only binary classification, distinguishing AEH/EC from non-cancerous conditions. This simplification overlooks the clinical need for granular differentiation among various pathological subtypes. Benign conditions include diverse types such as simple hyperplasia, complex hyperplasia without atypia, polyps, and submucosal myomas, each with distinct clinical implications. EC itself comprises multiple histological types and grades that are critical for determining prognosis and guiding treatment strategies. A multi-class classification system would be substantially more useful in clinical practice.

Retrospective design: All data were collected retrospectively, which introduces inherent selection biases. The study's cross-institutional design partially mitigates this concern, but prospective validation is needed before clinical deployment. The authors have not yet tested ECCADx in a real-time clinical workflow, where factors like image acquisition speed, integration with existing hysteroscopy systems, and clinician interaction with the AI output would all influence practical performance.

Planned next steps: The authors outline two primary future directions. First, they plan to extend ECCADx to support multi-class classification, enabling differentiation across the full spectrum of endometrial pathologies rather than just binary AEH/EC versus benign. Second, they intend to conduct prospective, randomized multicenter studies to rigorously evaluate ECCADx's real-world performance across diverse patient populations and clinical environments. On the deployment side, because ECCADx is built on a standard ResNet-50 architecture, it can run on conventional medical workstations with modern GPUs or dedicated edge computing devices, processing each image in milliseconds and supporting real-time auxiliary diagnosis during hysteroscopic examinations.

TL;DR: Key limitations include binary-only classification (no subtype differentiation), retrospective design, class imbalance in the external dataset (only 16 AEH/EC out of 105 patients), and training on single-center data. Future plans include multi-class classification, prospective multicenter trials, and real-time clinical integration leveraging the model's millisecond inference speed.

A Deep Learning-Based Computer-Aided Diagnosis System for Detecting Atypical Endometrial Hyperplasia and Endometrial Cancer Through Hysteroscopy

Original Paper (PDF)