The Efficacy of Deep Learning Models in the Diagnosis of Endometrial Cancer Using MRI: A Comparison with Radiologists

PMC 2022 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Diagnosing Endometrial Cancer on MRI Is Ripe for Deep Learning

Endometrial cancer is the sixth most common malignant disorder in women worldwide, with approximately 417,000 new cases and 97,000 deaths reported in 2020 alone. The incidence is rising globally, making early and accurate diagnosis increasingly important. Surgery and biopsy remain the standards for staging, but MRI plays a critical supporting role in preoperative evaluation by predicting myometrial invasion depth, cervical stromal involvement, and lymph node metastases. In situations where biopsy is not possible, such as closure of the internal uterine ostium or in patients who have not had sexual intercourse, MRI may be the primary diagnostic tool for detecting the presence of cancer.

The deep learning opportunity: Convolutional neural networks (CNNs) have demonstrated remarkable performance in image pattern recognition tasks, including segmentation, lesion detection, and classification across imaging modalities such as ultrasound, radiograph, CT, and MRI. Despite this progress, no prior study had developed a CNN specifically to diagnose the presence of endometrial cancer. Furthermore, few studies had investigated which MRI sequences and cross-sections are optimal for deep learning-based classification of tumors.

Study objectives: This retrospective study from the University of Tsukuba set out to accomplish three goals. First, it constructed CNNs for diagnosing endometrial cancer using several MRI sequences (T2-weighted images, ADC maps, and contrast-enhanced T1-weighted images) in both axial and sagittal cross-sections. Second, it compared the diagnostic performance of these CNNs against three board-certified radiologists with 27, 26, and 9 years of pelvic MRI experience. Third, it tested whether adding different types of image sets to the training data could improve CNN diagnostic performance.

The study used data from patients scanned between January 2015 and May 2020. The investigators organized their work into two experiments: Experiment 1 compared CNN performance against radiologists on single and combined image sets, while Experiment 2 tested the impact of training data augmentation with images from different sequences and cross-sections.

TL;DR: This is the first study to develop a CNN for diagnosing the presence of endometrial cancer on MRI. It compared deep learning models against three expert radiologists (9 to 27 years of experience) across multiple MRI sequences and cross-sections using data from 485 patients collected between 2015 and 2020.
Pages 2-4
Patient Selection, MRI Protocols, and Dataset Construction

Inclusion and exclusion criteria: The study enrolled women over 20 years of age who underwent pelvic MRI at the University of Tsukuba Hospital. The cancer group required hysterectomy with pathological confirmation of endometrial cancer. The non-cancer group included patients with pathologically or clinically confirmed benign lesions. Patients with a history of treatment for uterine diseases were excluded, as were those with macroscopically non-mass-forming cancers. A total of 485 women (mean age 52 years, range 21 to 91) were evaluated.

Training and testing split: The 485 patients were randomly assigned to training (388 patients: 204 cancer, 184 non-cancer) and testing (97 patients: 51 cancer, 46 non-cancer) groups. For training, 2,905 axial images per sequence (1,471 cancer, 1,434 non-cancer) and 1,105 sagittal images (624 cancer, 481 non-cancer) were used. For testing, only one central image per stack was extracted per patient, yielding 97 test images per sequence and cross-section. Two radiologists reached consensus on which image slices depicted the tumor.

MRI acquisition: Scans were performed on either 3T or 1.5T Philips equipment (Ingenia or Achieva) with a 32-channel phased-array body coil. The protocol captured T2-weighted images (T2WI), diffusion-weighted images (DWI with b-values of 0 and 1000), and contrast-enhanced T1-weighted images (CE-T1WI) in the equilibrium phase using gadolinium-based contrast. Both sagittal and axial cross-sections were obtained along the uterine axis.

Image preparation: DICOM images were converted to JPEG format and resized to 240 x 240 pixels. Five single image sets (axial T2WI, sagittal T2WI, axial ADC map, axial CE-T1WI, sagittal CE-T1WI) and four combined image sets were created. Combined sets were formed by vertically stacking axial images (240 x 480 or 240 x 720 pixels) or horizontally combining sagittal images (480 x 240 pixels) using ImageMagick. Notably, the entire pelvic images were used rather than cropped images of the uterus alone.

TL;DR: 485 patients (255 cancer, 230 non-cancer) were split into training (n = 388) and testing (n = 97). MRI was acquired at 3T or 1.5T across T2WI, ADC, and CE-T1WI sequences. The training set included 2,905 axial and 1,105 sagittal images per sequence, while the test set used one central image per patient per sequence (97 images).
Pages 4-5
CNN Architecture, Training Configuration, and Statistical Analysis

Network architecture: The study used Xception, a CNN architecture characterized by depthwise separable convolutions that enable more efficient use of model parameters compared to earlier architectures. The network was pre-trained on ImageNet (a large-scale dataset of natural images) and then fine-tuned on the MRI data. Deep learning was performed on a Deep Station Entry workstation (UEI, Tokyo) equipped with a GeForce RTX 2080Ti GPU and an Intel Core i7-8700 CPU, using the graphical deep learning software Deep Analyzer (GHELIA, Tokyo).

Training parameters: The Adam optimizer was used with a learning rate of 0.0001, beta1 of 0.9, beta2 of 0.999, and epsilon of 1e-7. The batch size was automatically selected. Data augmentation included horizontal flipping, rotation (plus or minus 4.5 degrees), shearing (0.05), and zooming (0.05). CNNs were generated by varying the training/validation split ratio (9:1, 8:2, or 7:3) and the number of epochs (50, 100, 200, 500, or 1000). For each image set, the best-performing configuration with both sensitivity and specificity above 0.75 was selected.

Radiologist comparison protocol: Three board-certified radiologists (with 27, 26, and 9 years of pelvic MRI experience) independently reviewed the 97 randomly ordered test images for each image set. They were blinded to clinical and pathological findings and rated cancer confidence on a 6-point scale: 0 (definitely absent), 0.2 (probably absent), 0.4 (possibly absent), 0.6 (possibly present), 0.8 (probably present), and 1.0 (definitely present). Scores of 0.0 to 0.4 were classified as non-cancer, and 0.6 to 1.0 as cancer. For the CNN, output values of 0 to 0.49 were classified as non-cancer and 0.50 to 1.0 as cancer.

Statistical analysis: Sensitivity, specificity, accuracy, and area under the ROC curve (AUC) were calculated with 95% confidence intervals. ROC analysis evaluated diagnostic performance across all conditions. Interobserver agreement was assessed using kappa (k) statistics. All analyses were performed in EZR (a graphical interface for R) and SPSS Statistics 27.0, with P less than 0.05 considered significant.

TL;DR: The Xception architecture (pre-trained on ImageNet) was fine-tuned with Adam optimizer (learning rate 0.0001) on a GeForce RTX 2080Ti. Training/validation splits of 7:3 to 9:1 and epochs of 50 to 1000 were tested. Three radiologists (9 to 27 years of experience) independently rated 97 test images on a 6-point confidence scale.
Pages 6-8
CNN vs. Radiologists: Single and Combined Image Sets

Best single image set performance: The CNN achieved its highest diagnostic performance on the axial ADC map, with an AUC of 0.95 (95% CI: 0.91 to 1.00), sensitivity of 0.94 (0.87 to 0.98), specificity of 0.87 (0.79 to 0.91), and accuracy of 0.91 (0.83 to 0.95). By comparison, the three radiologists achieved AUCs of only 0.78, 0.77, and 0.77 on the same image set, all significantly lower than the CNN (P less than 0.001 for all three). This was the largest performance gap observed in the entire study.

Other single image sets: Across the remaining four single image sets, the CNN also performed well. On axial T2WI, the CNN achieved an AUC of 0.90 (0.84 to 0.96), significantly outperforming Reader 2 (AUC 0.77, P = 0.015). On sagittal T2WI, the CNN's AUC was 0.88 (0.81 to 0.95), with no significant difference from the radiologists. On axial CE-T1WI, the CNN reached an AUC of 0.93 (0.87 to 0.98), significantly better than all three radiologists (P = 0.006 for Reader 1, P = 0.002 for Reader 2, P = 0.014 for Reader 3). On sagittal CE-T1WI, the CNN's AUC was 0.90 (0.84 to 0.97), with no significant differences from the radiologists.

Combined image sets: The four combined image sets yielded AUCs ranging from 0.87 to 0.93. Interestingly, the combined axial T2WI + ADC map + CE-T1WI set produced the lowest AUC among all CNN results at 0.87 (0.80 to 0.94). This was unexpected because combining more sequences was anticipated to provide richer information. The combined axial T2WI + ADC map achieved an AUC of 0.93 (0.88 to 0.98), which was significantly higher than Reader 1's AUC of 0.58 (P less than 0.001). The combined image sets generally showed comparable performance to the radiologists but did not consistently outperform the single image sets.

Key pattern: The CNN tended to achieve higher sensitivity (ranging from 0.80 to 0.94 across image sets) while the radiologists tended to have higher specificity, sometimes reaching 0.96 to 1.00 on certain sequences. This suggests the CNN was better at detecting cancer cases while the radiologists were more conservative, missing fewer non-cancer cases but also missing more cancers.

TL;DR: The CNN's best performance was on axial ADC maps (AUC 0.95, significantly better than all three radiologists at P less than 0.001). On axial CE-T1WI, the CNN also significantly outperformed all three radiologists (AUC 0.93 vs. 0.84). Combined image sets ranged from AUC 0.87 to 0.93 but did not consistently beat single image sets.
Pages 8-10
Where the CNN and Radiologists Disagreed, and What the False Negatives Reveal

Kappa statistics: The interobserver agreement (kappa) between the CNN and the three radiologists ranged widely from 0.32 to 0.81, which was generally lower than the agreement among the radiologists themselves. This finding suggests that the CNN may have used a fundamentally different visual strategy than human readers when interpreting MRI images. The highest agreement between the CNN and radiologists occurred on CE-T1WI sequences and combined sets that included CE-T1WI, where the contrast between tumor and myometrium is most visually distinct.

False negatives by radiologists only: In one illustrative case involving the axial ADC map, a 55-year-old woman with grade 1 endometrioid carcinoma had a tiny tumor filling the uterine cavity. The CNN correctly identified this as cancer with 99.9% confidence, while all three radiologists missed it. This highlights the CNN's potential advantage in detecting subtle signal changes on ADC maps that may be difficult for the human eye to perceive, particularly in low-spatial-resolution images.

False negatives by the CNN only: In another case, a 34-year-old woman with grade 1 endometrioid carcinoma had a massive tumor protruding into the myometrium of the posterior uterine wall. All three radiologists correctly identified this cancer, but the CNN assigned only 18.8% confidence. This case involved a tumor with clear morphological features that radiologists easily recognized but that apparently confused the CNN, possibly because the tumor's appearance did not match the patterns learned during training.

Shared false negatives: Both the CNN and all three radiologists failed in the case of a 31-year-old woman with grade 2 endometrioid carcinoma. The CNN confidence was only 22.5%. The tumor filled the uterine cavity, and the slight decrease in the ADC signal may have been insufficient to trigger detection by either the CNN or the human readers when viewed as a single image without additional sequences for comparison.

TL;DR: CNN-radiologist kappa ranged from 0.32 to 0.81, indicating the CNN used different visual cues than human readers. The CNN caught subtle ADC signal changes that all three radiologists missed (99.9% confidence), while radiologists recognized morphologically obvious tumors the CNN missed (18.8% confidence). Some cases fooled both.
Pages 11-12
Can Adding Different Image Types to Training Data Improve CNN Performance?

Experiment design: Experiment 2 investigated whether training the CNN on images from different sequences or cross-sections (beyond the same type used for testing) could boost diagnostic accuracy. For each of the five single test image sets, the CNN was trained using: (1) all images of the same sequence regardless of cross-section, (2) all images of the same cross-section regardless of sequence, and (3) all available images regardless of sequence and cross-section.

Sequences that benefited: For sagittal T2WI, adding other image types improved the AUC from 0.88 (baseline) to 0.91 or 0.92 across all three expanded training conditions. For sagittal CE-T1WI, the most dramatic improvement occurred when all available images were used for training, boosting the AUC from 0.90 to 0.95 (0.89 to 1.00). For axial T2WI, training with all T2WI images improved the AUC from 0.90 to 0.94, and training with all images raised it to 0.91. None of these improvements reached statistical significance, but the trend was consistent.

Sequences that did not benefit: For axial ADC maps, which already had the highest baseline AUC of 0.95, adding other image types either maintained or slightly reduced performance (AUC 0.89 to 0.93 with expanded training). Similarly, for axial CE-T1WI (baseline AUC 0.93), adding other types did not improve results and in one condition (all axial images) dropped the AUC to 0.88. These findings suggest that when a sequence already provides highly distinctive cancer features, mixing in less informative images may introduce noise rather than useful variation.

Cross-section vs. sequence additions: The study found that adding other cross-sections of the same sequence was especially beneficial, likely because images from the same sequence share similar signal intensity characteristics even when acquired in different planes. Adding different sequences from the same cross-section contributed similar morphological information. The authors suggest that for sequences with limited training data (such as sagittal views, which had fewer images than axial), the additional data from other types had the greatest impact.

TL;DR: Adding cross-sections and sequences to training improved AUC for sagittal T2WI (0.88 to 0.92), sagittal CE-T1WI (0.90 to 0.95), and axial T2WI (0.90 to 0.94), but did not help axial ADC (already at 0.95) or axial CE-T1WI (0.93). Same-sequence, different-cross-section additions were most beneficial.
Pages 12-13
Single-Slice Evaluation, JPEG Conversion, and Other Design Constraints

Single-image testing: The most significant limitation is that only one selected image per patient was evaluated during testing. In clinical practice, radiologists review an entire series of images to form their assessment. Evaluating a single central slice does not reflect real-world diagnostic conditions and likely underestimates the performance of both the CNN and the radiologists. A series-level evaluation would provide a more clinically meaningful comparison.

JPEG conversion: The DICOM images were converted to JPEG format because the deep learning software could not handle DICOM data directly. JPEG compression is lossy, meaning that some image information (including potentially diagnostic pixel-level data) was discarded during conversion. The images were also resized to 240 x 240 pixels, which is smaller than the original acquisition matrices (up to 704 x 704). These processing steps may have reduced the available diagnostic information for both the CNN and the radiologists.

Non-cancer group composition: Some patients in the non-cancer group (47 in training, 8 in testing) were classified based on clinical and imaging findings rather than pathological confirmation. Additionally, the classification of atypical endometrial hyperplasia as benign is debatable since it is a known precursor lesion for endometrial cancer. However, the authors reasoned that excluding atypical hyperplasia would be arbitrary and that classifying it as benign aligned with the study's goal of detecting frank cancer.

No dynamic contrast study: The study used only equilibrium-phase contrast-enhanced images and did not examine dynamic contrast studies, which can provide additional information about tumor vascularity and the degree of myometrial invasion. While the equilibrium phase offers the greatest contrast between tumor and myometrium, dynamic imaging might have provided complementary diagnostic data. The authors chose to limit the analysis to avoid excessive complexity.

Single-center design: All data came from a single institution (University of Tsukuba Hospital) using Philips MRI equipment. The CNN's performance on data from other institutions, other MRI vendors, and different acquisition protocols remains unknown. External validation would be necessary to confirm the generalizability of these results.

TL;DR: Key limitations include single-image (not full-series) testing, lossy JPEG conversion from DICOM, images resized to 240 x 240 pixels, some non-cancer cases without pathological confirmation, no dynamic contrast imaging, and single-center data from one MRI vendor (Philips). All of these factors limit real-world generalizability.
Pages 13-14
What Comes Next for Deep Learning in Endometrial Cancer Diagnosis

The study demonstrated that CNNs achieved high diagnostic performance for detecting endometrial cancer on MRI, with AUCs ranging from 0.87 to 0.95 across all tested image sets. On two specific image types, axial ADC maps and axial CE-T1WI, the CNN significantly outperformed all three expert radiologists. This finding is particularly notable for ADC maps, where the CNN's AUC of 0.95 far exceeded the radiologists' AUCs of 0.77 to 0.78. The low spatial resolution of ADC maps makes them challenging for human interpretation but apparently provides CNN-friendly contrast patterns for distinguishing cancer from non-cancer tissue.

Three-dimensional analysis: One of the most promising future directions is moving from two-dimensional single-slice analysis to three-dimensional volumetric evaluation. Mehrtash et al. previously demonstrated the value of using three-dimensional prostate images for CNNs, and a similar approach for endometrial cancer could capture spatial relationships between adjacent slices that are invisible in single-slice analysis. This would also better align with clinical practice, where radiologists scroll through entire image stacks.

Improved data handling: Future studies should work directly with DICOM data rather than converting to JPEG, preserving the full bit depth and resolution of the original acquisitions. Incorporating clinical data such as tumor markers, patient age, and other relevant factors into the model could provide additional diagnostic context. Multi-vendor and multi-institutional datasets would test and improve the model's robustness across different scanning environments.

Broader implications: The finding that combined image sets did not consistently outperform single image sets, combined with the observation that adding more training data from related sequences can improve performance on weaker image types, has practical implications for how deep learning studies in radiology should be designed. Rather than always combining multiple sequences into multi-channel inputs, it may be more effective to optimize single-sequence models and augment their training data with related image types. This approach could simplify both model architecture and clinical deployment.

TL;DR: CNNs achieved AUCs of 0.87 to 0.95 across all conditions, significantly outperforming radiologists on axial ADC maps (0.95 vs. 0.77 to 0.78) and axial CE-T1WI (0.93 vs. 0.84). Future work should use 3D volumetric analysis, native DICOM data, multi-center validation, and clinical data integration to move toward real-world deployment.
Citation: Urushibara A, Saida T, Mori K, et al.. Open Access, 2022. Available at: PMC9063362. DOI: 10.1186/s12880-022-00808-3. License: cc by.