Deep Learning Empowers Endoscopic Polyp Classification

Plain-English Explanations

Background

Page 1

Why Polyp Detection During Colonoscopy Needs AI Assistance

Colorectal cancer (CRC) remains one of the most common and deadly cancers worldwide, with over 15,410 new cases diagnosed in Taiwan alone in 2016. The annual incidence rate of CRC in Taiwan stands at 0.044%, while the mortality rate is 0.015% across both sexes. Colonoscopy is considered the gold standard for detecting precancerous polyps, and research from the United States has shown it can reduce CRC mortality by 65% to 75% when polyps are found and removed early.

The polyp miss problem is significant. Previous studies have demonstrated that an average of 22% to 28% of polyps and 20% to 24% of potentially cancerous adenomas are missed during standard colonoscopy examinations. Polyps can be overlooked due to their uncommon shapes (flat or very small) or simple operator error. Critically, it is estimated that for every 1% increase in the polyp detection rate, the incidence of CRC could be further reduced by 3%, making even modest improvements in detection clinically meaningful.

Two main polyp types drive clinical decisions. Hyperplastic polyps most frequently appear in the sigmoid colon and rectum, display tiny flat mucosal protrusions, and carry a very low probability of malignant transformation. Adenomatous polyps, on the other hand, serve as direct precursors to cancer and can grow in various parts of the large intestine. Among adenomatous polyps, the villous subtype carries the highest risk of becoming cancerous. Distinguishing between these two categories is essential for determining the appropriate treatment.

TL;DR: Colonoscopy misses 22-28% of polyps, and each 1% improvement in detection reduces CRC incidence by 3%. This study develops an AI system to help endoscopists detect and classify polyps more accurately.

Study Design

Pages 1-2

A Massive Multi-Hospital Dataset and Two-Model Architecture

Data scale sets this study apart. The researchers collected colonoscopy images from 5,000 colorectal cancer patients at Fu-Jen Catholic University Hospital (FJUH) between September 2017 and September 2020, using the Picture Archiving and Communication System (PACS). A total of 430,921 images were initially gathered for the detection task. After preprocessing to remove blurry, low-contrast, underexposed, or overexposed images, approximately 256,220 images remained for training and evaluation.

The study uses a two-model pipeline. The first model handles polyp detection (finding where polyps are in the image), while the second handles polyp classification (determining whether a detected polyp is adenomatous or non-adenomatous). For detection, 85% of the data (218,720 images) was used for fivefold cross-validation, with the remaining 15% reserved for testing. For classification, 17,485 images were collected, with 5,394 images remaining after preprocessing for fivefold cross-validation.

Rigorous annotation protocol. All images containing polyps were submitted to three experienced colorectal surgeons and gastroenterologists for annotation and classification. Images were classified according to the Japan NBI Expert Team (JNET) classification system using Narrow Band Imaging (NBI). An image was only included in the deep learning model if at least two of the three physicians agreed on both the annotated area and the polyp classification, ensuring high-quality ground truth labels.

External validation was built into the design. After internal training and testing, the team performed external validation using data from three separate hospitals. This included both a prospective arm (150 colonoscopy videos from 150 patients, yielding 516 polyps) and a retrospective arm (385 NBI images with matching pathology reports). This dual-mode external validation is uncommon in the field and strengthens the generalizability claims.

TL;DR: The study used 256,220 colonoscopy images from 5,000 patients, annotated by three physicians with consensus requirements, and validated externally across 3 hospitals using both prospective (150 videos) and retrospective (385 images) approaches.

Model Architecture

Pages 3-4

CNN for Detection, EfficientNet-b0 for Classification

The detection model is built on a Convolutional Neural Network (CNN). The team selected stochastic gradient descent (SGD) as the optimizer, with an initial learning rate of 0.01 and weight decay of 5 x 10^-4. The box loss gain and class loss gain were set at 0.05 and 0.5, respectively. Before training on their hospital data, the team conducted a pilot study using 10,000 colonoscopy polyp images from the publicly available Kvasir dataset, which confirmed that CNN was the best performer for speed and accuracy in object detection.

The classification model uses EfficientNet-b0. This architecture, introduced by Tan and Le in 2019, systematically scales network width, depth, and resolution using a compound scaling method. The team applied transfer learning from ImageNet pretrained weights, which has been shown to boost model performance significantly and reduce training times. The Adam optimizer was used with an initial learning rate of 1 x 10^-4, zero weight decay, and training ran for 150 epochs.

The system was named EndoAIM and deployed into a graphical user interface (GUI). When a polyp appears on the endoscopic screen, the system immediately marks it with a green bounding box. A yellow warning map is displayed for two seconds to reinforce the alert. A thumbnail reminder function was also added so that the most recently detected polyp image persists on screen, preventing physicians from missing lesions that flash by briefly. For classification, the system switches to NBI display mode and classifies detected polyps as adenomatous or non-adenomatous.

TL;DR: The system pairs a CNN (with SGD optimizer) for real-time polyp detection with an EfficientNet-b0 model (with Adam optimizer and ImageNet transfer learning) for adenoma classification, deployed as the EndoAIM GUI tool.

Detection Results

Pages 5-6

Near-Perfect Polyp Detection: 97% Sensitivity and Specificity

Internal testing performance was exceptional. On the held-out test set, the polyp detection model achieved a sensitivity of 0.9709 (95% CI: 0.9646 to 0.9757) and a specificity of 0.9701 (95% CI: 0.9663 to 0.9749). The AUC score reached 0.9902, which the authors describe as state-of-the-art. The mean average precision (mAP) at IoU thresholds of 0.5 to 0.95 was 0.8845, indicating strong localization accuracy across varying overlap thresholds.

These numbers translate to real clinical value. A sensitivity of 97.09% means the model correctly identified polyps in approximately 97 out of every 100 frames containing polyps. A specificity of 97.01% means it correctly identified normal tissue in about 97 out of every 100 polyp-free frames. This balance between sensitivity and specificity is crucial for a clinical tool, as high sensitivity alone would generate excessive false alarms, while high specificity alone would miss too many true polyps.

The mAP metric adds spatial precision. Unlike sensitivity and specificity, which measure whether a polyp is present or absent, mAP evaluates how accurately the model draws its bounding box around the polyp. An mAP of 0.8845 at the strict 0.5:0.95 IoU range indicates that the model not only detects polyps reliably but also localizes them with high spatial precision, which is critical for guiding endoscopists to the exact location of a lesion.

TL;DR: The detection model achieved 97.09% sensitivity, 97.01% specificity, an AUC of 0.9902, and an mAP of 0.8845 on the internal test set, demonstrating both reliable detection and precise localization of polyps.

Classification Results

Pages 5-6

Distinguishing Adenomas from Non-Adenomas with 0.9989 AUC

The classification model performed at near-perfect levels. When distinguishing adenomatous from non-adenomatous polyps, the EfficientNet-b0 model achieved a sensitivity of 0.9889, a specificity of 0.9778, and an F1 score of 0.9834. The AUC score reached 0.9989 (95% CI: 0.9954 to 1.00), meaning the model could almost perfectly separate the two polyp categories across all classification thresholds.

Why this distinction matters clinically. Adenomatous polyps are direct precursors to colorectal cancer and require removal, while hyperplastic polyps generally carry negligible malignant potential. Current clinical practice requires physicians to bring the endoscopic lens close to a polyp and switch to Narrow Band Imaging (NBI) to visually assess polyp characteristics. Even with NBI, interobserver variability among endoscopists is a well-documented problem. The deep learning model can reduce this variability by providing a consistent, objective classification.

The high F1 score (0.9834) reflects balanced performance. The F1 score is the harmonic mean of precision and recall, meaning the model avoids both excessive false positives (incorrectly calling non-adenomas dangerous) and excessive false negatives (missing true adenomas). In clinical terms, this balance means the system neither over-treats patients with unnecessary biopsies nor under-treats patients by missing precancerous lesions.

TL;DR: The classification model distinguished adenomatous from non-adenomatous polyps with 98.89% sensitivity, 97.78% specificity, an F1 of 0.9834, and a near-perfect AUC of 0.9989, reducing the interobserver variability that plagues manual NBI classification.

External Validation

Page 7

Three-Hospital External Validation: Prospective and Retrospective

Prospective detection validation across 3 hospitals. The model was tested on 150 colonoscopy videos (50 per hospital) from 150 patients, encompassing 516 polyps total. The average lesion-based sensitivity reached 0.9516 (95% CI: 0.9295 to 0.9670), with individual hospital results of 0.9817 (Hospital A), 0.9389 (Hospital B), and 0.9360 (Hospital C). The frame-based specificity averaged 0.9720 (95% CI: 0.9713 to 0.9726), with individual values of 0.9676, 0.9833, and 0.9605 for hospitals A, B, and C, respectively.

Retrospective classification validation confirmed robustness. Using 385 NBI images with matching pathology reports (193 non-adenoma, 192 adenoma), the model achieved an average AUC of 0.9521 (95% CI: 0.9308 to 0.9734). Hospital A yielded the highest AUC at 0.9947 (95% CI: 0.9817 to 1.00), while Hospital C had the lowest at 0.9207 (95% CI: 0.8749 to 0.9665). Importantly, the pathology report for each patient served as the ground truth, providing a more reliable reference standard than endoscopist judgment alone.

The performance gap between internal and external validation is expected but modest. Detection sensitivity dropped from 0.9709 (internal) to 0.9516 (external), a difference of about 2 percentage points. Classification AUC dropped from 0.9989 to 0.9521. While the classification drop is larger, an AUC above 0.95 still represents strong discriminative ability. The variation across hospitals (Hospital A consistently outperforming Hospital C) likely reflects differences in equipment, imaging conditions, and patient populations.

The dual validation approach is a notable strength. Using prospective data for detection and retrospective data for classification tests the model under conditions that closely approximate real clinical deployment. Prospective video data captures the full complexity of live colonoscopy, including varying bowel preparation quality and camera angles, while retrospective NBI images with pathology confirmation provide the most reliable ground truth for classification accuracy.

TL;DR: Across 3 hospitals, external validation showed 95.16% lesion-based sensitivity for detection and an AUC of 0.9521 for classification. Performance varied by hospital but remained strong, with prospective video data and pathology-confirmed retrospective data providing rigorous validation.

Comparison

Pages 8-9

How EndoAIM Stacks Up Against Other Deep Learning Systems

Dataset size is a key differentiator. The authors trained on 256,220 images from 5,000 patients, which is substantially larger than most comparable studies. Yamada et al. (2019) achieved slightly higher sensitivity (97.3%) and specificity (99.0%), but their dataset contained only 705 images from 752 lesions. Ozawa et al. (2020) used 27,598 images and reached 90% sensitivity with an 83% positive predictive value for white light images (improving to 97% sensitivity and 98% PPV with NBI). Another study achieved 98% accuracy using 11,300 images with a modified ZF-Net architecture.

The trade-off between dataset size and raw accuracy is important. A model trained on a larger, more diverse dataset with marginally lower accuracy may actually generalize better to real clinical settings than a model trained on a small, homogeneous dataset with slightly higher test accuracy. The authors explicitly note that comparing models trained on different datasets using common metrics like sensitivity and specificity does not allow for direct head-to-head performance benchmarking.

The EndoAIM system offers practical deployment advantages. Unlike many research prototypes, EndoAIM was deployed into an Olympus CV290 endoscopy machine and tested in real clinical workflows. The GUI includes real-time detection alerts, NBI-triggered classification, a yellow warning overlay, and a thumbnail reminder system. These practical features address the real-world problem of physicians missing brief polyp appearances during the dynamic process of colonoscopy.

TL;DR: EndoAIM was trained on a dataset 10 to 300 times larger than most competitors. While some smaller studies report slightly higher accuracy, EndoAIM's large-scale training, multi-hospital validation, and deployment on Olympus CV290 hardware distinguish it as a clinically ready tool.

Limitations & Future Work

Pages 9-10

Honest Limitations and the Path Toward Broader Clinical Deployment

External validation size remains a limitation. While the training dataset was large (256,220 images), the external validation sets were comparatively small: 150 colonoscopy videos for detection and 385 NBI images for classification. The authors acknowledge that this modest validation size does not sufficiently reflect the model's robustness and accuracy across the full range of clinical conditions. They plan to recruit more patients from additional health institutions in a future prospective study to address this gap.

The lack of common benchmarking datasets hinders the field. Although numerous public datasets exist (CVC-CLINIC, ETIS-LARIB, CVC-ColonDB, CVC-EndoSceneStill, and the Kvasir dataset), in-house hospital datasets remain closed to public access. This fragmentation makes it difficult to compare models on equal footing. A model reporting 98% accuracy on 11,300 images cannot be meaningfully compared to one reporting 97% accuracy on 256,220 images without testing both on the same external dataset.

Operator dependence persists. The screening guidelines for CRC indicate the necessity of removing precancerous polyps, but this process remains highly dependent on the experience of the colonoscopy practitioner. The deep learning system is designed to assist, not replace, the endoscopist. The final decision must always remain with the operator, though with AI assistance, that decision can be made with increased confidence and consistency.

Regulatory and policy frameworks are still catching up. The authors note that with numerous AI-based polyp detection models being developed, there is a pressing need for specific regulations and policies governing the use of deep learning systems in clinical endoscopy settings. Standardized evaluation protocols, transparent reporting of model limitations, and clear guidelines for clinical integration are necessary before these tools can achieve widespread adoption.

TL;DR: Key limitations include a relatively small external validation set (535 cases total), lack of standardized benchmarking datasets across the field, and the need for regulatory frameworks. The team plans larger multi-institutional prospective studies to strengthen generalizability evidence.

Deep Learning Empowers Endoscopic Detection and Polyps Classification: A Multiple-Hospital Study

Original Paper (PDF)