Detection of Colorectal Polyps from Colonoscopy Using Machine Learning: A Survey on Modern Techniques

PMC (Open Access) 2023 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Polyp Detection Matters and Where Machines Can Help

Colorectal polyps are small growths that form in the lining of the colon. Most begin as harmless bumps, but if left undetected and untreated over time, they can transform into colorectal cancer (CRC). According to the International Agency for Research on Cancer (IARC), approximately 2 million individuals were diagnosed with CRC in 2020 alone, with nearly 1 million reported deaths. This makes CRC one of the leading causes of cancer mortality worldwide.

The standard method for finding polyps is colonoscopy, a procedure in which a camera-equipped tube is inserted into the patient's rectum after bowel preparation. However, physicians miss polyps from time to time due to fatigue, lack of experience, and the inherent difficulty of spotting flat or small lesions. Factors such as age, obesity, family history, smoking, alcohol consumption, and intestinal conditions like Crohn's disease all contribute to polyp development, making reliable screening essential.

From a machine learning perspective, polyp detection involves training a model to recognize features representing a polyp from colonoscopy images or live video streams. These models can be designed for three objectives: polyp segmentation (drawing a mask around the polyp), polyp detection (placing a bounding box around it), or polyp classification (determining whether it is benign or malignant). Some systems combine multiple objectives, such as detection followed by classification.

This survey by ELKarazle et al. from Swinburne University of Technology reviews 20 recently published articles on automated polyp detection, covering benchmark datasets, evaluation metrics, common challenges, standard architectures, and the latest proposed methods. Articles were selected from Google Scholar, ScienceDirect, the NIH database, Nature, and SpringerLink based on relevance, novelty, recency, and citation impact.

TL;DR: CRC kills nearly 1 million people per year, and colonoscopists miss polyps due to fatigue and difficult lesion types. This survey reviews 20 recent ML-based polyp detection methods, covering datasets, architectures, and challenges.
Pages 2-4
Six Obstacles That Make Automated Polyp Detection Difficult

Data disparity: Polyps come in vastly different shapes, sizes, and types, yet existing public datasets are small and do not cover all possible variations. Flat, depressed polyps are especially hard to capture, and the lack of diverse training samples leads to models that miss these lesions. Augmentation techniques such as flips, rotations, and random crops can help, but they do not fully solve the problem.

Poor bowel preparation: If the colon is not properly cleaned before the procedure, residual fecal matter and bubbles can confuse the model. Studies have reported that most false positives come from the model mistaking non-polyp objects, such as feces or normal blood vessels, for actual polyps. This is an external factor that depends on the physician and patient, not on model tuning.

Light reflection: Colonoscopes are equipped with a light source for navigation, but the resulting white light reflection can either hide polyps or create polyp-like anomalies in the image. Multiple studies reported degraded performance when specular highlights were present. A related issue is colonoscopy viewpoint: polyps near the boundary of a video frame are hard to detect even with the naked eye, and several methods failed to correctly identify lesions at frame edges.

Computational demands and domain mismatch: High-performing models often require sophisticated, high-end hardware, creating a gap between accuracy and real-time feasibility. Additionally, most models rely on transfer learning from networks pre-trained on ImageNet, a general-purpose dataset of everyday objects. The domain gap between natural images and colonoscopy images has led to poor performance in several studies, as features learned from cars and dogs do not transfer perfectly to mucosal tissue.

TL;DR: Key obstacles include limited diverse training data, poor bowel preparation causing false positives, white light reflections hiding or mimicking polyps, edge-of-frame lesions, high computational costs, and the domain gap between ImageNet pre-training and colonoscopy images.
Pages 4-6
15 Benchmark Datasets: From 56 Images to 110,000+

Kvasir-SEG is one of the most popular datasets, introduced in 2020, containing 1,000 polyp images with corresponding segmentation masks. Images were captured in unfiltered, real-life settings with resolutions ranging from 332 x 487 to 1,920 x 1,072 pixels. Another widely used dataset is CVC-ClinicDB (2015), with 612 static frames at 384 x 288 pixels extracted from 31 colonoscopy sequences. It has been used extensively to test segmentation methods including the well-known U-Net architecture.

ETIS-Larib (2014) provides 196 samples at a fixed 1,225 x 996 resolution and is mainly used for testing due to its small size. CVC-ColonDB (2012) offers 300 images at 574 x 500 pixels with diverse polyp types and sizes. The smallest dataset reviewed, CVC-PolypHD, contains just 56 high-resolution images (1,920 x 1,080). On the larger end, EndoTect provides over 110,000 static images plus 373 videos totaling about 11.26 hours and 1.1 million frames.

For video-based tasks, the S.U.N. Colonoscopy Video Database includes 49,136 samples across 100 different polyps (82 low-grade adenomas, 7 hyperplastic, 4 sessile serrated lesions, among others). The ASU-Mayo Clinic dataset contains 38 video sequences split into 20 for training and 18 for testing. Newer datasets like NeoPolyp (7,500 images), PolypGen (8,037 samples with both positive and negative examples), and EndoTest (48 short videos plus 10 full-length colonoscopy videos) round out the available resources.

The survey notes a recurring theme: most publicly available datasets remain too small to cover all possible variations of polyp size, shape, and type. Several methods had to combine multiple datasets for training, and many relied on privately collected data that cannot be reproduced or compared against by other researchers.

TL;DR: The survey catalogs 15 benchmark datasets ranging from 56 to over 110,000 images. Kvasir-SEG (1,000 images), CVC-ClinicDB (612 images), and the SUN database (49,136 samples) are among the most used. Dataset scarcity remains a major bottleneck.
Pages 6-8
U-Net, SegNet, FCN, and PSPNet: How Models Draw Polyp Boundaries

U-Net was introduced in 2015 and has become the preferred segmentation network in medical imaging. It follows an encoder-decoder design with 23 convolutional layers and residual (skip) connections that allow the decoder to recover fine spatial details lost during downsampling. Each layer uses the ReLU activation function. U-Net has been the foundation for numerous polyp detection studies, including the Y-Net variant reviewed in this survey, which extends U-Net with two encoders and a single decoder.

SegNet, also from 2015, uses a similar encoder-decoder layout but replaces skip connections with a different approach: it stores pooling indices from the encoder and uses them during upsampling in the decoder. The encoder consists of 13 convolutional layers matching the VGG16 design, and the decoder mirrors this with 13 convolutional layers where pooling is replaced by upsampling. SegNet has been used to segment colonoscopy images, achieving an IoU of 81.7% in one reviewed study.

Fully Convolutional Networks (FCN) contain only convolutional layers with no dense (fully connected) layers, making them flexible in terms of input size and faster to train. They use a downsampling path, an upsampling path, and skip connections. PSPNet (Pyramid Scene Parsing Network, 2017) uses a pyramid parsing module that aggregates context information at different regional scales, producing more accurate global context. PSPNet is less common than U-Net or SegNet for polyp detection but has been benchmarked against them.

A key insight from the survey is that segmentation models produce pixel-level masks around detected polyps, offering more precise localization than bounding-box detection. However, they require correspondingly detailed training labels: each training image must have a manually created segmentation mask, which is time-consuming and demands expert annotation from physicians.

TL;DR: U-Net (23 layers with skip connections) is the dominant segmentation architecture for polyp detection. SegNet uses VGG16-based encoders, FCN eliminates dense layers for speed, and PSPNet adds multi-scale context. All require labor-intensive pixel-level mask annotations.
Pages 7-9
YOLO, Faster R-CNN, SSD, and Pre-Trained Classifiers

YOLO (You Only Look Once) is the most widely used detection algorithm in the reviewed literature. First introduced in 2016, YOLO treats detection as a regression problem where a single neural network predicts bounding boxes and class probabilities from an entire image in one pass, processing 45 frames per second. The survey covers methods using YOLOv3, YOLOv4, and YOLOv5m. Modifications include replacing the DarkNet53 backbone with CSPNet, swapping ReLU for SiLU activation, and integrating the SWIN transformer to capture global context alongside local features.

Faster R-CNN uses a Region Proposal Network (RPN) that shares convolutional features with the detection network, making it suitable for real-time use. One reviewed method combined Faster R-CNN with ResNet101 for feature extraction and a gradient-boosted decision tree classifier, achieving 97.4% sensitivity and an AUC of 91.7%. The Single-Shot Detector (SSD) discretizes bounding box output space into boxes of different aspect ratios and uses multiple feature maps at different resolutions. One SSD-based method achieved 92% accuracy on a University of Leeds dataset.

For classification, pre-trained convolutional networks dominate. VGG16 (16 layers, 224 x 224 input) and VGG19 (19 layers) serve as backbone feature extractors. ResNet50 (50 layers with skip connections, 227 x 227 input) is used for both detection and segmentation. AlexNet (8 layers) and GoogLeNet (22 layers) are also employed, though GoogLeNet consistently underperformed as a standalone network and is mainly used in ensemble configurations. One ensemble method stacking ResNet101, GoogLeNet, and Xception achieved 98.6% precision and 98.01% recall for polyp detection.

All these pre-trained models were originally trained on ImageNet, a collection of 14 million images across 22,000 classes. Transfer learning from ImageNet is the preferred approach because it is faster and more accurate than training from scratch, though the domain gap between natural images and colonoscopy remains a recognized limitation.

TL;DR: YOLO is the most popular detection algorithm (45 fps), with Faster R-CNN and SSD as alternatives. Classification relies on ImageNet-pretrained networks like VGG16, ResNet50, and ensembles (up to 98.6% precision). Transfer learning is standard but the natural-to-medical domain gap persists.
Pages 9-13
20 Methods Benchmarked: From Modified YOLO to Transformer Hybrids

Modified YOLO networks: One study replaced DarkNet53 with CSPNet in YOLOv3 and introduced CSPDarkNet53 for YOLOv4, achieving a precision of 90.61%, recall of 91.04%, and F1 of 90.82% on the ETIS-Larib dataset. A YOLOv5m variant combined with the SWIN transformer replaced YOLO's bottleneck module with transformer blocks and added a temporal information fusion module to reduce white light reflection artifacts, scoring 83.6% precision and 73.1% recall on CVC-ClinicVideo.

Best performers: A dual-path CNN that converted images to HSV color space and applied gamma correction achieved 100% precision, 99.2% recall, and 99.6% F1 on CVC-ColonDB, though this method assumed manual polyp location, limiting real-world applicability. The sECA-NET (shuffle-efficient channel attention network) combined a CNN with a region proposal network and achieved 94.9% precision, 96.9% recall, and 95.9% F1. An ensemble of ResNet101, GoogLeNet, and Xception with weighted majority voting scored 98.6% precision and 98.01% recall.

Novel approaches: NeutSS-PLS used Neutrosophic theory to suppress white light reflection before feeding images to a U-Net-inspired saliency network, achieving 92.3% precision and 92.4% F1. A method combining SWIN transformers with EfficientNet used multi-dilation convolutional blocks and multi-feature aggregation to capture both global and local features, reporting a mean dice coefficient of 0.906 and IoU of 0.842. An instance tracking head (ITH) plug-in module, designed to be compatible with any detection algorithm, scored 92.6% precision when paired with YOLOv4.

A GI Genius V2 endoscopy system embedded two pre-trained ResNet18 networks for real-time adenoma classification, achieving 84.8% accuracy, 80.7% sensitivity, and 87.3% specificity. A random forest approach using Pyradiomics for texture feature extraction achieved an AUC of 91.0%, sensitivity of 82.0%, and specificity of 85.0% for premalignant polyp detection. A 2D/3D hybrid CNN used ResNet101 for spatial features and 3D convolutions for temporal coherence, scoring 93.45% precision and 89.65% F1.

TL;DR: The 20 reviewed methods averaged 90.26% precision, 86.51% recall, and 88.38% F1. Top performers used multi-path or ensemble strategies (up to 99.6% F1), while real-time systems trading accuracy for speed scored lower (78-88% F1).
Pages 13-15
How Polyp Detectors Are Scored: Precision, Recall, IoU, and mAP

Classification and localization metrics: The most common evaluation combination is precision, recall, and F1 score. Precision measures how many of the model's positive predictions are correct (TP / (TP + FP)), while recall measures how many actual polyps were found (TP / (TP + FN)). F1 is the harmonic mean of the two, with a value of 1.0 being perfect. Sensitivity (identical to recall) and specificity (TN / (TN + FP)) are also widely used.

Accuracy is defined as (TP + TN) / (TP + TN + FP + FN) but is considered unreliable when datasets are unbalanced, which is common in polyp detection since most colonoscopy frames contain no polyps. Researchers therefore prefer F1 as a more robust measure. For object detection algorithms like YOLO, R-CNN, and Faster R-CNN, mean average precision (mAP) is the preferred metric, as it uses both recall and precision across all classes.

Segmentation metrics: The standard measure is Intersection-over-Union (IoU), also called the Jaccard Index. IoU calculates the overlap between the predicted segmentation mask and the ground truth mask, divided by their union. An IoU greater than or equal to 0.5 is generally considered acceptable. Some studies also report the dice coefficient, which is related to IoU but weights overlap differently (dice = 2 * IoU / (1 + IoU)).

The survey reveals that not all methods report the same metrics, making direct comparisons difficult. Some studies report only accuracy, others only IoU, and several use private test sets. This inconsistency is itself a barrier to progress, as the field lacks a standardized evaluation protocol that would allow fair head-to-head comparisons across different polyp detection approaches.

TL;DR: Precision, recall, and F1 are the primary metrics for detection; IoU (Jaccard Index, threshold 0.5+) is standard for segmentation; mAP is preferred for bounding-box detectors. Inconsistent metric reporting across studies makes fair comparison difficult.
Pages 15-17
Key Trends, Unresolved Gaps, and Recommendations for Future Work

YOLO dominates real-time detection: The survey found that YOLO-based architectures are the most widely used method for detecting polyps from both static images and live-stream colonoscopy video, owing to their speed (45 fps) and strong bounding-box IoU performance. The second major trend is the increasing adoption of vision transformers, used either as standalone networks or as modules integrated with CNNs. Transformers process images as patches rather than pixel-by-pixel, offering efficiency advantages for capturing global context.

Flat polyp detection remains unsolved: Using a pre-trained ResNet50, the authors demonstrated that the network successfully extracted edges and shape features from an elevated polyp but completely failed on a flat polyp, instead detecting only the white light reflection edges. To the authors' knowledge, no adequately tested method exists to optimize feature extraction so that flat polyps are reliably identified, particularly from unseen samples. This represents one of the most significant unresolved challenges.

Real-life performance gap: Models trained and tested on unprocessed, real-life colonoscopy samples consistently reported lower scores than those using curated datasets. Methods in studies using private real-world data reported the lowest specificity scores (60.3% and 85.0%). The authors also note that no comprehensive testing across different colonoscope manufacturers and resolutions has been conducted, meaning model scalability across hardware platforms remains unverified.

Recommendations: The authors propose several directions for future research. First, generative adversarial networks (GANs), including architectures like StyleGAN and conditional GAN, should be explored to generate synthetic polyp images and address data scarcity. Second, the gap between model accuracy and computational efficiency must be bridged to enable deployment on low-power colonoscopy equipment. Third, professional physicians should be engaged to verify whether AI tools genuinely reduce stress and fatigue among colonoscopists. Finally, standardized cross-device testing and uniform evaluation protocols would enable more meaningful comparisons across methods.

TL;DR: YOLO and vision transformers lead the field, but flat polyp detection remains unsolved and real-world performance lags behind curated-dataset results. Future priorities include GAN-based data augmentation, efficiency optimization for low-power devices, cross-colonoscope testing, and physician engagement studies.
Citation: ELKarazle K, Raman V, Then P, Chua C.. Open Access, 2023. Available at: PMC9953705. DOI: 10.3390/s23031225. License: cc by.