Deep Learning Polyp Localization in Screening Colonoscopy

Plain-English Explanations

Background

Pages 1-2

Why Missed Polyps Are a Major Problem and How Deep Learning Could Help

Colorectal cancer (CRC) is the second leading cause of cancer-related death in the United States. It develops from precancerous polyps over a mean dwell time of more than 10 years, and the National Polyp Study showed that 70% to 90% of colorectal cancers are preventable through regular colonoscopy with polyp removal. Despite this, 7% to 9% of colorectal cancers still occur in patients who are up to date with screening colonoscopy. An estimated 85% of these "interval cancers" are attributed to polyps that were missed or incompletely removed during the procedure.

Adenoma detection rate (ADR) measures the percentage of screening colonoscopies in which at least one adenoma is found. The estimated prevalence of precancerous polyps in the over-50 screening population exceeds 50%, yet ADR varies widely among colonoscopists, ranging from just 7% to 53%. Tandem colonoscopy studies reveal that 22% to 28% of polyps and 20% to 24% of adenomas are missed on the first pass. A large Kaiser Permanente study demonstrated that every 1% increase in ADR reduces interval cancer rates by 3%, while a Polish study of nearly 1 million person-years found a 6% reduction per 1% ADR improvement.

Several technologies have been developed to improve ADR, including enhanced optics, chromoendoscopy, wide-angle colonoscopes, and cap-assisted techniques. However, a meta-analysis and a large randomized study found no ADR difference between extra-wide-angle systems and standard forward-viewing colonoscopy. Most studies on Narrow-Band Imaging (NBI) also showed no improvement in ADR over white light. The authors argue that computer-assisted image analysis using convolutional neural networks (CNNs) offers a distinct advantage: it requires no alteration of the colonoscope or the procedure itself, and deep learning has already demonstrated success across computer vision, speech recognition, and other scientific domains.

TL;DR: Colonoscopists miss 20-28% of polyps, and every 1% increase in ADR cuts interval cancer by 3-6%. Hardware upgrades have not reliably improved ADR. Deep learning offers a software-only solution that requires no changes to existing colonoscopes.

Architecture

Pages 3-4

CNN Architectures: VGG, ResNet, and the YOLO-Inspired Ensemble

The researchers trained and evaluated multiple convolutional neural network (CNN) architectures for two separate tasks: polyp detection (binary classification of whether a frame contains a polyp) and polyp localization (predicting a bounding box around the polyp). Both tasks used architecturally identical networks except for the final output layer. Detection models used softmax outputs optimized with KL-divergence loss, while localization models used linear output units optimized with L2 loss. All networks were built with standard building blocks: convolutional layers, fully connected layers, max/average pooling, ReLU activation functions, optional batch normalization, and skip connections.

The team tested architectures in two categories. Not pre-initialized (NPI) models started from random weights and served as baselines. Pre-initialized (PI) models used weights from training on the ImageNet corpus of 1.2 million natural images before being fine-tuned on colonoscopy data. Within the PI category, the specific architectures tested were VGG16, VGG19, and ResNet50. The ImageNet pre-training strategy is known as transfer learning, and the authors hypothesized that low-level visual features learned from natural images (edges, textures, shapes) would transfer effectively to the medical imaging domain.

For polyp localization, the team tested three training variants: (1) standard L2 mean-squared-error loss for bounding box coordinates; (2) Dice loss, which directly maximizes the overlap between predicted and ground-truth bounding boxes; and (3) an "internal ensemble" approach inspired by the YOLO (You Only Look Once) algorithm, where the CNN produces and aggregates 49 individual weighted predictions of polyp size and location in a single forward pass. All detection and localization variants had nearly identical runtime complexity, with less than 1% difference in processing speed.

All experiments were implemented using the Keras and TensorFlow software libraries, running on Titan X (Pascal) GPUs with 12 GB of RAM and 11 TFLOPS of processing power. Input images were rescaled to 224 x 224 pixels (the native resolution for VGG and ResNet) and normalized to unit-normal distribution by subtracting the mean pixel value and dividing by the standard deviation.

TL;DR: The team tested VGG16, VGG19, and ResNet50 architectures, all pre-initialized with ImageNet weights. Localization used a YOLO-inspired "internal ensemble" producing 49 bounding-box predictions per frame. Everything ran on a single consumer-grade GPU with Keras/TensorFlow.

Training Data

Pages 4-5

Building the Dataset: 8,641 Images From Over 2,000 Patients

The primary training dataset consisted of 8,641 hand-selected colonoscopy images drawn from over 2,000 patients to avoid intra-patient polyp similarity bias. This set contained 4,088 images of unique polyps and 4,553 images without polyps, making it nearly perfectly balanced. The images included both white light endoscopy (WLE) and NBI images (840 NBI and 7,801 WLE), and covered all portions of the colorectum including retro-views in the rectum and cecum, the appendiceal orifice, and the ileocecal valve.

The researchers deliberately and randomly included features such as forceps, snares, cuff devices, debris, melanosis coli, and diverticula in both polyp and non-polyp images in a balanced fashion. This was a critical design choice to prevent the CNN from learning spurious associations between the appearance of tools and the presence of polyps. Bounding box locations and dimensions were recorded for all polyp-containing images by a team of colonoscopists (fellows and faculty at UCI with ADR above 45% and more than 100 procedures).

A separate validation set of 1,330 colonoscopy images (672 polyp, 658 non-polyp) was collected from different patients. For video validation, two sets were used: 9 randomly selected archived colonoscopy videos and 11 deliberately more challenging videos performed by a senior colonoscopist (ADR at or above 50%) who intentionally withdrew the scope without closing in on already-identified polyps to simulate missed-polyp scenarios. The combined duration of all 20 videos was approximately 5 hours, totaling roughly 500,000 frames.

To reduce overfitting, the team applied dropout at a rate of 0.5 to fully connected layers, data augmentation (random horizontal and vertical mirroring, rotations from 0 to 90 degrees, and shearing), and early stopping based on a reserved validation subset. A larger combined dataset was also created by augmenting the 8,641 images with 44,947 frames extracted from the 9 videos, sampling every 8th non-polyp frame and every 4th polyp frame to reduce temporal correlation.

TL;DR: The dataset included 8,641 labeled images from 2,000+ patients (balanced polyp/non-polyp) plus 20 colonoscopy videos totaling 5 hours. Tools and artifacts were deliberately included in both classes to prevent spurious learning. Dropout, augmentation, and early stopping reduced overfitting.

Detection Results

Pages 6-7

96.4% Accuracy and 0.991 AUC: Polyp Detection Performance

The headline result of this study was the polyp detection performance of the best model, PI-CNN 2 (VGG19 pre-initialized on ImageNet), which achieved a cross-validation accuracy of 96.4% and an area under the ROC curve (AUC) of 0.991 on the 8,641-image dataset. Models that were not pre-initialized on ImageNet (NPI-CNN 1 and 2) performed comparably to previously published state-of-the-art polyp classifiers but were surpassed by a significant margin by all three pre-initialized architectures. Interestingly, the scores across VGG16, VGG19, and ResNet50 were remarkably similar once ImageNet pre-training was applied.

At a sensitivity of 90%, the best model had a false positive rate of only 0.5%. At 95.2% sensitivity, the false positive rate was 1.6%, and at 97.1% sensitivity it rose to 6.5%. This range of high sensitivities with low false-positive burden is critical for clinical deployment, where excessive false alarms could desensitize the endoscopist and negate the benefit of the assistance system.

The team also investigated whether the CNN handled different polyp morphologies equally well. A review of 1,578 true positive and all 228 false negative predictions revealed 381 nonpolypoid lesions (Paris IIa, IIb, IIc) and 678 polypoid polyps (Paris Ip, Is). The CNN missed 12% of polypoid polyps (84 of 678) and 11% of nonpolypoid lesions (41 of 381), demonstrating that the model detected flat and depressed lesions just as effectively as protruding polyps. After correcting for sampling bias by including all remaining true positives, the overall miss rate dropped to approximately 5%.

Testing at a higher resolution of 480 x 480 pixels produced virtually identical results (96.4% accuracy, 0.990-0.992 AUC across architectures) but more than doubled computational cost. The VGG19 model trained on the 8,641 images also achieved 96.4% accuracy and 0.974 AUC on the fully independent 1,330-image test set, confirming that intra-patient polyp similarity did not introduce notable bias in the cross-validation results.

TL;DR: The best CNN achieved 96.4% accuracy and 0.991 AUC for polyp detection, with only 0.5% false positives at 90% sensitivity. Flat and polypoid lesions were detected equally well. Results held on an independent 1,330-image test set (96.4% accuracy, 0.974 AUC).

Localization Results

Pages 7-8

Bounding-Box Localization: Dice Score of 0.83 in 10 Milliseconds

For the polyp localization task, models were trained on the subset of images containing a single polyp per frame, which represented the vast majority of polyp-containing samples. Consistent with detection results, pre-initialized ImageNet CNNs significantly outperformed randomly initialized networks at localizing polyps. Neither the L2 loss nor the Dice loss showed a consistent advantage over the other when used alone.

The key finding was that the YOLO-inspired "internal ensemble" approach substantially improved localization accuracy, boosting the Dice coefficient from 0.79 to 0.83 for the best model (PI-CNN 2 / VGG19). For context, the best previously published Dice score for polyp segmentation was 0.55, achieved on a different dataset using a different method. The improvement from 0.55 to 0.83 represents a major advance in how precisely a CNN can outline polyp locations during colonoscopy.

A critical advantage of using bounding-box localization rather than pixel-level segmentation was processing speed. The best model processed 98 images per second (approximately 10 milliseconds per frame) for both detection and localization on a single consumer-grade GPU. This is roughly four times faster than the 25 to 30 frames per second required for real-time video (PAL/NTSC standards). In comparison, the fastest polyp localization model from prior work at the MICCAI 2015 Endoscopic Vision Challenge could process only 7 frames per second, and the slowest managed just 0.1 frames per second.

The bounding-box approach was a deliberate engineering choice. Pixel-level segmentation provides unnecessarily precise polyp boundaries for clinical purposes but struggles to operate in real time. By choosing bounding boxes, the researchers achieved a practical system that could realistically overlay polyp locations on live video during colonoscopy without introducing lag or disruption to the endoscopist's workflow.

TL;DR: The YOLO-inspired internal ensemble achieved a Dice score of 0.83 for polyp localization, far exceeding the prior best of 0.55. Processing at 98 frames per second (10 ms/frame) was 4x faster than real-time requirements, making bounding-box overlay during live colonoscopy feasible.

Video Validation

Pages 8-9

Expert Video Study: CNN Found Every Polyp and Revealed 17 More

The video validation study was designed to simulate real clinical conditions. Three expert colonoscopists (all with ADR above 50%) independently reviewed nine de-identified colonoscopy videos without CNN assistance, recording every polyp they encountered. Their findings were combined by consensus. CNN predictions were filtered by requiring at least 8 contiguous frames with above 40% probability for polyp presence to count as a detection event, balancing sensitivity and specificity.

In the nine videos, the original colonoscopists had removed 28 polyps during the actual procedures. Without CNN assistance, the three expert reviewers identified 8 additional polyps beyond those 28, for a total of 36. When reviewing CNN-overlaid videos (where a green bounding box appeared on frames with above 95% polyp probability), the senior expert identified 9 more polyps, bringing the total to 45. No unique polyps were missed by the CNN. Frame-level sensitivity and specificity of CNN predictions relative to CNN-assisted expert review were both 0.93 (chi-square p less than 0.00001).

A second, more challenging validation used 11 additional videos containing 73 unique polyps. These were deliberately difficult: a senior colonoscopist (ADR at or above 50%, over 20,000 career colonoscopies) withdrew the scope without closing in on already-identified polyps, mimicking missed-polyp scenarios. The CNN trained on the 8,641 images alone identified 68 of 73 polyps at a 7% false positive rate. After fine-tuning on labeled frames from the first video study, it identified 67 of 73 at 5% FPR, or 72 of 73 at 12% FPR with a more sensitive threshold.

False negatives were enriched with views of distant and field-edge polyps, while false positives were enriched with near-field collapsed mucosa, debris, suction marks, NBI artifacts, and polypectomy sites. The additional training data from the first video study notably reduced false positive detections, likely due to the abundance and variety of artifacts (water, air bubbles, fecal matter, blurry frames) encountered during real procedures.

TL;DR: Expert reviewers found 36 polyps without CNN help; with CNN overlay, they found 45 (17 more than the original colonoscopists removed). The CNN missed zero unique polyps. In 11 harder "flyby" videos, it caught 68-72 of 73 polyps at 5-12% false positive rates.

NBI vs. White Light

Pages 9-10

Transfer Learning Across Imaging Modalities: WLE and NBI Combined

An important secondary experiment tested whether a CNN trained on both white light endoscopy (WLE) and Narrow-Band Imaging (NBI) data would outperform models trained on a single modality. The researchers retrained the VGG19 CNN on three subsets: NBI-only, WLE-only, and the combined WLE+NBI dataset from the 8,641 images, using 7-fold cross-validation for each configuration.

Training and testing on WLE-only data yielded 96.1% accuracy and 0.991 AUC. Training and testing on NBI-only data yielded 92.9% accuracy and 0.970 AUC. In both cases, these results were worse than the combined WLE+NBI model, which achieved 96.4% accuracy with 0.992 AUC on WLE data and 94.8% accuracy with 0.988 AUC on NBI data. The lower NBI-only performance likely reflects the much smaller amount of NBI training data (840 images versus 7,801 WLE images) rather than any inherent difficulty of the modality.

This finding demonstrates a synergistic knowledge transfer between the two imaging modalities. Features learned from WLE images helped the model perform better on NBI images, and vice versa. The practical implication is significant: a single CNN can be deployed across colonoscopy systems regardless of whether NBI capability is present, and training on mixed-modality data actually improves performance compared to modality-specific training.

TL;DR: Training on combined WLE+NBI data outperformed single-modality training in all cases. The combined model reached 96.4% accuracy on WLE and 94.8% on NBI, demonstrating cross-modality transfer learning. A single CNN works across both imaging modes.

Limitations & Future Directions

Pages 10-11

From Feasibility to Clinical Trials: What Comes Next

The authors acknowledge several important limitations of the feasibility study. The video review was conducted on archived, de-identified recordings, which excluded information about colonoscopy indications (screening versus surveillance) and polyp histology. CNN performance may vary by indication. The effects of live CNN overlay on endoscopist inspection behavior are unknown, and extrapolation from video review to real-time clinical use cannot be assumed without prospective validation.

Polyp histology is especially relevant to clinical value. Time spent on polypectomy is "added value" when the lesion is precancerous or malignant, or when it affects ADR or surveillance interval calculations. However, if the CNN primarily triggers removal of clinically irrelevant lesions (normal tissue, lymphoid aggregates), the added time and unnecessary pathology costs would be unacceptable. Future randomized studies need to examine effects on colonoscopy time, pathology costs, ADR, polyps per procedure, and the ratio of surveillance-relevant to surveillance-irrelevant polyps.

The estimated impact on procedure time is modest. False positives would likely require less than 5 seconds each to evaluate, at an estimated rate of fewer than 8 per colonoscopy. This relatively minor time cost could be reduced further with more training data, user interface refinements (color selection, sound effects), and the simultaneous deployment of optical pathology AI algorithms that characterize polyps in real time. While the study used Olympus endoscopes (which hold approximately 70% of the endoscope market), the transfer learning approach should generalize to other vendors with minimal additional tuning.

The authors conclude that this is the first polyp detection AI system ready for real-time validation studies, meeting both the accuracy and speed constraints simultaneously. They call for prospective multicenter trials to test whether CNN-assisted colonoscopy actually improves ADR and reduces adenoma miss rates in clinical practice, and they note that the same deep learning methods could be adapted to address other real-time needs in endoscopy beyond polyp detection.

TL;DR: Key limitations include the retrospective video-only design, unknown histology, and untested effects on endoscopist behavior. Future randomized trials must measure ADR, procedure time, pathology costs, and clinically relevant versus irrelevant polyp removal. The system is positioned as the first AI ready for prospective real-time validation.

Deep Learning Localizes and Identifies Polyps in Real Time with 96% Accuracy in Screening Colonoscopy

Original Paper (PDF)