Deep Learning Semantic Segmentation for Colorectal Pathology

Plain-English Explanations

Background

Page 1

Why Colorectal Cancer Pathology Needs AI Assistance

Colorectal cancer (CRC) is the third most commonly occurring cancer in men and the second in women, and it is projected to affect more than 2.2 million new cases and cause 1.1 million deaths globally by 2030. The diagnostic pathway typically begins with the detection of polyps or biopsies during colonoscopy. In Europe alone, colorectal cancer screening programs target approximately 110 million people per year, with about 5% of participants requiring further examination via colonoscopy.

This screening surge has created a bottleneck in pathology laboratories. After colonoscopy, histopathologists must examine every tissue sample under a microscope, classifying glandular formations on a spectrum from normal through hyperplasia and dysplasia to cancer. A high percentage of these samples turn out to be negative (not containing cancer), meaning pathologists spend substantial time reviewing benign tissue. The sheer volume of screening-detected polyps has made the diagnostic pipeline increasingly unsustainable without computational assistance.

Beyond simple cancer detection, pathologists also assess prognostic biomarkers such as the tumor-stroma ratio, tumor budding (small clusters of up to four tumor cells at the invasive margin), and tumor deposits (discrete cancer nodules in surrounding fat tissue). These assessments rely on visual estimation and are therefore subject to inter-observer variability. Computers can assist by objectively detecting and quantifying tissue compartments, leading to more reproducible and reliable biomarker measurements.

TL;DR: CRC screening generates enormous volumes of biopsies, most of them benign, overwhelming pathology labs. AI-based tissue analysis can pre-screen slides and objectively quantify prognostic biomarkers that currently depend on subjective visual assessment.

Approach

Pages 1-3

Semantic Segmentation: Painting Every Pixel With a Tissue Label

The core idea of this paper is multi-class semantic segmentation, a technique where every single pixel in a whole-slide image (WSI) is assigned to one of 14 tissue categories. These categories include: (1) normal glands, (2) low-grade dysplasia, (3) high-grade dysplasia/tumor, (4) submucosal stroma, (5) desmoplastic stroma, (6) stroma lamina propria, (7) mucus, (8) necrosis and debris, (9) lymphocytes, (10) erythrocytes, (11) adipose tissue, (12) muscle, (13) nerve, and (14) background. This comprehensive labeling creates a detailed, color-coded map of the entire tissue section.

Earlier work in CRC histopathology, most notably by Kather et al., used patch classification to segment nine different tissue types, and the publicly released dataset spawned a large body of follow-up deep learning research. Other groups focused on narrower tasks such as gland segmentation (fostered by international challenges like GLAS and CRAG) or used tumor segmentation to predict disease-free survival. This paper expands the scope to 14 classes, covering not just the primary cancer-associated epithelial and stroma classes but also peripheral tissue types like nerve, adipose tissue, and erythrocytes.

The authors argue that this broad segmentation serves as a versatile building block for many downstream applications. The same model could be repurposed for identifying peri-neural invasion, quantifying the tumor-stroma ratio, or analyzing immune cell distributions across different tissue compartments. By generating a rich, interpretable tissue map first, the system avoids the "black box" problem of end-to-end classification models and instead provides outputs that pathologists can visually verify.

TL;DR: The model assigns every pixel in a whole-slide image to one of 14 tissue types, creating a detailed color-coded map. This segmentation approach is transparent to pathologists and serves as a foundation for multiple diagnostic and prognostic applications.

Data

Pages 3-5

Multi-Centric Training Data and the 1,054-Patient Classification Cohort

Segmentation dataset: The team collected 79 formalin-fixed, paraffin-embedded tissue samples (surgical resections and biopsies) from five medical centers across the Netherlands and Germany. Slides were stained with H&E at each center's own laboratory, introducing natural staining variation. Three different scanner types were used for digitization: the Pannoramic P250 Flash II (3D-Histech), the IntelliSite (Philips), and the NanoZoomer 2.0 HT (Hamamatsu), all at 0.24 micrometers per pixel resolution. This multi-scanner, multi-center design was intentional, ensuring the model would encounter realistic variation during testing.

From these 79 slides, 52 WSIs from a single center (Radboud University Medical Center) were split into 40 for training and 12 for validation. The remaining 27 WSIs from all five centers formed the multi-centric test set, with 10 originating from the same center as training and 17 from external centers. Within each WSI, regions of interest were exhaustively annotated at the pixel level by one pathologist and two trained analysts, labeling every pixel into one of the 14 categories. The annotation tool used was the open-source ASAP software.

Classification dataset: A separate cohort of 1,054 colon biopsies was collected from the Cannizzaro Hospital in Catania, Italy, scanned with an Aperio AT2 scanner. Each case received a single slide-level label based on the pathology report: high-risk (tumor and high-grade dysplasia, n = 292), low-grade dysplasia (n = 693), hyperplasia (n = 36), or other/benign (n = 33). When multiple findings co-existed in a single slide, the highest-risk label was assigned. This independent Italian cohort served as the external validation for the biopsy classification system.

TL;DR: The segmentation model was trained on 40 WSIs from one center and tested on 27 WSIs from five centers using three different scanners. A separate Italian cohort of 1,054 biopsies provided external validation for the four-class biopsy classifier.

Architecture

Pages 5-7

Modified U-Net With ResNet-Style Skip Connections

The segmentation model is based on the U-Net architecture, the dominant encoder-decoder design for biomedical image segmentation since its introduction by Ronneberger et al. in 2015. The encoder path progressively downsamples the image through max-pooling layers while doubling the number of filters at each level, starting with 32 filters in the first layer. The decoder path mirrors this structure, progressively upsampling to recover the original spatial resolution. Standard U-Net skip connections link matching encoder and decoder levels, preserving fine-grained spatial detail.

The authors introduced two key modifications. First, inspired by ResNet, they added residual-style skip connections within each convolutional block (what they call a "U-Net block"). Specifically, the input to each block is concatenated with the block's final feature map before pooling. The authors observed experimentally that this improved gradient flow during training. Second, they replaced transposed convolutions in the expansion path with nearest-neighbor upsampling followed by a 2x2 convolution layer, which helps avoid checkerboard artifacts that transposed convolutions can produce.

The model processes 512 x 512 pixel RGB patches sampled at 1 micrometer per pixel (equivalent to 10x magnification). Training used the Adam optimizer with an adaptive learning rate starting at 1e-4, halved every 20 epochs if validation performance stalled. Extensive data augmentation was applied: random flipping, rotation, elastic deformation, blurring, brightness changes (random gamma), stain variation, color shifts, and contrast adjustments. Networks trained for up to 500 epochs with early stopping after 50 epochs of no improvement, using mini-batches of 5 instances and 300 iterations per epoch.

TL;DR: The model uses a modified U-Net with 32 starting filters, ResNet-style skip connections for better gradient flow, and nearest-neighbor upsampling instead of transposed convolutions. It processes 512 x 512 pixel patches at 10x magnification with extensive data augmentation during training.

Loss Functions

Pages 6-7

Head-to-Head Comparison of Four Loss Functions

A major contribution of this paper is the systematic comparison of four loss functions for multi-class histopathology segmentation. Categorical cross-entropy (CC) is the traditional default, computing pixel-wise divergence between predicted and ground-truth distributions. However, CC struggles with class imbalance because over-represented classes (like background or stroma) dominate the training signal, causing the model to neglect small but clinically important classes like erythrocytes or nerve tissue.

Focal loss addresses class imbalance by down-weighting "easy" examples (pixels the model already classifies correctly with high confidence) and up-weighting hard examples. It adds two hyperparameters (alpha and gamma) to the cross-entropy formula. Bi-tempered loss tackles a different problem: label noise. Small annotation inaccuracies can push cross-entropy loss values toward infinity, distorting decision boundaries. The bi-tempered formulation replaces the standard softmax with a heavy-tailed version and modifies the entropy function, making the model more robust to mislabeled pixels. It also requires two hyperparameters (t1 and t2).

Lovasz-softmax loss takes a fundamentally different approach by directly optimizing a surrogate of the Intersection-over-Union (IoU) metric, using submodular optimization. Unlike pixel-wise losses, it evaluates performance at the class level, naturally handling imbalance by treating small and large tissue compartments with equal importance. Crucially, it requires no additional hyperparameters beyond the standard training setup.

The authors used the original recommended hyperparameter values for Focal and Bi-tempered losses rather than tuning them, which keeps the comparison fair but means those functions might perform better with optimization. The Dice loss was excluded because it proved difficult to combine with the team's preferred class-balancing method, though the Dice metric was still used for evaluation.

TL;DR: Four loss functions were compared: categorical cross-entropy (baseline), Focal loss (handles class imbalance), Bi-tempered loss (handles label noise), and Lovasz-softmax (directly optimizes IoU). All used default hyperparameters. Lovasz-softmax requires no tuning, making it the simplest to deploy.

Segmentation Results

Pages 8-10

Lovasz-Softmax Wins on Multi-Centric Test Data

The four models were evaluated on 999 non-overlapping tiles (512 x 512 micrometers each) extracted from 27 WSIs across five centers. The Lovasz-softmax loss achieved the best overall mean Dice score of 0.72, closely followed by Bi-tempered loss at 0.71, while categorical cross-entropy and Focal loss both scored 0.69. A Wilcoxon signed-rank test found no statistically significant difference between the overall scores, but the Lovasz model showed the most consistent performance across all 14 classes.

Class-level differences revealed interesting patterns. Bi-tempered loss excelled at segmenting low-grade dysplasia (Dice = 0.88 versus 0.77-0.80 for others) and nerve tissue (Dice = 0.69), likely benefiting from its noise-robust formulation in these challenging classes. Lovasz-softmax achieved the best scores for lymphocytes (0.84), erythrocytes (0.68), nerve (0.83), and normal glands (0.88). Categorical cross-entropy led on tumor segmentation (0.89) and desmoplastic stroma (0.69). Focal loss, surprisingly, offered no clear advantage over plain cross-entropy on this 14-class task, possibly because its intended benefit diminishes when the number of classes is large.

On the public CRAG and GLAS benchmarks, results shifted slightly. Cross-entropy and Focal loss achieved the highest CRAG Dice scores (0.77 without lumen), while Bi-tempered dropped to 0.69. On GLAS, cross-entropy and Lovasz tied at 0.80. Including lumen pixels (which the network was not designed to segment) reduced scores by roughly 0.1 on CRAG, demonstrating the importance of matching evaluation protocols to model design.

One external test center with very dark H&E staining caused all models to struggle, with submucosal stroma Dice scores dropping from an average of 0.54 to just 0.28. This highlighted staining variation as a persistent challenge even in multi-centric training setups and pointed to the need for more aggressive stain augmentation or normalization strategies.

TL;DR: Lovasz-softmax achieved the best overall Dice score (0.72) across 14 classes on multi-centric data. Bi-tempered loss excelled on specific classes like low-grade dysplasia (0.88). Focal loss offered no advantage over standard cross-entropy with 14 classes. Dark staining caused performance drops at one external center.

Classification Results

Pages 8-10

From Segmentation to Biopsy Classification: AUCs Up to 0.89 on 1,054 Patients

The best-performing Lovasz-softmax segmentation model was used as the foundation for a biopsy classification system tested on 1,054 colon biopsies from the Cannizzaro Hospital in Italy. Rather than building a complex end-to-end classifier, the authors extracted five simple, interpretable features from each segmentation map: (1) the normalized histogram of all 14 tissue types, (2) the number of tumor clusters, (3) the average cluster size, (4) the minimum cluster size, and (5) the maximum cluster size. All tumor clusters smaller than 30 square micrometers were excluded to filter out false-positive segmentation noise.

A random forest classifier with 1,000 decision trees was trained on these features using five-fold cross-validation with class-balanced folds. When a WSI contained multiple tissue fragments, each was processed independently and the worst (highest-risk) classification was adopted as the final slide label. This mirrors clinical practice, where pathologists flag the most concerning finding.

The system achieved one-vs-all AUC values of 0.87 (+/- 0.03) for high-grade dysplasia/tumor, 0.82 (+/- 0.02) for low-grade dysplasia, 0.89 (+/- 0.03) for hyperplasia, and 0.79 (+/- 0.05) for other/benign conditions. The overall quadratic weighted kappa score was 0.91, indicating strong agreement with pathologist diagnoses. Of the 36 hyperplasia cases, 29 were correctly classified (80.6%), though the system struggled to distinguish hyperplasia from low-grade dysplasia, a known challenge even for experienced pathologists.

TL;DR: A random forest classifier built on five simple segmentation-derived features achieved AUCs of 0.82 to 0.89 across four biopsy categories on 1,054 patients. The overall kappa score of 0.91 shows strong agreement with pathologist diagnoses, using a fully interpretable pipeline.

Limitations & Future Work

Pages 10-13

Error Sources, Staining Challenges, and the Path Forward

Staining-related errors: Detailed error analysis revealed that roughly 40% of classification mistakes could be traced back to faulty segmentation output. In overstained specimens, dark tissue regions were sometimes incorrectly identified as tumor when they should have been labeled as normal or low-grade dysplastic epithelium. The authors proposed two remedies: more substantial stain augmentation during training (requiring retraining), or applying stain normalization at inference time using methods like Cycle-GANs, which would not require retraining.

Artifact sensitivity: Incidental artifacts such as small air bubbles on glass slides and staining defects also contributed to errors. The paper showed an example where an air bubble caused a dramatic segmentation failure. These artifact-driven errors could be mitigated by implementing quality control on digitized slides before AI processing, or by explicitly including artifact-containing regions in the training data so the model learns to handle them.

Missing hyperplasia class: The segmentation model was not trained with a dedicated hyperplasia class due to insufficient training material, yet hyperplasia was included as a classification target. The system correctly identified 29 of 36 hyperplasia cases but frequently confused it with low-grade dysplasia. The authors recommended incorporating hyperplasia directly into the segmentation model's class set. Similarly, adding more spatial features for classes beyond tumor epithelium (such as low-grade dysplasia cluster statistics) could improve the classifier's discriminative power.

Looking ahead, the authors envision integration into clinical workflows through two scenarios: AI pre-reading cases to fill reports for pathologist sign-off, or AI risk-scoring cases to prioritize the order in which pathologists review them. The segmentation model has been made publicly available for research use on the Grand Challenge platform, enabling the broader community to build upon this 14-class tissue mapping approach for applications ranging from tumor-stroma ratio quantification to spatial immune cell profiling.

TL;DR: About 40% of classification errors stemmed from staining-related segmentation mistakes. Air bubbles and the absence of a dedicated hyperplasia segmentation class also contributed. The model is publicly available, and future work will target stain normalization, artifact handling, and expanded class coverage.

Deep Learning for Multi-Class Semantic Segmentation Enables Colorectal Cancer Detection and Classification in Digital Pathology Images

Original Paper (PDF)