Colorectal cancer (CRC) is the second most common cancer in women and the third most common in men, with an incidence estimated to increase by 80% globally by the year 2035. Most CRCs are sporadic (70 to 80%), while roughly one third carry a hereditary component. The cornerstone of CRC diagnosis remains the pathologic examination of tissue, stained with Hematoxylin and Eosin (H&E), followed by immunohistochemistry (IHC) and molecular techniques. But the high incidence, combined with a worldwide shortage of pathologists, has led to significant delays in diagnosis and patient care.
The variability problem: Pathologists make diagnoses based on image-based pattern recognition. In several instances, accurate diagnosis or estimation of prognostic and predictive factors is subject to personal interpretation, leading to documented inter-observer and intra-observer variability. This inconsistency between experts has created a strong rationale for reliable, objective computational methods that can reduce diagnostic uncertainty.
Molecular complexity: Beyond morphology, CRC requires molecular classification. The Cancer Genome Atlas has classified CRC into three groups: highly mutated tumors (~13%), ultra-mutated tumors (~3%), and chromosomal instability (CIN) tumors (~84%). A four-subtype consensus molecular classification (CMS1 through CMS4) was proposed in 2015. Microsatellite instability (MSI), present in approximately 13 to 16% of cases, has significant diagnostic and prognostic value. All of these features must be assessed, compounding the pathologist's workload.
Enter deep learning: Over the last five years, the development of reliable computational approaches using machine learning and deep learning (DL) has grown exponentially. DL algorithms, particularly Convolutional Neural Networks (CNNs), have shown the potential to assist in diagnosis, predict clinically relevant molecular phenotypes and MSI status, identify histological features related to prognosis and metastasis, and assess tumor microenvironment components.
Database and search terms: The authors systematically searched PubMed from inception through 31 December 2021 using a comprehensive algorithm combining terms like "convolutional neural networks," "CNN," and "deep learning" with CRC-related terms (colon, colorectal, intestinal, bowel) and histopathology-specific terms (biopsy, microscopy, histology, slide, eosin). The search was conducted on 14 January 2022 and followed PRISMA guidelines, registered with PROSPERO 2020.
Eligibility criteria: Included studies had to present at least one DL model for histopathological assessment of large bowel slides and CRC. Eligible applications covered diagnosis, tumor tissue classification, tumor microenvironment analysis, prognosis, survival and metastasis risk evaluation, tumor mutational burden characterization, and microsatellite instability detection. Excluded were in vitro models, studies using endoscopic or radiological images rather than histological sections, non-photonic microscopy, review articles, meta-analyses, non-human studies, and non-English publications.
Screening process: The initial search returned 166 articles. Four independent researchers screened the citations using the Rayyan online software. Three researchers assessed the medical aspects, while one CNN expert evaluated the technical components. After screening, 92 articles were assessed for full-text eligibility, with 74 excluded for reasons including use of endoscopic images (n=21), being review articles (n=20), or lacking relevant clinical endpoints (n=11). A further 10 were excluded after full-text review, leaving 82 articles in the final systematic review.
Dual-viewpoint analysis: From each paper, the authors extracted information on first author, year, journal, aim of medical research, technical method, classification details, dataset, and performance metrics. Uniquely, this review analyzes each study from both a medical viewpoint (what clinical question is addressed) and a technical viewpoint (what DL architecture is used), providing a comprehensive two-dimensional analysis of the field.
Binary cancer detection: Seventeen of the 82 studies focused on diagnostic classification, such as cancer versus non-cancer, benign versus colon adenocarcinoma, or benign versus malignant. A patch-cluster-based aggregation model developed by Wang et al. classified CRC images (cancer/not cancer) using 14,234 whole-slide images (WSIs) and achieved 98.11% accuracy with an AUC of 99.83%. Yu et al. demonstrated that semi-supervised learning (SSL) with large amounts of unlabeled data achieved a patient-level AUC of 0.974, comparable to pathologists, using 13,111 WSIs.
Multi-class tissue classification: Another 17 studies addressed tumor tissue classification across multiple categories. Iizuka et al. classified CRC into adenocarcinoma, adenoma, and normal tissue across 4,036 WSIs, achieving AUCs of 0.967 for adenocarcinoma and 0.99 for adenoma using Inception v3. The ARA-CNN model by Raczkowski et al. achieved 99.11% accuracy on binary tasks and 92.44% on 8-class tissue classification. Grading into normal, low-grade, and high-grade CRC was accomplished by Awan et al. and Shaban et al. with 91% and 95.7% accuracy, respectively.
Gland segmentation: The challenging task of gland segmentation, critical for tumor grading, was addressed by multiple studies. Kainz et al. trained two networks that achieved 95% and 98% classification accuracy for recognizing and separating glands. Graham et al. proposed the MilD-Net+ architecture for simultaneous gland and lumen segmentation. Chen et al. developed the deep contour-aware network (DCAN) for accurate gland and nuclei segmentation on histological CRC images.
Aggregate performance: Across all reviewed studies, binary classification problems achieved a mean accuracy of 94.11% (plus or minus 1.3%) with a mean AUC of 0.852 (plus or minus 0.066). Three-class problems reached a mean accuracy of 95.5% (plus or minus 1.7%) with a mean AUC of 0.931 (plus or minus 0.051). Eight-class problems achieved a mean accuracy of 94.4% (plus or minus 2.0%) with a mean AUC of 0.972 (plus or minus 0.022).
Scope of microenvironment studies: Nineteen of the 82 reviewed studies investigated the tumor microenvironment, the largest single category. The tumor microenvironment includes stroma, necrosis, lymphocytes, immune cell infiltrates, and other non-tumor tissue components that play critical roles in cancer progression and treatment response. Jiao et al. demonstrated that a higher tumor-stroma ratio was a risk factor for progression, while high levels of necrosis and lymphocyte features were associated with a lower progression-free interval.
Immune cell quantification: Pai et al. conducted a tumor microenvironment analysis on colorectal tissue microarrays (TMAs), where the algorithm efficiently detected differences between mismatch-repair deficient (MMRD) and mismatch-repair proficient (MMRP) slides based on inflammatory stroma, tumor-infiltrating lymphocytes (TILs), and mucin. Swiderska-Chadaj et al. compared four different CNN architectures for lymphocyte detection in IHC images stained for CD3 and CD8, with U-Net showing the best performance (F1 score: 0.80). Xu et al. proposed a DL model for quantifying CD3 and CD8 T-cell density within the stroma using IHC slides, achieving high accuracy and showing that a higher stromal immune score predicted improved survival.
Biomarker and mutation prediction: The ImmunoAIzer model by Bian et al. achieved 90.4% accuracy in biomarker prediction (CD3, CD20, TP53, DAPI) and detected tumor gene mutations with AUCs of 0.76 to 0.79 for APC, KRAS, and TP53. Schrammen et al. proposed the Slide-Level Assessment Model (SLAM) for simultaneous tumor detection (AUROC 0.980) and BRAF status detection (AUROC 0.821). Predictions of genetic mutations such as APC, KRAS, PIK3CA, SMAD4, TP53, and BRAF from H&E slides alone could support clinical diagnosis and better stratify patients for targeted therapies.
Nuclei classification: Graham et al. proposed HoVer-Net, a novel CNN architecture based on Preact-ResNet50, for simultaneous segmentation and classification of nuclei into four types: normal, malignant, dysplastic epithelial, and inflammatory. Shapcott et al. trained a CNN on 853 annotated images for four nuclei types (epithelial, inflammatory, fibroblasts, "other"), achieving 76% classification accuracy and finding that fewer inflammatory cells were related to mucinous carcinoma.
Lymph node metastasis prediction: Fourteen studies focused on histological features related to prognosis, metastasis, and survival. Chuang et al. used ResNet-50 to predict nodal metastasis in slides containing one or more lymph nodes, achieving an AUC of 0.9993 for macrometastasis and 0.9956 for micrometastasis across 3,182 WSIs. Kiehl et al. reported an AUROC of 71% on an internal test set using ResNet-18 pre-trained on the CAMELYON16 challenge. Brockmoeller et al. used ShuffleNet with transfer learning, achieving an AUROC of 0.733 for predicting more than one positive lymph node in both pT1 and pT2 CRC cohorts.
Survival prediction: Bychkov et al. used TMAs with a VGG-16 followed by a recurrent ResNet to predict 5-year disease-specific survival (DSS), achieving a hazard ratio of 2.3 (95% CI: 1.79 to 3.03) and an AUC of 0.69. Skrede et al. deployed an ensemble of ten CNN models based on DoMore.v1 using over 12 million image tiles, classifying patients into good, uncertain, and poor prognosis groups. The poor-prognosis group had an unadjusted hazard ratio of 3.84 compared to the good-prognosis group, with an overall AUC of 0.713 for 3-year cancer-specific survival.
Stromal and prognostic features: The tumor-stroma ratio (TSR), measured by DL, emerged as an important prognostic factor. Zhao et al. and Geessink et al. showed that a stroma-high score was associated with reduced overall survival. Kather et al. proposed a "deep stroma score" combining non-tumor tissue components as an independent prognostic factor, particularly for patients with advanced CRC. Jones et al. found that a lower desmoplastic-to-inflamed stroma ratio predicted disease recurrence after rectal excision (AUC 0.71, sensitivity 0.92).
Crohn-like lymphoid reaction: Zhao et al. demonstrated that a Crohn-like lymphoid reaction (CLR) density at the invasive front of the tumor was a good predictor of prognosis independent of TNM stage. High CLR density was associated with improved overall survival, with hazard ratios of 0.58 in the discovery cohort and 0.45 in the validation cohort.
Clinical importance of MSI: Microsatellite instability, driven by defective mismatch repair (MMR) DNA mechanisms, is present in approximately 13 to 16% of CRCs. MSI status guides treatment decisions, particularly for immunotherapy eligibility, and helps identify Lynch syndrome (hereditary nonpolyposis colorectal cancer), which accounts for 2 to 3% of all CRCs. Ten of the 82 reviewed studies specifically addressed MSI detection from histopathological images.
Top-performing MSI classifiers: Echle et al. developed a modified ShuffleNet-based DL detector using data from the MSIDETECT consortium, achieving a cross-validation AUC of 0.92 and a validation AUROC of 0.96 after color normalization. Lee et al. proposed a two-stage framework (initial custom CNN followed by Inception v3), achieving an AUC of 0.972 on the Seoul St. Mary's Hospital dataset. Bilal et al. used ResNet-18 and adaptive ResNet-34 to predict multiple molecular pathways, achieving a mean AUROC of 0.86 for MSI status.
Consensus molecular subtypes: Beyond binary MSI/MSS classification, Sirinukunwattana et al. used Inception v3 with adversarial learning to classify CRC into the four consensus molecular subtypes (CMS1 through CMS4) across 1,206 slides from three datasets, achieving AUCs of 0.81 to 0.88. Nguyen et al. associated CMS classification with mucin-to-tumor area quantification, finding that CMS2 CRC had no mucin and that MUC5AC protein expression indicated worse overall survival. Shimada et al. developed a CNN for predicting tumor mutational burden-high (TMB-H), achieving an AUC of 0.91.
Synthetic data for MSI prediction: Krause et al. explored an innovative approach using a Conditional Generative Adversarial Network (CGAN) to generate synthetic histology images for MSI detection. A synthetic dataset achieved an AUROC of 0.743, nearly matching real image performance (AUROC 0.742 and 0.757 for the two patient cohorts). The best results (AUROC 0.777) came from combining both synthetic and real images, suggesting GANs can augment limited training data.
Distribution of approaches: Of the 82 studies, 80 employed CNNs for image segmentation or classification, while only 2 used Generative Adversarial Networks (GANs) for image simulation. Among the CNN studies, 10 proposed custom architectures built from scratch, 42 used pre-trained popular architectures with transfer learning, and 26 implemented novel architectures including modifications (5 studies), combinations of CNNs with other AI techniques (15 studies), and ensemble methods (6 studies). Two studies did not provide architectural details.
Popular pre-trained architectures: The most commonly used pre-trained model family was VGG, employed in various configurations (VGG-16 and VGG-19) across at least eight studies. Inception v3 was the second most popular, used for tasks ranging from tumor tissue classification to MSI prediction. ResNet variants (ResNet-18, ResNet-34, ResNet-50) were the third most common, often combined with other architectures. Additional models included U-Net for encoding-decoding segmentation tasks, ShuffleNet, AlexNet, YOLO, DenseNet, MobileNet, LSTM, Xception, DarkNet, and EfficientNetB1. Nearly all studies utilized popular machine learning environments such as PyTorch, TensorFlow, Keras, and Fastai.
Novel and ensemble architectures: Several modified architectures stood out: HoVer-Net (based on Preact-ResNet50), KimiaNet (based on DenseNet, achieving 96.80% accuracy), and a MobileNetV2 modification by Yamashita et al. that outperformed pathologists in MSI prediction (model AUROC 0.865 versus pathologist AUROC 0.605). Ensemble approaches included the DoMore.v1-based system by Skrede et al. with ten CNN models, and the Mean-Ensemble-CNN and NN-Ensemble-CNN by Paladini et al. using ResNet-101, ResNeXt-50, Inception-v3, and DenseNet-161 together.
Transfer learning advantage: Pre-trained models, typically trained on ImageNet, proved less computationally expensive and often achieved strong results even though the original training images differed from histological sections. Transfer learning was the most popular approach (42 of 80 CNN studies), confirming its effectiveness for CRC histopathology where annotated datasets are limited. Custom architectures, though simpler (typically 4 to 15 layers), performed well for straightforward binary classification tasks.
Dominant datasets: The Cancer Genome Atlas (TCGA) was the most frequently used public dataset across the reviewed studies, providing WSIs for colon adenocarcinoma (TCGA-COAD, 461 patients) and rectal adenocarcinoma (TCGA-READ, 172 patients). The University Medical Center Mannheim dataset, curated by Kather et al., provided 100,000 image patches across 9 tissue classes and was reused in multiple studies. The 2015 MICCAI Gland Segmentation Challenge dataset (85 training and 80 testing images) served as the benchmark for gland segmentation tasks. Other notable datasets included the DACHS cohort (2,431 patients), LC25000 (10,000 colon images), and DigestPath (660 H&E images).
Dataset sizes varied dramatically: Studies ranged from as few as 10 slides (Holland et al.) to over 14,000 WSIs (Wang et al.) and 12 million image tiles (Skrede et al.). The largest datasets generally produced the most clinically relevant results, though some smaller, carefully curated collections achieved competitive performance. Patch-level analysis was common, with individual WSIs typically generating hundreds to thousands of patches at 20x or 40x magnification. A Hamamatsu NanoZoomer scanner at 40x produces pixels corresponding to 227 nm each, capturing cellular-level detail.
GANs for data augmentation: Two studies specifically explored GANs to improve training data. Krause et al. used a Conditional GAN (CGAN) with 6 convolutional layers for both generator and discriminator networks, generating synthetic histology images for MSI classification training. Deshpande et al. introduced SAFRON, a novel GAN architecture that generates images of arbitrarily large sizes from small patch training, achieving 97% median classification accuracy with generated images added to the baseline set versus 93% without. These approaches address the fundamental challenge of limited annotated histopathology data.
Whole-slide image processing: WSIs are gigapixel images, with a single slide potentially generating 170,000+ patches. Most scanning systems (Hamamatsu NanoZoomer, Omnyx, Zeiss, Pannoramic 250 Flash II, Leica Aperio) offer 20x and 40x optical magnifications. Images are stored in compressed JPEG or uncompressed TIFF format. The computational challenge of processing these massive images has driven the adoption of patch-based approaches, where representative regions are extracted and classified individually before aggregation.
AI matches pathologists: Several studies demonstrated that DL-based model predictions did not differ in statistical significance from pathologist predictions. Wang et al. showed that their CNN's CRC classification was comparable to pathologists' diagnosis with no statistical difference. Yamashita et al. found their modified MobileNetV2 achieved an AUROC of 0.865 for MSI prediction, significantly outperforming pathologists at 0.605 in a reader study. These results suggest DL algorithms could provide valuable second opinions, especially when diagnostic inconsistencies occur.
Quantitative microenvironment analysis: In clinical practice, DL algorithms could provide valuable quantitative information about the tumor microenvironment that is difficult for pathologists to assess manually. Better patient stratification for targeted therapies through DL-based mutation prediction (MSI, BRAF, KRAS, TP53) represents one of the most promising applications. The ability to predict molecular status from routine H&E slides alone could reduce the need for expensive molecular testing in many cases.
Evolution of the field: The review documents a clear trajectory: early studies employed simple custom CNNs, followed by wider adoption of transfer learning from pre-trained networks, and finally the development of novel architectures tailored to specific medical questions. In the last two years covered, alternative deep learning techniques like GANs emerged. The authors note this progression is faster in CRC histopathology than in other fields, likely driven by the disease's high incidence and the availability of large public datasets like TCGA.
Infrastructure requirements: For an efficient fully digital workflow, the development of technology infrastructure including computers, scanners, workstations, and medical displays is necessary. The available scanned histological images can be reviewed by pathologists simultaneously from different locations, enabling remote collaboration. However, the transition to a fully digital pathology workflow remains a significant investment for most institutions.
Dataset limitations: The most significant barrier is the need for larger, higher-quality datasets with expert annotations and external validation cohorts. Many studies relied on single-institution data or the same public datasets (particularly TCGA and the Mannheim collection), raising questions about generalizability. Large datasets are not always available from pathologist annotations, and enrichment with simulated training sets via GANs remains underexplored, with only 2 of 82 studies using this approach.
Cross-study comparison challenges: Different studies used varying performance metrics and classification problem structures, making direct cross-study comparison difficult. The authors note that "it is not meaningful to calculate the average performance value for all the studies" because the nature of each classification problem differs. Only accuracy and AUC, the two most commonly reported metrics, could be aggregated across problem types. Standardized benchmarking protocols would significantly advance the field.
Technical and practical barriers: Inter-laboratory variability in tissue preparation, staining protocols, and scanning equipment creates domain shift problems that can degrade model performance. The computational demands of processing gigapixel WSIs remain significant. Most reviewed models focused on a single classification task, but real clinical workflows require multi-task models that can simultaneously assess grade, stage, molecular markers, and microenvironment features from the same slide.
Path forward: The authors call for larger numbers of datasets, quality image annotations, and external validation cohorts to establish the diagnostic accuracy of DL models in clinical practice. They suggest that parts of this systematic review could be extended to a formal meta-analysis, particularly utilizing data from retrospective studies and survival analysis. The ultimate goal is to complement, not replace, the pathologist, providing objective outputs that reduce inter-observer variability and accelerate diagnosis for the benefit of patients.