Interpretable Survival Prediction for Colorectal Cancer Using Deep Learning

Plain-English Explanations

Background

Page 1

Why Colorectal Cancer Prognosis Needs More Than TNM Staging

Colorectal adenocarcinoma is the third most commonly diagnosed cancer worldwide and the second leading cause of cancer-related death, trailing only lung cancer. For patients diagnosed at stage II or stage III, the key clinical question is whether they will benefit from adjuvant chemotherapy after surgical tumor removal. Current decision-making relies heavily on TNM staging (Tumor size, Node involvement, Metastasis), but even within a single TNM stage, patient outcomes vary substantially. Some stage II patients experience recurrence and death despite appearing low-risk, while others classified as high-risk never relapse.

For stage II patients specifically, adjuvant chemotherapy benefits only a small subset, and overtreatment carries substantial adverse effects. For stage III patients, chemotherapy is generally standard care, but prognostic information still matters because it determines regimen intensity and treatment duration. Known histopathologic features such as tumor budding, lymphovascular invasion, and tumor grade can provide useful information, but their clinical utility is limited by poor sensitivity and high inter-pathologist variability in assessment.

The deep learning opportunity: Recent machine learning approaches have shown promise for extracting novel prognostic information directly from routine histopathology slides. However, most deep learning prognostic models function as "black boxes," making it difficult for clinicians to understand or trust the features driving their predictions. This study set out to build a deep learning system (DLS) for survival prediction and, critically, to develop a method for making its learned features interpretable and reproducibly identifiable by human pathologists.

Study scale: The researchers assembled 3,652 cases (27,300 whole-slide images) for model development, with two separate validation datasets containing 1,239 cases (9,340 slides) and 738 cases (7,140 slides). All cases came from the Medical University of Graz Biobank, spanning patients diagnosed between 1984 and 2013, providing a uniquely large and long-follow-up dataset for this type of research.

TL;DR: Colorectal cancer is the third most common cancer globally, and TNM staging alone leaves substantial outcome variability unexplained. This study used 3,652 cases (27,300 slides) to build a deep learning survival predictor for stage II/III patients and developed a method to make its predictions interpretable.

Methodology

Pages 1-2

How the Two-Stage Deep Learning System Was Built

Data preparation: The study used archived formalin-fixed, paraffin-embedded, hematoxylin and eosin (H&E) stained slides from the Medical University of Graz Biobank. All slides were scanned at 20x magnification (0.5 micrometers per pixel) using a Leica Aperio AT2 scanner. The initial pool contained 6,437 cases and 114,561 slides. After excluding immunohistochemistry-stained slides, non-colorectal tissue, cases with low tumor content, deaths within 30 days of surgery, and secondary tumor resections, the final dataset comprised 5,629 cases (43,780 slides).

Two-model pipeline: The DLS consisted of two sequential models. First, a tumor segmentation model based on the Inception-v3 architecture classified every region of a whole-slide image as tumor or non-tumor. This model was trained on pixel-level annotations from 265 slides and achieved an AUC of 0.985 for patch-level tumor detection. The identified tumor regions then served as the input for the second prognostic model, which predicted case-level disease-specific survival (DSS) risk scores.

Prognostic model design: The prognostic model used a weakly supervised approach with convolutional neural networks (CNNs) built from depth-wise separable convolution layers (similar to MobileNet architecture). During training, 16 image patches per case were randomly sampled from tumor-containing regions. The patch-level features were extracted by shared-weight CNN modules, merged into a case-level feature vector via average pooling, and passed through a final Cox regression layer to produce a scalar risk score. The loss function was Cox partial likelihood, and training ran for 2 million steps across 50 distributed workers.

Validation strategy: Cases from 1984 to 2007 were randomly split 2:1:1 into training, tuning, and validation set 1. All cases from 2008 to 2013 formed validation set 2, providing a temporal validation that tested whether the model could generalize to a more recent cohort with potentially different treatment practices and slide preparation techniques. The primary evaluation metric was 5-year disease-specific survival AUC, chosen because all cases had at least 5 years of follow-up.

TL;DR: The DLS used two sequential models: a tumor segmentation model (AUC 0.985) to find cancer regions, then a weakly supervised CNN prognostic model trained on 3,652 cases using Cox partial likelihood loss. Two validation sets (including a temporal validation from a later era) tested generalizability.

Prognostic Performance

Pages 2-3

The DLS Achieved Meaningful Survival Prediction Across Both Validation Sets

AUC results: For the combined stage II/III cohort, the DLS achieved a 5-year disease-specific survival AUC of 0.698 in validation set 1 (1,239 cases) and 0.686 in validation set 2 (738 cases). When broken down by stage, stage II AUCs were 0.680 and 0.663 in the two sets, while stage III AUCs were 0.655 in both validation sets. These results were consistent with those reported by Skrede et al. (2020) using a comparable weakly supervised approach on a different cohort, providing cross-study validation that this type of deep learning approach can achieve substantial risk stratification.

Kaplan-Meier stratification: When patients were divided into quartile-based risk groups, the DLS produced highly significant separation in survival curves (p < 0.001 for log-rank tests). Among stage II patients, the 5-year DSS rates for the high-risk versus low-risk groups were 73% versus 89% in validation set 1, and 57% versus 86% in validation set 2. For stage III patients, the corresponding rates were 41% versus 76% and 43% versus 73%. These survival differences are comparable to or greater than those observed for established prognostic factors such as T-category, tumor-infiltrating lymphocytes, lymphovascular invasion, and perineural invasion.

Additive value beyond clinical features: In multivariable Cox regression adjusting for nine clinicopathologic features (age, sex, tumor grade, T-category, N-category, R-status, L-status, and V-status), the DLS remained a statistically significant independent predictor of survival (p < 0.001). Adding the DLS risk score to the clinicopathologic baseline model increased the 5-year AUC by 0.120 and 0.085 for stage II (across the two validation sets), and by 0.065 and 0.022 for stage III. For the combined cohort, the final AUCs reached 0.733 and 0.721, representing absolute improvements of 0.055 and 0.038 over the baseline.

Clinical implications: These results suggest the DLS could meaningfully inform treatment decisions. For stage II patients, the model could help identify the high-risk subset most likely to benefit from adjuvant chemotherapy, reducing overtreatment. For stage III patients, it could guide decisions about therapy regimen intensity and duration. The authors note that prospective studies evaluating the impact of DLS-informed treatment decisions on patient outcomes are warranted.

TL;DR: The DLS achieved 5-year survival AUCs of 0.70 and 0.69 across two validation sets and added significant predictive value beyond nine standard clinicopathologic features. High-risk versus low-risk survival differences reached 32 percentage points for stage II and 35 points for stage III patients.

Interpretability

Pages 4-5

Making the Black Box Transparent: Clustering-Derived Interpretable Features

The interpretability challenge: A critical barrier to clinical adoption of deep learning prognostic models is their "black box" nature. If clinicians cannot understand what features drive a model's predictions, they cannot build the trust necessary for AI-supported decision-making. The researchers developed a novel method for generating human-interpretable histologic features by clustering image embeddings, then measured how well these features could explain the DLS predictions.

Standard clinical features fall short: The team first tested whether known clinicopathologic features (T-category, N-category, grade, sex, age, margin status, lymphatic and venous invasion) could explain DLS scores. Using multivariable linear regression, these nine features accounted for only 18% of the variance in DLS scores (R-squared = 0.18 in both validation sets). The most significant associations were with T-category and N-category. This result confirmed that the DLS had learned something substantially beyond what standard clinical staging captures.

Clustering-derived features explain the majority: The researchers used a pre-trained image-similarity deep learning model (SMILY) to generate patch-level embeddings that captured visual similarity. They clustered 100,000 tumor-containing training patches into 200 groups using k-means. For each case, the percentage of patches belonging to each cluster was calculated as a feature. Remarkably, these 200 clustering-derived features explained 73% of the variance in DLS scores for validation set 1 and 80% for validation set 2. Even a curated subset of just 10 features (selected via forward stepwise selection) explained 57% and 61% of the variance, respectively.

Pathologist review: Three pathologists independently reviewed 15 sample patches per feature for each of the top 10 clusters and provided structured histopathological descriptions. The four high-risk features generally involved intermediate to high-grade tumor in small or solid clusters, while the six low-risk features typically contained lower-grade tumor forming glands and tubules with high tumor-to-stroma ratios. This demonstrated that the DLS had learned histologically meaningful and describable features rather than inscrutable mathematical artifacts.

TL;DR: Standard clinicopathologic features explained only 18% of DLS score variance, but 200 clustering-derived histologic features explained 73-80%. Even 10 selected features explained 57-61%. High-risk clusters showed poorly differentiated tumor in small clusters; low-risk clusters showed well-formed glands.

Key Discovery

Pages 5-6

The Tumor-Adipose Feature: A Potentially Novel Prognostic Marker

Identifying the strongest signal: Among all 200 clustering-derived features, one stood out dramatically. Cluster #72, which the researchers named the "Tumor-Adipose Feature" (TAF), had the highest regression coefficient (strongest association with DLS scores) and the highest average patch-level DLS score of 2.76, nearly triple the next-highest feature at 0.97. TAF was characterized by small clusters of moderately to poorly differentiated tumor cells adjacent to a substantial component of adipose (fat) tissue, with a minor component of desmoplastic stroma.

Independent prognostic value: Case-level quantitation of TAF (measuring what percentage of a patient's tumor patches contained this feature) was independently and highly prognostic. Kaplan-Meier curves showed significant survival separation when patients were stratified by TAF content. Notably, TAF remained significantly associated with survival even within T3-only cases (a subgroup analysis), demonstrating that its prognostic value was independent of T-category staging. This is important because an initial interpretation might have been that TAF simply represents deep tumor invasion (T3/T4 stage).

Reproducible identification by humans: To test whether TAF could be reliably recognized, five participants (two anatomic pathologists and three non-pathologist researchers) were trained on 50 example patches and then assessed 200 new patches. Accuracy ranged from 87.0% to 95.5%, with inter-pathologist concordance of 93.5%. This high agreement suggests TAF is a visually distinctive and learnable feature that could be incorporated into routine pathological assessment without requiring AI assistance.

Biological hypotheses: The researchers proposed several hypotheses for why tumor cells near adipose tissue predict worse outcomes. One possibility involves submucosal adipose tissue as a prognostic factor potentially linked to inflammatory bowel disease or obesity. Evidence suggests that body-mass index, visceral fat, and subcutaneous fat may be associated with adverse outcomes in metastatic colorectal cancer. Another hypothesis involves cancer-associated adipocytes playing an adverse role, as described in breast cancer. There are also morphologic similarities between TAF and irregular tumor growth at the invasive edge, potentially representing an "infiltrative" rather than "pushing" border configuration, a known poor-prognosis pattern.

TL;DR: The "Tumor-Adipose Feature" (poorly differentiated tumor cells near fat tissue) was the strongest predictor in the model, with a DLS score of 2.76 (nearly 3x the next feature). Pathologists identified it with 87-95.5% accuracy, and it predicted survival independently of T-category staging.

Patch-Level Analysis

Pages 6-7

What the DLS Scored as High-Risk and Low-Risk at the Tissue Level

Known histoprognostic features: To further understand the DLS, pathologists annotated 161 slides from validation set 2 for established prognostic features. Among known features, patches containing lymphovascular invasion had the highest average DLS score (1.03), followed by perineural invasion (0.75), intratumoral budding (0.33), peritumoral fibrosis (0.26), and peritumoral budding (0.10). Polyp patches had the lowest average score (-0.86). These associations align with established clinical knowledge, confirming the DLS learned clinically relevant patterns.

Clustering-derived features dwarf known features: However, the TAF cluster (#72) had a dramatically higher average DLS score (2.76) than any known histoprognostic feature. The next three high-risk clusters (#139, #96, #23) had average scores of 0.97, 0.74, and 0.74 respectively. Among the six low-risk clusters, scores ranged from -0.56 to -0.87. This comparison reveals that the DLS placed far greater weight on the novel TAF pattern than on any traditional prognostic marker, suggesting the model discovered morphological information not captured by conventional assessment.

High-risk versus low-risk patterns: The four high-risk clusters shared common themes: small or solid clusters of intermediate to high-grade tumor cells with substantial stromal or adipose components. Cluster #139 showed predominant stroma with mature and intermediate desmoplasia and relatively little tumor. Clusters #96 and #23 both featured small clusters of high-grade tumor, including single tumor cells, with mature desmoplasia. In contrast, the six low-risk clusters consistently showed lower-grade tumor forming well-defined glands and tubules with high tumor-to-stroma ratios, suggesting that organized, well-differentiated growth patterns signaled better prognosis.

Inference speed: The complete two-model pipeline took 11 plus or minus 7 minutes per case on a single 16-core machine, 13 plus or minus 8 seconds using 50 distributed cloud machines, and just 8 plus or minus 5 seconds on a Google Cloud Tensor Processing Unit (TPU v2). Given that slide preparation and digitization alone take a few minutes per slide (with an average of 10 slides per case), the computational inference is not a bottleneck for clinical deployment.

TL;DR: The TAF cluster scored 2.76 on the DLS risk scale, nearly 3x higher than lymphovascular invasion (1.03), the highest-scoring known feature. High-risk patterns involved poorly differentiated tumor in small clusters; low-risk patterns showed well-formed glands. Inference takes 8-13 seconds with cloud computing.

Limitations

Pages 7-8

Caveats: Single Institution, Retrospective Design, and Unexplained Variance

Retrospective confounding: Because this was a retrospective study, treatment pathways represent an important confounding factor. While treatment guidelines within stage II and stage III are fairly uniform, some variability in neoadjuvant and adjuvant therapy likely existed across the 1984-2013 study period. Progression-free survival might be less susceptible to treatment confounding than disease-specific survival, but it was not available at the required scale.

Single-institution data: All cases came from the Medical University of Graz Biobank. Although validation set 2 (2008-2013) provided temporal validation with different baseline characteristics and likely different treatment practices, geographic validation in diverse cohorts was not performed. Differences in patient demographics, tissue preparation, staining protocols, and scanning equipment across institutions could affect model performance. The authors explicitly note that geographically diverse data with the necessary imaging and clinical information were not available for this study.

Missing clinical correlates: Several known prognostic factors could not be evaluated for their association with the DLS, including tumor budding, number of lymph nodes examined, tumor location, obstruction, microsatellite instability, molecular profiles (BRAF, KRAS mutations), and formal tumor-infiltrating lymphocyte scoring. While the clustering analysis did not reveal obvious associations with TILs or desmoplasia, the relationship between the DLS and these factors requires formal examination in future work.

Unexplained variance: Even with all 200 clustering-derived features, approximately 20% of the variance in DLS scores remained unexplained. This suggests the model captured additional patterns not fully represented by the clustering approach. Furthermore, the clusters were based on image similarity rather than histopathological concepts, so pathologist-guided refinement of these clusters could potentially yield more prognostic and better-defined features. The TAF feature itself, while reproducibly identifiable at the patch level, still requires validation of pathologists' case-level quantitation, which will need standardized scoring guidelines.

TL;DR: Key limitations include single-institution data (Graz, Austria only), retrospective design with potential treatment confounding over a 30-year span, inability to assess several known prognostic factors, and roughly 20% of DLS score variance remaining unexplained by the clustering approach.

Significance & Future Directions

Pages 8-9

From Black Box to Discovery Tool: What This Means for Cancer Pathology

A framework for explainable AI in pathology: The central contribution of this study extends beyond the prognostic model itself. The clustering-based interpretability method provides a general framework that can be applied to any weakly supervised deep learning model in histopathology. By showing that human-interpretable features can explain 73-80% of a black-box model's predictions, the authors demonstrated that deep learning does not have to remain opaque. This approach bridges the gap between high-performance AI and the clinical trust required for real-world adoption.

Potential for novel feature discovery: The identification of TAF illustrates how machine learning can uncover potentially novel prognostic markers that were not previously defined or systematically studied. Unlike traditional hypothesis-driven research where pathologists first define features and then test their prognostic value, this study reversed the process: the model identified what mattered for survival, and pathologists then described and validated it. This "learn from the machine" paradigm could accelerate the discovery of new histomorphologic biomarkers across many cancer types.

Next steps for clinical translation: Before TAF or the DLS can enter clinical practice, several steps are needed. Multicenter, geographically diverse validation studies must confirm generalizability across different patient populations, tissue preparation methods, and scanning equipment. Standardized scoring guidelines for TAF need to be developed and tested for inter-pathologist reproducibility at the case level (not just the patch level). Prospective clinical trials should evaluate whether DLS-informed treatment decisions improve patient outcomes compared to standard staging alone. Integration with molecular biomarkers such as microsatellite instability status and BRAF/KRAS mutation profiles could further enhance prognostic accuracy.

Open science and reproducibility: The authors made their custom deep learning architecture, loss function, sample training code, and statistical analysis code publicly available on GitHub (under the google-health repository). The deep learning framework used was TensorFlow. While the pathology data itself requires ethics review for access through the Graz Biobank, the code transparency supports independent verification and extension of this work. The study followed the REMARK checklist for prognostic study reporting, further supporting reproducibility standards.

TL;DR: This study provides a reusable framework for making any weakly supervised pathology model interpretable. The TAF discovery shows AI can identify novel prognostic features, but multicenter validation, standardized scoring guidelines, and prospective trials are needed before clinical deployment. Code is publicly available on GitHub.