Auto-Segmentation of Bladder Cancer on MRI

Plain-English Explanations

Overview

Page 1

Why Automatic Bladder Cancer Segmentation Matters and What This Study Set Out to Do

Clinical motivation: Bladder cancer (BC) is the tenth most common malignancy worldwide, with a rising mortality rate. Staging decisions depend heavily on MRI, and a growing body of research uses radiomics, the extraction of hundreds of quantitative features from medical images, to predict muscle invasion and other clinically relevant outcomes. However, radiomics analysis requires a precisely drawn region of interest (ROI) around the tumor on every MRI slice. Manual segmentation by radiologists is labor-intensive, time-consuming, and inherently subjective, making it a bottleneck for large-scale radiomics research and clinical deployment.

Gap in the literature: Prior studies on CNN-based automatic segmentation of bladder cancer on MRI were all conducted at single institutions with small patient cohorts. These models therefore lacked sufficient generalizability for clinical application because they were trained and tested on MR images from a single scanner and a single patient population. No prior work had attempted multi-center, multi-vendor automatic segmentation of bladder cancer using deep learning.

Study design: This two-center retrospective study from Kyoto University Hospital and Osaka Red Cross Hospital enrolled 170 patients with pathologically confirmed bladder cancer. MR images were acquired on seven different scanner models from three vendors (Siemens, Philips, and GE Healthcare) at both 1.5-T and 3-T field strengths. Patients were randomly split into a training set of 140 and a test set of 30. The study had two goals: (1) develop a modified U-Net model for accurate automatic segmentation of BC on diffusion-weighted imaging (DWI) and apparent diffusion coefficient (ADC) maps, and (2) assess the reproducibility of radiomics features extracted from automatically segmented tumors versus manually drawn ROIs.

Significance of the two-center approach: By pooling data from two hospitals with diverse MR scanners and acquisition parameters, the authors aimed to create a segmentation model robust enough to handle the heterogeneity encountered in real clinical practice. This is a critical step toward making radiomics-based bladder cancer assessment practical at scale, where images come from many different institutions and scanners.

TL;DR: This two-center study of 170 bladder cancer patients from seven different MR scanners developed a modified U-Net model for automatic tumor segmentation on MRI. The dual goals were to achieve high segmentation accuracy across diverse imaging conditions and to verify that radiomics features extracted from automatic segmentations are reproducible compared to manual expert ROIs.

Patients and MRI Protocol

Pages 2-3

Patient Selection, Multi-Vendor MRI Acquisition, and Image Annotation

Patient cohort: Consecutive patients with pathologically proven BC who underwent preoperative bladder MRI between January 2016 and June 2020 were identified. Pathological diagnosis was confirmed by transurethral resection of bladder tumor (TURBT) or cystectomy. Exclusion criteria included prior BC treatment within 6 months, uncertain T stage, insufficient MR images, artificial devices in the imaging field, severe artifacts, and no detectable tumor on MRI. Of the final 170 patients, mean age was 73.6 years (range 47 to 94), with 136 males and 34 females. The training and test groups showed no statistically significant differences in age, sex, histological grade, or muscle invasion status (all p-values > 0.05).

Multi-vendor MRI protocol: Images were obtained using scanners from three manufacturers: Siemens Healthineers (Skyra, Prisma, Avanto), Philips Healthcare (Achieva, Ingenia, Intera), and GE Healthcare (SIGNA EXCITE). Each examination included at minimum axial T2-weighted images, axial T1-weighted images, and axial diffusion-weighted images (DWI). The b-values applied ranged from b = 0 to b = 1000 s/mm2, with ADC maps automatically generated using a mono-exponential decay model. The matrix size of DWI and ADC maps ranged from 110-192 x 80-128 pixels, reflecting the heterogeneity of acquisition parameters across scanners and institutions.

Image annotation process: A board-certified radiologist with 12 years of urogenital radiology experience (Y.M.) manually segmented all bladder cancers on each axial DWI slice at b = 1000 s/mm2 using 3D Slicer software, referencing other MR sequences and pathological reports. A second board-certified radiologist (Y.K., also 12 years of experience) verified every ROI. These expert-drawn annotations served as the reference standard for evaluating segmentation accuracy.

Image preprocessing: All axial DWI images were resized to 128 x 128 pixels. Signal intensities were normalized using a formula that divides the difference between each pixel's intensity and the image mean by 12 times the standard deviation. This normalization step was essential for handling the intensity variability introduced by different scanner manufacturers, field strengths, and acquisition protocols.

TL;DR: The study enrolled 170 patients (140 training, 30 test) imaged on seven scanner models from three vendors. Two expert radiologists annotated every tumor slice as the reference standard. DWI images were resized to 128 x 128 pixels and intensity-normalized to account for cross-scanner variability.

Model Architecture

Pages 3-4

Modified U-Net Architecture and Training Strategy

U-Net design: The authors built a modified U-Net that was deeper than the original architecture, using five encoding/decoding layers instead of the standard four. U-Net is a convolutional neural network originally developed for biomedical image segmentation, featuring a contracting path (encoder) that captures context and an expansive path (decoder) that enables precise localization. The additional layer in this modified version allows the network to learn higher-level abstract features from the input images, which can be particularly valuable when dealing with heterogeneous multi-vendor data.

Training configuration: The model was trained using the Adam optimizer with a Dice loss cost function. Key hyperparameters included 30 epochs, a batch size of 56, and an initial learning rate of 0.001. Training was performed with five-fold cross-validation, where 80% of the 140 training patients were used for training and 20% for validation in each fold. The model was built using TensorFlow 2.5.0 on a Linux workstation running Ubuntu 18.04 with an NVIDIA GeForce RTX 3090 GPU (24 GB memory).

Multi-sequence input strategy: A critical design decision was the evaluation of different input data configurations. The authors tested four input types: single-sequence images (b0 images alone, b1000 images alone, or ADC maps alone) and multi-sequence images (b0 + b1000 + ADC maps fed into three channels simultaneously). For single-sequence inputs, the same image was duplicated across all three input channels. For the multi-sequence approach, each of the three MR sequences was assigned to its own channel, allowing the network to leverage complementary information from all three sources.

Ensemble method for testing: For final evaluation on the test dataset, the authors employed an ensemble model combining the five models generated during cross-validation. A voxel was classified as bladder cancer only if three or more of the five models predicted it as tumor (a majority-vote rule). This ensemble approach reduces false positives from any single model and produces more robust, consensus-based segmentation masks.

TL;DR: The modified U-Net used five layers (one deeper than the original) and was trained with Dice loss, Adam optimizer, and five-fold cross-validation on 140 patients. Multi-sequence input (b0 + b1000 + ADC) fed three channels simultaneously. For testing, an ensemble of the five cross-validation models used majority voting to produce the final segmentation.

Segmentation Results

Pages 4-5

Segmentation Performance: Multi-Sequence Input Outperforms Single-Sequence Alternatives

Input comparison during cross-validation: The impact of input image type on segmentation accuracy was substantial. Using b0 images alone yielded the lowest performance, with mean Dice similarity coefficients (DSC) of 0.69 for training and only 0.37 for validation, indicating severe overfitting to the training data. The b1000 images achieved DSCs of 0.79 (training) and 0.64 (validation), while ADC maps alone reached 0.78 (training) and 0.66 (validation). The multi-sequence input (b0 + b1000 + ADC maps) decisively outperformed all single-sequence inputs with mean DSCs of 0.83 for training and 0.79 for validation.

Test dataset performance: With multi-sequence images as the chosen input, the final ensemble model achieved a median DSC of 0.81 (interquartile range: 0.70 to 0.88) on the held-out test set of 30 patients. This is a strong result given the multi-vendor, multi-center nature of the data. Representative cases showed that a large tumor was segmented almost perfectly (DSC = 0.95), while a case with two spatially distant tumors was well-segmented with a DSC of 0.82, demonstrating the model's ability to handle multiple lesions within the same patient.

Comparison with existing architectures: The authors benchmarked their modified U-Net against five established segmentation networks using the same multi-sequence input data. Mean validation DSCs were: original U-Net = 0.75, attention U-Net = 0.74, UNet++ = 0.46, U2-Net = 0.69, and TransUNet = 0.75. All of these were lower than the modified U-Net's 0.79, confirming that the deeper five-layer architecture provided a meaningful performance advantage for this specific task. The poor performance of UNet++ (0.46) suggests that its nested skip connections may not be well-suited to the relatively small and heterogeneous bladder cancer imaging datasets.

Why multi-sequence works best: The superiority of multi-sequence input is clinically intuitive. The b0 images partially contain T2-weighted information (showing anatomical context), the b1000 images highlight areas of restricted diffusion (where tumors appear bright), and ADC maps provide quantitative diffusion values that help distinguish true restriction from T2 shine-through artifacts. Combining these three complementary data sources in a single model allows the CNN to exploit all available diagnostic information simultaneously.

TL;DR: The multi-sequence model (b0 + b1000 + ADC) achieved a mean validation DSC of 0.79 and a median test DSC of 0.81 (IQR: 0.70-0.88), outperforming all single-sequence inputs and five competing architectures including original U-Net (0.75), attention U-Net (0.74), and TransUNet (0.75). A large tumor was segmented at DSC = 0.95.

Radiomics Reproducibility

Pages 5-6

Reproducibility of Radiomics Features Extracted from Automatic Segmentations

Why reproducibility matters: For radiomics to be clinically useful, the quantitative features extracted from tumor ROIs must be consistent regardless of whether the ROI was drawn manually by an expert or generated automatically by a CNN. If automatic segmentation introduces too much variability in computed features, it could undermine any downstream prediction model built on those features. The authors evaluated this by comparing 107 radiomics features extracted from ADC maps using both manual and automatic segmentations, assessed via the intraclass correlation coefficient (ICC 2.1).

Feature categories tested: The radiomics features were calculated using PyRadiomics 3.0.1 and fell into three broad groups: first-order features (n = 18), which capture intensity histogram statistics; shape-based features (n = 14), which describe tumor geometry; and high-order texture features (n = 75), which quantify spatial patterns including gray-level co-occurrence matrix (GLCM, n = 24), gray-level run length matrix (GLRLM, n = 16), gray-level size zone matrix (GLSZM, n = 16), neighboring gray tone difference matrix (NGTDM, n = 5), and gray-level dependence matrix (GLDM, n = 14).

ICC results by feature group: All feature groups demonstrated good median ICC values (range: 0.83 to 0.86). First-order features had a median ICC of 0.83 (IQR: 0.77-0.94). Shape-based features achieved a median ICC of 0.85 (IQR: 0.71-0.94). Among the high-order texture features, GLCM reached 0.85, GLRLM 0.84, GLSZM 0.85, NGTDM 0.85, and GLDM 0.86. Overall, 61 of 75 high-order features demonstrated good to excellent reproducibility (ICC 0.76 to 1.00), and all features with good to excellent ICC returned a significant p-value below 0.05.

Notable exceptions: A subset of features showed only moderate to poor reproducibility (ICC 0.39 to 0.73). These included kurtosis, mean absolute deviation, robust mean absolute deviation, and variance among first-order features, as well as elongation, flatness, major axis length, maximum 3D diameter, and sphericity among shape-based features. These features are known to be sensitive to small differences in segmentation boundaries, particularly for shape descriptors that depend on the precise outer contour of the ROI. This information is practically useful: researchers building radiomics models for bladder cancer should consider excluding these less stable features when using automated segmentation.

TL;DR: Of 107 radiomics features compared between manual and automatic segmentation, median ICC values ranged from 0.83 to 0.86 across all feature groups, with 61 of 75 high-order texture features showing good to excellent reproducibility (ICC 0.76-1.00). A small subset of shape and first-order features (e.g., kurtosis, sphericity, flatness) had lower reliability (ICC 0.39-0.73).

Discussion

Pages 6-7

How This Model Compares to Prior Work and Why Generalizability Matters

Context of prior single-center studies: Several earlier studies reported automatic segmentation of bladder cancer with DSC values higher than 0.81, but all were conducted at single centers with small cohorts and single MR scanner types. The authors note that single-site machine learning studies carry a well-documented risk of overfitting, where the model learns scanner-specific artifacts and institution-specific patterns rather than genuinely tumor-relevant features. The result is degraded generalization performance when the model encounters images from a new site or scanner.

Advantage of multi-vendor data: This study was the first to perform automatic segmentation of bladder cancer using multi-vendor MR scanners at two institutions with the largest cohort to date (170 patients). Despite the greater heterogeneity in image quality, resolution, and acquisition parameters, the model still achieved a median test DSC of 0.81 and outperformed five established segmentation architectures. This demonstrates that a well-designed modified U-Net can learn generalizable tumor features even from diverse data sources, a necessary prerequisite for any model intended for real-world clinical deployment.

Comparison with cervical cancer studies: The authors contextualize their radiomics reproducibility results against studies in other pelvic cancers. A prior study on uterine cervical cancer using a similar approach reported poor reliability for most features except first-order features. A study on endometrial cancer found good to excellent reliability for many features but poor reliability for GLCM, GLRLM, and NGTDM categories. In contrast, the bladder cancer model presented here yielded better reliability across all feature categories, with median ICC values consistently above 0.83. This suggests that the model's segmentation quality is high enough to support downstream radiomics analysis.

Clinical implications: The combination of accurate automatic segmentation and reliable radiomics feature extraction has direct practical value. It means that large-scale radiomics studies on bladder cancer, such as those predicting muscle invasion or treatment response, can use automatically generated ROIs rather than requiring radiologists to manually segment every tumor. This could dramatically reduce the time and cost of radiomics research and accelerate translation toward clinical decision-support tools.

TL;DR: Unlike previous single-center studies, this multi-vendor, two-center model achieved DSC 0.81 despite greater image heterogeneity, and outperformed five established architectures. Radiomics reproducibility exceeded that reported in comparable cervical and endometrial cancer studies, supporting the use of automatic segmentation for large-scale bladder cancer radiomics research.

Limitations and Future Directions

Pages 7-8

Study Limitations and What Needs to Happen Next

DWI-only input: The model was trained exclusively on diffusion-weighted images and ADC maps, without incorporating T2-weighted images or dynamic contrast-enhanced T1-weighted images. The authors chose this approach deliberately because bladder deformation caused by fluctuating urine volume during scanning creates misregistration between sequences. While deformable image registration could theoretically align T2WI with DWI, misregistration errors might actually degrade segmentation accuracy. The authors note that b0 images partially contain T2-weighted information, which may partially compensate for the absence of dedicated T2WI input.

Mixed institutional data without external validation: Cases from both institutions were pooled and randomly divided into training and test sets rather than using one institution's data for training and the other's for external validation. This means the model has not been tested in a fully independent external cohort. True external validation with bladder cancer cases from entirely new institutions and scanners would provide stronger evidence of generalizability and is explicitly called for by the authors.

Small test dataset: The test set contained only 30 patients, which limits the statistical power for evaluating both DSC and ICC values. With such a small sample, the interquartile range of DSC (0.70 to 0.88) is relatively wide, meaning there is considerable case-to-case variability that a larger test cohort might better characterize. Similarly, ICC estimates based on 30 subjects have wider confidence intervals than would be ideal for establishing reproducibility benchmarks.

Path forward: Future work should incorporate multi-institutional external validation, larger patient cohorts, and potentially the inclusion of additional MRI sequences with robust deformable registration. Integrating this automatic segmentation pipeline with downstream radiomics-based prediction models for muscle invasion staging or treatment response could create an end-to-end clinical tool. The demonstrated reproducibility of most radiomics features suggests that such an automated pipeline would yield results comparable to those obtained with labor-intensive manual segmentation.

TL;DR: Key limitations include DWI-only input (no T2WI due to bladder deformation risk), lack of true external validation (both centers' data were mixed), and a small test set of 30 patients. Future studies should pursue multi-institutional external validation, larger cohorts, and integration with radiomics-based prediction models for a fully automated clinical workflow.