Clinical motivation: Urinary bladder cancer ranks among the ten most common cancers worldwide and is characterized by a high metastatic potential and recurrence rate. Accurate and timely diagnosis is critical for successful treatment, yet the standard diagnostic approach, cystoscopy, has known limitations, including difficulty distinguishing carcinoma in-situ from scarring or inflammatory changes. Computed tomography (CT) scanning offers a less invasive alternative that can generate images in the frontal, axial, and sagittal planes during a single breath-hold using modern multi-detector scanners with 64 to 320 rows of detectors.
The semantic segmentation task: This study by Baressi Segota et al. proposes a transfer learning approach to automatically segment (identify and outline) urinary bladder cancer masses from CT images. Semantic segmentation differs from simple classification in that instead of labeling an entire image as "cancer" or "no cancer," the algorithm assigns a label to every single pixel, producing an output mask that highlights precisely where the malignant tissue is located. The dataset was collected from the Clinical Hospital Center of Rijeka and consists of 10,402 CT images across three planes: 4,413 frontal, 4,993 axial, and 996 sagittal images, all from patients with confirmed bladder cancer.
System architecture: The researchers designed a two-stage pipeline. First, an AlexNet classifier automatically determines which anatomical plane a given CT image was captured in. Then, depending on the identified plane, the image is routed to one of three specialized U-Net architectures, each optimized for segmenting bladder cancer in that particular plane. This plane-specific design recognizes that anatomical structures appear differently depending on the viewing angle, and training separate models for each plane could yield better segmentation performance than a single one-size-fits-all model.
Annotation quality: All image annotations (ground truth masks) were created by a specialist urologist and validated against additional medical procedures including cystoscopy. Three independent urologists evaluated the medical findings, achieving a Fleiss' kappa coefficient of 0.83, indicating a high degree of inter-observer agreement. This rigorous annotation process provides confidence that the ground truth labels used to train and evaluate the deep learning models are clinically reliable.
U-Net fundamentals: The core segmentation engine in this study is the U-Net architecture, a fully convolutional network originally designed for biomedical image segmentation. Unlike standard CNNs that output a single class label, U-Net produces a pixel-level output mask the same size as the input image. Its architecture is divided into two halves: a contractive (encoder) path that progressively downsamples the image to capture high-level features, and an expansive (decoder) path that upsamples those features back to the original image resolution to produce a detailed segmentation map.
Skip connections: A defining feature of U-Net is its skip connections, which concatenate feature maps from the contractive path directly to corresponding layers in the expansive path. During downsampling, fine-grained spatial details (such as the exact boundaries of a tumor) are progressively lost. The skip connections recover this information by feeding high-resolution features from earlier layers into the decoder, enabling the network to produce precise pixel-level segmentation boundaries. This is particularly important for bladder cancer, where the tumor margins against surrounding tissue can be subtle in CT images.
Parallel U-Net design: Because CT urography captures images in three anatomical planes, the researchers implemented three parallel U-Net models, one dedicated to each plane. The frontal plane U-Net processes the 4,413 frontal images, the axial plane U-Net handles the 4,993 axial images, and the sagittal plane U-Net works on the 996 sagittal images. This design decision is motivated by the observation that anatomical structures appear differently in each orientation, and specialized models can learn plane-specific features more effectively than a single generalized model.
Output interpretation: The output of each U-Net is a binary segmentation mask where pixels corresponding to bladder cancer regions are highlighted. This mask can then be overlaid on the original CT image to visually indicate the extent and location of the malignant mass. Such annotated images support clinicians in evaluating bladder cancer spread and can facilitate more standardized assessments compared to purely manual interpretation, which varies between radiologists.
The transfer learning concept: Transfer learning is a machine learning paradigm in which a model trained on one task (the source domain) is repurposed for a different but related task (the target domain). In this study, the source domain is the ImageNet dataset, a massive collection of over 14 million natural images across 1,000 categories, and the target domain is bladder cancer segmentation from CT scans. The underlying hypothesis is that low-level visual features learned from natural images (edges, textures, shapes) are transferable to medical imaging tasks, providing a better initialization than training from scratch with random weights.
How it works in practice: The contractive (encoder) portion of each U-Net is replaced with the convolutional layers of a pretrained CNN architecture. The fully connected classification layers at the end of the pretrained network are removed since semantic segmentation requires a fully convolutional configuration. During training, the pretrained encoder layers are frozen (their parameters remain fixed), while only the expansive (decoder) layers are updated through backpropagation. This approach substantially reduces the number of trainable parameters and helps prevent overfitting, which is particularly valuable when the medical image dataset is limited in size.
Backbone architectures tested: The researchers evaluated six pretrained backbone architectures: VGG-16 (16 layers with small 3x3 kernels), InceptionV3 (parallel convolutions with multiple kernel sizes), ResNet50, ResNet101, and ResNet152 (residual networks with identity shortcut connections to combat vanishing gradients), and Inception-ResNet (a hybrid combining Inception modules with residual connections). Each backbone brings different strengths: VGG-16 offers simplicity, Inception provides multi-scale feature extraction, and the ResNet variants enable very deep architectures without gradient degradation.
Comparison to baseline: In every plane, the transfer learning approach yielded substantially higher Dice coefficients than the standard U-Net trained from scratch. For example, in the frontal plane, the baseline U-Net achieved a DSC of only 0.7846, while the best transfer learning model (ResNet101 backbone) reached 0.9587, an improvement of over 17 percentage points. This demonstrates that ImageNet-pretrained features provide a powerful foundation even when the target domain (grayscale medical CT) is visually quite different from natural photographs.
Why plane recognition is needed: Since the system uses three separate U-Net models (one per anatomical plane), it must first determine which plane a given CT image belongs to before routing it to the correct segmentation network. Rather than requiring manual labeling by a radiologist, the researchers automated this step using AlexNet, a relatively compact CNN with 5 convolutional layers and 3 fully connected layers. AlexNet was chosen because plane recognition is a straightforward three-class classification problem that does not require the complexity of deeper architectures.
Training configuration: The AlexNet classifier was optimized using a grid search over hyperparameters including six different optimizers (Adam, AdaMax, Adagrad, AdaDelta, RMSprop, Nadam), five batch sizes (1, 2, 4, 8, 16), and seven epoch counts (50 to 200). The best configuration used the RMSprop optimizer with a batch size of 16 trained for just 10 epochs. Interestingly, training for more epochs led to significantly worse performance, a clear sign of overfitting, while training for only 1 epoch was also insufficient.
Classification performance: The optimal AlexNet configuration achieved an AUC_micro of 0.9999 with a standard deviation of only 0.0006 across the five-fold cross-validation procedure. This near-perfect classification accuracy means the plane recognition step introduces virtually zero error into the overall pipeline. The system can reliably distinguish frontal, axial, and sagittal CT images, ensuring that each image is processed by the correct plane-specific U-Net.
Practical significance: The high performance of the plane classifier means the entire two-stage system can operate autonomously: a new CT image is first classified by AlexNet, then automatically routed to the appropriate U-Net for segmentation, with no human intervention required at any step. This end-to-end automation is essential for potential clinical deployment, where the system would need to process large volumes of CT images efficiently without requiring radiologists to manually sort images by plane.
Dice Similarity Coefficient (DSC): The primary evaluation metric is the Dice coefficient, defined as DSC = 2|X intersection Y| / (|X| + |Y|), where X is the ground truth mask and Y is the predicted mask. DSC ranges from 0 (no overlap) to 1 (perfect overlap) and is the standard metric in medical image segmentation because it balances precision and recall while being sensitive to both shape and positional accuracy. A DSC of 0.95, for example, means the predicted cancer region overlaps with 95% of the true cancer region with minimal false positives or false negatives.
Intersection over Union (IoU): The secondary metric is IoU, calculated as the intersection of the predicted and ground truth masks divided by their union. IoU is more strict than DSC because the union in the denominator penalizes any non-overlapping pixels more heavily. Both metrics are reported to provide a complete picture of segmentation quality. For a given DSC value, the corresponding IoU is always lower (e.g., DSC of 0.9587 corresponds to IoU of 0.9438 for the frontal plane).
Five-fold cross-validation: To assess not just segmentation accuracy but also generalization performance, the researchers employed five-fold cross-validation. The dataset for each plane is divided into five equal folds; in each iteration, four folds are used for training and one for testing, rotating until all folds have served as the test set. The mean DSC across all five folds measures segmentation performance, while the standard deviation of DSC (denoted sigma(DSC)) measures generalization. A low sigma(DSC) indicates consistent performance across different data splits, suggesting the model will generalize well to unseen patients.
Multi-objective model selection: The researchers used a multi-objective ranking to select the best model for each plane. Models were first sorted by descending DSC; in case of ties, the model with lower sigma(DSC) was preferred. This ensures that the selected model achieves both high segmentation accuracy and stable generalization. The extensive grid search across 6 optimizers, 5 batch sizes, and 7 epoch counts (210 configurations per backbone per plane) ensures thorough exploration of the hyperparameter space.
Frontal plane (4,413 images): The best segmentation in the frontal plane was achieved using ResNet101 as the U-Net backbone, trained with the Nadam optimizer for 50 epochs at batch size 2. This configuration reached a DSC of 0.9587 and IoU of 0.9438, with a sigma(DSC) of just 0.0059, indicating excellent generalization. The baseline U-Net without transfer learning only managed DSC of 0.7846. Among all backbones, ResNet50 showed the lowest sigma(DSC) at 0.0019 but had a slightly lower DSC of 0.9314, while Inception-ResNet had the poorest generalization (sigma(DSC) = 0.1212) despite a moderate DSC of 0.8991.
Axial plane (4,993 images): For the axial plane, ResNet50 delivered the best results with DSC of 0.9372 and IoU of 0.9372, using the Adam optimizer for 150 epochs at batch size 4. The generalization performance showed sigma(DSC) of 0.0147. The baseline U-Net reached only DSC of 0.8347. InceptionV3 performed well with DSC of 0.9147 and notably low sigma(DSC) of 0.0051, suggesting it could be preferred if generalization stability is prioritized. VGG-16 showed poor generalization with sigma(DSC) of 0.2456 despite a reasonable DSC of 0.8804.
Sagittal plane (996 images): The sagittal plane achieved the highest raw DSC of 0.9660 using VGG-16 as the backbone, trained with Adam for 200 epochs at batch size 2, with IoU of 0.9482. However, the generalization was notably weaker, with sigma(DSC) of 0.0486, the highest among the three planes. The baseline U-Net managed DSC of 0.8639. The weaker generalization is directly attributable to the smaller dataset size: only 996 images compared to 4,413 and 4,993 for the other planes, resulting in fewer training samples per fold during cross-validation.
Backbone performance patterns: No single backbone dominated across all three planes. ResNet101 excelled in the frontal plane, ResNet50 in the axial plane, and VGG-16 in the sagittal plane. This variation reflects the different characteristics of each plane's images and dataset size. The deeper ResNet architectures (101, 152) did not consistently outperform shallower ones, suggesting that depth alone is not the determining factor. Transfer learning consistently and substantially outperformed isolated training, with DSC improvements ranging from 10 to 17 percentage points depending on the plane.
Segmentation vs. generalization trade-off: A central finding of the discussion is the tension between raw segmentation performance and generalization stability. The sagittal plane achieved the highest DSC (0.9660) but the worst sigma(DSC) (0.0486), while the frontal plane had a slightly lower DSC (0.9587) but far better sigma(DSC) (0.0059). This pattern reveals that a high average score can mask inconsistent fold-to-fold performance. For clinical deployment, a model that performs reliably across different patient subsets (low sigma) may be more trustworthy than one with a slightly higher but more variable average score.
Dataset size as the limiting factor: The sagittal plane's weaker generalization is directly explained by its smaller dataset: only 996 images compared to 4,413 (frontal) and 4,993 (axial). With five-fold cross-validation, each fold in the sagittal set contains roughly 199 test images and 797 training images, far fewer than the approximately 883/3,530 and 999/3,994 splits for frontal and axial planes respectively. Fewer training examples per fold increase the variance in learned representations, leading to larger fluctuations in DSC across folds. The authors explicitly recommend collecting more sagittal images before deploying the system clinically.
Overfitting observations: The grid search results revealed a consistent overfitting pattern across all planes and backbone architectures. Performance typically peaked at an intermediate number of training epochs and then degraded with further training. For example, the ResNet101-based frontal U-Net peaked at 125 epochs, while the VGG-16-based sagittal U-Net peaked at 200 epochs. Similarly, the AlexNet plane classifier peaked at just 10 epochs. These findings underscore the importance of early stopping or careful epoch selection, especially when working with moderately-sized medical imaging datasets.
IoU trends mirror DSC: The IoU analysis followed similar patterns to DSC, with the frontal plane showing the most stable behavior and the sagittal plane the least. The frontal plane's ResNet101 model achieved IoU of 0.9438 with sigma(IoU) of 0.0079, while the sagittal VGG-16 model reached IoU of 0.9482 with sigma(IoU) of 0.0398. The consistency between DSC and IoU metrics strengthens confidence in the results, as both metrics independently confirm the same ranking of models and the same generalization patterns across planes.
Transfer learning validated for bladder cancer segmentation: The study conclusively demonstrates that using pretrained CNN backbones as U-Net encoders significantly improves semantic segmentation of urinary bladder cancer from CT images. Across all three anatomical planes, transfer learning models outperformed standard U-Nets by substantial margins. The best models achieved DSC values of 0.9587 (frontal, ResNet101), 0.9372 (axial, ResNet50), and 0.9660 (sagittal, VGG-16). These results fall within the range of state-of-the-art performance for medical image segmentation tasks.
Plane-specific optimization works: The four research questions posed at the outset were all answered affirmatively. It is possible to design separate segmentation systems for each plane. Automated plane recognition is feasible with near-perfect accuracy (AUC_micro 0.9999). Transfer learning substantially improves both segmentation and generalization performance. And the optimal backbone varies by plane, with no single architecture universally dominating. This suggests that for multi-planar medical imaging applications, independent optimization per viewing angle is a worthwhile strategy.
Clinical applicability: The high segmentation and generalization performances, particularly for the frontal and axial planes, suggest the system could serve as a clinical decision support tool. By automatically delineating cancer regions on CT images, the system could assist radiologists and urologists in evaluating bladder cancer spread, potentially reducing interpretation time and inter-reader variability. The automated pipeline from plane classification through segmentation requires no manual intervention, making it suitable for integration into clinical workflows where efficiency is critical.
Remaining challenges: Before clinical deployment, several issues need to be addressed. The sagittal plane model requires more training data to achieve acceptable generalization. The study uses images from a single institution (Clinical Hospital Center of Rijeka), so external validation on multi-center data is necessary to confirm generalizability to different CT scanners, imaging protocols, and patient populations. Additionally, the frozen encoder strategy, while effective for preventing overfitting, may limit the model's ability to adapt pretrained features specifically to medical imaging. Fine-tuning selected encoder layers could potentially improve results further.
Future directions: The authors suggest that this transfer learning-based segmentation system, combined with the automated plane recognition component, opens the possibility for routine clinical utilization. Expanding the sagittal dataset, incorporating multi-institutional data, and exploring fine-tuning strategies for the pretrained backbones are natural next steps. The framework could also potentially be extended to other cancers visible on CT imaging or to three-dimensional volumetric segmentation by combining predictions across all three planes.