Deep Learning for Urine Cytopathology Prediction

Plain-English Explanations

Overview

Pages 1-2

Why Link Urine Cytology to Histopathology, and What This Study Sets Out to Do

Clinical motivation: Urothelial carcinoma (UC) is among the most common cancers worldwide and tends to be multifocal and recurrent, demanding lifelong screening and surveillance. The definitive diagnosis of UC relies on histopathological assessment of tissue obtained through cystoscopy, an invasive, expensive procedure that is not easily accessible in all clinical settings. Urine cytology offers an effective, inexpensive, and non-invasive alternative, but there is currently no established gold standard for correlating cytological findings with histopathological outcomes (cyto-histo correlation) in urine specimens.

Prior work in automated cytology: Three previous studies had already demonstrated that deep learning systems (DLSs) could automate urine cytology diagnosis. Vaickus et al. achieved over 90% accuracy with a hybrid deep-learning and morphometric algorithm. Sanghvi et al. reached 79.5% sensitivity and 84.5% specificity using a pure neural network on 2,405 ThinPrep slides. Most notably, Nojima et al. showed that a 16-layer VGG CNN could not only detect UC cells but also determine whether lesions were invasive and whether they were high grade, achieving AUCs above 0.86 and F1 scores above 0.82 for both tasks. However, that model required retraining with histopathological data from the eighth layer onward.

Study hypothesis: Liu et al. hypothesized that routine urine cytology images contain sufficient morphological information to predict the presence of malignant tissue in the urinary tract. The rationale is that malignant tissues undergo constant exfoliation, shedding tumor cells whose morphology is influenced by the underlying tissue pathology. The key innovation is that the DLS was trained only on cytology data to detect cancer cells, then applied directly to predict histopathology results without any additional training on histopathological specimens.

Study scope: The research was conducted at Peking University First Hospital (a collaboration between the Department of Urology and the School of Cyber Science and Technology at Beihang University). Data were collected retrospectively from consecutive patients examined or treated between September 2014 and January 2020. The system was designed to both assist pathologists in cytology reading and provide novel histopathologic insights for urologists planning therapeutic strategies.

TL;DR: This study from Peking University developed a deep learning system trained solely on urine cytology images to predict histopathological malignancy, without requiring any histopathology training data. Prior DLS approaches needed retraining with tissue-level data, making this a more streamlined and explainable design for bridging the gap between cytology and histopathology.

Data and Cohort

Pages 2-3

Patient Cohort, Data Acquisition, and Dataset Structure

Patient population: The study retrieved archival glass slides of hematoxylin and eosin-stained urine cytocentrifugation cytology from consecutive patients at Peking University First Hospital between 2014 and 2020. Urine cytology was diagnosed using Papanicolaou's classification, where classes III, IV, and V were defined as positive (suspicious or malignant) and classes I and II (including atypical) were defined as negative. The cohort included 441 positive cases and 395 negative cases for a total of 836 patients.

Surgical follow-up: Among the 441 positive cytology cases, 211 underwent surgery within the following year, all of which were confirmed as UC on histological examination. Among the 395 negative cytology cases, all received surgery: 333 were ultimately diagnosed with UC on histopathology (contradicting the cytology), and 62 were confirmed benign. A blinded pathologist review of the 333 contradicted cases found that 63 actually had cancer cells in their cytology images that had been initially overlooked, underscoring the difficulty of manual cytology interpretation.

Image preparation: From the original slides, 1,280 x 960-pixel JPEG images were exported: 466 images from positive cases and 417 images from negative cases. The training-validation set and preliminary test set were drawn from positive cytology cases with malignant histology, allocated at an 8:1 ratio. The internal test set comprised positive cytology cases with matched surgical results. The extra test set consisted of the 333 negative-cytology cases that had malignant histopathological outcomes, representing the most diagnostically challenging subset.

Annotation protocol: A certified urological pathologist annotated malignant cells with remarkable atypia using the open-source software LabelMe. The training and validation set encompassed 387 positive cases with 2,668 labeled cells across 411 images. These images were subdivided into 175 x 200-pixel panel subimages, yielding 1,953 subimages each containing at least one labeled cell. The training and validation sets were further split at a 5:1 ratio for early stopping during network training.

TL;DR: The study included 836 patients (441 cytology-positive, 395 cytology-negative) with matched surgical outcomes. A pathologist annotated 2,668 cancer cells across 411 images. The extra test set of 333 cases with negative cytology but malignant histology represented the hardest diagnostic challenge, with 63 of those cases having cancer cells missed during initial cytology reading.

Model Architecture

Pages 3-4

ResNet101 Faster R-CNN: Architecture and Training Strategy

Core architecture: The deep learning system was built on a ResNet101 Faster R-CNN (Faster Region-based Convolutional Neural Network). ResNet101 is a 101-layer Residual Network originally proposed at the 2016 IEEE Conference on Computer Vision and Pattern Recognition by He et al. Its residual connections allow the network to train much deeper architectures without degradation, and it has shown high performance in domains including skin lesion detection and brain disease detection on MRI. The Faster R-CNN component combines object detection and classification into a single network, extracting features, making detections through those features, and assigning a confidence score (ranging from 0 to 100) to each detection.

Transfer learning and training details: The ResNet101 backbone was pretrained on the ImageNet database (1.2 million training images across 1,000 object classes). These pretrained weights initialized all convolutional layers, which were then fine-tuned on the cytology images. The images passed through 33 convolution blocks followed by 1 dense layer, with the SoftMax function as the activation function. The model was implemented in Python 3.8 using TensorFlow 1.12.0 and Keras 2.0.3.

Training protocol: Spatial augmentation was applied during training, including 90-degree rotation and vertical and horizontal flips. The maximum epoch was set at 80, with early stopping triggered if validation loss did not improve after 15 consecutive epochs. Both total loss and system accuracy stabilized after 45 to 50 epochs, and the final model was selected at epoch 48 when validation loss reached its minimum of 1.6 and classification accuracy hit its peak of 0.77.

Two-stage design: The system operated in two stages. In the first stage, the Faster R-CNN was trained exclusively to detect cancer cells in cytology images. In the second stage, an additional classifier was appended to the end of the initial DLS without any retraining of the convolutional layers. This classifier selected the highest confidence score among all detected cells in an image and applied a threshold to make a binary classification (benign or malignant). This design ensured that both the cell detection and malignancy prediction tasks shared the exact same set of convolutional features, making the prediction directly traceable to specific cells identified by the model.

TL;DR: The system used a ResNet101 Faster R-CNN pretrained on ImageNet and fine-tuned on cytology images. Training converged at epoch 48 with 0.77 validation accuracy. The two-stage design first detected cancer cells, then predicted histopathological malignancy by thresholding the maximum confidence score, all without retraining on histopathology data.

Cancer Cell Detection

Pages 4-5

Performance of the DLS for Detecting Urothelial Carcinoma Cells

Threshold optimization: The DLS assigned a confidence value (from 0 to 100) to each detected cell, and this value served as the threshold for cell detection. As the threshold decreased, sensitivities increased but at the cost of more benign cells being mistakenly flagged as malignant. The accuracy initially rose with increasing thresholds, with the rate of improvement slowing around the 50 to 55 range. Based on this analysis across the preliminary and internal test sets, 55 was selected as the optimal threshold.

Preliminary test set results: Under the optimal threshold of 55, the DLS achieved 41% sensitivity for cancer cell detection in the preliminary test set, with an average of 3.09 false-positive cells per image. The cell detection accuracy was 50.0%. While these numbers may appear modest for cell-level detection, the system's purpose was not to identify every cancer cell but rather to identify cells with the highest degree of atypia, which would then drive the malignancy prediction.

Internal and extra test set results: In the internal test set, the DLS reached 36% sensitivity with 0.72 false-positive cells per image and 50.3% accuracy. In the extra test set (the most challenging subset of initially negative-cytology cases), sensitivity was 41% with only 0.31 false-positive cells per image, though accuracy dropped to 14.5%. The lower accuracy in the extra test set is expected given that these cases had cancer cells that were originally missed by experienced pathologists.

Subgroup analysis of the extra test set: Among the 64 positive images in the extra test set, sensitivity was 41% with an average of 0.95 false-positive cells per image. For the 281 negative images, the average false-positive rate was only 0.16 cells per image, and sensitivity was not calculable because there were no true positive cancer cells in these images. The overall false-positive rate of 0.25 cells per image across the entire extra test set indicates that the model was quite conservative in its detections.

TL;DR: At the optimal threshold of 55, the DLS detected cancer cells with 36-41% sensitivity across test sets and very low false-positive rates (0.25 to 3.09 per image). The design prioritized identifying the most atypical cells rather than exhaustive detection, since only the highest-confidence cell in each image was used for the downstream malignancy prediction.

Malignancy Prediction

Pages 5-6

Predicting Histopathological Malignancy from Cytology Images

Prediction mechanism: For malignancy prediction, the DLS took the maximum confidence score among all detected cells in a given image and compared it against the threshold. A total of 97 images in the internal test set and 345 images in the extra test set could be paired with corresponding histopathological specimens. The hypothesis was straightforward: a case with higher-confidence cell detections in cytology was more likely to harbor malignant tissue on surgical pathology.

Internal test set (85 cases): The DLS achieved an area under the curve (AUC) of 0.90 (95% CI: 0.84-0.96). Under the optimal threshold of 55, sensitivity was 71% (95% CI: 52%-85%) and specificity was 94% (95% CI: 84%-98%). The F1 score was 0.76, and the kappa score was 0.68 (95% CI: 0.52-0.84), indicating substantial agreement with the pathologist reference standard. Only 4 images that scored above 55 had benign histologic results (false positives). The highest kappa score of 0.71 occurred at threshold 57, and the highest F1 score of 0.78 at threshold 58.

Extra test set (333 cases): Despite this being the most diagnostically challenging group (negative cytology with malignant histology), the DLS achieved an even higher AUC of 0.93 (95% CI: 0.90-0.95). Under the optimal threshold, sensitivity was 67% (95% CI: 54%-78%) and specificity was 92% (95% CI: 88%-95%). The F1 score was 0.66, and the kappa score was 0.58 (95% CI: 0.46-0.70), indicating moderate agreement. The highest kappa and F1 scores both occurred at threshold 52, at 0.60 and 0.69 respectively.

Clinical significance: The relatively high specificity in both test sets means that when the DLS predicted malignancy, the patient was very likely to indeed have UC. The lower sensitivity indicates that some UC patients were missed, which the authors attribute to cases where urothelial carcinomas may not present morphologically abnormal cells in urine. The fact that the extra test set achieved a higher AUC than the internal test set is notable, suggesting the DLS captured patterns that human pathologists initially overlooked in these difficult cases.

TL;DR: The DLS predicted histopathological malignancy with AUC 0.90 (internal test, 85 cases) and AUC 0.93 (extra test, 333 cases). Specificity was 92-94%, meaning very few false alarms. The system achieved this without any histopathology training data, relying solely on the confidence scores from cytology-based cancer cell detection.

Explainability

Pages 6-7

Explainability: How the DLS Makes Its Predictions Transparent

Design philosophy: A central contribution of this study is the explainability of the deep learning system. Unlike many "black box" models that output a prediction without revealing the reasoning, this DLS bases its malignancy prediction directly on the abnormal cells it selects from each image. The prediction can be verified by a pathologist who examines the candidate abnormal cells highlighted by the model. This design was intentional: the additional classifier appended to the Faster R-CNN does not introduce new learned features, but instead reuses the biological meaning of the confidence scores already computed during cell detection.

Comparison with prior approaches: Previous DLS designs for cyto-histo correlation, such as the model by Nojima et al. using a 16-layer VGG CNN, required retraining with histopathological data starting from the eighth layer onward. This "mixing-training" approach introduced a new set of features that partially diverged from cytology-specific features, making it harder to trace exactly which cytological features drove histopathology predictions. In contrast, the current study's DLS used the same convolutional features for both cell detection and malignancy prediction, meaning the system's reasoning was always grounded in identifiable morphological abnormalities visible in the cytology image.

No manually engineered features: The DLS contained no manually designed features. All features were learned from the cytology images through the convolutional network. The degree of fit calculated by the DLS during cancer cell detection served double duty: it measured how likely a cell was to be cancerous, and the maximum degree of fit across an entire image measured how likely a case was to harbor malignant tissue. This elegant reuse eliminated the need for separate feature engineering or additional training with histopathological data.

Limitations of explainability: The authors acknowledge that while the system is more explainable than mixing-training models, Faster R-CNN still lacks established technical methods for fully disentangling the individual representative features it uses for detection. The specific morphological characteristics (such as nuclear-cytoplasm ratio, chromatin quality, or cell quantity) that drive the confidence scores cannot be individually isolated and quantified in the way that manually engineered features can. This remains an area for future development.

TL;DR: The DLS is explainable because its malignancy prediction is directly derived from identified abnormal cells, allowing pathologists to verify each prediction by inspecting the flagged cells. Unlike prior models that retrained on histopathology data, this system uses a single set of cytology-derived features for both detection and prediction, though individual features within the CNN remain difficult to isolate.

Discussion

Pages 7-8

Interpreting the Results: Sensitivity, Specificity, and the Cyto-Histo Correlation Problem

High specificity, moderate sensitivity: The DLS exhibited relatively high specificity (92-94%) but more moderate sensitivity (67-71%) across both test sets. This pattern means that most cases the model flagged as malignant were truly malignant, while some UC patients were missed. The authors explain this asymmetry by noting that certain urothelial carcinomas may not shed morphologically abnormal cells into the urine, making them fundamentally undetectable by any cytology-based method. The system does not need to detect every cancer cell to function, only the most atypical ones that provide the strongest signal for malignancy prediction.

Performance on the extra test set: The extra test set, consisting of cases with negative cytology but positive surgical histology, achieved a higher AUC (0.93) than the internal test set (0.90). This counterintuitive result indicates the DLS was able to detect subtle patterns in these difficult cases. However, the sensitivity and specificity at the optimal threshold were lower for the extra test, and the cell detection accuracy was still increasing beyond the selected threshold of 55. This suggests the optimal threshold for this particular population may differ from the one derived from the preliminary and internal test sets, and future studies should explore population-specific threshold calibration.

Inherited annotation bias: The authors candidly address the issue of inherited bias. Since no cytopathological scoring system is perfect, pathologists themselves do not spot every true cancer cell. A DLS trained on pathologist annotations inevitably inherits these biases and errors. In the extra test set, where pathologists had originally failed to identify cancer cells, the DLS faced the additional challenge of detecting cells that its own training data may have underrepresented. A blinded pathologist review of the extra test set found that only 63 of 333 contradicted cases actually had identifiable cancer cells.

Redefining cyto-histo correlation: The study raises an important conceptual point about the nature of cyto-histo correlation in urine. There is ongoing debate about whether a negative cytology with a concurrent positive surgical result should be classified as a "false negative." Similarly, positive urine cytology followed by a negative surgical result may not be a true "false positive." The DLS results imply that when the model predicts malignancy, it focuses on cytological features that partially overlap with those used for UC cell detection, suggesting that cytology images contain latent information about histopathological status that standard manual interpretation may not fully capture.

TL;DR: Specificity of 92-94% means very few false-positive malignancy predictions, while 67-71% sensitivity reflects the inherent limitation that some UCs do not shed detectable abnormal cells. The DLS outperformed initial human cytology reading on the extra test set (AUC 0.93), but threshold optimization and annotation bias remain challenges requiring further study.

Limitations and Future Directions

Pages 8-9

Study Limitations and Prospects for Clinical Translation

Single-center retrospective design: The study was conducted at a single institution (Peking University First Hospital), and all data were collected retrospectively. This limits the generalizability of the findings to other patient populations, cytology preparation methods, and imaging systems. The authors explicitly call for multicentered prospective studies to validate these results. Without external validation, the reported AUCs of 0.90 and 0.93 may not fully reflect real-world performance across diverse clinical settings with different staining protocols, scanner resolutions, and pathologist annotation practices.

Feature interpretability gap: Although the DLS is more explainable than mixing-training models, the inability to fully disentangle individual features within the Faster R-CNN represents a limitation. The study cannot define precisely which morphological characteristics (nuclear size, chromatin texture, nuclear-cytoplasm ratio, etc.) the network weighs most heavily for its predictions. Techniques such as gradient-weighted class activation mapping (Grad-CAM) or feature ablation studies could potentially address this gap in future work, providing pathologists with more granular insight into the model's decision-making process.

Threshold generalization: The optimal threshold of 55, selected based on the preliminary and internal test sets, may not generalize to all clinical populations. The extra test set showed that cell detection accuracy was still rising beyond this threshold, suggesting that population-specific calibration could improve performance. Developing adaptive thresholding strategies that account for the distribution of cell atypia scores within a given patient population is a promising direction for future research.

Clinical impact and risk stratification: If validated in prospective multicenter trials, this DLS could serve as a non-invasive risk-stratification tool for urologists. By predicting the likelihood of histologically confirmed malignancy at the time of urine cytology collection, the system could help prioritize patients for cystoscopy and surgical intervention, potentially reducing unnecessary invasive procedures for low-risk patients while accelerating workup for high-risk individuals. The dual functionality of assisting pathologists in cytology reading while simultaneously providing histopathologic predictions makes this approach unique among existing DLS tools for urothelial carcinoma surveillance.

Broader implications: The study contributes to the growing evidence that deep learning can bridge the gap between cytopathology and histopathology without requiring histopathological training data. This principle could extend beyond bladder cancer to other organ systems where cytology screening precedes definitive histological diagnosis, such as cervical cancer (Pap smear to biopsy) or lung cancer (sputum cytology to tissue biopsy). The explainable design, where predictions are grounded in identifiable cells rather than abstract features, may also help address the "black box" trust barrier that has slowed clinical adoption of AI in pathology.

TL;DR: Key limitations include single-center retrospective design, inability to isolate individual CNN features, and a fixed threshold that may not generalize. If validated prospectively, this DLS could serve as a non-invasive risk-stratification tool, prioritizing patients for cystoscopy based on cytology-derived malignancy predictions. The explainable, cytology-only training approach could extend to other cancer screening workflows.