This systematic review, authored by Samy A. Azer and published in the World Journal of Gastrointestinal Oncology in 2019, evaluates how convolutional neural networks (CNNs) have been applied to hepatocellular carcinoma (HCC) and liver mass imaging. HCC is the fifth most common cancer in men, the seventh most common in women, and the third leading cause of cancer-related death worldwide. The male-to-female ratio for liver cancer is two- to three-fold higher in most regions, likely driven by a greater prevalence of risk factors (hepatitis B and C, obesity, diabetes, metabolic syndrome, and non-alcoholic fatty liver disease) and differences in sex steroid hormones and epigenetic factors.
Clinical imaging context: Current guidelines from the American and European liver societies recommend ultrasound for HCC surveillance, with CT and MRI reserved for characterizing suspected focal lesions. Contrast-enhanced CT and MRI can identify up to 65% of small cell nodules under 2 mm in size, but detection of small nodules depends heavily on the vascular dynamic enhancement pattern across different phases. Inter-operator variability from qualitative visual assessment further limits diagnostic consistency, creating a clear opening for computer-aided diagnosis frameworks.
CNN applications in medicine: CNNs are a class of deep learning algorithms that have already demonstrated value across multiple clinical domains, including detection of gastrointestinal bleeding in wireless capsule endoscopy, diagnosis of Helicobacter pylori infection from endoscopy images, and identification of gastrointestinal polyps. In medical imaging more broadly, CNNs perform four core tasks: lesion detection, classification (sorting lesions into categories such as malignant vs. benign), segmentation (delineating organ or anatomical structures), and image reconstruction (generating noiseless CT images from subsampled data).
The review set out to answer two specific questions: What is the current status of CNN research for assessing HCC, liver metastases, and other liver masses? And what is the accuracy of CNN-based deep learning systems for lesion detection, classification, and segmentation of these images?
The review followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Three databases were searched: PubMed, EMBASE, and the Web of Science, along with research books publishing full conference papers. The search covered all studies up to January 2019 and was restricted to English-language studies conducted on humans. Keywords included "Cancer," "Liver," "Hepatocellular carcinoma," "HCC," "Liver mass," "Metastasis," "Hepatic," "Radiology," "Pathology," "Histopathology," "Ultrasound," "Computed tomography," and "Magnetic resonance images."
Comprehensive journal screening: Beyond database searches, the authors manually searched journals listed in the Journal Citation Reports-2017 of the Web of Science across six categories: Gastroenterology and Hepatology (32 journals), Oncology (41 journals), Radiology (6 journals), Pathology (14 journals), Computer Sciences and Engineering (18 journals), and Medical Informatics (7 journals). Reference lists of primary articles and reviews were also hand-searched to identify missed studies.
PICOS framework: The inclusion criteria used a formal PICOS framework. The population was worldwide datasets involving human patients. The intervention was CNN-based diagnostic models. The comparison was manual detection by radiologists, hepatologists, or anatomical pathologists. The outcomes were lesion detection, classification, segmentation, or image reconstruction. Studies needed to include controlled comparisons, comparisons to manual assessment, or benchmarking against other AI models.
Selection and extraction: Two independent researchers reviewed titles and abstracts, with disagreements resolved through discussion. Data extraction covered first author, year of publication, objectives, methods, cancer type, main results, accuracy metrics (sensitivity, specificity), and institutional affiliation. Inter-rater agreement was measured using the Cohen kappa coefficient via SPSS software. From 129 initially identified publications, 78 remained after removing duplicates, 36 full-text articles were assessed for eligibility, and 11 met the final selection criteria.
The 11 studies covered a range of CNN applications in liver oncology: 6 studies focused on differentiating liver masses or distinguishing HCC from other lesions, 3 on differentiating HCC from cirrhosis or detecting new tumour development, and 2 on HCC nuclei grading or segmentation. In terms of task type, 4 studies targeted lesion detection, 5 addressed classification, and 2 focused on segmentation. The imaging modalities spanned CT scans (6 studies), ultrasound (1 study), 3D multi-parameter MRI scans (2 studies), and cellular or histopathological images (2 studies).
Dataset sizes: Sample sizes varied dramatically across studies. Ben-Cohen et al. used CT data from just 20 patients with 68 lesions (testing on 14 patients with 55 lesions). Frid-Adar et al. worked with a limited dataset of 182 liver lesions (53 cysts, 64 metastases, 65 haemangiomas). Todoroki et al. used 3D multi-phase contrast-enhanced liver CT images from 75 patients across 5 lesion types. Yasaka et al. conducted the largest CT study, using 55,536 images from 460 patients for training and 100 liver mass image sets for testing. For MRI, Trivizakis et al. examined scans from 134 patients (37.7% primary liver mass, 62.3% metastatic), while Zhang et al. used images from 20 patients generating 1,700 non-overlapping patches. Only one study (Bharti et al.) used ultrasound from 94 patients, and two studies used 127 liver pathology images.
Geographic distribution: The research was conducted across Japan (2 studies), China (3), the United States (2), India (1), Greece (1), and Israel (4), with some papers involving authors from multiple countries. The leading institutions included the University of Tokyo Hospital, Tel Aviv University's Biomedical Engineering Medical Image Processing Laboratory, Yale University, the Hebrew University of Jerusalem, and Zhejiang University.
Author backgrounds: Of the 58 authors across all 11 studies, only 5 came from radiology departments, 1 from pathology, and 2 had other medical backgrounds. The remaining 50 were from engineering, computer science, and medical image processing. This imbalance is reflected in the publication venues: most articles appeared in computer science and biomedical informatics journals (Neurocomputing, IEEE Journal of Biomedical Health Informatics, Computers in Biology and Medicine) rather than clinical journals.
The CNN architectures varied considerably across the 11 studies. Ben-Cohen et al. combined a global context approach using a fully convolutional network with local patch-level analysis through superpixel sparse-based classification for detecting liver metastases on CT. Trivizakis et al. proposed a 3D CNN architecture with four consecutive strided 3D convolutional layers (3 x 3 x 3 kernel size), ReLU activation functions, a fully connected layer with 2,048 neurons, and a softmax layer for binary classification, trained and validated on 130 Diffusion Weighted MR image scans.
Ultrasound and pathology models: Bharti et al. designed a system for ultrasound images based on higher-order features with hierarchical organization and multi-resolution analysis, enabling characterization of liver echotexture and surface roughness across four stages (normal, chronic, cirrhosis, and HCC on cirrhosis). For histopathology, Li et al. developed a joint multiple fully connected CNN with extreme learning machine (MFC-CNN-ELM) architecture for HCC nuclei grading, using a centre-proliferation segmentation method with labels marked under the guidance of three pathologists. A related approach by the same group introduced a structured convolutional extreme learning machine (SC-ELM) with case-based shape templates for HCC nucleus segmentation.
Data augmentation with GANs: Frid-Adar et al. tackled the small dataset problem by generating synthetic medical images using Generative Adversarial Networks (GANs), then training a CNN classifier on the augmented data. This approach improved classification performance from 78.6% sensitivity and 88.4% specificity (with classic augmentation) to 85.7% sensitivity and 92.4% specificity (with synthetic augmentation), demonstrating that GAN-generated data can meaningfully boost CNN performance when real training data is scarce.
Multi-phase and longitudinal approaches: Todoroki et al. used a deep CNN incorporating multi-layered architecture with a two-step approach: first segmenting the liver from CT images, then calculating the probability of each pixel in the segmented liver. Zhang et al. developed a novel 3D deep CNN with auto-context elements and a U-Net-like architecture, using multi-level hierarchical design and multi-phase training procedures to classify liver tissue types in HCC patients from MRI. Vivanti et al. developed methods for both automatic tumour delineation in longitudinal CT studies and detection of new tumours in follow-up scans.
CT-based classification: Yasaka et al. achieved a median accuracy of 0.84 for test data when classifying liver masses into five categories (classic HCC, other malignant tumours, intermediate masses, haemangiomas, and cysts), with an area under the receiver operating characteristic curve (AUC) of 0.92. This was the largest study in the review, using 55,536 images from 460 patients across three contrast phases (non-contrast, arterial, and delayed). Trivizakis et al. demonstrated that their 3D CNN achieved 83% classification performance in discriminating primary from metastatic liver tumours, compared to 69.6% and 65.2% for two different 2D CNN approaches, confirming the advantage of 3D convolutional architectures for tissue classification.
Ultrasound and pathology results: Bharti et al. achieved a classification accuracy of 96.6% in differentiating four liver stages (normal, chronic liver disease, cirrhosis, and HCC evolved on cirrhosis) from ultrasound images acquired from 94 patients. For histopathology, Li et al. demonstrated that their MFC-CNN-ELM architecture had superior performance in grading HCC nuclei compared to related methods, with external validation on ICPR 2014 HEp-2 cells confirming generalizability. The companion study on nucleus segmentation using SC-ELM also outperformed published comparison methods.
Metastasis detection: Ben-Cohen et al. reported a true positive rate of 94.6% with only 2.9 false positives per case for detecting small liver metastases on CT using 3-fold cross validation. This is clinically significant because identifying small metastatic deposits in the liver is one of the most challenging tasks in abdominal imaging, and a high true positive rate with low false positives could meaningfully reduce missed lesions in clinical practice.
Longitudinal tumour tracking: Vivanti et al. demonstrated two important capabilities. First, their tumour delineation method achieved an average overlap error of 17% (SD = 11.2) and a surface distance of 2.1 mm (SD = 1.8) across 222 tumours from 31 patients, far surpassing stand-alone segmentation and without requiring large annotated training datasets. Second, their tumour detection method achieved a true positive new tumour detection rate of 86% (compared to 72% with stand-alone detection) and a tumour burden volume overlap error of 16% across 246 tumours from 37 longitudinal CT studies, enabling both new tumour detection and volumetric burden estimation.
One of the review's critical observations is the inconsistency of accuracy measurement across the 11 studies. The approaches used included: assessing system performance and evaluation metrics (Ben-Cohen et al., Bharti et al.), measuring automatic liver segmentation and lesion detection accuracy, generating receiver operating characteristic curves and precision-recall curves (Trivizakis et al.), comparing outcomes with other CNN architectures (Li et al., Frid-Adar et al., Zhang et al.), measuring sensitivity, specificity, and accuracy parameters (Frid-Adar et al., Vivanti et al.), and comparing deep CNN performance against Bayesian models and benchmark methods (Todoroki et al., Zhang et al.).
Missing metrics: Notably, some studies did not report sensitivity or specificity at all (Bharti et al., Todoroki et al., Zhang et al.), making direct comparisons across studies impossible. Only Frid-Adar et al. compared their CNN results with visual inspection by expert radiologists using precision and recall rates. The lack of standardized reporting is a significant issue: without consistent accuracy parameters applied to the same types of images, the field cannot reliably determine which CNN architectures or approaches perform best for specific clinical tasks.
Inter-rater reliability: The review itself maintained methodological rigor in its own assessment, with inter-rater agreement between the two evaluators yielding Cohen kappa scores ranging from 0.779 to 0.894, indicating substantial to near-perfect agreement. However, because the underlying studies used heterogeneous data, had gaps in reported results, and varied substantially in methods, the authors determined that a meta-analysis was not feasible. This limitation means the review could only provide a narrative synthesis rather than pooled quantitative estimates of CNN performance.
The Todoroki et al. study on multi-phase CT demonstrated that the deep CNN tumour detection method could discriminate between 5 different lesion types (cysts, focal nodular hyperplasia, HCC, haemangioma, and metastases) with results superior to boundaries delineated by doctors and outperforming other convolutional methods. Zhang et al. showed that multi-resolution input, auto-context design, and multi-phase training procedures collectively improved classification of liver tissue types from MRI compared to single-resolution and single-phase approaches.
Insufficient training data: The most consistent limitation across the 11 studies was the small size of training datasets. Several researchers explicitly reported difficulty obtaining medical images, which limits the direct applicability of machine learning algorithms. While some architectures (notably Vivanti et al.) were designed to work with small training sets, the field as a whole lacked access to the large, annotated datasets that drive reliable CNN performance. The largest study (Yasaka et al., 460 patients) is still modest by deep learning standards, and most studies used far fewer patients.
Imbalanced authorship and reporting: With 50 of 58 authors coming from engineering and computer science backgrounds, the methods sections of most studies emphasized technical details (CNN architecture, test data preparation, algorithm development) while providing minimal clinical information about patient populations, image sources, and clinical procedures. This imbalance limits the clinical interpretability of the results. Medical readers cannot fully assess how generalizable the findings are without knowing the patient demographics, disease severity distribution, and imaging protocols used.
Lack of standardization: Most studies did not provide enough information about the exact sources of their datasets, making it impossible to determine whether different studies used overlapping data or the same testing protocols. Without this transparency, comparing performance across studies is unreliable. The review called for journals publishing deep learning research to develop standardized guidelines requiring authors to disclose dataset sources and testing protocols.
Additional biases: Publication bias may have suppressed negative results. The English-language restriction excluded potentially relevant studies published in other languages. The diversity of lesion types included across studies means findings must be interpreted with caution, as performance on one lesion type does not necessarily generalize to others. The retrospective design of most studies also limits conclusions about real-world clinical performance.
Large-scale multi-centre studies: The review's primary recommendation is multi-institute, multi-centre collaborations involving large numbers of patients with cirrhosis from different pathological causes, as well as patients with HCC on top of cirrhosis, liver secondaries, and other liver masses. Such collaborations would resolve the insufficient training data problem that affects nearly every study reviewed, and would enable more reliable measurement of CNN accuracy and performance across diverse patient populations.
Longitudinal CT studies: The results from Vivanti et al. (86% new tumour detection rate vs. 72% with stand-alone methods) suggest that longitudinal approaches, comparing follow-up scans against baseline, could offer superior detection of new small tumours. The review hypothesizes that this longitudinal paradigm deserves dedicated research investment, particularly for monitoring patients at high risk for recurrence or new tumour development. Comparing CNN-tracked changes against existing stand-alone and follow-up methods would establish whether deep learning adds meaningful clinical value in surveillance contexts.
Clinical comparison studies: A critical gap is the near-total absence of case-control studies comparing CNN performance directly against manual image assessment by expert radiologists, hepatologists, and pathologists. Only one study in the review (Frid-Adar et al.) made this comparison. Without head-to-head evaluations, the field cannot determine whether CNNs offer genuine diagnostic improvement or simply replicate existing expert performance. The review emphasizes that choosing discriminant features to represent clinical characteristics and embedding them as key features in CNN algorithms for both segmentation and classification requires active collaboration between medical experts and computer engineering teams.
Standardized accuracy assessment: Future studies should prioritize reporting sensitivity, specificity, and positive predictive values using consistent methods. Ideally, individual studies should apply two or three different CNN methods to the same image set and compare accuracy parameters directly. The literature currently lacks such comparative studies, making any cross-study performance comparison unreliable due to multiple confounding variables in models, datasets, and evaluation approaches. Differentiation between primary and secondary liver tumours remains a particularly important clinical target that CT-based deep learning methods have shown initial promise in addressing.