Hepatocellular carcinoma (HCC) ranks fifth among the most common malignancies worldwide and is the third most common cause of cancer-related death globally. Despite advances in treatment and diagnostics, prognosis remains poor because of delayed diagnosis and limited treatment strategies. This systematic review set out to catalog and evaluate every published study applying artificial intelligence to the radiological detection and characterization of HCC, regardless of the imaging technique used.
The authors note that AI has potential across three key domains in HCC: (a) risk factor stratification, (b) lesion characterization, and (c) improved prognostication in established cases. HCC is a particularly challenging cancer because of its multiple overlapping risk factors, including non-alcoholic fatty liver disease (NAFLD), non-alcoholic steatohepatitis (NASH), and progressive cirrhosis. Several AI modalities have been developed to differentiate and predict the risk of incident HCC, while deep learning (DL) and radiomics-based approaches using CT and MRI have shown promise in distinguishing HCC from non-HCC liver nodules with high diagnostic accuracy.
AI hierarchy: The paper defines the relevant AI subfields clearly. Machine learning (ML) is a subset of AI that trains on previous experience and improves sequentially. Deep learning (DL) is a further subset of ML that uses multi-layered neural networks to process large training datasets and produce meaningful predictions across diagnostic, therapeutic, and prognostic domains. These distinctions matter because the included studies span the full range from classical ML classifiers (SVM, KNN, random forest) to advanced deep learning architectures (CNN, DNN, LSTM).
The review follows PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-analyses) guidelines. Because of inconsistent reporting of outcome measures and differences in populations and study design across the included studies, the authors opted for qualitative synthesis rather than a formal meta-analysis. The review ultimately included 46 articles published between 1998 and 2022, making it a comprehensive snapshot of a 24-year research trajectory.
The authors searched three databases: PubMed, Scopus, and Cochrane, using Boolean combinations of terms including "Artificial Intelligence," "Machine Learning," "Hepatocellular Carcinomas," "HCC," and "Liver Cancer." Results were admitted from inception up to May 5, 2022. The search terms were modified to fit each database, and reference lists of included articles and relevant reviews were manually checked for additional papers.
Inclusion criteria: Only published articles reporting the application of AI in detecting or diagnosing HCC were included. Studies reporting AI applications outside diagnosis (risk prediction, prognosis, treatment) were excluded. Only diagnoses based on CT, MRI, ultrasound (US), 18F-FDG PET, or X-ray were selected, while other methods like pathology reports or biomarkers were excluded. Reviews, letters, editorials, conference papers, preprints, commentaries, book chapters, and non-English articles were all excluded.
Search results: The initial search identified 3,160 records (1,677 from PubMed, 1,426 from Scopus, 57 from Cochrane). After removing 1,052 duplicates via EndNoteX9, 2,108 studies were screened by title and abstract. Of these, 2,032 were excluded for various reasons: 1,813 not related to the topic, 80 additional duplicates not caught by software, 62 conference papers, 26 reviews, and others. From 76 potentially eligible full texts, 27 articles were included, and a manual search identified 19 additional articles, for a final total of 46 studies.
Quality assessment: The authors developed a composite quality score (QS) based on three items, each scored 0-2 for a maximum of 6 points. The first item assessed the number of images (fewer than 50 = 0, 50-100 = 1, more than 100 = 2), estimating the risk of bias and overfitting. The second assessed the use of a validation cohort (none = 0, internal train/test split = 1, external validation cohort = 2). The third considered whether data was collected from 2011 or later, when GPU-accelerated CNN training became feasible. Two authors independently screened titles and abstracts, with a third reviewer resolving disagreements.
The mean "Number of Images" score was 1.70, with 36 articles (78.3%) analyzing at least 100 images, which is a positive indicator for model robustness. However, the "Cohort for Validation" score was only 0.609 on average. An external validation cohort was used in only 2 articles (4.3%), while 43.5% of studies had no validation cohort at all and 52.2% relied solely on an internal train/test split. This is a critical weakness, as the absence of external validation severely limits how generalizable these AI models are to new patient populations and imaging protocols.
Publication timing: The mean "Year of Publication" score was 1.87, with 87.0% of included works published in 2011 or later, after GPU-accelerated deep learning became practical. The median publication year was 2019, and 63.0% of all included studies were published between 2019 and 2022, reflecting the explosive growth of AI in medical imaging during this period. The years 2019-2020 alone accounted for 21.7% and 15.2% of included studies, respectively.
On average, the total quality score was 4.17 out of 6 (median 4.00, SD 1.04). Only 3 articles (6.52%) scored below 3, while 2 (4.34%) received the maximum score of 6. A contingency table analysis showed a statistically significant constant improvement in quality scores over the years (p = 0.004), likely driven by the publication of reporting guidelines, dedicated AI methodology checklists, and technological improvements that make larger datasets and more rigorous validation feasible.
Key takeaway: While the field is improving in terms of dataset size and methodological rigor, the near-complete absence of external validation cohorts (only 4.3% of studies) remains the single biggest weakness. Without external validation, even impressive accuracy numbers cannot be confidently generalized beyond the institution where the model was trained.
A total of 19 articles (41.30%) in the review used CT as their diagnostic modality. CT-based AI studies spanned a wide range of approaches, from classical radiomics feature extraction to fully automated deep learning systems. Ziegelmayer et al. (2022) compared CNN features against radiomics features for robustness to technical variations in image acquisition parameters and found that CNN features were more stable. Xu et al. (2022) built an SVM model using radiomic features from non-contrast CT to discriminate between early-stage HCC and intrahepatic cholangiocarcinoma (ICCA).
Detection performance: Kim et al. (2021) developed a CNN-based model for detecting primary hepatic malignancies in multiphase CT images of patients at high risk for HCC. The model achieved 84.8% sensitivity with 4.80 false positives per CT scan in the test set, using a cohort of 1,320 patients. Krishan et al. (2020) tested multiple classifiers including R-part decision trees, AdaBoost, random forest, k-SVM, GLM, and neural networks on 794 normal and 844 abnormal liver images. Their accuracy ranged from 98.39% to 100% for tumor identification and 76.38% to 87.01% for tumor classification, with a multi-level ensemble model performing best.
Protocol optimization: Shi et al. (2020) demonstrated that when combined with a Convolutional Dense Network (CDN), a three-phase CT protocol without the pre-contrast phase showed similar diagnostic accuracy to a four-phase protocol in differentiating HCC from other focal liver lesions (FLLs). This is clinically significant because reducing CT phases means lower radiation dose for patients. Mokrane et al. (2019) used KNN, SVM, and random forest on triphasic CT to diagnose HCC in cirrhotic patients with indeterminate nodules, establishing a proof of concept that radiomics signatures from arterial and portal venous phase changes can enable non-invasive diagnosis.
Earlier CT studies by Stoitsis et al. (2006), Gletsos et al. (2003), and Chen et al. (1998) used texture features and neural network classifiers to classify hepatic lesions. Stoitsis et al. achieved 100%, 93.75%, and 90.63% classification accuracy in training, validation, and testing sets, respectively. These earlier works, while limited in dataset size, laid the groundwork for modern deep learning approaches by demonstrating that quantitative image features could reliably differentiate liver tissue types.
Ultrasound was the most frequently studied modality, with 20 articles (43.47%) using US in their work. Many of these studies specifically leveraged contrast-enhanced ultrasound (CEUS), which provides dynamic information about lesion perfusion that is well suited to AI analysis. Turco et al. (2022) developed an interpretable radiomics approach using logistic regression, SVM, random forest, and KNN to differentiate malignant from benign FLLs on CEUS, finding that perfusion-related features (peak time, wash-in time), microvascular architecture (spatiotemporal coherence), and spatial contrast enhancement characteristics (global kurtosis, GLCM Energy) were most relevant.
Multimodal integration: Sato et al. (2022) built a deep multimodal representation learning model that integrated tumor images, patient background, and blood biomarkers for differentiating liver tumors on B-mode US. The multimodal model outperformed a CNN using US images alone, demonstrating that combining clinical and imaging data improves diagnostic accuracy. Huang et al. (2020) proposed a novel computer-aided diagnosis (CAD) approach extracting spatial-temporal semantics for atypical HCC, achieving an average accuracy of 94.40%, recall of 94.76%, F1-score of 94.62%, specificity of 93.62%, and sensitivity of 94.76%.
CAD systems: Multiple groups developed CAD systems for FLL classification. Acharya et al. (2018) achieved 92.95% accuracy, 90.80% sensitivity, and 97.44% specificity for lesion classification using radon transform and bi-directional empirical mode decomposition features. Ta et al. (2018) showed that CAD systems using ANN and SVM classified benign and malignant FLLs with accuracy comparable to an expert reader, and that CAD improved the accuracy of both novice and experienced readers. Time-based features of time-intensity curves were more discriminating than intensity-based features.
Earlier US studies: Hassan et al. (2017) proposed a stacked sparse autoencoder framework achieving 97.2% overall accuracy, outperforming multi-SVM, KNN, and Naive Bayes. Kondo et al. (2017) demonstrated that combining features from arterial, portal, and post-vascular phases was important for SVM-based classification of Sonazoid CEUS. Sugimoto et al. (2010) used ANNs for differential diagnosis of FLLs by CEUS, achieving classification accuracies of 84.8% for metastasis, 93.3% for hemangioma, and 98.6% for all HCCs.
Only 7 articles (15.21%) studied MRI, making it the least represented imaging modality. The authors attribute this to MRI being a more recent and less widely accessible technology compared to US and CT. MRI did not appear in the AI-HCC literature before 2019 in their analysis. Despite the smaller number of studies, MRI-based AI models produced some of the most compelling results in the review.
Small HCC detection: Zheng et al. (2021) investigated the feasibility of automatic detection of small HCC (2 cm or less) in cirrhotic livers using a pattern-matching and deep learning (PM-DL) model. The model showed superior performance in both a validation cohort and an external test cohort, suggesting feasibility for automatic detection of small HCCs with high accuracy. Detecting small HCCs is particularly important clinically because earlier detection correlates with better treatment outcomes and eligibility for curative interventions like resection or transplantation.
Multi-class lesion classification: Stollmayer et al. (2021) compared 2D and 3D DenseNet architectures for classifying FLLs on multi-sequence MRI and found both could differentiate focal nodular hyperplasia (FNH), HCC, and metastases with good accuracy when trained on hepatocyte-specific contrast-enhanced sequences. Kim et al. (2020) developed a fully automated deep learning model to detect HCC using hepatobiliary phase MR images, achieving 94% sensitivity, 99% specificity, and an AUC of 0.97 for HCC cases in the test dataset. On external validation, performance remained strong at 87% sensitivity, 93% specificity, and an AUC of 0.90.
3D CNN innovation: Trivizakis et al. (2019) proposed a novel 3D CNN for discriminating between primary and metastatic liver tumors from diffusion-weighted MRI data, demonstrating that volumetric deep learning architectures can bring significant benefit to MRI liver discrimination, especially in size-limited, disease-specific clinical datasets. Hamm et al. (2019) developed a CNN-based deep learning system that classified common hepatic lesions on multi-phasic MRI, demonstrating feasibility for classifying lesions with typical imaging features across six common hepatic lesion types.
A striking finding of this review is that no study investigated AI applied to PET or X-ray for HCC diagnosis. For X-ray, this is understandable because the modality is used for interventional therapeutic procedures rather than diagnostic imaging in liver cancer. For PET, the absence is more significant. PET combined with CT is already used in other cancers to define undetermined lesions with high sensitivity, but HCC poses a unique challenge: the liver maintains glucose homeostasis, leading to low 18F-FDG uptake in low-grade tumors. Only up to two-thirds of HCC tumors are 18F-FDG avid, though higher standardized uptake values (SUV) indicate more malignant tumors.
Alternative PET tracers: The authors suggest that using tracers such as 18F-Choline and 11C-Acetate may increase PET accuracy for HCC and open the door to AI applications in PET-based HCC diagnosis. This represents a clear gap in the current literature and a potential avenue for future research that could combine the metabolic information from PET with AI pattern recognition.
Data sharing crisis: The review highlights an imminent need for data sharing in collaborative repositories. Most AI studies were conducted on small-scale datasets at single institutions in high-income countries, with limited to no data from lower-middle and low-income countries. This geographic and economic bias puts the credibility of these models in question for global deployment. The authors point to the European Union's Human Brain Project and EBRAINS as a model for collaborative data infrastructure, arguing that similar efforts are needed for liver cancer imaging data.
Generalizability gap: The mean quality score of 4.17/6 is reasonable, but the "Cohort for Validation" score was the lowest component at only 0.609. The fundamental challenge is that ML algorithms require large training datasets and operate on the "Garbage In, Garbage Out" (GIGO) principle. The incongruity between modeled datasets and real-world data remains a fundamental barrier. The significant heterogeneity across studies in terms of gold standards used for HCC diagnosis, patient features, radiologist opinions, contrast agent type and dose, and follow-up imaging protocols made it impossible to aggregate results into a meta-analysis.
Methodological diversity: The most significant limitation of this review is the wide diversity from one article to another in textural parameters and methods used. Even for similar research questions, it was challenging to aggregate and compare articles. This heterogeneity is why the authors could not perform a meta-analysis and had to rely on qualitative synthesis. The imaging modalities (US, CT, MRI), the AI architectures (CNN, SVM, random forest, DNN, LSTM, and others), and the clinical endpoints all varied substantially.
Quality scoring limitations: The three-item quality score used in this review was practical and highly reproducible, but the authors acknowledge it was rather simplistic. It enabled evaluation of many articles at the expense of a thorough analysis of individual study methods. More granular quality assessment tools like QUADAS-2 or PROBAST, which are specifically designed for diagnostic accuracy and prediction model studies, might have provided deeper insight into methodological strengths and weaknesses. However, these tools would have been difficult to apply consistently across 46 studies spanning 24 years of methodology.
Future directions: The authors envision deep learning algorithms that combine clinical, radiological, pathological, and molecular information to identify and prognosticate HCC patients more effectively. Algorithms trained on post-chemotherapy patients could enable early identification of treatment response and timely switches between therapeutic options, enabling pre-emptive therapy adjustment based on molecular signature and imaging. Multi-center AI studies and pooled imaging data repositories are identified as essential for advancing the field beyond single-institution proof-of-concept studies.
Strengths of this work: To the best of the authors' knowledge, this is the first systematic review focused specifically on AI in radiological HCC detection and characterization, deliberately excluding pathology and prognosis. This narrow focus allowed for detailed analysis of all the scientific techniques studied in this specific domain, providing a comprehensive index that can guide future research planning. The review catalogs every study's scope, AI method, and key findings in a single reference table.