Deep Learning HCC Prediction From EHRs

Plain-English Explanations

Overview & Background

Pages 1-2

Why Predicting HCC From Routine Health Records Matters

Hepatocellular carcinoma (HCC) is the most common form of primary liver cancer and the fourth leading cause of cancer-related death worldwide. Despite its severity, HCC is often diagnosed at advanced stages when curative treatment options are limited. Current screening guidelines recommend surveillance with ultrasound and alpha-fetoprotein (AFP) testing every 6 months for high-risk patients, but adherence to screening protocols is inconsistent, and many patients fall through the cracks. Early detection could dramatically improve outcomes, as patients diagnosed at early stages are eligible for curative interventions such as surgical resection, liver transplantation, or ablation therapy.

This study, published in JMIR Medical Informatics in 2021, set out to develop a deep learning model that predicts HCC risk using only routinely collected data from electronic health records (EHRs). The central idea is that EHR data, including demographics, laboratory results, diagnosis codes, and medication histories, already contains signals that can identify patients progressing toward HCC. Rather than requiring specialized imaging or biomarkers, the model leverages information that clinicians already collect during standard care.

Clinical motivation: The existing risk prediction tools for HCC, such as the PAGE-B, REACH-B, and CU-HCC scores, are generally designed for specific populations (e.g., patients on antiviral therapy for hepatitis B) and rely on a limited set of clinical variables. These models typically achieve moderate discrimination, with AUC values in the range of 0.70 to 0.80. The authors hypothesized that a deep learning approach trained on a broader set of EHR features could outperform these conventional scoring systems while remaining practical for deployment in real clinical settings.

Study scope: The research team used data from the Veterans Affairs (VA) Corporate Data Warehouse, one of the largest integrated healthcare systems in the United States. The VA system provided a massive and longitudinally rich dataset, making it well-suited for training deep learning models that require large volumes of structured clinical data.

TL;DR: HCC is the 4th leading cause of cancer death globally, and early detection is critical but inconsistent. This study developed a deep learning model to predict HCC risk from routine EHR data in the VA healthcare system, aiming to outperform conventional risk scores (AUC 0.70-0.80) without requiring specialized tests.

Data & Cohort

Pages 3-5

Building the Dataset: 46,007 Patients From the VA Health System

The study cohort was drawn from the VA Corporate Data Warehouse and included 46,007 patients with chronic liver disease (CLD) who received care between 2006 and 2017. This population was specifically chosen because CLD patients represent the primary at-risk group for developing HCC. The cohort included patients with diagnoses of hepatitis B, hepatitis C, alcoholic liver disease, non-alcoholic fatty liver disease (NAFLD), and cirrhosis of various etiologies.

Case-control definition: Within this cohort, 3,784 patients (approximately 8.2%) developed HCC during the study period, confirmed through ICD diagnosis codes and the VA Cancer Registry. The remaining 42,223 patients served as controls. The authors defined the prediction window as 1 year before HCC diagnosis, meaning the model was designed to identify patients who would develop HCC within the next 12 months based on data available at the prediction time point.

Data elements: The EHR data extracted for each patient included demographics (age, sex, race, ethnicity), laboratory test results (liver function tests, complete blood counts, metabolic panels, AFP levels), diagnosis codes (ICD-9 and ICD-10), procedure codes, and medication records. Laboratory values were aggregated using summary statistics (mean, median, minimum, maximum, most recent value, and count of measurements) over the observation window. This longitudinal aggregation captured not just point-in-time snapshots but temporal trends in patient health status.

Train-test split: The dataset was divided temporally rather than randomly. Patients with observation windows ending before 2016 were assigned to the training set, while those with observation windows in 2016 and later formed the test set. This temporal split is critical because it mimics real-world deployment conditions where the model must predict future outcomes based on past patterns. Random splitting, by contrast, can leak temporal information and inflate performance metrics.

TL;DR: The cohort included 46,007 CLD patients from the VA system (2006-2017), with 3,784 HCC cases (8.2%). Data included demographics, labs, diagnoses, procedures, and medications. A temporal train-test split (pre-2016 for training, 2016+ for testing) ensured realistic evaluation conditions.

Feature Selection

Pages 5-7

From Thousands of Variables Down to a Minimal Feature Set

One of the most distinctive aspects of this study is its focus on building a high-performing model using as few features as possible. The initial feature space included over 900 candidate variables derived from the EHR data. These encompassed raw laboratory values and their aggregated statistics, demographic variables, counts of specific diagnosis codes, medication categories, and procedural histories. Working with this many features in a clinical prediction model introduces risks of overfitting, reduces interpretability, and creates practical barriers to deployment.

Feature reduction strategy: The authors employed a systematic feature selection process. They first used univariate statistical tests to eliminate features with no significant association with HCC development. They then applied recursive feature elimination (RFE), which iteratively removes the least important features based on model performance, to narrow the set further. The goal was to identify the smallest feature subset that maintained near-optimal predictive performance.

The final minimal set: Through this process, the authors identified that strong predictive performance could be achieved with as few as 10 to 30 features. The top-ranked features consistently included platelet count, AFP (alpha-fetoprotein), albumin, AST (aspartate aminotransferase), ALT (alanine aminotransferase), bilirubin, INR (international normalized ratio), creatinine, age, and body mass index (BMI). Many of these are components of existing liver disease severity scores like MELD (Model for End-Stage Liver Disease) and FIB-4, which reflects the fact that HCC risk is tightly coupled with the degree of underlying liver dysfunction.

Clinical significance of minimalism: The emphasis on minimal features is not just an academic exercise. In real-world clinical settings, EHR data is often incomplete, with missing laboratory values and inconsistent documentation. A model that requires hundreds of features will frequently encounter missing data at inference time, degrading performance. A minimal-feature model is more robust to missing data, easier to validate across institutions, and more transparent to clinicians who need to understand and trust its predictions.

TL;DR: Starting from over 900 candidate variables, the authors used univariate tests and recursive feature elimination to reduce the feature set to 10-30 variables. Key predictors included platelet count, AFP, albumin, AST, ALT, bilirubin, INR, creatinine, age, and BMI. Minimal features improve robustness, portability, and clinical trust.

Model Architecture

Pages 7-9

Deep Learning Architecture and Training Approach

The primary model was a feedforward deep neural network (DNN) with multiple hidden layers. The architecture used fully connected layers with ReLU (Rectified Linear Unit) activation functions, batch normalization to stabilize training, and dropout regularization to prevent overfitting. The output layer used a sigmoid activation to produce a probability score between 0 and 1, representing the estimated risk of developing HCC within the next 12 months.

Handling class imbalance: Because only 8.2% of the cohort developed HCC, the dataset was significantly imbalanced. The authors addressed this through a combination of techniques: class-weighted loss functions that penalized misclassification of HCC cases more heavily, and oversampling of the minority class using SMOTE (Synthetic Minority Over-sampling Technique) during training. These strategies ensured the model did not simply learn to predict "no HCC" for every patient, which would achieve 91.8% accuracy but be clinically useless.

Comparison models: To contextualize the deep learning model's performance, the authors trained several baseline and comparison models. These included logistic regression (the standard approach for clinical risk prediction), random forest, gradient-boosted trees (XGBoost), and support vector machines (SVM). Each comparison model was trained on the same feature sets and evaluated using identical metrics. This head-to-head comparison allowed the authors to determine whether the added complexity of deep learning was justified by meaningful performance gains.

Training details: The DNN was trained using the Adam optimizer with a learning rate schedule that reduced the rate when validation loss plateaued. Early stopping was applied based on validation performance to prevent overfitting. Hyperparameter tuning, including the number of hidden layers, the number of neurons per layer, dropout rate, and learning rate, was performed using cross-validation on the training set. The final model architecture was then evaluated on the held-out temporal test set.

TL;DR: The model was a feedforward deep neural network with ReLU activations, batch normalization, and dropout. Class imbalance (8.2% HCC rate) was addressed via weighted loss and SMOTE. Baselines included logistic regression, random forest, XGBoost, and SVM. Training used Adam optimizer with early stopping and cross-validated hyperparameter tuning.

Results & Performance

Pages 9-12

Model Performance: AUC of 0.90 With Minimal Features

The deep learning model achieved an area under the receiver operating characteristic curve (AUROC) of approximately 0.90 on the temporal test set when using the full feature set. This is a strong result for a clinical prediction model and substantially exceeds the performance of existing HCC risk scores, which typically report AUC values between 0.70 and 0.80. Importantly, the model maintained an AUROC above 0.85 even when restricted to the minimal feature set of approximately 20 variables, demonstrating that a small number of routinely available clinical features can capture most of the predictive signal.

Sensitivity and specificity trade-offs: At an operating threshold optimized for clinical screening (high sensitivity), the model achieved a sensitivity of approximately 80% with a specificity near 85%. This means the model correctly identified roughly 4 out of 5 patients who would develop HCC within one year, while incorrectly flagging about 15% of non-HCC patients. For a screening tool intended to trigger further workup (such as imaging), this balance is clinically appropriate because the cost of missing an HCC case is much higher than the cost of an unnecessary follow-up test.

Comparison with baselines: The DNN outperformed all comparison models across nearly every metric. Logistic regression achieved an AUROC of approximately 0.84, random forest reached about 0.87, and XGBoost performed at roughly 0.88. The DNN's advantage was most pronounced in the minimal-feature setting, where simpler models showed greater performance degradation when features were removed. This suggests that the deep learning architecture was better at extracting non-linear interactions among a small number of variables.

Calibration: Beyond discrimination (AUC), the authors evaluated calibration, which measures how well predicted probabilities match observed event rates. A well-calibrated model that predicts a 10% risk should see approximately 10% of such patients actually develop HCC. The DNN showed good calibration overall, though it tended to slightly overestimate risk in the highest-risk decile. The authors applied Platt scaling as a post-hoc calibration correction, which improved reliability of the probability estimates.

TL;DR: The DNN achieved an AUROC of approximately 0.90 (full features) and above 0.85 (minimal features of approximately 20 variables). Sensitivity was approximately 80% at approximately 85% specificity. It outperformed logistic regression (0.84 AUC), random forest (0.87), and XGBoost (0.88). Calibration was good overall, refined via Platt scaling.

Feature Importance & Interpretability

Pages 12-14

What the Model Learned: Key Risk Drivers and Clinical Interpretability

To understand which features drove the model's predictions, the authors used SHAP (SHapley Additive exPlanations) values, a game-theoretic approach to model interpretability. SHAP assigns each feature an importance score for every individual prediction, revealing not just which features matter globally but how they contribute to specific patient risk estimates. This level of transparency is essential for clinical adoption, as physicians need to understand why a model flags a particular patient as high risk.

Top predictive features: The SHAP analysis confirmed that the most influential features were consistent with known HCC risk factors. Platelet count emerged as the single most important predictor, with low platelet counts (indicative of portal hypertension and advanced liver disease) strongly associated with higher HCC risk. AFP was the second most important feature, which aligns with its longstanding clinical use as a tumor marker for HCC. Albumin level, a marker of synthetic liver function, ranked third. Low albumin signals decompensated liver disease and elevated cancer risk.

Non-linear interactions: The SHAP analysis also revealed clinically meaningful non-linear interactions that simpler models could not capture. For example, the combination of declining platelet count and rising AFP over time was far more predictive than either feature alone. Similarly, the interaction between age and liver function markers showed that the same degree of liver dysfunction carried different risk implications depending on the patient's age. These interaction effects explain part of the DNN's performance advantage over logistic regression, which models features independently unless interactions are explicitly engineered.

Temporal patterns: Because laboratory values were aggregated over time (including trends such as the difference between the most recent value and the historical mean), the model effectively captured disease trajectory. Patients whose platelet counts were declining, whose bilirubin was rising, or whose AFP was trending upward received higher risk scores than patients with stable but equally abnormal values. This temporal sensitivity is a key advantage of using longitudinal EHR data rather than single-visit snapshots.

TL;DR: SHAP analysis revealed platelet count, AFP, and albumin as the top three predictors. The DNN captured non-linear interactions (e.g., declining platelets combined with rising AFP) and temporal trends (worsening trajectories vs. stable abnormal values) that simpler models missed.

Limitations

Pages 14-16

Single-System Data, Demographic Skew, and Validation Gaps

VA population bias: The most significant limitation is that the model was developed and evaluated exclusively within the VA healthcare system. The VA population is predominantly male (over 90%) and older, with a high prevalence of hepatitis C and alcohol-related liver disease. This demographic composition does not reflect the general population, nor does it represent the growing burden of NAFLD-related HCC, which disproportionately affects non-VA populations including women and patients with metabolic syndrome. Deploying this model outside the VA system without external validation could produce unreliable risk estimates.

Retrospective design: The study used a retrospective cohort design, which means the model was developed and tested on historical data rather than prospectively evaluated in a clinical workflow. Retrospective studies are susceptible to selection bias, information bias, and confounding that may not be fully addressed through temporal splitting alone. The model has not been tested in a prospective clinical trial where its predictions would inform real-time clinical decision-making, and its real-world impact on patient outcomes remains unknown.

Missing data and EHR quality: EHR data is inherently messy. Laboratory values may be missing not at random but because sicker patients receive more frequent testing (informative missingness). The authors handled missing data through imputation, but the extent and pattern of missingness could introduce systematic biases. Additionally, ICD coding accuracy varies across institutions, and misclassification of both HCC cases and CLD controls could affect model training and evaluation.

Prediction horizon: The model was designed to predict HCC within a 1-year window. While this is clinically useful for short-term screening decisions, it does not address longer-term risk stratification. A patient classified as low risk for the next year may still be at substantial risk over a 3- to 5-year horizon, which is the time frame more relevant for decisions about initiating or intensifying surveillance programs. Extending the prediction window or building multi-horizon models would enhance clinical utility.

TL;DR: Key limitations include VA-only data (over 90% male, high hepatitis C prevalence), retrospective design without prospective clinical validation, informative missingness in EHR data, and a 1-year-only prediction horizon. External validation across diverse healthcare systems is essential before clinical deployment.

Future Directions

Pages 16-18

Toward Clinical Deployment: Next Steps for EHR-Based HCC Prediction

External validation: The most immediate priority is validating the model on non-VA populations. This includes academic medical centers, community hospitals, and healthcare systems serving more diverse patient demographics. Validation across different EHR platforms (Epic, Cerner, Meditech) would also test the model's robustness to differences in data coding practices, laboratory assay standardization, and documentation patterns. Multi-site validation studies are the necessary next step before any real-world deployment can be considered.

Prospective clinical trials: Beyond retrospective validation, the model ultimately needs to be tested in a prospective setting where its risk scores are integrated into the clinical workflow. This could take the form of a randomized controlled trial comparing HCC surveillance outcomes (stage at diagnosis, survival) between patients managed with standard guidelines versus those receiving model-assisted risk stratification. Such trials would answer the critical question: does adding AI-based risk prediction actually improve patient outcomes?

Multi-horizon and dynamic prediction: Future work could extend the model to predict HCC risk over multiple time horizons (6 months, 1 year, 3 years, 5 years), giving clinicians a more complete risk trajectory. Additionally, recurrent neural network (RNN) or transformer architectures could be applied to model the sequential nature of EHR data more naturally. Rather than aggregating laboratory values into summary statistics, these architectures could process raw time-series data from each patient visit, potentially capturing more nuanced disease progression patterns.

Integration with imaging and genomics: While the appeal of this model lies in its reliance on routine EHR data alone, future systems could combine EHR-based risk scores with imaging findings (ultrasound, CT, or MRI) and genomic markers to build multimodal prediction systems. Such integration would be particularly valuable for patients in the intermediate-risk category, where a single data modality may not provide sufficient certainty to guide clinical decisions. The minimal-feature EHR model could serve as a first-line screening tool, with imaging and genomic data added for patients flagged as elevated risk.

TL;DR: Next steps include external validation on non-VA populations and diverse EHR platforms, prospective clinical trials measuring actual patient outcomes, extension to multi-horizon prediction (6 months to 5 years), exploration of RNN/transformer architectures for sequential EHR data, and integration with imaging and genomic markers for multimodal risk assessment.

Predicting Hepatocellular Carcinoma With Minimal Features From Electronic Health Records: Development of a Deep Learning Model

Original Paper (PDF)