Applications of artificial intelligence in non-small cell lung cancer: from precision diagnosis to personalized treatment

PMC 2024 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why NSCLC Demands AI: The Scale of the Problem and This Review's Scope

Lung cancer remains the leading cause of cancer death globally. According to GLOBOCAN 2022, approximately 2.48 million new cases and 1.82 million deaths occurred worldwide. Non-small cell lung cancer (NSCLC) accounts for 80-85% of all lung cancer cases and includes adenocarcinoma, squamous cell carcinoma, and large-cell carcinoma. Despite advances in low-dose computed tomography (LDCT) screening, targeted therapies, and immunotherapy, outcomes remain poor: the overall 5-year relative survival for NSCLC sits at roughly 32%, with stark stage gradients of about 67% for localized disease, 40% for regional, and just 12% for distant metastases.

The core challenge is heterogeneity. Spatially distinct tumor subclones, variable target expression, divergent microenvironmental niches (inflamed, excluded, and desert phenotypes), and shifting selective pressures under therapy create complex response patterns and resistance. Environmental exposures, particularly fine particulate matter (PM2.5), contribute to lung adenocarcinoma among never-smokers and interact with molecular drivers like EGFR. This biological, clinical, and environmental complexity motivates AI methods that can learn structure across scales and generate individualized predictions that update over time.

This review covers AI advances for NSCLC published between January 2023 and August 2025. The authors searched PubMed, MEDLINE, Embase, Web of Science, Scopus, and Cochrane CENTRAL, combining controlled vocabulary and free-text terms for NSCLC and AI across imaging, digital pathology, multi-omics, prognosis, treatment decision support, and drug discovery. Two reviewers independently screened records, prioritizing human NSCLC studies with clinically meaningful endpoints. As a narrative review, no pre-registered protocol or formal risk-of-bias assessment was employed.

Architectural families surveyed: The review organizes contemporary AI systems into four major families: (1) Transformer backbones using self-attention for spatial and contextual integration, (2) temporal and frequency attention via Fourier attention (FA) and wavelet attention (WA) for long-range periodicity and multiscale transients, (3) graph neural networks (GNNs) encoding pathway and topological constraints for multi-omics integration, and (4) generative adversarial networks for denoising, super-resolution, and stain normalization.

TL;DR: NSCLC kills 1.82 million people annually with a 5-year survival of only 32%. This narrative review surveys 2023-2025 AI advances across imaging, pathology, multi-omics, prognosis, and treatment, organized around Transformer, frequency-attention, GNN, and GAN architectures.
Pages 3-5
AI-Assisted Imaging Diagnosis: CT, PET-CT, and MRI for Nodule Detection and Risk Stratification

In thoracic radiology, deep learning CNNs trained on chest CT detect and characterize pulmonary nodules, prioritize worklists, and generate quantitative malignancy-risk estimates. The landmark 2019 Google system achieved an AUROC of approximately 0.94 for per-case cancer detection on LDCT, reducing false positives by about 11% and false negatives by about 5% compared to expert radiologists in retrospective testing. In parallel, the Sybil model (published in the Journal of Clinical Oncology, 2023) predicts individual 1-to-6-year lung cancer risk from a single LDCT scan without additional clinical covariates, reporting external AUROCs of approximately 0.75 to 0.81 across multiple validation centers.

Beyond detection: Deep learning also enhances tumor and organ-at-risk segmentation for volumetry, growth-kinetics modeling, and radiotherapy planning. In PET-CT, AI assists mediastinal staging with AUROCs around 0.90 for classifying nodal involvement and detecting occult metastases. In MRI, AI supports brain-metastasis surveillance and target-volume delineation. Denoising and super-resolution techniques further enhance low-dose image quality while preserving quantitative imaging features.

Radiomics vs. deep learning: The paper draws a methodological distinction between hand-crafted radiomics and end-to-end deep learning. Radiomics extracts shape, intensity, and texture features but is sensitive to heterogeneity in slice thickness, reconstruction kernel, and vendor. Pipelines must follow Image Biomarker Standardisation Initiative (IBSI) guidelines and document voxel resampling (e.g., isotropic 1.0 mm), intensity discretization, and pre-filter parameters. Deep learning, by contrast, leverages voxel-level signals and peritumoral context, often outperforming classical models on diverse, harmonized datasets. Both paradigms require external validation, calibration curves, and decision curve analysis (DCA).

Domain shift: Performance on external cohorts often drops by 5 to 10 AUROC points compared to internal testing, a recurring challenge. Mitigation strategies include multi-domain training across vendors and protocols, physics-informed harmonization (ComBat, kernel-aware resampling), out-of-distribution (OOD) detection with case-level uncertainty estimation, and test-time adaptation. Several nodule management solutions have received regulatory clearance, including Riverain ClearRead CT (FDA 510(k)) and Veye Lung Nodules (EU MDR CE mark), which function as concurrent readers integrated into PACS workflows.

TL;DR: The Google LDCT system achieved AUROC 0.94 for cancer detection. Sybil predicts 1-to-6-year risk with external AUROCs of 0.75-0.81. PET-CT staging models reach AUROC ~0.90. External validation typically shows 5-10 point AUROC drops. Regulatory-cleared tools like ClearRead CT and Veye are already deployed in clinical workflows.
Pages 5-6
AI in Pathological Subtyping and Biomarker Inference from Whole-Slide Images

Digital pathology represents a major frontier for AI in NSCLC. Traditional histologic subtyping and molecular alteration identification require meticulous microscopy plus multiple ancillary assays (IHC for TTF-1 and p40, FISH or sequencing for genomic alterations). Deep learning models trained on labeled H&E whole-slide images (WSIs) now classify NSCLC histopathology, distinguishing adenocarcinoma from squamous carcinoma with accuracy of approximately 0.95 in published benchmarks. These WSI pipelines use multiple instance learning (MIL) with attention pooling, where tile-level embeddings are aggregated via self-attention or gated attention to produce slide-level predictions. Recent variants apply token-based Transformers to tile sequences with two-dimensional positional encodings.

EAGLE for EGFR prescreening: A particularly impactful application is molecular virtual staining, predicting biomarkers directly from routine H&E histology. Building on foundational work by Coudray et al. (AUROC ~0.82 for EGFR prediction from H&E), Campanella and colleagues developed EAGLE in 2025. Trained on more than 5,000 digitized biopsies, EAGLE achieved internal and external AUROCs of approximately 0.85 and 0.87, which translated into approximately 43% fewer reflex molecular tests while maintaining sensitivity. In a prospective deployment simulation, EAGLE reached an AUROC of 0.89 on new cases and substantially reduced biomarker reporting time, conserving tissue for comprehensive sequencing and shortening turnaround for initiating targeted therapy.

IHC quantification: AI-based analysis of immunohistochemical slides produces more consistent protein-biomarker quantification than manual scoring. Wu and colleagues (2022) developed a deep learning system to score PD-L1 IHC in NSCLC, achieving a correlation coefficient of about 0.94 with pathologist tumor-proportion scores and improving inter-pathologist consistency. Lightweight Vision Transformer heads operating on nucleus or patch tokens are increasingly used to standardize PD-L1 TPS scoring. Similar approaches are being explored for predicting ALK and ROS1 fusions from morphology, though their rarity demands larger training datasets.

Bias concerns: The authors flag several important limitations. External validation across H&E biomarker studies is inconsistent and subgroup calibration is seldom reported. Many studies do not enforce patient-level splits that separate sites and scanners, increasing leakage risk. Tile-level heatmaps often co-localize with staining or batch signatures rather than tumor morphology. The authors recommend color-space and stain normalization ablations, negative-region controls, concept-based validation (e.g., TCAV for gland formation and keratinization), and slide-level counterfactuals to test causal relevance.

TL;DR: WSI-based subtyping reaches ~0.95 accuracy for adenocarcinoma vs. squamous. EAGLE achieves AUROC 0.85-0.89 for EGFR prescreening and cuts reflex molecular tests by 43%. PD-L1 scoring AI correlates at 0.94 with pathologist scores. Key gaps include inconsistent external validation and leakage from acquisition-related signatures.
Pages 6-8
Multimodal Survival Models and Dynamic Risk Prediction

Prognostication improves substantially when multiple data types are fused together. The review describes how AI integrates genomics (mutations, copy number alterations), transcriptomics, methylomics, proteomics, metabolomics, ctDNA, imaging features (radiomics and pathomics), and routine clinical variables into composite risk scores. Model classes include penalized Cox models with learned embeddings, deep survival models (DeepSurv and DeepHit), Transformer-based fusion for variable-length longitudinal sequences, and GNNs that capture pathway and interaction structure. Fusion occurs at the feature level (early), decision level (late), or at intermediate layers via cross-attention.

Effect sizes: Multimodal survival models combining CT radiomics, WSI embeddings, mutational signatures, and clinical covariates typically increase the concordance index (C-index) by 0.05 to 0.12 over clinical baselines and lower the integrated Brier score. External validations show good portability with modest calibration drift that is usually amenable to recalibration. Early-stage (I-II) studies combining genomics and pathomics have identified high-risk subgroups that gain an absolute 5% to 10% overall survival benefit from adjuvant therapy, while low-risk groups may be candidates for therapy de-escalation.

Image-based prognosis: A deep learning model applied to pre-treatment CT predicted overall survival in NSCLC with an AUROC of approximately 0.70, outperforming a clinical-factors-only model at approximately 0.60. AI-derived pathomic features from H&E slides reported 5-year survival prediction with AUC values between 0.64 and 0.85, exceeding stage-alone or grade-alone models. Song et al. (2023) developed a nomogram combining deep learning radiomics from CT with clinicopathologic variables to predict progression-free survival in stage IV EGFR-mutant NSCLC treated with EGFR inhibitors.

Dynamic risk models: Beyond static baselines, landmarking and joint models update risk after each assessment, incorporating ctDNA kinetics, radiographic response, and laboratory trends. These approaches use sequence Transformers consuming tokens indexed by visit, with Fourier layers for long-range periodicity and wavelet blocks for abrupt transients. They outperform static baselines on time-dependent AUC and enable earlier therapy escalation or de-escalation. The authors also note that reinforcement learning policies for adaptive sequencing are promising but require prospective oversight and clearly defined safety constraints.

TL;DR: Multimodal fusion raises the C-index by 0.05-0.12 over clinical baselines. Pathomic features yield 5-year survival AUCs of 0.64-0.85. CT-based prognosis reaches AUROC ~0.70 vs. ~0.60 for clinical factors alone. Early-stage genomics-pathomics models identify subgroups gaining 5-10% absolute survival benefit from adjuvant therapy.
Pages 9-11
AI for Radiotherapy, Chemotherapy, Targeted Therapy, and Immunotherapy Response Prediction

Radiotherapy: AI improves both planning and outcome prediction. Standardized, externally validated segmentation pipelines enable reliable auto-contouring across sites. For response prediction, a radiomics signature predicted 2-year local recurrence after definitive chemoradiation with an AUROC of approximately 0.75, outperforming traditional stage-based estimates. A multi-institutional deep learning model combining dosiomics, radiomics, and clinical data predicted grade 2 or higher radiation pneumonitis with an external validation AUROC of approximately 0.80 and well-calibrated risk estimates. ML models also predict radiation-related cardiac toxicity and pulmonary fibrosis by analyzing pre-treatment scans together with dose-volume parameters.

Chemotherapy: Predicting chemotherapy response remains challenging. A deep learning model using radiomic features distinguished responders from non-responders to first-line chemotherapy on baseline CT with modest accuracy of approximately 70%. Gene expression-based chemotherapy response scores show potential for predicting response to neoadjuvant chemotherapy in resectable NSCLC. AI analysis of blood-based biomarkers like serum N-glycome changes has also been investigated. No AI test for chemotherapy response is yet in routine clinical use.

Targeted therapy (EGFR/ALK): For EGFR-mutant NSCLC patients starting tyrosine kinase inhibitors (TKIs), ML models applied to baseline clinical and imaging data can flag patients at high risk of early progression (within 6-9 months). These predictions are clinically actionable: a patient predicted to respond poorly to TKI monotherapy might receive upfront combination with chemotherapy or a VEGF inhibitor. In ALK-positive NSCLC, AI models are being explored to predict which specific ALK inhibitor a tumor is most likely to respond to based on omics-derived biological differences.

Immunotherapy: Only about 20-30% of unselected NSCLC patients respond to immune checkpoint inhibitors, making predictive biomarkers essential. Deep-IO, a deep learning model by Rakaee et al., predicts checkpoint inhibitor outcomes from pre-treatment H&E slides. In 958 patients with advanced NSCLC, Deep-IO achieved an AUC of 0.66 for objective response in external validation (vs. 0.62 for PD-L1 at 50% or above). Combining the AI score with PD-L1 pushed the AUC to 0.70. A separate radiomic classifier separated hyperprogressors from ordinary progressors with an AUC of about 0.87. Multimodal models integrating radiomics, PD-L1, and ctDNA metrics have outperformed individual predictors for 1-year survival on immunotherapy.

TL;DR: Radiotherapy: recurrence prediction AUROC ~0.75, pneumonitis prediction AUROC ~0.80. Chemotherapy response: ~70% accuracy. Immunotherapy: Deep-IO AUC 0.66 for response (0.70 combined with PD-L1), hyperprogression classifier AUC ~0.87. Only 20-30% of NSCLC patients respond to checkpoint inhibitors, making AI-based selection critical.
Pages 11-13
AI-Accelerated Drug Discovery: From Target Identification to Virtual Screening

AI is reshaping early-phase NSCLC drug development across several fronts. Deep generative approaches learn from large chemical libraries and bioassay data to predict compounds that inhibit cancer targets or overcome resistance. A key application is the discovery of next-generation inhibitors against established oncogenic drivers. Resistance to third-generation EGFR TKIs like osimertinib often emerges through EGFR T790M and C797S mutations. In 2024, Zhou et al. used an ML-aided approach to identify CDDO-Me as a potential fourth-generation EGFR inhibitor active against T790M-mutant NSCLC, confirmed experimentally in xenograft models.

Structure-activity modeling: Zhang et al. (2023) used ML with support vector machines and random forests to design new small molecules targeting EGFR active-site mutations, achieving external accuracy greater than 95% and an R-squared of approximately 0.93 between predicted and experimental activity. GNNs operating on molecular graphs predict compounds that may inhibit novel targets like KRAS G12C and MET exon 14 skipping. AlphaFold-supported pipelines are being integrated for complex structure prediction to inform docking and molecular design.

Drug repurposing: ML algorithms predict drug-target interactions and synergistic combinations by mining historical pharmacologic data. One platform analyzing transcriptomic profiles predicted that an FDA-approved kinase inhibitor (not originally indicated for lung cancer) could have activity in KRAS-mutant NSCLC; subsequent laboratory testing confirmed this and led to a new clinical trial. The open-source tool D3EGFR provides a deep learning-based web server that predicts the sensitivity of EGFR-mutant lung cancers to various TKIs and their combinations.

Limitations of AI drug discovery: Many studies use random or scaffold-split validation, which allows near-duplicate chemotypes to inflate performance; temporal splits and external assays are needed instead. Assay heterogeneity and batch effects can confound activity labels. Generative models may propose molecules that are difficult to synthesize, unstable, or outside the applicability domain, yet synthesizability metrics and medicinal chemistry review are rarely reported. Translation from cell lines to patients is limited by context mismatch, off-target effects, and ADMET constraints. Uncertainty and calibration are seldom quantified.

TL;DR: ML identified CDDO-Me as a potential 4th-gen EGFR inhibitor (confirmed in xenograft models). SVM/random forest models designed EGFR-targeting molecules with external accuracy above 95% and R-squared ~0.93. GNNs target KRAS G12C and MET exon 14 skipping. Drug repurposing identified a new clinical trial candidate for KRAS-mutant NSCLC.
Pages 13-15
Data Bias, Explainability Gaps, and Systematic Pitfalls Across NSCLC AI Studies

Data bias and quality: AI performance depends heavily on data provenance. Variation in CT acquisition parameters (tube voltage, slice thickness, reconstruction kernel, vendor) and PET protocols introduces batch effects that distort learned representations. Digital pathology pre-analytic factors (fixation, staining, scanner optics) have similar effects. Labels are often noisy due to interreader variability and evolving diagnostic criteria, and class imbalance from rare histologies and uncommon fusions predisposes models to majority-class bias with poorer performance in minority subgroups.

Generalizability: The authors document that AUROC drops of 5-10 points across scanners or sites are common. A systematic review of externally validated imaging models found most algorithms performed worse on external datasets. In digital pathology, a 2024 meta-analysis reported variable accuracy and frequent risk of bias, and a public audit of commercial products showed that only about 40% had peer-reviewed external validation. Even the EAGLE system, with internal/external AUROCs of 0.85 and 0.87 and a prospective AUROC of 0.89, still requires translation into net benefit, explicit decision thresholds, and measurable reductions in time-to-treatment.

Explainability failures: Post-hoc saliency methods (Grad-CAM, Integrated Gradients, Layer-wise Relevance Propagation) can remain stable when labels or weights are randomized, highlight scanner-specific artifacts, or change with minor preprocessing adjustments. The authors propose a minimum evidence package for explainability: sanity checks, faithfulness metrics (deletion/insertion curves, ROAR/IROF), stability across seeds and scanners, concept-level validation (TCAV with statistical testing), and prospective human-factors studies demonstrating improved decisions without automation bias.

Systematic data pitfalls: The review catalogs five pervasive weaknesses: (1) single-center, small-sample cohorts that inflate internal discrimination but fail under domain shift, (2) subjective and inconsistent annotations, particularly near PD-L1 thresholds at 1% and 50%, (3) systemic case-mix and acquisition biases that models learn and amplify, (4) leakage risks from patient overlap across splits and refitting normalization on combined data, and (5) imbalanced outcomes producing unstable thresholds in under-represented groups. The authors recommend locked pipelines, patient-level splits separating sites and scanners, and joint reporting of discrimination, calibration, and DCA stratified by multiple demographic and clinical variables.

TL;DR: External AUROC drops of 5-10 points are typical. Only about 40% of commercial pathology AI products have peer-reviewed external validation. Saliency maps can highlight artifacts instead of biology. The authors identify five systematic data pitfalls and recommend locked pipelines, patient-level splits, and joint discrimination/calibration/DCA reporting.
Pages 15-17
The Validation Ladder, Regulatory Frameworks, and Clinical Workflow Integration

The authors propose a staged validation ladder: (1) technical validation using cross-validation and internal/external splits, (2) clinical validation through multicenter retrospective studies with pre-specified analysis plans, and (3) clinical utility demonstrations through prospective evaluations such as DECIDE-AI pilots, stepped-wedge or cluster trials, and randomized controlled trials. At each stage, they recommend reporting discrimination, calibration, DCA, and net reclassification compared with standard care.

Execution standards: External validation must use a locked pipeline with preprocessing learned on development data applied unchanged to external cohorts. Patient-level splits must keep sites and scanners separate. Reports should include AUROC and area under the precision-recall curve with 95% confidence intervals, calibration slope and intercept, smooth calibration curves, expected calibration error, and DCA across pre-specified thresholds. For survival outcomes, the concordance index, time-dependent AUROC, and integrated Brier score are required. Sample size targets include at least 100 events and 100 non-events for binary outcomes and at least 200 events for survival models.

Regulatory alignment: The paper maps AI/ML software-as-a-medical-device (SaMD) to multiple frameworks: the International Medical Device Regulators Forum risk frameworks, FDA total product lifecycle principles with predetermined change-control plans, the EU Medical Device Regulation and In Vitro Diagnostic Regulation alongside the EU AI Act, and the UK MHRA change programme. Quality management must comply with ISO 13485, risk management per ISO 14971, software lifecycle standards IEC 62304 and IEC 82304-1, current cybersecurity guidance, and Good ML Practice principles.

Clinical integration: Deployment requires seamless integration with RIS, PACS, DICOM Structured Reports and Segmentation, FHIR, and CDS Hooks. Shadow-mode pilots should quantify turnaround time, alert burden (alerts per 100 cases), coverage-accuracy trade-offs, and re-review rates. Service-level agreements must define inference latency targets, failure rates, audit-log completeness, and automatic escalation near decision thresholds (e.g., PD-L1 at 1% and 50%). Lifecycle governance includes drift triggers (AUROC drops beyond a preset margin or ECE exceeding a threshold), recalibration and rollback plans, and monitoring for adverse AI events.

TL;DR: A 3-stage validation ladder progresses from internal/external splits to prospective trials. External cohorts need at least 100 events for binary outcomes and 200 for survival. Regulatory alignment spans FDA, EU MDR, EU AI Act, and UK MHRA. Deployment demands PACS/FHIR integration, shadow-mode piloting, and drift monitoring with rollback plans.
Pages 17-20
Foundation Models, Digital Twins, Causal AI, and Equitable Deployment

Multimodal foundation models: The field is converging on foundation models trained on diverse medical corpora (CT, PET, WSI, clinical notes, structured lab data) paired with cross-modal Transformers learning joint latent spaces. For NSCLC, these typically comprise three components: (1) modality-specific tokenizers (3D patch embedding for CT, tile encoder for WSI, gene-set projector for omics, text encoder for clinical notes), (2) a shared Transformer with 12-48 layers and 12-24 attention heads connected by cross-attention bridges, and (3) personalization layers using adapters and LoRA for site-specific adaptation. Self-supervised and weakly supervised objectives (masked modeling, contrastive pairing) reduce labeling burden and improve cross-institutional transfer.

Temporal modeling and digital twins: The authors advocate a shift from static snapshots to trajectory-based models. Sequence models should represent longitudinal imaging, ctDNA kinetics, laboratory trends, and therapy timelines, updating risk in real time. These can be combined with digital twin constructs that simulate counterfactual responses under alternative regimens, enabling "what-if" exploration during tumor boards. Fourier attention layers capture long-range periodicity, while wavelet blocks capture abrupt transients induced by treatment changes.

Causal and mechanism-aware AI: The review highlights causal forests, uplift modeling, and targeted maximum likelihood estimation as methods to estimate individualized treatment effects under confounding and heterogeneity. Mechanism-aware representation learning encodes biological and physical priors, improving transportability and safety under distribution shift. Pathway-regularized networks and graph causal models link radiomic heterogeneity to hypoxia and immune evasion programs.

Equity and access: Priorities include proactive inclusion of underrepresented populations in training and validation, calibration-within-groups and equalized-odds reporting by sex, ancestry, socioeconomic status, and geography, and remediation through reweighting or domain-specific adapters. For low-resource settings, the authors recommend optimizing for edge inference, minimizing dependencies, supporting offline operation, and providing tiered models matched to local infrastructure. Federated learning with secure aggregation and differential privacy enables multi-institutional training while preserving confidentiality.

Three inflection points: The authors identify three key transitions ahead: (1) multimodal foundation models should replace fragmented single-task pipelines, (2) causal and longitudinal modeling should become standard, combining treatment-effect estimators with dynamic risk models, and (3) AI-derived computable biomarkers should advance along formal qualification pathways from analytical validity through multicenter clinical validity to prospective demonstrations of clinical utility.

TL;DR: Foundation models for NSCLC use 12-48 layer Transformers with cross-modal attention, LoRA adapters for site personalization, and self-supervised pretraining. Digital twins will enable counterfactual treatment simulation. Causal forests and uplift modeling address individualized treatment effects. Equitable deployment requires federated learning, edge inference, and proactive fairness auditing across subgroups.
Citation: Chang L, Li H, Wu W, et al.. Open Access, 2025. Available at: PMC12836995. DOI: 10.1186/s12967-025-07591-z. License: cc by.