Deep Learning Treatment Response Prediction in NSCLC

Plain-English Explanations

Overview & Background

Pages 1-3

Why Automating NSCLC Treatment Response Assessment Matters

Non-small-cell lung cancer (NSCLC) accounts for 85-90% of all lung cancers and remains one of the leading causes of death worldwide for both men and women. In 2020, lung cancer was responsible for 22% of the 8,164,372 cancer deaths globally, far outpacing colorectal (11.45%), liver (10.17%), gastric (9.42%), and breast cancer (8.39%). Despite advances in chemotherapy, radiotherapy, and surgery, the overall survival rate for lung cancer patients remains at most five years in many cases. This bleak prognosis underscores the urgency of monitoring treatment effectiveness quickly and accurately so that ineffective regimens can be abandoned before causing unnecessary toxicity.

Two imaging modalities dominate treatment monitoring: Computed tomography (CT) measures morphological (structural) changes in tumor size, while positron emission tomography (PET) captures metabolic changes by tracking glucose uptake via fluorodeoxyglucose (FDG) tracers. Both approaches are essential, but manually evaluating these scans is time-consuming and subject to inter-reader variability. A single CT evaluation by an experienced radiologist can take 35-40 minutes, including 15-20 minutes to identify target lesions and another 10-15 minutes to measure them.

Deep learning as a transformative tool: This systematic review examines how deep learning (DL), a subset of artificial intelligence that uses deep neural networks to process images, can automate and improve the assessment of treatment response in NSCLC. The authors surveyed the literature on DL-based methods that analyze CT and PET images for classification, segmentation, and prediction of tumor response. The review spans 47 pages and encompasses research on architectures like CNNs, U-Net, GANs, and VAEs, along with clinical evaluation criteria such as RECIST and PERCIST.

Radiomics as the bridge: The field of radiomics combines medical imaging with AI tools to extract quantitative features from images, including intensity, shape, size, volume, and tissue structure. This approach is particularly valuable for cancer treatment monitoring because it can reveal patterns invisible to the human eye. The review positions DL within the broader radiomics framework and examines how these methods can be integrated into the cancer care workflow, specifically during the follow-up phase after treatment.

TL;DR: NSCLC causes 22% of global cancer deaths (8.16 million in 2020) and accounts for 85-90% of lung cancers. Manual CT evaluation takes 35-40 minutes per patient. This 47-page systematic review surveys deep learning methods for automating treatment response assessment using CT and PET imaging, covering architectures like CNN, U-Net, and GAN across classification, segmentation, and prediction tasks.

Clinical Standards

Pages 4-8

RECIST, EORTC, and PERCIST: The Criteria That Define Treatment Response

Before any deep learning model can be trained to evaluate treatment response, it must align with the clinical criteria that oncologists already use. The review details two major families of evaluation criteria. For CT-based morphological assessment, the Response Evaluation Criteria in Solid Tumors (RECIST) has been the gold standard since its introduction in 2000, with RECIST 1.1 becoming the preferred version after 2009. RECIST classifies tumor response into four categories: complete response (CR, disappearance of all target lesions), partial response (PR, at least 30% decrease in the sum of target lesion diameters), progressive disease (PD, at least 20% increase plus an absolute increase of at least 5 mm), and stable disease (SD, neither sufficient shrinkage nor growth).

PET-based metabolic criteria: For functional imaging, two criteria dominate. The European Organization for Research and Treatment of Cancer (EORTC) criteria, developed in 1999, evaluate specific lesion regions using SUV (Standardized Uptake Value) adjusted for body surface area. PERCIST (PET Response Criteria in Solid Tumors), introduced in 2009, is generally regarded as more straightforward and uses SULpeak (Standard Uptake Lean Body Mass) measurements with a 12 mm ROI. EORTC requires a 25% reduction in SUVmax for partial metabolic response, while PERCIST requires a 30% reduction in SULpeak plus an absolute drop of 0.8 SULpeak units.

The dimensional measurement debate: The review highlights an ongoing discussion about whether 1D, 2D, or 3D measurements are most effective. RECIST relies on 1D (longest diameter), while the earlier WHO method used 2D (area). Multiple studies have compared these approaches, including volumetric (3D) assessments. So far, it has not been conclusively demonstrated that area and volume measurements evaluate treatment response better than unidirectional measurement of the target lesion, though 2D and 3D approaches may enhance precision in monitoring changes over time. The modified mRECIST guidelines have expanded the framework to accommodate area and volume changes.

These clinical criteria serve a dual purpose in the DL pipeline: they provide the labeling framework for training data (assigning response categories to image sets) and the benchmark against which DL model predictions are validated. The ability of DL systems to replicate specialist judgment according to RECIST or PERCIST classifications is a key measure of their clinical utility.

TL;DR: RECIST 1.1 defines treatment response as CR (lesion disappearance), PR (30%+ diameter decrease), PD (20%+ increase with 5 mm minimum), or SD (no qualifying change). PERCIST uses a 30% SULpeak reduction threshold for metabolic response versus EORTC's 25% SUVmax threshold. These criteria both label training data and validate DL model outputs.

Deep Learning Fundamentals

Pages 9-18

Neural Network Architectures and Evaluation Metrics for Medical Imaging

The review provides a thorough technical primer on the deep learning architectures relevant to NSCLC treatment monitoring. Convolutional Neural Networks (CNNs), introduced by LeCun et al. in 1989, remain dominant. They consist of three primary layer types: convolutional layers that compute weighted sums of neighboring pixels using convolution kernels, pooling layers (max pooling or mean pooling) that reduce spatial dimensions while preserving key information, and fully connected layers that perform the final classification. CNN families tested for treatment response prediction include AlexNets, VGG Nets, ResNets, and DenseNets.

U-Net is the most widely adopted architecture for segmentation tasks in this domain. Originally designed to handle limited annotated medical data, U-Net uses a contraction path (convolution and pooling for contextual encoding) and an expansion path (deconvolution layers for decoding) connected by skip connections. Variants like MSDS-UNet, SquExUNet, GUNET3++, SegChaNet, RRc-Unet, and RAD-UNet appear throughout the reviewed literature, each modifying the base architecture to improve segmentation accuracy for lung lesions.

Other architectures covered: Recurrent Neural Networks (RNNs) for sequential and time-series data processing, Recursive Neural Networks (RvNNs) for hierarchical structure prediction, Variational Autoencoders (VAEs) for generating new data through encoder-decoder learning, and Generative Adversarial Networks (GANs) for producing realistic synthetic images using generator-discriminator pairs. The review notes that deep learning methods for treatment monitoring primarily use CNN, VAE, and GAN structures, including hybrid combinations.

Evaluation metrics: The most commonly reported metrics across the reviewed studies are Accuracy (ACC), Dice Similarity Coefficient (DC), Sensitivity (SEN), Specificity (SPE), Area Under the ROC Curve (AUC), and Average Hausdorff Distance (AHD). The Dice Coefficient, which quantifies the overlap between automated and manual segmentations, is particularly important for assessing segmentation quality. For CT-based methods, Accuracy is more frequently reported (ranging from 72% to 98.5%), while PET-based methods favor the Dice Coefficient (ranging from 78% to 93%).

Activation functions and preprocessing: The review details the role of activation functions (Sigmoid, Tanh, ReLU, ELU) and their tradeoffs regarding vanishing gradients and computational efficiency. Image preprocessing filters used in reviewed DL methods include Wiener filters, Gaussian filters, and Wavelet filters for noise removal and blurring, as well as gradient-based filters (Roberts, Prewitt, Sobel, Isotropic) for edge detection. Notably, only a few reviewed studies on treatment monitoring use image-filtering techniques, though diagnosis-focused research has shown that specific filters can improve model precision.

TL;DR: CNNs (especially U-Net and its variants) dominate NSCLC treatment monitoring. CT-based methods report Accuracy from 72% to 98.5%, while PET-based methods report Dice Coefficients from 78% to 93%. Key architectures include ResNet, VGG-16, Mask R-CNN, GANs, and VAEs. Preprocessing filters (Wiener, Gaussian, Wavelet) are underutilized in treatment assessment despite proven benefits in diagnosis.

CT-Based Methods

Pages 19-26

Deep Learning for Morphological Treatment Assessment via CT Imaging

Direct classification without segmentation: Chang et al. developed a DL method that classifies treatment responses by directly comparing pre- and post-chemotherapy CT images, bypassing lesion isolation entirely. Using a Multiple Instance Learning (MIL) approach, their system assigns a single class label ("response" for CR/PR or "non-response" for PD/SD) to a series of images rather than labeling each slice individually. Various pretrained backbone CNNs (AlexNets, VGG Nets, ResNets, DenseNets) were tested as feature extractors, with an attention mechanism pooling process for computational efficiency. The model was trained using datasets from two hospitals with ImageNet-pretrained backbones, and its predictions showed high similarity to RECIST categorizations by radiologists.

Text-based RECIST estimation: Arbour et al. took a different approach by building a DL model that estimates RECIST responses from text in clinical radiology reports, rather than directly from scan images. Their deep neural network, constructed with encoding, interaction, and two fully connected layers using hyperbolic tangent activation followed by a softmax output layer, was trained on gold-standard RECIST reports from qualified radiologists. The model targeted patients with advanced NSCLC treated with PD-1/PD-L1 blockade. A key finding was that performance remained consistent regardless of reporting style across institutions, reinforcing the standardized nature of RECIST criteria.

Segmentation-focused approaches: Multiple studies addressed automated lesion segmentation for CT-based treatment monitoring. Tang et al. developed a semiautomatic 1D method using a cascaded CNN combining a Spatial Transformer Network (STN) for lesion normalization with a Stacked Hourglass Network (SHN) for estimating RECIST endpoints. Xie et al. created RECIST-Net to detect four extreme points and the central point of lesions. Woo et al. used three cascaded CNNs to classify whether target lesion size exceeded 32 pixels, achieving excellent agreement with radiologist measurements. Jiang et al. applied a multiscale CNN for volumetric segmentation of NSCLC tumors in immunotherapy patients, while Chen et al. developed a novel encoder-decoder CNN architecture for precise tumor segmentation.

Automated volumetric measurement: Kidd et al. advanced automation further with a completely automated method for evaluating chemotherapy response in malignant pleural mesothelioma (MPM). Their 2D CNN segmented each axial slice and classified the response according to mRECIST criteria, with manual annotations used only for training and validation. The results met specialist assessment to an acceptable degree, though the authors characterized it as a proof of principle. Across all reviewed CT segmentation studies, U-Net was the predominant architecture. Reported accuracy scores ranged from 72% (multiscale CNN, Jiang et al.) to 98.5% (GAN-based, Gonzalez-Crespo). Dice Coefficients ranged from 62% (standard U-Net) to 98.86% (ResNet50 combined with U-Net). Dataset sizes varied from 493 to 57,793 training images, with image resolutions typically at 512 x 512 pixels.

Computational efficiency: The review includes timing benchmarks from the authors' own experiments. Training a U-Net on 613 brain CT images (394 x 394 pixels) took 5 hours, while 800 lung CT images (512 x 512 pixels) required 12 hours. On an Intel Xeon workstation (64 cores, 3.0 GHz), prediction took 3.3 seconds, compared with 4.7 seconds on a lower-spec Intel 5i system (2 cores, 2.6 GHz). The lower-performance system required nearly twice the training time (2 hours 48 minutes vs. 1 hour 21 minutes). In contrast, a radiologist's manual evaluation takes 35-40 minutes per case.

TL;DR: CT-based DL methods span direct classification (MIL with pretrained CNNs), text-based RECIST estimation (DNN on radiology reports), and segmentation approaches (cascaded CNNs, U-Net variants). Accuracy ranges from 72% to 98.5%, Dice Coefficients from 62% to 98.86%. U-Net training on 800 lung CTs takes approximately 12 hours, but predictions complete in seconds versus 35-40 minutes for manual radiologist evaluation.

PET-Based Methods

Pages 27-30

Deep Learning for Metabolic Treatment Assessment via PET and PET/CT

PET's unique role: PET imaging uses fluorodeoxyglucose (FDG) to characterize tumor lesions within a metabolic frame of reference. Unlike CT, PET can reveal heterogeneous texture and variable contours of nodules that may not be apparent in structural imaging. PET is inherently prognostic, making it faster and more effective for evaluating treatment responses in solid cancers compared to anatomical assessments. However, limited availability of PET equipment in oncology centers and specific usage conditions result in its less frequent clinical use.

Detection methods: Zhang et al. introduced a multiscale region-based CNN using three Mask R-CNN models trained at different scales on PET images, with weighted voting to reduce false positives. Chen et al. proposed a multimodal attention-guided 3D CNN for combined PET/CT images, where attention mechanisms weight inputs by task relevance, achieving improved detection of cases that might go unnoticed through visual assessment alone.

Segmentation of metabolic activity: Fruh et al. developed a weakly supervised segmentation method for tumor lesions from preprocessed PET/CT images using a VGG-16-based CNN (16 layers, 13 convolutional plus 3 fully connected). Class activation maps (CAM, GradCAM, GradCAM++, ScoreCAM) identified tumor regions relevant to the network's decision, followed by adaptive image segmentation. Performance was measured using 3D Dice score, metabolic tumor volume (MTV), and total lesion glycolysis (TLG). Protonotarios et al. addressed limited training data with a few-shot learning scheme based on U-Net, incorporating user feedback for continuous improvement in PET/CT lesion segmentation.

Treatment response prediction from PET: Amyar et al. proposed a multiscale DL framework using a U-Net backbone for 3D medical images that simultaneously performs tumor segmentation, image reconstruction, and treatment response prediction for lung and esophageal cancer. The architecture combined encoding and decoding paths with skip connections, followed by a multilayer perceptron (MLP) for classification. Li et al. developed a DL model predicting PD-L1 expression in NSCLC patients undergoing immunotherapy using PET/CT data, employing ResNet-101 (101 layers deep, pretrained on ImageNet) for feature extraction combined with PyRadiomics and logistic regression. Across PET-based segmentation studies, Dice Coefficients ranged from 78% (U-Net, Xu et al.) to 93% (U-Net, Ghimire et al.), with most studies employing 3D volumetric approaches and image resolutions from 128 x 128 to 512 x 512 pixels.

AI denoising considerations: Weyts et al. investigated whether AI denoising of PET images affects lesion quantification during treatment evaluations under EORTC and PERCIST guidelines. The study concluded that AI-denoised PET images can be safely incorporated into clinical workflows and that treatment responses remain satisfactory according to established criteria. This is important because denoising may slightly modify image information in unknown ways, but the changes do not negatively affect treatment evaluation effectiveness.

TL;DR: PET-based DL methods achieve Dice Coefficients of 78-93% for tumor segmentation. Key approaches include Mask R-CNN with weighted voting, VGG-16 with class activation maps, few-shot U-Net learning, and ResNet-101 for PD-L1 prediction. AI denoising of PET images does not compromise treatment evaluation accuracy under EORTC/PERCIST criteria. No fully automated metabolic treatment evaluation method has been reported yet.

Performance Comparison

Pages 25-26

Deep Learning vs. Healthcare Professionals: Head-to-Head Evidence

The review cites a comprehensive meta-analysis by Liu et al. (2019) that compared DL methods with healthcare professionals across multiple disease areas, including lung cancer, breast cancer, ophthalmology, and dermatology. Most studies (87.8%) relied on retrospective data, while only 12.2% used prospective datasets. CNNs were particularly prominent among the DL architectures evaluated.

Key performance figures: Among 14 externally validated studies, sensitivity was 87% for DL methods compared with 86.4% for professional clinicians, and specificity was 92.5% for DL versus 90.5% for clinicians. However, when the analysis expanded to 69 quantifiable studies (out of 82 total), the numbers dropped to mean sensitivity of 79.1% and mean specificity of 88.3% for DL methods. This decline highlights the inflating effect of internal validation, which was preferred over external (out-of-sample) validation in 70% of cases.

Validation concerns: The data underscores a critical issue in the field. Internal validation tends to overestimate diagnostic accuracy for both DL systems and clinicians. The gap between internally and externally validated performance (sensitivity dropping from 87% to 79.1%, specificity from 92.5% to 88.3%) demonstrates why external validation must become standard practice. The review authors recommend broader and standardized validation in clinical settings, improved transparency and methodological rigor, and international reporting standards to facilitate DL integration in healthcare.

The takeaway is that DL models perform comparably to healthcare professionals in controlled settings, but real-world deployment requires more rigorous validation frameworks. For the time being, the review positions DL as an aid for clinical applications rather than a replacement for clinicians, with the primary goal of streamlining case triage and flagging abnormalities effectively.

TL;DR: In 14 externally validated studies, DL achieved 87% sensitivity (vs. 86.4% for clinicians) and 92.5% specificity (vs. 90.5%). Performance dropped to 79.1% sensitivity and 88.3% specificity across 69 broader studies. Internal validation was used in 70% of cases, which inflates accuracy estimates. Only 12.2% of studies used prospective data.

Regulation & Validation

Pages 31-33

Regulatory Frameworks and Validation Protocols for AI in Oncology

The review provides an unusually detailed section on regulation, covering both the EU and US frameworks for AI-based medical devices. In the EU, Regulation 2017/745 (Medical Devices Regulation, or MDR) imposes strict standards including quality management systems and post-marketing surveillance. The General Data Protection Regulation (GDPR) governs health data, and the 2018 Cybersecurity Directive protects essential services. In the US, the 21st Century Cures Act emphasizes data usage and privacy while enabling precision medicine, and HIPAA governs health information technology. The FDA regulates medical device safety and efficacy but faces challenges keeping pace with rapid technological change.

The EU AI Act: The 2024 EU Artificial Intelligence Act introduces a risk-based framework that will significantly impact healthcare AI. It categorizes AI systems by risk level (unacceptable, high, low, or minimal) and establishes legal obligations for developers, healthcare professionals, and public health authorities. AI tools for evaluating disease treatment fall into the moderate-to-high-risk category. A regulatory "sandbox" mechanism (Recital 47) allows exploration of new AI products under supervision, permitting legally collected personal data to be used for developing, training, and testing AI systems designed for public health, including disease detection, diagnosis, prevention, and treatment.

Four-phase validation protocol: The review outlines a recommended multitest validation approach from the Panel for the Future of Science and Technology. Phase 1 (feasibility testing) evaluates algorithms under ideal conditions, comparing AI performance with medical experts. Phase 2 (capability) tests via simulation or "in silico" clinical trials under more realistic conditions with clinician participation. Phase 3 (effectiveness) shifts to real clinical settings for optimization. Phase 4 (durability) involves ongoing performance monitoring, auditing, and iterative algorithm updates with larger datasets. The review notes that most current research remains at the feasibility stage, with academic work needing better integration into this regulatory framework to progress through capability, effectiveness, and durability phases.

The authors emphasize that the US currently trails the EU in regulatory strength for medical devices and digital security, particularly lacking comprehensive cybersecurity measures. Collaboration between healthcare institutions and industry partners must ensure AI models are robust and clinically relevant while meeting strict compliance requirements. The goal is balancing innovation with ethical standards, including transparency, patient privacy, and equitable healthcare access.

TL;DR: The 2024 EU AI Act classifies AI treatment monitoring tools as moderate-to-high risk. A four-phase validation protocol (feasibility, capability, effectiveness, durability) is recommended, but most current research is still at phase 1. The EU leads with MDR + GDPR + Cybersecurity Directive, while the US relies on the 21st Century Cures Act, HIPAA, and FDA oversight. Regulatory sandboxes allow controlled testing of new AI medical devices.

Limitations

Pages 33-34, 40-41

Key Limitations and Gaps in Current Deep Learning Methods

Data volume and quality: Successful training of DL models requires large healthcare datasets with substantial labeled data, but medical images frequently contain noise and irregular features that complicate object detection, feature extraction, and segmentation. Many of the reviewed studies relied on small, single-center datasets, and the largest CT training set encompassed 57,793 images while some PET studies used as few as 50 training samples. Standardized, multi-institutional data collection remains a major unresolved challenge.

Black box interpretability: DL models are frequently treated as "black boxes," making it difficult for clinicians to understand how decisions are reached. This opacity undermines clinician trust and limits acceptance of DL medical devices in clinical environments. While attention mechanisms and class activation maps (CAM, GradCAM) provide some interpretability, these remain insufficient for the level of transparency demanded in oncological decision-making.

Automation gaps: Many DL methods still require manual inputs, such as radiologist-drawn bounding boxes for ROI selection or manual annotations for training. Semi-automated approaches dominate the literature. For CT-based methods, the 2D measurements that serve as the primary step before neural network segmentation are still manually obtained. For PET-based methods, no DL system has achieved fully automated metabolic treatment evaluation. The review specifically notes that Woo et al.'s CNN set struggled when lesion size matched 32 pixels exactly, illustrating the edge cases that hinder full automation.

Validation weaknesses: Internal validation was preferred over external validation in 70% of analyzed studies, inflating performance metrics. Only 12.2% of studies used prospective data. The sensitivity gap between internally validated studies (87%) and the broader pool (79.1%) is substantial. Furthermore, many studies lack direct comparisons between DL and clinician performance, and there is considerable variability in study design quality and reporting standards.

Computational demands: Model training is time- and computationally intensive, with training times ranging from 1 hour 21 minutes (325 MR images on a 64-core workstation) to 12 hours (800 lung CT images). Hardware specifications significantly affect both training and inference times. The authors' experiments showed that a lower-specification system took nearly twice as long to train and produced prediction times 1.5 times longer than a high-performance workstation.

TL;DR: Major limitations include small datasets (some PET studies used only 50 training images), black-box opacity, persistent need for manual input (no fully automated PET treatment evaluation exists), internal-only validation in 70% of studies, and significant computational demands (up to 12 hours training for 800 CT images). The sensitivity gap between internal (87%) and broader validation (79.1%) highlights overfitting concerns.

Future Directions

Pages 33-34, 41-42

Where Deep Learning for NSCLC Treatment Monitoring Is Heading

Feature enrichment and multimodal data: To address dataset limitations, the review advocates leveraging diverse data sources beyond imaging alone. Electronic health records (EHRs), wearable devices, genomics data, social media, and environmental information can create more comprehensive patient profiles. This multimodal approach could feed into Scalable Personalized Health Systems (SPHSs) that combine imaging with clinical and molecular data for individualized disease risk prediction and treatment recommendations.

Interpretable and temporal modeling: Two technical priorities emerge clearly from the review. First, developing interpretable modeling techniques is essential for explaining how DL models arrive at predictions, which is a prerequisite for clinician trust and regulatory approval. Second, temporal modeling, meaning architectures designed to handle time-dependent data, could track disease progression across sequential scans and enhance both classification and treatment assessment. Recurrent neural networks and attention-based architectures are natural candidates for this direction, though they remain underexplored in the treatment monitoring context.

Federated learning for privacy-preserving collaboration: The concept of federated inference involves developing models across multiple institutions that share learned parameters without sharing raw patient data. This approach preserves privacy while improving model generalization, a critical need given that most current models are trained and validated at single centers. Multi-institutional training would also help address the spectrum bias inherent in single-site datasets.

Toward full automation: The review identifies a clear trajectory from manual to semi-automated to fully automated systems. Key milestones needed include automated target lesion detection without radiologist-drawn bounding boxes, automated ROI selection in PET scans, end-to-end pipelines that take raw imaging input and output RECIST or PERCIST classifications, and integration of preprocessing (filtering, denoising) as learnable pipeline components rather than manual preprocessing steps. Liu et al.'s highly automated system for cancer follow-up using 3D U-Net for liver metastases segmentation with automatic RECIST 1.1 evaluation provides a template that could be adapted for NSCLC.

Hypothesis-driven discovery: Beyond automation of existing workflows, the review points to DL models being used for exploratory analysis and hypothesis formulation in clinical research. Rather than simply replicating what radiologists do, future systems could identify novel imaging biomarkers predictive of treatment response that are invisible to human observers. The integration of radiomics features extracted by CNN with clinical outcomes data could reveal new patterns linking tumor morphology, metabolism, and treatment efficacy.

TL;DR: Key future directions include multimodal data integration (EHR, genomics, wearables), interpretable DL models for regulatory compliance, federated learning across institutions to improve generalization without sharing raw data, fully automated end-to-end pipelines for RECIST/PERCIST classification, and temporal modeling architectures for tracking disease progression across sequential scans.

Deep Learning Approaches for Automated Prediction of Treatment Response in Non-Small-Cell Lung Cancer

Original Paper (PDF)