Deep Learning Lung Cancer Histology CT Classification

Overview & Background

Page 1

Why Predicting Lung Cancer Histology From CT Scans Matters

Lung cancer is the leading cause of cancer-related death worldwide, and more than 80% of primary lung cancers are classified as non-small cell lung cancer (NSCLC). Within NSCLC, the two most common histological subtypes are adenocarcinoma (ADC) and squamous cell carcinoma (SCC), which arise from small and large airway epithelia respectively. Histologic phenotype is a critical predictor of treatment response and clinical outcome, making accurate classification essential for selecting the right therapy.

Limitations of tissue biopsy: While manual tissue assessment under light microscopy remains the gold standard, biopsy has well-known drawbacks. Tumor heterogeneity means a single biopsy may fail to capture the full morphological profile. Of every tissue block sent for diagnosis, only 1 or 2 slides are typically assessed, limiting the pathologist's view of the overall tumor environment. Molecular testing (e.g., EGFR/KRAS) can help identify driver mutations, but integrating molecular pathology into routine workflows remains challenging due to cost and expertise barriers.

The radiomics opportunity: This study proposes using deep learning on standard-of-care computed tomography (CT) images to non-invasively predict NSCLC histology. The core idea is that CT data, which is routinely acquired for every lung cancer patient, encodes quantitative information about tumor phenotype that CNNs can learn to decode. Prior work had already shown deep learning achieving greater than 99% sensitivity and specificity for lung nodule screening, and AUC values of 0.74 for predicting pathological response in NSCLC treated with chemoradiation.

The authors built their models using the Boston Lung Cancer Survival (BLCS) cohort, a well-characterized dataset of 311 early-stage NSCLC patients treated at Massachusetts General Hospital (MGH). Their goal was to create a non-invasive pathological biomarker that could augment biopsy-based diagnosis and serve as a corrective aid for diagnosticians.

TL;DR: Over 80% of lung cancers are NSCLC, and distinguishing ADC from SCC histology is critical for treatment. This study uses deep learning on routine CT scans from 311 MGH patients to predict histology non-invasively, aiming to augment biopsy-based classification.

Cohort & Data

Pages 2-3

Patient Cohort and Dataset Design

The study cohort comprised 311 patients with early-stage NSCLC who received surgical treatment at MGH between 1999 and 2011. Of these, 186 (59.8%) had Stage I disease and 125 (40.2%) had Stage II. Median follow-up was 3.9 years, with 86.2% two-year survival. Pathologist-confirmed histology broke down as follows: 155 (49.8%) adenocarcinoma, 68 (21.9%) squamous cell carcinoma, and 88 (28.3%) classified as "Other" (large cell, mixed histology, bronchoalveolar carcinoma, carcinoid, and multi-primary cases). Only 18 patients (5.8%) had EGFR/KRAS mutation data available, so molecular analysis was not pursued.

Train-test split: Data was partitioned randomly into approximately 75:25 for fine-tuning versus testing, with no statistically significant differences between the two sets in histology distribution (p = 0.892), stage (p = 0.417), smoking status, sex, or survival. For Model A (the primary binary classifier), the tuning set contained 120 ADC and 52 SCC patients (n = 172), while the test set had 35 ADC and 16 SCC patients (n = 51). Model B, which included all three histology groups, used 228 patients for tuning and 83 for testing.

Image preprocessing: Each patient's pre-resection CT was processed by placing a clinician-located seed-point at the tumor center using 3D Slicer software. From this seed-point, 3D volumes were extracted and converted into 2D input tiles measuring 50 mm x 50 mm. Isotropic rescaling was applied with a linear interpolator to achieve uniform 1 mm x 1 mm pixel spacing. Density normalization was performed with mean subtraction and linear transformation to standardize CT intensity values across the heterogeneous dataset.

TL;DR: 311 early-stage NSCLC patients (155 ADC, 68 SCC, 88 Other), split 75:25 for training/testing with balanced characteristics. CT images preprocessed into 50 mm x 50 mm tiles at 1 mm x 1 mm isotropic resolution with density normalization.

Methodology

Pages 3-4

CNN Architecture and Transfer Learning Approach

The authors employed a transfer learning strategy to overcome the challenge of limited labeled medical data. They used the VGG-16 (Visual Geometry Group) architecture, a 16-layer convolutional neural network pre-trained on ImageNet, a dataset of more than 14 million hand-annotated natural images. Transfer learning allowed the model to retain previously learned low- and mid-level image features (edges, shadows, textures) while adapting higher layers to the specific task of histology classification from CT data.

Model A (binary classifier): Fine-tuned on 172 patients with ADC or SCC histology only. The last convolutional, pooling, and fully connected layers were unfrozen for fine-tuning over 100 epochs. The softmax output layer was set to 2 classes (ADC vs. SCC). Input was 50 mm x 50 mm image patches fed as three identical grayscale channels to match VGG-16's expected 3-channel input format.

Model B (three-class classifier): Fine-tuned with the same architecture on a heterogeneous dataset of 228 cases containing all three histology groups (ADC, SCC, and Other). The softmax layer was set to 3 classes. This model was tested on the same 83-patient heterogeneous test set as Model A to enable direct comparison. The ResNet50 architecture was also evaluated but showed no significant improvement over VGG-16.

Hyperparameter optimization: The authors iteratively explored hyperparameters including batch size and the depth of fine-tuning. Performance was evaluated using AUC, accuracy, sensitivity, specificity, the Wilcoxon rank sum statistic, and two-sided p-values. Models with AUC above 0.60 and p-value below 0.05 were considered predictive.

TL;DR: VGG-16 pre-trained on ImageNet was fine-tuned over 100 epochs. Model A was a binary ADC vs. SCC classifier (n = 172 tuning, n = 51 test). Model B was a 3-class model (n = 228 tuning, n = 83 test). ResNet50 was also tested but did not outperform VGG-16.

Feature-Based Analysis

Pages 4-5

Deep Radiomics: Extracting and Classifying CNN-Derived Features

Beyond using the CNN as an end-to-end classifier, the authors explored a "deep radiomics" approach: extracting quantitative feature vectors from intermediate CNN layers and feeding them into traditional machine learning classifiers. Two feature sets were generated from Model A. A 512-dimensional (512-D) vector from the last pooling layer, and a 4096-dimensional (4096-D) vector from the first fully connected layer. These high-dimensional features preserve global spatial information through convolutional kernel operations, giving them an advantage in fine-grained recognition and texture analysis over hand-crafted radiomics features.

Dimensionality reduction: Principal component analysis (PCA) reduced both the 512-D and 4096-D feature spaces to 60 principal components, capturing 95% of cumulative explained variance. The LASSO method (alpha = 0.01) then selected the 18 most predictive features. Notably, LASSO selected the same 18 features from both the 512-D and 4096-D vectors, demonstrating strong reproducibility across network layers.

Machine learning classifiers: Four classifiers were independently evaluated on these reduced feature sets: k-nearest neighbors (kNN, k = 5), support vector machine (SVM) with linear and non-linear kernels, and random forest (RF). The features were normalized by mean subtraction and scaling to unit variance before classification, which is essential for SVM performance. Individual features appeared to follow Gaussian or Gaussian mixture distributions, validating the normalization approach.

This two-stage pipeline (CNN feature extraction followed by classical ML classification) provides an alternative to fully connected neural networks and has been shown in the broader literature to sometimes outperform the original CNN in classification tasks.

TL;DR: CNN-derived 512-D and 4096-D feature vectors were reduced via PCA (to 60 components at 95% variance) and LASSO (to 18 features). Four ML classifiers (kNN, linear SVM, SVM, RF) were trained on these deep radiomics features as an alternative to end-to-end CNN classification.

Results

Pages 5-6

Classification Performance: CNN and Feature-Based Models

Model A (CNN binary classifier): The VGG-16 based model achieved AUC of 0.71 (p = 0.018) on the held-out test set of 51 ADC/SCC patients, with 68.6% accuracy, 82.9% specificity, and 37.5% sensitivity. The model was significantly better at ruling out ADC (high specificity) than detecting it (low sensitivity), suggesting stronger potential as a diagnostic confirmation tool than as a screening tool.

Clinical parameter comparison: Univariate logistic regression using clinical variables yielded lower performance. Smoking status produced AUC of 0.64 (p = 0.118), age gave AUC of 0.55 (p = 0.544), and sex was the strongest clinical predictor at AUC of 0.69 (p = 0.039). The deep learning model outperformed all three clinical parameters, with the advantage of being fully automated and independent of patient-reported data.

Feature-based classifiers: The kNN model on 4096-D features matched the CNN at AUC = 0.71 (p = 0.017), with 76.5% accuracy, 85.7% specificity, and 56.3% sensitivity. Linear SVM on 4096-D features achieved AUC of 0.68 (p = 0.042), non-linear SVM reached AUC of 0.64 (p = 0.107), and RF had the weakest performance at AUC of 0.57 (p = 0.423). On the 512-D feature vector, all classifiers showed somewhat lower performance, with kNN at AUC of 0.64, linear SVM at 0.62, non-linear SVM at 0.63, and RF at 0.61. The 4096-D features from the fully connected layer consistently outperformed the 512-D pooling layer features, likely because fully connected neurons have access to all activations from the previous layer rather than only local features.

External validation: Model A was also tested on the independent Lung3 dataset from The Cancer Imaging Archive (TCIA), comprising 49 patients (30 SCC, 19 ADC). It achieved AUC of 0.60 (p = 0.251), showing some transfer of learned signal despite the different institutional context and reversed class balance (SCC-dominant rather than ADC-dominant).

TL;DR: Model A achieved AUC 0.71 (p = 0.018) for ADC vs. SCC, outperforming clinical predictors (sex AUC 0.69, smoking AUC 0.64, age AUC 0.55). kNN on 4096-D features matched at AUC 0.71 (p = 0.017) with better sensitivity (56.3% vs. 37.5%). External validation on Lung3 yielded AUC 0.60.

Probabilistic Classification

Pages 6-7

How the CNN Performs as a Probabilistic Classifier Across All Histology Types

The authors tested Model A (trained only on ADC and SCC) as a probabilistic classifier on a heterogeneous test set of 83 patients containing all three histology groups: 35 ADC, 16 SCC, and 32 "Other." The Kruskal-Wallis H-test on prediction probability distributions across the three groups showed a statistically significant overall difference (p = 0.015). Post-hoc comparisons revealed that the ADC vs. SCC distinction was the strongest (p = 0.003), while SCC vs. "Other" showed a trend (p = 0.235), and ADC vs. "Other" was not significant (p = 0.355).

Clinical significance of the "Other" overlap: The lack of a significant difference between ADC and "Other" predictions is not necessarily a failure. The "Other" category likely contains misclassified adenocarcinomas, including bronchoalveolar carcinoma (BAC), which has historically been confused with ADC. Revised classification systems have since eliminated the BAC term entirely. The CNN's inability to distinguish these groups may actually reflect genuine radiographic similarity, validating the known challenges in manual histological classification.

Model A vs. Model B in heterogeneous data: Despite being trained only on two histology types, Model A still predicted ADC correctly in the heterogeneous set with AUC of 0.66 (p = 0.013), 85% specificity, and 31% sensitivity. Model B, which was specifically trained on all three histology groups, actually performed worse: AUC of 0.62 (p = 0.127) for SCC vs. all others, and AUC of 0.58 (p = 0.234) for ADC vs. all others. The binary-trained model's superiority suggests that adding the heterogeneous "Other" category during training introduced noise rather than useful signal.

Higher prediction certainty was consistently associated with correct histology type prediction, meaning the model's confidence score itself carried diagnostic value. This is an important property for clinical deployment, where a confidence threshold could be used to flag cases requiring additional review.

TL;DR: On a 83-patient mixed test set, Model A (binary-trained) distinguished ADC vs. SCC with p = 0.003 and predicted ADC at AUC 0.66. Model B (3-class-trained) performed worse at AUC 0.58 for ADC, suggesting binary training is more effective than including heterogeneous histology types.

Model Interpretability

Pages 7-9

Grad-CAM Visualization: What the CNN Actually Looks At

To address the "black box" criticism of neural networks, the authors used Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps showing which image regions most influenced Model A's predictions. Grad-CAM works by computing the gradient of the target class score with respect to the feature maps of the last convolutional layer. Deeper layers in a CNN capture higher-level visual constructs while retaining spatial information that is lost in fully connected layers, making them ideal for activation mapping.

Key findings from heatmaps: The first convolutional layers highlighted tumor edges, consistent with what is observed when pre-trained models are applied to natural images. Deeper layers activated on regions on or immediately around the tumor itself. Critically, the model also highlighted areas surrounding the tumor, suggesting that peritumoral contextual information carries predictive value. These "at-risk" zones likely correspond to anatomic regions harboring occult microscopic disease that contributes to local treatment failure with surgery or radiation.

Robustness near chest wall: For lesions near the chest wall, the CNN appropriately focused on the lesion and lung parenchyma while placing less weight on bone and soft tissue, even though these structures have similar CT density to tumor. This demonstrates the model's ability to learn complex and representative features rather than relying on simple density thresholds. The findings provide reassurance that the model is activating on clinically relevant structures within the region of interest.

The authors note that while this interpretability analysis is qualitative, it confirms that the CNN has learned meaningful radiographic patterns. Quantitative interpretability metrics and experimental designs that mitigate bias (such as blinding and blocking) would strengthen future validation efforts.

TL;DR: Grad-CAM heatmaps showed Model A focuses on the tumor and peritumoral regions, correctly ignoring non-relevant structures like chest wall bone. Peritumoral activation suggests the model detects features related to microscopic disease spread, adding interpretive value beyond simple tumor identification.

Limitations & Future Directions

Pages 9-10

Study Limitations and the Path Toward Clinical Integration

Sample size: The most significant limitation is the relatively small cohort of 311 patients, with only 51 in the primary test set and 49 in external validation. While the 75:25 split was designed to maximize test set representativeness, the limited numbers constrain statistical power, particularly for subgroup analyses. The external Lung3 validation cohort had a reversed class balance (SCC-dominant) compared to the training data (ADC-dominant), which likely contributed to the drop in AUC from 0.71 to 0.60.

Retrospective, single-center design: All primary data came from MGH patients treated between 1999 and 2011. This introduces potential institutional bias in imaging protocols, scanner types, and pathology practices. The 12-year collection window also means CT technology evolved substantially during the study period, adding heterogeneity to the imaging data. While the authors argue this heterogeneity tests model robustness, it also introduces noise that could degrade performance.

Missing molecular data: Only 18 of 311 patients (5.8%) had EGFR/KRAS mutation status available, since routine molecular testing was not standard at MGH for early-stage NSCLC during the collection period. This prevented any analysis of the relationship between radiomics features and oncogenic driver mutations, which would have substantially strengthened the clinical relevance of the work. The "Other" histology category is also a known source of misclassification, particularly for bronchoalveolar carcinoma and undifferentiated NSCLC.

Future directions: The authors highlight several paths forward. Prospective validation on additional large external datasets is the most immediate need. Federated or collaborative learning could enable model training on decentralized data across institutions without requiring data sharing, helping overcome inter- and intra-institutional data silos. Integration with complementary approaches such as liquid biopsy could provide multi-modal diagnostic support. The ultimate vision is transforming rigid histological classification into a more analytical framework that combines radiological, biological, and clinical variables through deep learning-based radiomics.

TL;DR: Key limitations include small sample size (311 patients, 51-patient test set), single-center retrospective design (MGH, 1999-2011), and only 5.8% molecular data availability. Future work needs prospective multi-center validation, federated learning across institutions, and integration with liquid biopsy for multi-modal diagnostics.

Deep learning classification of lung cancer histology using CT images

Original Paper (PDF)

Plain-English Explanations