Deep Learning Lung Cancer CT Classification

Overview

What This Paper Is About

Lung cancer is the leading cause of cancer-related death worldwide, responsible for an estimated 2.2 million new cases and 1.8 million deaths globally as of 2020. Early detection of lung nodules (small spots in the lungs between 3 mm and 3 cm in size) is critical because these nodules can be either benign or malignant. However, radiologists working without computer-aided diagnosis (CAD) tools have false-positive rates between 51% and 83.2%, meaning they frequently flag benign nodules as suspicious.

This paper introduces DCSwinB, a new deep learning model designed to classify pulmonary nodules in CT images as benign or malignant. The model combines two powerful approaches: Convolutional Neural Networks (CNNs) for extracting fine local details, and the Swin Transformer (a type of Vision Transformer) for capturing broader, global patterns across the image. This "dual-branch" design allows the model to see both the small details and the big picture simultaneously.

The authors evaluated DCSwinB on the LUNA16 and LUNA16-K datasets, which are derived from the well-known LIDC-IDRI collection of annotated thoracic CT scans. Using ten-fold cross-validation, DCSwinB achieved 90.96% accuracy, 90.56% recall, 89.65% specificity, and an AUC of 0.94 on the LUNA16-K dataset. These results outperformed all comparison models, including VGG16, ResNet50, DenseNet, Swin-T, and several advanced hybrid architectures.

TL;DR: DCSwinB is a new hybrid deep learning model that combines CNNs and Swin Transformers to classify lung nodules as benign or malignant in CT images. It achieved 90.96% accuracy and 0.94 AUC, outperforming all tested alternatives.

Background

Why Existing AI Models Fall Short for Lung Nodule Classification

CNN limitations: Convolutional Neural Networks have been the gold standard in medical image analysis for years. They excel at detecting local patterns, textures, and spatial features in images. However, standard CNNs have a restricted receptive field, meaning they struggle to capture long-range dependencies across distant parts of an image. For volumetric CT scans, where global context is often essential for accurate diagnosis, this is a significant limitation.

Vision Transformer limitations: Vision Transformers (ViTs) solve the global context problem by using self-attention mechanisms that model relationships across the entire image. However, they divide images into fixed-size patches, which can disrupt fine-grained local details needed to identify small lung nodules. ViTs also require large datasets for effective training (because they lack the built-in spatial biases of CNNs), and the self-attention mechanism has quadratic computational complexity, making it expensive for high-resolution medical images.

Hybrid approaches: Models like TransUNet, E-TransUNet, and GLoG-CSUnet have tried combining CNNs and transformers. While these hybrids show improved results, balancing local detail extraction and global context modeling without one overshadowing the other remains challenging. Hybrid architectures can also become complex to design and computationally demanding, limiting deployment in resource-constrained clinical environments.

The DCSwinB model was specifically designed to address all three of these challenges: capturing both local and global features, maintaining computational efficiency, and achieving strong classification performance on pulmonary nodules.

TL;DR: CNNs miss global context, Vision Transformers miss local detail and are computationally expensive, and existing hybrids struggle to balance both. DCSwinB was designed to overcome all three limitations.

Architecture

How the DCSwinB Model Works: Dual-Branch Design

Overall structure: DCSwinB is built on the Swin-Tiny Vision Transformer as its backbone. The model uses a hierarchical structure with four successive stages containing 2, 2, 6, and 2 layers of Swin Transformer blocks, respectively. Input 3D CT images are first processed through a convolution layer with a 4x4 patch size, generating an initial embedding with 96 channels. Feature maps are progressively downsampled between stages through a patch merging procedure based on linear layers.

Dual-branch splitting: At each stage, the input features are split into two parallel branches using a 1x1 convolutional layer. Each branch receives half the channels of the original input. The first branch (the ViT branch) processes features through the Swin Transformer with window-based multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA). The second branch (the CNN branch) applies a 3x3 convolution followed by max pooling to extract local features efficiently.

Conv-MLP module: A key innovation is the integration of a Conv-MLP module within the Swin Transformer branch. Between two standard MLP layers, the model inserts a depthwise 3x3 convolution that strengthens connections between adjacent attention windows. This allows the model to capture long-range spatial dependencies that standard window-based attention would miss. The depthwise convolution operates independently per channel, reducing parameter complexity from 3x3xCxC (standard convolution) to just 3x3xC.

Feature fusion: After processing through both branches, the outputs are concatenated along the channel dimension to produce the final feature representation for each stage. This design allows DCSwinB to combine the local pattern recognition of CNNs with the global context modeling of transformers in a computationally efficient way.

TL;DR: DCSwinB splits features into two branches: a Swin Transformer branch (enhanced with Conv-MLP for long-range dependencies) and a CNN branch (for local features). Both branches use only half the channels, keeping computation manageable.

Dataset

Training Data: LUNA16 and LIDC-IDRI Datasets

Source data: The study used the publicly available LUNA16 dataset, which is derived from the Lung Image Database Consortium image collection (LIDC-IDRI). LIDC-IDRI contains thoracic CT scans annotated by multiple radiologists, making it a standard benchmark for pulmonary nodule analysis. CT scans with inconsistent slice spacing, missing slices, or slice thickness greater than 3 mm were excluded. Nodules smaller than 3 mm in diameter were also filtered out as clinically insignificant.

Preprocessing: All CT scans were resampled to a uniform isotropic voxel spacing of 1x1x1 mm using trilinear interpolation. Raw voxel intensities in Hounsfield Units (HU) were clipped to a lung window range of [-1000, 400] HU and then normalized to [0, 1] using min-max scaling. For each annotated nodule, a Region of Interest (ROI) cube of 64x64x64 voxels was extracted, centered on the nodule's centroid coordinates.

Labeling: Four experienced radiologists graded each nodule on a scale from 1 (highly unlikely malignant) to 5 (highly suspicious). Nodules with an average rating of 3 were excluded due to ambiguity. Nodules rated 1 or 2 were labeled benign, and those rated 4 or 5 were labeled malignant. This produced a dataset of 450 malignant nodules and 554 benign nodules from a total of 1,004 nodules in the LUNA16 database.

Data augmentation: To improve robustness, the training data was augmented with random rotations (plus or minus 15 degrees around each axis), random scaling (factor between 0.9 and 1.1), random translations (up to plus or minus 5 voxels per axis), and random horizontal/vertical flipping with 0.5 probability each.

TL;DR: The model was trained on 1,004 lung nodules (450 malignant, 554 benign) from the LUNA16/LIDC-IDRI dataset, with 64x64x64 voxel ROIs extracted from preprocessed CT scans and extensive 3D data augmentation.

Training

How the Model Was Trained and Validated

Data splitting: The dataset was split at the patient level (not nodule level) to prevent data leakage, with an 80:10:10 ratio for training, validation, and testing. A ten-fold cross-validation strategy was employed, also with patient-level splits. Within each fold, stratified sampling ensured that the distribution of benign and malignant nodules was approximately maintained across training and validation subsets. The final reported metrics are averaged across all 10 folds.

Optimization: Training used the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9. The model was trained for 300 epochs with a batch size of 8. The learning rate started at 0.01, was reduced to 0.001 after 60 epochs, and further decreased to 0.0001 after 120 epochs. Weight decay of 0.0001 was applied for regularization, along with dropout regularization to prevent overfitting.

Hardware and framework: Experiments were conducted on an Ubuntu 22.04 system using PyTorch with CUDA acceleration on an NVIDIA RTX 4060 GPU with 16 GB of memory. The full training process with ten-fold cross-validation completed in approximately 12 hours.

The small batch size of 8 was deliberately chosen to introduce more randomness into gradient updates, which can improve generalization. It also balanced efficient GPU usage against the regularization benefits of smaller batches, which is particularly important when working with a relatively small dataset of around 1,000 nodules.

TL;DR: Training used patient-level 10-fold cross-validation, SGD with a stepped learning rate schedule over 300 epochs, and completed in about 12 hours on a single NVIDIA RTX 4060 GPU.

Results

Performance Comparison: DCSwinB vs. 10 Other Models

LUNA16 dataset results: On the LUNA16 dataset, DCSwinB achieved the highest accuracy of 87.94%, exceeding SpikingResformer (86.96%) by 0.98%. It attained 85.56% recall, improving upon SpikingResformer (84.96%) and FiT (82.31%). DCSwinB also recorded the highest AUC of 0.94 among all tested models. Traditional CNN models performed notably worse: VGG16 achieved only 81.35% accuracy with 70.64% recall, ResNet50 reached 82.36% accuracy with 72.53% recall, and DenseNet scored 82.58% accuracy with 75.96% recall.

LUNA16-K dataset results: On the LUNA16-K dataset, DCSwinB showed even stronger performance with 90.96% accuracy, 90.56% recall, 89.65% specificity, an AUC of 0.90, and an F1-score of 90.56%. This represented a 1.00% accuracy improvement over SpikingResformer (89.96%) and a 1.60% improvement in recall (90.56% vs. 88.96%). Compared to ResNet50, DCSwinB showed a remarkable 4.60% improvement in recall on this dataset.

Transformer-based comparisons: Among transformer-based models, Swin-T achieved 84.35% accuracy and 0.84 AUC on LUNA16-K, while ConvNeXt reached 85.58% accuracy and 0.85 AUC. DaViT and CrossViT performed better at 86.58% and 87.58% accuracy respectively, but both still fell well short of DCSwinB's 90.96%. The hybrid models FiT (89.24%) and GVT (89.66%) came closer but still could not match DCSwinB's overall performance.

TL;DR: DCSwinB outperformed all 10 comparison models on both datasets, achieving 90.96% accuracy and 90.56% recall on LUNA16-K, with notable gains over CNN baselines (VGG16, ResNet50, DenseNet) and hybrid architectures alike.

Ablation

Ablation Study: What Each Component Contributes

Baseline comparison: The ablation study systematically tested the contribution of each architectural component. Starting from the Swin-Tiny baseline (87.94% accuracy, 85.56% recall, 85.65% specificity, 0.92 AUC on LUNA16-K), the authors incrementally added the dual-branch structure and the Conv-MLP module to measure their individual effects.

Dual-branch without Conv-MLP: Adding the dual-branch structure alone (DCSwinB without Conv-MLP) improved accuracy from 87.94% to 88.56%, recall from 85.56% to 87.02%, and specificity from 85.65% to 86.15%. The AUC increased from 0.92 to 0.93. This confirms that splitting features into parallel CNN and transformer branches provides meaningful performance gains even without the Conv-MLP enhancement.

Full DCSwinB with Conv-MLP: Adding the Conv-MLP connections produced the largest gains. Accuracy jumped to 90.96% (a 3.02% improvement over Swin-Tiny), recall reached 90.56% (up 5.00%), and specificity improved to 89.65% (up 4.00%). The AUC reached 0.94. The Conv-MLP module's depthwise convolutions enhance local feature interactions between neighboring attention windows while maintaining global feature modeling through the transformer.

Training convergence: All four model variants (Swin-T, Swin-DMLP, Swin-DB, and DCSwinB) showed rapid accuracy improvement during the first 140 epochs, followed by gradual enhancement from epochs 140 to 260, and stable convergence between epochs 260 and 300. DCSwinB consistently achieved the highest validation accuracy throughout training.

TL;DR: Both the dual-branch design and the Conv-MLP module contribute meaningfully. The full DCSwinB model improved accuracy by 3.02%, recall by 5.00%, and specificity by 4.00% over the Swin-Tiny baseline.

Efficiency

Computational Efficiency: Faster and Lighter Than the Baseline

Parameter reduction: Despite its dual-branch architecture, DCSwinB actually reduces the number of parameters by approximately 19.8% compared to the standard Swin-Tiny model. This reduction comes from the dual-branch design, which splits channels in half for each branch, and from the use of lightweight depthwise convolutions instead of standard convolutions in the Conv-MLP module.

FLOPs reduction: DCSwinB achieves a 24.4% decrease in floating-point operations (FLOPs) compared to Swin-Tiny. This is significant because FLOPs directly relate to the computational cost of running the model. The reduction comes largely from replacing multi-head self-attention and MLP processing on the full feature set with simpler convolution and max pooling operations on the CNN branch, and from the efficient depthwise convolution in the Conv-MLP module.

Inference speed: In practical terms, DCSwinB is 2.6 milliseconds faster per image than Swin-Tiny, representing a 16% improvement in inference speed. These benchmarks were measured on an NVIDIA RTX 4060 GPU with identical input sizes and batch settings. Faster inference is particularly important for clinical deployment, where real-time or near-real-time processing of CT scans is desirable.

The combination of better accuracy and lower computational cost is uncommon in deep learning, where performance gains typically come at the expense of increased computation. DCSwinB achieves this by intelligently distributing work between its two branches, letting the lightweight CNN branch handle tasks that do not require the full power of transformer-based attention.

TL;DR: DCSwinB is 19.8% smaller in parameters, 24.4% lower in FLOPs, and 16% faster at inference than Swin-Tiny, while simultaneously achieving higher accuracy. It achieves better performance with less computation.

Limitations

Limitations and Future Directions

Computational demands: Although DCSwinB is more efficient than Swin-Tiny, the dual-branch architecture still requires significant computational resources and memory. This may limit its use in resource-constrained clinical environments or on devices with limited processing power. Deploying the model on standard hospital workstations or edge devices would likely require further optimization or model compression techniques.

Dataset dependency: The model's performance relies heavily on pretraining with the LUNA16 and LUNA16-K datasets. While this pretraining enables strong generalization within those data distributions, performance may degrade when applied to CT scans from different institutions with varied image quality, noise levels, scanner types, or patient demographics. The authors acknowledge that generalization to other types of pulmonary nodules or CT imaging data remains uncertain.

Binary classification scope: The current model only classifies nodules as benign or malignant. It does not distinguish between lung cancer subtypes such as non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC), or between adenocarcinoma and squamous cell carcinoma. In clinical practice, subtype classification is important for treatment planning. Additionally, nodules with ambiguity scores of 3 (the borderline cases) were excluded from the study, which means the model has not been tested on the most diagnostically challenging cases.

Future directions: The authors plan to extend DCSwinB to fully 3D volumetric data processing, incorporate multimodal information (combining imaging with clinical and genomic data), and develop lightweight model variants for broader clinical deployment. They also suggest exploring unsupervised or self-supervised learning strategies to reduce dependency on large annotated datasets.

TL;DR: Key limitations include resource requirements for clinical deployment, dependence on LUNA16 training data, and restriction to binary benign/malignant classification. Future work aims to extend the model to 3D volumetric data, multimodal inputs, and cancer subtyping.

Deep learning-based lung cancer classification of CT images

Original Paper (PDF)

Plain-English Explanations