High-Accuracy ML Framework for Lung Cancer

Plain-English Explanations

Overview

What This Paper Is About

Lung cancer remains the leading cause of cancer-related mortality worldwide. According to GLOBOCAN 2022, there were approximately 2.5 million new cases and 1.82 million deaths from lung cancer globally, accounting for 16.8% of all cancer deaths. Early detection can improve survival rates by as much as 20%, yet many patients are diagnosed at advanced stages when treatment options are limited. Machine learning has emerged as a powerful tool for improving early prediction and risk stratification based on patient symptoms and clinical data.

This paper combines a systematic literature review (SLR) of 40 published studies on machine learning for lung cancer prediction with the development of a new voting ensemble framework. The SLR uses the tollgate methodology to select and evaluate existing research, answering four research questions about classifier effectiveness, model limitations, AI versus traditional methods, and key predictive features. The authors then propose their own framework built on Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR) combined through hard voting.

The proposed model was evaluated on two publicly available Kaggle datasets containing clinical symptom data. Using SMOTE for class balancing, ANOVA-based SelectKBest for feature selection, and 10-fold cross-validation, the ensemble achieved 99% accuracy on the cancer patient dataset (1,000 instances, 25 features) and 92.5% accuracy on the survey lung cancer dataset (309 instances, 16 features). These results outperformed individual classifiers and several previously published models.

TL;DR: This paper reviews 40 existing studies on ML-based lung cancer prediction and proposes a voting ensemble of RF, SVM, and LR that achieves 99% and 92.5% accuracy on two clinical symptom datasets, outperforming prior approaches.

Literature Review

Systematic Review: What 40 Studies Reveal About ML for Lung Cancer

The authors conducted a systematic literature review covering 130 initial papers from eight electronic databases, including IEEEXplore, PubMed, Springer Link, MDPI, Elsevier, Nature, Hindawi, and Google Scholar. Using the tollgate approach (a five-phase filtering process), they narrowed these down to 40 final studies published between 2018 and 2024. Quality evaluation scores were applied at each phase based on whether the methodology, features, and findings adequately addressed the research questions.

The review revealed that SVM was the most frequently used algorithm (9 out of 26 repeated classifier uses), followed by RF and LR (4 uses each), Naive Bayes (NB) (3 uses), and KNN, Rotation Forest, and CNN (2 uses each). Among the reviewed studies, the most commonly used features for prediction included age, chronic disease, and cough (each appearing in 11 studies), followed by alcohol consumption and swallowing difficulty (10 each), and smoking, allergy, and BMI (9 each).

The review identified several critical gaps: most studies used relatively small, domain-specific datasets without clear preprocessing or class imbalance handling. Feature selection methods were frequently unreported. Classical ML methods (RF, SVM, LR) remained dominant for tabular clinical data, while CNN-based architectures dominated image-driven studies. The authors noted that ensemble techniques and deep learning classifiers like ANNs were gaining traction but often lacked interpretability.

Notable existing results included 95.56% accuracy with SVM on a UCI dataset, 94.42% accuracy with XGBoost using SMOTE, 93% accuracy with an ANN using CRISP-DM methodology, and 98.76% accuracy with GBM using PCA and random oversampling. These benchmarks formed the baseline against which the proposed framework was compared.

TL;DR: A tollgate-based review of 40 studies found SVM, RF, and LR as the most popular classifiers, with age, chronic disease, and cough as top predictive features. Key gaps included poor handling of class imbalance and limited feature selection reporting.

Datasets

Two Clinical Datasets: Symptoms, Lifestyle, and Cancer Risk Levels

Dataset 1 (Cancer Patient Dataset): Obtained from Kaggle, this dataset contains 1,000 instances with 25 attributes covering demographics (age, gender), lifestyle factors (smoking, alcohol use, obesity, balanced diet), environmental exposures (air pollution, occupational hazards, passive smoking), genetic risk, and symptoms (chest pain, coughing of blood, fatigue, shortness of breath, wheezing, swallowing difficulty, clubbing of fingernails, frequent cold, dry cough, snoring). The target variable "level" has three classes: low (303 instances), medium (332 instances), and high (365 instances).

Dataset 2 (Survey Lung Cancer Dataset): Also from Kaggle, this smaller dataset contains 309 instances with 16 attributes including gender, age, smoking, yellow fingers, anxiety, peer pressure, chronic disease, fatigue, allergy, wheezing, alcohol consumption, cough, shortness of breath, swallowing difficulty, and chest pain. The target variable is binary (lung cancer: yes or no), with a highly imbalanced distribution of 270 "yes" and only 39 "no" instances.

Both datasets are symptom-based and tabular, meaning they rely on clinical features rather than imaging data such as CT scans or x-rays. This makes them suitable for early screening and triage scenarios where a patient's reported symptoms and risk factors can be used to flag individuals who may need further diagnostic workup, such as low-dose CT (LDCT) screening. The datasets do not distinguish between specific lung cancer subtypes such as non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), adenocarcinoma, or squamous cell carcinoma.

TL;DR: Two Kaggle datasets were used: a 1,000-instance, 25-feature dataset with three risk levels and a 309-instance, 16-feature binary dataset. Both rely on clinical symptoms and lifestyle factors rather than imaging data.

Preprocessing

Data Preprocessing: SMOTE, ANOVA, and Feature Selection

Missing value handling: The authors used Python's isnull() function to verify data completeness and checked for duplicate records. Both datasets were confirmed to have no missing values, which simplified the preprocessing pipeline.

Class imbalance correction with SMOTE: The Synthetic Minority Oversampling Technique (SMOTE) was applied to address class imbalance, particularly in Dataset 2 where "no" cases (39) were vastly outnumbered by "yes" cases (270). SMOTE generates synthetic samples by interpolating between existing minority-class data points and their nearest neighbors (k=5). After SMOTE, Dataset 1 was balanced to 365 instances per class, and Dataset 2 was balanced to 270 instances per class. Importantly, SMOTE was applied only within training folds during cross-validation to prevent data leakage.

Feature selection with ANOVA: The SelectKBest method with the f_classif scoring function was used for feature selection. This performs an ANOVA F-test to compute the variance ratio between classes for each feature, ranking them by their discriminative power. Nine features were selected for each dataset. For Dataset 1: air pollution, alcohol consumption, genetic risk, workplace hazards, blood in cough, balanced diet, obesity, allergy, and passive smoking. For Dataset 2: yellowing fingers, alcohol consumption, peer pressure, fatigue, allergy, wheezing, cough, swallowing difficulty, and chest pain.

SHAP analysis was also performed to validate feature importance. In Dataset 1, passive smoking, fatigue, and coughing of blood emerged as the top contributors. In Dataset 2, age, allergy, and yellow fingers were most influential. This confirmed that smoking-related risk factors and respiratory symptoms are the strongest predictors of lung cancer across both datasets.

TL;DR: SMOTE balanced the imbalanced classes (applied only within training folds), ANOVA-based SelectKBest selected the top 9 features per dataset, and SHAP analysis confirmed passive smoking, fatigue, and respiratory symptoms as the most important predictors.

Methodology

The Voting Ensemble: How RF, SVM, and LR Work Together

Ensemble design: The proposed framework uses a hard voting ensemble that combines Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR). In hard voting, each classifier independently predicts a class label, and the final prediction is the class that receives the most votes. This approach leverages the complementary strengths of each algorithm: RF captures non-linear feature interactions through multiple decision trees, SVM finds optimal hyperplanes for class separation (using an RBF kernel with C=1.0 and gamma=scale), and LR provides probabilistic baseline predictions with strong interpretability.

Individual classifiers: The Random Forest was configured with 100 estimators, entropy as the splitting criterion, and a max depth of 10. RF is an ensemble of decision trees where each tree is trained on a random subset of the data and features, reducing overfitting through averaging. SVM maps data into a higher-dimensional space using a radial basis function (RBF) kernel and finds the maximum-margin separating hyperplane. LR uses a sigmoid function to model the probability of class membership, making it well-suited for both binary and multiclass problems.

Validation strategy: A stratified 10-fold cross-validation pipeline was used. The data was split into 10 folds while maintaining class distribution in each fold. SMOTE was applied only to the training portion of each fold, ensuring the test fold remained untouched by synthetic data. This nested approach prevents the optimistic bias that can occur when oversampling is applied before splitting. The model was implemented with a fixed random state of 42 for reproducibility, and all experiments were conducted in Python 3.10.11 with scikit-learn 1.2.1.

TL;DR: The framework combines RF (100 trees, entropy, depth 10), SVM (RBF kernel), and LR through hard voting. Stratified 10-fold cross-validation with SMOTE applied only to training folds prevents data leakage and ensures reliable evaluation.

Results

Performance Results: 99% and 92.5% Accuracy

Dataset 1 (Cancer Patient Dataset): The proposed voting ensemble achieved 99% accuracy across all key metrics. The 10-fold cross-validation results showed accuracy of 0.991 +/- 0.009, F1 macro of 0.991 +/- 0.010, precision macro of 0.991 +/- 0.009, and recall macro of 0.991 +/- 0.010. The confusion matrix showed extremely high true positive counts and minimal misclassifications across all three risk levels (low, medium, high), with counts of 55, 61, and 82 correct predictions respectively, and only a handful of errors.

Dataset 2 (Survey Cancer Dataset): The ensemble achieved a final test accuracy of 92.5%. The cross-validation metrics were accuracy of 0.894 +/- 0.050, F1 macro of 0.763 +/- 0.090, precision macro of 0.815 +/- 0.129, and recall macro of 0.782 +/- 0.117. The confusion matrix showed 55 true positives and 47 true negatives, with only 4 false positives and 2 false negatives. The wider variance in cross-validation scores reflects the smaller dataset size (309 instances).

Individual classifier comparison: On Dataset 1, Decision Tree achieved 96% accuracy, LR achieved 94%, NB scored 57.5%, and KNN reached 79%. On Dataset 2, Decision Tree achieved 89%, LR 94%, NB 89%, and KNN 86.1%. The voting ensemble consistently outperformed all individual classifiers, confirming the benefit of combining multiple algorithms through majority voting.

TL;DR: The voting ensemble hit 99% accuracy (0.991 F1) on the 1,000-instance dataset and 92.5% accuracy on the 309-instance dataset, outperforming Decision Tree (96%/89%), LR (94%/94%), KNN (79%/86.1%), and NB (57.5%/89%).

Comparison

How the Proposed Model Compares to Existing Systems

Against published benchmarks: The authors compared their framework against several previously published models evaluated on the same or similar UCI/Kaggle lung cancer datasets. Abdullah et al. (2021) achieved 95.56% using SVM with correlation-based feature selection. Mamun et al. (2022) reached 94.42% with XGBoost and cross-validation. Vieira et al. (2021) obtained 93% using an ANN with information gain and chi-square feature selection. Faisal et al. (2018) achieved 90% with a gradient-boosted tree. The proposed model's 99% accuracy on Dataset 1 exceeded all of these by a significant margin.

Key differentiators: Unlike most prior studies, the proposed framework explicitly addresses class imbalance through SMOTE (applied within cross-validation folds to prevent leakage), uses a systematic ANOVA-based feature selection method that reduces the feature space to 9 attributes, and combines three complementary classifiers through ensemble voting. Many existing studies either used single classifiers without ensemble methods, did not address class imbalance, or applied oversampling before data splitting, which can inflate performance metrics.

Additional benchmarks included Radhika et al. (2019) with LR at 96.9% using 7-fold cross-validation, Viji Cripsy and Divya (2023) with LR at 91.90% using PCA and Ranker method, and Dritsas and Trigka (2022) with SVM at 95.4%. The proposed model's Dataset 2 accuracy of 92.5% fell below some of these single-classifier benchmarks, likely due to the smaller, more imbalanced dataset and the stricter nested cross-validation approach that avoids data leakage.

TL;DR: The proposed model's 99% accuracy on Dataset 1 surpassed all existing benchmarks, including XGBoost (94.42%), ANN (93%), and SVM (95.56%). Key advantages include proper SMOTE application within CV folds and ANOVA-based feature selection.

Interpretability

SHAP Analysis: Understanding Which Features Drive Predictions

Dataset 1 feature importance: SHAP (SHapley Additive exPlanations) analysis revealed that passive smoking had the highest mean absolute SHAP value for the cancer patient dataset, followed by fatigue, coughing of blood, and wheezing. These features showed the greatest influence on the model's predictions across all three risk levels (low, medium, high). Alcohol use, clubbing of fingernails, and swallowing difficulty also contributed meaningfully. The SHAP values varied by class, with passive smoking being particularly important for distinguishing high-risk patients.

Dataset 2 feature importance: For the survey lung cancer dataset, age emerged as the most influential predictor, followed by allergy, yellow fingers, alcohol consumption, and fatigue. Peer pressure, wheezing, and chronic disease also contributed but with smaller SHAP values. The dominance of age aligns with clinical knowledge that lung cancer incidence increases substantially after age 50, particularly in individuals with additional risk factors.

The SHAP results from both datasets converge on a consistent clinical picture: smoking-related exposures (active and passive), respiratory symptoms (wheezing, cough, shortness of breath), and general health indicators (fatigue, chronic disease) are the strongest predictors of lung cancer risk. This aligns with established epidemiological evidence and supports the model's clinical validity. The use of SHAP provides transparency that many prior lung cancer prediction studies lacked, making the model more suitable for clinical decision support where practitioners need to understand why a particular prediction was made.

TL;DR: SHAP analysis identified passive smoking, fatigue, and blood in cough as top predictors for Dataset 1, and age, allergy, and yellow fingers for Dataset 2. These findings align with established clinical risk factors for lung cancer.

Limitations

Limitations and Future Directions

Dataset constraints: Both datasets are publicly available from Kaggle and are relatively small (1,000 and 309 instances). While SMOTE helps balance class distributions, its synthetic nature can inflate performance metrics, particularly on small datasets. The high accuracy of 99% on Dataset 1 should be interpreted cautiously, as the dataset may contain inherently separable patterns that do not fully reflect the complexity of real-world clinical data. The authors acknowledge that generalizability to larger, more diverse patient populations remains untested.

Scope of prediction: The framework is limited to symptom-based tabular data and does not incorporate imaging modalities such as CT scans, LDCT screening, or chest x-rays. In clinical practice, the detection of pulmonary nodules through imaging is a critical component of lung cancer diagnosis. The model also does not distinguish between cancer subtypes (NSCLC vs. SCLC, or adenocarcinoma vs. squamous cell carcinoma), which is essential for treatment planning. The datasets' binary or three-level classification oversimplifies the staging complexity of real lung cancer diagnosis.

Methodological considerations: The proposed framework represents an incremental benchmarking effort rather than a fundamentally new algorithm. The voting ensemble of RF, SVM, and LR uses well-established classifiers without novel architectural modifications. Additionally, while the nested cross-validation approach is methodologically sound, external validation on independent clinical cohorts was not performed, which limits confidence in real-world deployment potential.

Future work: The authors plan to validate the framework on more diverse, larger, and clinically annotated datasets. They suggest integrating different data types (imaging, genomic, and clinical features) to create multimodal prediction models. The framework could serve as a decision-support tool for early screening and risk assessment, complementing clinical judgment with patient-specific risk factors such as age, lifestyle habits, and comorbidities.

TL;DR: Key limitations include small Kaggle datasets, potential SMOTE-inflated accuracy, no imaging data, no cancer subtype classification, and lack of external validation. Future work aims for larger clinical datasets and multimodal integration.

Revolutionizing Lung Cancer Detection: A High-Accuracy Machine Learning Framework for Early Diagnosis

Original Paper (PDF)