Leveraging Data Mining for Inference and Prediction in Lung Cancer Research

Document Type : Original Article

Authors

1 Department of Mathematics and Computer Science, Fontbonne University, USA

2 Department of Statistics, Western Michigan University, USA

3 Institute for Data Science and Informatics, University of Missouri Columbia, USA

4 Senior Officer, Product Development at Radiant Nutraceuticals Ltd, Bangladesh

Abstract

Lung cancer~is the second most common cancer worldwide, with an estimated 2.21 million new diagnoses and 1.8 million deaths in 2020, according to WHO. Successful lung cancer treatment, early detection, and diagnosis improve survival rates. This study included 270 lung cancer patients and 39 with no lung cancer patients. Logistic regression will be used to analyze the association between variables for inference and Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression Analysis, k Nearest Neighborhood, Decision tree, Bagging, Random Forest, and Support Vector Machine used to predict the likelihood of an individual developing lung cancer based on factors. In terms of accuracy, 5 fold cross validation showed higher accuracy than the validation set approach where the Logistic Regression Model had the highest accuracy of 93.54%, followed by the Linear Discriminant Analysis with an accuracy of 92.09%,  the Support Vector Machine with an accuracy of 91.29%, Bagging and Random Forest with an accuracy of 90.90 and 91.23 respectively, the Quadratic Discriminant Analysis with an accuracy of 89.97, Decision Tree with an accuracy of 89.97,  the Knn-10 model with an accuracy of 17.74%, and lastly KNN-5 Model with an accuracy of 16.12%. The logistic regression model identified key associations between lung cancer and factors such as Allergy, Peer pressure, Swallowing difficulty, Smoking, Chronic disease, Alcohol consumption, yellow fingers, Fatigue, and Coughing. The accuracy rankings varied between 5-fold cross-validation and validation set approaches. Notably, the logistic regression model consistently demonstrated superior performance, achieving an accuracy rate of 93.54%.

Keywords

Main Subjects