Abstract
Keywords
Introduction
According to the 2022 GLOBOCAN study, there were 2.48 million new cases of lung cancer and 1.81 million deaths worldwide, making it one of the most common cancers and a leading cause of cancer-related mortality globally. 1 Research shows that most stage IV lung cancer patients die within 5 years of diagnosis, whereas patients with stage IA (early-stage) lung cancer have a 5-year survival rate of over 75%. The survival rates of late-stage lung cancer patients are significantly lower than those of early-stage lung cancer patients.2,3 Lung cancer is classified histopathology into nonsmallcell lung cancer (NSCLC), accounting for approximately 85% of cases, and small-cell lung cancer (SCLC), accounting for approximately 15% of cases, with lung adenocarcinoma being the most common subtype of lung cancer within NSCLC.4,5 In recent decades, with the rapid development of targeted therapy, the survival time of lung cancer patients has significantly increased. Targeted therapy is characterized by its high efficacy, significant therapeutic effects, and minimal side effects compared to traditional treatments such as radiotherapy, chemotherapy, and surgery. 6 Common driver genes in lung cancer include epidermal growth factor receptor (EGFR), anaplastic lymphoma kinase, and Kirsten rat sarcoma viral oncogene. Among these, EGFR mutations are the most common and are one of the most frequently targeted genes in the treatment of lung adenocarcinoma. 7 EGFR is a glycoprotein with tyrosine kinase activity that regulates the proliferation, differentiation, and apoptosis of normal cells. Mutations or amplifications in this gene can convert EGFR into an oncogenic protein. 8 Lung adenocarcinoma patients with EGFR mutations often exhibit significant clinical responses to EGFR tyrosine kinase inhibitors (TKIs). However, not all such patients benefit from EGFR-TKIs, underscoring the need for early identification of EGFR mutation status and specific mutation sites to optimize patient prognosis. 9 Currently, detecting driver genes primarily relies on invasive methods such as surgical pathology and tissue biopsy. However, due to the significant heterogeneity of tumors, tissue samples obtained through local puncture may not fully represent the lesions. Repeated punctures add to the economic burden and increase patient discomfort. While serum circulating tumor DNA extraction can serve as an auxiliary method for traditional gene detection, it has a false-negative rate of up to 30%, limiting its ability to accurately determine EGFR gene mutation status. 10 Therefore, noninvasive and efficient methods for detecting driver genes are crucial. CT is the most commonly used noninvasive imaging technique for diagnosing lung cancer due to its simplicity. Studies have shown that CT features of lung adenocarcinoma patients can predict EGFR mutation status to a certain extent. 11 Computer-aided diagnosis leverages advanced technologies like machine learning to analyze CT imaging and clinical features. It uses computer algorithms to build models for diagnosing specific diseases and to perform tasks such as disease classification, prediction, and localization. 12 This study established three machine learning models: logistic regression, support vector machine, and random forest, and optimized their parameters to predict the EGFR gene mutation status in patients with lung adenocarcinoma. The development of these predictive models aids in diagnosing EGFR gene mutations, allowing for the formulation of precise targeted treatment strategies for patients with such mutations, thereby improving their survival and prognosis. Based on real-world clinical data, this study investigates the performance of machine learning models in predicting EGFR gene mutations in lung adenocarcinoma and their practical significance for clinical decision making.
Materials and methods
Patient data
This study is a retrospective analysis of patients who underwent EGFR gene mutation testing and were subsequently diagnosed with lung adenocarcinoma at The First Affiliated Hospital of Hainan Medical University, Haikou, China, between January 2016 and December 2020. Patient clinical and thoracic CT imaging data were obtained from the hospital's electronic medical records. To ensure privacy and confidentiality, all patient information was de-identified before analysis. This study is based on existing clinical data, and all data usage complies with the ethical principles outlined in the 1975 Declaration of Helsinki and its 2013 revision. The research protocol was reviewed and approved by the relevant ethics committee. This study is reported in accordance with STROBE guidelines. 13
Inclusion criteria
Local residents of Hainan province.
Patients diagnosed with pulmonary adenocarcinoma by histopathology or cytology between January 2016 and September 2019.
All patients underwent EGFR gene mutation testing of tumor tissue using the multiplex fluorescence PCR (Polymerase chain reaction) method, and definitive test results were obtained.
Complete baseline CT imaging data before antitumor treatment (including surgical resection, chemotherapy, radiotherapy, targeted therapy, and traditional Chinese medicine treatment).
Complete clinical data.
No second primary tumor present.
Exclusion criteria
Patients with severe artifacts in CT images.
Patients with large amounts of pleural effusion, severe obstructive pneumonia, or lung collapse, causing the primary lesion to be unclearly identifiable on CT images.
A total of 222 patients with pulmonary adenocarcinoma were initially included in this study. Eight patients were excluded due to severe artifacts in their CT images, and 20 patients were excluded because the primary lesions could not be clearly identified on CT images due to significant pleural effusion, severe obstructive pneumonia, or lung collapse. As a result, 194 patients remained for analysis. Of these, 135 patients were assigned to the training set and 59 patients to the testing set, with a ratio of 7:3, as shown in Figure 1.

The patient inclusion process flowchart.
CT image acquisition
All CT images for this study were obtained using a GE Medical Systems LightSpeed 64-row CT scanner. All operators involved were standardized and trained to follow consistent scanning protocols. The specific scanning parameters were as follows: tube voltage of 120 kV, tube current ranging from 90 to 200 mA, slice thickness and slice interval both set to 5.0 mm, reconstructed into 2 mm thin slices, with a window width of 350 Hounsfield units (HU) and a window level of 40 HU, and a reconstructed matrix size of 512 × 512. Contrast-enhanced scans were performed using a power injector, with iodine meglumine (dose: 1.5–2.0 mL/kg) injected through the antecubital vein at a flow rate of 3.0 mL/s. Scanning began 25 to 30 s after contrast injection (arterial phase), followed by a scan at 75 seconds (venous phase), and a third scan at 240 seconds (delayed phase). Both contrast-enhanced and plain CT scans were conducted with the patient in the supine position, with both arms raised.
CT image feature extraction
CT images were interpreted and analyzed by two experienced radiologists. When disagreements occurred, a senior physician was consulted, and a consensus was reached after discussion. The key CT imaging indicators evaluated in this study included: tumor growth location, maximum tumor diameter, tumor margin clarity, cavitation, lobulation sign, spiculation sign, bubble sign, air bronchogram sign, pleural retraction sign, thickening sign, segmental or higher airway obstruction, tumor necrosis, pleural effusion, hilar and mediastinal lymph node metastasis, intrapulmonary metastasis, distant organ metastasis, vessel convergence sign, calcification, and tumor density. Tumor density was further categorized into solid and part-solid nodules based on the presence of ground-glass opacity within the lesion. 14
Feature selection
In this study, a total of 22 features were included, with “mutation” as the dependent variable and the remaining 21 features as independent variables. Among these, there were 17 CT imaging features and 4 clinical features. LASSO regression was used to select from these 21 features. As lambda increased, both the degrees of freedom and residuals gradually decreased. The optimal lambda value identified was 0.02458241, as shown in Figure 2. Additionally, 10-fold cross-validation was employed in LASSO regression to optimize the model, as illustrated in Figure 3. Ultimately, 10 significant features were selected, including smoking, alcohol consumption, vessel convergence sign, pleural retraction sign, segmental or higher airway obstruction, hilar and mediastinal lymph node enlargement, intrapulmonary metastasis, pleural effusion, tumor location, and lobulation sign. Of these, two were clinical features and eight were CT imaging features.

Lambda graph.

Ten-fold cross-validation.
Decision curve analysis and nomogram construction
Decision curve analysis (DCA) is a statistical method used to evaluate the clinical utility and effectiveness of predictive models. By comparing the net benefits of different models or treatment strategies across various decision thresholds, DCA helps decision-makers choose the optimal strategy. 15 The nomogram is a graphical tool that illustrates the impact of variables on outcomes. It simplifies the model equation, making it easier to understand the importance of each variable. 16 In this study, 10 variables were initially selected using LASSO regression, followed by multifactorial logistic regression analysis. Variables with statistical significance (p < 0.05) from the logistic regression were used to construct the nomogram. The nomogram provides a clear visual representation of the importance of these variables, allowing for an intuitive evaluation of their impact on prediction results. By combining DCA and the nomogram, researchers can achieve a more comprehensive and intuitive analysis, which aids in optimizing predictive models and decision-making processes.
Statistical analysis
This study used SPSS 20.0 and R 4.2.3 software for statistical analysis. Continuous data (metric variables) were represented as means and standard deviations, while categorical variables were compared between the mutation and wild-type groups using Pearson's chi-square test. For small sample sizes or expected frequencies <5, a chi-square test with continuity correction and Fisher's exact test were applied. LASSO regression was employed to select features for constructing a predictive model for EGFR mutations in lung adenocarcinoma. The model was trained using the “glmnet” package in R, with 10-fold cross-validation to determine the optimal lambda value. Cross-validation was set to a maximum of 100,000 iterations to ensure model convergence and stability. The lambda value that minimized cross-validation error was selected, and the corresponding regression coefficients were extracted to ensure model accuracy and stability. 17 To compare the diagnostic performance of different models, we calculated sensitivity, specificity, accuracy, area under the curve (AUC), and 95% confidence intervals (CI) for logistic regression, support vector machine (SVM), and random forest (RF) algorithms using receiver operating characteristic curves. 18
Model construction and optimization
In this study, logistic regression, SVM, and RF models were developed based on the 10 features selected by LASSO regression, and their prediction performance for EGFR mutation status was compared. To optimize model performance, 10-fold cross-validation and grid search were used for parameter tuning in the logistic regression model. “expand.grid” function defined the parameter grid, with a range between 0 and 1, including values for alpha and lambda. For the SVM model, four kernel functions were tested: linear, polynomial, radial, and sigmoid. The optimal kernel function was selected to build the final SVM model. The RF model was developed using the randomForest package. To optimize the model, hyperparameter tuning was performed using the mlr3verse package. Parameters such as the number of features (mtry), the number of trees (num.trees), and the minimum node size (min.node.size) were automatically tuned, with specified parameter ranges.
Results
Clinical data and CT imaging features
The study included 194 patients, with 130 males and 64 females, and a mean age of 60 ± 10 years. Among these patients, 89 were in the mutation group and 105 in the wild-type group. A comparison of 10 feature factors was conducted between the mutation and wild-type groups, with data divided into training and testing sets. In the training set, the factors of alcohol consumption, intrapulmonary metastasis, and pleural effusion showed statistical significance at the 0.05 level between the mutation and wild-type groups. In the testing set, smoking, vessel convergence sign, and hilar and mediastinal lymph node enlargement demonstrated statistical significance. Details are provided in Table 1.
Clinical data and CT imaging features.
Note:①* Pearson's chi-square test, p
Model parameter
In the logistic regression model's training set, 10-fold cross-validation and grid search were utilized for parameter tuning. “expand.grid” function defined the parameter grid, with a range between 0 and 1, encompassing values for alpha and lambda. The optimal alpha and lambda values for the clinical feature model were 0 and 0.2, for the CT imaging feature model were 0 and 0.2, and for the combined feature model were 0 and 0.3.
For the SVM model, linear, polynomial, radial, and sigmoid kernel functions were tested. The optimal kernel function was selected to build the SVM model, as shown in Supplemental Table S1. Parameter tuning was performed for SVM models with different features, as detailed in Supplemental Table S2.
The RF model was constructed using the randomForest package in R. To optimize model performance, hyperparameter tuning was conducted using the mlr3verse package. This included selecting the number of features (mtry), the number of trees (num.trees), and the minimum node size (min.node.size). The parameter tuning range is provided in Supplemental Table S3.
Model performance comparison
Clinical feature model
Based on the 10 variables selected by LASSO regression, including the clinical features of smoking and alcohol consumption, logistic regression, SVM, and RF models were constructed. The performance of these models was evaluated in both the training and testing sets. In the training set, the logistic regression model demonstrated the best performance, with an AUC of 0.643, sensitivity of 0.591, and specificity of 0.638. However, in the testing set, the SVM model showed superior performance, achieving an AUC of 0.655, sensitivity of 0.484, and specificity of 0.750. Although the models had relatively low sensitivity, indicating moderate effectiveness in identifying actual EGFR gene mutations, their high specificity suggested they were effective in excluding non-EGFR gene mutation cases. Detailed performance metrics are presented in Table 2, Figures 4, and 5. These results provide valuable insights for selecting appropriate models and highlight the strengths and weaknesses of each model, facilitating further optimization and improvement of their predictive ability and clinical applicability.

Comparison of ROC curves for clinical feature models in the training set.

Comparison of ROC curves for clinical feature models in the testing set.
Performance comparison of three models based on clinical features.
AUC: area under the curve; LR: logistic regression; RF: random forest: SVM: support vector machine.
CT imaging feature model
In this study, a model was constructed using eight CT imaging feature variables selected by LASSO regression, including vascular convergence sign, pleural retraction sign, segmental or higher airway obstruction, hilar and mediastinal lymph node enlargement, intrapulmonary metastasis, pleural effusion, tumor location, and lobulation. In the training set, the SVM model showed the best performance, with an AUC of 0.815, and sensitivity, specificity, and accuracy of 0.734, 0.789, and 0.815, respectively. In contrast, in the testing set, the logistic regression model performed best, achieving an AUC of 0.805, sensitivity of 0.875, and accuracy of 0.729, indicating relatively high accuracy. These results suggest that the SVM model demonstrated strong performance in the training set, while the logistic regression model excelled in the testing set. Detailed performance data are available in Table 3, Figures 6, and 7.

Comparison of ROC curves for CT imaging feature models in the training set.

Comparison of ROC curves for CT imaging feature models in the testing set.
Performance comparison of three models based on CT imaging features.
AUC: area under the curve; CT: computed tomography; RF: random forest: SVM: support vector machine.
Combined feature model
This study combined two clinical features, smoking, and alcohol consumption, with eight CT imaging features: vessel convergence sign, pleural retraction sign, segmental or higher airway obstruction, hilar and mediastinal lymph node enlargement, lung metastasis, pleural effusion, tumor location, and lobulation sign, to construct a comprehensive predictive model. In the training set, the SVM model demonstrated the best performance, achieving the highest AUC value of 0.877, with optimal sensitivity, specificity, and accuracy among the three models. In contrast, in the test set, the logistic regression model outperformed both the SVM and RF models, with an AUC value of 0.827, and it also had the best sensitivity and accuracy. These results indicate that the combined model has strong predictive ability in both the training and test sets. Detailed performance data are available in Table 4, Figures 8, and 9. These findings underscore the importance of integrating clinical and CT imaging features for diagnosing EGFR gene mutations in lung adenocarcinoma.

Comparison of ROC curves for combined feature models in the training set.

Comparison of ROC curves for combined feature models in the testing set.
Comparison of performance among three models using combined features.
AUC: area under the curve; CT: computed tomography; RF: random forest: SVM: support vector machine.
Decision curve analysis
As shown in Figure 10, when evaluating the benefits of predictive models, the benefits of using CT imaging feature models and combination models are significantly greater than those of clinical feature models for threshold probabilities greater than 40%. These benefits increase as the decision thresholds rise. When the threshold probability exceeds 60%, the CT imaging feature model offers more benefits than the combination model. For any given threshold probability, predictions using clinical feature models, CT imaging feature models, and combination models provide equal or greater benefits than the “the treat-all scheme” or “the treat-none scheme” schemes. These results highlight the superiority of CT imaging features across various decision thresholds, particularly at high-risk thresholds where the benefits are highest.

Decision curve analysis.
Nomogram construction
In this study, 10 variables were initially selected through LASSO regression, followed by multivariable logistic regression analysis. The results showed that four variables were statistically significant: vessel convergence sign, hilar and mediastinal lymph node enlargement, intrapulmonary metastasis, and lobulation sign. Using these four statistically significant variables, a nomogram was constructed to rank the importance of variable features. As shown in Table 5, multivariable logistic regression analysis of these four variables had a significant impact on predicting EGFR gene mutation in lung adenocarcinoma. Among the four variables, the most important feature for predicting EGFR gene mutation in lung adenocarcinoma is the lobulation sign, followed by intrapulmonary metastasis, vessel convergence sign, and hilar and mediastinal lymph node enlargement. The lobulation sign has a score of 100, indicating that it is very important for predicting EGFR mutation in the model, with a predictive value of 0.9. A high predictive value indicates that lung cancer patients with lobulation signs are more likely to develop EGFR mutations, as shown in Figure 11. Through this quantitative risk assessment, clinicians can more accurately estimate the likelihood of a patient developing an EGFR gene mutation. Based on the high predicted probability, clinicians may recommend EGFR mutation testing to confirm the diagnosis and develop a targeted treatment plan for the patient. Additionally, we report the calibration results of the nomogram, which revealed good prediction accuracy between the actual probability and predicted probability, as shown in Figure 12.

Nomogram.

Calibration plot.
Nomogram constructed from multifactor logistic regression.
Note: * p < 0.05 indicates statistical significance. The presence of VCS is coded as 1, while its absence is coded as 0. The odds ratio is 4.016, indicating a higher likelihood of EGFR mutation occurrence in patients with vessel convergence sign. The presence of hilar and mediastinal lymph node enlargement (enlarge) is coded as 1, while its absence is coded as 0. The odds ratio is 0.346, suggesting a higher likelihood of EGFR mutation occurrence in patients without hilar and mediastinal lymph node enlargement. The presence of intra-PM is coded as 1, while its absence is coded as 0. The odds ratio is 2.529, indicating a higher likelihood of EGFR mutation occurrence in patients with intrapulmonary metastasis. The presence of lobulation sign (lobulation) is coded as 1, while its absence is coded as 0. The odds ratio is 10.083, suggesting a higher likelihood of EGFR mutation occurrence in patients with lobulation sign. EGFR: epidermal growth factor receptor; PE: pleural effusion; PM: pulmonary metastasis; VCS: vessel convergence sign.
Discussion
EGFR-TKIs are the most effective first-line treatment for lung adenocarcinoma with EGFR mutations. Compared to traditional chemotherapy, EGFR-TKIs can prolong progression-free survival and improve the quality of life. Therefore, accurately identifying EGFR mutation status is crucial for optimizing treatment outcomes.19,20 Currently, most studies predicting EGFR gene mutations focus solely on models using a single feature.21–23 Only a few studies combine both CT imaging features and clinical features to develop EGFR prediction models.24,25
The study found that the EGFR gene mutation rate was 45.9% among the 194 patients with lung adenocarcinoma, which is similar to the findings of Wang et al. 26 Unlike previous studies that relied on a single feature to build prediction models,27,28 this study developed three types of models: a clinical feature model, a CT feature model, and a combined feature model incorporating both clinical and CT features. The performance of these models was compared in predicting EGFR gene mutations. The results indicate that the combined feature model outperforms the single feature models, aligning with the findings of Jiang and Lu et al.29–31 Although their studies primarily focused on radiomic feature analysis, their conclusions support the reliability of our CT imaging feature analysis. Additionally, compared to most other EGFR mutation prediction models, our study not only combined clinical and CT features but also included parameter tuning to avoid overfitting and underfitting, thereby enhancing model performance and generalizability. 32
The DCA analysis results of this study reveal that for threshold probabilities greater than 40%, the benefits of using CT imaging feature models and combined models are significantly higher than those of clinical feature models, and these benefits increase as the decision thresholds rise. When the threshold probability exceeds 60%, the CT imaging feature models provide greater benefits compared to the combined models, which aligns with Hong's research findings. The key difference is that Hong's study relied solely on CT imaging features, while this study incorporated both clinical and CT imaging features. 27 By comparing the benefits of different models across various thresholds, it is possible to determine which model offers the most advantage in practical applications. 15 Although the combined feature model has the highest AUC value, the DCA analysis indicates that the CT imaging feature model provides superior benefits compared to both the combined model and the clinical feature model, suggesting that CT imaging features may offer more clinical utility in real-world decision making. While AUC value is an important indicator of model performance, it does not fully capture the model's effectiveness in clinical settings. Clinical practicality is influenced by numerous factors, and the ultimate goal of the model is to support clinical decision-making. Therefore, a comprehensive evaluation of the model's predictive ability, practicality, and clinical utility is essential for making accurate clinical decisions.
The nomogram developed in this study identified the “lobulation sign” as an important predictor of EGFR gene mutations in lung adenocarcinoma, followed by pulmonary metastasis, the vascular bundle sign, and mediastinal lymph node enlargement. Zhang et al. 33 constructed a combined model through retrospective analysis to predict EGFR gene mutations and found that nonsmokers and the vascular bundle sign were independent predictors of EGFR mutations. Zhao et al. 34 performed a meta-analysis on risk factors for NSCLC and determined that the Vascular cluster sign is a significant risk factor for EGFR mutations in patients with NSCLC. Despite methodological differences, the findings from Zhang and Zhao's studies align with our results, reinforcing the validity of our research conclusions.
However, some studies have indicated that there is no significant difference in predictive ability between models combining imaging features with clinical features and those using single imaging features.35,36 Therefore, the effectiveness of combined feature prediction models compared to single-feature models remains debated, and further research is needed to clarify this issue. The EGFR mutation prediction model developed in our study holds significant potential for clinical application. It can assist physicians in risk assessment, enhance diagnostic accuracy and efficiency, and help in optimizing personalized treatment plans, thereby improving medical quality. However, challenges such as technological integration and physician acceptance may arise during implementation. Future research should focus on validating the model in larger populations to confirm its applicability and address these challenges.
This study has several limitations. First, the small sample size may impact the generalizability of the model. Due to constraints, external validation using data from other institutions was not performed. Future studies should involve larger sample sizes from diverse sources to validate the model externally. Second, the extraction of CT imaging features is subject to subjective interpretation by radiologists. High consistency among radiologists can enhance data reliability and ensure the credibility of research findings. Low consistency could introduce observer bias, affecting the stability and reproducibility of the results. Finally, the study's participants were all local residents of Hainan, which limits the generalizability of the findings. Caution should be exercised when applying these results to other populations.
Conclusion
In summary, the results from the test set suggest that the combined model may provide better performance compared to both the single clinical feature model and the CT imaging feature model. The DCA indicates that the net benefit of the CT imaging feature model might exceed that of both the combined model and the clinical feature model. Additionally, the nomogram analysis highlights that the lobulation sign could be an important predictor of EGFR gene mutations in lung adenocarcinoma. Therefore, in clinical decision making, it is crucial to not only consider the model's predictive accuracy and practicality but also its clinical significance to make more informed and accurate decisions.
Supplemental Material
sj-zip-1-sci-10.1177_00368504241293008 - Supplemental material for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma
Supplemental material, sj-zip-1-sci-10.1177_00368504241293008 for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma by Youjian Yao, Nengde Zhang, Caiwei Lu, Lianhua Liu, Yu Fu and Mei Gui in Science Progress
Supplemental Material
sj-docx-2-sci-10.1177_00368504241293008 - Supplemental material for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma
Supplemental material, sj-docx-2-sci-10.1177_00368504241293008 for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma by Youjian Yao, Nengde Zhang, Caiwei Lu, Lianhua Liu, Yu Fu and Mei Gui in Science Progress
Supplemental Material
sj-pdf-3-sci-10.1177_00368504241293008 - Supplemental material for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma
Supplemental material, sj-pdf-3-sci-10.1177_00368504241293008 for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma by Youjian Yao, Nengde Zhang, Caiwei Lu, Lianhua Liu, Yu Fu and Mei Gui in Science Progress
Footnotes
Authors’ contributions
Data availability statement
Declaration of conflicting interests
Ethics approval
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
