Sage Journals: Discover world-class research

Abstract

Purpose: This study aims to develop a predictive model for epidermal growth factor receptor (EGFR) mutations in lung adenocarcinoma by integrating computed tomography (CT) imaging features with clinical characteristics. Methods: A retrospective analysis was conducted using electronic medical records from 194 patients diagnosed with lung adenocarcinoma between January 2016 and December 2020, with approval from the institutional review board. Features were selected using LASSO regression, and predictive models were built using logistic regression, support vector machine, and random forest methods. Individual models were created for clinical features, CT imaging features, and a combined model to predict EGFR mutations. Results: The training set revealed that alcohol consumption, intrapulmonary metastasis, and pleural effusion were statistically significant in distinguishing between wild-type and mutation groups (p < 0.05). In the testing set, hilar and mediastinal lymphadenopathy showed statistical significance (p < 0.05). The combined model outperformed the individual clinical and CT imaging feature models. In the testing set, the logistic regression model achieved the highest AUC of 0.827, with sensitivity, specificity, and accuracy of 0.714, 0.712, and 0.712, respectively. Nomogram analysis identified lobulation as an important feature, with a predicted probability of up to 0.9. The decision curve analysis showed that the CT imaging feature model provided a higher net benefit compared to both the clinical feature model and the combined model. Conclusion: In summary, while the combined model outperformed the individual feature models in the testing set, the CT imaging feature model demonstrated the greatest clinical net benefit. Lobulation was identified as an important predictor of EGFR mutations in lung adenocarcinoma.

Keywords

lung adenocarcinoma epidermal growth factor receptor prediction model logistic regression model support vector machine model random forest model

Introduction

According to the 2022 GLOBOCAN study, there were 2.48 million new cases of lung cancer and 1.81 million deaths worldwide, making it one of the most common cancers and a leading cause of cancer-related mortality globally.¹ Research shows that most stage IV lung cancer patients die within 5 years of diagnosis, whereas patients with stage IA (early-stage) lung cancer have a 5-year survival rate of over 75%. The survival rates of late-stage lung cancer patients are significantly lower than those of early-stage lung cancer patients.^2,3 Lung cancer is classified histopathology into nonsmallcell lung cancer (NSCLC), accounting for approximately 85% of cases, and small-cell lung cancer (SCLC), accounting for approximately 15% of cases, with lung adenocarcinoma being the most common subtype of lung cancer within NSCLC.^4,5 In recent decades, with the rapid development of targeted therapy, the survival time of lung cancer patients has significantly increased. Targeted therapy is characterized by its high efficacy, significant therapeutic effects, and minimal side effects compared to traditional treatments such as radiotherapy, chemotherapy, and surgery.⁶ Common driver genes in lung cancer include epidermal growth factor receptor (EGFR), anaplastic lymphoma kinase, and Kirsten rat sarcoma viral oncogene. Among these, EGFR mutations are the most common and are one of the most frequently targeted genes in the treatment of lung adenocarcinoma.⁷ EGFR is a glycoprotein with tyrosine kinase activity that regulates the proliferation, differentiation, and apoptosis of normal cells. Mutations or amplifications in this gene can convert EGFR into an oncogenic protein.⁸ Lung adenocarcinoma patients with EGFR mutations often exhibit significant clinical responses to EGFR tyrosine kinase inhibitors (TKIs). However, not all such patients benefit from EGFR-TKIs, underscoring the need for early identification of EGFR mutation status and specific mutation sites to optimize patient prognosis.⁹ Currently, detecting driver genes primarily relies on invasive methods such as surgical pathology and tissue biopsy. However, due to the significant heterogeneity of tumors, tissue samples obtained through local puncture may not fully represent the lesions. Repeated punctures add to the economic burden and increase patient discomfort. While serum circulating tumor DNA extraction can serve as an auxiliary method for traditional gene detection, it has a false-negative rate of up to 30%, limiting its ability to accurately determine EGFR gene mutation status.¹⁰ Therefore, noninvasive and efficient methods for detecting driver genes are crucial. CT is the most commonly used noninvasive imaging technique for diagnosing lung cancer due to its simplicity. Studies have shown that CT features of lung adenocarcinoma patients can predict EGFR mutation status to a certain extent.¹¹ Computer-aided diagnosis leverages advanced technologies like machine learning to analyze CT imaging and clinical features. It uses computer algorithms to build models for diagnosing specific diseases and to perform tasks such as disease classification, prediction, and localization.¹² This study established three machine learning models: logistic regression, support vector machine, and random forest, and optimized their parameters to predict the EGFR gene mutation status in patients with lung adenocarcinoma. The development of these predictive models aids in diagnosing EGFR gene mutations, allowing for the formulation of precise targeted treatment strategies for patients with such mutations, thereby improving their survival and prognosis. Based on real-world clinical data, this study investigates the performance of machine learning models in predicting EGFR gene mutations in lung adenocarcinoma and their practical significance for clinical decision making.

Materials and methods

Patient data

This study is a retrospective analysis of patients who underwent EGFR gene mutation testing and were subsequently diagnosed with lung adenocarcinoma at The First Affiliated Hospital of Hainan Medical University, Haikou, China, between January 2016 and December 2020. Patient clinical and thoracic CT imaging data were obtained from the hospital's electronic medical records. To ensure privacy and confidentiality, all patient information was de-identified before analysis. This study is based on existing clinical data, and all data usage complies with the ethical principles outlined in the 1975 Declaration of Helsinki and its 2013 revision. The research protocol was reviewed and approved by the relevant ethics committee. This study is reported in accordance with STROBE guidelines.¹³

Inclusion criteria

Local residents of Hainan province.

Patients diagnosed with pulmonary adenocarcinoma by histopathology or cytology between January 2016 and September 2019.

All patients underwent EGFR gene mutation testing of tumor tissue using the multiplex fluorescence PCR (Polymerase chain reaction) method, and definitive test results were obtained.

Complete baseline CT imaging data before antitumor treatment (including surgical resection, chemotherapy, radiotherapy, targeted therapy, and traditional Chinese medicine treatment).

Complete clinical data.

No second primary tumor present.

Exclusion criteria

Patients with severe artifacts in CT images.

Patients with large amounts of pleural effusion, severe obstructive pneumonia, or lung collapse, causing the primary lesion to be unclearly identifiable on CT images.

A total of 222 patients with pulmonary adenocarcinoma were initially included in this study. Eight patients were excluded due to severe artifacts in their CT images, and 20 patients were excluded because the primary lesions could not be clearly identified on CT images due to significant pleural effusion, severe obstructive pneumonia, or lung collapse. As a result, 194 patients remained for analysis. Of these, 135 patients were assigned to the training set and 59 patients to the testing set, with a ratio of 7:3, as shown in Figure 1.

Figure 1.

The patient inclusion process flowchart.

CT image acquisition

All CT images for this study were obtained using a GE Medical Systems LightSpeed 64-row CT scanner. All operators involved were standardized and trained to follow consistent scanning protocols. The specific scanning parameters were as follows: tube voltage of 120 kV, tube current ranging from 90 to 200 mA, slice thickness and slice interval both set to 5.0 mm, reconstructed into 2 mm thin slices, with a window width of 350 Hounsfield units (HU) and a window level of 40 HU, and a reconstructed matrix size of 512 × 512. Contrast-enhanced scans were performed using a power injector, with iodine meglumine (dose: 1.5–2.0 mL/kg) injected through the antecubital vein at a flow rate of 3.0 mL/s. Scanning began 25 to 30 s after contrast injection (arterial phase), followed by a scan at 75 seconds (venous phase), and a third scan at 240 seconds (delayed phase). Both contrast-enhanced and plain CT scans were conducted with the patient in the supine position, with both arms raised.

CT image feature extraction

CT images were interpreted and analyzed by two experienced radiologists. When disagreements occurred, a senior physician was consulted, and a consensus was reached after discussion. The key CT imaging indicators evaluated in this study included: tumor growth location, maximum tumor diameter, tumor margin clarity, cavitation, lobulation sign, spiculation sign, bubble sign, air bronchogram sign, pleural retraction sign, thickening sign, segmental or higher airway obstruction, tumor necrosis, pleural effusion, hilar and mediastinal lymph node metastasis, intrapulmonary metastasis, distant organ metastasis, vessel convergence sign, calcification, and tumor density. Tumor density was further categorized into solid and part-solid nodules based on the presence of ground-glass opacity within the lesion.¹⁴

Feature selection

In this study, a total of 22 features were included, with “mutation” as the dependent variable and the remaining 21 features as independent variables. Among these, there were 17 CT imaging features and 4 clinical features. LASSO regression was used to select from these 21 features. As lambda increased, both the degrees of freedom and residuals gradually decreased. The optimal lambda value identified was 0.02458241, as shown in Figure 2. Additionally, 10-fold cross-validation was employed in LASSO regression to optimize the model, as illustrated in Figure 3. Ultimately, 10 significant features were selected, including smoking, alcohol consumption, vessel convergence sign, pleural retraction sign, segmental or higher airway obstruction, hilar and mediastinal lymph node enlargement, intrapulmonary metastasis, pleural effusion, tumor location, and lobulation sign. Of these, two were clinical features and eight were CT imaging features.

Figure 2.

Lambda graph.

Figure 3.

Ten-fold cross-validation.

Decision curve analysis and nomogram construction

Decision curve analysis (DCA) is a statistical method used to evaluate the clinical utility and effectiveness of predictive models. By comparing the net benefits of different models or treatment strategies across various decision thresholds, DCA helps decision-makers choose the optimal strategy.¹⁵ The nomogram is a graphical tool that illustrates the impact of variables on outcomes. It simplifies the model equation, making it easier to understand the importance of each variable.¹⁶ In this study, 10 variables were initially selected using LASSO regression, followed by multifactorial logistic regression analysis. Variables with statistical significance (p < 0.05) from the logistic regression were used to construct the nomogram. The nomogram provides a clear visual representation of the importance of these variables, allowing for an intuitive evaluation of their impact on prediction results. By combining DCA and the nomogram, researchers can achieve a more comprehensive and intuitive analysis, which aids in optimizing predictive models and decision-making processes.

Statistical analysis

This study used SPSS 20.0 and R 4.2.3 software for statistical analysis. Continuous data (metric variables) were represented as means and standard deviations, while categorical variables were compared between the mutation and wild-type groups using Pearson's chi-square test. For small sample sizes or expected frequencies <5, a chi-square test with continuity correction and Fisher's exact test were applied. LASSO regression was employed to select features for constructing a predictive model for EGFR mutations in lung adenocarcinoma. The model was trained using the “glmnet” package in R, with 10-fold cross-validation to determine the optimal lambda value. Cross-validation was set to a maximum of 100,000 iterations to ensure model convergence and stability. The lambda value that minimized cross-validation error was selected, and the corresponding regression coefficients were extracted to ensure model accuracy and stability.¹⁷ To compare the diagnostic performance of different models, we calculated sensitivity, specificity, accuracy, area under the curve (AUC), and 95% confidence intervals (CI) for logistic regression, support vector machine (SVM), and random forest (RF) algorithms using receiver operating characteristic curves.¹⁸

Model construction and optimization

In this study, logistic regression, SVM, and RF models were developed based on the 10 features selected by LASSO regression, and their prediction performance for EGFR mutation status was compared. To optimize model performance, 10-fold cross-validation and grid search were used for parameter tuning in the logistic regression model. “expand.grid” function defined the parameter grid, with a range between 0 and 1, including values for alpha and lambda. For the SVM model, four kernel functions were tested: linear, polynomial, radial, and sigmoid. The optimal kernel function was selected to build the final SVM model. The RF model was developed using the randomForest package. To optimize the model, hyperparameter tuning was performed using the mlr3verse package. Parameters such as the number of features (mtry), the number of trees (num.trees), and the minimum node size (min.node.size) were automatically tuned, with specified parameter ranges.

Results

Clinical data and CT imaging features

The study included 194 patients, with 130 males and 64 females, and a mean age of 60 ± 10 years. Among these patients, 89 were in the mutation group and 105 in the wild-type group. A comparison of 10 feature factors was conducted between the mutation and wild-type groups, with data divided into training and testing sets. In the training set, the factors of alcohol consumption, intrapulmonary metastasis, and pleural effusion showed statistical significance at the 0.05 level between the mutation and wild-type groups. In the testing set, smoking, vessel convergence sign, and hilar and mediastinal lymph node enlargement demonstrated statistical significance. Details are provided in Table 1.

Table 1.

Clinical data and CT imaging features.

Variable	Training set			Testing set
Variable	Mutation	Wild	p	Mutation	Wild	p
Smoking			0.105			0.006*
Yes	29(44.6%)	41(58.6%)		7(29.2%)	23(65.7%)
No	36(55.4%)	29(41.4%)		17(70.8%)	12(34.3%)
Alcohol consumption			0.003*			0.268
Yes	8(12.3%)	24(34.3%)		2(8.3%)	8(22.9%)
No	57(87.7%)	46(65.7%)		22(91.7%)	27(77.1%)
VCS			0.135			0.012*
Yes	17(26.2%)	11(15.7%)		7(29.2%)	1(2.9%)
No	48(73.8%)	59(84.3%)		17(70.8%)	34(97.1%)
PRS			0.292			0.508
Yes	54(83.1%)	53(75.7%)		19(79.2%)	25(71.4%)
No	11(16.9%)	17(24.3%)		5(20.8%)	10(28.6%)
Obstruction			0.082			0.232
Yes	49(75.4%)	43(61.4%)		18(75.0%)	21(60.0%)
No	16(24.6%)	27(38.6%)		6(25.0%)	14(40.0%)
Enlarge			0.077			0.022*
Yes	32(49.2%)	45(64.3%)		10(41.7%)	25(71.4%))
No	33(50.8%)	25(35.7%)		14(58.3%)	10(28.6%)
PM			0.013*			0.261
Yes	29(44.6%)	17(24.3%)		11(45.8%)	11(31.4%)
No	36(55.4%)	53(75.7%)		13(54.2%)	24(68.6%)
PE			0.045*			0.878
Yes	23(35.4%)	14(20.0%)		8(33.3%)	11(31.4%)
No	42(64.6%)	56(80.0%)		16(66.7%)	24(68.6%)
Location			0.076			0.836
Right lung	50(76.9%)	44(62.9%)		11(45.8%)	17(48.6%)
Left lung	15(23.1%)	26(37.1%)		13(54.2%)	18(51.4%)
Lobulation			0.408			0.235
Yes	64(98.5%)	66(94.3%)		24(100.0%)	31(88.6%)
No	1(1.5%)	4(5.7%)		0(0.0%)	4(11.4%)

Note：①* Pearson's chi-square test, p < 0.05. CT: computed tomography; Enlarge: hilar and mediastinal lymph node enlargement; Location: tumor location; Lobulation: lobulation sign; Obstruction: segmental or higher airway obstruction; PE: pleural effusion; PM: intrapulmonary metastasis; PRS: pleural retraction sign; VCS: vessel convergence sign.

Model parameter

In the logistic regression model's training set, 10-fold cross-validation and grid search were utilized for parameter tuning. “expand.grid” function defined the parameter grid, with a range between 0 and 1, encompassing values for alpha and lambda. The optimal alpha and lambda values for the clinical feature model were 0 and 0.2, for the CT imaging feature model were 0 and 0.2, and for the combined feature model were 0 and 0.3.

For the SVM model, linear, polynomial, radial, and sigmoid kernel functions were tested. The optimal kernel function was selected to build the SVM model, as shown in Supplemental Table S1. Parameter tuning was performed for SVM models with different features, as detailed in Supplemental Table S2.

The RF model was constructed using the randomForest package in R. To optimize model performance, hyperparameter tuning was conducted using the mlr3verse package. This included selecting the number of features (mtry), the number of trees (num.trees), and the minimum node size (min.node.size). The parameter tuning range is provided in Supplemental Table S3.

Model performance comparison

Clinical feature model

Based on the 10 variables selected by LASSO regression, including the clinical features of smoking and alcohol consumption, logistic regression, SVM, and RF models were constructed. The performance of these models was evaluated in both the training and testing sets. In the training set, the logistic regression model demonstrated the best performance, with an AUC of 0.643, sensitivity of 0.591, and specificity of 0.638. However, in the testing set, the SVM model showed superior performance, achieving an AUC of 0.655, sensitivity of 0.484, and specificity of 0.750. Although the models had relatively low sensitivity, indicating moderate effectiveness in identifying actual EGFR gene mutations, their high specificity suggested they were effective in excluding non-EGFR gene mutation cases. Detailed performance metrics are presented in Table 2, Figures 4, and 5. These results provide valuable insights for selecting appropriate models and highlight the strengths and weaknesses of each model, facilitating further optimization and improvement of their predictive ability and clinical applicability.

Figure 4.

Comparison of ROC curves for clinical feature models in the training set.

Figure 5.

Comparison of ROC curves for clinical feature models in the testing set.

Table 2.

Performance comparison of three models based on clinical features.

Model	Training set				Testing set
Model	Sensitivity	Specificity	Accuracy	AUC	Sensitivity	Specificity	Accuracy	AUC
LR	0.591	0.638	0.615	0.643	0.615	0.606	0.610	0.632
SVM	0.547	0.648	0.600	0.639	0.484	0.750	0.610	0.655
RF	0.574	0.640	0.610	0.640	0.520	0.576	0.552	0.625

AUC: area under the curve; LR: logistic regression; RF: random forest: SVM: support vector machine.

CT imaging feature model

In this study, a model was constructed using eight CT imaging feature variables selected by LASSO regression, including vascular convergence sign, pleural retraction sign, segmental or higher airway obstruction, hilar and mediastinal lymph node enlargement, intrapulmonary metastasis, pleural effusion, tumor location, and lobulation. In the training set, the SVM model showed the best performance, with an AUC of 0.815, and sensitivity, specificity, and accuracy of 0.734, 0.789, and 0.815, respectively. In contrast, in the testing set, the logistic regression model performed best, achieving an AUC of 0.805, sensitivity of 0.875, and accuracy of 0.729, indicating relatively high accuracy. These results suggest that the SVM model demonstrated strong performance in the training set, while the logistic regression model excelled in the testing set. Detailed performance data are available in Table 3, Figures 6, and 7.

Figure 6.

Comparison of ROC curves for CT imaging feature models in the training set.

Figure 7.

Comparison of ROC curves for CT imaging feature models in the testing set.

Table 3.

Performance comparison of three models based on CT imaging features.

Model	Training set				Testing set
Model	Sensitivity	Specificity	Accuracy	AUC	Sensitivity	Specificity	Accuracy	AUC
LR	0.613	0.781	0.704	0.757	0.875	0.674	0.729	0.805
SVM	0.734	0.789	0.763	0.815	0.655	0.700	0.678	0.726
RF	0.816	0.680	0.719	0.794	0.727	0.676	0.695	0.743

AUC: area under the curve; CT: computed tomography; RF: random forest: SVM: support vector machine.

Combined feature model

This study combined two clinical features, smoking, and alcohol consumption, with eight CT imaging features: vessel convergence sign, pleural retraction sign, segmental or higher airway obstruction, hilar and mediastinal lymph node enlargement, lung metastasis, pleural effusion, tumor location, and lobulation sign, to construct a comprehensive predictive model. In the training set, the SVM model demonstrated the best performance, achieving the highest AUC value of 0.877, with optimal sensitivity, specificity, and accuracy among the three models. In contrast, in the test set, the logistic regression model outperformed both the SVM and RF models, with an AUC value of 0.827, and it also had the best sensitivity and accuracy. These results indicate that the combined model has strong predictive ability in both the training and test sets. Detailed performance data are available in Table 4, Figures 8, and 9. These findings underscore the importance of integrating clinical and CT imaging features for diagnosing EGFR gene mutations in lung adenocarcinoma.

Figure 8.

Comparison of ROC curves for combined feature models in the training set.

Figure 9.

Comparison of ROC curves for combined feature models in the testing set.

Table 4.

Comparison of performance among three models using combined features.

Model	Training set				Testing set
Model	Sensitivity	Specificity	Accuracy	AUC	Sensitivity	Specificity	Accuracy	AUC
LR	0.710	0.808	0.763	0.823	0.714	0.712	0.712	0.827
SVM	0.833	0.790	0.807	0.877	0.514	0.917	0.678	0.787
RF	0.764	0.750	0.756	0.839	0.472	0.870	0.627	0.802

AUC: area under the curve; CT: computed tomography; RF: random forest: SVM: support vector machine.

Decision curve analysis

As shown in Figure 10, when evaluating the benefits of predictive models, the benefits of using CT imaging feature models and combination models are significantly greater than those of clinical feature models for threshold probabilities greater than 40%. These benefits increase as the decision thresholds rise. When the threshold probability exceeds 60%, the CT imaging feature model offers more benefits than the combination model. For any given threshold probability, predictions using clinical feature models, CT imaging feature models, and combination models provide equal or greater benefits than the “the treat-all scheme” or “the treat-none scheme” schemes. These results highlight the superiority of CT imaging features across various decision thresholds, particularly at high-risk thresholds where the benefits are highest.

Figure 10.

Decision curve analysis.

Nomogram construction

In this study, 10 variables were initially selected through LASSO regression, followed by multivariable logistic regression analysis. The results showed that four variables were statistically significant: vessel convergence sign, hilar and mediastinal lymph node enlargement, intrapulmonary metastasis, and lobulation sign. Using these four statistically significant variables, a nomogram was constructed to rank the importance of variable features. As shown in Table 5, multivariable logistic regression analysis of these four variables had a significant impact on predicting EGFR gene mutation in lung adenocarcinoma. Among the four variables, the most important feature for predicting EGFR gene mutation in lung adenocarcinoma is the lobulation sign, followed by intrapulmonary metastasis, vessel convergence sign, and hilar and mediastinal lymph node enlargement. The lobulation sign has a score of 100, indicating that it is very important for predicting EGFR mutation in the model, with a predictive value of 0.9. A high predictive value indicates that lung cancer patients with lobulation signs are more likely to develop EGFR mutations, as shown in Figure 11. Through this quantitative risk assessment, clinicians can more accurately estimate the likelihood of a patient developing an EGFR gene mutation. Based on the high predicted probability, clinicians may recommend EGFR mutation testing to confirm the diagnosis and develop a targeted treatment plan for the patient. Additionally, we report the calibration results of the nomogram, which revealed good prediction accuracy between the actual probability and predicted probability, as shown in Figure 12.

Figure 11.

Nomogram.

Figure 12.

Calibration plot.

Table 5.

Nomogram constructed from multifactor logistic regression.

Variable	Coefficients	Odds ratio	95% CI		p
Variable	Coefficients	Odds ratio	Lower	Upper	p
Intercept	−3.083				0.014*
Smoke	−0.388	0.679	0.330	1.390	0.289
Drink	−0.639	0.528	0.194	1.373	0.198
VCS	1.390	4.016	1.652	10.444	0.003*
PRS	0.401	1.494	0.668	3.418	0.333
Obstruction	0.512	1.668	0.807	3.503	0.170
Enlarge	−1.060	0.346	0.162	0.714	0.005*
PM	0.928	2.529	1.188	5.521	0.017*
PE	0.604	1.830	0.818	4.136	0.142
Location	0.301	1.351	0.692	2.649	0.379
Lobulation	2.311	10.083	1.533	202.190	0.041*

Note: * p < 0.05 indicates statistical significance. The presence of VCS is coded as 1, while its absence is coded as 0. The odds ratio is 4.016, indicating a higher likelihood of EGFR mutation occurrence in patients with vessel convergence sign. The presence of hilar and mediastinal lymph node enlargement (enlarge) is coded as 1, while its absence is coded as 0. The odds ratio is 0.346, suggesting a higher likelihood of EGFR mutation occurrence in patients without hilar and mediastinal lymph node enlargement. The presence of intra-PM is coded as 1, while its absence is coded as 0. The odds ratio is 2.529, indicating a higher likelihood of EGFR mutation occurrence in patients with intrapulmonary metastasis. The presence of lobulation sign (lobulation) is coded as 1, while its absence is coded as 0. The odds ratio is 10.083, suggesting a higher likelihood of EGFR mutation occurrence in patients with lobulation sign. EGFR: epidermal growth factor receptor; PE: pleural effusion; PM: pulmonary metastasis; VCS: vessel convergence sign.

Discussion

EGFR-TKIs are the most effective first-line treatment for lung adenocarcinoma with EGFR mutations. Compared to traditional chemotherapy, EGFR-TKIs can prolong progression-free survival and improve the quality of life. Therefore, accurately identifying EGFR mutation status is crucial for optimizing treatment outcomes.^19,20 Currently, most studies predicting EGFR gene mutations focus solely on models using a single feature.^21–23 Only a few studies combine both CT imaging features and clinical features to develop EGFR prediction models.^24,25

The study found that the EGFR gene mutation rate was 45.9% among the 194 patients with lung adenocarcinoma, which is similar to the findings of Wang et al.²⁶ Unlike previous studies that relied on a single feature to build prediction models,^27,28 this study developed three types of models: a clinical feature model, a CT feature model, and a combined feature model incorporating both clinical and CT features. The performance of these models was compared in predicting EGFR gene mutations. The results indicate that the combined feature model outperforms the single feature models, aligning with the findings of Jiang and Lu et al.^29–31 Although their studies primarily focused on radiomic feature analysis, their conclusions support the reliability of our CT imaging feature analysis. Additionally, compared to most other EGFR mutation prediction models, our study not only combined clinical and CT features but also included parameter tuning to avoid overfitting and underfitting, thereby enhancing model performance and generalizability.³²

The DCA analysis results of this study reveal that for threshold probabilities greater than 40%, the benefits of using CT imaging feature models and combined models are significantly higher than those of clinical feature models, and these benefits increase as the decision thresholds rise. When the threshold probability exceeds 60%, the CT imaging feature models provide greater benefits compared to the combined models, which aligns with Hong's research findings. The key difference is that Hong's study relied solely on CT imaging features, while this study incorporated both clinical and CT imaging features.²⁷ By comparing the benefits of different models across various thresholds, it is possible to determine which model offers the most advantage in practical applications.¹⁵ Although the combined feature model has the highest AUC value, the DCA analysis indicates that the CT imaging feature model provides superior benefits compared to both the combined model and the clinical feature model, suggesting that CT imaging features may offer more clinical utility in real-world decision making. While AUC value is an important indicator of model performance, it does not fully capture the model's effectiveness in clinical settings. Clinical practicality is influenced by numerous factors, and the ultimate goal of the model is to support clinical decision-making. Therefore, a comprehensive evaluation of the model's predictive ability, practicality, and clinical utility is essential for making accurate clinical decisions.

The nomogram developed in this study identified the “lobulation sign” as an important predictor of EGFR gene mutations in lung adenocarcinoma, followed by pulmonary metastasis, the vascular bundle sign, and mediastinal lymph node enlargement. Zhang et al.³³ constructed a combined model through retrospective analysis to predict EGFR gene mutations and found that nonsmokers and the vascular bundle sign were independent predictors of EGFR mutations. Zhao et al.³⁴ performed a meta-analysis on risk factors for NSCLC and determined that the Vascular cluster sign is a significant risk factor for EGFR mutations in patients with NSCLC. Despite methodological differences, the findings from Zhang and Zhao's studies align with our results, reinforcing the validity of our research conclusions.

However, some studies have indicated that there is no significant difference in predictive ability between models combining imaging features with clinical features and those using single imaging features.^35,36 Therefore, the effectiveness of combined feature prediction models compared to single-feature models remains debated, and further research is needed to clarify this issue. The EGFR mutation prediction model developed in our study holds significant potential for clinical application. It can assist physicians in risk assessment, enhance diagnostic accuracy and efficiency, and help in optimizing personalized treatment plans, thereby improving medical quality. However, challenges such as technological integration and physician acceptance may arise during implementation. Future research should focus on validating the model in larger populations to confirm its applicability and address these challenges.

This study has several limitations. First, the small sample size may impact the generalizability of the model. Due to constraints, external validation using data from other institutions was not performed. Future studies should involve larger sample sizes from diverse sources to validate the model externally. Second, the extraction of CT imaging features is subject to subjective interpretation by radiologists. High consistency among radiologists can enhance data reliability and ensure the credibility of research findings. Low consistency could introduce observer bias, affecting the stability and reproducibility of the results. Finally, the study's participants were all local residents of Hainan, which limits the generalizability of the findings. Caution should be exercised when applying these results to other populations.

Conclusion

In summary, the results from the test set suggest that the combined model may provide better performance compared to both the single clinical feature model and the CT imaging feature model. The DCA indicates that the net benefit of the CT imaging feature model might exceed that of both the combined model and the clinical feature model. Additionally, the nomogram analysis highlights that the lobulation sign could be an important predictor of EGFR gene mutations in lung adenocarcinoma. Therefore, in clinical decision making, it is crucial to not only consider the model's predictive accuracy and practicality but also its clinical significance to make more informed and accurate decisions.

Supplemental Material

sj-zip-1-sci-10.1177_00368504241293008 - Supplemental material for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma

Supplemental material, sj-zip-1-sci-10.1177_00368504241293008 for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma by Youjian Yao, Nengde Zhang, Caiwei Lu, Lianhua Liu, Yu Fu and Mei Gui in Science Progress

Supplemental Material

sj-docx-2-sci-10.1177_00368504241293008 - Supplemental material for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma

Supplemental material, sj-docx-2-sci-10.1177_00368504241293008 for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma by Youjian Yao, Nengde Zhang, Caiwei Lu, Lianhua Liu, Yu Fu and Mei Gui in Science Progress

Supplemental Material

sj-pdf-3-sci-10.1177_00368504241293008 - Supplemental material for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma

Supplemental material, sj-pdf-3-sci-10.1177_00368504241293008 for A predictive model of computed tomography and clinical features of EGFR gene mutation in lung adenocarcinoma by Youjian Yao, Nengde Zhang, Caiwei Lu, Lianhua Liu, Yu Fu and Mei Gui in Science Progress

Footnotes

Authors’ contributions

YY was involved in formal analysis,model construction,writing—original draft and writing—review & editing;NZ in formal analysis and writing—original draft;CL in literature research and collation;LL in model construction;YF in data curation;and MG in methodology,model construction,writing—original draft,writing—review & editing and supervision.

Data availability statement

The data used in this study were clinical data collected from hospitals. This data contains sensitive patient information and therefore cannot be shared publicly.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Ethics approval

This study has been approved by the Ethics Committee of Hainan Medical University (HYLL-2021–391).

Funding

This work was supported by Hainan Provincial Natural Science Foundation of China (grant number: 821MS044 and grant number: 821QN0895) and Hainan Philosophy and Social Science Planning Project of China (grant number HNSK(YB)23–40).

ORCID iD

Youjian Yao

Supplemental material

Supplemental material for this article is available online.

References

Bray

Laversanne

Sung

, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024; 74: 229–263.

Lancaster

Heuvelmans

Oudkerk

. Low-dose computed tomography lung cancer screening: clinical evidence and implementation research. J Intern Med 2022; 292: 68–80.

Marx

Chan

JKC

Chalabreysse

, et al.

The 2021 WHO classification of tumors of the thymus and mediastinum: what is new in thymic epithelial, germ cell, and mesenchymal tumors?

J Thorac Oncol 2022; 17: 200–213.

Dayen

Debieuvre

Molinier

, et al. New insights into stage and prognosis in small cell lung cancer: an analysis of 968 cases. J Thorac Dis Dec 2017; 9: 5101–5111.

Divan

Bittoni

Krishna

, et al. Real-world patient characteristics and treatment patterns in US patients with advanced non-small cell lung cancer. BMC Cancer 2024; 24: 424.

Yuan

Huang

Chen

, et al. The emerging treatment landscape of targeted therapy in non-small-cell lung cancer. Signal Transduct Target Ther 2019; 4: 61.

Casula

Pisano

Paliogiannis

, et al. Comparison between three different techniques for the detection of EGFR mutations in liquid biopsies of patients with advanced stage lung adenocarcinoma. Int J Mol Sci 2023; 24: 6410.

Guo

Song

Wang

, et al. Concurrent genetic alterations and other biomarkers predict treatment efficacy of EGFR-TKIs in EGFR-mutant non-small cell lung cancer: a review. Front Oncol 2020; 10: 610923.

Martin-Fernandez

. Fluorescence imaging of epidermal growth factor receptor tyrosine kinase inhibitor resistance in non-small cell lung cancer. Cancers (Basel) 2022; 14: 686.

10.

Zhang

Yao

, et al. Pan-cancer circulating tumor DNA detection in over 10,000 Chinese patients. Nat Commun 2021; 12: 11.

11.

Zhang

Zhao

Cao

, et al. Relationship between epidermal growth factor receptor mutations and CT features in patients with lung adenocarcinoma. Clin Radiol 2021; 76: 473.e17–473.e24.

12.

Thakur

Singh

Choudhary

. Lung cancer identification: a review on detection and classification. Cancer Metastasis Rev 2020; 39: 989–998.

13.

von Elm

Altman

Egger

, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet 2007; 370: 1453–1457.

14.

Henschke

Yankelevitz

Mirtcheva

, et al. CT screening for lung cancer: frequency and significance of part-solid and nonsolid nodules. AJR Am J Roentgenol 2002; 178: 1053–1057.

15.

Vickers

Van Calster

Steyerberg

. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. Br Med J 2016; 352: i6.

16.

Keam

Kim

Park

, et al. Nomogram predicting clinical outcomes in non-small cell lung cancer patients treated with epidermal growth factor receptor tyrosine kinase inhibitors. Cancer Res Treat 2014; 46: 323–330.

17.

Musoro

Zwinderman

Puhan

, et al. Validation of prediction models based on lasso regression with multiply imputed data. BMC Med Res Methodol 2014; 14: 116.

18.

Liu

Zhou

, et al. Development and validation of machine learning models to predict epidermal growth factor receptor mutation in non-small cell lung cancer: a multi-center retrospective radiomics study. Cancer Control. 2022; 29: 10732748221092926.

19.

Ding

Zhang

, et al. Radiomics for the prediction of EGFR mutation subtypes in non-small cell lung cancer. Med Phys 2019; 46: 4545–4552.

20.

Nguyen

DKN

Nguyen

, et al. Predicting EGFR mutation status in non-small cell lung cancer using artificial intelligence: a systematic review and meta-analysis. Acad Radiol 2024; 31: 660–683.

21.

Shiri

Amini

Nazari

, et al. Impact of feature harmonization on radiogenomics analysis: prediction of EGFR and KRAS mutations from non-small cell lung cancer PET/CT images. Comput Biol Med 2022; 142: 105230.

22.

Shiri

Maleki

Hajianfar

, et al. Next-generation radiogenomics sequencing for prediction of EGFR and KRAS mutation status in NSCLC patients using multimodal imaging and machine learning algorithms. Mol Imaging Biol 2020; 22: 1132–1148.

23.

NQK

Kha

Nguyen

, et al. Machine learning-based radiomics signatures for EGFR and KRAS mutations prediction in non-small-cell lung cancer. Int J Mol Sci 2021; 22: 9254.

24.

Shen

Mao

, et al. CT radiomics in predicting EGFR mutation in non-small cell lung cancer: a single institutional study. Front Oncol 2020; 10: 542957.

25.

Yang

Liu

Ren

, et al. Using contrast-enhanced CT and non-contrast-enhanced CT to predict EGFR mutation status in NSCLC patients-a radiomics nomogram analysis. Eur Radiol 2022; 32: 2693–2703.

26.

Wang

, et al. Value of serum tumor markers for predicting EGFR mutations and positive ALK expression in 1089 Chinese non-small-cell lung cancer patients: a retrospective analysis. Eur J Cancer 2020; 124: 1–14.

27.

Hong

Zhang

, et al. Radiomics signature as a predictive factor for EGFR mutations in advanced lung adenocarcinoma. Front Oncol 2020; 10: 28.

28.

Yang

Chen

Gong

, et al. Application of CT radiomics features to predict the EGFR mutation status and therapeutic sensitivity to TKIs of advanced lung adenocarcinoma. Transl Cancer Res 2020; 9: 6683–6690.

29.

Jiang

Yang

, et al. Computed tomography-based radiomics quantification predicts epidermal growth factor receptor mutation status and efficacy of first-line targeted therapy in lung adenocarcinoma. Front Oncol 2022; 12: 985284.

30.

Zhang

, et al. A novel radiomic nomogram for predicting epidermal growth factor receptor mutation in peripheral lung adenocarcinoma. Phys Med Biol 2020; 65: 055012.

31.

Zhang

Cao

Zhang

, et al. Predicting EGFR mutation status in lung adenocarcinoma: development and validation of a computed tomography-based radiomics signature. Am J Cancer Res 2021; 11: 546–560.

32.

Infante

Miceli

Ambrogi

. Sample size and predictive performance of machine learning methods with survival data: a simulation study. Stat Med 2023; 42: 5657–5675.

33.

Zhang

Cai

Wang

, et al. CT and clinical characteristics that predict risk of EGFR mutation in non-small cell lung cancer: a systematic review and meta-analysis. Int J Clin Oncol 2019; 24: 649–659.

34.

Zhao

Han

, et al. Clinicoradiological features associated with epidermal growth factor receptor exon 19 and 21 mutation in lung adenocarcinoma. Clin Radiol 2019; 74: 80.e7–80.e17.

35.

Yang

Dong

Wang

, et al. Computed tomography-based radiomics signature: a potential indicator of epidermal growth factor receptor mutation in pulmonary adenocarcinoma appearing as a subsolid nodule. Oncologist 2019; 24: e1156–e1164.

36.

Wang

Wan

Xia

, et al. Value of radiomics model based on multi-parametric magnetic resonance imaging in predicting epidermal growth factor receptor mutation status in patients with lung adenocarcinoma. J Thorac Dis 2021; 13: 3497–3508.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.03 MB

0.00 MB

0.13 MB

0.04 MB