Abstract
Introduction
In recent years, with the rapid development of railway transportation, there have been widespread concerns regarding the security assurance of railway train control systems. Global navigation satellite systems (GNSSs), which can provide accurate and real-time information regarding the position, speed, direction, and other states of trains, have become an inevitable choice for precise train positioning in novel location-enabled railway systems. 1 However, satellite signals are weak upon reaching the ground, and the details of civilian signals are open to the public. Consequently, the GNSS receiver embedded in the train positioning unit is highly susceptible to intentional or unintentional interferences. 2 In the past, the GNSS-based applications primarily focused on the effect and countermeasures to assure the functional safety, 3 lacking the full-life-cycle design for an active protection against cybersecurity risks, which had received in-depth attention in various areas, such as the smart cities 4 and intelligent systems. 5 The threat owing to GNSS vulnerability lies in interference, particularly GNSS spoofing interference. A GNSS spoofing signal, which is transmitted or generated to simulate fake satellite signals, can force the receiver to derive an inaccurate position, velocity, and time report, thereby misleading the operation of target systems equipped with the receiver. By leading the target system to a predetermined location through a malicious spoofing attack, the safety of the operational system and human lives will be significantly threatened with serious risks. 6 For specific GNSS-based safety-related services and applications, effective spoofing protection measures must be emphasized and prioritized.
GNSS spoofing has been reported in several global incidents. Spoofing is different from other incidents. This demonstrates the vulnerability of GNSS and results in the loss of false signals. A portable civilian global positioning system (GPS) spoofer was developed and utilized to assess the threat of spoofing and successfully conduct deception experiments on a superyacht. 7 In addition, a GNSS navigation spoofing device can be constructed using a GNSS signal simulator, signal amplifier, generator, and other equipment, which successfully deceives an operating truck through a spoofing attack. 8 With the development of specific techniques, the realization of GNSS spoofing attacks has become easier. For example, even an inexpensive software-defined radio can make a smartphone believe that it is in a different location. This has been illustrated by the severe consequences of intentional GNSS spoofing interference, highlighting the urgent need to develop effective spoofing protection measures to enhance the security levels of GNSS devices.
Numerous studies have been conducted on the detection and recognition of GNSS spoofing. To detect the direction of the spoofing signal, Ong et al. 9 proposed a method to estimate the azimuth angle of the GPS carrier signal using a dual antenna and an interferometer. However, the addition of an interferometer increases the recognition cost. The antenna array can be used to detect a single deception signal or deception signals from multiple directions and eliminate deception interference effects.10,11 This solution requires multiple antennae. Thus, the antenna length affects the recognition performance. Because the modification of receiver hardware facilities is usually difficult in currently utilized systems, antispoofing can be realized by recognizing spoofing attacks by identifying unreasonable jumps, such as the carrier amplitude and carrier phase, especially at the beginning of spoofing. 12 Monitoring the received power provides an effective solution, which requires viewing all the received carrier amplitude values and automatic gain control settings at the RF front-end of the receiver. Because a spoofer requires a high power level, sudden power loss may indicate a spoofing attack.13,14 However, for this category, the mutation only occurs at the initial stage of the spoofing event; thus, this method is suitable for strong and short-lived scenarios. To continuously identify the existence of GNSS spoofing, different solutions have been proposed that encrypt the signal navigation information; thus, it is difficult for an attacker to obtain and change the navigation information in the satellite signal. An ideal method is based on the symmetric encryption. 15 If the signal correlation peak is extremely high, the two noisy versions of the encrypted signal interact, indicating that the signal is true. Otherwise, the potential victim receiver is proven to have been spoofed.16,17 The encryption-enabled solutions require a secure receiver network to generate a real version of the encryption code and a secure communication network to send the signal to the signal processing unit that can check the correlation. However, the specified requirements for related facilities may limit the implementation of this solution. Considering the possible involvement of auxiliary devices that are immune to GNSS spoofing, non-GNSS information-assisted solutions have been proposed for spoofing recognition and protection. By analyzing the measurement models of the GNSS and inertial measurement unit, a spoofing recognition model has been proposed and examined using a generalized likelihood ratio test. 18
With the rapid development of machine learning (ML) technology, data-driven spoofing recognition methods have become a major concern in recent years owing to their robustness, high generalization ability, and ability to analyze and extract effective information. Typical ML solutions have been adopted and demonstrated in spoofing recognition and suppression, such as the gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light-gradient boosting machine (LightGBM). ML-based solutions have shown great potential for modeling and recognition. However, there is a scope for enhancing the accuracy and robustness of spoofing awareness and recognition. To address the problems of existing methods, this study proposes a data-driven spoofing recognition solution using the LightGBM model. Among the various optimization algorithms that can be utilized to enhance artificial intelligence techniques 19 and data-driven classifiers, 20 the Bayesian optimization (BO) algorithm 21 has shown superior empirical performance in several optimization-based applications. Therefore, the BO algorithm is introduced into the original LightGBM to realize a BO-LightGBM solution to further improve the spoofing recognition performance. Using field-data-based spoofing injection tests, the advantages of the proposed BO-LightGBM method in terms of recognition accuracy and robustness were illustrated.
The remainder of this paper is organized as follows: the sample features in spoofing recognition section describes sample features for spoofing recognition. The BO-LightGBM model and corresponding spoofing recognition solution for GNSS-based train positioning are presented in the BO-LightGBM spoofing recognition section. The test and analysis section presents the test results and corresponding analysis. The conclusions section draws conclusions and briefly reviews future plans.
Sample features in spoofing recognition
GNSS signal observation model
A GNSS-based autonomous train positioning system (TPS) with a train-centric design is composed of a GNSS receiver, an antenna, assistant location sensors, and a location processing unit (LPU). Specific assistant sensors can be used in a TPS, such as an ODO and Doppler radar. Figure 1 shows the typical architecture of a TPS system.

Architecture of GNSS-based TPS.
The GNSS receiver obtains the satellite signal for navigation calculations, and the result is used to determine the final location report by the LPU. The satellite signal
The received signal undergoes different processing stages in the receiver, including capture, tracking, and navigation calculations with different observation variables, such as the pseudo-range, carrier phase, and carrier-to-noise density ratio (
Typical sample features
Pseudo-range residual
The pseudo-range refers to the measured satellite-user distance converted with the signal propagation time from a satellite to the receiver. Observation errors exist owing to several factors, such as propagation delay, clock error, and signal interference. Using satellite ephemeris data and specific propagation delay models, the corrected pseudo-range
Under normal operating conditions with four or more satellites observed at the same instant, the actual distance
The pseudo-range residual
The sensitivity of
Carrier phase smoothed pseudo-range residual
The carrier phase is a measure of the range between a satellite and receiver, expressed in units of cycles of the carrier frequency, which evaluates the change in the behavior of the carrier waveform of the satellite signal. The carrier phase can be used to smoothen the pseudo-range. This can be used as an observational characteristic quantity.
Therefore, the carrier phase-smoothing pseudo-range residual can be obtained using (7) to identify whether the receiver has been deceived.
Carrier-to-noise density ratio (C/N0)
Noise affects the measurement accuracy of GNSS receivers.
When the receiver is spoofed, it is forced to receive the spoofing signal, resulting in a stronger baseband signal. A high
GNSS/ODO speed difference
In the GNSS spoofing attack scenario, the GNSS-derived speed was inaccurate. By contrast, an ODO can evaluate the along-track running distance and speed of a train that is less susceptible to GNSS interference. Hence, the difference in train speed between GNSS and ODO can be used as an indication to evaluate the negative effects of spoofing. The GNSS/ODO speed difference
The measurement from ODO is immune to electromagnetic effects; it is relatively reliable, even under a GNSS spoofing attack scenario. When spoofing interference exists, the receiver-derived speed
GNSS/ODO pseudo-range difference
The GNSS/ODO pseudo-range difference indicates the difference between the ODO-derived and the receiver-observed pseudo-ranges. Using the ODO-derived speed, time integration can be performed to estimate the train's along-track traveling distance
Figure 2 shows the details of the

Principle of ODO-based coordinate prediction based on the trackmap data.
Procedures of ODO-based coordinate prediction based on trackmap data.
ODO: odometer.
The equivalent pseudo-range
The immunity of the ODO-derived coordinate location enables the evaluation of the effect of GNSS spoofing by calculating the difference
Dilution of Precision
The dilution of precision (DOP) value refers to the use of a satellite system to verify the accuracy or error of the GNSS receiver's current position, and is expressed as the root number of the sum of the latitude, longitude, and elevation error squared of the position. Among the different DOP quantities, the geometric accuracy factor, denoted as the position dilution of precision (PDOP), is related to the azimuthal distribution of the satellites. Similarly, the horizontal and vertical accuracy factors were denoted as horizontal dilution of precision (HDOP) and vertical dilution of precision (VDOP), respectively.
Using the abovementioned feature quantities, the structure of the model training sample for spoofing recognition can be determined. Because the spatial movement of a train is constrained by the railway track and repeated within a period according to a specific railway time schedule, it is possible to construct a sample dataset considering the spatiotemporal similarity of a specific railway line. By combining these feature quantities with the spoofing label according to the sample data collection conditions, the structure of the training sample can be determined. The sample feature format
BO-LightGBM spoofing recognition
Using the GNSS observations received by the train-borne receiver, data samples can be collected for data-driven modeling, which describes the relationship between the feature set and the spoofing indication. To ensure the performance of the derived model, an advanced modeling algorithm is proposed in this study. The procedures of the entire spoofing recognition solution can be summarized in the following three steps.
1. Dataset collection: The field data collected in practical railway operations can be recorded to generate a dynamic operation scenario in GNSS spoofing injection tests. Based on a practical train trajectory and specific GNSS observation configurations, such as satellite visibility and signal propagation features, different GNSS spoofing interference events can be incorporated into a GNSS signal generator. Using the data recorded by the receiver under test (RUT), the datasets for model training and testing can be established through preprocessing.
2. Off-line model training: Using datasets from the RUT under different GNSS spoofing scenarios, the proposed modeling and classification algorithm was adopted to optimize the spoofing recognition model through multiple iterations. The offline modeling results were utilized in the online TPS operation. With the new incoming data samples from additional tests and scenarios, offline modeling can be reactivated to update the model for subsequent applications.
3. Online spoofing recognition: During the practical operation of a TPS, real-time GNSS observations can be extracted and transferred as training samples according to the structure of the feature quantities. By employing the spoofing interference recognition model, the indicator of the GNSS spoofing attack status can be identified and fed to the LPU to switch the online location determination logic and mitigate the effect of the spoofing attack.
The core of this solution is the capability and performance of the proposed modeling method. To address the problem of conventional classification modeling algorithms with constrained performance, an advanced modeling solution is proposed as follows:
LightGBM classification method
LightGBM is an effective machine-learning algorithm based on the GBDT method. It is proposed to overcome the problems of the GBDT, which presents with the constraints of low accuracy and high time cost when managing large datasets. However, the LightGBM method significantly accelerates prediction and reduces memory consumption. 24 It adopts a leafwise strategy for data splitting, decision tree growth, and leaf node splitting. This strategy is employed in the iterative process in which each tree is designed to reduce the residuals of the previous iteration, build a new model in the direction of residual reduction (negative gradient), make the difference between the prediction result and the target value of the training data for this iteration, and eventually minimize the loss function by stepwise approximation. Thus, the classification performance of the model can be further improved by deriving a strong learner for recognition.
Figure 3 depicts the principle and key features of the LightGBM method. The powerful advantages of LightGBM are reflected in three aspects: less memory, smaller samples, and fewer features, corresponding to the three technical implementations of the histogram algorithm (HA), gradient-based one-side sampling (GOSS), and exclusive feature bundling (EFB). HA is the core feature of the LightGBM solution and is involved in constructing discrete histograms of features and performing data splitting to reduce computational complexity and improve training speed. The GOSS strategy was adopted using the LightGBM solution for gradient sampling during training. Because gradients contribute differently to the splitting of decision tree nodes, GOSS retains the full information of large-gradient samples and partial information of small-gradient samples to accelerate the training process while maintaining the accuracy of the model. EFB is a feature engineering technology in LightGBM that is used to bundle correlated features and reduce the number of features; thereby reducing the complexity of the model. The utilization of the EFB can accelerate model training, improve generalization performance, and reduce the risk of overfitting to a certain extent.

Principle and features of LightGBM.
Bayesian optimization strategy
Tuning the hyperparameters is an important task in ML and must be performed in advance. The superior performance of a specific ML algorithm is highly dependent on its ability to set the appropriate hyperparameters. The LightGBM method involves numerous hyperparameters that directly affect the structure and performance of the derived model. Consequently, the BO can tune the parameters fast and efficiently. 25 The Bayesian optimization strategy assumes that the loss function to be optimized is directly and functionally related to the hyperparameters and provides a method to evaluate uncertainties. 26 Thus, a new combination of hyperparameters was selected based on the current posterior probability function estimate, and an objective function was invoked to evaluate the performance of this hyperparameter combination. A smaller objective function value indicates that the model prediction is closer to the true label, indicating a better model performance level. Through several iterations and evaluations, the Bayesian optimization gradually approaches the optimal hyperparameter combination. 27 In each iteration, the existing samples and specific logic are used to select the next sample using a certain strategy to further optimize the approximation of the objective function. By iteratively evaluating different combinations of hyperparameters and adopting the derived optimal hyperparameter set, the performance and generalization of the model can be improved. Compared to traditional grid searching or random searching strategies, the optimal combination of hyperparameters obtained by the Bayesian algorithm can achieve enhanced efficiency and performance. 28
BO-LightGBM-based spoofing recognition
To enhance the performance of the original LightGBM algorithm, a BO strategy was used to enable an integrated BO-LightGBM solution for spoofing recognition. The overall framework of BO-LightGBM for data-driven offline model training and online spoofing recognition is shown in Figure 4. The framework consisted of three parts: dataset collection, offline model training, and online spoofing recognition.
Dataset collection: The GNSS-based train positioning data on a practical line was used to construct a dynamic operation scenario to implement the train positioning simulation. A spoofing interference signal was injected to obtain receiver observations. The required observations were extracted and the ODO data were converged to construct the feature quantities, as in (16), which constituted the training set of the spoofing scene. Off-line model training: Twenty percent of the training set was randomly selected to construct a validation set to enhance model training performance. An improved scheme based on BO-LightGBM was proposed to optimize the recognition performance of the model. The corresponding test set was used to test the trained model and determine whether it met the expected target. Finally, the trained model was stored for invocation. On-line spoofing recognition: During train operation, the observation feature quantity was extracted from the GNSS positioning unit in real-time, and the model obtained by offline training was utilized to determine whether the TPS was disturbed by a spoofing attack.

Architecture of BO-LightGBM-based spoofing recognition for GNSS-based train positioning.
A flowchart of the BO-LightGBM-based modeling is shown in Figure 5. The significant difference between the LightGBM-based and advanced BO-LightGBM solutions lies in the model training procedures, which are described in the following steps:

Flow of the BO-LightGBM-based modeling for spoofing recognition.
The hyperparameter range, including the learning rate, maximum number of leaf nodes in the tree, and maximum tree depth, was determined to be optimized. In this study, the negative of the mean of the log loss function (“
The objective function measures the difference between the model-predicted probability and the true label. It reflects the measure of uncertainty by the model and indicates the loss of information between the predictions and actual outcomes.
Calculate the gradient and Hessian matrix for the sample on the current leaf node Optimal split point determination (c) Splitting is performed
By setting the seeds of the random number generator, a set of initial hyperparameters was generated, and the LightGBM model was built on the training set to calculate the value of the objective function. With training datasets
By comparing the current objective function output with the value from the previous round, the current optimal hyperparameter combination can be determined based on the smaller output of the objective function. Using the expected improvement (EI) sampling function,
29
the settings of the next hyperparameters were selected for evaluation. In contrast to the original BO strategy, which uses the posterior probability to reflect the evaluation of the LightGBM hyperparameters under the current dataset, the EI sampling function is adopted in this solution to select the next combination of hyperparameters that should be evaluated and guided accordingly.
Using the selected hyperparameters, the LightGBM model can be trained, and the value of the objective function can be calculated. The LightGBM training process is summarized as follows.
In each round of training, a new decision tree model
LightGBM uses a histogram algorithm that calculates the split gain for each eigenvalue and histogram bucket and finds the maximum split point gain as
LightGBM uses a leafwise strategy, where the node with the highest gain among all leaf nodes is selected preferentially for splitting into two new leaf nodes at each split. Consequently, a new decision tree model is obtained.
According to the new decision tree model,
The current value of the objective function is calculated using the model obtained from this round of training.
Based on the prediction of the existing sampling and EI sampling function-based guidance mechanisms, hyperparameters with the objective function reaching the maximum value are selected to identify the optimal hyperparameters.
The final LightGBM model was trained using the optimal hyperparameters, and the performance of the model was evaluated using the test set. Through all the operation steps, the ultimately derived learner can be described as
Pseudocode for BO-LightGBM modeling.
BO-LightGBM: Bayesian optimization-light gradient boosting machine
Through BO-LightGBM model training using specific datasets, the importance of all the sample features presented in the typical sample features section can be determined. The feature importance sequence can be used to represent the weights of all features, which determine the characteristics of the derived spoofing recognition model. Note that feature weights corresponding to different models may vary considerably by different training sample datasets. The advantages of the automatic adjustment of hyperparameters can be illustrated by the architecture and procedures of the proposed modeling solution.
Automatic adjustment of hyperparameters. BO can automatically explore the hyperparameters and find the optimal hyperparameter combination without manually traversing all the different combinations. Automatic adjustments reduce the subjectivity and workload of manual operations. Global optimal performance. The BO strategy considers the results of the evaluated hyperparameter combination, which can search for the hyperparameter range more comprehensively and avoid falling into a local optimum. High efficiency. LightGBM provides an efficient gradient-boosting framework for processing large-scale datasets and high-dimensional features. With the optimized hyperparameters derived from the BO, the BO-enhanced LightGBM solution can identify the hyperparameters with a limited time cost; thus, significantly accelerating the model training and tuning process.
Test and analysis
Test platform configuration
To demonstrate the interference recognition capability of the TPS under a spoofing attack, a practical TPS data log recorded in the Shenyang–Heishan high-speed railway was used to generate GNSS observation scenarios under spoofing attacks by injecting different spoofing events in a spoofing simulation test platform. A Spirent GSS9000 signal generator, GNSS signal simulator with the SimGEN spoofing control tool, and RUT (Ublox EVK-M8N) were employed in the test platform. The construction and information flow are shown in Figure 6, and Figure 7 shows the test setup for the entire test environment.

Platform setup and information flow.

Overview of the spoofing test environment.
The collected data was utilized to generate the train trajectory script and GPS satellite observation scenario using the SimGEN tool to reconstruct the real GNSS observation conditions. In the spoofing test scenario, the corresponding GPS ephemeris was imported. The SimSAFE tool was configured with SimGEN using specific spoofing commands covering both ramp and sinusoidal GPS spoofing modes. Note that certain assumptions must be made on this platform to ensure that the RUT can achieve successful spoofing.
The power level of the generated spoofing signal was higher than that of the actual GPS satellite signal, and the RUT could successfully capture the spoofing signal. Differences exist in the Doppler and code phases between the actual GPS and spoofing signals. In addition, the power level of the spoofing signal can be adjusted to increase gradually. Thus, the spoofing signal can occupy satellite-tracking channel(s) and mislead the RUT. The satellite channels of the signal simulator are sufficient to realize both pure GNSS and spoofing signals. All assumptions were fulfilled by the signal generator in this platform, which ensured a successful spoofing scenario simulation and testing of the spoofed RUT for data collection.
Dataset construction
The training and validation datasets were constructed by collecting the RUT output data under different GPS spoofing and nonspoofing scenarios. Spoofing interference was injected when the scenario ran for 600 s. There were 40968 groups of samples obtained from the tests under different testing scenarios.
Using the proposed BO-LightGBM modeling solutions with the datasets, search range, and selected optimal combination of hyperparameters are listed in Table 3.
Search range and optimal combination of selected hyperparameters of BO-LightGBM model.
BO-LightGBM: Bayesian optimization-light gradient boosting machine.
In the spoofing injection test, two typical GPS spoofing modes—sinusoidal and ramp—were investigated to illustrate the capability of the proposed modeling solution. Three different settings were used for the sinusoidal mode, and the ramp mode involved two configurations. Details of five different spoofing scenarios, represented as T1–T5, are listed in Table 4. To illustrate the impact of the spoofing attack, the results from the RUT under two representative scenarios (T1 for the sinusoid and T5 for the ramp) were extracted to analyze the time-varying characteristics of the sample features.
Results under T1 scenario
Details of different spoofing interference tests.
Figure 8 shows the detailed changes in the sample features with time when spoofing interference is injected into the two satellite channels. The channels can be represented by satellite pseudo-random noise (PRN) indices of 15 and 20. In all the subfigures, the absolute time (in seconds) of the test system is shown on the

Sample features under the T1 scenario.
It can be observed that after 600 s, pseudo-range residual and carrier phase smoothed pseudo-range residual oscillate periodically in the sinusoid form. The GNSS/ODO pseudo-range difference exhibits a large mutation after 600 s and sinusoidal changes occur in the two spoofing-injected satellite channels (PRN = 15 and 20). According to the characteristics of the feature with the ODO measurement that was immune to the spoofing, we showed that the RUT was affected by the injected attack event and the navigation calculation performance might degrade. In addition, the pseudo-range residual, carrier phase-smoothed pseudo-range residual, and GNSS/ODO pseudo-range difference of the satellites without spoofing injection also had obvious sinusoidal characteristics after 600 s since the test was conducted. For the satellite observation status-related features, significant changes were observed after the spoofing injection. The

SPP errors under the T1 scenario.
Periodic oscillation in the form of a sinusoid is present in all three directions. The characteristics of the errors converge with the situation of all sample features derived from the tests for data collection.
2. Results under T5 scenario
Figure 10 shows the detailed changing status of all the sample features when four satellites (PRN = 15, 20, 24, and 32) were injected with spoofing interference. Note that not all satellites can always be observed based on the relative positional relationship between each satellite and the target train. According to the GPS ephemeris, a satellite with PRN = 25 can be observed near the end of the time span, as shown in Figure 10.

Sample features under the T5 scenario.
After 600 s since the test was stated, the pseudo-range residual, carrier-phase-smoothed pseudo-range residual, and GNSS/ODO pseudo-range difference significantly increased in a ramp form. The

SPP errors under the T5 scenario.
Based on the analysis of the sample features under specific scenarios, the sensitivity of these features to spoofing attacks can be revealed, which can be used to illustrate the potential of sample sets in achieving effective model training for recognizing spoofing interference.
Recognition performance analysis
To demonstrate the performance of the proposed BO-LightGBM solution for spoofing recognition, independent operations were performed for four different solutions, as follows.
M1: GBDT (Gradient Boosting Decision Tree) M2: XGBoost (eXtreme Gradient Boosting) M3: Original LightGBM (Light Gradient Boosting Machine) M4: Proposed BO-LightGBM
The same sample sets were employed in parallel operations of the four solutions for comparison. To illustrate the performance of the proposed solution (M4) over other referencing methods, a comparative analysis of the receiver operating characteristic (ROC) curves and confusion matrices from all solutions was performed. The ROC curve was used to illustrate the combined sensitivity and specificity trends of derived classifiers using a specific solution. The confusion matrix reflects the model prediction results for each category and provides an effective indication of the classification performance of the model. Figures 12–15 show the derived ROC curves and confusion matrices from all involved solutions under five scenarios (T1–T5).

ROC curves and confusion matrices (M1).

ROC curves and confusion matrices (M2).

ROC curves and confusion matrices (M3).

ROC curves and confusion matrices (M4).
The results illustrate that the proposed solution (M4) achieves a superior performance level over the other referencing methods under different spoofing scenarios. It achieved a higher accuracy level in model prediction for all testing scenarios, and the classification performance was more stable and reliable than that of the other methods. In addition, the area under curve (AUC) was investigated to intuitively evaluate the prediction accuracy of the model. Specifically, it can be found that the AUC by our proposed solution (M4) has been significantly improved under T1–T3. Considering the comparison under the T1 scenario as an example, the AUC from M4 was improved by 17.54%, 7.85%, and 2.70% over the three methods. Only for the T4 and T5 cases with ramp-mode spoofing, a slight decrease of 0.089% and 0.059%, respectively, occurred for M4 over the M2 solution. However, among these methods, only M4 achieved a high level (above 0.99799) for all situations. According to the above local and global comparative analyses, the overall performance of M4 surpassed that of the other three solutions.
Based on the model index analyses, by examining the recognition capability, we determined the capability of the four classification models in evaluating the effectiveness of implementation. Four quantitative indicators, accuracy (A), precision (P), recall rate (R), and F1 score, were statistically compared, and the results are shown in Figures 16–20. The results in these figures represent the recognition capability of M1–M4 for injected GPS spoofing interference under T1–T5 schemes.

Performance indicator results for T1.

Results of performance indicators under T2.

Results of performance indicators under T3.

Results of performance indicators under T4.

Results of performance indicators for T5.
The quantitative indicators adopted were defined as follows:
Thus, all four methods achieved certain spoofing recognition capabilities under different spoofing interference modes and scenarios. The four quantitative indicators from our proposed solution achieved a high accuracy of over 99.78% in identifying whether spoofing interference occurred. Under scenarios T1–T3 with a sinusoid-type attack, the accuracy of M4 was improved by 25.51% over the other three methods, and a significant improvement of 21.36% was realized for the F1 score of the model. The precision improved by a maximum of 42.73%. However, the recall of M4 differed slightly from those of M1 and M2 by 0.17% under the spoofing attack scenarios. Under the T4 and T5 scenarios in testing the performance with the ramp-type attack, all four quantitative indicators of our proposed solution realized an advanced performance level while failing to outperform M2 with a limited difference of less than 0.11%. According to the comparison results with the original LightGBM without introducing the BO, it was found that the involvement of the BO mechanism effectively improved the capability of the model to recognize spoofing attacks. Note that uncertainty quantification plays a pivotal role in easing the impact of uncertainties during both optimization and decision-making. 30 Although the BO directly influences the determination of the combination of hyperparameters for optimizing the LightGBM, the effect of uncertainty in the BO can be indicated by the recognition performance achieved by the entire solution, which validates the effectiveness of the design of this solution.
Considering both nonspoofing and GPS spoofing scenarios, Table 5 summarizes the missed and false alarm rates in the identification and classification of injected spoofing interference during the tests. Under the sinusoidal spoofing interference scenarios (T1–T3), the M4 solution proposed in this study was significantly better than the other three modeling schemes. Under the T1 scenario, the missed alarm rate by M4 was reduced by 38.48%, 23.48%, and 0.62%, and the false alarm rate was reduced by 29.87%, 14.60%, and 0.27% compared with M1–M3, respectively. Significant improvements were achieved using M4. Under the T2 scenario, the missed alarm rate of M4 was reduced by 9.73%, 9.87%, and 1.08%, and the false alarm rate was reduced by 5.22%, 5.30%, and 0.48%, respectively. The T3 scenario achieved an improvement similar to that of T2. Enhancements of 11.65%, 10.64%, and 1.18% were achieved for the missed alarm rate, and those for the false alarm rate were 6.25%, 5.64%, and 0.51%, respectively. However, it is different under ramp-spoofing interference scenarios (T4 and T5). The proposed solution (M4) did not always perform the best among the tested models. For the T4 scenario, M4 did not realize an enhancement in the missed and false alarm rates over M1 and M2, but the differences were relatively small and less than 0.09%. Compared with M3, the two indices improved by 0.54% and 0.26%, respectively. Under T5 scenario, M4 outperformed both M1 and M3. It failed to perform better than M2; however, the differences were also constrained at only 0.27% for missed alarms and 0.11% for false alarms. Generally, the M4 solution proposed in this study can achieve significant comprehensive advantages under these test scenarios, particularly in the sinusoidal spoofing scenarios. It achieves stable and constrained missed and false alarm rates under all scenarios, ensuring that it can effectively manage different spoofing interference situations.
Missed and false alarm rates for spoofing identification.
Analysis under synthetic spoofing scenario
Based on the performance evaluation and analysis using a single spoofing signal mode, we further investigated the model performance of the proposed solution under more complicated spoofing attack situations. In contrast to tests using only the sinusoid or ramp modes, the two spoofing modes were integrated to constitute a synthetic interference condition using the same train trajectory and GPS satellite observation scenario settings. The same RUT was used to collect datasets for model training and testing. A spoofing injection was also triggered when the test scenario ran for 600 s. Sinusoid-mode spoofing was first injected into three satellite channels (PRN = 15, 20, and 24), where the same settings of the sinusoid signal were adopted as in the T1 scenario. The spoofing power slope is set at a stable level of 0.3 because the spoofing injection is started. When the test scenario ran for 1200 s, the ramp-spoofing attack was activated in the same satellite channels (PRN = 15, 20, and 24), where the slope was set to 3 m/s, which was the same as in the T5 scenario. Using the datasets collected under the synthetic scenario, all four solutions (M1–M4) were performed independently for comparative analysis.
The recognition capabilities of the four involved models were investigated using a specific testing set (T6) under the synthetic spoofing scenario. The same quantitative indicators, including accuracy (A), precision (P), recall rate (R), and F1 value, were statistically compared, as shown in Figure 21, which illustrates the recognition capabilities of M1–M4 under this synthetic scenario. Considering both nonspoofing and spoofing attack scenarios, Table 6 summarizes the missed and false alarm rates in the recognition of complicated spoofing interference events.

Results of performance indicators under T6.
Missed and false alarm rates under synthetic spoofing scenarios.
The results confirm that the performance advantages of the proposed solution (M4) are revalidated. Under this complicated spoofing scenario, the four quantitative indicators from our proposed solution reached a high level of over 99.80% in identifying whether spoofing interference occurred. The accuracy, precision, recall, and F1 score improved by 4.35%, 4.89%, 6.12%, and 8.00%, respectively, over the three referencing methods. For the comparison of missed and false alarm rates, it can be observed that all four solutions perform with relatively lower rates compared with the single spoofing scenarios T1–T5, which is a result of the more significant spoofing effects with the integration of two spoofing signal settings. The performance advantage of M4 remained under the T6 scenario. This can be attributed to a reduction of 4.76%, 4.25%, and 2.59% in the missed alarm rate achieved by M4 compared with M1–M3, and the reduction in the false alarm rate can be 6.25%, 5.64%, and 0.51%, respectively. The results from a complicated GNSS spoofing scenario further demonstrate that the proposed BO-LightGBM solution can achieve advanced stability and the strongest reliability among all referencing methods involved. The revealed characteristics of this solution allow us to explore the significant advantages of spoofing recognition in GNSS-based train positioning. This shows great potential for smart diagnosis and active spoofing suppression in specific GNSS-based implementations.
Conclusions
To realize trustworthy train positioning using GNSS, this study proposes a novel active recognition solution for spoofing interference attacks. Compared with traditional machine learning-based approaches, the proposed BO-enhanced LightBGM solution can learn the characteristics of GNSS spoofing attacks through a data-driven model under the railway train operation scenario, while considering the spoofing-affected satellite observation features and the spatial constraints of the trains. In addition, the automatic adjustment of the hyperparameters by introducing Bayesian optimization enhances the coverage of different spoofing modes by LightBGM. The results of spoofing interference tests demonstrate the advanced recognition performance of the proposed solution over other data-driven methods, including GBDT, XGBoost, and LightBGM. In addition to the sinusoidal and ramp signal-based attack modes, the synthetic signal-based interference test further illustrates the spoofing recognition capability under a more complicated attack situation.
Note that a key challenge to the proposed solution is the uncertainty and diversity of the spoofing interference patterns, which may constrain the performance and application of the proposed solution. It is difficult to collect sufficient datasets under real but unknown spoofing attack scenarios. Thus, we chose to create different spoofing attack modes in the spoofing injection test environment to alleviate the challenges and limitations caused by interference patterns. In the future, effective spoofing monitoring and recording tools will be introduced in practical train operation environments to capture practical events and obtain additional field datasets. Therefore, we will further focus on active protection measures against multisource and multimode GNSS spoofing interference to ensure the credibility and safety of novel GNSS-enabled railway systems. More advanced data-driven models and calculation platforms should be considered to further improve the solution and cope with the risks posed by various GNSS attacks.
