Sage Journals: Discover world-class research

Abstract

Trustworthy positioning is critical in the operational control and management of trains. For a train positioning system (TPS) based on a global navigation satellite system (GNSS), a spoofing attack significantly threatens the trustworthiness of positioning. However, the influence and recognition of GNSS spoofing attacks are not considered in the existing research on GNSS-enabled TPS. Spoofing attacks affect the performance of GNSS observations and the positioning results, allowing the development of data-driven spoofing recognition solutions. This study aims to achieve effective spoofing recognition for active security protection in TPS. Different features were designed to reflect the effects of a spoofing attack, including GNSS observation-related indicators and odometer-enabled parameters, and a novel Bayesian optimization-light gradient boosting machine (BO-LightGBM) solution was proposed. In particular, a Bayesian optimization technique was introduced into the LightGBM framework to improve the hyperparameter determination capability for recognition model training. Using a GNSS spoofing test platform with a specific GNSS signal generator and the SimSAFE spoofing test tool, different spoofing attack modes were tested to collect sample datasets for model training and evaluation. The results of model establishment and comparison of the model performance indicators illustrated the advantages of the proposed solution, its adaptability to different spoofing attack situations, and its superiority over state-of-the-art modeling strategies.

Keywords

Railway train positioning system spoofing attack recognition model machine learning

Introduction

In recent years, with the rapid development of railway transportation, there have been widespread concerns regarding the security assurance of railway train control systems. Global navigation satellite systems (GNSSs), which can provide accurate and real-time information regarding the position, speed, direction, and other states of trains, have become an inevitable choice for precise train positioning in novel location-enabled railway systems.¹ However, satellite signals are weak upon reaching the ground, and the details of civilian signals are open to the public. Consequently, the GNSS receiver embedded in the train positioning unit is highly susceptible to intentional or unintentional interferences.² In the past, the GNSS-based applications primarily focused on the effect and countermeasures to assure the functional safety,³ lacking the full-life-cycle design for an active protection against cybersecurity risks, which had received in-depth attention in various areas, such as the smart cities⁴ and intelligent systems.⁵ The threat owing to GNSS vulnerability lies in interference, particularly GNSS spoofing interference. A GNSS spoofing signal, which is transmitted or generated to simulate fake satellite signals, can force the receiver to derive an inaccurate position, velocity, and time report, thereby misleading the operation of target systems equipped with the receiver. By leading the target system to a predetermined location through a malicious spoofing attack, the safety of the operational system and human lives will be significantly threatened with serious risks.⁶ For specific GNSS-based safety-related services and applications, effective spoofing protection measures must be emphasized and prioritized.

GNSS spoofing has been reported in several global incidents. Spoofing is different from other incidents. This demonstrates the vulnerability of GNSS and results in the loss of false signals. A portable civilian global positioning system (GPS) spoofer was developed and utilized to assess the threat of spoofing and successfully conduct deception experiments on a superyacht.⁷ In addition, a GNSS navigation spoofing device can be constructed using a GNSS signal simulator, signal amplifier, generator, and other equipment, which successfully deceives an operating truck through a spoofing attack.⁸ With the development of specific techniques, the realization of GNSS spoofing attacks has become easier. For example, even an inexpensive software-defined radio can make a smartphone believe that it is in a different location. This has been illustrated by the severe consequences of intentional GNSS spoofing interference, highlighting the urgent need to develop effective spoofing protection measures to enhance the security levels of GNSS devices.

Numerous studies have been conducted on the detection and recognition of GNSS spoofing. To detect the direction of the spoofing signal, Ong et al.⁹ proposed a method to estimate the azimuth angle of the GPS carrier signal using a dual antenna and an interferometer. However, the addition of an interferometer increases the recognition cost. The antenna array can be used to detect a single deception signal or deception signals from multiple directions and eliminate deception interference effects.^10,11 This solution requires multiple antennae. Thus, the antenna length affects the recognition performance. Because the modification of receiver hardware facilities is usually difficult in currently utilized systems, antispoofing can be realized by recognizing spoofing attacks by identifying unreasonable jumps, such as the carrier amplitude and carrier phase, especially at the beginning of spoofing.¹² Monitoring the received power provides an effective solution, which requires viewing all the received carrier amplitude values and automatic gain control settings at the RF front-end of the receiver. Because a spoofer requires a high power level, sudden power loss may indicate a spoofing attack.^13,14 However, for this category, the mutation only occurs at the initial stage of the spoofing event; thus, this method is suitable for strong and short-lived scenarios. To continuously identify the existence of GNSS spoofing, different solutions have been proposed that encrypt the signal navigation information; thus, it is difficult for an attacker to obtain and change the navigation information in the satellite signal. An ideal method is based on the symmetric encryption.¹⁵ If the signal correlation peak is extremely high, the two noisy versions of the encrypted signal interact, indicating that the signal is true. Otherwise, the potential victim receiver is proven to have been spoofed.^16,17 The encryption-enabled solutions require a secure receiver network to generate a real version of the encryption code and a secure communication network to send the signal to the signal processing unit that can check the correlation. However, the specified requirements for related facilities may limit the implementation of this solution. Considering the possible involvement of auxiliary devices that are immune to GNSS spoofing, non-GNSS information-assisted solutions have been proposed for spoofing recognition and protection. By analyzing the measurement models of the GNSS and inertial measurement unit, a spoofing recognition model has been proposed and examined using a generalized likelihood ratio test.¹⁸

With the rapid development of machine learning (ML) technology, data-driven spoofing recognition methods have become a major concern in recent years owing to their robustness, high generalization ability, and ability to analyze and extract effective information. Typical ML solutions have been adopted and demonstrated in spoofing recognition and suppression, such as the gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light-gradient boosting machine (LightGBM). ML-based solutions have shown great potential for modeling and recognition. However, there is a scope for enhancing the accuracy and robustness of spoofing awareness and recognition. To address the problems of existing methods, this study proposes a data-driven spoofing recognition solution using the LightGBM model. Among the various optimization algorithms that can be utilized to enhance artificial intelligence techniques¹⁹ and data-driven classifiers,²⁰ the Bayesian optimization (BO) algorithm²¹ has shown superior empirical performance in several optimization-based applications. Therefore, the BO algorithm is introduced into the original LightGBM to realize a BO-LightGBM solution to further improve the spoofing recognition performance. Using field-data-based spoofing injection tests, the advantages of the proposed BO-LightGBM method in terms of recognition accuracy and robustness were illustrated.

The remainder of this paper is organized as follows: the sample features in spoofing recognition section describes sample features for spoofing recognition. The BO-LightGBM model and corresponding spoofing recognition solution for GNSS-based train positioning are presented in the BO-LightGBM spoofing recognition section. The test and analysis section presents the test results and corresponding analysis. The conclusions section draws conclusions and briefly reviews future plans.

Sample features in spoofing recognition

GNSS signal observation model

A GNSS-based autonomous train positioning system (TPS) with a train-centric design is composed of a GNSS receiver, an antenna, assistant location sensors, and a location processing unit (LPU). Specific assistant sensors can be used in a TPS, such as an ODO and Doppler radar. Figure 1 shows the typical architecture of a TPS system.

Figure 1.

Architecture of GNSS-based TPS.

The GNSS receiver obtains the satellite signal for navigation calculations, and the result is used to determine the final location report by the LPU. The satellite signal $s (t)$ received by a train-borne receiver at instant t can be expressed as follows: $s (t) = s^{r e} (t) + s^{s p} (t) + n (t)$ (1) $s^{r e} (t) = \sum_{i = 1}^{M} \sqrt{2 \cdot P_{R}} \cdot x_{i} (t - τ_{R}) \cdot D_{R} (t - τ_{R}) \times \sin (2 π (f_{1} + f_{d R}) (t - τ_{R}) + θ_{R}),$ (2) $s^{s p} (t) = \sum_{j = 1}^{N} \sqrt{2 \cdot P_{S}} \cdot x_{j} (t - τ_{S}) \cdot D_{S} (t - τ_{S}) \times \sin (2 π (f_{1} + f_{d S}) (t - τ_{S}) + θ_{S}),$ (3)where $s^{r e} (t)$ and $s^{s p} (t)$ denote the authentic and spoofing signals, respectively; $n (t)$ indicates the observation noise; M and N denote the number of authentic and fake satellites observed at instant t, respectively; $P_{R}$ and $P_{S}$ represent the average received power of satellites i and j, respectively; $x_{i}$ and $x_{j}$ are the ranging codes; $τ_{R}$ and $τ_{S}$ are propagation delays; $D_{R}$ and $D_{S}$ are the corresponding data codes; $f_{1}$ is the carrier frequency; $f_{d R}$ and $f_{d S}$ represent the Doppler shifts; and $θ_{R}$ and $θ_{S}$ are the initial carrier phases of the authentic and spoofing signals, respectively.

The received signal undergoes different processing stages in the receiver, including capture, tracking, and navigation calculations with different observation variables, such as the pseudo-range, carrier phase, and carrier-to-noise density ratio (C/N₀).²² When the TPS is subjected to spoof interference, the features of these observation variables change, enabling the recognition of the existence and level of spoofing interference. When sufficient data samples under specific spoofing attack scenarios can be collected, the data-driven recognition model will be practical, which is expected to be sensitive to the presence of GNSS spoofing and realize an in-time alert indication to identify the spoofing attack. However, it is difficult to evaluate the relationship between direct GNSS observations and spoofing labels because spoofing attacks can occur under numerous satellite-receiver situations and signal power-level conditions. To enable the effective characteristics for establishing a data-driven spoofing recognition model, additional information that is not affected by spoofing is expected to be introduced. As a typical assistant train positioning sensor, an ODO calculates the relative along-track location of a train through wheel rotation without receiving any external signal. Hence, both the GNSS observation models and odometer can be investigated to design sample features for dataset collection.

Typical sample features

Pseudo-range residual

The pseudo-range refers to the measured satellite-user distance converted with the signal propagation time from a satellite to the receiver. Observation errors exist owing to several factors, such as propagation delay, clock error, and signal interference. Using satellite ephemeris data and specific propagation delay models, the corrected pseudo-range $ρ_{i, t}^{p}$ at instant t can be obtained as $ρ_{i, t}^{p} = ρ_{i, t} + c \cdot (δ_{s} - δ_{r}) - ε_{i, ion} - ε_{i, tro},$ (4)where $ρ_{i, t}^{p}$ indicates the pseudo-range observation of the $i$ th satellite at instant t; c is the light speed; $δ_{s}$ and $δ_{r}$ denote the clock differences between the satellite and receiver, respectively; and $ε_{i, ion}$ and $ε_{i, tro}$ indicate the ionospheric and tropospheric delay errors, respectively.

Under normal operating conditions with four or more satellites observed at the same instant, the actual distance $ρ_{i, t}^{a}$ at t between the satellite and receiver can be obtained by combining the positions between the satellite and receiver based on the following criterion: $ρ_{i, t}^{a} = \sqrt{{(x_{a, t} - x_{i, t})}^{2} + {(y_{a, t} - y_{i, t})}^{2} + {(z_{a, t} - z_{i, t})}^{2}},$ (5)where $(x_{a, t}, y_{a, t}, z_{a, t})$ is the actual position at t of the TPS GNSS antenna, which can be obtained using the least squares (LS) estimator, and $(x_{i, t}, y_{i, t}, z_{i, t})$ represents the coordinates of the $i$ th satellite at t, which can be obtained from the ephemeris file.

The pseudo-range residual $λ_{i, t}^{pr}$ of the satellite i at instant t can be evaluated by the difference between the calculated actual distance and corrected pseudo-range as follows: $λ_{i, t}^{pr} = ρ_{p, t} - ρ_{a, t} .$ (6)When a receiver is spoofed, the characteristics of the set of ${λ_{i, t}^{pr}}$ will be different owing to the change in the pseudo-range observation $ρ_{i, t}$ , which will affect the pseudo-range residual as $λ_{i, t}^{pr} = (ρ_{i, t} + e_{i, t}^{ρ}) - ρ_{i, t}^{a},$ (7)where $e_{i, t}^{ρ}$ denotes the pseudo-range error of the $i$ th satellite.

The sensitivity of $λ_{i, t}^{pr (sp)}$ to spoofing attacks provides an opportunity to deliver alerts to users. Thus, the pseudo-range residual can be employed as an observation feature to determine whether a spoofing attack has occurred.

Carrier phase smoothed pseudo-range residual

The carrier phase is a measure of the range between a satellite and receiver, expressed in units of cycles of the carrier frequency, which evaluates the change in the behavior of the carrier waveform of the satellite signal. The carrier phase can be used to smoothen the pseudo-range. This can be used as an observational characteristic quantity. $ρ_{i, t}^{cp} = \frac{1}{M} ρ_{t} + \frac{M - 1}{M} [ρ_{i, t - 1}^{cp} + λ (ϕ_{t} - ϕ_{t - 1})],$ (8)where $ρ_{i, t}^{cp}$ is the carrier-smoothed pseudo-range of the $i$ th satellite at t, $ϕ_{t}$ represents the carrier phase at t, and M denotes the smoothing time.

Therefore, the carrier phase-smoothing pseudo-range residual can be obtained using (7) to identify whether the receiver has been deceived. $λ_{i, t}^{cp} = (ρ_{i, t}^{cp} + e_{i, t}^{ρ}) - ρ_{i, t}^{a}$ (9)

Carrier-to-noise density ratio (C/N₀)

Noise affects the measurement accuracy of GNSS receivers. C/N₀ is defined as the ratio of the baseband signal power to the noise power within a unit bandwidth, and the receiver uses C/N₀ to describe the degree of noise interference.²³ A large C/N₀ ratio indicates that the baseband signal is stronger, and the raw observations are reliable. C/N₀ is digitally output by the baseband processing module, while also outputting the raw observation. It is typically expressed in decibels (dB) as $(C / N_{0})_{i, t} = 10 \cdot \lg (P_{i, t}^{(S)} / P_{i, t}^{(N)}),$ (10)where $P_{i, t}^{(S)}$ and $P_{i, t}^{(N)}$ represent the carrier signal and noise power at t, respectively.

When the receiver is spoofed, it is forced to receive the spoofing signal, resulting in a stronger baseband signal. A high C/N₀ will be obtained under a strong baseband signal situation, as in (10); thus, a recognition feature can be constructed by examining the change in the C/N₀ value as $(C / N_{0})_{i, t}^{(sp)} = 10 \cdot \lg (P_{i, t}^{(S)} / P_{i, t}^{(N)}),$ (11)where it has the same meaning as $(C / N_{0})_{i, t}$ , but specifies the value under a possible spoofing attack.

GNSS/ODO speed difference

In the GNSS spoofing attack scenario, the GNSS-derived speed was inaccurate. By contrast, an ODO can evaluate the along-track running distance and speed of a train that is less susceptible to GNSS interference. Hence, the difference in train speed between GNSS and ODO can be used as an indication to evaluate the negative effects of spoofing. The GNSS/ODO speed difference $λ_{t}^{VODO}$ allows us to generate a sample feature quantity as follows: $λ_{t}^{VODO} = v_{O D O, t} - v_{r, t},$ (12)where $v_{O D O, t}$ denotes the odometer-derived speed at instant t and $v_{r, t}$ denotes the ground speed measurement by the GNSS receiver.

The measurement from ODO is immune to electromagnetic effects; it is relatively reliable, even under a GNSS spoofing attack scenario. When spoofing interference exists, the receiver-derived speed $v_{r, t}$ is biased; thus, the speed difference $λ_{t}^{VODO}$ can be employed as a direct indicator of the effect of the interference.

GNSS/ODO pseudo-range difference

The GNSS/ODO pseudo-range difference indicates the difference between the ODO-derived and the receiver-observed pseudo-ranges. Using the ODO-derived speed, time integration can be performed to estimate the train's along-track traveling distance $l_{t}$ from the last epoch location as $l_{t} = \int_{t - 1}^{t} v_{O D O, t} d t .$ (13)Because the train must move along railway tracks, the odometer can determine the 1D along-track distance over the last reference point (i.e. the last relevant Balise group). Using railway electronic trackmap data, the distance l_t at each epoch calculated by the ODO measurement can be projected onto a specific track piece described by two adjacent points-of-interest (POIs), with which the coordinates can be calculated. Therefore, the difference between the GNSS pseudo-range observations and the ODO-derived equivalent pseudo-ranges can be evaluated.

Figure 2 shows the details of the l_t projection based on the electronic trackmap data. From the starting position $P_{t - 1}$ within a specific track piece derived at the last epoch, the traveling distance $| P_{t - 1} P_{t} |$ can be used to predict the target track piece for the following epoch location, and the track piece can be determined by two adjacent endpoints: $b_{m} (N_{P}, E_{q})$ and $b_{m + 1} (N_{P + 1}, E_{q + 1})$ . Table 1 lists the procedures of the ODO-projection-based coordinate prediction based on trackmap data.

Figure 2.

Principle of ODO-based coordinate prediction based on the trackmap data.

Table 1.

Procedures of ODO-based coordinate prediction based on trackmap data.

Input: Traveling distance and electronic trackmap data
1:	for m = 1, 2, 3, … do
2:	Find the minimum distance between P _t ₋₁ and b _m on the electronic map.
3:	end for
4:	According to the vectors $\bar{b_{m} P_{t - 1}}$ and $\bar{b_{m} b_{m + 1}}$ , calculate the projection length $L_{t - 1}$ of vector $\bar{b_{m} P_{t - 1}}$ on vector $\bar{b_{m} b_{m + 1}}$ as $L_{t - 1} = \frac{\bar{b_{m} P_{t - 1}} \cdot \bar{b_{m} b_{m + 1}}}{\| \bar{b_{m} b_{m + 1}} \|}$ .
5:	Calculate the coordinates of P _t ₋₁ within the trackmap piece as $P_{t - 1}^{'} = b_{m} + L_{t - 1} \cdot \bar{b_{m} b_{m + 1}}$
6:	for $\| P_{1} P_{2} \|, \| P_{2} P_{3} \|, \| P_{3} P_{4} \|, \dots$ do
7:	if ( $\| P_{t - 1} P_{t} \| \leq \| b_{k - 1} b_{k} \|$ ) then
8:	$P_{t}^{'} = P_{t - 1}^{'} + \| P_{t - 1} P_{t} \| \cdot \frac{\bar{b_{k} b_{k + 1}}}{\| \bar{b_{k} b_{k + 1}} \|}$
9:	else
10:	$P_{t}^{'} = b_{k} + (\| P_{t - 1} P_{t} \| - \| P_{t - 1}^{'} b_{k} \|) \cdot \frac{\bar{b_{k} b_{k + 1}}}{\| \bar{b_{k} b_{k + 1}} \|}$
11:	end if
12:	end for
13:	Return the ODO-projected coordinates $P_{t}^{'}$

ODO: odometer.

The equivalent pseudo-range $ρ_{i, t}^{(odo)}$ based on the ODO measurement can be obtained as $ρ_{i, t}^{(ODO)} = \sqrt{{(x_{ODO, t} - x_{i, t})}^{2} + {(y_{ODO, t} - y_{i, t})}^{2} + {(z_{ODO, t} - z_{i, t})}^{2}}$ (14)where $(x_{ODO, t}, y_{ODO, t}, z_{ODO, t})$ represents the ODO-derived earth-centered earth-fixed coordinates.

The immunity of the ODO-derived coordinate location enables the evaluation of the effect of GNSS spoofing by calculating the difference $λ_{i, t}^{ODO}$ between the GNSS pseudo-range observation and the equivalent pseudo-range derived from the ODO-based projection with respect to the same satellite. Thus, an additional indicator can be introduced into the sample feature set, as follows: $λ_{i, t}^{ODO} = ρ_{i, t}^{(ODO)} - (ρ_{i, t}^{cp} + e_{i, t}^{ρ})$ (15)

Dilution of Precision

The dilution of precision (DOP) value refers to the use of a satellite system to verify the accuracy or error of the GNSS receiver's current position, and is expressed as the root number of the sum of the latitude, longitude, and elevation error squared of the position. Among the different DOP quantities, the geometric accuracy factor, denoted as the position dilution of precision (PDOP), is related to the azimuthal distribution of the satellites. Similarly, the horizontal and vertical accuracy factors were denoted as horizontal dilution of precision (HDOP) and vertical dilution of precision (VDOP), respectively.

Using the abovementioned feature quantities, the structure of the model training sample for spoofing recognition can be determined. Because the spatial movement of a train is constrained by the railway track and repeated within a period according to a specific railway time schedule, it is possible to construct a sample dataset considering the spatiotemporal similarity of a specific railway line. By combining these feature quantities with the spoofing label according to the sample data collection conditions, the structure of the training sample can be determined. The sample feature format K with respect to a satellite channel is recorded as $K = [λ_{i, t}^{pr}, λ_{i, t}^{cp}, (C / N_{0})_{i, t}, λ_{t}^{VODO}, λ_{i, t}^{ODO}, D_{t}^{P}, D_{t}^{H}, D_{t}^{V}]$ (16)where

$λ_{i, t}^{pr}$ is the pseudo-range residual of the $i$ th satellite at instant t; $λ_{i, t}^{cp}$ is the carrier-phase smoothed pseudo-range residual of the $i$ th satellite at instant t; $(C / N_{0})_{i, t}$ is the carrier-to-noise density ratio of $i$ th satellite at instant t; $λ_{t}^{VODO}$ is the GNSS/ODO speed difference at the instant t; $λ_{i, t}^{ODO}$ is GNSS/ODO pseudo-range difference of the $i$ th satellite at instant t; $D_{t}^{P}$ is the PDOP value at the instant t; $D_{t}^{H}$ is the HDOP value at the instant t; $D_{t}^{V}$ is the VDOP value at the instant of t. Considering the difference in the range space of the involved feature quantities, the original quantities were preprocessed using a normalization operation. To meet the requirements of model training, the entire sample dataset was divided into different subsets according to the principle of specific proportions, including the model training, verification, and test sets. To effectively utilize the datasets under specific GNSS spoofing scenarios, an effective modeling solution is the key to achieving a desirable GNSS spoofing recognition performance for resilient train positioning purposes.

BO-LightGBM spoofing recognition

Using the GNSS observations received by the train-borne receiver, data samples can be collected for data-driven modeling, which describes the relationship between the feature set and the spoofing indication. To ensure the performance of the derived model, an advanced modeling algorithm is proposed in this study. The procedures of the entire spoofing recognition solution can be summarized in the following three steps.

1. Dataset collection: The field data collected in practical railway operations can be recorded to generate a dynamic operation scenario in GNSS spoofing injection tests. Based on a practical train trajectory and specific GNSS observation configurations, such as satellite visibility and signal propagation features, different GNSS spoofing interference events can be incorporated into a GNSS signal generator. Using the data recorded by the receiver under test (RUT), the datasets for model training and testing can be established through preprocessing.

2. Off-line model training: Using datasets from the RUT under different GNSS spoofing scenarios, the proposed modeling and classification algorithm was adopted to optimize the spoofing recognition model through multiple iterations. The offline modeling results were utilized in the online TPS operation. With the new incoming data samples from additional tests and scenarios, offline modeling can be reactivated to update the model for subsequent applications.

3. Online spoofing recognition: During the practical operation of a TPS, real-time GNSS observations can be extracted and transferred as training samples according to the structure of the feature quantities. By employing the spoofing interference recognition model, the indicator of the GNSS spoofing attack status can be identified and fed to the LPU to switch the online location determination logic and mitigate the effect of the spoofing attack.

The core of this solution is the capability and performance of the proposed modeling method. To address the problem of conventional classification modeling algorithms with constrained performance, an advanced modeling solution is proposed as follows:

LightGBM classification method

LightGBM is an effective machine-learning algorithm based on the GBDT method. It is proposed to overcome the problems of the GBDT, which presents with the constraints of low accuracy and high time cost when managing large datasets. However, the LightGBM method significantly accelerates prediction and reduces memory consumption.²⁴ It adopts a leafwise strategy for data splitting, decision tree growth, and leaf node splitting. This strategy is employed in the iterative process in which each tree is designed to reduce the residuals of the previous iteration, build a new model in the direction of residual reduction (negative gradient), make the difference between the prediction result and the target value of the training data for this iteration, and eventually minimize the loss function by stepwise approximation. Thus, the classification performance of the model can be further improved by deriving a strong learner for recognition.

Figure 3 depicts the principle and key features of the LightGBM method. The powerful advantages of LightGBM are reflected in three aspects: less memory, smaller samples, and fewer features, corresponding to the three technical implementations of the histogram algorithm (HA), gradient-based one-side sampling (GOSS), and exclusive feature bundling (EFB). HA is the core feature of the LightGBM solution and is involved in constructing discrete histograms of features and performing data splitting to reduce computational complexity and improve training speed. The GOSS strategy was adopted using the LightGBM solution for gradient sampling during training. Because gradients contribute differently to the splitting of decision tree nodes, GOSS retains the full information of large-gradient samples and partial information of small-gradient samples to accelerate the training process while maintaining the accuracy of the model. EFB is a feature engineering technology in LightGBM that is used to bundle correlated features and reduce the number of features; thereby reducing the complexity of the model. The utilization of the EFB can accelerate model training, improve generalization performance, and reduce the risk of overfitting to a certain extent.

Figure 3.
Principle and features of LightGBM.

Bayesian optimization strategy

Tuning the hyperparameters is an important task in ML and must be performed in advance. The superior performance of a specific ML algorithm is highly dependent on its ability to set the appropriate hyperparameters. The LightGBM method involves numerous hyperparameters that directly affect the structure and performance of the derived model. Consequently, the BO can tune the parameters fast and efficiently.²⁵ The Bayesian optimization strategy assumes that the loss function to be optimized is directly and functionally related to the hyperparameters and provides a method to evaluate uncertainties.²⁶ Thus, a new combination of hyperparameters was selected based on the current posterior probability function estimate, and an objective function was invoked to evaluate the performance of this hyperparameter combination. A smaller objective function value indicates that the model prediction is closer to the true label, indicating a better model performance level. Through several iterations and evaluations, the Bayesian optimization gradually approaches the optimal hyperparameter combination.²⁷ In each iteration, the existing samples and specific logic are used to select the next sample using a certain strategy to further optimize the approximation of the objective function. By iteratively evaluating different combinations of hyperparameters and adopting the derived optimal hyperparameter set, the performance and generalization of the model can be improved. Compared to traditional grid searching or random searching strategies, the optimal combination of hyperparameters obtained by the Bayesian algorithm can achieve enhanced efficiency and performance.²⁸

BO-LightGBM-based spoofing recognition

To enhance the performance of the original LightGBM algorithm, a BO strategy was used to enable an integrated BO-LightGBM solution for spoofing recognition. The overall framework of BO-LightGBM for data-driven offline model training and online spoofing recognition is shown in Figure 4. The framework consisted of three parts: dataset collection, offline model training, and online spoofing recognition.
Dataset collection: The GNSS-based train positioning data on a practical line was used to construct a dynamic operation scenario to implement the train positioning simulation. A spoofing interference signal was injected to obtain receiver observations. The required observations were extracted and the ODO data were converged to construct the feature quantities, as in (16), which constituted the training set of the spoofing scene.

Off-line model training: Twenty percent of the training set was randomly selected to construct a validation set to enhance model training performance. An improved scheme based on BO-LightGBM was proposed to optimize the recognition performance of the model. The corresponding test set was used to test the trained model and determine whether it met the expected target. Finally, the trained model was stored for invocation.

On-line spoofing recognition: During train operation, the observation feature quantity was extracted from the GNSS positioning unit in real-time, and the model obtained by offline training was utilized to determine whether the TPS was disturbed by a spoofing attack.

Figure 4.
Architecture of BO-LightGBM-based spoofing recognition for GNSS-based train positioning.

A flowchart of the BO-LightGBM-based modeling is shown in Figure 5. The significant difference between the LightGBM-based and advanced BO-LightGBM solutions lies in the model training procedures, which are described in the following steps:
Step 1. Defining the hyperparameter range and objective function.

Figure 5.
Flow of the BO-LightGBM-based modeling for spoofing recognition.

The hyperparameter range, including the learning rate, maximum number of leaf nodes in the tree, and maximum tree depth, was determined to be optimized. In this study, the negative of the mean of the log loss function (“binary_logloss”) for the model in cross-validation is adopted as the objective function, denoted as $\begin{aligned} E (x) = & - \frac{1}{N} \sum_{i = 1}^{N} L (y_{i}, F (x)) \\ = & - \frac{1}{N} \sum_{i = 1}^{N} (y_{i} \cdot \ln {\hat{y}}_{i} + (1 - y_{i}) \cdot \ln (1 - {\hat{y}}_{i})), \end{aligned}$ (17)where ${\hat{y}}_{i}$ indicates the probability of predicting a positive class and $y_{i}$ is the true class label of the sample.

The objective function measures the difference between the model-predicted probability and the true label. It reflects the measure of uncertainty by the model and indicates the loss of information between the predictions and actual outcomes.
Step 2. Initialize sampling.
By setting the seeds of the random number generator, a set of initial hyperparameters was generated, and the LightGBM model was built on the training set to calculate the value of the objective function. With training datasets $D = {(x_{i}, y_{i})}$ and $i = 1, 2, \dots, N$ with N samples, the LightGBM model training regression tree was set with N samples, and the goal was to generate an additive model as $F (x) = \sum_{m = 1}^{M} T_{m} (x),$ (18)where $T_{m} (x)$ denotes the mth tree.
Step 3. Selecting the next sampling step.
By comparing the current objective function output with the value from the previous round, the current optimal hyperparameter combination can be determined based on the smaller output of the objective function. Using the expected improvement (EI) sampling function,²⁹ the settings of the next hyperparameters were selected for evaluation. In contrast to the original BO strategy, which uses the posterior probability to reflect the evaluation of the LightGBM hyperparameters under the current dataset, the EI sampling function is adopted in this solution to select the next combination of hyperparameters that should be evaluated and guided accordingly.
Step 4. Training LightGBM model.
Using the selected hyperparameters, the LightGBM model can be trained, and the value of the objective function can be calculated. The LightGBM training process is summarized as follows.
Step 4.1. Decision tree learning
In each round of training, a new decision tree model $T_{m} (x)$ should be generated. To minimize the loss function $F (x)$ , all the $(m - 1)$ decision trees will be involved as follows: $F_{m - 1} (x) = \sum_{i = 1}^{m - 1} T_{i} (x) .$ (19)
Calculate the gradient and Hessian matrix for the sample on the current leaf node $G_{i} = \frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}$ (20) $H_{i} = \frac{\partial^{2} L (y_{i}, F (x_{i}))}{\partial F^{2} (x_{i})}$ (21)

Optimal split point determination
LightGBM uses a histogram algorithm that calculates the split gain for each eigenvalue and histogram bucket and finds the maximum split point gain as $G a i n = \frac{1}{2} [\frac{G_{L}^{2}}{H_{L} + φ} + \frac{G_{R}^{2}}{H_{R} + φ} - \frac{{(G_{L} + G_{R})}^{2}}{H_{L} + H_{R} + φ}],$ (22)where $φ$ is the regularization parameter; $G_{L}$ and $G_{R}$ are the sums of the sample gradients on the left and right child nodes, respectively; and $H_{L}$ and $H_{R}$ denote the sums of the sample Hessian matrices on the left and right child nodes, respectively.
(c) Splitting is performed
LightGBM uses a leafwise strategy, where the node with the highest gain among all leaf nodes is selected preferentially for splitting into two new leaf nodes at each split. Consequently, a new decision tree model is obtained.
Step 4.2. Loss function update
According to the new decision tree model, $T_{m} (x)$ , for simplification purposes, the loss function can be updated and approximated using a Taylor expansion. Thus, the current loss function can be obtained as follows: $L (y, F_{m - 1} (x) + T_{m} (x)) = L (y, F_{m - 1} (x)) + \frac{\partial L (y, F_{m - 1} (x))}{\partial F_{m - 1} (x)} T_{m} (x) + \frac{1}{2} \frac{\partial^{2} L (y, F_{m - 1} (x))}{\partial F_{m - 1}^{2} (x)} T_{m}^{2} (x) .$ (23)By substituting $G_{m}$ and $H_{m}$ into (23) to represent the specific components, the form of the current-loss function can be changed to $L (y, F_{m - 1} (x) + T_{m} (x)) = L (y, F_{m - 1} (x)) + G_{m} \cdot T_{m} (x) + \frac{1}{2} \cdot H_{M} \cdot T_{m}^{2} (x) .$ (24)
Step 5. Calculating the objective function value
The current value of the objective function is calculated using the model obtained from this round of training.
Step 6. Determination of optimal hyperparameters
Based on the prediction of the existing sampling and EI sampling function-based guidance mechanisms, hyperparameters with the objective function reaching the maximum value are selected to identify the optimal hyperparameters.
Step 7. Training of final LightGBM model
The final LightGBM model was trained using the optimal hyperparameters, and the performance of the model was evaluated using the test set. Through all the operation steps, the ultimately derived learner can be described as $\hat{F} (x) = F_{M} (x) = \sum_{m = 1}^{M} T_{m} (x) .$ (25)Table 2 illustrates the detailed pseudocode of the proposed BO-LightGBM modeling solution, where H records the hyperparameter combinations and corresponding objective function values.

Table 2.
Pseudocode for BO-LightGBM modeling.

BO-LightGBM modeling

1: Define the hyperparameter range and objective function E(x).

2: Randomly select a set of hyperparameters x to initialize the LightGBM.

3: for n = 1, 2, 3, … do

4: Determine the next set of the hyperparameter combinations to be evaluated through the sampling function $E I$ .

5: Evaluate $x_{n}$ as $x_{n} = \arg max E I (x | H_{1 : n - 1})$ .

6: Use the selected x to train the LightGBM model and update E(x_n).

7: Update H as $H \leftarrow H (x_{n}, E (x_{n}))$ .

8: end for

9: Return $\hat{x}$ with the maximum E(x) in H .

10: Train the LightGBM model using $\hat{x}$ .

11: Test and evaluate the derived model.

BO-LightGBM: Bayesian optimization-light gradient boosting machine

Through BO-LightGBM model training using specific datasets, the importance of all the sample features presented in the typical sample features section can be determined. The feature importance sequence can be used to represent the weights of all features, which determine the characteristics of the derived spoofing recognition model. Note that feature weights corresponding to different models may vary considerably by different training sample datasets. The advantages of the automatic adjustment of hyperparameters can be illustrated by the architecture and procedures of the proposed modeling solution.
Automatic adjustment of hyperparameters. BO can automatically explore the hyperparameters and find the optimal hyperparameter combination without manually traversing all the different combinations. Automatic adjustments reduce the subjectivity and workload of manual operations.

Global optimal performance. The BO strategy considers the results of the evaluated hyperparameter combination, which can search for the hyperparameter range more comprehensively and avoid falling into a local optimum.

High efficiency. LightGBM provides an efficient gradient-boosting framework for processing large-scale datasets and high-dimensional features. With the optimized hyperparameters derived from the BO, the BO-enhanced LightGBM solution can identify the hyperparameters with a limited time cost; thus, significantly accelerating the model training and tuning process.

Test and analysis

Test platform configuration

To demonstrate the interference recognition capability of the TPS under a spoofing attack, a practical TPS data log recorded in the Shenyang–Heishan high-speed railway was used to generate GNSS observation scenarios under spoofing attacks by injecting different spoofing events in a spoofing simulation test platform. A Spirent GSS9000 signal generator, GNSS signal simulator with the SimGEN spoofing control tool, and RUT (Ublox EVK-M8N) were employed in the test platform. The construction and information flow are shown in Figure 6, and Figure 7 shows the test setup for the entire test environment.

Figure 6.
Platform setup and information flow.

Figure 7.
Overview of the spoofing test environment.

The collected data was utilized to generate the train trajectory script and GPS satellite observation scenario using the SimGEN tool to reconstruct the real GNSS observation conditions. In the spoofing test scenario, the corresponding GPS ephemeris was imported. The SimSAFE tool was configured with SimGEN using specific spoofing commands covering both ramp and sinusoidal GPS spoofing modes. Note that certain assumptions must be made on this platform to ensure that the RUT can achieve successful spoofing.
The power level of the generated spoofing signal was higher than that of the actual GPS satellite signal, and the RUT could successfully capture the spoofing signal.

Differences exist in the Doppler and code phases between the actual GPS and spoofing signals. In addition, the power level of the spoofing signal can be adjusted to increase gradually. Thus, the spoofing signal can occupy satellite-tracking channel(s) and mislead the RUT.

The satellite channels of the signal simulator are sufficient to realize both pure GNSS and spoofing signals.

All assumptions were fulfilled by the signal generator in this platform, which ensured a successful spoofing scenario simulation and testing of the spoofed RUT for data collection.

Dataset construction

The training and validation datasets were constructed by collecting the RUT output data under different GPS spoofing and nonspoofing scenarios. Spoofing interference was injected when the scenario ran for 600 s. There were 40968 groups of samples obtained from the tests under different testing scenarios.

Using the proposed BO-LightGBM modeling solutions with the datasets, search range, and selected optimal combination of hyperparameters are listed in Table 3.

Table 3.
Search range and optimal combination of selected hyperparameters of BO-LightGBM model.

Parameter Range Value

boosting_type — gbdt

objective — binary

metrics — binary_logloss

learning_rate (0.01,1) 0.861444123

num_leaves (10,300) 226

n_estimators (100,300) 222

feature_fraction (0.1,1) 0.73793965

bagging_fraction (1,10) 0.82800952

bagging_freq (0,10) 5

lambda_l1 (0,10) 0.36290894

lambda_l2 (0,10) 2.41620196

BO-LightGBM: Bayesian optimization-light gradient boosting machine.

In the spoofing injection test, two typical GPS spoofing modes—sinusoidal and ramp—were investigated to illustrate the capability of the proposed modeling solution. Three different settings were used for the sinusoidal mode, and the ramp mode involved two configurations. Details of five different spoofing scenarios, represented as T1–T5, are listed in Table 4. To illustrate the impact of the spoofing attack, the results from the RUT under two representative scenarios (T1 for the sinusoid and T5 for the ramp) were extracted to analyze the time-varying characteristics of the sample features.
Results under T1 scenario

Table 4.
Details of different spoofing interference tests.

Spoofing mode Programmer options Pseudo-range setting Frequency setting Spoofing power setting

Sinusoid Option 1 (T1) Amplitude: 7m Frequency: 0.01Hz Slope: 0.3

Option 2 (T2) Amplitude: 8m Frequency: 0.01Hz Slope: 0.3

Option 3 (T3) Amplitude: 8m Frequency: 0.01Hz Slope: 0.5

Ramp Option 4 (T4) Slope: 2 m/s — Slope: 0.3

Option 5 (T5) Slope: 3 m/s — Slope: 0.1

Figure 8 shows the detailed changes in the sample features with time when spoofing interference is injected into the two satellite channels. The channels can be represented by satellite pseudo-random noise (PRN) indices of 15 and 20. In all the subfigures, the absolute time (in seconds) of the test system is shown on the x-axis.

Figure 8.
Sample features under the T1 scenario.

It can be observed that after 600 s, pseudo-range residual and carrier phase smoothed pseudo-range residual oscillate periodically in the sinusoid form. The GNSS/ODO pseudo-range difference exhibits a large mutation after 600 s and sinusoidal changes occur in the two spoofing-injected satellite channels (PRN = 15 and 20). According to the characteristics of the feature with the ODO measurement that was immune to the spoofing, we showed that the RUT was affected by the injected attack event and the navigation calculation performance might degrade. In addition, the pseudo-range residual, carrier phase-smoothed pseudo-range residual, and GNSS/ODO pseudo-range difference of the satellites without spoofing injection also had obvious sinusoidal characteristics after 600 s since the test was conducted. For the satellite observation status-related features, significant changes were observed after the spoofing injection. The C/N₀ values of the two satellites increased sharply with a large mutation after 600 s. The HDOP and VDOP values also increased, whereas the PDOP decreased after 600 s, and did not change significantly during most of the subsequent periods. Figure 9 shows the single point positioning (SPP) errors under the T1 scenario in the sinusoidal interference mode.

Figure 9.
SPP errors under the T1 scenario.

Periodic oscillation in the form of a sinusoid is present in all three directions. The characteristics of the errors converge with the situation of all sample features derived from the tests for data collection.
2. Results under T5 scenario
Figure 10 shows the detailed changing status of all the sample features when four satellites (PRN = 15, 20, 24, and 32) were injected with spoofing interference. Note that not all satellites can always be observed based on the relative positional relationship between each satellite and the target train. According to the GPS ephemeris, a satellite with PRN = 25 can be observed near the end of the time span, as shown in Figure 10.

Figure 10.
Sample features under the T5 scenario.

After 600 s since the test was stated, the pseudo-range residual, carrier-phase-smoothed pseudo-range residual, and GNSS/ODO pseudo-range difference significantly increased in a ramp form. The C/N₀ values of the four spoofing-injected satellites indicate a large uplift height of approximately 48. The three DOP values were similar to those in the T1 case. Figure 11 shows the SPP errors, where obvious ramp characteristics can be observed in the errors increasing over time. The error characteristics followed the form of the injected ramp spoofing mode.

Figure 11.
SPP errors under the T5 scenario.

Based on the analysis of the sample features under specific scenarios, the sensitivity of these features to spoofing attacks can be revealed, which can be used to illustrate the potential of sample sets in achieving effective model training for recognizing spoofing interference.

Recognition performance analysis

To demonstrate the performance of the proposed BO-LightGBM solution for spoofing recognition, independent operations were performed for four different solutions, as follows.
M1: GBDT (Gradient Boosting Decision Tree)

M2: XGBoost (eXtreme Gradient Boosting)

M3: Original LightGBM (Light Gradient Boosting Machine)

M4: Proposed BO-LightGBM
The same sample sets were employed in parallel operations of the four solutions for comparison. To illustrate the performance of the proposed solution (M4) over other referencing methods, a comparative analysis of the receiver operating characteristic (ROC) curves and confusion matrices from all solutions was performed. The ROC curve was used to illustrate the combined sensitivity and specificity trends of derived classifiers using a specific solution. The confusion matrix reflects the model prediction results for each category and provides an effective indication of the classification performance of the model. Figures 12–15 show the derived ROC curves and confusion matrices from all involved solutions under five scenarios (T1–T5).

Figure 12.
ROC curves and confusion matrices (M1).

Figure 13.
ROC curves and confusion matrices (M2).

Figure 14.
ROC curves and confusion matrices (M3).

Figure 15.
ROC curves and confusion matrices (M4).

The results illustrate that the proposed solution (M4) achieves a superior performance level over the other referencing methods under different spoofing scenarios. It achieved a higher accuracy level in model prediction for all testing scenarios, and the classification performance was more stable and reliable than that of the other methods. In addition, the area under curve (AUC) was investigated to intuitively evaluate the prediction accuracy of the model. Specifically, it can be found that the AUC by our proposed solution (M4) has been significantly improved under T1–T3. Considering the comparison under the T1 scenario as an example, the AUC from M4 was improved by 17.54%, 7.85%, and 2.70% over the three methods. Only for the T4 and T5 cases with ramp-mode spoofing, a slight decrease of 0.089% and 0.059%, respectively, occurred for M4 over the M2 solution. However, among these methods, only M4 achieved a high level (above 0.99799) for all situations. According to the above local and global comparative analyses, the overall performance of M4 surpassed that of the other three solutions.

Based on the model index analyses, by examining the recognition capability, we determined the capability of the four classification models in evaluating the effectiveness of implementation. Four quantitative indicators, accuracy (A), precision (P), recall rate (R), and F1 score, were statistically compared, and the results are shown in Figures 16–20. The results in these figures represent the recognition capability of M1–M4 for injected GPS spoofing interference under T1–T5 schemes.

Figure 16.
Performance indicator results for T1.

Figure 17.
Results of performance indicators under T2.

Figure 18.
Results of performance indicators under T3.

Figure 19.
Results of performance indicators under T4.

Figure 20.
Results of performance indicators for T5.

The quantitative indicators adopted were defined as follows: ${\begin{matrix} A = \frac{T P + T N}{T P + T N + F P + F N} \\ P = \frac{T P}{T P + F P} \\ R = \frac{T P}{T P + F N} \\ F 1 = 2 \times \frac{P \times R}{P + R} \end{matrix},$ (24)where TP, TN, FP, and FN represent the number of true and false samples identified as positive and negative, respectively.

Thus, all four methods achieved certain spoofing recognition capabilities under different spoofing interference modes and scenarios. The four quantitative indicators from our proposed solution achieved a high accuracy of over 99.78% in identifying whether spoofing interference occurred. Under scenarios T1–T3 with a sinusoid-type attack, the accuracy of M4 was improved by 25.51% over the other three methods, and a significant improvement of 21.36% was realized for the F1 score of the model. The precision improved by a maximum of 42.73%. However, the recall of M4 differed slightly from those of M1 and M2 by 0.17% under the spoofing attack scenarios. Under the T4 and T5 scenarios in testing the performance with the ramp-type attack, all four quantitative indicators of our proposed solution realized an advanced performance level while failing to outperform M2 with a limited difference of less than 0.11%. According to the comparison results with the original LightGBM without introducing the BO, it was found that the involvement of the BO mechanism effectively improved the capability of the model to recognize spoofing attacks. Note that uncertainty quantification plays a pivotal role in easing the impact of uncertainties during both optimization and decision-making.³⁰ Although the BO directly influences the determination of the combination of hyperparameters for optimizing the LightGBM, the effect of uncertainty in the BO can be indicated by the recognition performance achieved by the entire solution, which validates the effectiveness of the design of this solution.

Considering both nonspoofing and GPS spoofing scenarios, Table 5 summarizes the missed and false alarm rates in the identification and classification of injected spoofing interference during the tests. Under the sinusoidal spoofing interference scenarios (T1–T3), the M4 solution proposed in this study was significantly better than the other three modeling schemes. Under the T1 scenario, the missed alarm rate by M4 was reduced by 38.48%, 23.48%, and 0.62%, and the false alarm rate was reduced by 29.87%, 14.60%, and 0.27% compared with M1–M3, respectively. Significant improvements were achieved using M4. Under the T2 scenario, the missed alarm rate of M4 was reduced by 9.73%, 9.87%, and 1.08%, and the false alarm rate was reduced by 5.22%, 5.30%, and 0.48%, respectively. The T3 scenario achieved an improvement similar to that of T2. Enhancements of 11.65%, 10.64%, and 1.18% were achieved for the missed alarm rate, and those for the false alarm rate were 6.25%, 5.64%, and 0.51%, respectively. However, it is different under ramp-spoofing interference scenarios (T4 and T5). The proposed solution (M4) did not always perform the best among the tested models. For the T4 scenario, M4 did not realize an enhancement in the missed and false alarm rates over M1 and M2, but the differences were relatively small and less than 0.09%. Compared with M3, the two indices improved by 0.54% and 0.26%, respectively. Under T5 scenario, M4 outperformed both M1 and M3. It failed to perform better than M2; however, the differences were also constrained at only 0.27% for missed alarms and 0.11% for false alarms. Generally, the M4 solution proposed in this study can achieve significant comprehensive advantages under these test scenarios, particularly in the sinusoidal spoofing scenarios. It achieves stable and constrained missed and false alarm rates under all scenarios, ensuring that it can effectively manage different spoofing interference situations.

Table 5.
Missed and false alarm rates for spoofing identification.

Testing set Nonspoofing Spoofing

Model R_MA Model R_FA

T1 M1 38.94% M1 30.09%

M2 23.94% M2 14.82%

M3 1.08% M3 0.49%

M4 0.46% M4 0.22%

T2 M1 9.82% M1 5.26%

M2 9.96% M2 5.34%

M3 1.17% M3 0.52%

M4 0.09% M4 0.04%

T3 M1 11.72% M1 6.28%

M2 10.71% M2 5.67%

M3 1.25% M3 0.54%

M4 0.07% M4 0.03%

T4 M1 0.11% M1 0.05%

M2 0.16% M2 0.08%

M3 0.74% M3 0.36%

M4 0.20% M4 0.10%

T5 M1 0.72% M1 0.33%

M2 0.14% M2 0.07%

M3 1.06% M3 0.48%

M4 0.41% M4 0.18%

Analysis under synthetic spoofing scenario

Based on the performance evaluation and analysis using a single spoofing signal mode, we further investigated the model performance of the proposed solution under more complicated spoofing attack situations. In contrast to tests using only the sinusoid or ramp modes, the two spoofing modes were integrated to constitute a synthetic interference condition using the same train trajectory and GPS satellite observation scenario settings. The same RUT was used to collect datasets for model training and testing. A spoofing injection was also triggered when the test scenario ran for 600 s. Sinusoid-mode spoofing was first injected into three satellite channels (PRN = 15, 20, and 24), where the same settings of the sinusoid signal were adopted as in the T1 scenario. The spoofing power slope is set at a stable level of 0.3 because the spoofing injection is started. When the test scenario ran for 1200 s, the ramp-spoofing attack was activated in the same satellite channels (PRN = 15, 20, and 24), where the slope was set to 3 m/s, which was the same as in the T5 scenario. Using the datasets collected under the synthetic scenario, all four solutions (M1–M4) were performed independently for comparative analysis.

The recognition capabilities of the four involved models were investigated using a specific testing set (T6) under the synthetic spoofing scenario. The same quantitative indicators, including accuracy (A), precision (P), recall rate (R), and F1 value, were statistically compared, as shown in Figure 21, which illustrates the recognition capabilities of M1–M4 under this synthetic scenario. Considering both nonspoofing and spoofing attack scenarios, Table 6 summarizes the missed and false alarm rates in the recognition of complicated spoofing interference events.

Figure 21.
Results of performance indicators under T6.

Table 6.
Missed and false alarm rates under synthetic spoofing scenarios.

Testing set Nonspoofing Spoofing

Model R_MA Model R_FA

T6 M1 5.86% M1 6.28%

M2 5.35% M2 5.67%

M3 2.69% M3 0.54%

M4 0.10% M4 0.03%

The results confirm that the performance advantages of the proposed solution (M4) are revalidated. Under this complicated spoofing scenario, the four quantitative indicators from our proposed solution reached a high level of over 99.80% in identifying whether spoofing interference occurred. The accuracy, precision, recall, and F1 score improved by 4.35%, 4.89%, 6.12%, and 8.00%, respectively, over the three referencing methods. For the comparison of missed and false alarm rates, it can be observed that all four solutions perform with relatively lower rates compared with the single spoofing scenarios T1–T5, which is a result of the more significant spoofing effects with the integration of two spoofing signal settings. The performance advantage of M4 remained under the T6 scenario. This can be attributed to a reduction of 4.76%, 4.25%, and 2.59% in the missed alarm rate achieved by M4 compared with M1–M3, and the reduction in the false alarm rate can be 6.25%, 5.64%, and 0.51%, respectively. The results from a complicated GNSS spoofing scenario further demonstrate that the proposed BO-LightGBM solution can achieve advanced stability and the strongest reliability among all referencing methods involved. The revealed characteristics of this solution allow us to explore the significant advantages of spoofing recognition in GNSS-based train positioning. This shows great potential for smart diagnosis and active spoofing suppression in specific GNSS-based implementations.

Conclusions

To realize trustworthy train positioning using GNSS, this study proposes a novel active recognition solution for spoofing interference attacks. Compared with traditional machine learning-based approaches, the proposed BO-enhanced LightBGM solution can learn the characteristics of GNSS spoofing attacks through a data-driven model under the railway train operation scenario, while considering the spoofing-affected satellite observation features and the spatial constraints of the trains. In addition, the automatic adjustment of the hyperparameters by introducing Bayesian optimization enhances the coverage of different spoofing modes by LightBGM. The results of spoofing interference tests demonstrate the advanced recognition performance of the proposed solution over other data-driven methods, including GBDT, XGBoost, and LightBGM. In addition to the sinusoidal and ramp signal-based attack modes, the synthetic signal-based interference test further illustrates the spoofing recognition capability under a more complicated attack situation.

Note that a key challenge to the proposed solution is the uncertainty and diversity of the spoofing interference patterns, which may constrain the performance and application of the proposed solution. It is difficult to collect sufficient datasets under real but unknown spoofing attack scenarios. Thus, we chose to create different spoofing attack modes in the spoofing injection test environment to alleviate the challenges and limitations caused by interference patterns. In the future, effective spoofing monitoring and recording tools will be introduced in practical train operation environments to capture practical events and obtain additional field datasets. Therefore, we will further focus on active protection measures against multisource and multimode GNSS spoofing interference to ensure the credibility and safety of novel GNSS-enabled railway systems. More advanced data-driven models and calculation platforms should be considered to further improve the solution and cope with the risks posed by various GNSS attacks.

BO-LightGBM modeling
1:	Define the hyperparameter range and objective function E(x).
2:	Randomly select a set of hyperparameters x to initialize the LightGBM.
3:	for n = 1, 2, 3, … do
4:	Determine the next set of the hyperparameter combinations to be evaluated through the sampling function $E I$ .
5:	Evaluate $x_{n}$ as $x_{n} = \arg max E I (x \| H_{1 : n - 1})$ .
6:	Use the selected x to train the LightGBM model and update E(x_n).
7:	Update H as $H \leftarrow H (x_{n}, E (x_{n}))$ .
8:	end for
9:	Return $\hat{x}$ with the maximum E(x) in H .
10:	Train the LightGBM model using $\hat{x}$ .
11:	Test and evaluate the derived model.

Parameter	Range	Value
boosting_type	—	gbdt
objective	—	binary
metrics	—	binary_logloss
learning_rate	(0.01,1)	0.861444123
num_leaves	(10,300)	226
n_estimators	(100,300)	222
feature_fraction	(0.1,1)	0.73793965
bagging_fraction	(1,10)	0.82800952
bagging_freq	(0,10)	5
lambda_l1	(0,10)	0.36290894
lambda_l2	(0,10)	2.41620196

Spoofing mode	Programmer options	Pseudo-range setting	Frequency setting	Spoofing power setting
Sinusoid	Option 1 (T1)	Amplitude: 7m	Frequency: 0.01Hz	Slope: 0.3
Option 2 (T2)	Amplitude: 8m	Frequency: 0.01Hz	Slope: 0.3
Option 3 (T3)	Amplitude: 8m	Frequency: 0.01Hz	Slope: 0.5
Ramp	Option 4 (T4)	Slope: 2 m/s	—	Slope: 0.3
Option 5 (T5)	Slope: 3 m/s	—	Slope: 0.1

Testing set	Nonspoofing	Spoofing
T1	M1	38.94%	M1	30.09%
M2	23.94%	M2	14.82%
M3	1.08%	M3	0.49%
M4	0.46%	M4	0.22%
T2	M1	9.82%	M1	5.26%
M2	9.96%	M2	5.34%
M3	1.17%	M3	0.52%
M4	0.09%	M4	0.04%
T3	M1	11.72%	M1	6.28%
M2	10.71%	M2	5.67%
M3	1.25%	M3	0.54%
M4	0.07%	M4	0.03%
T4	M1	0.11%	M1	0.05%
M2	0.16%	M2	0.08%
M3	0.74%	M3	0.36%
M4	0.20%	M4	0.10%
T5	M1	0.72%	M1	0.33%
M2	0.14%	M2	0.07%
M3	1.06%	M3	0.48%
M4	0.41%	M4	0.18%

Testing set	Nonspoofing	Spoofing
T6	M1	5.86%	M1	6.28%
M2	5.35%	M2	5.67%
M3	2.69%	M3	0.54%
M4	0.10%	M4	0.03%

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by the Beijing Natural Science Foundation (4232031),National Natural Science Foundation of China (T2222015,U2268206),Technological R&D Program of China State Railway Group Co.,Ltd (K2023X002),and Fundamental Research Funds for the Central Universities (2022JBQY003).

ORCID iD

Jiaqi Bi

References

Beugin

Marais

. Simulation-based evaluation of dependability and safety properties of satellite technologies for railway localization. Transport Res C-Emer 2012; 36: 13–26.

Humphreys

Ledvina

Psiaki

, et al. Assessing the spoofing threat: development of a portable GPS civilian spoofer. In: Paper presented at: 21st International Technical Meeting of the Satellite Division of The Institute of Navigation, Savannah, GA, September 16–19, 2008. https://www.ion.org/publications/abstract.cfm?articleID=8132.

Kim

Lee

Kim

, et al. Localization of solar panel cleaning robot combining vision processing and extended Kalman filter. Sci Progress-UK 2024; 107: 1–29.

Gheisari

Najafabadi

Alzubi

, et al. OBPP: an ontology-based framework for privacy-preserving in IoT-based smart city. Future Gener Comp Sy 2021; 123: 1–13.

Alzubi

Qiqieh

Alzubi

. Fusion of deep learning based cyberattack detection and classification model for intelligent systems. Cluster Comput 2023; 26: 1363–1374.

Song

Xiao

, et al. Optimal order of time-domain adaptive filter for anti-jamming navigation receiver. Remote Sens-Basel 2021; 14: 1–15.

Bhatti

Humphreys

. Hostile control of ships via false GPS signals: demonstration and detection. Navigation-US 2017; 64: 51–66.

James

. Vulnerability assessment of the US transportation infrastructure that relies on the global positioning system. J Navigation 2003; 56: 185–193.

Ong

Hyoungmin

. Direction of arrival estimation of GNSS signal using dual antenna. J Position Navig Timing 2020; 9: 215–220.

10.

Fan

Gan

, et al. Adaptive spoofing suppression algorithm for GNSS based on multiple antennas array. Sensors-Basel 2020; 20: 1–19.

11.

Yang

Zhang

Tang

, et al. A combined anti jamming and antispoofing algorithm for GPS arrays. Int J Antenn Propag 2019; 8: 1–9.

12.

Dehghanian

Nielsen

Lachapelle

. GNSS spoofing detection based on signal power measurements: statistical analysis. Int J Navig Obs 2012; 2012: 1–8.

13.

Chu

Wen

, et al. Statistical model and performance evaluation of a GNSS spoofing detection method based on the consistency of Doppler and pseudo-range positioning results. Journal Navigation 2019; 72: 447–466.

14.

Zhu

Ouyang

, et al. GNSS spoofing detection technology based on Doppler frequency shift difference correlation. Meas Sci Technol 2022; 33: 1–11.

15.

Humphreys

. Detection strategy for cryptographic GNSS anti-spoofing. IEEE T Aero Elec Sys 2013; 49: 1073–1090.

16.

Wesson

Rothlisberger

Humphreys

. A proposed navigation message authentication implementation for civil GPS anti-spoofing. In: Paper presented at: 24th International Technical Meeting of the Satellite Division of the Institute of Navigation, Portland, OR, September 20–23, 2011. https://www.ion.org/publications/abstract.cfm?articleID=9870 .

17.

Wesson

Rothlisberger

Humphreys

. Practical cryptographic civil GPS signal authentication. Navigation- US 2012; 59: 177–193.

18.

Ceccato

Formaggio

Laurenti

, et al. Generalized likelihood ratio test for GNSS spoofing detection in devices with IMU. IEEE T Inf Foren Sec 2021; 16: 3496–3509.

19.

Movassagh

Alzubi

Gheisari

, et al. Artificial neural networks training algorithm integrating invasive weed optimization with differential evolutionary model. J Amb Intel Hum Comp 2021; 14: 6017–6025.

20.

Alzubi

Alweshah

, et al. An optimal pruning algorithm of classifier ensembles: dynamic programming approach. Neural Comput Appl 2020; 32: 16091–16107.

21.

Yuen

Kuok

. Bayesian methods for updating dynamic models. App Mech Rev 2011; 64: 1–18.

22.

Mosavi

Azad

Emamgholipour

. Position estimation in single-frequency GPS receivers using Kalman filter with pseudo-range and carrier phase measurements. Wireless Pers Commun 2013; 72: 2563–2576.

23.

Sharawi

Akos

Alol

. GPS C/N₀ estimation in the presence of interference and limited quantization levels. IEEE T Aero Elec Sys 2007; 43: 227–238.

24.

Sun

Chen

, et al. A model combining convolutional neural network and LightGBM algorithm for ultra-short-term wind power forecasting. IEEE Access 2019; 7: 28309–28318.

25.

Snoek

Larochelle

Adams

, et al. Practical Bayesian optimization of machine learning algorithms. In: Paper presented at: 25th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, December 3–6, 2012. https://dl.acm.org/doi/10.5555/2999325.2999464 .

26.

Mueller

Quintana

Page

. Nonparametric Bayesian inference in applications. Stat Method Appl-Ger 2017; 27: 175–206.

27.

Wang

Zhou

, et al. The improvement and application of XGBoost method based on the Bayesian optimization. J Guangdong U Technol 2018; 35: 23–28.

28.

Malikov

AKU

Cho

Kim

, et al. A novel ultrasonic inspection method of the heat exchangers based on circumferential waves and deep neural networks. Sci Progress-UK 2023; 106: 1–26.

29.

Yuan

Chen

Wang

, et al. A compensation method based on extreme learning machine to enhance absolute position accuracy for aviation drilling robot. Adv Mech Eng 2018; 10: 1–11.

30.

Phan

Khan

Salay

, et al. Bayesian Uncertainty quantification with synthetic data. Lect Notes Comput Sci 2019; 11699: 378–390.

Testing set	Nonspoofing		Spoofing
Testing set	Model	R_MA	Model	R_FA
T1	M1	38.94%	M1	30.09%
	M2	23.94%	M2	14.82%
	M3	1.08%	M3	0.49%
	M4	0.46%	M4	0.22%
T2	M1	9.82%	M1	5.26%
	M2	9.96%	M2	5.34%
	M3	1.17%	M3	0.52%
	M4	0.09%	M4	0.04%
T3	M1	11.72%	M1	6.28%
	M2	10.71%	M2	5.67%
	M3	1.25%	M3	0.54%
	M4	0.07%	M4	0.03%
T4	M1	0.11%	M1	0.05%
	M2	0.16%	M2	0.08%
	M3	0.74%	M3	0.36%
	M4	0.20%	M4	0.10%
T5	M1	0.72%	M1	0.33%
	M2	0.14%	M2	0.07%
	M3	1.06%	M3	0.48%
	M4	0.41%	M4	0.18%

Spoofing attack recognition for GNSS-based train positioning using a BO-LightGBM method

Abstract

Keywords

Introduction

Sample features in spoofing recognition

GNSS signal observation model

Typical sample features

Pseudo-range residual

Carrier phase smoothed pseudo-range residual

Carrier-to-noise density ratio (C/N0)

GNSS/ODO speed difference

GNSS/ODO pseudo-range difference

Dilution of Precision

BO-LightGBM spoofing recognition

LightGBM classification method

Bayesian optimization strategy

BO-LightGBM-based spoofing recognition

Test and analysis

Test platform configuration

Dataset construction

Recognition performance analysis

Analysis under synthetic spoofing scenario

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References

Carrier-to-noise density ratio (C/N₀)