Abstract
Keywords
Introduction
Civil engineering structures, such as bridges, buildings, and pipelines, rely on structural health monitoring (SHM) systems to ensure their safety and longevity. These systems utilize an array of sensors and non-destructive evaluation techniques to continuously or periodically assess the condition of these structures, providing accurate data on their current health. 1 Such data are crucial for the early detection of potential structural issues and for the precise evaluation of any detected damage. By incorporating advanced technologies and data analysis methods, SHM systems facilitate timely and proactive maintenance, enabling civil engineers to prioritize repairs based on the urgency and importance of the identified problems.2,3 The early detection and accurate severity assessment offered by SHM not only ensure that resources are allocated efficiently but also help minimize risks and prolong the service life of critical infrastructure.4,5
The concept of early damage detection in SHM became a critical focus during the field’s formative years, driven by the need to identify structural issues before they escalated. Doebling et al. 6 provided one of the first comprehensive reviews of vibration-based damage identification methods, detailing approaches such as changes in modal frequencies, mode shapes, and advanced techniques using modal strain energy to detect incipient damage. Building on this foundation, Sohn et al. 7 expanded the scope by reviewing statistical pattern recognition paradigms in SHM, stressing the importance of data normalization and feature extraction for identifying subtle changes indicative of early damage. Farrar et al. 8 introduced statistical process control methods for distinguishing between environmental variations and actual structural damage, significantly enhancing the sensitivity of early detection methods.
Recent advancements in SHM have been significantly influenced by the integration of supervised learning techniques, which have greatly enhanced early damage detection. Supervised learning models, utilizing labeled datasets, have shown remarkable precision in identifying known damage patterns. Avci et al. 9 demonstrated the superiority of convolutional neural networks (CNNs) for damage detection compared to traditional methods. Rafiei and Adeli 10 further advanced the field with a dynamic neural classification algorithm for high-rise buildings, showing the adaptability of supervised learning across diverse structures. Azimi et al. 11 reviewed state-of-the-art deep learning methods in SHM, while more recently, Ahmadian et al. 12 developed a supervised machine learning model capable of detecting subtle structural deviations, marking a key step forward in applying supervised methods for real-time damage detection.
While supervised learning models have advanced SHM, they rely on large labeled datasets, which are often difficult to obtain in real-world scenarios. Unsupervised learning techniques address this challenge by detecting anomalies and potential damage using unlabeled data, making them increasingly important for early damage detection in the absence of labeled data. 13 Bull et al. 14 introduced an unsupervised novelty detection approach that uses probabilistic models to account for environmental and operational variability, thereby improving the reliability of structural anomaly detection. Similarly, Rastin et al. 15 used convolutional autoencoders to detect and quantify structural damage, achieving success with both numerical models and full-scale structures like the Tianjin Yonghe Bridge. Sarmadi and Yuen 16 employed a one-class kernel null space algorithm with probabilistic threshold estimation for early damage detection, effectively managing environmental variability. Wang and Cha 17 proposed an unsupervised deep learning approach utilizing a deep auto-encoder with a one-class support vector machine to detect structural damage by extracting damage-sensitive features from acceleration response data, achieving high accuracy in both numerical and experimental studies. Lei et al. 18 proposed a deep convolutional generative adversarial network for damage detection by reconstructing lost sensor data to enable accurate identification of structural conditions and damage. This reflects the growing trend of using AI to enhance the sensitivity of early damage detection in vibration-based monitoring systems. 19
Once damage is detected, accurately quantifying its severity is crucial for determining appropriate maintenance and repair strategies. Over the past few decades, significant progress has been made in damage severity assessment methods in SHM. Early contributions by Pandey et al. 20 introduced the use of changes in modal curvature for damage localization and severity assessment, laying the groundwork for future developments. Farrar and Jauregui 21 compared damage identification algorithms using both experimental and numerical modal data, offering insights into how different methods can assess damage severity effectively. Ren and Sun 22 further advanced the field by applying wavelet analysis for quantitative damage assessment, demonstrating how frequency analysis can accurately localize and measure severity. As SHM methods evolved, more sophisticated techniques emerged. Fan and Qiao 23 provided a comprehensive review of vibration-based damage identification methods, focusing on algorithms for quantifying damage severity. Their work highlighted the potential of data-driven methods for precisely estimating the extent of structural damage. 24
As SHM systems evolved, machine learning emerged as a pivotal tool for enhancing the accuracy and efficiency of damage severity assessments. 25 Tibaduiza et al. 26 proposed a method that combined feature selection with extreme learning machines, significantly improving the precision of severity estimation. This marked a shift toward data-driven approaches, emphasizing the extraction and analysis of relevant features from extensive datasets. Building on this, Zhang et al. 27 introduced CNNs for assessing damage severity directly from raw sensor data, streamlining the assessment process by reducing the need for extensive feature engineering. Concurrently, applying probabilistic methods into SHM has addressed the inherent uncertainties in structural models and measurement data. Behmanesh et al. 28 developed a Bayesian probabilistic framework for assessing damage severity in steel structures, incorporating uncertainty into the assessment process to enhance the reliability of results. Huang et al. 29 extended this approach by applying Gaussian process regression to damage severity assessment, offering probabilistic predictions crucial for informed decision-making in SHM.
Recent advancements in SHM have also seen the integration of deep learning and data fusion, particularly in multi-sensor contexts, significantly enhancing damage severity assessment. Entezami and Shariatmadar 30 introduced an unsupervised learning approach for damage localization and severity assessment, using novel damage indices based on time series modeling. Their method identifies robust model orders through an iterative process and uses AutoRegressive model parameters and residuals as damage-sensitive features to quantify damage severity. Following this, Xu et al. 31 proposed a Long Short-Term Memory (LSTM) neural network framework for real-time seismic damage assessment, demonstrating how deep learning can effectively quantify the extent of structural damage at a regional scale by analyzing data from multiple sources. Postorino et al. 32 demonstrated the robustness of CNNs in predicting damage severity and location in composite structures, even under manufacturing uncertainties and noise. Building on these advancements, Nguyen-Ngoc et al. 33 introduced a method combining Deep Neural Networks with the Artificial Rabbit Optimization algorithm, effectively localizing and quantifying damage in truss bridges while overcoming challenges such as local minima in optimization processes.
In addition to these data-driven advancements, innovation in signal processing techniques continues to play a crucial role in damage severity assessment. Esmaielzadeh et al. 34 applied this method to concrete gravity dams, using relative frequency error to estimate damage severity. By analyzing changes in natural frequencies, they accurately identified the location and severity of structural damage in non-linear, non-stationary signals common in SHM. Mousavi et al. 35 used the complete ensemble empirical mode decomposition with adaptive noise technique, focusing on key features like energy and instantaneous amplitude, to achieve more accurate damage severity classification compared to traditional methods.
Hybrid methodologies that integrate various techniques have emerged as powerful tools in severity assessment. 36 Dang et al. 37 proposed a hybrid 1- Deep Convolutional Neural Network (DCNN)-LSTM model for accurate damage severity assessment in civil structures. By combining signal processing techniques with deep learning, their method efficiently handles noisy sensor data and achieves high accuracy in real-time SHM. Similarly, Svendsen et al. 38 developed a hybrid SHM framework that uses both numerical and experimental data to assess damage severity in steel bridges. Another innovative hybrid methodology, as described by Sakiyama et al., 39 combines principal component analysis (PCA), finite element simulations, and Monte Carlo simulations to quantify the severity of the damage of the aging infrastructure.
Recent advances in SHM have explored diverse methodologies across well-established benchmarks, yet key limitations persist—especially regarding scalability, automation, and interpretability. Traditional approaches to the International Association for Structural Control and Monitoring–the American Society of Civil Engineers (IASC–ASCE) Phase I benchmark often rely on supervised learning frameworks with hand-crafted features and static thresholds,40–42 limiting their adaptability to new damage conditions or unseen environments. Similarly, SHM studies using the Old ADA truss bridge, a full-scale, artificially damaged structure, typically require extensive manual calibration and often lack generalized scoring mechanisms.43,44 In both cases, these models are prone to false alarms due to environmental noise and lack the ability to distinguish between damage severity levels without supervised labels or retraining.
Valdez-Yepez et al. 45 introduced the only public dataset for bolt-loosening detection in offshore wind-turbine jacket supports and applied PCA plus Mahalanobis distance for anomaly detection. 46 Although effective in small-scale laboratory tests, this pipeline has limited practical scalability: manual thresholding impedes real-time operation, and the method yields only a binary healthy/damaged flag, offering neither bolt-level localization nor multi-severity discrimination (e.g., 6 vs 9 Nm loosening) without retraining. Such limitations restrict its utility for progressive on-the-fly damage assessment.
Despite significant advancements in both early damage detection and severity assessment, several key challenges continue to hinder the full potential of these methods. Environmental and operational variability pose significant challenges in SHM, where fluctuations in temperature, humidity, and operational loads can obscure or mimic structural damage, complicating the accurate detection of anomalies. Robust algorithms capable of distinguishing between benign variations and genuine structural problems are urgently needed. 47 Additionally, setting reliable thresholds for anomaly detection is problematic; arbitrarily or context-dependent thresholds can lead to false positives or negatives, particularly in complex machine learning and deep learning models. 48 The issue of generalization across different datasets and structural types further limits the applicability of current SHM methods, as many models perform well in specific contexts, but struggle in varied environments. 49 Scalability and computational efficiency are also critical concerns, especially as SHM systems increasingly rely on large-scale sensor networks that generate vast amounts of data requiring real-time analysis. 50 Furthermore, the interpretability of SHM models, especially those based on deep learning, remains a challenge. The “black-box” nature of these models can impede trust and practical application, underscoring the need for interpretable AI that provides clear and understandable rationales for their decisions. 51
Recent studies have begun to address the most pressing of these gaps, namely, robustness to environmental variability. Manifold learning-aided clustering followed by non-parametric probabilistic scoring substantially reduces false alarms by forming environment-specific data manifolds before anomaly detection. 52 Complementary work removes freezing-temperature artifacts from bridge modal frequencies by normalizing unsupervised data, thus restoring damage sensitivity without manual tuning. 53 Moreover, cyclostationarity-based signal modelling offers a physics-guided route to suppress speed-induced variance in rotating machinery while retaining wear signatures. 54 Together, these advances demonstrate that hybrid environment-aware learning pipelines can outperform purely data-driven baselines, yet they remain largely confined to single-domain case studies and require custom parameter choices, leaving open questions of cross-domain generalization and computational scalability. In parallel, hybrid physics-informed neural networks have been introduced to merge first-principles structural models with data-driven learning, 55 and probabilistic generative models such as variational autoencoders offer an unsupervised framework for quantifying uncertainty and detecting anomalies without labeled damage data. 56
Digital twin platforms now pair high-rate Internet of Things (IoT) sensor streams with physics-based simulations to deliver real-time asset-specific health assessments of gearboxes, wind turbine drivetrains, and bridge decks, enabling condition-based maintenance and reducing unplanned downtime. 57 In intelligent manufacturing, cyclostationarity-aware vibration analysis takes advantage of the periodic statistics of rotating machinery to isolate weak fault signatures in spur gears with highly variable speed and load profiles. 58 These successes underscore the advantage of blending domain physics with advanced deep learning pipelines for SHM. 48 However, a unified and transferable framework that can accommodate the sparse sensor layouts, large spatial scales, and severe environmental variability typical of civil infrastructure remains an open research challenge. Existing approaches often face serious limitations: In such settings, algorithms typically require labeled damage data, hand-tuned thresholds, or extensive sensitivity analyzes to avoid false alarms caused by operational or environmental fluctuations.52–54 These limitations highlight the pressing need for an unsupervised and calibration-free framework that enables robust early damage detection and interpretable severity assessment, even under complex real-world variability.
Addressing the complex challenges of damage detection in SHM has been a central focus of our research. Our development of the EdgeConvFormer 59 model marked a significant advancement in this area, integrating graph convolutional networks and transformers to capture complex spatiotemporal patterns. A key innovation of EdgeConvFormer is its use of dynamic Graph Convolutional Network (GCN)s within a spatial-temporal 2D framework, which adapts to varying structural conditions and sensor layouts, enhancing its ability to capture intricate spatial dependencies. Additionally, the Parallel Sensor-Specific Transformer enables precise modeling of temporal dependencies across sensors, preserving individual sensor characteristics while learning inter-sensor relationships. These features collectively provide a more accurate and robust analysis of the relationships between different sensors over time, setting EdgeConvFormer apart from existing methods.
While EdgeConvFormer demonstrated superior performance in anomaly detection and damage localization using a reconstruction approach, 60 it still has some limitations. The model’s encoder excelled at extracting multi-scale features; however, its decoder, based on a simplified multi-layer perceptron (MLP) structure, did not fully leverage the richness of the encoded information. The MLP-based decoder used max and mean pooling followed by linear transformations to synthesize these features into the final reconstructed data. This straightforward approach, while easy to implement and effective in certain contexts, ultimately fell short in preserving the intricate spatial and temporal details crucial for accurate anomaly detection and severity assessment. As a result, some critical contextual information was lost during the reconstruction process, reducing the model’s sensitivity to subtle yet significant structural changes, particularly in noisy environments where early detection is crucial.
Recognizing the need to overcome the limitations of EdgeConvFormer, we have focused on enhancing the model to better preserve the intricate details captured by the encoder, thereby improving its ability to accurately identify and assess structural damage, especially in complex scenarios. This work builds on the strengths of EdgeConvFormer, addressing its shortcomings to further improve its effectiveness in SHM applications. As a solution, we propose U-GraphFormer, an enhanced model that significantly advances feature extraction and reconstruction processes by incorporating a U-Net-inspired encoder–decoder architecture. This new approach ensures more comprehensive utilization of the encoded information, leading to improved performance in anomaly detection and severity assessment.
In U-GraphFormer, the simple MLP decoder used in EdgeConvFormer is replaced with a more sophisticated structure that mirrors the encoder, integrating skip connections between corresponding layers. This design allows for more dynamic and progressive refinement of features across multiple layers, akin to the U-Net architecture. The inclusion of skip connections between each layer of the encoder and decoder ensures that detailed spatial and temporal information is preserved throughout the reconstruction process, mitigating the loss of critical contextual information that was a limitation in EdgeConvFormer. By progressively refining features at each layer, U-GraphFormer enhances the model’s ability to capture and utilize multi-scale features more effectively, leading to more accurate and detailed analyses. This advanced decoding mechanism not only improves the reliability of early damage detection but also significantly enhances the model’s capacity for severity assessment. The skip connections allow for the direct flow of fine-grained information from the encoder to the decoder, enabling the model to reconstruct complex patterns with greater precision. As a result, U-GraphFormer is better equipped to differentiate between normal structural variations and subtle indicators of damage, even in challenging environments.
Additionally, U-GraphFormer is engineered for real-time monitoring, with the capability to trigger alarms immediately when an adaptively generated threshold is exceeded. This real-time functionality, combined with the enhanced decoding architecture, ensures timely interventions and provides detailed insights into the severity of detected damage. This facilitates informed decision-making and prioritization of maintenance efforts, making U-GraphFormer a comprehensive and reliable tool for monitoring critical infrastructure. The main contributions of this work are as follows:
A novel model, U-GraphFormer, enhances SHM by integrating spatio-temporal graph learning with sensor-specific temporal self-attention within a U-Net-inspired encoder–decoder architecture, enabling dynamic refinement of features for more accurate anomaly detection and severity assessment.
The development of a Segment-Level Self-Adaptive Scoring Decision Mechanism for damage early detection and severity assessment. By dynamically selecting the most appropriate scoring method, base anomaly scoring, or dynamic Gaussian kernel scoring (Gauss_D_K)—based on an adaptively determined threshold, this mechanism ensures accurate, context-sensitive assessments. This innovation allows for simple and intuitive use of mean and standard deviation to reliably gauge damage progression, facilitating timely and informed maintenance decisions.
For each structure, U-GraphFormer is trained once on its healthy baseline data. Afterwards, any new segment (400 steady-state points) from that same structure can be processed in less than 2 min on a standard Graphics Processing Unit (GPU)—without further threshold tuning or retraining—enabling straightforward, continuous monitoring of the system.
The research evaluates U-GraphFormer on three SHM case studies—(i) the IASC-ASCE Phase I synthetic steel-frame benchmark, (ii) the offshore wind turbine jacket bolt loosening data set, and (iii) the Old ADA full-scale steel-truss bridge. In all cases, the model achieves segment-wise damage detection accuracy 100%, allowing reliable early warnings from short testing segments of a few minutes. In the synthetic benchmark, time-point scores and mean/
Methodology
As depicted in Figure 1, the U-GraphFormer architecture builds upon the foundational EdgeConvFormer model 59 and is specifically designed for effective early detection and differentiation of structural damage severity in SHM. This methodology integrates advanced techniques to preprocess raw sensor data, encode complex spatiotemporal relationships, and accurately reconstruct the data for anomaly detection. The process includes data smoothing through a sliding window approach, sophisticated encoding using Time2Vec embeddings, spatiotemporal graph learning, and sensor-specific temporal self-attention. The decoder part of U-GraphFormer is enhanced by incorporating edge convolution and transformer layers connected via a U-Net structure, which allows for the seamless integration of high-resolution and low-resolution features, ensuring more precise reconstruction of the data. Anomalies are identified by comparing the reconstructed data to the original and applying thresholding to detect significant deviations. For early detection, the model continuously monitors the reconstruction errors, raising an alarm when these errors exceed predefined thresholds, allowing for timely intervention and maintenance to prevent further deterioration. For severity assessment, anomaly scores are analyzed over specific time segments, with the mean and standard deviation calculated to provide insights into the extent and progression of the damage. These statistical measures facilitate a more informed and proactive approach to maintenance and intervention, ensuring that the U-GraphFormer architecture not only detects anomalies early but also accurately assesses the severity of structural damage.

The architecture of U-GraphFormer. (1) Data preprocessing: Moving average is applied independently to each sensor’s data, smoothing out noise and highlighting trends. (2) Encoder: The encoder uses Time2Vec embeddings and a multilayer stack of GraphFormer Blocks to capture both temporal patterns and spatial relationships in the sensor data. (3) Decoder: Mirroring the encoder’s architecture, the decoder employs GraphFormer blocks connected via a U-Net structure to reconstruct the data from encoded representations. (4) Reconstruction: The decoder’s output is passed through a final linear layer to produce the reconstructed data, which is then compared to the original moving mean data. (5) Anomaly detection and severity assessment: Reconstruction errors and anomaly scores are calculated and significant deviations are flagged as anomalies using Tail-p thresholding. An alarm is raised when these errors exceed the threshold, and the mean and standard deviation of reconstruction errors over specific time segments are analyzed to assess the extent and progression of the damage.
Data preprocessing
In this study, data preprocessing plays a crucial role in enhancing the accuracy of anomaly detection and localization within SHM. The preprocessing pipeline consists of several key steps that work together to prepare the data for effective analysis.
To begin with, moving average is applied to the time series data, which serves to smooth out noise and emphasize significant trends. This technique, calculated over a specified window size
After smoothing, data are standardized to a mean of zero and a standard deviation of one, ensuring equal feature contribution. The standardized data
where
The final step in the preprocessing pipeline involves the implementation of an overlapping window approach. This technique divides the standardized data
Through empirical testing and cross-validation, the optimal window size and stride were determined to maximize precision, recall, and F1 score, leading to effective damage detection and localization. This overlapping window approach not only enhances detection accuracy but also supports real-time monitoring, allowing for timely assessments of damage severity.
Encoder
The encoder of the U-GraphFormer architecture, depicted in Figure 1, consists of four layers, each integrating an EdgeConv 61 module with a parallel sensor-specific temporal self-attention (ParaAtten) module. The encoder processes the input sensor data to capture complex spatial-temporal relationships, enhancing data representation at each stage. Initiated by inputs from the Time2Vec 62 module, this layered configuration gradually refines the features to improve the detection of spatiotemporal patterns in multivariate time series data.
Time2Vec embedding
Positional embedding is crucial in Transformer models to address their inherent limitation in recognizing sequence order. While traditional Transformers use sine and cosine functions to embed positional information, these methods often struggle with time series data, which can show both regular and irregular patterns. To overcome this, we employ Time2Vec, 62 an advanced approach that differentiates between periodic and aperiodic patterns in time series data.
Time2Vec transforms each sensor’s moving mean data,
Here,
GraphFormer block
Each layer of the encoder uses an GraphFormer Block to dynamically update embeddings based on spatial-temporal relationships. The GraphFormer Block consists of two main components:
Edge Convolution: The EdgeConv module applies a graph convolution over the input data to capture spatial relationships. The input
where
Parallel sensor-specific temporal self-attention (ParaAtten): The attention mechanism employs ParaAtten to focus on temporal features unique to each sensor. The output from the EdgeConv module
The attention output for each sensor in the
This attention output is further processed with a feedforward network and added residual connections, enhancing the temporal feature representation. The resulting tensor from each layer is permuted back to the original dimensions
Decoder
Multi-level GraphFormer decoder blocks
As shown in Figure 2, the decoder in U-GraphFormer is central to the model’s performance, building upon the encoder’s architecture while introducing critical enhancements for improved reconstruction and anomaly detection. This core component features a multi-level GraphFormer structure, incorporating U-Net style skip connections that integrate features across different resolutions and scales, significantly enhancing the model’s ability to reconstruct sensor data with high fidelity.

The architecture of U-GraphFormer featuring a multi-level encoder–decoder structure with skip connections. The encoder consists of four layers, each employing EdgeConv blocks followed by a transformer layer and reshaping step. The decoder mirrors the encoder’s structure, integrating high-resolution and low-resolution features through U-Net-style skip connections.
The decoder mirrors the structure of the encoder but incorporates added complexity to effectively capture and refine multi-scale features essential for accurate anomaly detection. It processes the encoded data by progressively refining it through multiple layers that blend high- and low-resolution features. Skip connections link these layers, enabling the model to retain and utilize information from earlier encoding stages, ensuring that crucial details are preserved during the reconstruction process.
The decoder starts with the final encoder layer’s output,
For each subsequent decoder layer
This concatenation ensures multi-scale feature integration, improving reconstruction quality. The significance of these multi-level features is profound. By merging high-resolution (local) and low-resolution (global) features, the decoder can accurately reconstruct the original sensor data, capturing both subtle anomalies and broader patterns in the structural health data. This capability is crucial for enabling the model to accurately detect anomalies and precisely assess their severity.
Each layer in the decoder includes a GraphFormer Block, which is crucial for processing and refining the encoded features. These blocks consist of two main components: EdgeConv and ParaAtten, which work together to model the relationships between different sensors and time steps effectively. Skip connections in a U-Net style architecture allow the decoder to incorporate information from earlier stages of encoding, ensuring that detailed features are preserved and utilized in the reconstruction.
The skip connections serve a dual purpose; they prevent the loss of spatial and temporal information as the data are passed through multiple layers, and they facilitate the integration of multi-level features, leading to a more accurate and detailed reconstruction of the original sensor data.
The multi-level refinement process in the decoder is particularly significant for anomaly detection and severity assessment. By effectively integrating features from different scales, the decoder ensures that the reconstructed data retains the essential characteristics of the original signals, allowing for precise detection of both subtle and significant anomalies. This multi-level approach also enhances the model’s ability to differentiate between various types of structural damage, leading to more accurate and reliable severity assessments.
The output of each decoder layer
Reconstruction
The output from the decoder, denoted as
where
During the unsupervised training phase, the model is trained exclusively on data from normal operational conditions. The model aims to learn and recognize standard patterns in the data. The reconstructed data
The reconstruction error, measured using the mean squared error (MSE) loss function, is calculated as follows:
where
Anomaly detection
Anomaly scoring
This stage of the architecture focuses on identifying anomalies by calculating the reconstruction error
In SHM, accurate and timely detection of anomalies is crucial for maintaining the integrity and safety of structures. Traditional anomaly detection methods often rely on static scoring mechanisms, which may not effectively capture minor anomalies or adapt to evolving data trends. To address these limitations, we propose a segment-level adaptive scoring approach that selectively uses either base anomaly scoring or Gauss_D_K depending on the characteristics of each segment. This adaptive approach aims to enhance the sensitivity and robustness of anomaly detection across various damage scenarios.
Base anomaly scoring: Once the reconstruction errors
We assume that the reconstruction errors for each sensor follow a specific statistical distribution characterized by parameters
The probability of observing a reconstruction error
We obtain a vector of log probabilities for each time step by computing the log probabilities for all sensors. The aggregated anomaly score
Gauss_D_K 63 : While the base anomaly scoring method is intuitive and effective, certain scenarios in SHM require a more sophisticated approach to handle temporal misalignments in sensor readings. Gauss_D_K was introduced to address these challenges. By applying a Gaussian kernel, this method smooths the anomaly scores, reducing noise and aligning temporal differences across sensors.
Anomaly scores
where
where * denotes convolution and the overall health
The scoring process involves dynamically updating the local mean and variance of sensor errors over a sliding window, which helps in capturing ongoing trends and subtle changes in the data. The resulting anomaly scores are then convolved with a Gaussian kernel, which enhances the temporal alignment of anomalies detected across different sensors. This additional scoring mechanism is particularly beneficial in complex scenarios where structural changes might not be immediately apparent or are spread over time. By summing the smoothed anomaly scores, Gauss_D_K provides a more unified and consistent analysis, ensuring more reliable detection of anomalies and offering a clearer picture of the system’s overall health.
Segment-level self-adaptive scoring decision mechanism: As shown in Figure 3, the process of determining the appropriate scoring method for each segment begins with calculating the Segment-Level Anomaly Scores, which include the mean and standard deviation of the base anomaly scores

Flowchart of the segment-level decision mechanism.
Furthermore, a threshold
where
Once the threshold
Empirical validation: Our empirical observations confirm that the proposed segment-level decision method effectively detects both serious and minor damages. For serious damages, segments tend to exceed the threshold
Thresholding
The computed anomaly score is subjected to thresholding to identify significant deviations. Errors that exceed a predefined threshold are flagged as anomalies, indicating potential structural damage. This approach allows for early detection and differentiation of the severity of the damage, facilitating timely intervention and maintenance.
Tail-p thresholding
63
: Tail-p thresholding identifies significant deviations across multiple sensors by summing the negative log probabilities across
Our Tail-
Early detection of damage
Early detection of potential structural damage is critical for timely intervention and maintenance, helping to prevent further deterioration. Building on the segment-level decision mechanism and the thresholding method described previously, the following process is used to achieve early detection: Once the appropriate anomaly scores
The process starts by continuously monitoring anomaly scores,
The alarm mechanism, triggered by the ratio of positive anomalies, acts as an early warning system, enabling timely inspection and intervention. This early detection approach is designed to facilitate timely maintenance actions to address potential structural issues before they worsen. It provides a comprehensive and unified view of the system’s health by aggregating data from multiple sensors, and ensures robustness by reducing false positives through the ratio-based detection method. The immediate alert triggered when the ratio of positive anomalies exceeds the threshold enables rapid identification of potential damage, allowing maintenance teams to take swift action. This early detection not only helps prevent further deterioration but also plays a crucial role in ensuring the safety and longevity of the structure. When a segment triggers an alarm, identifying the first positive time step as the onset of damage provides valuable information for understanding the origin of structural issues and planning effective remedial actions.
Severity assessment of damage
Following the early detection of potential anomalies, assessing the severity of the detected damage is crucial for prioritizing maintenance efforts and determining the appropriate remedial actions. Once an alarm is raised for a segment, a detailed severity assessment is conducted to gauge the extent of the damage.
In this work, the severity assessment process involves analyzing anomaly scores over specific time segments, where the mean and standard deviation of these scores are computed to gain insights into the extent and progression of the damage. The effectiveness of this approach lies not merely in the computation of these statistical measures, but in the inherent quality of the anomaly scores generated by the model.
The mean reconstruction error
The mean anomaly score serves as a reliable indicator of the overall level of structural anomalies detected, with higher mean values suggesting more significant deviations and, consequently, more severe damage. The standard deviation provides insight into the variability of these anomalies; higher values imply sporadic or inconsistent anomalies, potentially indicating intermittent issues or fluctuating severity levels.
What sets this work apart from previous approaches, including EdgeConvFormer, is the informativeness of the anomaly scores produced by U-GraphFormer. The model’s advanced architecture, particularly its U-Net-inspired encoder–decoder structure and refined graph learning capabilities, generates anomaly scores that are inherently more reflective of true structural conditions. This means that even the simple computation of mean and standard deviation from these scores yields highly reliable severity assessments. The anomaly scores produced by U-GraphFormer are more stable and consistent across varying levels of damage, making them a robust foundation for severity assessment.
Unlike other models, which may produce noisy or less consistent scores requiring complex post-processing to achieve accuracy, U-GraphFormer’s scores are directly meaningful. The simplicity of using mean and standard deviation for severity assessment is a testament to the model’s ability to capture and represent structural anomalies with high fidelity. This approach ensures a consistent and accurate ranking of damage severity, aligned closely with the ground truth, offering a clear advantage over previous methods.
In conclusion, the severity assessment methodology in U-GraphFormer leverages the intrinsic quality of the anomaly scores, ensuring that the computed mean and standard deviation not only provide a clear picture of the current state but also enable precise and reliable decision-making for maintenance and intervention strategies. This approach represents a significant advancement over the EdgeConvFormer model, demonstrating the enhanced capability of U-GraphFormer to deliver actionable insights into structural health.
Evaluation metrics
Metrics for early detection: The early detection capability of the model is evaluated using standard time-wise anomaly detection metrics: Precision, Recall, and the
Metric for severity assessment: To assess the model’s effectiveness in evaluating damage severity, we introduce the Ranking Accuracy (Spearman’s rank correlation) metric.
67
This metric evaluates how well the anomaly scores correlate with the true severity ranking of the different damage scenarios. The Ranking Accuracy is calculated using Spearman’s rank correlation coefficient
where
In this context,
Case study
Phase I IASC–ASCE benchmark
As shown in Figure 4, the Phase I IASC–ASCE SHM Benchmark Problem, organized by IASC and ASCE, evaluates SHM algorithms using simulated data from a 4-story, 2-bay by 2-bay 3D steel-frame model structure. This benchmark includes datasets from healthy conditions and six distinct damage patterns, progressing from easily detectable extreme damage to more challenging cases. The severity ranking, based on visual representation, places Damage2 as the most severe, followed by Damage1, Damage5, Damage4, Damage3, and Damage6 as the least severe.

The six damage patterns (a-g) and their severity ranking. 68
The ASCE Benchmark Problem outlines five cases designed to evaluate SHM techniques, each varying in complexity with respect to load distribution and structural symmetry. These cases include both symmetric and asymmetric structures, with degrees of freedom (DOF) ranging from 12 to 120, and different load application points, such as all stories or roof-only. Among these, we have chosen to focus on Case 5 due to its complexity and relevance to real-world applications. This case features a 120-DOF asymmetric structure subjected to roof-level loading, making it particularly effective for testing the robustness of advanced anomaly detection and damage assessment methods. The asymmetry in Case 5 is introduced by replacing one 400-kg floor slab on the roof with a 550-kg slab, resulting in a roof configuration of three 400-kg slabs and one 550-kg slab. Additionally, instead of distributed floor excitations, a shaker is positioned at the top of the center column on the roof to simulate specific dynamic events. Gaussian noise is also added to the acceleration data to represent sensor measurement noise, further simulating real-world conditions. These settings allow for a precise simulation of dynamic events, providing a rigorous challenge for SHM techniques to accurately capture the structure’s response. More details on this benchmark can be found in Johnson et al. 68
Data generation
In the training phase of our model, we efficiently processed 45 min of high-frequency data sampled at 1000 Hz using a sliding window with a window size of 256 and a stride of 90, resulting in 29,998 time steps. Both the training and validation sets were derived from this single continuous 45-min segment of healthy condition data, split in an 8:2 ratio. During training, the model was exclusively exposed to this segment to learn patterns characterizing a healthy structural state.
For the test dataset, which was also sampled at 1000 Hz using a sliding window with a window size of 256 and a stride of 90, data were collected over durations of 1, 2, 4, 6, and 8 min for both the healthy state and the six damage patterns described earlier. This approach aimed to determine the optimal duration for early damage detection and severity assessment. Analyzing various time segments helped us understand how different observation lengths impact the model’s capability to accurately and promptly detect anomalies and assess structural damage severity. By comparing the model’s performance across these segments, we identified the most effective observation window for early damage detection, ensuring timely interventions and thorough damage severity assessments. This comprehensive strategy ensures that the SHM algorithms are robust and effective across a range of damage scenarios and observation periods.
As shown in Table 1, during the training phase, we convert the healthy dataset into a set of sub-sequences using overlapping windows of length
U-GraphFormer configuration parameters for the ASCE dataset.
U-GraphFormer’s performance on the IASC–ASCE Phase I benchmark for damage 1 under varying Tail-
During the training phase, our model processed 45 min of high-frequency data sampled at 1000 Hz by employing average moving smoothing to improve reconstruction accuracy. This technique involved applying a sliding window with a size of 256 and a stride of 90, resulting in 29,998-time steps. Thanks to an early stopping criterion, the training was completed in 48 min and 56 s, with each iteration averaging 85.36 s. This efficiency is impressive given the high sampling rate and the complexity of handling numerous time steps.
In the testing phase, we analyzed a 4-min segment of data, also sampled at 1000 Hz, using identical sliding window parameters. This resulted in 2666 time steps. The model evaluated the test data in under 2 min, demonstrating its ability to deliver timely insights. This performance highlights the model’s ability to efficiently handle large volumes of high-frequency data, allowing for quick detection and response to structural changes.
Early detection and severity assessment of damages in testing
Early detection: To evaluate the model’s early detection performance, we created six testing segments, each consisting of 1-min healthy data combined with 1-min data from different severity damage scenarios (Damage1–Damage6). With a sample frequency of 1000 Hz, each segment contributes an equal length of 60,000 data points, resulting in a total test dataset of 120,000 data points. Preprocessing was performed using a sliding window with a window size of 256 and a stride of 90, yielding 1328 time points. As the reconstruction is window-based, utilizing overlapping windows of length
Our objective is to assess the model’s capability to accurately and promptly detect the onset of damage within these segments. Anomaly scores were initially computed using base scoring to evaluate the preliminary severity. Based on this evaluation, a decision was made whether to further apply Gauss_D_K according to the selection method described in part Segment-Level Self-Adaptive Scoring Decision Mechanism of section “Anomaly detection.” The detection threshold was established based on the training data, and anomalies were identified when the anomaly score exceeded this threshold. This approach allows for a comprehensive evaluation of the model’s effectiveness in early damage detection across varying severity levels.
Severity assessment: To assess the severity of damage, we collected data over durations of 1, 2, 4, 6, and 8 min for both the healthy state and the six different severity damage scenarios (Damage1–Damage6). Each duration-specific dataset was preprocessed using a sliding window with a window size of 256 and a stride of 90. This preprocessing step yielded various time points depending on the duration of the dataset. We tested these seven segments separately to evaluate the model’s performance in assessing the severity of structural damage. Anomaly scores were computed for each segment using base scoring to determine the preliminary severity. The mean and standard deviation of these anomaly scores were then calculated for each damage scenario and duration.
As mentioned before, the base anomaly score is more effective in assessing the severity of damage compared to the Gauss_D_K anomaly score. The base anomaly score captures the overall deviation from normal behavior more directly, making it more sensitive to significant structural changes. Higher mean anomaly scores indicate more substantial structural deviations, while higher standard deviations suggest greater variability and inconsistency in the detected anomalies. This analysis provides critical insights into the extent and variability of the damage, enabling a nuanced understanding of the structural health. The rationale behind the effectiveness of the base anomaly score lies in its ability to reflect the immediate and cumulative impact of damage on the structure, whereas the Gauss_D_K score may sometimes smooth out critical deviations due to its kernel-based approach. We will validate this observation in our experiments by comparing the performance of both scoring methods across different damage scenarios and durations.
By comparing the mean and standard deviation of the anomaly scores across different durations, we can evaluate the effectiveness of the model in detecting and assessing the severity of damage over time. This comprehensive assessment helps in planning appropriate maintenance and intervention strategies, ensuring timely and accurate responses to various levels of structural damage.
Experimental results and discussions
We conducted a comprehensive series of experiments on the test dataset to evaluate the model’s performance in early detection of damage and severity assessment.
Early detection of damage: Early detection of structural damage is crucial for ensuring timely interventions and minimizing the risk of severe damage. In this context, our model employs a segment-wise alarm mechanism based on the ratio of time-wise point anomaly scoring, providing an effective strategy for early detection across different damage severities.
The model continuously monitors the anomaly score at each time point. When these scores exceed predefined thresholds, they contribute to the time-wise point anomalies. Within each segment, the model calculates the ratio of these anomalies—instances where the anomaly score
Table 3 presents a comparative analysis of Base Anomaly Scoring and Gauss_D_K Anomaly Scoring across different test segments, combining healthy data with each damage pattern (Damage1–Damage6). The baseline for the healthy segment has a mean of 13.87 and a standard deviation of 5.23, setting the threshold
Comparison of base anomaly scoring and Gauss_D_K anomaly scoring under different test segments on metrics precision, recall, and F1 score for time-wise anomaly detection. Test segments include combinations of healthy state with each damage pattern (Damage1–Damage6). The baseline for the healthy segment has a mean of 13.87 and a standard deviation of 5.23, determining the threshold
The effectiveness of this strategy is reflected in the time-wise precision, recall, and F1-score metrics, which determine the accuracy of early detection. Precision measures the proportion of true positive anomalies among all detected anomalies, ensuring that when an alarm is raised, it is likely due to genuine damage rather than false positives. High precision is crucial for minimizing unnecessary inspections or repairs. Recall indicates the model’s ability to detect all actual damage events. High recall is essential for early detection because missing any signs of damage can allow it to progress unnoticed, leading to potentially catastrophic failures. The F1-score, which balances precision and recall, provides a comprehensive measure of the model’s overall performance, making it a critical metric for evaluating the effectiveness of early detection.
For example, in the case of severe damage (Healthy + Damage1), the model achieves a precision of 0.99, recall of 0.98, and F1-score of 0.98 using Base Anomaly Scoring, as presented in Figure 5. The mean anomaly score for this segment is 747.74 with a standard deviation of 1837.00, and the overall mean anomaly score is 380.81, significantly above the threshold of 24.33. This demonstrates the model’s ability to promptly flag anomalies in alignment with the ground truth, capturing substantial deviations in structural behavior within a short timeframe. Moreover, the threshold-independent metrics corroborate this result: the Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves in Figure 6 yield an Area Under the Curve (AUC) of 0.991 and an average-precision (AP) of 0.994, respectively, demonstrating that the anomaly score separates healthy and Damage1 windows almost perfectly across all decision thresholds.

Anomaly detection for the test segment combining healthy data with Damage1 using base anomaly scoring.

ROC curve for the test segment combining healthy data with Damage1 using base anomaly scoring.
Conversely, in the case of the most minor damage scenario (Healthy + Damage6) as shown in Figure 7, where Damage6 reduces the stiffness of a single brace on the first floor to two-thirds of its original value, the model applies the Gauss_D_K scoring method. The mean anomaly score in this case is 17.14 with a standard deviation of 3.21, resulting in a precision of 0.80, recall of 0.71, and F1-score of 0.75. The overall mean anomaly score is 15.62, which falls below the base threshold, justifying the use of Gauss_D_K scoring to enhance detection sensitivity. Despite the subtlety of the damage, the segment-wise alarm accuracy for this scenario remains at 100%, demonstrating the model’s robustness in detecting even the most minor anomalies. Although the Damage6 case is intentionally subtle, the ROC analysis in Figure 8 still reports an AUC of 0.794 and an AP of 0.834, confirming that the score distribution retains sufficient contrast to support early detection—particularly once the Gauss_D_K anomaly-scoring is applied.

Anomaly detection for the test segment combining healthy data with Damage6 using Gauss_D_K anomaly scoring.

ROC curve for the test segment combining healthy data with Damage 6 using Gauss_D_K anomaly scoring.
The model’s segment-wise alarm accuracy is consistently 100% across all tested scenarios, including both serious and minor damages. This high level of accuracy indicates that the model reliably triggers alarms whenever there is a genuine structural issue, regardless of the severity of the damage. By leveraging both Base Anomaly Scoring and Gauss_D_K scoring within this framework, the model can adaptively detect early signs of damage, ensuring that even minor issues are identified before they escalate into significant problems. This approach not only optimizes the early detection process but also enhances the overall reliability of the model in diverse SHM scenarios.
Severity assessment of damage: Table 4 evaluates damage detection and severity across different segment lengths for various damage states, including Healthy, Damage1, Damage2, Damage3, Damage4, Damage5, and Damage6. Segment lengths of 1, 2, 4, 6, and 8 min are analyzed. The mean and standard deviation of anomaly scores are used to assess the severity of damage, with higher values indicating more severe damage.
Evaluation of damage detection and severity across different segment lengths. The table presents the mean and standard deviation of anomaly scores for different damage states: Healthy, Damage1, Damage2, Damage3, Damage4, Damage5, and Damage6. Segment lengths evaluated include 1, 2, 4, 6, and 8 min. The Ranking Accuracy (Rank Correlation) metric indicates how well the mean and standard deviation of the anomaly scores align with the true severity ranking. The 4-min segment length shows the best performance with a perfect ranking correlation, highlighted in yellow. The 2- and 6-min segment lengths also demonstrate strong performance with high-ranking accuracy, highlighted in light yellow.
The detailed examination of the anomaly scores across different segment lengths provides several key insights into the model’s performance and its implications for SHM. The 4-min segment length stands out as the most effective, striking an optimal balance between data granularity and detection accuracy. This segment length captures sufficient data to reliably detect anomalies while maintaining responsiveness to structural changes. The highlighted yellow rows in the table illustrate how the 4-min segments’ mean and standard deviation of anomaly scores closely match the true severity rankings, demonstrating both high sensitivity and precision. In contrast, the 2- and 6-min segments also show a good alignment with the true severity rankings but exhibit more variability in their standard deviations. This suggests that while these segments can still accurately detect damage, their detection precision may fluctuate slightly, indicating a trade-off between stability and sensitivity. The 1-min segment length, with its higher variability in anomaly scores, particularly in the standard deviation, indicates less stable detection performance. This shorter segment length is more susceptible to noise and transient variations, which can obscure the true signal of structural anomalies and lead to inconsistent detection results. On the other hand, the 8-min segment length displays lower anomaly scores, reflecting a less sensitive detection capability. The averaging effect over a longer duration may dilute the impact of transient damage events, causing critical anomalies to be smoothed out and potentially missed. This highlights the potential drawback of longer segments in failing to capture rapid or transient changes in structural health.
Overall, the analysis underscores the importance of selecting an appropriate segment length for effective damage detection. The 4-min segment length provides a robust and reliable framework for SHM, balancing the need for sensitivity to anomalies and stability in detection. These insights are crucial for optimizing monitoring strategies and ensuring timely and accurate maintenance interventions, ultimately contributing to the safety and longevity of structures.
The boxplot as shown in Figure 9 provides a comprehensive visualization of the model’s performance in detecting and differentiating various levels of structural damage severity. In the main plot, the anomaly scores for the most severe damage scenarios (Damage1 and Damage2) are substantially higher than those for less severe damage scenarios. This clear distinction underscores the model’s capability to identify and prioritize significant structural issues accurately.

Boxplot of mean and standard deviation for all damage severities. This plot illustrates the distribution of time-wise anomaly scores across different damage severities, ranging from Healthy to Damage6. Due to the substantial difference in anomaly scores, a zoomed-in boxplot is included for the lesser severity groups (Healthy, Damage6, Damage3, Damage4, and Damage5) to highlight their anomaly scores, which are significantly smaller compared to the more severe damage scenarios (Damage1 and Damage2). The smaller boxplot ensures the visibility of anomaly scores for less severe damage phases, demonstrating the effective differentiation of damage severity by the model.
The zoomed-in inset plot highlights the lesser severity damage scenarios (Healthy, Damage6, Damage3, Damage4, and Damage5). This inset is crucial as it brings to light the subtle variations in anomaly scores that might be overshadowed by the higher scores of the more severe damages. Despite the smaller anomaly scores, the model effectively differentiates between these less severe scenarios, demonstrating its sensitivity to even minor structural anomalies.
The ranking of the anomaly scores across all scenarios is completely consistent with the ground truth, validating the model’s accuracy in assessing damage severity. The mean and standard deviation values for each damage scenario, presented in the plot, further confirm the model’s robust performance in providing a nuanced understanding of the structural health. This capability is essential for prioritizing maintenance interventions, thereby enhancing the safety and integrity of the structure. The visualization not only highlights the model’s effectiveness in early damage detection but also its precision in ranking the severity of damage, which is critical for timely and appropriate structural maintenance.
Model comparison
To assess the improvements introduced by U-GraphFormer, a comprehensive comparison was conducted with the previous EdgeConvFormer model. 59 The U-GraphFormer model incorporates a more sophisticated architecture to enhance its feature extraction and reconstruction capabilities. The results presented in Tables 5 and 6 demonstrate that U-GraphFormer significantly outperforms EdgeConv Former across various validation metrics, including precision, recall, and F1 score for time-wise anomaly detection, as well as segment-wise detection accuracy. The improved architecture allows for more accurate identification of structural damages across different test segments, leading to better overall performance in SHM applications.
Comparison of EdgeConvFormer and U-GraphFormer based on validation metrics.
Comparison of EdgeConvFormer and U-GraphFormer under different test segments on metrics precision, recall, F1 score, and segment-wise detection results (Seg. detection). Test segments include combinations of healthy state with each damage pattern (Damage1–Damage6).
Table 5 provides a comparative analysis of the validation loss and validation reconstruction error for both models. The results clearly demonstrate the superior performance of U-GraphFormer. The model achieves a remarkably lower validation loss of
Further insights can be drawn from the performance metrics on different test segments, as shown in Table 6. Each test segment consists of 1-min healthy data combined with 1-min data from various damage scenarios (Damage1–Damage6). The U-GraphFormer consistently outperforms the original model across all test segments in terms of precision, recall, and F1 score for time-wise anomaly detection. For example, in the “Healthy + Damage1” test segment, U-GraphFormer achieves a precision of 0.9910, recall of 0.9770, and F1 score of 0.9839, compared to EdgeConvFormer’s precision of 0.5853, recall of 0.9309, and F1 score of 0.7187. This significant improvement highlights the model’s ability to accurately and promptly detect anomalies.
Moreover, the U-GraphFormer shows remarkable performance even in more challenging scenarios such as “Healthy + Damage6,” where the damage is subtle and harder to detect. The enhanced model attains a precision of 0.7960, recall of 0.7057, and F1 score of 0.7481, significantly outperforming EdgeConvFormer which records lower precision (0.3210), recall (0.3085), and F1 score (0.3146).
U-GraphFormer extends the U-Net/Transformer paradigm by operating on a learned spatio-temporal graph: nodes represent sensors, edges encode both physical proximity and feature similarity, and temporal self-attention captures per-sensor dynamics. In the decoder, we mirror the encoder’s graph layers with skip-connections for multi-scale fusion rather than an MLP (EdgeConvFormer) or standard deconvolution (U-Net). This hybrid design not only preserves both global and local context but also scales with edge sparsity (avoiding the quadratic blow-up of pure Transformers). Compared to a vanilla transformer of similar depth, U-GraphFormer incurs 30% more per-layer FLOPs and 10% higher peak GPU memory (≈40 GB on 4 × TITAN V), yet still processes a 60 s segment (400 × 8 sensors) in 16 s of inference time—well within real-time SHM requirements.
These findings validate the efficacy of the architectural modifications in U-GraphFormer. The mirrored encoder–decoder structure with skip connections enhances feature extraction and anomaly detection, resulting in improved performance metrics across different test segments. This underscores the robustness and adaptability of the U-GraphFormer model, making it a more reliable tool for SHM and damage detection. The ablation experiments further demonstrate that the enhancements in U-GraphFormer not only reduce validation loss and reconstruction error but also significantly improve the accuracy and reliability of anomaly detection across various damage scenarios.
Bolt-loosening detection in jacket-type offshore wind turbine supports
In this case study, we evaluate our unsupervised U-GraphFormer model on vibration data collected from a scaled jacket-type offshore wind turbine support, focusing on early detection and severity assessment of bolt loosening (see Figure 10, The dataset, originally described by Valdez-Yepez et al., 45 comprises high-frequency voltage measurements from eight triaxial accelerometers (PCB® 356A17) mounted at key joints on the jacket structure. Vibrations were induced by white noise excitation to simulate operational loading, while bolt conditions spanned four states: fully tightened (12 Nm), slight loosening (9 Nm), moderate loosening (6 Nm), and fully removed.

Experimental setup for bolt-loosening detection on the scaled jacket-type support structure. (Left) Eight PCB® 356A17 triaxial accelerometers mounted at key leg-to-brace joints and labeled Sensor 1–Sensor 8. (Top right) Close-up of the specific bolts whose preload was systematically varied. (Bottom right) Annotated view of the four structural levels (Level 1 through Level 4) where bolt conditions (healthy, 9 Nm, 6 Nm, removed) were induced and tested. 45
Data split and testing
The vibration dataset utilized in this study was acquired from the publicly accessible Dataverse archive published by Valdez-Yepez et al., 45 available online via DOI: 10.34810/data1011. We specifically focused on the dataset corresponding to a mid-level white-noise excitation amplitude (folder A_1). Each data file in this subset comprises 24-channel accelerometer voltage responses collected at approximately 25,840 time points, equivalent to around 60 s of data recorded at a sampling rate of 1 kHz.
For the unsupervised training phase, we aggregated 19 CSV files corresponding to the healthy structural condition, resulting in a comprehensive training set containing a total of 491,055 time points across 24 channels. An additional healthy CSV file was reserved exclusively for validation purposes, ensuring that the model remained unbiased and robust in distinguishing normal structural behavior.
As shown in Table 7, the test dataset includes a total of 13 CSV files, carefully assembled to comprehensively assess the model’s anomaly detection capability and severity ranking accuracy. This set includes a single held out healthy file along with 12 damaged condition recordings. Specifically, for each of the three damage states—slightly loose (9 Nm), moderately loose (6 Nm), and fully removed bolts, one recording is randomly selected for each of the four structural levels (levels 1–4). Collectively, this results in roughly 326,214 time points dedicated to testing and inference.
Test set composition for white-noise excitation at nominal intensity (amplitude 1.0). Each trial (one healthy and twelve damaged at different levels) is nominally described as a 60-s recording at 1 kHz. Each file contains 24 sensor channels.
Data preprocessing and inference procedure
The preprocessing pipeline begins by concatenating 19 CSV recordings corresponding to the healthy state into a unified training dataset, comprising approximately 491,055 time points across 24 sensor channels. Missing values in the concatenated dataset are imputed using the column-wise mean to maintain consistency and data integrity. Considering the spatial characteristics of structural vibrations, the original 24-dimensional dataset—corresponding to eight triaxial accelerometers—is condensed into eight magnitude-based scalar features. Each feature represents the vibration magnitude at a sensor location by computing the Euclidean norm across the
This fusion step transforms the triaxial sensor data into scalar magnitudes, capturing vibration energy comprehensively and effectively reducing dimensionality while preserving spatial insights critical for anomaly detection.
Following fusion, the magnitude-based training data are segmented into overlapping subsequences using a sliding window approach, configured with a window length of 60 samples (equivalent to 60 ms at 1 kHz sampling rate) and a stride of 30 samples (50% overlap). Each subsequence is then summarized by averaging across the temporal dimension to generate consistent feature vectors for model training.
The inference phase employs an identical preprocessing strategy across all test datasets, which include recordings for healthy conditions and multiple bolt-loosening damage states (minor (9 Nm), moderate (6 Nm), and severe (NoBolt)). These test recordings undergo the same dimensional fusion, segmentation, and standardization procedures, with feature scaling based exclusively on training data statistics.
Subsequent to preprocessing, each standardized test sequence is fed into the U-GraphFormer model. Reconstruction errors generated by the model serve as anomaly scores for each sequence, enabling both the identification of anomalies and the quantification of structural integrity. Thresholds for anomaly detection are adaptively derived from the distribution of anomaly scores obtained from healthy training segments, ensuring robust, and generalizable anomaly detection performance.
Training and inference efficiency
U-GraphFormer was trained once on the 19 healthy-state CSV trials provided by the wind-turbine dataset (491,055 samples per channel, i.e., ≈8 min at 1 kHz) in 1194 s on a workstation equipped with four NVIDIA TITAN V 12 GB GPUs (peak memory ≈40 GiB). For inference, we evaluated all 13 test trials—each containing 24,772–25,882 samples per channel (≈25–26 s of data at 1 kHz; see Table 7)—in a total of 207.1 s, averaging ≈16 s per file. This corresponds to ≈0.63 s of computation per 1 s of recorded data. Extrapolating to a full 60-s recording (60,000 samples per channel) yields an estimated inference time under 40 s. These runtimes and resource footprints confirm that U-GraphFormer is both efficient and practical for near-real-time SHM.
Results and discussions
Early Detection of Damage In a real-world monitoring scenario, each 60-s acquisition (one CSV file) is treated as a continuous segment whose condition must be assessed promptly. After applying the preprocessing steps described in section “Data preprocessing and inference procedure,” each segment yields 400 steady-state feature points. An anomaly score

Anomaly detection was performed on a single continuous test sequence obtained by concatenating 13 CSV files—one healthy baseline and four distinct bolt-loosening positions (Levels 1–4) for each of three fault modes (9 Nm “minor,” 6 Nm “moderate,” and NoBolt “severe”). After preprocessing, the first 100 transitional samples of each 500-point segment were removed, yielding 400 steady-state points per file and a 5200-point sequence in total. The reconstruction-error-based anomaly score plotted over this sequence, together with precision metrics, shows a clear escalation in detected deviation corresponding to the progression from healthy through each loosening position and fault mode.
To support reliable early detection, we apply Tail-
The performance in time points is further illustrated in Figure 12, which shows an ROC AUC of 0.832 and an average precision of 0.985. These values confirm that the model effectively distinguishes healthy and anomalous points. However, relying on individual scores may be overly sensitive to brief transients. Our ratio-based mechanism provides a principled way to aggregate these time-wise signals into a stable yet responsive decision process.

ROC and precision-recall curves for base anomaly scoring on the concatenated test set of healthy operation and progressive bolt-loosening faults (Levels 1–4 for each of 9 Nm minor, 6 Nm moderate and NoBolt severe).
Severity assessment of damage: Once a 60-s segment is flagged as anomalous, we assess the severity of the damage using two descriptive statistics of the reconstruction-error sequence
In the healthy baseline segment, the scores are consistently low with
This trend persists in the 6-Nm “moderate” and NoBolt “severe” cases. The 6-Nm segments show stronger mean deviations due to sustained bias (
Importantly, this variation across bolt positions adds practical value: it allows the system to prioritize maintenance based on the structural importance of the affected joint. While some damage levels may yield lower anomaly scores, this is consistent with their reduced contribution to global stiffness or load transfer. The method thus offers not only accurate segment-wise detection, but also an interpretable, physically grounded severity ranking—allowing domain experts to differentiate between faults that are merely detectable and those that are truly critical.
Real-world case: A steel truss bridge subject to artificial damage
In this study, we focus on enhancing the early detection and severity assessment of structural damage in bridge structures by analyzing progressive scenarios. The Old ADA Bridge in Japan provides a valuable case study for SHM and damage detection. 44 As shown in Figure 13, this steel truss bridge, extensively tested before its removal in 2012, serves as an ideal testbed for understanding the effects of artificial damage on structural integrity and system identification.

Old ADA bridge. 44
This study utilized ambient vibration testing to measure the bridge’s structural response to natural environmental excitations. This method, which leverages the inherent vibrations from environmental factors such as wind and traffic, is advantageous due to its non-intrusive nature and cost-effectiveness. Capturing these low-level vibrations requires the deployment of high-quality accelerometers. Eight uniaxial accelerometers were strategically placed on the bridge deck to gather detailed vibration data. Five sensors were positioned near the truss member subjected to artificial damage, while the remaining three were located on the opposite side of the bridge deck. This arrangement ensured comprehensive coverage of the structural response across the entire bridge.
Figure 14(a) shows the layout and sensor information. Eight uniaxial accelerometers were strategically placed on the bridge deck: five near the truss member subjected to artificial damage and three on the opposite side. This arrangement ensured comprehensive coverage of the structural response across the entire bridge. The bridge was subjected to five distinct damage scenarios, as depicted in Figure 14(b) and (c).

(a) Sketch and sensor information; (b) sketch of damage scenarios; and (c) artificial damage applied to tension members. 44
To ensure the statistical reliability of the data, measurements were repeated multiple times across these scenarios. Specifically, the intact (INT) state was recorded three times, while the half-cut, full-cut, and 5/8th span cut scenarios were each recorded once. The repaired state was recorded twice. Data were collected at a high sampling rate of 200 Hz, providing a detailed and high-resolution dataset of the bridge’s dynamic responses.
This comprehensive dataset offers a rare opportunity to analyze the behavior of a steel truss bridge under controlled damage conditions. It serves as a critical benchmark for developing and validating new methods for damage detection and SHM using accelerometer data. By analyzing this dataset, researchers can gain significant insights into the structural behavior under various damage scenarios, thereby enhancing the reliability and effectiveness of monitoring systems designed to ensure the safety and integrity of such critical infrastructure.
Data preparing and preprocessing
In this study, ambient vibration tests were conducted on a bridge structure with a sampling rate of 200 Hz to evaluate the structural condition under various scenarios, including INT and damaged states. The vibration data were collected using eight accelerometers, providing a high-resolution capture of the bridge’s vibrational behavior. The five different scenarios tested include INT, three damage states (DMG1, DMG2, and DMG3), and a recovery state (RCV).
As shown in Table 8, for the INT scenario, data were collected in three separate tests labeled No1, No2, and No3, with sample sizes of 48,615, 84,125, and 75,964, respectively. The recovery scenario included two tests, No1 and No2, with 10,574 and 75,294 samples, respectively. The damage scenarios, DMG1, DMG2, and DMG3, each involved a single test, with 57,553, 73,859, and 71,553 samples, respectively. To ensure consistent data length and robust model training, data preparation involved selecting INT No2 and No3 as the primary datasets for training, providing a diverse set of INT state data. These datasets were chosen for their larger size, which enhanced the model’s ability to learn the characteristics of the INT bridge structure. For testing, INT No1 was used as a baseline to verify the model’s ability to recognize the INT condition, while the DMG1, DMG2, and DMG3 datasets allowed for evaluating the model’s sensitivity to different damage states. RCV No2 was included to assess the model’s ability to identify improvements in structural integrity post-recovery.
Ambient vibration data collected from five different structural states of a steel truss bridge. The scenarios include: the INT state, half-cut vertical member (DMG1), fully cut vertical member (DMG2), repaired state (RCV), and 5/8th span cut (DMG3). Data were collected for three repetitions of the INT test and two repetitions of the RCV test, while only one dataset was collected for DMG1, DMG2, and DMG3. Each dataset’s size indicates the number of time steps (rows) recorded using 8 accelerometers (columns). The details of each structural state are provided to highlight the different damage levels and locations. 69
To ensure a consistent and comprehensive dataset for model training and evaluation, we selected specific datasets based on their length and variability. The INT datasets from tests No2 and No3, consisting of 84,125 and 75,964 samples respectively, were chosen for training. These datasets provide extensive coverage and variability, crucial for robust model learning. For testing, we utilized the INT No1 dataset, with 48,615 samples, as a baseline to assess the model’s ability to recognize the INT condition. Additionally, the DMG1, DMG2, and DMG3 datasets, containing 57,553, 73,859, and 71,553 samples, respectively, were used to evaluate the model’s sensitivity to different damage states. The RCV No2 dataset, with 75,294 samples, was included to assess the model’s capability to detect improvements in structural integrity after recovery.
In the preprocessing phase, we implemented several techniques to enhance the model’s efficiency, carefully chosen through empirical assessment and cross-validation. We began with moving average smoothing, using a window size of 60 and a stride of 30, to reduce noise and highlight significant patterns within the vibration data, thereby enhancing the signal-to-noise ratio for more accurate anomaly detection. Following this, we standardized the data to a mean of zero and a standard deviation of one, ensuring uniformity across all accelerometer readings and allowing each feature to contribute equally to the analysis. We then applied an overlapping sliding window technique with a window size of 100 and a stride of 10, generating multiple overlapping segments from each time series to enrich the training dataset. This approach enabled the model to focus on capturing essential vibration patterns and temporal cycles, improving its sensitivity to subtle variations that may indicate potential anomalies or structural damage.
Training and testing process
The training dataset consisted of 160,089 time points, maintaining a high sampling rate of 200 Hz. The training process adhered closely to the parameters set in the ASCE benchmark, as detailed in Table 1, except for using a window size of 60 and a stride of 30 for the moving average, which resulted in a total of 5335 time steps. The training process was efficient, requiring only 887 s and utilizing 34,436 MiB of GPU memory distributed across 4 NVIDIA TITAN V 12 GB GPUs.
For the testing phase, a more extensive dataset of 326,874 time points was assembled by combining data from INT No1, DMG1, DMG2, RCV No2, and DMG3. This dataset was designed to simulate a progressive structural state. Similarly to the training phase, a sliding window technique with a window size of 60 and a stride of 30 was applied, resulting in 10,894 time steps. To emphasize the critical early stages of structural changes, each phase was truncated to include only the first 500 time points, yielding a total of 2500 time steps for detailed analysis.
The model efficiently processed these test data in just 31.61 s, demonstrating its ability to rapidly deliver insights. This performance highlights the model’s proficiency in managing large volumes of high-frequency data, ensuring timely detection and response to structural changes. By capturing and analyzing subtle variations in structural behavior, the model provides a robust solution for real-time monitoring of structural health.
Results and discussions
Early detection of damage: As shown in Figure 15, the model’s performance is evaluated using time-point-wise precision, recall, and F1-score metrics. These metrics are essential for identifying structural anomalies at their earliest stages, enabling proactive maintenance strategies. The subplot depicting predicted labels shows the model’s metrics: an overall precision of 1.00, recall of 0.68, and F1-score of 0.81. The recall value is primarily affected by the DMG3 phase, where the anomaly detection is inconsistent between the first and second halves.

Anomaly detection across progressive structural states. This plot illustrates the model’s performance in detecting anomalies across different structural states (INT, DMG1, DMG2, RCV, and DMG3) within a progressive scenario. Each phase originally consisted of 500 time points; however, the first 100 time points of each phase were removed to eliminate transitional effects, allowing the analysis to focus on steady-state behavior. This resulted in a total of 2000 time points. The model effectively identifies structural deviations using anomaly scores and precision metrics, highlighting its capability to detect and differentiate damage levels.
The model’s time-point-wise detection performance is robust across different phases, particularly in DMG1 and DMG2. In DMG1, the model effectively detects half-cut damage at the midspan with high precision and recall due to pronounced structural deviations that are easily captured. During DMG2, the model identifies full-cut damage, achieving high anomaly scores indicating severe structural compromise. This demonstrates the model’s capability to detect critical damage points that pose significant risks. The INT and RCV phases illustrate the model’s capabilities. It accurately identifies a few anomalies in stable conditions and effectively detects recovery post-repair, maintaining high precision and indicating its ability to recognize INT structures and recovery efforts.
However, the DMG3 phase presents a unique challenge with notable variance in recall, where the consistency of anomaly detection varies between the first and second halves. In the first half of DMG3, there is a significant spike in anomaly scores, reflecting the full-cut damage at the 5/8th span, indicating a severe structural deviation captured by the model. In contrast, the second half of DMG3 shows a marked drop in anomaly scores, as illustrated in Figure 15. Upon closer examination of the reconstruction results in Figure 16, this decrease in anomaly scores aligns with a substantial reduction in the ground truth values. This correlation suggests that the low anomaly scores in the second half may not reflect a limitation of the model but rather the reduced distinguishable structural deviation in the data. Moreover, factors such as sensor placement and the quality of the collected data could further impact the model’s ability to consistently detect anomalies, especially when structural changes are subtle or masked by noise. These insights underscore the importance of considering both model performance and data integrity in evaluating the effectiveness of SHM systems.

Reconstruction results for sensors A1–A4: The figure illustrates the reconstruction performance of the model for sensors A1–A4. It demonstrates how well the model captures the original signal patterns and highlights any discrepancies that indicate potential anomalies.
Additionally, the transition from the RCV phase to DMG3, which involves recovery efforts, may introduce changes in structural dynamics that further complicate anomaly detection. These shifts in sensor readings during the DMG3 phase can affect the model’s consistency in identifying anomalies, highlighting the complexity of monitoring structural health during dynamic recovery processes.
Furthermore, despite the reduced recall in the second half of the DMG3 phase, the model’s ability to closely align anomaly scores with the actual time points where damage begins is noteworthy. This alignment, along with the detection of remarkably high anomaly scores during the critical moment when damage begins, ensures that severe structural deviation is accurately identified. Additionally, by exceeding the 50% ratio of time points detected as abnormal within this segment, the model guarantees that the segment-wise alarm accuracy remains at 100%. Even in this challenging phase, the model effectively triggers an alarm, demonstrating its robustness and reliability in early damage detection.
These findings underscore the model’s effectiveness in early detection, as it accurately flags anomalies within the initial 1-min sensor data, closely aligning the detection time with the actual occurrence of damage. The segment-level adaptive scoring strategy proves robust and adaptable, effectively capturing both significant and minor anomalies across varying levels of damage severity, with a consistent segment-wise alarm accuracy of 100%.
Severity assessment of damage: Assessing the severity of detected damage is critical for prioritizing maintenance efforts and resource allocation. Anomaly scores, which quantify structural deviations, provide valuable insights into the extent of damage across different phases. Higher mean anomaly scores and greater variability indicate more pronounced structural deterioration, as shown in Table 9 and visualized in Figure 15.
Anomaly score statistics and location-based insights for each damage phase.
In the INT phase, which serves as the baseline, the anomaly scores are low, with a mean of 6.96 and a standard deviation of 1.46, indicating stable conditions across the bridge with no detected anomalies. This confirms the structural integrity of the bridge under normal conditions, demonstrating the model’s capability to recognize a healthy structure.
Moving to the DMG1 phase, half-cut damage is introduced at the midspan, where bending moments are highest, making it particularly vulnerable. The model reports medium anomaly scores, reflected by a mean of 240.75 and a high standard deviation of 378.02, indicating noticeable structural changes that could potentially affect stability. Early detection in this phase is crucial for initiating timely maintenance and ensuring the structural integrity of the bridge.
In the DMG2 phase, a full-cut at the midspan results in severe damage, with the structural integrity heavily compromised. The mean anomaly score spikes to 1084.82 with a substantial standard deviation of 1563.54, indicating a critical point in the structure’s lifecycle. The model is expected to detect high anomaly scores, accurately reflecting the severe degradation and increased failure risk.
The RCV phase involves welded recovery to restore the bridge’s integrity. The anomaly scores reduce significantly, with a mean of 7.49 and a standard deviation of 1.45, indicating successful recognition of the recovery efforts and diminished structural anomalies. This underscores the model’s ability to detect improvements in structural conditions following repairs.
In the DMG3 phase, a full-cut occurs at the 5/8th span, representing severe damage at a new location on the bridge. While the mean anomaly score of 590.85 is lower than DMG2 due to varied load distributions, it still indicates high severity, as reflected by the high standard deviation of 1890.57. Although both DMG2 and DMG3 signify critical damage, the central position of DMG2 is expected to have a more significant impact on overall stability. The model should effectively identify these structural variations, demonstrating its ability to monitor and respond to severe damage across different spans of the bridge.
The model exhibits strong early damage detection, with high precision and recall in the DMG1, DMG2, and DMG3 phases, which is crucial for preventing minor issues from escalating into severe failures. These phases underscore the model’s effectiveness in identifying significant structural changes at key points. During the damaged phases, while the mean anomaly scores are high, the standard deviations are even higher, highlighting the model’s heightened sensitivity to structural anomalies. This variability results from the model’s unsupervised training on healthy states, leading to greater reconstruction errors under damaged conditions. In contrast, during INT and recovery phases, the model consistently produces low mean anomaly scores with low standard deviation, accurately distinguishing non-damage states and confirming the effectiveness of repairs. These results collectively demonstrate the model’s reliability in monitoring the bridge’s condition and enabling timely maintenance interventions.
Conclusion
U-GraphFormer represents a significant advancement in SHM by offering enhanced capabilities in damage detection and severity assessment. This novel model combines advanced data processing techniques, such as spatiotemporal graph learning and sensor-specific temporal self-attention, within a mirrored encoder–decoder architecture reminiscent of U-Net, to effectively capture and analyze complex patterns in sensor data. The integration of skip connections further enhances the model’s ability to reconstruct and identify anomalies with greater accuracy, making it a powerful tool for early detection and intervention.
The successful application of U-GraphFormer in both benchmark tests and real-world scenarios underscores its robustness and adaptability, particularly in early damage detection and severity assessment. The model excels in segment-wise detection, achieving 100% accuracy in all scenarios, and demonstrates high precision in detecting the exact start time of damage. This is evident in its performance on Damage1 in the benchmark, where it achieved precision, recall, and F1-scores of 0.99, 0.98, and 0.98, respectively. Even when detecting the most minor damage, such as Damage6, U-GraphFormer continues to perform commendably, with precision, recall, and F1-scores of 0.80, 0.71, and 0.75. This capability is vital for preventing minor issues from escalating into severe structural failures. By utilizing the mean and standard deviation of anomaly scores as interpretable indicators, U-GraphFormer delivers a nuanced assessment of damage severity. Rigorous testing on the ASCE benchmark and a real-world steel truss bridge demonstrated the model’s effectiveness, achieving a ranking accuracy of 1. This flawless accuracy reflects the model’s ability to provide severity assessments that closely align with actual damage progression, making U-GraphFormer an invaluable tool for timely maintenance and informed decision-making.
Future work could improve U-GraphFormer in several important directions. First, investigating sensor-level and temporal attribution techniques, such as saliency maps or SHAP, may enhance diagnostic precision. Second, evaluating robustness under diverse environmental and operational conditions—by incorporating environment-adaptive scoring or domain-adaptive approaches—could further strengthen reliability in real-world settings. Finally, validating U-GraphFormer on a wider range of civil structures and damage types would help demonstrate its generalizability and scalability for SHM applications.
