Abstract
Introduction and project motivation
There have been many efforts to develop collections of ground motion data and metadata (e.g. Ancheta et al., 2014; Bahrampouri et al., 2021; Bozorgnia and Stewart, 2020; Castro et al., 2022; Goulet et al., 2021; Hutchinson et al., 2024; Ji et al., 2023; Luzi et al., 2016; Palmer et al., 2022; Rekoske et al., 2020; Rennolet et al., 2018; Sandikkaya et al., 2024). While these efforts have produced substantial volumes of data, there are potential issues for researchers and engineers who wish to use all combinations of datasets for model development. These issues are related to inconsistencies and incompatibilities between databases developed in different projects. Inconsistencies may arise in the form of differences in processing procedures, metadata assignment protocols, or how intensity measures are computed. Incompatibilities arise from different organizational structures used in different databases. Data integrity may also arise within a database depending on how the data are organized (different parameter values for the same site in a flatfile, typos, etc.). When researchers and engineers utilize data from multiple databases these issues need to be considered.
The Engineering Strong-Motion Database (ESM) (Luzi et al., 2016) organized ground motion data and metadata coming from different sources in the European-Mediterranean region. The Next-Generation Attenuation (NGA) program developed a commonly used suite of databases that are divided by tectonic regime: shallow crustal events in active tectonic regions (NGA-West2—Ancheta et al., 2014), crustal events in stable continental regions (NGA-East—Goulet et al., 2021), and subduction earthquakes (NGA-Sub—Bozorgnia and Stewart, 2020). Each of the NGA projects has developed a uniformly processed dataset of recorded earthquake ground motions and associated source, path, and site metadata; however there is no linkage across the three databases. Since completion of the NGA-West2 and -East data collection efforts around 2011, the amount of seismic instrumentation has increased, particularly in the state of California (Kuyuk and Allen, 2013), which means that the quantity and spatial distribution of data produced from a single event can be much greater than similar events included in earlier datasets. This has led to efforts to curate databases from more recent events (e.g. Farghal et al., 2020; Ji et al., 2023; Rekoske et al., 2020).
Ground motion processing has historically been performed manually for each record, but processing tools have evolved to a point where fully automated procedures can be used to select corner frequencies, accept/reject records based on measured quantities like signal-to-noise ratio (
Creating algorithms that match the judgment of experienced analysts is challenging, and because that judgment is used to ensure data quality while optimizing usable bandwidth, additional measures of quality assurance may be required for accepting/rejecting recordings and for corner frequency selection for data sets to be used in important applications. Furthermore, essential metadata, such as the preferred finite-fault solutions and basin depth parameters that have been used in the NGA projects (e.g. Ahdi et al., 2022; Contreras et al., 2022; Goulet et al., 2021b), are not contained in the databases queried by
The subject of this article is a ground motion database (GMDB) that has been developed for engineering applications and, at the time of writing (June 2025), contains data from several published databases including NGA-West2, NGA-East, and the Hellenic strong-motion database (Margaris et al., 2021), as well as data from more recent events that was processed by one or more of the authors. An important feature of the data included in the GMDB is the application of consistent protocols that include human inspection during record processing and source, path, and site metadata compilation. Following this introduction, we present the ground motion data stored in the database, and the procedures used to process new data and assign metadata using protocols consistent with those established during NGA projects. The GMDB has been assembled as a relational database, the organization of which is presented after discussion of the data. Finally, we discuss how to interact with the GMDB through a publicly accessible web portal and application programming interface (API).
Ground motion datasets
The GMDB includes ground motions (including time-series) processed by one or more of the authors for recent events in the western US (WUS), central and eastern North America (CENA), and Türkiye. The GMDB metadata (not including time-series) compilation began with the following published datasets: the NGA-West2 global database for shallow crustal earthquakes in active tectonic regions (Ancheta et al., 2014), the NGA-East database for CENA for shallow crustal earthquakes in stable continental regions (Goulet et al., 2021), and the Hellenic database from mostly shallow crustal and some subduction earthquakes within and near Greece assembled using protocols consistent with NGA projects (Margaris et al., 2021). As appropriate, metadata has been updated using results of more recent studies (e.g. updated site parameters based on new site measurements). The information from published databases were adapted into the GMDB data structure presented below, but otherwise were not modified unless explicitly stated.
The authors have processed ground motions in a consistent manner for a number of events since 2011 as part of prior studies. Those studies include site response and path studies in southern and northern California (Buckreis et al., 2023b, 2024a; Nweke et al., 2022; Wang, 2020 and Mohammed et al., 2025), studies to assess the combined data misfits of NGA-East GMMs with site response models (Ramos-Sepulveda et al., 2024), studies to examine the usability of ground motions recorded by the Community Seismic Network (CSN) (Mohammed et al., 2024), and ground motion analysis for the 2014 Napa earthquake, 2019 Ridgecrest earthquake sequence, and 2023 Kahramanmaraş, Türkiye earthquake sequence (Ahdi et al., 2020; Kishida et al. 2014 and Buckreis et al., 2024b, respectively). These efforts significantly increased the quantity of available ground motions in California, CENA, and the Mediterranean relative to NGA databases (Figure 1). The “Other” slice shown in Figure 1 includes data from Alaska, China, Japan, Taiwan, and New Zealand from the NGA-West2 database. The following subsection describes the different data collection efforts.

Distribution of recordings in the Ground Motion DataBase (GMDB) by region and collection study.
Data selection, processing, and distributions
Although the domains of the recent collection efforts contributing to the GMDB are different, protocols consistent with NGA projects were used to select and curate the data in each study. In all cases, only events with magnitudes (

Locations of event focal mechanisms and stations in California and neighboring states. Symbol colors represent different studies as shown in the legend.

Locations of event focal mechanisms and stations in Central and Eastern North America (CENA). Symbol colors represent different studies as shown in the legend. Boundary between Western United States (WUS) and CENA given by Moschetti et al. (2024).

Locations of event focal mechanisms and stations in the Mediterranean. Symbol colors represent different studies as shown in the legend.
Raw time-series records for all available stations for each U.S. event were obtained from the Incorporated Research Institutions for Seismology (IRIS; Trabant et al., 2012) or directly from seismic network operators. Records from events in and around California were cross-checked against those available from the Center for Engineering Strong Motion Data (CESMD) data repository maintained by the California Strong Motion Instrumentation Program (CSMIP) (https://www.strongmotioncenter.org/) and those from the California Department of Water Resources (DWR) seismic network. Buckreis et al. (2024b) obtained waveforms from the Earthquake Data Center System of Türkiye (TDVMS; https://tdvms.afad.gov.tr/) and IRIS. For all regions, data screening was performed to remove apparently unreliable and duplicated records, which may exist due to multiple collocated instruments at a site. In the event of collocated accelerometers and seismometers with meaningful recorded signals, which occurred for portions of the data processed by Kishida et al. (2014), Ahdi et al. (2020), Wang (2020), Buckreis et al. (2022), Buckreis et al. (2024b), Ramos-Sepúlveda et al. (2024), and Mohammed et al. (2024), we preferred the motion recorded by the seismometer unless there is evidence of amplitude clipping, in which case the time-series recorded by the accelerometer is used. In the case of data prepared by Mohammed et al. (2025), and anticipated future efforts, the motion with the widest usable bandwidth as a function of corner frequencies selected during signal processing and sampling rate are preferred.
Each record component was processed individually according to standard protocols developed within NGA projects (e.g. Goulet et al., 2021). In the case of the records assembled in Ahdi et al. (2020), Wang (2020), and Buckreis et al. (2022, 2024b), signal processing was performed manually in R (R Core Team, 2022) using a processing code developed during NGA-West2 (Ancheta et al., 2014). The records assembled by Ramos-Sepúlveda et al. (2024) and Mohammed et al. (2025, 2024) were processed semi-automatically using a version of gmprocess (Thompson et al., 2025) that automatically adjusts high-pass corner frequencies to remove displacement artifacts and includes a manual review using a tool that allows corner frequencies to be adjusted or records to be rejected (Ramos-Sepúlveda et al., 2025). Because every record is reviewed and potentially adjusted by an analyst, the manner in which we have used gmprocess is equivalent to the NGA processing procedure, but is significantly more efficient because the automated selections made by
In total, we processed 34,371 records in California, 7029 records in CENA, and 1215 records in Türkiye as summarized in Table 1. The distribution of these data with respect to
Summary of the number of events, stations, and records in each dataset

Data distributions for California and neighboring states with respect to magnitude (

Data distributions for CENA with respect to

Data distributions for the Mediterranean region with respect to

Number of usable ground motion as a function of period for (a) California and neighboring states, (b) Central and Eastern North America (CENA), and (c) Mediterranean region.
Intensity measures
Intensity measures (IMs) are parameters computed from ground motion time-series and are commonly used to quantify certain engineering attributes of ground motions. NGA databases typically include peak IMs such as peak ground acceleration (
In the present dataset, all IMs excluding
Metadata
The ground motion data discussed in the previous section are most useful when accompanied by source, path, and site parameters. These parameters are associated with each individual ground motion, and are required for model development and for time-series record selection for use in response history analyses and other research studies and engineering applications. This section presents the sources of metadata and the methods used to assign metadata to records, sites, and earthquakes. These methods incorporate approaches for resolving conflicts between metadata contained in the original NGA datasets in some cases.
Seismic-source parameters
Seismic-source parameters include origin date and time, hypocenter coordinates (longitude, latitude, and depth), seismic moment,
Parameters that describe the fault rupture surface as one or more rectangles (upper-left corner coordinates and dimensional length and width) are necessary to calculate source-to-site distances. Finite-fault solutions are often formulated for large magnitude events based on inversion of recorded surface motions. Of the post-NGA events in the GMDB, only the
FFMs for the remaining events were generated using the simulation procedure described in Contreras et al. (2022), performed using an updated version of the program originally developed by Chiou and Youngs (2008). This algorithm generates a stochastic set of possible rupture surfaces given the available source metadata, and selects the most probable surface that does not result in atypically short or long finite-fault distances for any given site. We re-coded the procedure for use in Python (Buckreis, 2024) and updated the procedure to include
Site / station parameters
We distinguish sites from stations because stations include information about instruments and networks that may change over time, whereas site data are constant. Furthermore, multiple stations may be installed at the same site, and separating them avoids duplication in data entry. Site parameters include location (latitude, longitude, and elevation), topographic slope, terrain class, surficial geological unit, recommended
Computed from a nearby (<150 m; same geology) measured
Estimated using the depth extrapolation relationship of Dai et al. (2013) with region-specific regression coefficients from Kwak et al. (2017a) when a nearby shallow measured
Estimated using regional
Estimated using a weighted combination of Kriging-interpolated (Thompson, 2018), slope- (Wald and Allen, 2007), terrain- (Yong et al., 2012), and/or geology-based proxy models as described in Wang (2020).
Geology-based
Following the hierarchy above, some sites from the NGA-West2 and NGA-East projects were updated in cases where a measured

(a) Distribution of
Shear-wave isosurface depth parameters (
Station parameters include station name, location, the network and station codes specified by the Standard for the Exchange of Earthquake Data (SEED) (IRIS, 2012), station type/housing (i.e. free-field vs in a structure), sensor depth, installation and removal dates, and information about the network that operates the station. Station locations were compared among the NGA flatfiles, IRIS, NCEDC, CESMD, and the Southern California Earthquake Data Center (SCEDC) housed at Caltech. For nearly all stations, locations agreed well and data from the catalog with the most significant digits was adopted. In some cases, the station parameters were different among the different sources of information, and were handled on a case-by-case basis using a semi-automated procedure to resolve the conflicts. Stations spaced within 1 km of each other (or ∼0.01° in coordinate precision) were flagged as initial candidate duplicate stations. These stations were then further screened based on similarity of station code and station name using
In addition to screening for duplicate stations, the aforementioned approach was adapted to assign networks (and their associated network codes) to the majority of NGA-West2 stations, because this information was not always accurately recorded. The motivation for this effort arose from the fact that it is easier to screen for duplicate stations when their unique network and station SEED code combinations, as defined by the International Federation of Digital Seismograph Networks (FDSN), are available. This process will need to be undertaken each time new data are added. We are able to assign networks to all but 94 NGA-West2 California stations by comparing NGA-West2 station metadata to all available California stations downloaded from IRIS, NCEDC, SCEDC, and CESMD. We suspect most of these 94 stations to be older-decommissioned stations that will not produce data in the future.
Path parameters
Site-to-source distances are calculated between station locations and one or more finite-fault segments used to represent the rupture surface using P4CF (Chiou, 2021). Distance metrics include
Relational database
The objective of the work presented herein is to develop a web-served publicly accessible relational database that users can query. Motivation for a web-served relational database is driven by three factors:
Users often desire specific fields of data that can be accessed by targeted queries rather than having to download undesired data.
The relational structure improves data/metadata integrity by avoiding repetition of data entry that would be required if all metadata were stored in a single table (e.g. site parameters for different recordings made at the same site) (Krüger, 2004).
Indexes can make database searches faster than input / output operations on local files.
The relational database utilizes the MySQL InnoDB storage engine and can be queried using Structured Query Language (SQL). Fields have predefined data types that include numeric (e.g. “INT,”“FLOAT,” etc.), temporal (e.g. “DATE” and “DATETIME”), or string types (e.g. “VARCHAR,”“MEDIUMTEXT,” etc.), which ensures data integrity by only allowing entries conforming to valid formats. This section describes the database structure, the reasoning for its organization, and other elements developed to help users access the data.
Database structure
The organizational structure of the database—the schema—defines the tables, fields, and relationships among the tables in the database. The current GMDB schema contains 32 tables that can be broadly grouped into five categories: event, site/station, ground motion, auxiliary, and junction. A list of table names in each group is provided in Tables 2 through 6. A diagram depicting the tables and their relationships is presented in Fig. 10. Linkages between tables are identified by shared fields called “keys.” A
Event tables
Site/Station tables
Ground motion tables
Auxiliary tables
Junction tables

Diagram of simplified GMDB schema depicting primary and foreign key relationships; symbol colors and shapes indicate the nature of the table contents; lighter shades (e.g.
The GMDB utilizes indexed integer primary and foreign keys that are automatically assigned each time a new entry is inserted into the database. Using single integer keys facilitates simpler SQL queries when tables are joined, but does not inherently enforce uniqueness of the combination of fields that define a unique entry (which is inherent to composite keys consisting of multiple columns). We separately require certain combinations of fields to be unique using the UNIQUE INDEX option in MySQL. For example, the combination of
Event tables
Event metadata are organized into six tables:
Rupture surface representations and associated data for each event are stored in the
The
The
Site / station tables
Station and site metadata are organized into seven tables:
The
The
Turkish Disaster and Emergency Management Authority (AFAD): AFAD—network-station code
CESMD: CESMD—network-station code
NGA-West 2 flatfile: NGAW2—ssn
NGA-East flatfile: NGAE—ssn
USGS
Shear-wave velocity profile database (VSPDB, Kwak et al., 2021): VSPDB—profile id
For sites where
“Geo”: geology- or hybrid geology-slope proxies (e.g. Wills et al., 2015).
“Kri”: Kriging interpolation if site is located near measured
“Terr”: terrain-based proxies (e.g. Yong, 2016).
“Pea17—
The expanded list of proxy models implemented in the NGA-Sub project (Ahdi et al., 2022) have not yet been implemented in the GMDB.
The
Ground motion tables
Ground-motion metadata are organized into two tables:
Ground motion data are organized into five tables:
Acceleration time-series data are stored separately in the
The
Auxiliary information tables
Auxiliary information describes anything that is not inherently related to ground motion data, and is organized into five tables:
The
The
Finally, the
Junction tables
Junction tables are used to efficiently store many-to-many relationships between two tables. There are seven junction tables in the GMDB:
The “earthquake ID” (
Sometimes information about the location of the earthquake or site is also important, such as knowing if they occur within a particular geographic area. This type of metadata is stored in the
The
Version control
The GMDB is constantly changing as admins enter new data and update existing fields, which poses significant challenges for reproducibility of research findings. Version control is needed so users are able to query specific data collections that were used in a study. Two general approaches for version control in relational databases are (1) schema-based, in which version information is stored within database tables and versioning is handled through SQL queries, and (2) copy-based, in which the database is copied and archived at a specific point in time and version control is achieved by querying the appropriate version. Schema-based methods are complicated to implement because they require additional tables and fields, and the SQL queries must be written specifically to retrieve data from the desired version. Copy-based methods are simple to implement, but involve significant duplication of information in the different database copies, which is inefficient with respect to memory. We adopt a blend of these approaches within the GMDB. We use a schema-based approach for the
The version control approach is illustrated in Fig. 10. The schema-based version control database contains the large tables (e.g.
Application programming interface
This section discusses how the ground motion data and associated metadata in the GMDB are made publicly accessible. One approach would be to open the database to user SQL queries, but this would require users to be proficient in SQL and to understand the database structure to write queries. Moreover, allowing users to submit custom SQL queries is also a security threat. To overcome these challenges, we developed a representational state transfer (REST) API and online tool to assist users with accessing and interacting with the data. This API enables users to request data from the GMDB using relatively straightforward query string parameters appended to the end of a URL that serves as an endpoint. Using the API to retrieve data requires only a basic understanding of the database structure and no knowledge of SQL. Key features of the API are explained here, and complete documentation can be found at Buckreis et al. (2023a) (https://doi.org/10.34948/G4RP4K).
The API can be accessed through a web portal, where data can be viewed in HTML tables, or through HTTP requests, where data are returned in a JSON format. The API requires authentication and authorization to prevent unauthorized users and bots from accessing data. When using the web portal, authentication is provided through a login script, and authorization is handled by a dropdown menu defining the user’s role. When retrieving JSON data, users must first authenticate by submitting their username and password via a Basic Auth HTTP request. Successful authentication returns an authorization token that must then be included in the header of subsequent HTTP requests to retrieve data.
To request data from a certain table, the user appends an endpoint to the end of the base URL: https://gmdatabase.org/. For example, if a user is interested in earthquake-source metadata, they can query the

Screenshot of information returned by API (https://gmdatabase.org/events).
In addition to providing access to individual tables, the GMDB has a flatfile endpoint in which data from the following tables are joined together into a flattened representation:
Users can view time-series data through the web GUI, though results are truncated to the first 100 characters of the JSON strings to facilitate responsive web viewing (https://gmdatabase.org/timeSeriesData). However, users can access the full JSON strings by submitting an HTTP request through the API.
Summary and recommendations
The GMDB departs from past practices for disseminating ground motion data by making data available through a web-served database via an API. The GMDB currently stores the unified ground motion data from active tectonic and stable continental regions assembled by NGA-West2, NGA-East, Kishida et al. (2014), Ahdi et al. (2020), Wang (2020), Margaris et al. (2021), Buckreis et al. (2022, 2024b), Ramos-Sepúlveda et al. (2024), and Mohammed et al. (2025, 2024) The California portion of the database contains 49,797 multi-component records from 525 events, the CENA subset is made up of 16,405 multi-component records from 183 events, and the Mediterranean subset is made of 4403 multi-component records from 559 events.
As of this writing (June 2025), multiple efforts are ongoing to extend the GMDB, including global data from shallow crustal earthquakes in active tectonic regions as part of the NGA-West3 database (Buckreis et al., 2024c). The schema was developed with these expansions in mind; however, organizational additions and/or alterations are anticipated as the database grows. The source and site metadata compiled as part of ongoing projects, such as NGA-West3, are publicly available in the GMDB since these types of information are generally compiled from public sources. However, the ground motions and associated intensity measures are restricted pending formal release by the project.
The GMDB API is a complex product that is still routinely undergoing updates and advancements. Requests for individual tables and batches of time-series data are operational, and ongoing developments aim to provide customizable datasets containing ground metadata and intensity measures in an end-to-end computing workflow for the next generation of ground motion analytics. We encourage researchers and engineers who utilize the GMDB to reference the specific GMDB version used and to cite all original sources of data and metadata.
