Sage Journals: Discover world-class research

Abstract

Accurate forecasts of ambulance demand are crucial inputs when planning and deploying staff and fleet. Such demand forecasts are required at national, regional, and sub-regional levels and must take account of the nature of incidents and their priorities. These forecasts are often generated independently by different teams within the organization. As a result, forecasts at different levels may be inconsistent, resulting in conflicting decisions and a lack of coherent coordination in the service. To address this issue, we exploit the hierarchical and grouped structure of the demand time series and apply forecast reconciliation methods to generate both point and probabilistic forecasts that are coherent and use all the available data at all levels of disaggregation. The methods are applied to daily incident data from an ambulance service in Great Britain, from October 2015 to July 2019, disaggregated by nature of incident, priority, managing health board, and control area. We use an ensemble of forecasting models and show that the resulting forecasts are better than any individual forecasting model. We validate the forecasting approach using time series cross-validation.

Graphical Abstract

Keywords

healthcare emergency services forecast reconciliation ambulance demand regression

Introduction

A failure to match available resources to demand in Emergency Medical Services (EMS) results in patient flow problems, with serious consequences for patients, staff, and the entire care system (Ekström et al. 2015; Rostami-Tabar and Ziel 2022). Demand forecasting in EMS helps service planners to avoid the mismatch, potentially providing massive savings in costs and lives, and leading to better patient outcomes. Accurate daily demand forecasting enables planners and decision-makers to manage resources to meet anticipated patients, reconfigure units, and redeploy staff and vehicles as necessary.

Demand forecasts at EMS are typically required at multiple levels of an organization to inform various planning and decision-making processes (Hulshof et al. 2012). There are some planning processes at the national level (strategic and long-term) such as workforce resource planning and budgeting; sub-national, regional, or healthcare level (tactical and medium-term) such as temporary capacity expansions, resource sharing; and hospital or station level (operational and short-term) such as planning rosters for staff and ambulance deployment. Demand forecasts might also be required at different levels for a specific area of interest, such as the nature of demand or the priority level. Moreover, the time series data in EMS has an inherent hierarchical and grouped structure to support such forecasting requirements. Demand for emergency medical services at the national level can be disaggregated in a geographical hierarchy into sub-national, regions, health boards, and stations/hospitals, or divided into groups such as the nature of incidents or demand priority. Forecasts produced at both higher and lower levels of hierarchies are necessary for effective decision-making in EMS. For example, control area EMS forecasts can inform strategic decisions about how to allocate limited resources to lower levels, such as health boards and stations/hospitals. At the lower levels, hospitals or ambulance stations could use such forecasts to plan for staffing and resource allocation, ambulance dispatching, staff-to-shift assignment, and staff rescheduling based on the anticipated volume, and priority and nature of incidents. Additionally, generating forecasts at lower levels could potentially improve the accuracy of the high-level forecasts by providing more detailed information on the nature and priority of incidents. This could help to identify patterns in demand that may not be apparent at the higher level. Therefore, employing forecasting techniques that consider the hierarchical and/or grouped patterns of time series in EMS aligns naturally, offering the possibility to enhance forecast accuracy and facilitate coordination.

To illustrate the problem, let’s consider a very simple example where we have an EMS provider with a national level of governance and two regions (A and B), each with a health board and a station. There is a total national budget to be split between the regions in proportion to the forecast number of incidents in each region. The two regions have very different incident patterns, and so they must be forecast using different models. However, the data are noisy at the regional level, so the national forecasts are best obtained by summing the demand from the two regions. The resulting national forecasts are not equal to the sum of the regional forecasts, and so they are not coherent. In fact, the national forecasts show a decreasing trend in demand, and so the national governing body decides to cut the budget for the next year. But neither of the regional forecasts shows a trend, and so the regions argue that the budget cut is unfair. In addition, Region A has much more variable demand than Region B, and so to cope with periods of peak demand, Region A needs to hold more resources in reserve. Therefore, the budget distribution needs to be made in a way that ensures the probability of each region being unable to meet demand is equal. Our solution to this problem is to use a hierarchical forecasting approach that ensures the forecasts are probabilistically coherent. Then, any trends or other forecast characteristics at the national level will also be reflected in the regional forecasts, and the probabilistic forecasts allow for the different levels of uncertainty in the two regions. Budgets can be allocated by controlling the probability of demand exceeding available resources, rather than being simply in proportion to the expected demand.

Despite a large number of studies dedicated to forecasting for EMS (Gul and Celik 2020; Ibrahim et al. 2016; Shi, Rostami-Tabar, and Gartner 2022; Wargon et al. 2009), the hierarchical data structure has been largely ignored, and the main focus has been on producing independent (base) forecasts at a single level. Generating independent forecasts can result in a lack of consistency and coordination, which leads to less effective planning and decision-making.

With hierarchical forecasting, plans at any level are based on coherent forecasts and, therefore, can be aligned. Implementing and sustaining improvements in EMS require alignments and coordination between different stakeholders, without which teams operate in isolation, leading to conflicts, duplication work, rework, or work that runs counter to the overall goal to improve the quality of delivery service. A hierarchical forecasting framework can be used as a tool to improve coordination between teams across care services at the national, sub-national, regional, and local levels. The hierarchical forecasting approaches not only create consistent forecasts but are usually also more accurate than the independent (base) forecasts (Hyndman et al. 2011). To our knowledge, there has been no previous research involving hierarchical and grouped forecasting in the entire field of forecasting for healthcare management.

In this paper, we address this gap by investigating the application of hierarchical forecasting approaches in the EMS using a daily time series of attended incidents from 2015 to 2019 in a major ambulance service in Great Britain. The data has hierarchical and grouped structures, with hierarchies at the national, control (i.e., sub-national), and health board (i.e., regional) levels, as well as groups by priority and nature of incidents. We produce consistent point forecasts and forecast distributions for all levels over an 84 day horizon, which is critical for effective planning and associated risk management. We compare the point and probabilistic forecast accuracy of the independent forecasts, bottom-up, and optimal reconciliation approaches. We first generate independent/base forecasts using Exponential Smoothing State Space (ETS), Poisson regression using Generalized Linear Model (GLM), and tscount (TSGLM), a simple empirical distribution and an ensemble method, followed by applying bottom-up and optimal reconciliation approaches. Forecast performance is assessed by the Mean Absolute Scaled Error (MASE) and Mean Squared Scaled Error (MSSE) for point forecasts and Continuous Ranked Probability Scores (CRPS) for probabilistic forecasts. This paper complies with reproducibility principles (Boylan et al. 2015; Stodden and Miguez 2013). We provide the R codes for the proposed models and benchmarks. Therefore, they can be applied to any healthcare service (e.g., emergency department, primary or social care), subject to the time series having a hierarchical and/or grouped structure. While our research focuses on emergency medical services, it is important to emphasize the suggested framework’s adaptability, which expands its relevance to a variety of service sectors such as supply chains, tourism, finance, and call centers. Our approach can be generalized in cases with hierarchically structured and/or grouped time series data, which is common in many service sectors.

The remainder of this article is structured as follows: In the Research Background section, we provide a brief review of the literature and discuss its limitation to position our work; in the Experiment Setup section, we present the experimental design describing the data set, forecasting methods and forecast evaluation metrics. In the Results and Discussion section, we discuss the hierarchical time series forecasting approaches to generate both point and probabilistic forecasts. In the Conclusion section, we present and discuss our results; in Section 6, we summarize our findings and present ideas for future research.

Research Background

Emergency medical services (EMS) are a critical component in the delivery of urgent medical care to communities. Effective service delivery requires accurate resource planning that generally relies on demand forecasts at operational, tactical, and strategic levels. There are a substantial number of studies on the application of time series forecasting in Emergency Medical Services. For example, Ibrahim et al. (2016) provide an extensive review of the models used in forecasting call volume arrivals. Another important area is related to forecasting ambulance demand. Although the definition of demand might not always be clearly stated, this typically refers to a situation where a physical resource has been deployed to respond to an incident. This might be also called attended incidents. Another demand-related variable is verified incidents; these are all incidents that require action: either by sending a physical vehicle, responding via the Clinical Support Desk, requesting an external provider to respond to it, or forwarding it to other channels such as police, firefighters, or general practitioners. Our study is aligned with this stream of literature. Another similar area that has received considerable attention is Emergency Department forecasting (Rostami-Tabar, Browell, and Svetunkov 2024); we refer interested readers to Shi, Rostami-Tabar, and Gartner (2022), Gul and Celik (2020), and Wargon et al. (2009) for extensive reviews of the relevant literature. In this section, we provide a brief review of studies on forecasting ambulance demand in EMS.

There are generally two main streams of research related to forecasting ambulance demand in EMS: (i) the first stream focuses on the application of time series methods and regression approaches to forecasting aggregate ambulance demand (Sasaki et al. 2010; Vile et al. 2012); and (ii) the second stream considers forecasting EMS demand in finer temporal and geographical granularities by employing temporal-spatial prediction methods (Zhou 2016; Zhou and Matteson 2016). The focus of our study is related to the first stream of research.

Sasaki et al. (2010) develop a multivariable regression model to estimate future EMS demands. In addition to the historical demand, the population census for different age groups and counts of the number of companies employing more than five people are included in the regression. The census variables describe groups who are more likely to need an ambulance. A stepwise ordinary least squares regression analysis is used for estimating the parameter and generating forecasts. The only performance measure reported in this study is R², which is not an effective measure of forecast accuracy (Armstrong 2001, 457). The research design of this study is not rigorous, and the study is not reproducible. Vile et al. (2012) explore using a Singular Spectrum Analysis (SSA) method to generate forecasts of the EMS demand at the national level for 7-day, 14-day, 21-day, and 28-day forecast horizons using data provided by an ambulance service in Great Britain. The performance of this approach is compared to Auto-Regressive Integrated Moving Average (ARIMA) and Holt-Winters time series methods using Root Mean Squared Error (RMSE). They concluded that point forecasts generated by SSA are more accurate for longer-term horizons, but that ARIMA and Holt-Winters performance is superior for shorter-term horizons. Vile et al. (2016) further develop a decision support system to integrate forecasts generated by SSA. However, the study does not compare and contrast the performance of forecasting methods based on utility measures such as cost, resource utilization and response time. The tool contains options that allow generating forecasts at various levels of granularity; however, it ignores the structure of the hierarchical and grouped relationships, preventing aligned decision-making and coordination. Al-Azzani, Davari, and England (2021) utilizes data from the Welsh Ambulance Service to explore the forecast accuracy of four forecasting approaches: ARIMA, Holt Winters, Multiple Regression, and Singular Spectrum Analysis (SSA) in predicting call volume demand. The aim is to compare these approaches with the current method across various planning horizons (7, 30, and 90 days) for both total call volume and category-specific demand. Forecast accuracy performance is evaluated using root mean square error (RMSE) and mean absolute percentage error (MAPE). The findings indicate that ARIMA performs the best in predicting weekly and monthly demand. However, when it comes to long-term demand, the SSA method proves to be the most effective. Ibrahim et al. (2016) conducted a case study to assess the effectiveness of multiple forecasting methods: the multiplicative univariate forecasting model (MU), univariate mixed-effects model (ME), and two variations of bivariate mixed-effects models (BME). Call center data were utilized to forecast for periods of 1, 7, and 14 days ahead, using only a limited dataset of 42 days. The performance of these forecasting methods was evaluated using two metrics: RMSE for point forecasts and coverage probability for the 95% prediction interval. The findings indicate that the ME consistently produces the most accurate point forecasts. On the other hand, BME models demonstrate superior coverage probabilities when forecasting for 1 day or 1 week ahead. For a 2-week leading period, MU shows better coverage probability.

Hermansen and Mengshoel (2021) investigate forecasting EMS demand at a high spatio-temporal resolution of 1 km² spatial regions and 1-hr time intervals using total incidents in Oslo, Norway, from 1 January 2015 to 11 February 2019. They used multi-layer perceptron (MLP) and long short-term memory (LSTM) models to forecast the EMS demand and compare the results to simple aggregation methods and baselines. The point forecast accuracy is evaluated using Mean Absolute Error (MAE) and Mean Squared Error (MSE), and the forecast distribution is measured by Categorical Cross-Entropy. They found that Neural Network models performed better in producing point forecasts, while a distribution baseline method based on the spatial distribution of the incidents across all time steps provided more accurate forecast distributions. Zhou (2016) proposed three methods based on Gaussian mixture models, kernel density estimation, and kernel warping to predict hourly data 4 weeks ahead for a 1 km² spatial region. Two years of incidents from Toronto, Canada (years 2007 and 2008 with 391,296 events) and Melbourne, Australia, were used to build the model and examine the performance on test data using mean negative log-likelihood. They show that forecasts generated by the proposed methods were significantly more accurate than the current industry practice (a simple averaging formula). Grekousis and Liu (2019) investigated the combination of spatial analysis methods with data mining techniques based on an improved Hungarian algorithm and a MLP neural network to identify the most likely locations of future emergency events. The proposed approach was tested using data from 2851 events attended by the EMS in Athens, Greece, over 24 weeks. They showed that 23% of real emergency events lie within 50 m of the predicted ones, and nearly 70% of the real emergency events lie no further than 150 m away, which is rather accurate given the granularity of the problem at the city level.

Table 1 provides a summary of some studies in the literature on forecasting in Emergency Medical Services. We note a number of limitations in the literature on EMS forecasting, that encourage us to undertake this research. These limitations are summarized as follows:

1. Current studies ignore the inherent hierarchical and/or grouped structure of the time series data, and the relationship between series at different levels of hierarchy. While the hierarchical forecasting methodology has been developed and applied in various domains over the past 10 years (Panagiotelis et al. 2023), it has never been explored in this area.

2. Current research is mainly concerned with generating point forecasts at a single level of hierarchy. There is a lack of studies considering the entire forecast distribution of daily ambulance demand for the whole hierarchy to better represent the uncertainty of future demand, providing a risk management tool for planners.

3. Reproducibility is still a major challenge in EMS forecasting, as it is unlikely that any reader can reproduce prior studies without the help of the authors of those papers.

4. Another limitation is related to the generated forecasts not being in the sample space of non-negative counts. Since actual ambulance counts cannot be negative or non-integer, ambulance demand forecast distributions should reflect the data. Of course, point forecasts represent means, so they should be non-negative but may be non-integer. While this might not be an issue when producing forecasts at a single level, producing non-negative count forecasts in a hierarchical or grouped structure is challenging and requires further investigation in the future.

Table 1.

Summary of Some Studies on Forecasting in Emergency Medical Services.

Reference	Year	Variable	Horizon	Method	Metric	Probabilistic	Reconciliation	Reproducibility
Current study	2023	Ambulance demand	84 days	Stationary, Exponential Smoothing State Space (ETS), Poisson regression using Generalized Linear Model (GLM) and tscount (TSGLM), a simple empirical distribution and an ensemble method	MASE, MSSE, CRPS	YES	YES	YES
Al-Azzani et al	2021	Call volume	7, 30, 90 days	ARIMA, Holt Winters, Multiple Regression, and Singular Spectrum Analysis	RMSE, MAPE	NO	NO	NO
Haugsbø et al.	2021	Ambulance demand in spatio-temporal	1 hour	MLP, LSTM	MAE, MSE, Cross-Entropy	YES	NO	NO
Grekousis et al.	2019	Locations of incidents	1 hour	MLP and Hungarian algorithm	RMSE	NO	NO	NO
Ibrahim et al.	2016	Call volume	1, 7, 14 days	multiplicative univariate forecasting, univariate mixed-effects, bivariate mixed-effects model, and bivariate mixed-effects	RMSE, prediction interval coverage	Partial	NO	NO
Vile et al.	2012	Ambulance demand	7, 14, 21, 28 days	Singular Spectrum	RMSE	NO	NO	NO
Vile et al.	2012	Ambulance demand	7, 14, 21, 28 days	Analysis, ARIMA, Holt-Winters	RMSE	NO	NO	NO
Sasaki et al.	2010	Ambulance demand	5 years	OLS regression	R^2	NO	NO	NO

This paper concerns the problem of hierarchical forecasting in EMS and generates and evaluates both point and probabilistic forecasts across different levels of the hierarchy, hence addressing some important gaps identified in the literature.

Experiment Setup

Planners in the ambulance service work with a planning horizon of 6 weeks. That is, planning is generally frozen for the next 42 days, so any forecasts will only affect plans for the time period beyond the next 42 days. Consequently, the forecast horizon in this study is 2 × 42 = 84 days ahead, with performance evaluation assessed based on the last 42 days and not the whole forecast period. The forecasts are produced for various training and test sets using time series cross-validation (Hyndman and Athanasopoulos 2021). In the following section, we discuss the dataset, describe the forecasting methods used to generate base forecasts, and present the point and probabilistic accuracy measures.

Data

The dataset used in this study is from a major ambulance service in Great Britain. It contains information relating to the daily number of attended incidents from 1 October 2015 to 31 July 2019, disaggregated by nature of incidents, priority, the health board managing the service, and the control area (or region). Figure 1 depicts both the hierarchical and grouped structure of the data. Figure 1(a) illustrates the nested hierarchical structure based on the control area and health board, and Figure 1(b) shows the grouped structure by priority and the nature of the incident.^1,2

Figure 1.

The hierarchical and grouped structure of attended incidents (ambulance demand).

Table 2 also displays the structure of the data, with the total number of series at each level. At the top level, we have the total attended incidents for the country. We can split these total attended incidents by control area, by health board, by priority or by nature of the incident. There are 3 control areas, which are broken down by 7 local health boards. Attended incident data is categorized into three priority classes of red, amber, and green. There are also 35 different nature of incidents, such as chest pain, strokes, breathing problems, etc. In total, across all levels of disaggregation, there are 1530 time series.

Table 2.

Number of Time Series in Each Level for the Hierarchical and Grouped Structure of Attended Incidents.

Level	Number of Series
All country	1
Control	3
Health board	7
Priority	3
Priority * Control	9
Priority * Health board	21
Nature of incident	35
Nature of incident * Control	105
Nature of incident * Health board	245
Priority * Nature of incident	104
Control * Priority * Nature of incident	306
Control * Health board * Priority * Nature of incident (Bottom level)	691
Total	1530

Note. Due to certain combinations of the nature of incident with other variables, there is a lack of representation in the dataset. As a result, for example, instead of the calculation 3 * 35 = 105, it would be modified to 3 * 35-1 = 104.

Given the total number of time series, direct visual analysis is infeasible. Therefore, we first compute features of all 1530 time series (Kang, Hyndman, and Smith-Miles 2017) and display the strength of trend and weekly seasonality strength in Figure 2. Each point represents one time series with the strength of trend on the x-axis and the strength of seasonality on y-axis. Both measures are on a scale of [0, 1].

Figure 2.

The strength of the trend and weekly seasonality in the time series of attended incidents.³

In this paper, the strength of trend and seasonality were calculated using the “STL” (Seasonal and Trend Decomposition using Loess) decomposition method, as described by Bandara, Hyndman, and Bergmeir (2024). STL is a widely used and flexible method for decomposing time series data into trend, seasonal, and remainder components. The decomposition of a time series $y_{t}$ is written as $y_{t} = T_{t} + S_{t} + R_{t}$ , where $T_{t}$ is the smoothed trend component, $S_{t}$ is the seasonal component and $R_{t}$ is a remainder component. The strength of trend is defined as $F_{T} = \max (0, 1 - \frac{Var (R_{t})}{Var (T_{t} + R_{t})})$

For strongly trended data, the seasonally adjusted data should have much more variation than the remainder component. Therefore Var( $R_{t}$ )/Var( $T_{t} + R_{t}$ ) should be relatively small. But for data with little or no trend, the two variances should be approximately the same.

The strength of seasonality is defined similarly: $F_{S} = \max (0, 1 - \frac{Var (R_{t})}{Var (S_{t} + R_{t})}) .$ series with seasonal strength $F_{S}$ , close to 0 exhibits almost no seasonality, while a series with strong seasonality will have $F_{S}$ close to 1 because Var( $R_{t}$ ) will be much smaller than Var( $S_{t} + R_{t}$ ). It is clear that there are some series showing strong trends and/or seasonality, corresponding to series at the higher levels of the hierarchy. The majority of series show low trends and seasonality. These are time series belonging to the bottom series, series related to the nature of incidents for a given control, health board, and priority level. The bottom series are dominated by noise with little or no systematic patterns.

In addition to displaying the trend and seasonality strength (Hyndman and Athanasopoulos 2021), we also visualize a few time series at various levels of aggregation. Figure 3 reveals different information, such as trend, seasonality, and noise. For example, some series depict seasonality and trends, whereas others report a low volume of attended incidents and entropy, making them more volatile and difficult to forecast. At the level of the nature of incidents combined with categories of other levels, there are many series that contain zeros with low counts. As such, the data set represents a diverse set of daily time series patterns.

Figure 3.

Daily time plot of attended incidents at various levels.⁴

We consider several forecasting models that account for the diverse patterns of the time series across the entire hierarchy. In developing the forecasting models, the time series of holidays is also used in addition to the attended incidents. We use public holidays, school holidays, and Christmas Day and New Year’s Day as predictors of attended incidents. These types of holidays will affect people's activities and may increase or decrease the number of attended incidents.

Forecasting Methods

Given the presence of various patterns in the past attended incidents, we consider three different forecasting models to generate the base forecasts. Once the base forecasts are produced, hierarchical and grouped time series methods are used to reconcile them across all levels. We briefly discuss forecasting models in the following sections, and the hierarchical forecasting methods are discussed in Section 4.

Stationary

We start with a simple forecasting approach, assuming that future days will be similar to past days. We use the empirical distribution of the past daily attended incidents to create the forecast distribution of future attended incidents. We have chosen this “stationary” method as a benchmark due to its widespread usage and simplicity, making it easily understandable for users. Forecasts serve as inputs for various decision-making systems that frequently employ simulations, wherein it is common to utilize the empirical distribution of demand as a forecast. Additionally, the stationary method has shown surprisingly high accuracy. Hence, any forecasting approach that can offer superior results compared to the stationary method would validate its practical use; otherwise, there is no necessity for employing more complex methods.

Exponential Smoothing State Space model (ETS)

ETS models (Hyndman and Athanasopoulos 2021) can combine trend, seasonality, and error components in a time series through various forms that can be additive, multiplicative, or mixed. The trend component can be none (“N”), Additive (“A”), or damped (“Ad”); the seasonality can be none (“N”), Additive (“A”), or multiplicative (“M”); and the error term can be additive (“A”) or multiplicative (“M”). To forecast the attended incidents at each level, we use the ets() function in the forecast package (Hyndman et al. 2022; Hyndman and Khandakar 2008) in R. To identify the best model for a given time series, the ets function uses the corrected Akaike’s Information Criterion (AICc). In our study, we use an automated algorithm to determine the suitable configuration for the trend, seasonality, and error terms in each time series. Specifically, we utilize the ets() function in the forecast package of R, which employs Akaike’s Information Criterion (AIC) to identify the optimal model for each time series. Given the large number of time series we work with (1530), it is impractical to manually select the appropriate form for each component in every time series. Consequently, the automated algorithm selects the best model based on the unique characteristics of each individual time series. As a result, a combination of additive or multiplicative forms for the components is employed, depending on the specific attributes of each time series. Despite the popularity and the relevance of automatic ETS in this study, it may produce forecast distributions that are non-integer and include negative values, although the number of attended incidents is always integer and non-negative. When using ETS, a time series transformation approach could be used to generate strictly positive forecasts, although forecast distributions will still be non-integer. An alternative is to use forecasting models that produce integer, non-negative forecasts. In the following section, we present Generalized Linear Models (GLMs) and poisson time series regression to produce count-base forecasts.

Generalized Linear Model (GLM)

GLMs are a family of models developed to extend the concept of linear regression models to non-Gaussian distributions (Faraway 2016). They model the response variable as a particular member of the exponential family, with the mean being a transformation of a linear function of the predictors. One of the models that is frequently used in practice to generate count forecasts is Poisson regression. Suppose the time series is denoted by y₁, . . ., y_T, then the Poisson GLM can be written as $y_{t} \sim P o i s s o n (λ_{t})$ $where \log (λ_{t}) = x_{t}' β$ and x_t is a vector of covariates, β is a vector of coefficients, and λ_t is the mean of the Poisson distribution. In our model, these include cubic splines for the time trend, day-of-week dummy variables (from Monday to Sunday), Fourier terms to capture the yearly seasonality, dummy variables indicating public holidays (1 when is a public holiday, 0 otherwise), school holidays (1 when is a school holiday, 0 otherwise), and Christmas Day (1 when is Christmas Day, 0 otherwise), and New Year’s Day (1 when is New Year’s Day, 0 otherwise). The Fourier terms are as defined in Hyndman and Athanasopoulos (2021, Section 7.4). This model takes account of weekly seasonality and annual seasonality. Monthly seasonality in time series data is extremely rare, and it does not exist in the ambulance demand used in this study. There is no reason for occurrences to occur more frequently at certain times of the month than others.

We fit a Poisson regression model using the function glm() from the stats package in R, with the argument family = Poisson to specify that we wish to fit a Poisson regression model with a log link function.

Poisson Regression Using tscount (TSGLM)

We also consider another Poisson regression model that takes into account serial dependence. This model captures the short-range serial dependence by including auto-regressive terms in addition to the same covariates that were used in the GLM model. To distinguish this from the previous GLM model, we will refer to this model as TSGLM. The Poisson TSGLM is similar to the GLM, with an additional auto-regressive component accounting for serial dependence. The term serial dependence refers to instances in which the number of incidents on a current day correlates with the number of incidents on previous days. $y_{t} \sim Poisson (λ_{t}) w h e r e \log (λ_{t}) = (y_{t - k}^{'}, x_{t}^{'}) β,$ and y_t−k is a vector of k lagged values. The TSGLM model explicitly accounts for serial dependence by including lagged values (i.e., past values) of the ambulance demand in the model. This is important in EMS forecasting because it allows the model to capture patterns in the data that are dependent on the past values of the time series, which might not be captured via the predictor variables.

We use the tsglm() function in the tscount package in R (Liboschik, Fokianos and, and Fried 2017) to model the attended incidents. Again, the logarithmic link function is used to ensure that the mean of the Poisson distribution is always positive.

Provided accidents occur independently, they will inherently follow a Poisson distribution (Feller 1991, pp. 156–158). Hence, it is reasonable to assume a Poisson distribution in this context. To account for changes over time, we incorporate trend and seasonality covariates, as well as public holiday effects, allowing the mean of the Poisson distribution to vary. However, it is important to note that if there are additional factors influencing the mean of the Poisson distribution that are not accounted for in our model, we might observe over- or under-dispersion in the data.

Ensemble Method

Finally, one effective strategy for improving forecast accuracy includes the simultaneous application of multiple forecasting methods on a given time series, followed by combining the forecasts rather than relying on separate forecasts generated by each individual method (Clemen 1989). In this paper, we use an ensemble method that combines the forecasts generated from the Stationary, ETS, GLM, and TSGLM models using a simple average to form a mixture distribution (Wang et al. in press).

To generate forecast probability distributions using the above methods, we use a form of bootstrapping described in Panagiotelis et al. (2023). This involves simulating 1000 future sample paths from each of the models by bootstrapping the model residuals, taking into account the cross-sectional correlations between the different aggregated and disaggregated series. In this way, we can generate an empirical distribution of forecasts for each model. The ensemble forecast distribution is a simple mixture of these empirical distributions.

It is important to emphasize that the aim of this study is not to provide an exhaustive compilation of forecasting models or to promote a particular model class. Instead, we have developed a flexible framework that can accommodate any forecasting model. Our primary objective is to demonstrate its practicality and effectiveness in integrating base forecasts from any model and generating coherent forecasts within a hierarchical structure.

Performance Evaluation

To evaluate the performance of the various forecasting approaches, we split the data into a series of ten training and test sets. We use a time series cross-validation approach (Hyndman and Athanasopoulos 2021), with a forecast horizon of 84 days and each training set expanding in 42-day steps. The first training set uses all data up to 2018-04-25, and the first test set uses the 84 days beginning 2018-04-26. The second training set uses all data up to 2018-06-06, with the second test set using the following 84 days. The largest training set ends on 2019-05-09, with the test set ending on 2019-07-31. Model development and hyper-parameter tuning is performed using the training data, and the errors are assessed using the corresponding test set. While we compute forecast errors for the entire 12 weeks, we are most interested in the last 42 days of each test set because that corresponds to how forecasts are generated for planning in practice. Forecasting performance is evaluated using both point and probabilistic error measures.

The error metrics provided below consider a forecasting horizon denoted by j, representing the number of time periods ahead we are predicting. In our study, this forecasting horizon ranges from 1 to 84 days, j = 1, 2, . . ., 84.

Point forecast accuracy is measured via the Mean Squared Scaled Error (MSSE) and the Mean Absolute Scaled Error (MASE). The Mean Absolute Scaled Error (MASE) (Hyndman and Athanasopoulos 2021; Hyndman and Koehler 2006) is calculated as $MASE = mean (| q_{j} |),$ (1)where $q_{j} = \frac{e_{j}}{\frac{1}{T - m} \sum_{t = m + 1}^{T} | y_{t} - y_{t - m} |},$ and e_j is the point forecast error for forecast horizon j, m = 7 (as we have daily seasonal series), y_t is the observation for period t, and T is the sample size (the number of observations used for training the forecasting model). The denominator is the mean absolute error of the seasonal naive method in the fitting sample of T observations and is used to scale the error. Smaller MASE values suggest more accurate forecasts. Note that the measure is scale-independent, thus allowing us to average the results across series. A related measure is MSSE (Hyndman and Athanasopoulos 2021; Makridakis, Spiliotis, and Assimakopoulos 2022), which uses squared errors rather than absolute errors: $MSSE = mean (q_{j}^{2}),$ (2)where $q_{j}^{2} = \frac{e_{j}^{2}}{\frac{1}{T - m} \sum_{t = m + 1}^{T} {(y_{t} - y_{t - m})}^{2}},$

Again, this is scale-independent, and smaller MSSE values suggest more accurate forecasts.

Using scale-independent measures, such as MASE and MSSE, enables more appropriate comparisons between time series at different levels and scales, as these measures are not influenced by the magnitude of the data. This is of particular importance in our study, as we work with time series at various levels of hierarchy with varying scales, resulting in different magnitudes of error. By employing scale-independent measures, we can meaningfully assess the forecast accuracy across the entire hierarchy, ensuring a more robust comparison.

To measure the forecast distribution accuracy, we calculate the Continuous Rank Probability Score (Gneiting and Katzfuss 2014; Hyndman and Athanasopoulos 2021). It rewards sharpness and penalizes miscalibration, so it measures the overall performance of the forecast distribution. $CRPS = mean (p_{j}),$ (3)where $p_{j} = \int_{- \infty}^{\infty} {(G_{j} (x) - F_{j} (x))}^{2} dx,$ where G_j(x) is the forecasted probability distribution function for forecast horizon j, and F_j(x) is the true probability distribution function for the same period.

Calibration refers to the statistical consistency between the distributional forecasts and the observations. It measures how well the predicted probabilities match the observations. On the other hand, sharpness refers to the concentration of the forecast distributions—a sharp forecast distribution results in narrow prediction intervals, indicating high confidence in the forecast. A model is well-calibrated if the predicted probabilities match the distribution of the observations, and it is sharp if it is confident in its predictions. The CRPS rewards sharpness and calibration by assigning lower scores to forecasts with sharper distributions and to forecasts that are well calibrated. Thus, it is a metric that combines both sharpness and miscalibration into a single score, making it a useful tool for evaluating the performance of probabilistic forecasts.

CRPS can be considered an average of all possible Winkler scores (Winkler 1972; Hyndman and Athanasopoulos 2021 Section 5.9) or percentile scores (Hyndman and Athanasopoulos 2021, Section 5.9), and thus provides an evaluation of all possible prediction intervals or quantiles. A specific prediction interval could be evaluated using a Winkler score. Certain situations may also require assessing accuracy for a particular quantile, such as lower (e.g., 5%) or higher (e.g., 95%) quantiles. In such cases, a percentile score becomes useful in meeting this specific requirement.

Hierarchical and Grouped Time Series Forecasting Techniques

There are many applications in healthcare, and in particular in EMS, where a collection of time series is available. These series are generally hierarchically organized based on multiple levels, such as area/region, health board, and/or are aggregated at different levels in groups based on nature of demand, priority of demand, or some other attribute. While series could be strictly hierarchical or only grouped based on some attributes, in many situations, more complex structures arise when attributes of interest are both nested and crossed, having a hierarchical and grouped structure. This is also the case for our application, as discussed in Section 3.1.

Independent (Base Forecast)

A common practice in healthcare (and EMS) to predict hierarchical and grouped series relies on producing independent forecasts, also referred to as base forecasts, typically by different teams as the need for such forecasts arises. We observe n time series at time t, across the entire hierarchical and grouped structure, written as y _t. The base forecasts of y _T+h given data y ₁ , . . ., y _T are denoted by ${\hat{y}}_{t}$ for h steps-ahead for all n series (n = 1530 in this study). Forecasts generated in this way are not coherent.

Reconciliation Methods

Traditionally, approaches producing coherent forecasts for hierarchical and grouped time series involve using bottom-up and top-down methods by generating forecasts at a single level and then aggregating or disaggregating. Top-down methods require having a unique hierarchical structure to disaggregate forecasts generated at the top level by proportions. However, given that we have multiple grouped attributes combined with the hierarchical structure, there is no unique way to disaggregate top forecasts. Hence, the top-down approach cannot be used in our application. The recommended approach is to use forecast reconciliation (Hyndman et al. 2011). In the following sections, we first discuss some notation and then present bottom-up and forecast reconciliation approaches used in this study to generate coherent forecasts.

Notations

Let b _t be a vector of n _b bottom-level time series at time t, and let a _t be a corresponding vector of n _a = n − n _b aggregated time series, where $a_{t} = A b_{t},$ and A is the n _a × n _b “aggregation” matrix specifying how the bottom-level series b _t are to be aggregated to form a _t. The aggregation matrix A is determined by the structure of the hierarchy. It maps the bottom-level time series to the corresponding higher-level time series. For example, if there are two bottom-level series and one aggregated series (equal to the sum of the two bottom-level series), then A = [1, 1]. The full vector of the time series is given by $y_{t} = [\begin{array}{l} a_{t} \\ b_{t} \end{array}] .$

This leads to the n × n _b “summing” or “structural” matrix given by $S = [\begin{array}{l} A \\ I_{n_{b}} \end{array}]$ such that y _t = S b _t.

The term “bottom-level series” refers to the most disaggregated series within the hierarchical and grouped time series structure. For instance, in Table 2, each distinct combination of values in the control area (e.g., south and east), health board (e.g., CV), priority (e.g., green), and nature of incident (e.g., chest pain), corresponds to one individual time series. In the dataset at hand, there are 691 unique combinations, resulting in 691 bottom-level time series. The “aggregate time series” describes how these bottom-level series are combined to create higher-level series. For instance, to obtain the incidents at the national level (i.e., all country level), the time series are aggregated across all control areas, health boards, priorities, and natures of incidents. Any desired aggregation level can be achieved based on the data structure, utilizing the bottom-level series available.

Bottom-Up (BU) and Linear Reconciliation Methods

Bottom-Up is a simple approach to generate coherent forecasts. It involves first creating the base forecasts for the bottom-level series (i.e., the most disaggregated series). These forecasts are then aggregated to the upper levels, which naturally results in coherent forecasts. The BU approach can capture the dynamics of the series at the bottom level, but these series may be noisy and difficult to forecast. The approach uses only the data at the most disaggregated level and so does not utilize all the information available across the hierarchical and grouped structure.

The bottom-up (BU) approach is constrained by its reliance solely on base forecasts from a single level of aggregation at the bottom level. While it does result in consistent forecasts, the BU approach lacks forecast reconciliation since no reconciliation is performed.

Forecast reconciliation approaches bridge this gap by combining and reconciling all base forecasts to generate coherent forecasts. This technique utilizes all the base forecasts produced within a hierarchical structure to create consistent forecasts at every level of the hierarchy. As a result, it goes beyond relying solely on base forecasts from a single level of aggregation and instead leverages all available information at each level to generate forecasts that minimize the total forecast variance of the set of coherent forecasts. Linear reconciliation involves projecting the base forecasts onto the coherent space. It is derived by minimizing the sum of the variances of the reconciled forecasts, subject to the resulting forecasts being coherent and unbiased (Wickramasuriya, Athanasopoulos, and Hyndman 2019). Linear forecast reconciliation methods can be written (Wickramasuriya, Athanasopoulos, and Hyndman 2019) as ${\tilde{y}}_{h} = S {(S^{'} W^{- 1} S)}^{- 1} W^{- 1} {\tilde{y}}_{h},$ where W is an n × n positive definite matrix, and ${\hat{y}}_{h}$ contains the h-step forecasts of $y_{T + h}$ given data to time T. When W _h is the covariance matrix of ${\hat{y}}_{h}$ , the resulting forecasts are optimal in the sense that the sum of the variances of the reconciled forecasts is minimized, provided the base forecasts ${\hat{y}}_{h}$ are unbiased. However, W _h is difficult to estimate, and so there have been various suggested approximations to W _h, leading to different types of reconciliation such as Ordinary Least Squares (OLS), Weighted Least Squares (WLS) and Minimum Trace (MinT).

Ordinary Least Squares (OLS) is the simplest and most commonly used method. In this approach, the estimation of W is based on the assumption that all the errors are uncorrelated and have equal variance and that multi-step forecast variances are proportional to one-step forecast variances. Then, W is simply the identity matrix multiplied by a constant factor. The main weakness of this approach is that it does not take account of the different scales of the base time series; the aggregated series will usually have higher variance than the disaggregated series simply because the values are larger, but OLS treats all series the same. A strength of the approach is that it is simple and does not involve estimating a covariance matrix.

Weighted Least Squares (WLS) is an extension of OLS where the variance of the errors is assumed to be heteroscedastic, that is, different for each series. But it assumes that the errors of each series are uncorrelated with each other and that multi-step forecast variances are proportional to one-step forecast variances. In this approach, W is defined as a diagonal matrix with the variance of the errors on the diagonal. The intuition behind WLS is that it assigns a higher weight to series with a smaller error variance and thereby takes into account the different scales of the base time series. The main weakness of this approach is that it ignores the relationships between series. A strength of WLS is that it is relatively easy to compute W as it is based only on error variances, which are readily estimated.

Minimum Trace (MinT) is a further generalization where W is defined as the covariance matrix of the one-step base forecast errors. Therefore, it takes account of both the scale of each series and the relationships between the series. But it still assumes that multi-step forecast variances are proportional to one-step forecast variances. The main weakness of this approach is that it is difficult to estimate the full covariance matrix, even with the one-step errors. In practice, we usually need to use a shrinkage estimate where the off-diagonal elements are shrunk towards zero.

We use the implementation of these methods in the hts package in R in the experiment.

Certainly, other approaches can be applied to hierarchical forecasting problems. Pennings and Van Dalen (2017) and Villegas and Pedregal (2018) proposed the idea of using a state space model to ensure consistent forecasts. However, when dealing with larger hierarchies, these models encounter difficulties in estimating covariance matrices. In contrast, our approach provides a clear advantage by allowing the incorporation of different forecasting methods for the base forecasts and even accommodating distinct methods for individual series. The decoupling of time series models from the reconciliation step adds significant flexibility in exploring a wide range of models.

Results and Discussion

In this section, we compare the forecasting performance of the Stationary, ETS, GLM, and TSGLM models along with the ensemble using base forecast and Minimum Trace (MinT) reconciliation methods. We have also computed the forecast accuracy for Ordinary Least Square (OLS) and Weighted Least Square (WLS) approaches, along with bottom-up forecasting. However, they are not reported here because their accuracy is outperformed by MinT. We should also note that forecasts, and consequently their corresponding errors, are generated for the entire hierarchy, and they could be reported at any level if required. But to save space, we have reported only the top level (Total), the bottom level, and the levels corresponding to control areas and health boards. The latter are chosen because this is where decision-making takes place, so these forecasts are the most important.

The overall forecasting performance is reported in Table 3, in which the average forecast accuracy over horizons of 43–84 days (corresponding to the planning horizon) is presented per model, method, and hierarchical level. Reported forecast accuracy is averaged across all forecast horizons, rolling origins, and series at each level. Table 3 presents both point and probabilistic forecast accuracy at total, control area, health board, and bottom-level series. Point forecast performance is reported using MASE and MSSE, while probabilistic forecast accuracy is reported using CRPS. The bold entries in each table identify a combination of method and model that performs best for the corresponding level (i.e., each column), based on the smallest values of accuracy measures.

Table 3.

Average Forecast Performance Calculated on the Test Sets at Forecast Horizons h = 43, . . . , 84 Days, With Time Series Cross-Validation Applied to Attended Incident Data.⁸

Reconciliation	Model	Total	Control Areas	Health Boards	Bottom
MSSE
Base	Stationary	1.139	1.059	1.047	1.019
Base	ETS	0.963	0.930	0.899	1.038
Base	GLM	0.910	0.940	0.923	1.002
Base	TSGLM	0.911	0.939	0.924	1.005
Base	Ensemble	0.782	0.856	0.876	1.008
MinT	Stationary	1.138	1.059	1.047	2.651
MinT	ETS	0.877	0.916	0.915	1.289
MinT	GLM	0.848	0.901	0.902	2.493
MinT	TSGLM	0.852	0.903	0.903	2.513
MinT	Ensemble	0.753	0.844	0.872	2.260
MASE
Base	Stationary	1.169	1.056	1.062	1.031
Base	ETS	0.979	0.875	0.816	0.975
Base	GLM	0.813	0.897	0.875	1.009
Base	TSGLM	0.822	0.901	0.875	1.050
Base	Ensemble	0.599	0.729	0.774	0.993
MinT	Stationary	1.168	1.057	1.062	2.095
MinT	ETS	0.785	0.852	0.845	0.994
MinT	GLM	0.720	0.827	0.837	1.803
MinT	TSGLM	0.722	0.833	0.839	1.851
MinT	Ensemble	0.560	0.706	0.765	1.557
CRPS
Base	Stationary	30.387	10.882	5.500	0.302
Base	ETS	14.309	6.074	3.476	0.244
Base	GLM	15.396	6.253	3.576	0.244
Base	TSGLM	15.316	6.227	3.575	0.245
Base	Ensemble	12.978	5.727	3.430	0.243
MinT	Stationary	30.368	10.902	5.498	0.313
MinT	ETS	13.515	5.967	3.547	0.243
MinT	GLM	13.839	5.917	3.453	0.246
MinT	TSGLM	14.000	5.947	3.455	0.248
MinT	Ensemble	12.585	5.728	3.426	0.247

Bold values show the lowest score for the given accuracy measure and the level of hierarchy.

Table 3 shows that forecast reconciliation (i.e., MinT) improves point forecast accuracy at the higher levels of the hierarchy, including total, control area, and health board. However, it does not result in accuracy improvement at the bottom-level series, for which base forecasts are more accurate. This might be due to the noisy structure of time series at the bottom level and possibly to very different patterns in the aggregated series. It is also clear from Table 3 that the ensemble method improves forecast accuracy at the total, control area, and health board. However, this does not remain valid for bottom series where different individual methods perform best, depending on the accuracy measure. While the forecast reconciliation approach aims to enhance forecast accuracy, its effectiveness is not guaranteed, especially if the bottom-level series exhibit excessive noise and lack systematic patterns. Despite this, reconciling forecasts at the bottom level can offer advantages by generating coherent forecasts that facilitate alignment in planning across various teams within an organization, promote better coordination, and prevent conflicting decisions. Moreover, even when dealing with noisy and irregular bottom-level series, reconciliation can still improve forecast accuracy at higher levels of the hierarchy by leveraging the information available across the hierarchy. Therefore, although the bottom-level forecasts may not be highly accurate on their own, reconciling them with higher-level forecasts can still provide a more consistent view of future demand and potentially yield more accurate forecasts at other levels.

Table 3 presents the accuracy of the forecast distribution measures by CRPS, which considers both forecasting reliability and interval sharpness. The smaller the value of CRPS, the better the comprehensive performance. We observe that forecast reconciliation results in forecast improvement at the total and health board levels. CRPS is almost identical at the control area and bottom levels. Base forecasts are slightly better at the control area level, while reconciliation is marginally more accurate than base at the bottom level. The ensemble method is also more accurate for higher levels, but ETS performs well at the bottom level. Table 3 also indicates that reconciliation using Mint generates accurate distributional forecasts. The marginal improvement in the average probabilistic forecast accuracy at the bottom level might be due to the reconciliation method giving improved forecast accuracy in the tails of the forecast distribution, which are critical for managing risks.

Overall, our results indicate that forecast reconciliation using the MinT method provides reliable forecasts and improves upon the base (unreconciled) forecasts at all levels except the bottom-level series. But even there, forecast reconciliation using MinT improves accuracy in the tails of the distribution.

In addition to the overall forecast accuracy presented in Table 3, we also report the point and probabilistic forecast accuracy measures for each forecast horizon in Figure 4. The figure focuses on the hierarchical levels important for decision-making, including total, control area, and health board; however, the accuracy could be calculated for any level. We only illustrate the results of the MinT method, given its strong performance described in Table 3. For illustration purposes, we report the average weekly forecast accuracy instead of the daily forecast horizon, as this reduces the visual noise in the figure. Thus, the x-axis shows horizons from week 1 (h = 1, . . . , 7) to week 12 (h = 78, . . . , 84). The forecast horizon from week 7 to week 12 corresponds to the upcoming planning horizon, which is used by planners and decision-makers. For both the point forecast and distributional accuracy, we can see that the ensemble approach performs best across almost all horizons, with the biggest differences at the highest levels of aggregation. It is important to highlight that all forecasting models outperform the stationary empirical distribution that is used as a benchmark for both point and probabilistic forecasts.

Figure 4.

Average accuracy by week for 12 weeks using MinT reconciliation.⁵

Despite using Poisson regression models to create count distributions of attended incidents for the base forecasts, it is important to note that the reconciled forecast distributions do not maintain a count format. In practical scenarios, there might be a need to use integer forecasts. Count forecast reconciliation is an active area of research, and it would be interesting to explore how our approach could be adapted to generate count-reconciled probabilistic forecasts in future studies. Rounding the forecasts is one possible solution to this problem. However, the impact of rounding on forecast accuracy varies depending on the level of hierarchy and the scale of the data. In situations with high volume demand, the effects of rounding may be negligible, and forecast accuracy calculations can overlook integer effects. On the other hand, in low-volume demand settings, such as forecasts at the bottom level of the hierarchy, integer (rounding) effects may have a more noticeable influence on forecast accuracy.

An Illustration of Probabilistic Forecast for EMS Demand

In this section, we provide an illustrative example of a probabilistic forecast for future demand, based on the total attended incidents at the SB health board. Due to the complexity of including such plots for the entire hierarchy and and the 84 days ahead in the manuscript, only one example is presented here. However, it is feasible to generate these plots for the entire hierarchy and for any forecast horizon, if necessary.

In practice, point forecasts are commonly used, but they have limitations as they ignore the uncertainty associated with the forecast. The future is inherently characterized by an irreducible level of uncertainty. Being prepared entails considering alternate courses of action. Probabilistic forecasts offer an alternative approach to anticipate future demand. Rather than providing a single value, they assign likelihoods to all possible demand outcomes, acknowledging that different numbers of attended incidents are possible, but with varying likelihoods.

The purpose of probabilistic forecasting, as demonstrated in Figures 5 and 6, is to quantify uncertainty. Figure 5 depicts the forecast distribution of total incidents in one health board over a 7-day period. It also gives the point forecast as well as the 80% and 90% prediction intervals. Figure 6 zooms in on the first day to show the histogram more clearly, illustrating the range of possible outcomes and their likelihood.

Figure 5.

A graphical illustration of the forecast distribution of ambulance demand (i.e., total incidence attended) for the SB health board for a horizon of 7 days.⁶

Figure 6.

An illustrative example of the forecast distribution of ambulance demand (i.e., total incidence attended) for the SB health board for 1 day ahead.⁷

Decisions based on these forecasts could focus on the tails of the distribution: unexpectedly high demand leading to crowding and inefficiency, or unexpectedly low demand resulting in wasted resources. Such forecasts are valuable tools for decision-makers and planners, especially when dealing with low-probability, high-cost situations. Different EMS managements may have varying risk attitudes depending on resource availability, making it crucial to consider the entire distribution when making decisions. For instance, these forecasts enable management to calculate the probability of demand exceeding a certain threshold of available resources (e.g., 90%), which can serve as an informative early warning measure for overcrowding.

It is important to note that while point forecasts and prediction intervals can be obtained from probabilistic forecasts, the reverse is not possible. A single number cannot be used to directly derive a probabilistic forecast. Prediction intervals, although helpful in indicating possible ranges, do not provide information on the probabilities of low or high demand.

In EMS planning, future demand is just one aspect to consider. Other inputs, such as capacity, should also be treated as probability distributions to adopt a probabilistic approach to planning. To extract valuable insights and make informed decisions from probabilistic forecasts, specialized numerical tools are required, as the forecasts themselves are typically represented as explicit probability density functions or Monte Carlo generators.

Conclusion

Forecasting problems at Emergency Medical Services often have inherent hierarchical and grouped structures. For example, looking at time series of arrival calls in a clinical desk service, Emergency Department admissions, verified incidents, or attended incidents in a country, they could be disaggregated by various attributes of interest. Total demand in the country could be disaggregated by region, then within each region by health board, within each health board, by station/hospital, and so on down to the postcode area. Alternative structures may arise when attributes of interest are crossed rather than nested. For example, the total demand could be disaggregated by priority (e.g., red, amber, and green) or by the nature of incidents. It is also natural to have a mixed structure; for example, the total demand could be disaggregated by priority and by health board.

Despite the inherent hierarchical structure of the forecasting problem in EMS, the common practice is to produce point forecasts for each time series independently. This practice may lead to a lack of coordination and possibly undesirable and conflicting outcomes. Furthermore, due to the asymmetric impact of resource allocation in this area, quantifying forecast uncertainty through probabilistic forecasts is also of value as it enables planners to manage associated risks. In this paper, we investigate the application of hierarchical forecasting methods for producing probabilistic forecasts of daily incidents attended up to 84 days ahead, using different forecasting methods.

Our results indicate that forecast reconciliation in EMS can not only contribute to a more coordinated approach to planning and decision-making through the production of coherent forecasts, but it can result in forecast accuracy improvements. Our proposed forecasting models, combined with reconciliation approaches, outperform the empirical distribution benchmark. We show that a substantial forecast improvement can be achieved at higher levels of aggregation by applying forecast reconciliation methods. When a point forecast is of interest at the bottom level of the series, we observe that reconciliation may not improve the forecast accuracy if the bottom series are noisy and lack systematic patterns. However, forecast reconciliation may result in more accurate forecast results for bottom series if we are interested in the tails of the forecast distribution rather than just center measures like mean (i.e., point forecast). Coherent forecasts are also crucial for informing planning activities, and we demonstrate that the proposed models produce coherent forecasts across all forecast horizons. Therefore, we recommend that forecast reconciliation approaches be adopted for routine use in EMS, whenever hierarchical and/or grouped time series data need to be forecasted. Moreover, we found that using an ensemble forecasting model, combining all the models developed in this paper instead of using each individually, works remarkably well for our mixed hierarchical and grouped structure.

Our research establishes a strong basis for future investigations and practical implementation in EMS. Leveraging the hierarchical and grouped structure of demand time series, EMS can use this advanced forecasting framework to generate coherent point and probabilistic forecasts, making the most of all available data at every level of the hierarchy. We acknowledge that a forecast serves a greater purpose beyond its mere existence, ideally enabling the best utility in terms of efficient allocation of medical services, response time, and cost, all informed by the forecast. While we fully appreciate the importance of evaluating forecast quality based on its impact on decision-making processes, it is essential to address the data requirements and methodology involved in measuring this impact. For a comprehensive assessment of the forecasts’ implications, access to additional data beyond ambulance demand, covering various decision types, capacity information, constraints in the decision system, and more, becomes necessary. This additional data would offer valuable insights into the specific decisions based on the forecasts, resulting in a more accurate evaluation of their impact on medical services. Furthermore, measuring the actual impact of forecasts would necessitate an approach that goes beyond forecasting itself. This would involve developing and implementing simulation models capable of replicating decision-making processes based on the forecast inputs. These simulation models would then evaluate the quality of the final decisions, taking into consideration the utilities that are particularly significant in the context of EMS.

Future research can build upon this study in several ways. In future investigations, we aim to explore this avenue by incorporating operational information, simulating decision processes, and assessing the decision impact of this framework on utilities that are significant to the EMS. Linking forecasts with their utilities (e.g., response time, allocation of medical services, resource utilization, cost, etc.) can offer an opportunity to maximize benefits through a more holistic planning approach. Additionally, in our study, we employed Poisson regression models to generate count distributions of attended incidents for the base forecasts. However, it is essential to note that the reconciled forecast distributions are not integer. This observation presents an interesting avenue for future research. Also, since the dataset used in this study only includes information on attended incidents, it would be valuable for future research to investigate the impact of failed responses on EMS forecasting if data on these incidents became available. It is also important to note that our methodology for hierarchical time series forecasting can be applied to any time series data in EMS, including those that may include failed responses.

Although our study primarily focuses on Emergency Medical Services, it is essential to emphasize that the framework we propose has broad applicability across various service industries (Ostrom et al. 2010). Our approach is particularly valuable in situations where time series data is structured hierarchically and/or grouped, a common characteristic found in many sectors. This occurs when data can be naturally organized into different levels of hierarchies or when dependencies and relationships exist among entities within the system. For instance, in supply chains (Shugan and Xie 2000), demand forecasting at different levels of the distribution network, such as regional warehouses or retail stores, is vital for efficient inventory management and minimizing stockouts. Our framework allows the reconciliation of forecasts, ensuring consistency, and alignment throughout the supply chain, leading to improved decision-making and operational efficiency. In the financial industry (Kimes and Chase 1998), where investments span multiple asset classes, geographical regions, or customer segments, our framework can be applied to forecast portfolio performance, asset allocation, or customer demand. Similarly, in transportation, the framework supports forecasting transportation demand at various levels, optimizing route planning and resource allocation. Likewise, in the hospitality and tourism industry (Dekimpe, Peers, and van Heerde 2016), it facilitates forecasting demand rates at state, regional, and departmental levels, enabling strategic pricing, capacity planning, and revenue management for hotels and other travel-related businesses. Additionally, in call centers, accurate call volume forecasting at different levels of the call center hierarchy or grouped structure is crucial for workforce management and resource allocation. Implementing our framework, call centers can generate accurate forecasts for different skill groups, shifts, and locations, ensuring efficient staffing and optimal service levels to meet customer demands.

Reproducibility

To enhance transparency and reproducibility, we not only provide data and the code, but also the entire paper that is written in R using Quarto. All materials to reproduce this paper is available at github.com/bahmanrostamitabar/forecasting-emergency-medicine.

The repository contains the raw data, all R scripts used in experiments, the results used in the paper, as well as the quarto files for producing this paper. Full instructions are provided in the repository.

Supplemental Material

Supplemental Material - Hierarchical Time Series Forecasting in Emergency Medical Services

Supplemental Material for Hierarchical Time Series Forecasting in Emergency Medical Services by Bahman Rostami-Tabar and Rob J. Hyndman in Journal of Service Research

Footnotes

Author Contributions

BRT and RJH conceptualize the study and designed the experiment. BRT obtained a funding to meet RJH at Monash University in Melbourne. BRT and RJH provided the analysis of the time series data. RJH developed forecasting models and hierarchical methods;BRT and RJH use R programming to conduct analysis. BRT drafted the manuscript,and RJH contributed substantially to its revision. BRT takes responsibility for the paper as a whole.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: A funding awarded by Cardiff Business School (AC2950RC17) allowed BRT to travel to Melbourne to visit RJH to discuss the research design,run the experiment and write the article. RJH received funding through the Australian Research Council Industrial Transformation Training Centre in Optimisation Technologies,Integrated Methodologies,and Applications (OPTIMA),Project ID IC200100009.

ORCID iD

Bahman Rostami-Tabar

Supplemental Material

Supplemental material for this article is available online.

Notes

Author Biographies

Rob Hyndman,Professor of Statistics at Monash University,Australia,is an elected Fellow of the Australian Academy of Science,the Academy of Social Sciences in Australia and the International Institute of Forecasters. Winner of the Moran Medal and Pitman Medal,he has authored over 200 research papers and 5 books.

Bahman Rostami-Tabar,Professor of Data-Driven Decision Science at Cardiff University,the United Kingdom,is the director of Data Lab for Social Good at Cardiff Business School. His research focus lies in leveraging analytical methodologies to provide informed insights from data for policy & decision-making processes in sectors contributing to social good.

References

Al-Azzani

Mohamed AK

Davari

Soheil

England

Tracey Jane

(2021). “An Empirical Investigation of Forecasting Methods for Ambulance Calls-A Case Study.” Health Systems, 10(4), 268–285.

Armstrong

Jon Scott

(2001). Evaluating Forecasting Methods, in: Armstrong, J. S. (Ed.), Principles of Forecasting: A Handbook for Researchers and Practitioners, Chap. 14. Kluwer Academic Publishers, 443-472.

Bandara

Kasun

Hyndman

Rob J.

Bergmeir

Christoph

(2024 in press) “MSTL: A Seasonal-Trend Decomposition Algorithm for Time Series with Multiple Seasonal Patterns,” International J Operational Research.

Boylan

John E.

Goodwin

Paul

Mohammadipour

Maryam

Syntetos

Aris A.

(2015). “Reproducibility in Forecasting Research,” International Journal of Forecasting, 31 (1), 79-90.

Clemen

Robert T.

(1989). “Combining Forecasts: A Review and Annotated Bibliography,” International Journal of Forecasting, 5 (4), 559-583.

Dekimpe

Marnik G.

Peers

Yuri

van Heerde

Harald J.

(2016). “The Impact of the Business Cycle on Service Providers: Insights From International Tourism.” Journal of Service Research, 19 (1), 22–38.

Ekström

Andreas

Kurland

Lisa

Farrokhnia

Nasim

Castrén

Maaret

Nordberg

Martin

(2015). “Forecasting Emergency Department Visits Using Internet Data,” Annals of Emergency Medicine, 65 (4), 436-442.

Faraway

Julian J.

(2016). Extending the Linear Model With R: Generalized Linear, Mixed Effects and Nonparametric Regression Models (2nd ed.). Boca Raton London New York: Chapman and Hall/CRC.

Feller

William

(1991). An Introduction to Probability Theory and Its Applications, Volume 2. (Vol. 81). New York : John Wiley & Sons.

10.

Gneiting

Tilmann

Katzfuss

Matthias

(2014). “Probabilistic Forecasting,” Annual Review of Statistics and Its Application,” (1), 1125–151.

11.

Grekousis

George

Liu

(2019). “Where Will the Next Emergency Event Occur? Predicting Ambulance Demand in Emergency Medical Services Using Artificial Intelligence,” Computers, Environment and Urban Systems, 76 (1), 110-122.

12.

Gul

Muhammet

Celik

Erkan

(2020). “An Exhaustive Review and Analysis on Applications of Statistical Forecasting in Hospital Emergency Departments,” Health Systems, 9 (4), 263-284.

13.

Hermansen

Anna Haugsbø

Mengshoel

Ole Jakob

(2021). “Forecasting Ambulance Demand Using Machine Learning: A Case Study from Oslo, Norway,” in 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA (December 05–07 2021), 01–10.

14.

Hulshof

Peter JH

Kortbeek

Nikky

Boucherie

Richard J.

Hans

Erwin W.

Bakker

Piet JM

(2012). “Taxonomic Classification of Planning Decisions in Health Care: A Structured Review of the State of the Art in OR/MS.” Health Systems, 1, 129–175.

15.

Hyndman

Rob

Athanasopoulos

George

Bergmeir

Christoph

Caceres

Gabriel

Chhay

Leanne

O’Hara-Wild

Mitchell

Petropoulos

Fotios

Razbash

Slava

Wang

Earo

Yasmeen

Farah

(2022). “Forecast: Forecasting Functions for Time Series and Linear Models,” R package version 8.19.

16.

Hyndman

Rob J.

Ahmed

Roman A.

Athanasopoulos

George

Shang

Han Lin

(2011). “Optimal Combination Forecasts for Hierarchical Time Series,” Computational Statistics & Data Analysis, 55 (9), 2579-2589.

17.

Hyndman

Rob J.

Athanasopoulos

George

(2021). Forecasting: Principles and Practice, 3rd ed. OTexts.

18.

Hyndman

Rob J.

Khandakar

Yeasmin

(2008). “Automatic Time Series Forecasting: the Forecast Package for R.” Journal of Statistical Software, 27(3), 1–22.

19.

Hyndman

Rob J.

Koehler

Anne B.

(2006). “Another Look at Measures of Forecast Accuracy,” International Journal of Forecasting, 22 (4), 679-688.

20.

Ibrahim

Rouba

Han

L’Ecuyer

Pierre

Shen

Haipeng

(2016). “Modeling and Forecasting Call Center Arrivals: A Literature Survey and a Case Study,” International Journal of Forecasting, 32 (3), 865-874.

21.

Kang

Yanfei

Hyndman

Rob J.

Smith-Miles

Kate

(2017). “Visualising Forecasting Algorithm Performance Using Time Series Instance Spaces,” International Journal of Forecasting, 33 (2), 345-358.

22.

Kimes

Sheryl E.

Chase

Richard B.

(1998). “The Strategic Levers of Yield Management,” Journal of Service Research, 1 (2), 156-166.

23.

Liboschik

Tobias

Fokianos

Konstantinos

Fried

Roland

(2017). “Tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models,” Journal of Statistical Software, 82 (5), 1-51.

24.

Makridakis

Spyros

Spiliotis

Evangelos

Assimakopoulos

Vassilios

(2022). “The M5 Competition: Background, Organization, and Implementation,” International Journal of Forecasting, 38 (4), 1325-1336.

25.

Ostrom

Amy L.

Bitner

Mary Jo

Brown

Stephen W.

Burkhard

Kevin A.

Goul

Michael

Smith-Daniels

Vicki

Demirkan

Haluk

Rabinovich

Elliot

(2010). “Moving Forward and Making a Difference: Research Priorities for the Science of Service,” Journal of Service Research, 13 (1), 4-36.

26.

Panagiotelis

Anastasios

Gamakumara

Puwasala

Athanasopoulos

George

Hyndman

Rob J.

(2023). “Probabilistic Forecast Reconciliation: Properties, Evaluation and Score Optimisation,” European Journal of Operational Research, 306 (2), 693-706.

27.

Pennings

Clint LP

Van Dalen

Jan

(2017). “Integrated Hierarchical Forecasting,” European Journal of Operational Research, 263 (2), 412-418.

28.

Rostami-Tabar

Bahman

Ziel

Florian

(2022). “Anticipating Special Events in Emergency Department Forecasting,” International Journal of Forecasting, 38 (3), 1197-1213.

29.

Rostami-Tabar

Bahman

Browell

Jethro

Svetunkov

Ivan

(2024). “Probabilistic Forecasting of Hourly Emergency Department Arrivals,” Health Systems, 13 (2), 133-149.

30.

Sasaki

Satoshi

Comber

Alexis J.

Suzuki

Hiroshi

Brunsdon

Chris

(2010). “Using Genetic Algorithms to Optimise Current and Future Health Planning-The Example of Ambulance Locations.” International Journal of Health Geographics, 9(1), 1–10.

31.

Shi

Mingzhe.

Rostami-Tabar

Bahman

Gartner

Daniel

(2022). “Forecasting for Unplanned Care Services: A Literature Review,” working paper, Health Care Management Science, Cardiff Business School.

32.

Shugan

Steven M.

Xie

Jinhong

(2000). “Advance Pricing of Services and Other Implications of Separating Purchase and Consumption,” Journal of Service Research, 2 (3), 227-239.

33.

Stodden

Victoria

Miguez

Sheila

(2013). “Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research,” Journal of Open Research Software, 2 (1), e21.

34.

Vile

Julie Leanne

Gillard

Jonathan W.

Harper

Paul R.

Knight

Vincent A.

(2012). “Predicting Ambulance Demand Using Singular Spectrum Analysis,” Journal of the Operational Research Society, 63 (11), 1556-1565.

35.

Vile

Julie Leanne

Gillard

Jonathan William

Harper

Paul Robert

Knight

Vincent Anthony

(2016). “Time-dependent Stochastic Methods for Managing and Scheduling Emergency Medical Services.” Operations Research for Health Care, 8, 42–52.

36.

Villegas

Marco A.

Pedregal

Diego J.

(2018). “Supply Chain Decision Support Systems Based on a Novel Hierarchical Forecasting Approach.” Decision Support Systems, 114, 29–36.

37.

Wang

Xiaoqian

Hyndman

Rob J.

Feng

Kang

Yanfei

(2023). “Forecast Combinations: A Over 50-Year Review.” International Journal of Forecasting 39 (4), 1518-1547.

38.

Wargon

Mathias

Guidet

Bertrand

Hoang

T. D.

Hejblum

Gilles

(2009). “A Systematic Review of Models for Forecasting the Number of Emergency Department Visits,” Emergency Medicine Journal, 26 (6), 395-399.

39.

Wickramasuriya

Shanika L.

Athanasopoulos

George

Hyndman

Rob J.

(2019). Optimal Forecast Reconciliation for Hierarchical and Grouped Time Series through Trace Minimization. Journal of the American Statistical Association, 114 (526), 804–819.

40.

Winkler

Robert L.

(1972). A Decision-Theoretic Approach to Interval Estimation. Journal of the American Statistical Association, 67(337), 187–191.

41.

Zhou

Zhengyi

(2016). “Predicting Ambulance Demand: Challenges and Methods” arXiv preprint arXiv:1606.05363.

42.

Zhou

Zhengyi

Matteson

David S.

(2016). “Predicting Melbourne Ambulance Demand Using Kernel Warping.” The Annals of Applied Statistics, 10(4), 1977–1996.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.29 MB