Abstract
Introduction
A failure to match available resources to demand in Emergency Medical Services (EMS) results in patient flow problems, with serious consequences for patients, staff, and the entire care system (Ekström et al. 2015; Rostami-Tabar and Ziel 2022). Demand forecasting in EMS helps service planners to avoid the mismatch, potentially providing massive savings in costs and lives, and leading to better patient outcomes. Accurate daily demand forecasting enables planners and decision-makers to manage resources to meet anticipated patients, reconfigure units, and redeploy staff and vehicles as necessary.
Demand forecasts at EMS are typically required at multiple levels of an organization to inform various planning and decision-making processes (Hulshof et al. 2012). There are some planning processes at the national level (strategic and long-term) such as workforce resource planning and budgeting; sub-national, regional, or healthcare level (tactical and medium-term) such as temporary capacity expansions, resource sharing; and hospital or station level (operational and short-term) such as planning rosters for staff and ambulance deployment. Demand forecasts might also be required at different levels for a specific area of interest, such as the nature of demand or the priority level. Moreover, the time series data in EMS has an inherent hierarchical and grouped structure to support such forecasting requirements. Demand for emergency medical services at the national level can be disaggregated in a geographical hierarchy into sub-national, regions, health boards, and stations/hospitals, or divided into groups such as the nature of incidents or demand priority. Forecasts produced at both higher and lower levels of hierarchies are necessary for effective decision-making in EMS. For example, control area EMS forecasts can inform strategic decisions about how to allocate limited resources to lower levels, such as health boards and stations/hospitals. At the lower levels, hospitals or ambulance stations could use such forecasts to plan for staffing and resource allocation, ambulance dispatching, staff-to-shift assignment, and staff rescheduling based on the anticipated volume, and priority and nature of incidents. Additionally, generating forecasts at lower levels could potentially improve the accuracy of the high-level forecasts by providing more detailed information on the nature and priority of incidents. This could help to identify patterns in demand that may not be apparent at the higher level. Therefore, employing forecasting techniques that consider the hierarchical and/or grouped patterns of time series in EMS aligns naturally, offering the possibility to enhance forecast accuracy and facilitate coordination.
To illustrate the problem, let’s consider a very simple example where we have an EMS provider with a national level of governance and two regions (A and B), each with a health board and a station. There is a total national budget to be split between the regions in proportion to the forecast number of incidents in each region. The two regions have very different incident patterns, and so they must be forecast using different models. However, the data are noisy at the regional level, so the national forecasts are best obtained by summing the demand from the two regions. The resulting national forecasts are not equal to the sum of the regional forecasts, and so they are not coherent. In fact, the national forecasts show a decreasing trend in demand, and so the national governing body decides to cut the budget for the next year. But neither of the regional forecasts shows a trend, and so the regions argue that the budget cut is unfair. In addition, Region A has much more variable demand than Region B, and so to cope with periods of peak demand, Region A needs to hold more resources in reserve. Therefore, the budget distribution needs to be made in a way that ensures the probability of each region being unable to meet demand is equal. Our solution to this problem is to use a hierarchical forecasting approach that ensures the forecasts are probabilistically coherent. Then, any trends or other forecast characteristics at the national level will also be reflected in the regional forecasts, and the probabilistic forecasts allow for the different levels of uncertainty in the two regions. Budgets can be allocated by controlling the probability of demand exceeding available resources, rather than being simply in proportion to the expected demand.
Despite a large number of studies dedicated to forecasting for EMS (Gul and Celik 2020; Ibrahim et al. 2016; Shi, Rostami-Tabar, and Gartner 2022; Wargon et al. 2009), the hierarchical data structure has been largely ignored, and the main focus has been on producing independent (base) forecasts at a single level. Generating independent forecasts can result in a lack of consistency and coordination, which leads to less effective planning and decision-making.
With hierarchical forecasting, plans at any level are based on coherent forecasts and, therefore, can be aligned. Implementing and sustaining improvements in EMS require alignments and coordination between different stakeholders, without which teams operate in isolation, leading to conflicts, duplication work, rework, or work that runs counter to the overall goal to improve the quality of delivery service. A hierarchical forecasting framework can be used as a tool to improve coordination between teams across care services at the national, sub-national, regional, and local levels. The hierarchical forecasting approaches not only create consistent forecasts but are usually also more accurate than the independent (base) forecasts (Hyndman et al. 2011). To our knowledge, there has been no previous research involving hierarchical and grouped forecasting in the entire field of forecasting for healthcare management.
In this paper, we address this gap by investigating the application of hierarchical forecasting approaches in the EMS using a daily time series of attended incidents from 2015 to 2019 in a major ambulance service in Great Britain. The data has hierarchical and grouped structures, with hierarchies at the national, control (i.e., sub-national), and health board (i.e., regional) levels, as well as groups by priority and nature of incidents. We produce consistent point forecasts and forecast distributions for all levels over an 84 day horizon, which is critical for effective planning and associated risk management. We compare the point and probabilistic forecast accuracy of the independent forecasts, bottom-up, and optimal reconciliation approaches. We first generate independent/base forecasts using Exponential Smoothing State Space (ETS), Poisson regression using Generalized Linear Model (GLM), and tscount (TSGLM), a simple empirical distribution and an ensemble method, followed by applying bottom-up and optimal reconciliation approaches. Forecast performance is assessed by the Mean Absolute Scaled Error (MASE) and Mean Squared Scaled Error (MSSE) for point forecasts and Continuous Ranked Probability Scores (CRPS) for probabilistic forecasts. This paper complies with reproducibility principles (Boylan et al. 2015; Stodden and Miguez 2013). We provide the R codes for the proposed models and benchmarks. Therefore, they can be applied to any healthcare service (e.g., emergency department, primary or social care), subject to the time series having a hierarchical and/or grouped structure. While our research focuses on emergency medical services, it is important to emphasize the suggested framework’s adaptability, which expands its relevance to a variety of service sectors such as supply chains, tourism, finance, and call centers. Our approach can be generalized in cases with hierarchically structured and/or grouped time series data, which is common in many service sectors.
The remainder of this article is structured as follows: In the Research Background section, we provide a brief review of the literature and discuss its limitation to position our work; in the Experiment Setup section, we present the experimental design describing the data set, forecasting methods and forecast evaluation metrics. In the Results and Discussion section, we discuss the hierarchical time series forecasting approaches to generate both point and probabilistic forecasts. In the Conclusion section, we present and discuss our results; in Section 6, we summarize our findings and present ideas for future research.
Research Background
Emergency medical services (EMS) are a critical component in the delivery of urgent medical care to communities. Effective service delivery requires accurate resource planning that generally relies on demand forecasts at operational, tactical, and strategic levels. There are a substantial number of studies on the application of time series forecasting in Emergency Medical Services. For example, Ibrahim et al. (2016) provide an extensive review of the models used in forecasting call volume arrivals. Another important area is related to forecasting ambulance demand. Although the definition of demand might not always be clearly stated, this typically refers to a situation where a physical resource has been deployed to respond to an incident. This might be also called attended incidents. Another demand-related variable is verified incidents; these are all incidents that require action: either by sending a physical vehicle, responding via the Clinical Support Desk, requesting an external provider to respond to it, or forwarding it to other channels such as police, firefighters, or general practitioners. Our study is aligned with this stream of literature. Another similar area that has received considerable attention is Emergency Department forecasting (Rostami-Tabar, Browell, and Svetunkov 2024); we refer interested readers to Shi, Rostami-Tabar, and Gartner (2022), Gul and Celik (2020), and Wargon et al. (2009) for extensive reviews of the relevant literature. In this section, we provide a brief review of studies on forecasting ambulance demand in EMS.
There are generally two main streams of research related to forecasting ambulance demand in EMS: (i) the first stream focuses on the application of time series methods and regression approaches to forecasting aggregate ambulance demand (Sasaki et al. 2010; Vile et al. 2012); and (ii) the second stream considers forecasting EMS demand in finer temporal and geographical granularities by employing temporal-spatial prediction methods (Zhou 2016; Zhou and Matteson 2016). The focus of our study is related to the first stream of research.
Sasaki et al. (2010) develop a multivariable regression model to estimate future EMS demands. In addition to the historical demand, the population census for different age groups and counts of the number of companies employing more than five people are included in the regression. The census variables describe groups who are more likely to need an ambulance. A stepwise ordinary least squares regression analysis is used for estimating the parameter and generating forecasts. The only performance measure reported in this study is R2, which is not an effective measure of forecast accuracy (Armstrong 2001, 457). The research design of this study is not rigorous, and the study is not reproducible. Vile et al. (2012) explore using a Singular Spectrum Analysis (SSA) method to generate forecasts of the EMS demand at the national level for 7-day, 14-day, 21-day, and 28-day forecast horizons using data provided by an ambulance service in Great Britain. The performance of this approach is compared to Auto-Regressive Integrated Moving Average (ARIMA) and Holt-Winters time series methods using Root Mean Squared Error (RMSE). They concluded that point forecasts generated by SSA are more accurate for longer-term horizons, but that ARIMA and Holt-Winters performance is superior for shorter-term horizons. Vile et al. (2016) further develop a decision support system to integrate forecasts generated by SSA. However, the study does not compare and contrast the performance of forecasting methods based on utility measures such as cost, resource utilization and response time. The tool contains options that allow generating forecasts at various levels of granularity; however, it ignores the structure of the hierarchical and grouped relationships, preventing aligned decision-making and coordination. Al-Azzani, Davari, and England (2021) utilizes data from the Welsh Ambulance Service to explore the forecast accuracy of four forecasting approaches: ARIMA, Holt Winters, Multiple Regression, and Singular Spectrum Analysis (SSA) in predicting call volume demand. The aim is to compare these approaches with the current method across various planning horizons (7, 30, and 90 days) for both total call volume and category-specific demand. Forecast accuracy performance is evaluated using root mean square error (RMSE) and mean absolute percentage error (MAPE). The findings indicate that ARIMA performs the best in predicting weekly and monthly demand. However, when it comes to long-term demand, the SSA method proves to be the most effective. Ibrahim et al. (2016) conducted a case study to assess the effectiveness of multiple forecasting methods: the multiplicative univariate forecasting model (MU), univariate mixed-effects model (ME), and two variations of bivariate mixed-effects models (BME). Call center data were utilized to forecast for periods of 1, 7, and 14 days ahead, using only a limited dataset of 42 days. The performance of these forecasting methods was evaluated using two metrics: RMSE for point forecasts and coverage probability for the 95% prediction interval. The findings indicate that the ME consistently produces the most accurate point forecasts. On the other hand, BME models demonstrate superior coverage probabilities when forecasting for 1 day or 1 week ahead. For a 2-week leading period, MU shows better coverage probability.
Hermansen and Mengshoel (2021) investigate forecasting EMS demand at a high spatio-temporal resolution of 1 km2 spatial regions and 1-hr time intervals using total incidents in Oslo, Norway, from 1 January 2015 to 11 February 2019. They used multi-layer perceptron (MLP) and long short-term memory (LSTM) models to forecast the EMS demand and compare the results to simple aggregation methods and baselines. The point forecast accuracy is evaluated using Mean Absolute Error (MAE) and Mean Squared Error (MSE), and the forecast distribution is measured by Categorical Cross-Entropy. They found that Neural Network models performed better in producing point forecasts, while a distribution baseline method based on the spatial distribution of the incidents across all time steps provided more accurate forecast distributions. Zhou (2016) proposed three methods based on Gaussian mixture models, kernel density estimation, and kernel warping to predict hourly data 4 weeks ahead for a 1 km 2 spatial region. Two years of incidents from Toronto, Canada (years 2007 and 2008 with 391,296 events) and Melbourne, Australia, were used to build the model and examine the performance on test data using mean negative log-likelihood. They show that forecasts generated by the proposed methods were significantly more accurate than the current industry practice (a simple averaging formula). Grekousis and Liu (2019) investigated the combination of spatial analysis methods with data mining techniques based on an improved Hungarian algorithm and a MLP neural network to identify the most likely locations of future emergency events. The proposed approach was tested using data from 2851 events attended by the EMS in Athens, Greece, over 24 weeks. They showed that 23% of real emergency events lie within 50 m of the predicted ones, and nearly 70% of the real emergency events lie no further than 150 m away, which is rather accurate given the granularity of the problem at the city level.
Table 1 provides a summary of some studies in the literature on forecasting in Emergency Medical Services. We note a number of limitations in the literature on EMS forecasting, that encourage us to undertake this research. These limitations are summarized as follows: 1. Current studies ignore the inherent hierarchical and/or grouped structure of the time series data, and the relationship between series at different levels of hierarchy. While the hierarchical forecasting methodology has been developed and applied in various domains over the past 10 years (Panagiotelis et al. 2023), it has never been explored in this area. 2. Current research is mainly concerned with generating point forecasts at a single level of hierarchy. There is a lack of studies considering the entire forecast distribution of daily ambulance demand for the whole hierarchy to better represent the uncertainty of future demand, providing a risk management tool for planners. 3. Reproducibility is still a major challenge in EMS forecasting, as it is unlikely that any reader can reproduce prior studies without the help of the authors of those papers. 4. Another limitation is related to the generated forecasts not being in the sample space of non-negative counts. Since actual ambulance counts cannot be negative or non-integer, ambulance demand forecast distributions should reflect the data. Of course, point forecasts represent means, so they should be non-negative but may be non-integer. While this might not be an issue when producing forecasts at a single level, producing non-negative count forecasts in a hierarchical or grouped structure is challenging and requires further investigation in the future. Summary of Some Studies on Forecasting in Emergency Medical Services.
This paper concerns the problem of hierarchical forecasting in EMS and generates and evaluates both point and probabilistic forecasts across different levels of the hierarchy, hence addressing some important gaps identified in the literature.
Experiment Setup
Planners in the ambulance service work with a planning horizon of 6 weeks. That is, planning is generally frozen for the next 42 days, so any forecasts will only affect plans for the time period beyond the next 42 days. Consequently, the forecast horizon in this study is 2 × 42 = 84 days ahead, with performance evaluation assessed based on the last 42 days and not the whole forecast period. The forecasts are produced for various training and test sets using time series cross-validation (Hyndman and Athanasopoulos 2021). In the following section, we discuss the dataset, describe the forecasting methods used to generate base forecasts, and present the point and probabilistic accuracy measures.
Data
The dataset used in this study is from a major ambulance service in Great Britain. It contains information relating to the daily number of attended incidents from 1 October 2015 to 31 July 2019, disaggregated by nature of incidents, priority, the health board managing the service, and the control area (or region). Figure 1 depicts both the hierarchical and grouped structure of the data. Figure 1(a) illustrates the nested hierarchical structure based on the control area and health board, and Figure 1(b) shows the grouped structure by priority and the nature of the incident.
1,2
The hierarchical and grouped structure of attended incidents (ambulance demand).
Number of Time Series in Each Level for the Hierarchical and Grouped Structure of Attended Incidents.
Given the total number of time series, direct visual analysis is infeasible. Therefore, we first compute features of all 1530 time series (Kang, Hyndman, and Smith-Miles 2017) and display the strength of trend and weekly seasonality strength in Figure 2. Each point represents one time series with the strength of trend on the x-axis and the strength of seasonality on y-axis. Both measures are on a scale of [0, 1]. The strength of the trend and weekly seasonality in the time series of attended incidents.
3

In this paper, the strength of trend and seasonality were calculated using the “STL” (Seasonal and Trend Decomposition using Loess) decomposition method, as described by Bandara, Hyndman, and Bergmeir (2024). STL is a widely used and flexible method for decomposing time series data into trend, seasonal, and remainder components. The decomposition of a time series
For strongly trended data, the seasonally adjusted data should have much more variation than the remainder component. Therefore Var(
The strength of seasonality is defined similarly:
In addition to displaying the trend and seasonality strength (Hyndman and Athanasopoulos 2021), we also visualize a few time series at various levels of aggregation. Figure 3 reveals different information, such as trend, seasonality, and noise. For example, some series depict seasonality and trends, whereas others report a low volume of attended incidents and entropy, making them more volatile and difficult to forecast. At the level of the nature of incidents combined with categories of other levels, there are many series that contain zeros with low counts. As such, the data set represents a diverse set of daily time series patterns. Daily time plot of attended incidents at various levels.
4

We consider several forecasting models that account for the diverse patterns of the time series across the entire hierarchy. In developing the forecasting models, the time series of holidays is also used in addition to the attended incidents. We use public holidays, school holidays, and Christmas Day and New Year’s Day as predictors of attended incidents. These types of holidays will affect people's activities and may increase or decrease the number of attended incidents.
Forecasting Methods
Given the presence of various patterns in the past attended incidents, we consider three different forecasting models to generate the base forecasts. Once the base forecasts are produced, hierarchical and grouped time series methods are used to reconcile them across all levels. We briefly discuss forecasting models in the following sections, and the hierarchical forecasting methods are discussed in Section 4.
Stationary
We start with a simple forecasting approach, assuming that future days will be similar to past days. We use the empirical distribution of the past daily attended incidents to create the forecast distribution of future attended incidents. We have chosen this “stationary” method as a benchmark due to its widespread usage and simplicity, making it easily understandable for users. Forecasts serve as inputs for various decision-making systems that frequently employ simulations, wherein it is common to utilize the empirical distribution of demand as a forecast. Additionally, the stationary method has shown surprisingly high accuracy. Hence, any forecasting approach that can offer superior results compared to the stationary method would validate its practical use; otherwise, there is no necessity for employing more complex methods.
Exponential Smoothing State Space model (ETS)
ETS models (Hyndman and Athanasopoulos 2021) can combine trend, seasonality, and error components in a time series through various forms that can be additive, multiplicative, or mixed. The trend component can be none (“N”), Additive (“A”), or damped (“Ad”); the seasonality can be none (“N”), Additive (“A”), or multiplicative (“M”); and the error term can be additive (“A”) or multiplicative (“M”). To forecast the attended incidents at each level, we use the ets() function in the forecast package (Hyndman et al. 2022; Hyndman and Khandakar 2008) in R. To identify the best model for a given time series, the ets function uses the corrected Akaike’s Information Criterion (AICc). In our study, we use an automated algorithm to determine the suitable configuration for the trend, seasonality, and error terms in each time series. Specifically, we utilize the ets() function in the forecast package of R, which employs Akaike’s Information Criterion (AIC) to identify the optimal model for each time series. Given the large number of time series we work with (1530), it is impractical to manually select the appropriate form for each component in every time series. Consequently, the automated algorithm selects the best model based on the unique characteristics of each individual time series. As a result, a combination of additive or multiplicative forms for the components is employed, depending on the specific attributes of each time series. Despite the popularity and the relevance of automatic ETS in this study, it may produce forecast distributions that are non-integer and include negative values, although the number of attended incidents is always integer and non-negative. When using ETS, a time series transformation approach could be used to generate strictly positive forecasts, although forecast distributions will still be non-integer. An alternative is to use forecasting models that produce integer, non-negative forecasts. In the following section, we present Generalized Linear Models (GLMs) and poisson time series regression to produce count-base forecasts.
Generalized Linear Model (GLM)
GLMs are a family of models developed to extend the concept of linear regression models to non-Gaussian distributions (Faraway 2016). They model the response variable as a particular member of the exponential family, with the mean being a transformation of a linear function of the predictors. One of the models that is frequently used in practice to generate count forecasts is Poisson regression. Suppose the time series is denoted by y1, . . ., yT, then the Poisson GLM can be written as
We fit a Poisson regression model using the function glm() from the stats package in R, with the argument family = Poisson to specify that we wish to fit a Poisson regression model with a log link function.
Poisson Regression Using tscount (TSGLM)
We also consider another Poisson regression model that takes into account serial dependence. This model captures the short-range serial dependence by including auto-regressive terms in addition to the same covariates that were used in the GLM model. To distinguish this from the previous GLM model, we will refer to this model as TSGLM. The Poisson TSGLM is similar to the GLM, with an additional auto-regressive component accounting for serial dependence. The term serial dependence refers to instances in which the number of incidents on a current day correlates with the number of incidents on previous days.
We use the tsglm() function in the tscount package in R (Liboschik, Fokianos and, and Fried 2017) to model the attended incidents. Again, the logarithmic link function is used to ensure that the mean of the Poisson distribution is always positive.
Provided accidents occur independently, they will inherently follow a Poisson distribution (Feller 1991, pp. 156–158). Hence, it is reasonable to assume a Poisson distribution in this context. To account for changes over time, we incorporate trend and seasonality covariates, as well as public holiday effects, allowing the mean of the Poisson distribution to vary. However, it is important to note that if there are additional factors influencing the mean of the Poisson distribution that are not accounted for in our model, we might observe over- or under-dispersion in the data.
Ensemble Method
Finally, one effective strategy for improving forecast accuracy includes the simultaneous application of multiple forecasting methods on a given time series, followed by combining the forecasts rather than relying on separate forecasts generated by each individual method (Clemen 1989). In this paper, we use an ensemble method that combines the forecasts generated from the Stationary, ETS, GLM, and TSGLM models using a simple average to form a mixture distribution (Wang et al. in press).
To generate forecast probability distributions using the above methods, we use a form of bootstrapping described in Panagiotelis et al. (2023). This involves simulating 1000 future sample paths from each of the models by bootstrapping the model residuals, taking into account the cross-sectional correlations between the different aggregated and disaggregated series. In this way, we can generate an empirical distribution of forecasts for each model. The ensemble forecast distribution is a simple mixture of these empirical distributions.
It is important to emphasize that the aim of this study is not to provide an exhaustive compilation of forecasting models or to promote a particular model class. Instead, we have developed a flexible framework that can accommodate any forecasting model. Our primary objective is to demonstrate its practicality and effectiveness in integrating base forecasts from any model and generating coherent forecasts within a hierarchical structure.
Performance Evaluation
To evaluate the performance of the various forecasting approaches, we split the data into a series of ten training and test sets. We use a time series cross-validation approach (Hyndman and Athanasopoulos 2021), with a forecast horizon of 84 days and each training set expanding in 42-day steps. The first training set uses all data up to 2018-04-25, and the first test set uses the 84 days beginning 2018-04-26. The second training set uses all data up to 2018-06-06, with the second test set using the following 84 days. The largest training set ends on 2019-05-09, with the test set ending on 2019-07-31. Model development and hyper-parameter tuning is performed using the training data, and the errors are assessed using the corresponding test set. While we compute forecast errors for the entire 12 weeks, we are most interested in the last 42 days of each test set because that corresponds to how forecasts are generated for planning in practice. Forecasting performance is evaluated using both point and probabilistic error measures.
The error metrics provided below consider a forecasting horizon denoted by j, representing the number of time periods ahead we are predicting. In our study, this forecasting horizon ranges from 1 to 84 days, j = 1, 2, . . ., 84.
Point forecast accuracy is measured via the Mean Squared Scaled Error (MSSE) and the Mean Absolute Scaled Error (MASE). The Mean Absolute Scaled Error (MASE) (Hyndman and Athanasopoulos 2021; Hyndman and Koehler 2006) is calculated as
Again, this is scale-independent, and smaller MSSE values suggest more accurate forecasts.
Using scale-independent measures, such as MASE and MSSE, enables more appropriate comparisons between time series at different levels and scales, as these measures are not influenced by the magnitude of the data. This is of particular importance in our study, as we work with time series at various levels of hierarchy with varying scales, resulting in different magnitudes of error. By employing scale-independent measures, we can meaningfully assess the forecast accuracy across the entire hierarchy, ensuring a more robust comparison.
To measure the forecast distribution accuracy, we calculate the Continuous Rank Probability Score (Gneiting and Katzfuss 2014; Hyndman and Athanasopoulos 2021). It rewards sharpness and penalizes miscalibration, so it measures the overall performance of the forecast distribution.
Calibration refers to the statistical consistency between the distributional forecasts and the observations. It measures how well the predicted probabilities match the observations. On the other hand, sharpness refers to the concentration of the forecast distributions—a sharp forecast distribution results in narrow prediction intervals, indicating high confidence in the forecast. A model is well-calibrated if the predicted probabilities match the distribution of the observations, and it is sharp if it is confident in its predictions. The CRPS rewards sharpness and calibration by assigning lower scores to forecasts with sharper distributions and to forecasts that are well calibrated. Thus, it is a metric that combines both sharpness and miscalibration into a single score, making it a useful tool for evaluating the performance of probabilistic forecasts.
CRPS can be considered an average of all possible Winkler scores (Winkler 1972; Hyndman and Athanasopoulos 2021 Section 5.9) or percentile scores (Hyndman and Athanasopoulos 2021, Section 5.9), and thus provides an evaluation of all possible prediction intervals or quantiles. A specific prediction interval could be evaluated using a Winkler score. Certain situations may also require assessing accuracy for a particular quantile, such as lower (e.g., 5%) or higher (e.g., 95%) quantiles. In such cases, a percentile score becomes useful in meeting this specific requirement.
Hierarchical and Grouped Time Series Forecasting Techniques
There are many applications in healthcare, and in particular in EMS, where a collection of time series is available. These series are generally hierarchically organized based on multiple levels, such as area/region, health board, and/or are aggregated at different levels in groups based on nature of demand, priority of demand, or some other attribute. While series could be strictly hierarchical or only grouped based on some attributes, in many situations, more complex structures arise when attributes of interest are both nested and crossed, having a hierarchical and grouped structure. This is also the case for our application, as discussed in Section 3.1.
Independent (Base Forecast)
A common practice in healthcare (and EMS) to predict hierarchical and grouped series relies on producing independent forecasts, also referred to as base forecasts, typically by different teams as the need for such forecasts arises. We observe
Reconciliation Methods
Traditionally, approaches producing coherent forecasts for hierarchical and grouped time series involve using bottom-up and top-down methods by generating forecasts at a single level and then aggregating or disaggregating. Top-down methods require having a unique hierarchical structure to disaggregate forecasts generated at the top level by proportions. However, given that we have multiple grouped attributes combined with the hierarchical structure, there is no unique way to disaggregate top forecasts. Hence, the top-down approach cannot be used in our application. The recommended approach is to use forecast reconciliation (Hyndman et al. 2011). In the following sections, we first discuss some notation and then present bottom-up and forecast reconciliation approaches used in this study to generate coherent forecasts.
Notations
Let
This leads to the
The term “bottom-level series” refers to the most disaggregated series within the hierarchical and grouped time series structure. For instance, in Table 2, each distinct combination of values in the control area (e.g., south and east), health board (e.g., CV), priority (e.g., green), and nature of incident (e.g., chest pain), corresponds to one individual time series. In the dataset at hand, there are 691 unique combinations, resulting in 691 bottom-level time series. The “aggregate time series” describes how these bottom-level series are combined to create higher-level series. For instance, to obtain the incidents at the national level (i.e., all country level), the time series are aggregated across all control areas, health boards, priorities, and natures of incidents. Any desired aggregation level can be achieved based on the data structure, utilizing the bottom-level series available.
Bottom-Up (BU) and Linear Reconciliation Methods
Bottom-Up is a simple approach to generate coherent forecasts. It involves first creating the base forecasts for the bottom-level series (i.e., the most disaggregated series). These forecasts are then aggregated to the upper levels, which naturally results in coherent forecasts. The BU approach can capture the dynamics of the series at the bottom level, but these series may be noisy and difficult to forecast. The approach uses only the data at the most disaggregated level and so does not utilize all the information available across the hierarchical and grouped structure.
The bottom-up (BU) approach is constrained by its reliance solely on base forecasts from a single level of aggregation at the bottom level. While it does result in consistent forecasts, the BU approach lacks forecast reconciliation since no reconciliation is performed.
Forecast reconciliation approaches bridge this gap by combining and reconciling all base forecasts to generate coherent forecasts. This technique utilizes all the base forecasts produced within a hierarchical structure to create consistent forecasts at every level of the hierarchy. As a result, it goes beyond relying solely on base forecasts from a single level of aggregation and instead leverages all available information at each level to generate forecasts that minimize the total forecast variance of the set of coherent forecasts. Linear reconciliation involves projecting the base forecasts onto the coherent space. It is derived by minimizing the sum of the variances of the reconciled forecasts, subject to the resulting forecasts being coherent and unbiased (Wickramasuriya, Athanasopoulos, and Hyndman 2019). Linear forecast reconciliation methods can be written (Wickramasuriya, Athanasopoulos, and Hyndman 2019) as
Ordinary Least Squares (OLS) is the simplest and most commonly used method. In this approach, the estimation of W is based on the assumption that all the errors are uncorrelated and have equal variance and that multi-step forecast variances are proportional to one-step forecast variances. Then,
Weighted Least Squares (WLS) is an extension of OLS where the variance of the errors is assumed to be heteroscedastic, that is, different for each series. But it assumes that the errors of each series are uncorrelated with each other and that multi-step forecast variances are proportional to one-step forecast variances. In this approach,
Minimum Trace (MinT) is a further generalization where
We use the implementation of these methods in the hts package in R in the experiment.
Certainly, other approaches can be applied to hierarchical forecasting problems. Pennings and Van Dalen (2017) and Villegas and Pedregal (2018) proposed the idea of using a state space model to ensure consistent forecasts. However, when dealing with larger hierarchies, these models encounter difficulties in estimating covariance matrices. In contrast, our approach provides a clear advantage by allowing the incorporation of different forecasting methods for the base forecasts and even accommodating distinct methods for individual series. The decoupling of time series models from the reconciliation step adds significant flexibility in exploring a wide range of models.
Results and Discussion
In this section, we compare the forecasting performance of the Stationary, ETS, GLM, and TSGLM models along with the ensemble using base forecast and Minimum Trace (MinT) reconciliation methods. We have also computed the forecast accuracy for Ordinary Least Square (OLS) and Weighted Least Square (WLS) approaches, along with bottom-up forecasting. However, they are not reported here because their accuracy is outperformed by MinT. We should also note that forecasts, and consequently their corresponding errors, are generated for the entire hierarchy, and they could be reported at any level if required. But to save space, we have reported only the top level (Total), the bottom level, and the levels corresponding to control areas and health boards. The latter are chosen because this is where decision-making takes place, so these forecasts are the most important.
Average Forecast Performance Calculated on the Test Sets at Forecast Horizons h = 43, . . . , 84 Days, With Time Series Cross-Validation Applied to Attended Incident Data. 8
Bold values show the lowest score for the given accuracy measure and the level of hierarchy.
Table 3 shows that forecast reconciliation (i.e., MinT) improves point forecast accuracy at the higher levels of the hierarchy, including total, control area, and health board. However, it does not result in accuracy improvement at the bottom-level series, for which base forecasts are more accurate. This might be due to the noisy structure of time series at the bottom level and possibly to very different patterns in the aggregated series. It is also clear from Table 3 that the ensemble method improves forecast accuracy at the total, control area, and health board. However, this does not remain valid for bottom series where different individual methods perform best, depending on the accuracy measure. While the forecast reconciliation approach aims to enhance forecast accuracy, its effectiveness is not guaranteed, especially if the bottom-level series exhibit excessive noise and lack systematic patterns. Despite this, reconciling forecasts at the bottom level can offer advantages by generating coherent forecasts that facilitate alignment in planning across various teams within an organization, promote better coordination, and prevent conflicting decisions. Moreover, even when dealing with noisy and irregular bottom-level series, reconciliation can still improve forecast accuracy at higher levels of the hierarchy by leveraging the information available across the hierarchy. Therefore, although the bottom-level forecasts may not be highly accurate on their own, reconciling them with higher-level forecasts can still provide a more consistent view of future demand and potentially yield more accurate forecasts at other levels.
Table 3 presents the accuracy of the forecast distribution measures by CRPS, which considers both forecasting reliability and interval sharpness. The smaller the value of CRPS, the better the comprehensive performance. We observe that forecast reconciliation results in forecast improvement at the total and health board levels. CRPS is almost identical at the control area and bottom levels. Base forecasts are slightly better at the control area level, while reconciliation is marginally more accurate than base at the bottom level. The ensemble method is also more accurate for higher levels, but ETS performs well at the bottom level. Table 3 also indicates that reconciliation using Mint generates accurate distributional forecasts. The marginal improvement in the average probabilistic forecast accuracy at the bottom level might be due to the reconciliation method giving improved forecast accuracy in the tails of the forecast distribution, which are critical for managing risks.
Overall, our results indicate that forecast reconciliation using the MinT method provides reliable forecasts and improves upon the base (unreconciled) forecasts at all levels except the bottom-level series. But even there, forecast reconciliation using MinT improves accuracy in the tails of the distribution.
In addition to the overall forecast accuracy presented in Table 3, we also report the point and probabilistic forecast accuracy measures for each forecast horizon in Figure 4. The figure focuses on the hierarchical levels important for decision-making, including total, control area, and health board; however, the accuracy could be calculated for any level. We only illustrate the results of the MinT method, given its strong performance described in Table 3. For illustration purposes, we report the average weekly forecast accuracy instead of the daily forecast horizon, as this reduces the visual noise in the figure. Thus, the x-axis shows horizons from week 1 ( Average accuracy by week for 12 weeks using MinT reconciliation.
5

Despite using Poisson regression models to create count distributions of attended incidents for the base forecasts, it is important to note that the reconciled forecast distributions do not maintain a count format. In practical scenarios, there might be a need to use integer forecasts. Count forecast reconciliation is an active area of research, and it would be interesting to explore how our approach could be adapted to generate count-reconciled probabilistic forecasts in future studies. Rounding the forecasts is one possible solution to this problem. However, the impact of rounding on forecast accuracy varies depending on the level of hierarchy and the scale of the data. In situations with high volume demand, the effects of rounding may be negligible, and forecast accuracy calculations can overlook integer effects. On the other hand, in low-volume demand settings, such as forecasts at the bottom level of the hierarchy, integer (rounding) effects may have a more noticeable influence on forecast accuracy.
An Illustration of Probabilistic Forecast for EMS Demand
In this section, we provide an illustrative example of a probabilistic forecast for future demand, based on the total attended incidents at the SB health board. Due to the complexity of including such plots for the entire hierarchy and and the 84 days ahead in the manuscript, only one example is presented here. However, it is feasible to generate these plots for the entire hierarchy and for any forecast horizon, if necessary.
In practice, point forecasts are commonly used, but they have limitations as they ignore the uncertainty associated with the forecast. The future is inherently characterized by an irreducible level of uncertainty. Being prepared entails considering alternate courses of action. Probabilistic forecasts offer an alternative approach to anticipate future demand. Rather than providing a single value, they assign likelihoods to all possible demand outcomes, acknowledging that different numbers of attended incidents are possible, but with varying likelihoods.
The purpose of probabilistic forecasting, as demonstrated in Figures 5 and 6, is to quantify uncertainty. Figure 5 depicts the forecast distribution of total incidents in one health board over a 7-day period. It also gives the point forecast as well as the 80% and 90% prediction intervals. Figure 6 zooms in on the first day to show the histogram more clearly, illustrating the range of possible outcomes and their likelihood. A graphical illustration of the forecast distribution of ambulance demand (i.e., total incidence attended) for the SB health board for a horizon of 7 days.
6
An illustrative example of the forecast distribution of ambulance demand (i.e., total incidence attended) for the SB health board for 1 day ahead.
7


Decisions based on these forecasts could focus on the tails of the distribution: unexpectedly high demand leading to crowding and inefficiency, or unexpectedly low demand resulting in wasted resources. Such forecasts are valuable tools for decision-makers and planners, especially when dealing with low-probability, high-cost situations. Different EMS managements may have varying risk attitudes depending on resource availability, making it crucial to consider the entire distribution when making decisions. For instance, these forecasts enable management to calculate the probability of demand exceeding a certain threshold of available resources (e.g., 90%), which can serve as an informative early warning measure for overcrowding.
It is important to note that while point forecasts and prediction intervals can be obtained from probabilistic forecasts, the reverse is not possible. A single number cannot be used to directly derive a probabilistic forecast. Prediction intervals, although helpful in indicating possible ranges, do not provide information on the probabilities of low or high demand.
In EMS planning, future demand is just one aspect to consider. Other inputs, such as capacity, should also be treated as probability distributions to adopt a probabilistic approach to planning. To extract valuable insights and make informed decisions from probabilistic forecasts, specialized numerical tools are required, as the forecasts themselves are typically represented as explicit probability density functions or Monte Carlo generators.
Conclusion
Forecasting problems at Emergency Medical Services often have inherent hierarchical and grouped structures. For example, looking at time series of arrival calls in a clinical desk service, Emergency Department admissions, verified incidents, or attended incidents in a country, they could be disaggregated by various attributes of interest. Total demand in the country could be disaggregated by region, then within each region by health board, within each health board, by station/hospital, and so on down to the postcode area. Alternative structures may arise when attributes of interest are crossed rather than nested. For example, the total demand could be disaggregated by priority (e.g., red, amber, and green) or by the nature of incidents. It is also natural to have a mixed structure; for example, the total demand could be disaggregated by priority and by health board.
Despite the inherent hierarchical structure of the forecasting problem in EMS, the common practice is to produce point forecasts for each time series independently. This practice may lead to a lack of coordination and possibly undesirable and conflicting outcomes. Furthermore, due to the asymmetric impact of resource allocation in this area, quantifying forecast uncertainty through probabilistic forecasts is also of value as it enables planners to manage associated risks. In this paper, we investigate the application of hierarchical forecasting methods for producing probabilistic forecasts of daily incidents attended up to 84 days ahead, using different forecasting methods.
Our results indicate that forecast reconciliation in EMS can not only contribute to a more coordinated approach to planning and decision-making through the production of coherent forecasts, but it can result in forecast accuracy improvements. Our proposed forecasting models, combined with reconciliation approaches, outperform the empirical distribution benchmark. We show that a substantial forecast improvement can be achieved at higher levels of aggregation by applying forecast reconciliation methods. When a point forecast is of interest at the bottom level of the series, we observe that reconciliation may not improve the forecast accuracy if the bottom series are noisy and lack systematic patterns. However, forecast reconciliation may result in more accurate forecast results for bottom series if we are interested in the tails of the forecast distribution rather than just center measures like mean (i.e., point forecast). Coherent forecasts are also crucial for informing planning activities, and we demonstrate that the proposed models produce coherent forecasts across all forecast horizons. Therefore, we recommend that forecast reconciliation approaches be adopted for routine use in EMS, whenever hierarchical and/or grouped time series data need to be forecasted. Moreover, we found that using an ensemble forecasting model, combining all the models developed in this paper instead of using each individually, works remarkably well for our mixed hierarchical and grouped structure.
Our research establishes a strong basis for future investigations and practical implementation in EMS. Leveraging the hierarchical and grouped structure of demand time series, EMS can use this advanced forecasting framework to generate coherent point and probabilistic forecasts, making the most of all available data at every level of the hierarchy. We acknowledge that a forecast serves a greater purpose beyond its mere existence, ideally enabling the best utility in terms of efficient allocation of medical services, response time, and cost, all informed by the forecast. While we fully appreciate the importance of evaluating forecast quality based on its impact on decision-making processes, it is essential to address the data requirements and methodology involved in measuring this impact. For a comprehensive assessment of the forecasts’ implications, access to additional data beyond ambulance demand, covering various decision types, capacity information, constraints in the decision system, and more, becomes necessary. This additional data would offer valuable insights into the specific decisions based on the forecasts, resulting in a more accurate evaluation of their impact on medical services. Furthermore, measuring the actual impact of forecasts would necessitate an approach that goes beyond forecasting itself. This would involve developing and implementing simulation models capable of replicating decision-making processes based on the forecast inputs. These simulation models would then evaluate the quality of the final decisions, taking into consideration the utilities that are particularly significant in the context of EMS.
Future research can build upon this study in several ways. In future investigations, we aim to explore this avenue by incorporating operational information, simulating decision processes, and assessing the decision impact of this framework on utilities that are significant to the EMS. Linking forecasts with their utilities (e.g., response time, allocation of medical services, resource utilization, cost, etc.) can offer an opportunity to maximize benefits through a more holistic planning approach. Additionally, in our study, we employed Poisson regression models to generate count distributions of attended incidents for the base forecasts. However, it is essential to note that the reconciled forecast distributions are not integer. This observation presents an interesting avenue for future research. Also, since the dataset used in this study only includes information on attended incidents, it would be valuable for future research to investigate the impact of failed responses on EMS forecasting if data on these incidents became available. It is also important to note that our methodology for hierarchical time series forecasting can be applied to any time series data in EMS, including those that may include failed responses.
Although our study primarily focuses on Emergency Medical Services, it is essential to emphasize that the framework we propose has broad applicability across various service industries (Ostrom et al. 2010). Our approach is particularly valuable in situations where time series data is structured hierarchically and/or grouped, a common characteristic found in many sectors. This occurs when data can be naturally organized into different levels of hierarchies or when dependencies and relationships exist among entities within the system. For instance, in supply chains (Shugan and Xie 2000), demand forecasting at different levels of the distribution network, such as regional warehouses or retail stores, is vital for efficient inventory management and minimizing stockouts. Our framework allows the reconciliation of forecasts, ensuring consistency, and alignment throughout the supply chain, leading to improved decision-making and operational efficiency. In the financial industry (Kimes and Chase 1998), where investments span multiple asset classes, geographical regions, or customer segments, our framework can be applied to forecast portfolio performance, asset allocation, or customer demand. Similarly, in transportation, the framework supports forecasting transportation demand at various levels, optimizing route planning and resource allocation. Likewise, in the hospitality and tourism industry (Dekimpe, Peers, and van Heerde 2016), it facilitates forecasting demand rates at state, regional, and departmental levels, enabling strategic pricing, capacity planning, and revenue management for hotels and other travel-related businesses. Additionally, in call centers, accurate call volume forecasting at different levels of the call center hierarchy or grouped structure is crucial for workforce management and resource allocation. Implementing our framework, call centers can generate accurate forecasts for different skill groups, shifts, and locations, ensuring efficient staffing and optimal service levels to meet customer demands.
Reproducibility
To enhance transparency and reproducibility, we not only provide data and the code, but also the entire paper that is written in R using Quarto. All materials to reproduce this paper is available at github.com/bahmanrostamitabar/forecasting-emergency-medicine.
The repository contains the raw data, all R scripts used in experiments, the results used in the paper, as well as the quarto files for producing this paper. Full instructions are provided in the repository.
Supplemental Material
Supplemental Material - Hierarchical Time Series Forecasting in Emergency Medical Services
Supplemental Material for Hierarchical Time Series Forecasting in Emergency Medical Services by Bahman Rostami-Tabar and Rob J. Hyndman in Journal of Service Research
Footnotes
Author Contributions
Declaration of Conflicting Interests
Funding
Supplemental Material
Notes
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
