Sage Journals: Discover world-class research

Abstract

Spatiotemporal features have a significant influence on traffic flow prediction. Due to the potentially internal relationship of adjacent roads, spatial information can, to some extent, affect traffic flow forecasting. Simultaneously, periodic information of traffic flow data can also be positively affected by temporal features. Considering these key points, this article proposes a parallel spatiotemporal deep learning network for short-term highway traffic flow forecasting, which learns features from the time and space dimensions. In the introduced model, the convolutional neural network is used to extract spatial features and long short-term memory is used to extract temporal features of traffic flow. The parallel-connected structure of convolutional neural network and long short-term memory reflects much powerful performance in traffic flow prediction. To apply the parallel spatiotemporal deep learning network in large dataset prediction, a dataset of Shanghai inner ring elevated road is used to predict 591 sensors in 6 months. Experimental results confirm that the overall performance of our parallel spatiotemporal deep learning network surpasses those of other state-of-the-art methods.

Keywords

Traffic flow forecasting spatiotemporal feature deep learning convolutional neural network long short-term memory

Introduction

Urban transportation is closely connected with overall planning and development, commuting efficiency, and quality. As the most important strategy and key component of smart city, an intelligent transportation system can efficiently improve the running capacity of the whole system. With the long-term theory and method learning of prediction and analysis of traffic flow, data mining for a large-scale traffic network has important practical value, including urban traffic control and dynamic route guidance.¹ The interval time of traffic flow prediction providing for traffic control and guidance usually is less than or equal to 15 min,² which is defined as short-term traffic flow forecasting. Short-term traffic flow forecasting is a real-time, periodical, and non-linear prediction process.

In theory, traffic flow forecasting is to predict traffic flow of a future time point by extracting features of historical data. With rapid development of storage technology and systematic data flow framework, a large scale of traffic data have been recorded for prediction. These advanced infrastructures benefit the research on traffic flow prediction.

Considering the large scale of data provided recently, a balance should be maintained between high accuracy and training efficiency.³ Shallow-structured models can have acceptable performance of prediction and have been applied in most small-scale applications.⁴ However, the limitation of these shallow models that they cannot extract effective information from large datasets and fit non-linear object functions is obvious.³ Recently, a deep learning network and its hybrid structure have extremely advanced breakthrough in both theoretical research and applications under real environments.^5,6 In theoretical research, powerful networks are introduced continuously. Goodfellow et al.⁷ proposed generative adversarial networks, which consist of a generative model and a discriminative model. Devlin et al.⁸ designed bidirectional encoder representations from transformers for language understanding, which can pre-train deep bidirectional representations by combining contexts in all layers. However, traffic flow is relatively complicated because it contains a lot of information in the time and space dimensions.⁹ Extracting information from both the dimensions is definitely a big challenge.

To address these issues, this article introduces a parallel spatiotemporal (PST) network that combines convolutional neural network (CNN) and long short-term memory (LSTM). The newly proposed model can be applied in large-scale traffic flow prediction and fit non-linear object functions. Concretely, the layer of CNN is used to extract spatial features and the layer of LSTM is used to capture temporal features. Since the functions of both are different, temporal and spatial information is learnt separately and concatenated output of each part to make prediction.

The structure of this article is organized as follows: section “Related work” provides a literature review on the models used for traffic flow prediction; section “Methodology” describes the problem statement and the structure of a parallel spatiotemporal deep learning network (PST-DNN); section “Experiment” presents the experiments with the real traffic dataset and evaluation of the performance of the newly proposed model and other popular models; section “Conclusion and future work” provides the conclusion and directions for future research.

Related work

In this section, we provide a literature review on the background of the applied models for traffic flow forecasting and the combination of CNN and LSTM.

Traffic flow forecasting

In recent research, multiple traffic flow prediction models have been proposed, which have used closely similar data wrangling methods. These methods can be generally divided into parameter learning model and non-parametric learning model. The parameter learning model includes linear and non-linear regression models, exponentially moving average model, and Kalman filtering model. Xie et al.¹⁰ first combined Kalman filtering and wavelet transformation to predict traffic flow. The method applied discrete wavelet decomposition to remove noises and used Kalman filtering to estimate multi-dimensional weight of historical traffic flow. When the traffic flow fluctuates frequently, the model achieves a much better result than the single Kalman filtering, but the short-term change of data cannot be satisfied with the linear condition of Kalman filtering. Wang et al.¹¹ proposed a non-linear algorithm combining generalized least squares and stochastic user equilibrium to train parameters of an origin–destination matrix. Van Hinsbergen et al.¹² proposed an advanced Kalman filter model based on partial extension. This model runs much faster than the global extension and can be applied to large-scale traffic flow prediction. Tchrakian et al.¹³ employed a short-term traffic flow forecasting algorithm based on spectrum analysis. Pan et al.¹⁴ predicted the transport state with temporal and spatial features and extended the cell transmission model. Although these models can be accomplished easily, they will not solve special change under unstabilized situations. For example, Zhou¹⁵ identified that Kalman filter can have abnormal predict value with inertia, although professional knowledge and abundant engineering experience are applied to the model.

A non-parametric learning model mainly includes an artificial neural network and a support vector regressor. Boto-Giralda et al.¹⁶ studied wavelet-based denoising for traffic volume time series forecasting with self-organizing neural networks. Jeong et al.¹⁷ proposed a supervised weighting-online learning algorithm for short-term traffic flow prediction. Lippi et al.¹ proposed two traffic flow forecasting methods based on support vector regressor and periodic features of traffic flow. The above algorithms depend on the amount of original data, but they cannot meet the demand in the real situations. To solve this problem, Huang et al.^18,19 and Lv et al.²⁰ applied deep belief network and stacked autoencoders (SAEs), respectively, to predict traffic flow. These two deep learning networks save the coefficient of traffic flow in the hidden layers to make the prediction. Du et al.^21,22 proposed a hybrid multimodal deep learning network, which series-connected the CNN and LSTM model with several features, including traffic flow, average speed, and density. Yang et al.²³ studied an optimized structure of the SAE model. However, little amount of input data can promote the model to find multiple accurate feasible solutions, whereas lack of several transport states in the training set will decline the accuracy of the prediction result. Although traffic flow data can be easily collected by different types of sensors, the size of sliding window should not be quite long. Compared with the small scale of traffic flow training set, pre-unsupervised training is beneficial to avoid overfitting.²⁴ A recent research identified that most ridge points under zero gradient are located on the hyperplane of a deep learning network,²⁵ and most gradient orientations around the points are upward and only few are downward.²⁶ Consequently, when using a shallow neural network to predict traffic flow, once the calculations fell into partial optimization, the prediction model will be nullified. Due to diversity and randomness of traffic flow, some transport states that do not appear in the training set will not be predicted by an artificial neural network. For these problems, the common solution is to increase the scale of the deep learning network, including depth and width.^27,28 Some new research increases the depth of the network and uses a large amount of input data to train the model, but the calculation efficiency increases simultaneously.^29–31 Therefore, building an efficient deep learning network with a high performance is challengeable.

Combination of CNN and LSTM

CNN is widely used in computer vision and image recognition to process image information, which achieves great results. Compared with the traditional method to extract spatial features, CNN has these several properties: first, the features are only related to nearby values rather than the global ones; second, the introduced pooling layers tremendously improve the running efficiency without erasing much more features. Actually, the locally connected weight features enable the network to resolve spatial information.^32,33 The structure of LSTM can solve sequence-to-sequence problems.³⁴ The structure is to use one LSTM unit to accept input sequence, one timestamp each time, train large fixed-dimensional vector representation, and the output weight will flow to another LSTM unit to extract output sequence. The ability to learn features with long-range temporal dependencies makes it a popular alternative for time series problems due to the considerable time lag between the input sequence and the corresponding outputs.

Since spatiotemporal features can be extracted by CNN and LSTM separately, the combination of both is popular in some research fields. For example, the combined structure has been applied to supervised sentiment classification,³⁵ air quality forecasting,^21,22 short text classification,³⁶ and three-dimensional (3D) object classification.³⁷ Traffic flow is similar to the data format of text and audio, which can be seen as the time series problem. Thus, geographic information of traffic flow corresponds to spatial information and the change of traffic flow over time corresponds to temporal information. Inspired by applications of the combined structure, a newly proposed structure for traffic flow prediction is introduced in this study.

Methodology

Problem statement

Traffic prediction is to predict future traffic flow according to a series of historical traffic flows. To be more concrete, based on previous data ${X_{t - n + 1}, \dots, X_{t}}$ of n time points, the task is to predict the traffic flow $X_{t + 1}$ after n time points. Since the time and space dimensions are used, we define a road network with an unweighted graph $G = (S, F)$ to reflect the topological structure. $S \in R^{n \times 1}$ , where S is a set of sensors along the road and n is the number of sensors used for model training. $F \in R^{1 \times 1}$ , where F represents the features of each sensor and refers to traffic flow in this article. The traffic prediction problem can be represented by learning an object function as follows

$X_{t + 1} = func (G, {X_{t - n + 1}, \dots, X_{t}})$ (1)

where n is the length of historical traffic flow before time point t.

The prediction model

In this section, we describe our deep learning network (see Figure 1) for the graphical illustration of the proposed model. The temporal features are learnt by two layers of LSTM and the spatial features are captured by five layers of one-dimensional convolutional neural network (1D CNN). The paralleled structure is achieved by calculating the time and space dimensions separately, concatenating all features, and transferring the output into a layer of the linear regression method. The whole network is designed as an end-to-end framework, and the following parts will describe each component.

Figure 1.

The overview of the proposed model.

Spatial feature modeling

Except traffic flow of one sensor over time, the data of adjacent sensors can be important to improve prediction accuracy because the correlation of detectors based on closed location is strong.³⁸ Thus, we choose CNN structure as a key layer of our proposed model. In this case, for the traffic flow dataset used as shown in Figure 1, a 1D CNN is deployed to extract spatial information. Especially, 1D CNN is used on the vector $V^{t} = [S_{1}^{t}, S_{2}^{t} \dots, S_{n_{input}}^{t}]$ , where $n_{input}$ is the number of adjacent sensors of one senor used to predict its future traffic flow and $S_{n}^{t}$ denotes the traffic flow of the nth sensor at time point t. The $f^{t}$ feature map of each 1D CNN layer is learnt as follows

$m_{f}^{t} = Act (w_{f}^{t} ⊙ V_{f}^{t} + b_{f}^{t})$ (2)

where $w_{f}^{t}$ is the convolution kernel with a certain size, $b^{t}$ is the bias, the operation symbol ⊙ denotes the calculation of 1D CNN, and $Act (\cdot)$ is a non-linear activation function. Especially, under different sizes of filter $n_{filter}$ , the dimension of output size for each 1D CNN layer $m_{out}$ is $n_{input} - n_{filter} + 1$ .

A fully convolutional network is used in the model, so pooling layers are not applied after each 1D CNN layer. Since a fully convolutional network is widely used for small-size image recognition,³⁹ spatial information needed for prediction is limited. Considering that overfitting will happen much easier with a larger filter size, the filter size of a 1D CNN is set at 3. For the deep learning framework, batch norm is the normalization of the output in each hidden layer, and it can ensure that gradient descent reduces the oscillations when approaching the minimum point and converge faster. Thus, one layer of batch norm is added after each 1D CNN layer. Also, since the rectified linear activation units (ReLU) can avoid vanishing gradient, it is added after the layer of batch norm.

Except one-way highway network, there are more complex road networks. For those complex road networks, 1D CNN will lose efficacy to comprehensively capture spatial information. Thus, the application of CNN on graph-structured data introduced by Henaff et al.⁴⁰ could be an appropriate solution to complex networks.

Temporal feature modeling

LSTM is a variant of the recurrent neural network (RNN) algorithm. The algorithm is first proposed by Sepp Hochreiter and Jurgen Schmidhuber in 1997. Before introducing LSTM, it is necessary to have a preliminary understanding of how RNN works.

RNN is a variant of the traditional feedforward neural network; different length sequences are permitted as input of the model. With a sequence of inputs ${x_{1}, \dots, x_{T}}$ , a single RNN gives a sequence of outputs ${y_{1}, \dots, y_{T}}$ by calculating the following equation

${h_{t}}^{l} = func (w^{hx} x_{t}^{l - 1} + w^{hh} h_{t - 1}^{l})$ (3)

where $w^{hx}$ and $w^{hh}$ are the weights of the hidden unit calculated with input x of layer l – 1 at time point t and the weight of the hidden unit calculated with state h of layer l at time point t – 1. The function can be a sigmoid or a tanh function. The structure of RNN makes it easy to map sequence to sequence whenever the order of input and output is known. However, the limitation of RNN is that the input and output of different lengths will not be allowed. In comparison, LSTM can solve this problem. Since different researchers use slightly different LSTM variants, we use the architecture suggested by Graves et al.⁴¹

$i_{t} = \tanh (w_{xi} x_{t} + w_{hi} h_{t - 1} + b_{i})$ (4)

$j_{t} = sigmoid (w_{xj} x_{t} + w_{hj} h_{t - 1} + b_{j})$ (5)

$f_{t} = \tanh (w_{xf} x_{t} + w_{hf} h_{t - 1} + b_{f})$ (6)

$o_{t} = \tanh (w_{xo} x_{t} + w_{ho} h_{t - 1} + b_{o})$ (7)

$c'_{t} = c_{t - 1} ⊙ f_{t} + i_{t} ⊙ j_{t}$ (8)

$h_{t} = \tanh (c'_{t}) ⊙ o_{t}$ (9)

where $w_{*}$ is the matrix of weights and $b_{*}$ is the bias. The operation symbol ⊙ denotes the element-wise vector product. $c'_{t}$ is an updated state for each nest LSTM unit. Obviously, there are two outputs for each LSTM unit, which are $c'_{t}$ and $h_{t}$ . $c'_{t}$ is used to avoid gradient vanishing and $h_{t}$ can make complicated decisions based on the level of importance for long- and short-term information. Since Karpathy et al.⁴² suggest that a single layer of LSTM can only capture short-term memories, two layers of LSTM are applied to extract both long- and short-term memories of traffic flow in this study.

Experiment

In this section, real Shanghai highway traffic flow datasets are used to evaluate the performance of the proposed PST-DNN method. We also compared the robustness of the proposed model with several state-of-the-art forecasting methods. Table 1 shows the description of the environment for all experiments.

Table 1.

Description of environment for all experiments.

CPU	Intel Core i7-8700K, 6 cores, 12 threads, base frequency: 3.70 GHz; max turbo frequency: 4.70 GHz
Memory	Gskill Trident Z, memory type: DDR4; capacity: 32 GB; tested speed: 3200 MHz
GPU	NVIDIA GeForce GTX 1080 Ti; GPU architecture: Pascal; frame buffer: 11 GB GDDR5X; memory speed: 11 GB/s; boost clock: 1582 MHz
System	Ubuntu 16.04 LTS

Data description and setup

Shanghai is one of the most popular cities with high population density and busy transportation in the world. The traffic flow data used in this study are part of Shanghai inner ring elevated road. As shown in Figure 2, the route marked in blue color is the track of the inner ring. There are 770 one-way sensors located along the inner ring.

Figure 2.

Traffic flow location applied in the experiment.

The records between 01 January 2011 and 30 June 2011 are used in this study. As suggested in Fusco and Gori,⁴³ the detected interval within 5 min is recommended to predict traffic flow when the transportation network is overcrowding day and night. Table 2 shows the description of experimental datasets. For each sensor, since there are 12 values in an hour recorded theoretically, those that lose more than 12 values continuously will be defined as invalid sensors in this study. Besides, null values are filled in according to the front value of that time point.

Table 2.

Description of experimental datasets.

Dataset	Shanghai inner ring elevated road
Sensors	Total: 770, valid: 591, invalid: 179
Intervals	5 min
Time span	1 January 2011–30 June 2011
#Records(each sensor)	52,126

Figure 3 shows the graphical illustration of the distribution plot for the average value, minimum value, maximum value, and standard deviation of traffic flow, respectively, among 591 valid sensors. Figure 4 shows the average traffic flow at 288 time points each day under the 25th, 50th, and 75th percentiles.

Figure 3.

The distribution plot for the average value, minimum value, maximum value, and standard deviation.

Figure 4.

The average traffic flow at 288 time points each day under 25th, 50th, and 75th percentiles.

Evaluation metrics

Generally, the mean relative error (MRE) is an appropriate evaluation metric for model comparison. However, if volumes are numerically high, MRE will lose efficacy. To avoid this, this article also introduces root mean square error (RMSE) and mean absolute error (MAE). The MRE, RMSE, and MAE values are defined as follows

$MRE = \frac{1}{N} \sum_{i = 1}^{N} \frac{| P_{i} - T_{i} |}{T_{i}}$ (10)

$RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(P_{i} - T_{i})}^{2}}$ (11)

$MAE = \frac{1}{N} \sum_{i = 1}^{N} | P_{i} - T_{i} |$ (12)

where N denotes the total number of prediction points, P_i denotes the prediction value, and T_i denotes the true traffic flow value.

Model parameter design

Grid search for input and model structure

In the proposed model, the key component of input data is the window size and that of the model structure is the number of CNN layers, as these parameters may affect the prediction precision greatly. Other hyperparameters of PST-DNN include training epoch, batch size, and the optimizer. In the experiment, we set the training epoch to 100, the batch size to 16, and the optimizer to Adam. To achieve the best values, we use grid search to find the optimal values by comparing the evaluation metrics.

In our study, we set the values of window size at 144, 288, 432, and 576, where 144 is the number of records in 12 h. The number of CNN layers is set at 4, 5, and 6 under the same structure: 1D CNN with three output filters and the kernel size of 2 as the length of the 1D convolution window. Especially, 60 sensors from 591 valid sensors are randomly selected to make the parameter tuning experiment statistically significant. As shown in Figure 5, the vertical axis represents the combination of different parameters, where the first value in brackets is the window size and the second value is the number of CNN layers. By comparing the results of the three metrics introduced in section “Evaluation metrics” and training time comprehensively, (288, 5) is the optimal combination. When the number of CNN layers increases, the training time also increases. This is mainly because when additional CNN layers are added, more parameters are produced for calculation and the model complexity also increases. Therefore, we set the window size to 288 and the number of CNN layers to 5 in all the experiments.

Figure 5.

Average evaluation metrics (MRE, RMSE, MAE, training time) of 60 randomly selected sensors with grid search.

Train–test splitting

For traffic flow of each valid sensor, the data are randomly shuffled in advance. After that, 70% of data are selected as the training set, and the remaining 30% of data are chosen as the testing set. Since the length of each sensor is approximately 52,126, the number of samples for the original data before splitting is about 51,838, which is calculated by the difference between the length of each sensor and the optimal window size. It means that 70–30 split is appropriate under this size of data. In comparison, if over 1 million samples are provided, 70–30 split is not available because extra bias may exist for that 30% of data. In that situation, 90–10 split is recommended.

Implementation details of the model structure

In total, 591 valid sensors are used to evaluate the performance of different time series models. For each sensor, models will be evaluated according to MRE, RMSE, and MAE.

The dimension of the input data is set to be (number of samples × number of features × window size). The number of samples depends on the train–test splitting and the total records of different sensors. The number of features is set at 5. It means that two sensors on the left-hand side of the predicted senor and two sensors on the right-hand side of the one are selected. This is because the symmetry location can provide equally importance geographical information when trained by the layers of CNN. The window size is set to 288 as concluded after grid search. After model training, the dimension of the output data for PST-DNN is (number of samples × 1).

PST-DNN presented in the experiment consists of five layers of 1D CNN and two layers of LSTM. For all the 1D CNN layers, there are three output filters and the temporal convolution kernel size is 2. Besides, one layer of batch normalization is used after each CNN layer to minimize the negative effect of internal covariate shift, and one layer of ReLU is added after each layer of batch normalization to ensure non-negative prediction. After the last layer of ReLU, global average pooling is used to replace the traditional fully connected layers in CNN. Meanwhile, each layer of LSTM has 128 hidden units. Finally, parameters from both CNN and LSTM are concatenated for fitting with a layer of the linear regression method.

Experimental results and comparison

The proposed PST-DNN is compared with several state-of-the-art methods, which includes the following:

LSTM;³⁴

SAE;²⁰

Lasso;⁴⁴

Prophet.⁴⁵

To guarantee a fairness of comparison, parameters of these models are chosen based on recommended values in the related literature. These parameters include the following: for LSTM, one layer of LSTM with 128 hidden units is used; for SAE, a three-depth neural network is applied with 300 hidden neurons for each layer, and the activation is a sigmoid function. In the lasso method, the weight of L1 norm is set at 0.01. In the Prophet method, the width of uncertainty intervals is set at 0.8. All neural network methods are built upon Keras library, the lasso method is trained by scikit-learn library, and the Prophet method is implemented by Prophet library. Train–test split of the dataset is 70/30. Also, the traffic flows are normalized and the range is scaled between 0 and 1 before inputting to all the models. Besides, to improve calculation efficiency and train data simultaneously, 591 valid sensors are divided into six groups, and all sensors in each group are located continuously along the roadside. Thus, five groups have 98 sensors and the sixth group has 101 sensors.

Table 3 presents the median evaluation indexes of 591 valid sensors for different models. Since the median value can ignore bias caused by outliers, it is much appropriate to reflect the population performance. Thus, other descriptive values, such as the average value and the first/third quartile, are not displayed. Also, training time is not provided because it always takes more training time for the neural network method, and the comparison of training time between deep network and non-deep network is meaningless. It is evitable that the overall performance of PST-DNN is relatively greater than the other models, although RMSE of PST-DNN is higher than the one of SAE by about 0.035 and MAE is higher than SAE by about 0.348 for the second group. PST-DNN achieves lower performance on MRE compared to the other models, which outperforms the second lowest model with the differences of 2.174, 0.025, 0.086, 0.023, 0.088, and 0.003 for each group. The experimental results reflect that the introduced model can learn spatial–temporal features effectively with the input data. Intuitively, lasso is a simple linear prediction method that cannot capture non-linear features of traffic flow, and the overall performance is poorer than the other models. Figure 6 reflects the boxplot comparison for RMSE of the five models in group 1.

Table 3.

Median MRE, RMSE, and MAE of different models for traffic flow forecasting.

Model	Median evaluation indexes	1	2	3	4	5	6
Lasso	MRE	0.332	0.632	0.423	0.635	0.426	0.133
	RMSE	17.374	32.093	18.872	32.272	19.518	17.280
	MAE	9.583	23.645	13.832	24.507	14.786	10.256
Prophet	MRE	0.189	0.519	0.321	0.525	0.325	0.094
	RMSE	13.247	29.989	17.892	30.270	18.515	12.742
	MAE	8.647	21.874	12.381	22.593	12.614	8.739
LSTM	MRE	0.091	0.449	0.256	0.512	0.289	0.079
	RMSE	10.116	30.275	15.371	30.319	15.566	9.082
	MAE	5.259	22.764	8.992	22.909	9.705	5.737
SAE	MRE	0.068	0.448	0.174	0.454	0.178	0.077
	RMSE	12.072	28.323	13.095	28.955	13.395	10.208
	MAE	6.925	19.639	8.621	20.540	9.608	7.812
PST-DNN	MRE	0.065	0.423	0.088	0.431	0.090	0.074
	RMSE	7.942	28.358	11.501	29.210	11.796	6.309
	MAE	3.27	19.987	5.186	20.129	5.992	3.804

MRE: mean relative error; RMSE: root mean square error; MAE: mean absolute error; LSTM: long short-term memory; SAE: stacked autoencoder; PST-DNN: parallel spatiotemporal deep learning network.

Figure 6.

Boxplot for RMSE of five models in group 1.

Boxplot for RMSE of the five models in group 1

MRE, RMSE, and MAE are used to evaluate error between the predicted values and true values. The forecasting accuracies of spatial and temporal distributions are equally important when the prediction focuses on spatial and temporal information. Thus, average correlation (AC) can be used to evaluate the performance of distribution forecasting for this information.⁴⁶ AC for spatial and temporal information is, respectively, defined as follows

$AC (t, g) = \frac{1}{N^{g; t}} \sum_{t = 1}^{N^{g; t}} Corr (P_{t}^{g}, T_{t}^{g})$ (13)

$AC (s, g) = \frac{1}{N^{g; s}} \sum_{s = 1}^{N^{g; s}} Corr (P_{s}^{g}, T_{s}^{g})$ (14)

where g is the serial number for each group, and there are six groups as shown in Table 4. $P_{t}^{g}$ and $P_{s}^{g}$ are the predicted traffic flows at time point t and space point s, respectively. Similarly, $T_{t}^{g}$ and $T_{s}^{g}$ are the true traffic flows at time point t and space point s, respectively. $N^{g; t}$ and $N^{g; s}$ are the lengths of the predicted time points and the number of sensors in group g, respectively. Table 4 illustrates the average AC of six groups for different models under the time and space dimensions. The comparison indicated that PST-DNN can predict accurate traffic flow for spatial and temporal information. In addition, it is clear that PST-DNN has the highest spatial AC among all the models. It reveals that the layers of CNN have a sensitive capacity to extract spatial information.

Table 4.

Average AC(t, g) and AC(s, g) of six groups for different models.

Model	$AC (t, g)$	$AC (s, g)$
Lasso	0.398	0.689
Prophet	0.404	0.776
LSTM	0.527	0.748
SAE	0.453	0.781
PST-DNN	0.476	0.783

LSTM: long short-term memory; SAE: stacked autoencoder; PST-DNN: parallel spatiotemporal deep learning network.

Figure 7 shows a partial visualization of one sensor’s prediction results predicted by these models, and it can be seen that PST-DNN can have much accurate predicted values in both space and time dimensions.

Figure 7.

Partial visualization of prediction values for sensor NHNX20: (a) prediction values of one sensor marked as NHNX20 from 7:00 am to 10:00 am on 30 June 2011; (b) prediction values of 19 continuous sensors at 7:00 am on 30 June 2011, where #1 location is the sensor marked as NHNX20.

Conclusion and future work

The contributions of the article can be summarized as follows:

The spatiotemporal features of traffic network are considered simultaneously with a parallel-connected neural network combined with CNN and LSTM in the traffic flow forecasting problem;

Models are experiments with a large scale of road network with half-year dataset, and the newly proposed model shows better prediction performance based on spatiotemporal features compared with the other time series models.

The proposed PST-DNN combines CNN and LSTM with a parallel structure. Also, it achieves exhilarating performance when applied in real environment compared with the other models. Since most of the existing research lays emphasis on performance of the models, statistical significance is uncertain. To make the experiment scientific, a large network is divided into groups for training and evaluation. It suggests a rigorous design of experiments related to model comparison.

However, despite high accuracy, deep learning application in traffic flow forecasting still needs more research. Maintaining a balance between the calculation resource usage and performance of accuracy is always a challenge. Therefore, advanced deep learning architectures used for large-scale traffic flow forecasting are worth studying in depth.

Footnotes

The authors would like to thank the Natural Science Foundation of China (61104166).

Handling Editor: Suat Ozdemir

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iD

Juan Chen

References

Lippi

Bertini

Frasconi

Short-term traffic flow forecasting: an experimental comparison of time-series analysis and supervised learning. IEEE Trans Intell Transport Syst 2013; 14(2): 871–882.

Zhao

Sun

High-order Gaussian process dynamical models for traffic flow prediction. IEEE Trans Intell Transport Syst 2016; 17(7): 2014–2019.

Bengio

Learning deep architectures for AI. Found Trends^® Mach Learn 2009; 2(1): 1–120.

Chauhan

Sharma

Dahiya

Stochastic trust region inexact Newton method for large-scale machine learning, 2018, https://arxiv.org/abs/1812.10426

Hoi

SCH

Xia

et al . Online multimodal deep similarity learning with application to image retrieval. In: Proceedings of the 21st ACM international conference on multimedia, Barcelona, 21–25 October 2013. New York: ACM Press.

Meng

Cong

Zhu

Construction of hierarchical diagnosis network based on deep learning and its application in the fault pattern recognition of rolling element bearings. Mech Syst Sig Process 2016; 72–73(2): 92–104.

Goodfellow

Pouget-Abadie

Mirza

et al . Generative adversarial nets. In: International conference on neural information processing systems, Montréal, Canada, 8–13 December 2014, pp.2672–2680. Palais des Congrès de Montréal.

Devlin

Chang

Lee

et al . BERT: pre-training of deep bidirectional transformers for language understanding, 2018, https://arxiv.org/abs/1810.04805

Efficient missing data imputing for traffic flow by considering temporal and spatial dependence. Transport Res Part C Emerg Tech 2013; 34(9): 108–120.

10.

Xie

Zhang

Short-term traffic volume forecasting using Kalman filter with discrete wavelet decomposition. Computer-Aid Civil Infrastruct Eng 2007; 22(5): 326–334.

11.

Wang

Liu

et al . A two-stage algorithm for origin-destination matrices estimation considering dynamic dispersion parameter for route choice. PLoS ONE 2016; 11(1): e0146850.

12.

Van Hinsbergen

CPIJ

Schreiter

Zuurbier

et al . Localized extended Kalman filter for scalable real-time traffic state estimation. IEEE Trans Intell Transport Syst 2012; 13(1): 385–394.

13.

Tchrakian

Basu

O’Mahony

Real-time traffic flow forecasting using spectral analysis. IEEE Trans Intell Transport Syst 2012; 13(2): 519–526.

14.

Pan

Sumalee

Zhong

et al . Short-term traffic state prediction based on temporal–spatial correlation. IEEE Trans Intell Transport Syst 2013; 14(3): 1242–1254.

15.

Zhou

Quantitative analysis of traffic flow forecasting and non-tropical steatorrhea based on deep learning. Guangzhou, China: South China University of Technology, 2017.

16.

Boto-Giralda

Díaz-Pernas

González-Ortega

et al . Wavelet-based denoising for traffic volume time series forecasting with self-organizing neural networks. Comput-Aided Civil Infrastruct Eng 2010; 25(7): 530–545.

17.

Jeong

Byon

Castro-Neto

et al . Supervised weighting-online learning algorithm for short-term traffic flow prediction. IEEE Trans Intell Transport Syst 2013; 14(4): 1700–1707.

18.

Huang

Song

Hong

et al . Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Trans Intell Transport Syst 2014; 15(5): 2191–2201.

19.

Huang

Zhang

et al . Dynamic boosting in deep learning using reconstruction error. In: International joint conference on neural networks, Beijing, China, 6–11 July 2014, pp.473–480. New York: IEEE.

20.

Duan

Kang

et al . Traffic flow prediction with big data: a deep learning approach. IEEE Trans Intell Transport Syst 2015; 16(2): 865–873.

21.

Yang

et al . Deep air quality forecasting using hybrid deep learning framework, 2018, https://arxiv.org/abs/1812.04783

22.

Gong

et al . A hybrid method for traffic flow forecasting using multimodal deep learning, 2018, https://arxiv.org/abs/1803.02099

23.

Yang

Dillon

Chen

YP.

Optimized structure of the traffic flow forecasting model with a deep learning approach. IEEE Trans Neural Netw Learn Syst 2016; 28: 2371–2381.

24.

Bengio

Courville

Vincent

Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 2012; 35(8): 1798–1828.

25.

Dauphin

Pascanu

Gulcehre

et al . Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: International conference on neural information processing systems, 2014, pp.2933–2941, https://arxiv.org/abs/1406.2572v1

26.

Choromanska

Henaff

Mathieu

et al . The loss surface of multilayer networks, 2014, https://arxiv.org/abs/1412.0233

27.

Szegedy

Liu

Jia

et al . Going deeper with convolutions. In: Conference on computer vision and pattern recognition (CVPR), Boston, MA, 7–12 June 2015, pp.1–9. New York: IEEE.

28.

Kuznetsov

Mohri

Syed

Multi-class deep boosting. In: International conference on neural information processing systems, Montréal, Canada, 8–13 December 2014. Palais des Congrès de Montréal.

29.

Zhang

Ren

et al . Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp.770–778. New York: IEEE.

30.

Murthy

Singh

Chen

et al . Deep decision network for multi-class image classification. In: Computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp.2240–2248. New York: IEEE.

31.

Iandola

Moskewicz

Ashraf

et al . FireCaffe: near-linear acceleration of deep neural network training on compute clusters. In: Computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, Vol. 37, pp.2592–2600. New York: IEEE.

32.

Lawrence

Giles

Tsoi

et al . Face recognition: a convolutional neural-network approach. IEEE Trans Neural Netw 1997; 8(1): 98–113.

33.

Yang

3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 2013; 35(1): 221–231.

34.

Hochreiter

Schmidhuber

Long short-term memory. Neural Comput 1997; 9(8): 1735–1780.

35.

Ram

Nagappan

Supervised sentiment classification with CNNs for diverse SE datasets, 2018, https://arxiv.org/abs/1812.09653?context=cs

36.

Shen

Zhang

et al . Improving medical short text classification with semantic expansion using word-cluster embedding. In: Kim

(ed.) Information science and applications. New York: Springer, 2018, pp.401–411.

37.

Socher

Huval

Bath

et al . Convolutional-recursive deep learning for 3d object classification. In: Proceedings of the 25th international conference on neural information processing systems, Stateline, NV, 3–6 December 2012, pp.665–673. New York: Curran Associates.

38.

Yang

Shi

Spatiotemporal context awareness for urban traffic modeling and prediction: sparse representation based variable selection. PLoS ONE 2015; 10(10): e0141223.

39.

Corcoran

Zamora-Resendiz

Liu

et al . A spatial mapping algorithm with applications in deep learning-based structure classification, 2018, https://arxiv.org/abs/1802.02532

40.

Henaff

Bruna

LeCun

Deep convolutional networks on graph-structured data, 2015, https://arxiv.org/abs/1506.05163

41.

Graves

Mohamed

Hinton

Speech recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, Vancouver, BC, Canada, 26–31 May 2013. New York: IEEE.

42.

Karpathy

Johnson

Fei-Fei

Visualizing and understanding recurrent networks, 2015, https://arxiv.org/abs/1506.02078

43.

Fusco

Gori

. The use of artificial neural networks in advanced traveler information and traffic management systems. In: Applications of advanced technologies in transportation engineering, Capri, 27–30 June 1995, pp.341–345. Reston, VA: ASCE.

44.

Kamarianakis

Shen

Wynter

Rejoinder: real-time road traffic forecasting using regime-switching space–time models and adaptive lasso. Appl Stochast Model Business Indus 2012; 28(4): 297–315.

45.

Taylor

Letham

Forecasting at Scale. Am Stat 2017; 72: 37–45.

46.

Tan

Qin

et al . A hybrid deep learning based traffic flow prediction method and its understanding. Transport Res Part C Emerg Tech 2018; 90: 166–180.

A parallel spatiotemporal deep learning network for highway traffic flow forecasting

Abstract

Keywords

Introduction

Related work

Traffic flow forecasting

Combination of CNN and LSTM

Methodology

Problem statement

The prediction model

Spatial feature modeling

Temporal feature modeling

Experiment

Data description and setup

Evaluation metrics

Model parameter design

Grid search for input and model structure

Train–test splitting

Implementation details of the model structure

Experimental results and comparison

Boxplot for RMSE of the five models in group 1

Conclusion and future work

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References