Abstract
Keywords
Introduction
Urban transportation is closely connected with overall planning and development, commuting efficiency, and quality. As the most important strategy and key component of smart city, an intelligent transportation system can efficiently improve the running capacity of the whole system. With the long-term theory and method learning of prediction and analysis of traffic flow, data mining for a large-scale traffic network has important practical value, including urban traffic control and dynamic route guidance. 1 The interval time of traffic flow prediction providing for traffic control and guidance usually is less than or equal to 15 min, 2 which is defined as short-term traffic flow forecasting. Short-term traffic flow forecasting is a real-time, periodical, and non-linear prediction process.
In theory, traffic flow forecasting is to predict traffic flow of a future time point by extracting features of historical data. With rapid development of storage technology and systematic data flow framework, a large scale of traffic data have been recorded for prediction. These advanced infrastructures benefit the research on traffic flow prediction.
Considering the large scale of data provided recently, a balance should be maintained between high accuracy and training efficiency. 3 Shallow-structured models can have acceptable performance of prediction and have been applied in most small-scale applications. 4 However, the limitation of these shallow models that they cannot extract effective information from large datasets and fit non-linear object functions is obvious. 3 Recently, a deep learning network and its hybrid structure have extremely advanced breakthrough in both theoretical research and applications under real environments.5,6 In theoretical research, powerful networks are introduced continuously. Goodfellow et al. 7 proposed generative adversarial networks, which consist of a generative model and a discriminative model. Devlin et al. 8 designed bidirectional encoder representations from transformers for language understanding, which can pre-train deep bidirectional representations by combining contexts in all layers. However, traffic flow is relatively complicated because it contains a lot of information in the time and space dimensions. 9 Extracting information from both the dimensions is definitely a big challenge.
To address these issues, this article introduces a parallel spatiotemporal (PST) network that combines convolutional neural network (CNN) and long short-term memory (LSTM). The newly proposed model can be applied in large-scale traffic flow prediction and fit non-linear object functions. Concretely, the layer of CNN is used to extract spatial features and the layer of LSTM is used to capture temporal features. Since the functions of both are different, temporal and spatial information is learnt separately and concatenated output of each part to make prediction.
The structure of this article is organized as follows: section “Related work” provides a literature review on the models used for traffic flow prediction; section “Methodology” describes the problem statement and the structure of a parallel spatiotemporal deep learning network (PST-DNN); section “Experiment” presents the experiments with the real traffic dataset and evaluation of the performance of the newly proposed model and other popular models; section “Conclusion and future work” provides the conclusion and directions for future research.
Related work
In this section, we provide a literature review on the background of the applied models for traffic flow forecasting and the combination of CNN and LSTM.
Traffic flow forecasting
In recent research, multiple traffic flow prediction models have been proposed, which have used closely similar data wrangling methods. These methods can be generally divided into parameter learning model and non-parametric learning model. The parameter learning model includes linear and non-linear regression models, exponentially moving average model, and Kalman filtering model. Xie et al. 10 first combined Kalman filtering and wavelet transformation to predict traffic flow. The method applied discrete wavelet decomposition to remove noises and used Kalman filtering to estimate multi-dimensional weight of historical traffic flow. When the traffic flow fluctuates frequently, the model achieves a much better result than the single Kalman filtering, but the short-term change of data cannot be satisfied with the linear condition of Kalman filtering. Wang et al. 11 proposed a non-linear algorithm combining generalized least squares and stochastic user equilibrium to train parameters of an origin–destination matrix. Van Hinsbergen et al. 12 proposed an advanced Kalman filter model based on partial extension. This model runs much faster than the global extension and can be applied to large-scale traffic flow prediction. Tchrakian et al. 13 employed a short-term traffic flow forecasting algorithm based on spectrum analysis. Pan et al. 14 predicted the transport state with temporal and spatial features and extended the cell transmission model. Although these models can be accomplished easily, they will not solve special change under unstabilized situations. For example, Zhou 15 identified that Kalman filter can have abnormal predict value with inertia, although professional knowledge and abundant engineering experience are applied to the model.
A non-parametric learning model mainly includes an artificial neural network and a support vector regressor. Boto-Giralda et al. 16 studied wavelet-based denoising for traffic volume time series forecasting with self-organizing neural networks. Jeong et al. 17 proposed a supervised weighting-online learning algorithm for short-term traffic flow prediction. Lippi et al. 1 proposed two traffic flow forecasting methods based on support vector regressor and periodic features of traffic flow. The above algorithms depend on the amount of original data, but they cannot meet the demand in the real situations. To solve this problem, Huang et al.18,19 and Lv et al. 20 applied deep belief network and stacked autoencoders (SAEs), respectively, to predict traffic flow. These two deep learning networks save the coefficient of traffic flow in the hidden layers to make the prediction. Du et al.21,22 proposed a hybrid multimodal deep learning network, which series-connected the CNN and LSTM model with several features, including traffic flow, average speed, and density. Yang et al. 23 studied an optimized structure of the SAE model. However, little amount of input data can promote the model to find multiple accurate feasible solutions, whereas lack of several transport states in the training set will decline the accuracy of the prediction result. Although traffic flow data can be easily collected by different types of sensors, the size of sliding window should not be quite long. Compared with the small scale of traffic flow training set, pre-unsupervised training is beneficial to avoid overfitting. 24 A recent research identified that most ridge points under zero gradient are located on the hyperplane of a deep learning network, 25 and most gradient orientations around the points are upward and only few are downward. 26 Consequently, when using a shallow neural network to predict traffic flow, once the calculations fell into partial optimization, the prediction model will be nullified. Due to diversity and randomness of traffic flow, some transport states that do not appear in the training set will not be predicted by an artificial neural network. For these problems, the common solution is to increase the scale of the deep learning network, including depth and width.27,28 Some new research increases the depth of the network and uses a large amount of input data to train the model, but the calculation efficiency increases simultaneously.29–31 Therefore, building an efficient deep learning network with a high performance is challengeable.
Combination of CNN and LSTM
CNN is widely used in computer vision and image recognition to process image information, which achieves great results. Compared with the traditional method to extract spatial features, CNN has these several properties: first, the features are only related to nearby values rather than the global ones; second, the introduced pooling layers tremendously improve the running efficiency without erasing much more features. Actually, the locally connected weight features enable the network to resolve spatial information.32,33 The structure of LSTM can solve sequence-to-sequence problems. 34 The structure is to use one LSTM unit to accept input sequence, one timestamp each time, train large fixed-dimensional vector representation, and the output weight will flow to another LSTM unit to extract output sequence. The ability to learn features with long-range temporal dependencies makes it a popular alternative for time series problems due to the considerable time lag between the input sequence and the corresponding outputs.
Since spatiotemporal features can be extracted by CNN and LSTM separately, the combination of both is popular in some research fields. For example, the combined structure has been applied to supervised sentiment classification, 35 air quality forecasting,21,22 short text classification, 36 and three-dimensional (3D) object classification. 37 Traffic flow is similar to the data format of text and audio, which can be seen as the time series problem. Thus, geographic information of traffic flow corresponds to spatial information and the change of traffic flow over time corresponds to temporal information. Inspired by applications of the combined structure, a newly proposed structure for traffic flow prediction is introduced in this study.
Methodology
Problem statement
Traffic prediction is to predict future traffic flow according to a series of historical traffic flows. To be more concrete, based on previous data
where
The prediction model
In this section, we describe our deep learning network (see Figure 1) for the graphical illustration of the proposed model. The temporal features are learnt by two layers of LSTM and the spatial features are captured by five layers of one-dimensional convolutional neural network (1D CNN). The paralleled structure is achieved by calculating the time and space dimensions separately, concatenating all features, and transferring the output into a layer of the linear regression method. The whole network is designed as an end-to-end framework, and the following parts will describe each component.

The overview of the proposed model.
Spatial feature modeling
Except traffic flow of one sensor over time, the data of adjacent sensors can be important to improve prediction accuracy because the correlation of detectors based on closed location is strong.
38
Thus, we choose CNN structure as a key layer of our proposed model. In this case, for the traffic flow dataset used as shown in Figure 1, a 1D CNN is deployed to extract spatial information. Especially, 1D CNN is used on the vector
where
A fully convolutional network is used in the model, so pooling layers are not applied after each 1D CNN layer. Since a fully convolutional network is widely used for small-size image recognition, 39 spatial information needed for prediction is limited. Considering that overfitting will happen much easier with a larger filter size, the filter size of a 1D CNN is set at 3. For the deep learning framework, batch norm is the normalization of the output in each hidden layer, and it can ensure that gradient descent reduces the oscillations when approaching the minimum point and converge faster. Thus, one layer of batch norm is added after each 1D CNN layer. Also, since the rectified linear activation units (ReLU) can avoid vanishing gradient, it is added after the layer of batch norm.
Except one-way highway network, there are more complex road networks. For those complex road networks, 1D CNN will lose efficacy to comprehensively capture spatial information. Thus, the application of CNN on graph-structured data introduced by Henaff et al. 40 could be an appropriate solution to complex networks.
Temporal feature modeling
LSTM is a variant of the recurrent neural network (RNN) algorithm. The algorithm is first proposed by Sepp Hochreiter and Jurgen Schmidhuber in 1997. Before introducing LSTM, it is necessary to have a preliminary understanding of how RNN works.
RNN is a variant of the traditional feedforward neural network; different length sequences are permitted as input of the model. With a sequence of inputs
where
where
Experiment
In this section, real Shanghai highway traffic flow datasets are used to evaluate the performance of the proposed PST-DNN method. We also compared the robustness of the proposed model with several state-of-the-art forecasting methods. Table 1 shows the description of the environment for all experiments.
Description of environment for all experiments.
Data description and setup
Shanghai is one of the most popular cities with high population density and busy transportation in the world. The traffic flow data used in this study are part of Shanghai inner ring elevated road. As shown in Figure 2, the route marked in blue color is the track of the inner ring. There are 770 one-way sensors located along the inner ring.

Traffic flow location applied in the experiment.
The records between 01 January 2011 and 30 June 2011 are used in this study. As suggested in Fusco and Gori, 43 the detected interval within 5 min is recommended to predict traffic flow when the transportation network is overcrowding day and night. Table 2 shows the description of experimental datasets. For each sensor, since there are 12 values in an hour recorded theoretically, those that lose more than 12 values continuously will be defined as invalid sensors in this study. Besides, null values are filled in according to the front value of that time point.
Description of experimental datasets.
Figure 3 shows the graphical illustration of the distribution plot for the average value, minimum value, maximum value, and standard deviation of traffic flow, respectively, among 591 valid sensors. Figure 4 shows the average traffic flow at 288 time points each day under the 25th, 50th, and 75th percentiles.

The distribution plot for the average value, minimum value, maximum value, and standard deviation.

The average traffic flow at 288 time points each day under 25th, 50th, and 75th percentiles.
Evaluation metrics
Generally, the mean relative error (MRE) is an appropriate evaluation metric for model comparison. However, if volumes are numerically high, MRE will lose efficacy. To avoid this, this article also introduces root mean square error (RMSE) and mean absolute error (MAE). The MRE, RMSE, and MAE values are defined as follows
where
Model parameter design
Grid search for input and model structure
In the proposed model, the key component of input data is the window size and that of the model structure is the number of CNN layers, as these parameters may affect the prediction precision greatly. Other hyperparameters of PST-DNN include training epoch, batch size, and the optimizer. In the experiment, we set the training epoch to 100, the batch size to 16, and the optimizer to Adam. To achieve the best values, we use grid search to find the optimal values by comparing the evaluation metrics.
In our study, we set the values of window size at 144, 288, 432, and 576, where 144 is the number of records in 12 h. The number of CNN layers is set at 4, 5, and 6 under the same structure: 1D CNN with three output filters and the kernel size of 2 as the length of the 1D convolution window. Especially, 60 sensors from 591 valid sensors are randomly selected to make the parameter tuning experiment statistically significant. As shown in Figure 5, the vertical axis represents the combination of different parameters, where the first value in brackets is the window size and the second value is the number of CNN layers. By comparing the results of the three metrics introduced in section “Evaluation metrics” and training time comprehensively, (288, 5) is the optimal combination. When the number of CNN layers increases, the training time also increases. This is mainly because when additional CNN layers are added, more parameters are produced for calculation and the model complexity also increases. Therefore, we set the window size to 288 and the number of CNN layers to 5 in all the experiments.

Average evaluation metrics (MRE, RMSE, MAE, training time) of 60 randomly selected sensors with grid search.
Train–test splitting
For traffic flow of each valid sensor, the data are randomly shuffled in advance. After that, 70% of data are selected as the training set, and the remaining 30% of data are chosen as the testing set. Since the length of each sensor is approximately 52,126, the number of samples for the original data before splitting is about 51,838, which is calculated by the difference between the length of each sensor and the optimal window size. It means that 70–30 split is appropriate under this size of data. In comparison, if over 1 million samples are provided, 70–30 split is not available because extra bias may exist for that 30% of data. In that situation, 90–10 split is recommended.
Implementation details of the model structure
In total, 591 valid sensors are used to evaluate the performance of different time series models. For each sensor, models will be evaluated according to MRE, RMSE, and MAE.
The dimension of the input data is set to be (number of samples × number of features × window size). The number of samples depends on the train–test splitting and the total records of different sensors. The number of features is set at 5. It means that two sensors on the left-hand side of the predicted senor and two sensors on the right-hand side of the one are selected. This is because the symmetry location can provide equally importance geographical information when trained by the layers of CNN. The window size is set to 288 as concluded after grid search. After model training, the dimension of the output data for PST-DNN is (number of samples × 1).
PST-DNN presented in the experiment consists of five layers of 1D CNN and two layers of LSTM. For all the 1D CNN layers, there are three output filters and the temporal convolution kernel size is 2. Besides, one layer of batch normalization is used after each CNN layer to minimize the negative effect of internal covariate shift, and one layer of ReLU is added after each layer of batch normalization to ensure non-negative prediction. After the last layer of ReLU, global average pooling is used to replace the traditional fully connected layers in CNN. Meanwhile, each layer of LSTM has 128 hidden units. Finally, parameters from both CNN and LSTM are concatenated for fitting with a layer of the linear regression method.
Experimental results and comparison
The proposed PST-DNN is compared with several state-of-the-art methods, which includes the following:
To guarantee a fairness of comparison, parameters of these models are chosen based on recommended values in the related literature. These parameters include the following: for LSTM, one layer of LSTM with 128 hidden units is used; for SAE, a three-depth neural network is applied with 300 hidden neurons for each layer, and the activation is a sigmoid function. In the lasso method, the weight of L1 norm is set at 0.01. In the Prophet method, the width of uncertainty intervals is set at 0.8. All neural network methods are built upon Keras library, the lasso method is trained by scikit-learn library, and the Prophet method is implemented by Prophet library. Train–test split of the dataset is 70/30. Also, the traffic flows are normalized and the range is scaled between 0 and 1 before inputting to all the models. Besides, to improve calculation efficiency and train data simultaneously, 591 valid sensors are divided into six groups, and all sensors in each group are located continuously along the roadside. Thus, five groups have 98 sensors and the sixth group has 101 sensors.
Table 3 presents the median evaluation indexes of 591 valid sensors for different models. Since the median value can ignore bias caused by outliers, it is much appropriate to reflect the population performance. Thus, other descriptive values, such as the average value and the first/third quartile, are not displayed. Also, training time is not provided because it always takes more training time for the neural network method, and the comparison of training time between deep network and non-deep network is meaningless. It is evitable that the overall performance of PST-DNN is relatively greater than the other models, although RMSE of PST-DNN is higher than the one of SAE by about 0.035 and MAE is higher than SAE by about 0.348 for the second group. PST-DNN achieves lower performance on MRE compared to the other models, which outperforms the second lowest model with the differences of 2.174, 0.025, 0.086, 0.023, 0.088, and 0.003 for each group. The experimental results reflect that the introduced model can learn spatial–temporal features effectively with the input data. Intuitively, lasso is a simple linear prediction method that cannot capture non-linear features of traffic flow, and the overall performance is poorer than the other models. Figure 6 reflects the boxplot comparison for RMSE of the five models in group 1.
Median MRE, RMSE, and MAE of different models for traffic flow forecasting.
MRE: mean relative error; RMSE: root mean square error; MAE: mean absolute error; LSTM: long short-term memory; SAE: stacked autoencoder; PST-DNN: parallel spatiotemporal deep learning network.

Boxplot for RMSE of five models in group 1.
Boxplot for RMSE of the five models in group 1
MRE, RMSE, and MAE are used to evaluate error between the predicted values and true values. The forecasting accuracies of spatial and temporal distributions are equally important when the prediction focuses on spatial and temporal information. Thus, average correlation (AC) can be used to evaluate the performance of distribution forecasting for this information. 46 AC for spatial and temporal information is, respectively, defined as follows
where
Average AC(
LSTM: long short-term memory; SAE: stacked autoencoder; PST-DNN: parallel spatiotemporal deep learning network.
Figure 7 shows a partial visualization of one sensor’s prediction results predicted by these models, and it can be seen that PST-DNN can have much accurate predicted values in both space and time dimensions.

Partial visualization of prediction values for sensor NHNX20: (a) prediction values of one sensor marked as NHNX20 from 7:00 am to 10:00 am on 30 June 2011; (b) prediction values of 19 continuous sensors at 7:00 am on 30 June 2011, where #1 location is the sensor marked as NHNX20.
Conclusion and future work
The contributions of the article can be summarized as follows:
The spatiotemporal features of traffic network are considered simultaneously with a parallel-connected neural network combined with CNN and LSTM in the traffic flow forecasting problem;
Models are experiments with a large scale of road network with half-year dataset, and the newly proposed model shows better prediction performance based on spatiotemporal features compared with the other time series models.
The proposed PST-DNN combines CNN and LSTM with a parallel structure. Also, it achieves exhilarating performance when applied in real environment compared with the other models. Since most of the existing research lays emphasis on performance of the models, statistical significance is uncertain. To make the experiment scientific, a large network is divided into groups for training and evaluation. It suggests a rigorous design of experiments related to model comparison.
However, despite high accuracy, deep learning application in traffic flow forecasting still needs more research. Maintaining a balance between the calculation resource usage and performance of accuracy is always a challenge. Therefore, advanced deep learning architectures used for large-scale traffic flow forecasting are worth studying in depth.
