Abstract
Keywords
Introduction
The rapid development of China’s rail transit has seen the addition of over 4000 km of new urban rail transit lines, with a cumulative passenger volume of nearly 100 billion. There are currently 283 urban rail transit lines operating in 50 cities in China, with a total length of 9206.8 km. In 2021, the total passenger volume reached 23.69 billion, and both the scale of the network and passenger flow ranked first in the world. The networked operation has greatly improved the efficiency and organizational level of China’s rail transit operation, but it has also brought about a large increase in passenger flow and many safety risk factors. High-density passenger flow has a significant impact on the safety of the road network. While rail transit brings convenience to people, stations, as hubs for large-scale passenger flow, have significant safety hazards, especially in scenarios with bi-directional passenger flow where flow conflicts are more likely to occur. For example, incidents of stampedes caused by crowded passenger flow in passageways have occurred at stations such as Shanghai Hongqiao Railway Station and Shenzhen Metro Station. Therefore, personnel-intensive places like rail transit hubs urgently need real-time perception of passenger flow safety status information to provide support for predicting passenger flow safety status and ensuring the operation safety of rail transit stations.
In practical applications, rail transit stations can be equipped with dozens or even hundreds of cameras. Traditional rail transit passenger flow detection methods 1 (such as estimating passenger flow based on automatic ticketing systems) cannot meet the real-time and high-precision requirements for on-site large-scale passenger flow operation safety detection. At the same time, scenes with bi-directional passenger flow in stations such as passageways and escalators are more likely to result in flow conflicts. During peak hours, holidays, and other heavy passenger flow periods, it is necessary to achieve precise detection of the bi-directional passenger flow status in specific areas of a station.
At present, the passenger flow detection methods of integrated hub stations mainly include pressure sensor detection technology, 2 infrared detection technology, 3 radio wave detection technology, 4 manual observation statistics, 5 AFC technology 6 and intelligent video monitoring technology. The main advantages and disadvantages of these six passenger flow detection methods are shown in Table 1:
Advantages and disadvantages of station passenger flow detection methods.
Through the analysis of the characteristics, advantages and disadvantages of several station passageways passenger flow detection technologies and methods, the most suitable technology for bi-directional passenger flow detection in the station passageways is to use intelligent video monitoring analysis technology. By directly transplanting the deep learning algorithm to the camera back-end system and analyzing the converted video data, real-time detection and tracking of bi-directional passenger flow in the station passageways can be completed. Relevant calculations can accurately provide key parameters such as bi-directional passenger flow density, speed, and passenger flow in the station passageways.
With the development of deep learning technology, research on computer vision tracking tasks has been greatly promoted, leading to rapid development of deep learning-based object tracking algorithms.7–10 However, currently, there is little research on introducing multi-object tracking algorithms into station passageways scenes for passenger flow detection and tracking. At the same time, there are many object occlusions in station passageways scenes, which poses a great challenge to the detection and tracking of dense passenger flows. Therefore, this paper introduces multi-object tracking algorithms into the station passageways scene for passenger flow detection and tracking, and improves the algorithm specifically for this particular scenario to achieve high tracking accuracy and strong environmental adaptability. The contributions of this paper are as follows:
(1) This paper introduces the SimAM attention mechanism based on the YOLOv7 detection algorithm to improve the detection accuracy of the detector for passenger flows in passageways.
(2) This paper optimizes the Kalman Filtering (KF) method based on the Deep-Sort tracking algorithm to improve the tracking accuracy by making the tracking frame fit of the target more accurate.
(3) This paper uses the Fast-ReID method for tracking matching to improve the stability of target matching and to reduce target ID drift as a way of improving the value of IDF1.
The arrangement of the remaining sections: Section 2 presents the relevant work in this paper, Section 3 describes the proposed methods in detail. Section 4 presents the experimental results, and finally, the conclusion is drawn in Section 5.
Related work
The main task of Multiple Target Tracking (MTT) is to locate multiple targets in a given video and keep their IDs consistent, and finally record their tracking trajectories. This paper mainly focuses on the study of passenger flow in the bi-directional passageways of the station. Currently, the mainstream pedestrian tracking algorithms are mostly based on the Tracking-by-Detection (TBD) paradigm, which detects first and subsequently tracks. This leads to the tracking results being largely dependent on the detection results of the target detector. Therefore, how to effectively obtain detection results for dense pedestrians is a key focus of this paper.
Current object detection methods mainly include traditional methods and deep learning-based methods. Traditional object detection methods mainly use hand-crafted features for detection, including local features (SIFT, 11 HOG, 12 Haar, 13 LBP, 14 etc.) and global features (DPM 15 ). Traditional object detection methods have tedious feature extraction, weak generalization ability, and high requirements for the image environment. Especially during the peak period of the station passage, there are occlusions and multi-scale changes in the large passenger flow, which makes the detection accuracy low and the real-time performance cannot meet the requirements.
The object detection method based on Deep Learning can accurately give the category and location information of objects. According to the detection type, they can be divided into one-stage detection algorithms (such as YOLO series, 16 SSD, 17 RetinaNet 18 ) and two-stage detection algorithms (such as RCNN series, 19 SPPnet 20 ). One-stage detection algorithms directly extract features from the network to predict object classification and location, which is relatively fast and suitable for mobile deployment. Two-stage detection algorithms need to first generate proposals and then perform fine-grained object detection. These algorithms are relatively slower and require multiple runs of detection and classification processes. Integrating deep learning-based detection algorithms can effectively deal with complex changes in passenger flow in railway transportation scenarios, but it is also necessary to optimize the model based on specific scenarios to balance the accuracy and speed of the algorithm.
Pedestrian re-identification (ReID) is considered as a sub-problem of image retrieval, 21 and feature extraction and metric learning in pedestrian re-identification can provide strong support for target tracking. Nowadays, many studies combine pedestrian re-identification technology with detection and tracking technology and are widely used in intelligent security systems.
Regarding the TBD paradigm, Bewley et al. proposed the SORT algorithm, which focused on inter-frame prediction and association. By combining Kalman Filter and Hungarian Algorithm, a simple and effective online tracking framework was proposed. Wojke et al. proposed the Deep-SORT algorithm. Giving that the SORT algorithm does not pay much attention to the frequent problem of target ID switching due to occlusion in long-term tracking, pedestrian re-identification technology was introduced as an appearance model based on the SORT algorithm. By learning on the re-identification model, the network’s discriminative ability for different target objects was enhanced. At the same time, a cascade matching strategy was proposed to improve the accuracy of target matching. Wang et al. proposed the JDE algorithm. Giving real-time performance, the JDE algorithm integrates one-stage detection with pedestrian re-identification, and outputed detection and ReID information at the same time to speed up inference. Regarding the JDE paradigm, which performed detection and tracking simultaneously, Zhang et al. proposed the Fair-MOT algorithm. The integration problem of object detection and pedestrian re-identification tasks was explored, and the anchor-free paradigm object detection algorithm Center-Net 22 was used as the detection branch, on which a parallel branch was added to output Re-ID features to distinguish different targets, which unified object detection and re-identification well.
In summary, improving the results of object detection can further improve the multi-target tracking results, and the effective features extracted by introducing pedestrian ReID techniques provide strong support for the target tracking task. For the problems of low tracking accuracy and low real-time performance of current multi-target tracking methods, this paper proposes an improved Deep-Sort pedestrian tracking method based on YOLOv7 to detect and count the number of passengers in both directions in station access scenarios, which can achieve good results while basically achieving good real-time performance.
Proposed methods
This paper proposes a station passageways bi-directional pedestrian flow tracking framework based on the Deep-Sort algorithm, which includes three parts:
The first part focuses on the detector, in which the SimAM attention mechanism is introduced on the basis of the Yolov7 23 detection algorithm to focus more on the details of occluded pedestrians and improve the detection accuracy. The second part is in the tracking aspect, where the Kalman Filter (KF) algorithm is optimized to fit more accurate tracking boxes and improve the tracking accuracy. The third part introduces the Fast-ReID 24 method, integrating the appearance feature extractor into the tracker to extract more representative features and improve the stability of target tracking, thereby reducing the frequency of target ID switching and completing the final task of bi-directional pedestrian flow tracking in the station passageways.
The general tracking process in this paper is shown in Figure 1.

The tracking process of the method in this paper.
The first frame (F1)
Firstly, the YOLOv7 object detection algorithm is used to improve the input image of the first frame, and a SimAM attention mechanism is added to the Head part of YOLOv7 in this paper. Secondly, the detection results (bounding box positions and sizes) are used as inputs for tracking. In the tracking module, a KF prediction (optimized KF) is performed to predict the position and size of the tracking box, and then a matching (by Fast-ReID module) is performed to match the object detection box with the tracking box and assign the corresponding target ID. Finally, the corresponding cost matrix and KF are updated, and the tracking results are output. Since the KF prediction in the tracking process requires the previous tracking results (i.e. the previous frame) for prediction, the tracking results are directly initialized in the first frame, and the detection results are directly used as the tracking results.
The second frame (F2)
The detection part is the same as the first frame, with a modified YOLOv7 target detection on the input image of the second frame, which is also used as input for tracking. In the tracking section, the KF prediction is performed on the first frame, by using the optimized KF, and then the detection results are matched with the KF prediction results, by using the Fast-ReID module. For the second time, the Fast-ReID module is used, the corresponding cost matrix, Fast-ReID feature library and KF are updated, and the tracking results are output.
The subsequent frames repeat the same steps as the second frame, as shown in the figure for frame
Detector
The current mainstream multi-object tracking methods still adopt the Tracking-by-Detection paradigm. Therefore, the multi-object tracking paradigm based on detection is also used in this paper, where detection is regarded as a crucial first step.
The target of this study is to monitor the relatively dense passenger flow in an integrated station, and it is necessary to detect the position of pedestrians and label them with rectangular boxes. YOLOv7 is the selected basic algorithm, which is currently the most advanced algorithm in the YOLO series. Its object detection model has not only high accuracy but also fast operation speed. YOLOv7 is implemented based on PyTorch and can be effectively applied to embedded devices and mobile terminals.
The YOLOv7 model mainly consists of three parts: the input part, the backbone part, and the head part. In the input part, the original image is first preprocessed, including data augmentation, adaptive image scaling, and adaptive anchor box calculation, and finally the image is aligned to an RGB image of size 640 × 640, which is input into the backbone network. The backbone part is composed of several BConv layers, E-ELAN layers, and MPConv layers, which appear alternately to reduce the width and height of the input image by half, increase the channel, and extract features, finally outputting three layers of feature maps. The head part consists of SPPCPC layers, several BConv layers, several ELAN-H layers, several MPConv layers, several Catconv layers, and RepVGG block layers that output three heads subsequently. In the head part, the backbone network technology is used to output three layers of feature maps of different sizes, and finally through the RepVGG block and Conv layers, the three tasks of image detection (classification, foreground and background classification, and bounding box) are predicted, and the final result is output.
As the target of this study is to monitor relatively dense passenger flow, most of the passenger targets are occluded, so it is necessary to further enrich the feature information of the occluded area, which pays more attention to the details of pedestrians, so that partially occluded pedestrians can also be detected. This can further improve the detection accuracy and lay a foundation for tracking occluded targets later. Therefore, this paper introduces a very effective attention module for convolutional neural networks – SimAM. Compared with existing channel attention modules and spatial attention modules, this module does not need to add parameters to the original network. Instead, it infers the 3-D attention weights of the feature map in one layer. When the target is partially occluded, it can also infer the targets for detection. Without changes to the Backbone, the SimAM module is embedded in the Head part. The modified YOLOv7 network structure is shown in Figure 2.

Improved YOLOv7 network structure (Embedded with SimAM attention mechanism).
Research has shown that most existing attention modules (SE, 25 CBAM 26 ) generate one-dimensional or two-dimensional weights from the feature map X and extend the generated channels and spatial attention weights. For channel (one-dimensional) attention weights, they treat different channel differently and treat all positions equally. For spatial (two-dimensional) attention weights, they treat different positions differently and treat all channel equally, which limits their ability of learning more distinguishing clues. Assigning three-dimensional weights to generated channels largely solves this limitation and is superior to traditional one-dimensional and two-dimensional weight attention. The SimAM attention module directly estimates the three-dimensional weight. In the sub graph, the same color is assigned to points on each channel, spatial position, or feature. Embedding the SimAM attention module in the network improves the three-dimensional feature extraction during feature extraction, thereby improving detection accuracy.
Another advantage of this module is that most of its operators are chosen based on the solution to the defined energy function, which avoids spending too much effort on structural adjustments. Through quantitative evaluations of various visual tasks, it has been shown that the module’s flexibility and effectiveness have improved the expressive power of many ConvNets. 27
This paper used the pedestrian subset of the COCO public datasets for large object detection, and used its pre-defined training and validation set to train the model, and the test set to evaluate the accuracy of the model. The improved YOLOv7 model based on the SimAM attention mechanism was compared with the original YOLOv7 and YOLOv5s 28 models, and the comparison results were shown in Table 2, which indicated the effectiveness of the improved YOLOv7 model proposed in this paper.
Accuracy comparison of YOLO series algorithms (best in bold).
Optimized KF
This paper found through experiments on the multi-object tracking datasets MOT16 29 that estimating the specific box height of the bounding box can indeed improve tracking performance.
In order to establish the motion model for objects on the image plane, researchers generally use discrete Kalman Filter at a constant speed. In the Deep-Sort model, an eight-dimensional space was used to represent the state of the track at a certain time, that is,
According to prior knowledge, this paper chooses the noise factors as in Wojke et al. to be

Optimized Kalman Filter comparison results ((a) is the tracking result of the 143rd frame of the MOT16-09 dataset, the left side of the picture in (a) is the original KF result, and the right side is the optimized KF result; (b) is the tracking result of the 153rd frame of the MOT16-09 dataset, as shown above).
Fast-ReID
Re-identification (Reid) aims to use various intelligent algorithms to find objects similar to the targets to be searched for in an image database and is a sub-task of image retrieval. Fast-ReID is an open-source library of SOTA level reid methods, with flexible configuration and stronger recognition and feature extraction capabilities for obtaining information on pedestrians.
This paper selects the stronger appearance feature extractor BoT from the Fast-ReID library to replace the original simple CNN feature extractor, and integrates it into the tracker. The feature extraction model in the tracker is replaced with the ReID model trained by Fast-ReID. During the tracking process, the confirmed tracker stores the corresponding detection feature maps into a feature set list each time feature matching is completed. The feature set list is updated after each feature matching, such as deleting some useless features or removing target features that have gone out of the frame. The feature set plays a role in ID association in the subsequent process, and when the similarity of features exceeds a certain threshold, it is considered as a successful association. The association results largely affect whether the ID of the target is stable and thus affects the tracking results.
This paper uses ResNet50 as the backbone and Market1501 as the datasets to train the final pedestrian re-identification model based on ResNet50-ibn as the baseline, as shown in Table 3.
Market1501 datasets test results.
Similarly, this paper directly updates the appearance state of the
Where,
Experiment and data
Datasets and experimental environment
This paper selects the publicly available datasets MOT16 for multi-object tracking to test the proposed algorithm and compare it with other advanced algorithms to analyze model performance. The experimental environment is based on the Ubuntu 16.04 operating system, Nvidia Titan XP 2080Ti graphics card, and 64 GB of RAM. PyTorch 1.6.0 deep learning framework is used, and the implementation is performed on a server running Python 3.7.
Evaluation indicators
In order to make the evaluation of the model more objective and accurate, and to compare it with other algorithms in a reasonable way, this paper uses commonly used evaluation metrics in the multi-object tracking field: Multi-Object Tracking Accuracy (MOTA), Multi-Object Tracking Precision (MOTP), Identification F1 Score (IDF1), and ID Switch (IDS). The formulas for some of the evaluation metrics are shown in Formulas (6) and (7):
In the formulas,
Experiment and analysis
For multi-object tracking, this paper chooses the MOT16 datasets for experiments and compares the results with several advanced multi-object tracking algorithms. Inspired by Zou et al., 30 this paper considers adding the time required for the detector when calculating the speed, and the results are shown in Table 4.
Comparison results between this algorithm and other algorithms (best in bold).
Through metric analysis, it can be seen that the algorithm proposed in this paper has relative advantages compared to some advanced algorithms. The MOTP, IDF1, and IDs metrics performed the best, and the MOTA value was similar to that of the TubeTK algorithm. However, the MOTP metric of the TubeTK algorithm was too low, and its real-time performance was poor, making it unsuitable for on-site deployment. Compared with the JDE algorithm, although there was a big difference in real-time performance, its ID switch frequency was too high, which was caused by the mutual occlusion of targets in a large number of dense pedestrian scenes. The scenario applied in this paper is the passage of a integrated hub station, which also has a large number of dense pedestrians. Therefore, the applicability and practicality of the algorithm proposed in this paper are stronger.
Ablation experiments
Table 5 shows the effectiveness of the improved YOLOv7 model based on the SimAM attention mechanism in this paper, as can be seen from the tracking results, by using the YOLO series as the target detector for target tracking.
Tracking results of YOLO series plus Deep-Sort algorithm (MOT16 datasets, best in bold).
Table 6 shows the ablation experiments performed on the MOT16 datasets for the improved part of this paper, namely the addition of the SimAM attention mechanism, the optimized KF (KF+) and the introduction of the Fast-ReID method. The results of the ablation experiments show the effectiveness of the improved part of this paper.
Results of ablation experiment (best in bold).
Application of experimental scenarios
Experimental process
Station passages are typically divided into two types: transfer passages and regular passages. Regular passages are characterized by a relatively loose and non-fixed direction of passenger flow, and congestion tends to occur during peak hours. Transfer passages are more prone to flow line conflicts, resulting in personnel accidents within the station. Bi-directional passenger flow refers to passengers traveling in two different directions, as shown in Figure 4. The figure shows a scene of a subway station passage, with barriers separating the passenger flows in two different directions of travel.

Site map of bi-directional passenger flow in station passage.
The scenario studied in this paper is bi-directional passenger flow in integrated hub station passages, as shown in Figure 5: bi-directional passenger flow refers to the passenger flow traveling in any rectangular box.

Bi-directional passenger flow diagram of station passage.
This paper describes how bi-directional passenger flow tracking and statistics are carried out. (Examples are only shown for single-direction tracking methods, while subsequent experimental figures demonstrate bi-directional tracking.) First, target tracking algorithms are used to track passenger flows, and then marking lines are manually or systematically drawn to record the center trajectories of the tracked passenger flows. It is determined whether the center of the passenger flow trajectory intersects with the marking line. Once an intersection occurs, the passenger ID is recorded, and passenger flow tracking statistics are carried out as shown in Figure 6.

Algorithm statistics rules.
For the visualization view, this paper does not record the passenger ID in the interface, but only records the number of interested points - the passenger flow statistics being tracked. Figure 7 shows some experimental data, where the black solid line in the middle of the image is the calibrated marking line. The “Cross Down” in the upper right corner shows the number of passengers passing through from top to bottom on the left side, and “Cross Up” shows the number of passengers passing through from bottom to top on the right side.

Some experimental data.
Experimental results
After statistical analysis of the experimental data, as shown in Table 7, the number of passengers tracked by the algorithm in this paper passing from top to bottom (“Down”) was 116, while the actual number of passengers passing through was 125, resulting in a tracking accuracy of 92.80%. The number of passengers tracked by the algorithm in this paper passing from bottom to top (“Up”) was 164, while the actual number of passengers passing through was 175, resulting in a tracking accuracy of 93.71%. The overall average tracking accuracy was 93.26%.
Tracking statistics results of experimental data.
Conclusion
This paper proposes an improved Deep-Sort algorithm based on YOLOv7. The SimAM attention mechanism is introduced on the basis of the YOLOv7 object detection algorithm, and the Kalman Filter (KF) is optimized and the Fast-ReID method is introduced on the basis of the Deep-Sort tracking algorithm, all of which aim to improve the model tracking performance. Finally, bi-directional passenger flow tracking is carried out in a integrated station passageways. Due to heavy occlusions in the station passageways, the number of tracked passenger flows cannot be fully counted. However, in our study, the overall tracking accuracy of the algorithm also reached more than 93%, and the real-time performance is remarkable, which is in line with practical applications on-site, and also demonstrates the applicability and advanced nature of the proposed algorithm.
