Sage Journals: Discover world-class research

Abstract

The normal operation of a integrated hub station is of great significance for the safe operation of the entire city’s transportation network. Accurately monitoring the passenger flow operation status of the station is the fundamental basis for achieving scientific management and control of passenger flow. In response to the urgent need for accurate and real-time detection of passenger flow in station passageways, a Yolov7-based improved Deep-Sort algorithm is proposed to detect and track bi-directional passenger flow in the passageways of integrated hub stations. Based on the Yolov7 detection algorithm, the SimAM attention mechanism was introduced to improve the accuracy of detecting passenger flow in the passageways. On the basis of the Deep-Sort tracking algorithm, the Kalman Filter (KF) method was optimized to make the tracking box of the target more accurate. Meanwhile, the Fast-ReID method was used to improve the long-term tracking of targets, thereby improving the value of IDF1. This algorithm can help to achieve real-time and accurate detection and tracking of bi-directional passenger flow in station passageways. In the event of an abnormal situation, the station staff can react rapidly to improve the station’s operational safety.

Keywords

Target tracking improved YOLOv7 algorithm SimAM attention mechanism improved Deep-Sort algorithm Kalman Filter (KF)Fast-ReID method

Introduction

The rapid development of China’s rail transit has seen the addition of over 4000 km of new urban rail transit lines, with a cumulative passenger volume of nearly 100 billion. There are currently 283 urban rail transit lines operating in 50 cities in China, with a total length of 9206.8 km. In 2021, the total passenger volume reached 23.69 billion, and both the scale of the network and passenger flow ranked first in the world. The networked operation has greatly improved the efficiency and organizational level of China’s rail transit operation, but it has also brought about a large increase in passenger flow and many safety risk factors. High-density passenger flow has a significant impact on the safety of the road network. While rail transit brings convenience to people, stations, as hubs for large-scale passenger flow, have significant safety hazards, especially in scenarios with bi-directional passenger flow where flow conflicts are more likely to occur. For example, incidents of stampedes caused by crowded passenger flow in passageways have occurred at stations such as Shanghai Hongqiao Railway Station and Shenzhen Metro Station. Therefore, personnel-intensive places like rail transit hubs urgently need real-time perception of passenger flow safety status information to provide support for predicting passenger flow safety status and ensuring the operation safety of rail transit stations.

In practical applications, rail transit stations can be equipped with dozens or even hundreds of cameras. Traditional rail transit passenger flow detection methods¹ (such as estimating passenger flow based on automatic ticketing systems) cannot meet the real-time and high-precision requirements for on-site large-scale passenger flow operation safety detection. At the same time, scenes with bi-directional passenger flow in stations such as passageways and escalators are more likely to result in flow conflicts. During peak hours, holidays, and other heavy passenger flow periods, it is necessary to achieve precise detection of the bi-directional passenger flow status in specific areas of a station.

At present, the passenger flow detection methods of integrated hub stations mainly include pressure sensor detection technology,² infrared detection technology,³ radio wave detection technology,⁴ manual observation statistics,⁵ AFC technology⁶ and intelligent video monitoring technology. The main advantages and disadvantages of these six passenger flow detection methods are shown in Table 1:

Table 1.

Advantages and disadvantages of station passenger flow detection methods.

Passenger flow detection method	Advantages and disadvantages
Pressure sensor detection	The equipment is frequently trampled, the service cycle is short, and the detection accuracy will decline
Infrared detection	The detection equipment not only has low cost, but also can achieve real-time detection, but its use efficiency is low
Radio wave detection	The detection environment is very demanding and the equipment is expensive, which is not suitable for large-scale deployment
Manual observation statistics	Rely on the experience of on-site personnel, unable to make quantitative analysis, low accuracy
AFC technology	The passenger flow data of inbound and outbound stations can be obtained, but the delay is great
Intelligent video monitoring analysis	High detection accuracy, strong environmental adaptability, and can be deployed in any area of the station

Through the analysis of the characteristics, advantages and disadvantages of several station passageways passenger flow detection technologies and methods, the most suitable technology for bi-directional passenger flow detection in the station passageways is to use intelligent video monitoring analysis technology. By directly transplanting the deep learning algorithm to the camera back-end system and analyzing the converted video data, real-time detection and tracking of bi-directional passenger flow in the station passageways can be completed. Relevant calculations can accurately provide key parameters such as bi-directional passenger flow density, speed, and passenger flow in the station passageways.

With the development of deep learning technology, research on computer vision tracking tasks has been greatly promoted, leading to rapid development of deep learning-based object tracking algorithms.^7–10 However, currently, there is little research on introducing multi-object tracking algorithms into station passageways scenes for passenger flow detection and tracking. At the same time, there are many object occlusions in station passageways scenes, which poses a great challenge to the detection and tracking of dense passenger flows. Therefore, this paper introduces multi-object tracking algorithms into the station passageways scene for passenger flow detection and tracking, and improves the algorithm specifically for this particular scenario to achieve high tracking accuracy and strong environmental adaptability. The contributions of this paper are as follows:

(1) This paper introduces the SimAM attention mechanism based on the YOLOv7 detection algorithm to improve the detection accuracy of the detector for passenger flows in passageways.

(2) This paper optimizes the Kalman Filtering (KF) method based on the Deep-Sort tracking algorithm to improve the tracking accuracy by making the tracking frame fit of the target more accurate.

(3) This paper uses the Fast-ReID method for tracking matching to improve the stability of target matching and to reduce target ID drift as a way of improving the value of IDF1.

The arrangement of the remaining sections: Section 2 presents the relevant work in this paper, Section 3 describes the proposed methods in detail. Section 4 presents the experimental results, and finally, the conclusion is drawn in Section 5.

Related work

The main task of Multiple Target Tracking (MTT) is to locate multiple targets in a given video and keep their IDs consistent, and finally record their tracking trajectories. This paper mainly focuses on the study of passenger flow in the bi-directional passageways of the station. Currently, the mainstream pedestrian tracking algorithms are mostly based on the Tracking-by-Detection (TBD) paradigm, which detects first and subsequently tracks. This leads to the tracking results being largely dependent on the detection results of the target detector. Therefore, how to effectively obtain detection results for dense pedestrians is a key focus of this paper.

Current object detection methods mainly include traditional methods and deep learning-based methods. Traditional object detection methods mainly use hand-crafted features for detection, including local features (SIFT,¹¹ HOG,¹² Haar,¹³ LBP,¹⁴ etc.) and global features (DPM¹⁵). Traditional object detection methods have tedious feature extraction, weak generalization ability, and high requirements for the image environment. Especially during the peak period of the station passage, there are occlusions and multi-scale changes in the large passenger flow, which makes the detection accuracy low and the real-time performance cannot meet the requirements.

The object detection method based on Deep Learning can accurately give the category and location information of objects. According to the detection type, they can be divided into one-stage detection algorithms (such as YOLO series,¹⁶ SSD,¹⁷ RetinaNet¹⁸) and two-stage detection algorithms (such as RCNN series,¹⁹ SPPnet²⁰). One-stage detection algorithms directly extract features from the network to predict object classification and location, which is relatively fast and suitable for mobile deployment. Two-stage detection algorithms need to first generate proposals and then perform fine-grained object detection. These algorithms are relatively slower and require multiple runs of detection and classification processes. Integrating deep learning-based detection algorithms can effectively deal with complex changes in passenger flow in railway transportation scenarios, but it is also necessary to optimize the model based on specific scenarios to balance the accuracy and speed of the algorithm.

Pedestrian re-identification (ReID) is considered as a sub-problem of image retrieval,²¹ and feature extraction and metric learning in pedestrian re-identification can provide strong support for target tracking. Nowadays, many studies combine pedestrian re-identification technology with detection and tracking technology and are widely used in intelligent security systems.

Regarding the TBD paradigm, Bewley et al. proposed the SORT algorithm, which focused on inter-frame prediction and association. By combining Kalman Filter and Hungarian Algorithm, a simple and effective online tracking framework was proposed. Wojke et al. proposed the Deep-SORT algorithm. Giving that the SORT algorithm does not pay much attention to the frequent problem of target ID switching due to occlusion in long-term tracking, pedestrian re-identification technology was introduced as an appearance model based on the SORT algorithm. By learning on the re-identification model, the network’s discriminative ability for different target objects was enhanced. At the same time, a cascade matching strategy was proposed to improve the accuracy of target matching. Wang et al. proposed the JDE algorithm. Giving real-time performance, the JDE algorithm integrates one-stage detection with pedestrian re-identification, and outputed detection and ReID information at the same time to speed up inference. Regarding the JDE paradigm, which performed detection and tracking simultaneously, Zhang et al. proposed the Fair-MOT algorithm. The integration problem of object detection and pedestrian re-identification tasks was explored, and the anchor-free paradigm object detection algorithm Center-Net²² was used as the detection branch, on which a parallel branch was added to output Re-ID features to distinguish different targets, which unified object detection and re-identification well.

In summary, improving the results of object detection can further improve the multi-target tracking results, and the effective features extracted by introducing pedestrian ReID techniques provide strong support for the target tracking task. For the problems of low tracking accuracy and low real-time performance of current multi-target tracking methods, this paper proposes an improved Deep-Sort pedestrian tracking method based on YOLOv7 to detect and count the number of passengers in both directions in station access scenarios, which can achieve good results while basically achieving good real-time performance.

Proposed methods

This paper proposes a station passageways bi-directional pedestrian flow tracking framework based on the Deep-Sort algorithm, which includes three parts:

The first part focuses on the detector, in which the SimAM attention mechanism is introduced on the basis of the Yolov7²³ detection algorithm to focus more on the details of occluded pedestrians and improve the detection accuracy. The second part is in the tracking aspect, where the Kalman Filter (KF) algorithm is optimized to fit more accurate tracking boxes and improve the tracking accuracy. The third part introduces the Fast-ReID²⁴ method, integrating the appearance feature extractor into the tracker to extract more representative features and improve the stability of target tracking, thereby reducing the frequency of target ID switching and completing the final task of bi-directional pedestrian flow tracking in the station passageways.

The general tracking process in this paper is shown in Figure 1.

Figure 1.

The tracking process of the method in this paper.

The first frame (F₁)

Firstly, the YOLOv7 object detection algorithm is used to improve the input image of the first frame, and a SimAM attention mechanism is added to the Head part of YOLOv7 in this paper. Secondly, the detection results (bounding box positions and sizes) are used as inputs for tracking. In the tracking module, a KF prediction (optimized KF) is performed to predict the position and size of the tracking box, and then a matching (by Fast-ReID module) is performed to match the object detection box with the tracking box and assign the corresponding target ID. Finally, the corresponding cost matrix and KF are updated, and the tracking results are output. Since the KF prediction in the tracking process requires the previous tracking results (i.e. the previous frame) for prediction, the tracking results are directly initialized in the first frame, and the detection results are directly used as the tracking results.

The second frame (F₂)

The detection part is the same as the first frame, with a modified YOLOv7 target detection on the input image of the second frame, which is also used as input for tracking. In the tracking section, the KF prediction is performed on the first frame, by using the optimized KF, and then the detection results are matched with the KF prediction results, by using the Fast-ReID module. For the second time, the Fast-ReID module is used, the corresponding cost matrix, Fast-ReID feature library and KF are updated, and the tracking results are output.

The subsequent frames repeat the same steps as the second frame, as shown in the figure for frame t − 1 (F_t_− 1) and frame t (F_t).

Detector

The current mainstream multi-object tracking methods still adopt the Tracking-by-Detection paradigm. Therefore, the multi-object tracking paradigm based on detection is also used in this paper, where detection is regarded as a crucial first step.

The target of this study is to monitor the relatively dense passenger flow in an integrated station, and it is necessary to detect the position of pedestrians and label them with rectangular boxes. YOLOv7 is the selected basic algorithm, which is currently the most advanced algorithm in the YOLO series. Its object detection model has not only high accuracy but also fast operation speed. YOLOv7 is implemented based on PyTorch and can be effectively applied to embedded devices and mobile terminals.

The YOLOv7 model mainly consists of three parts: the input part, the backbone part, and the head part. In the input part, the original image is first preprocessed, including data augmentation, adaptive image scaling, and adaptive anchor box calculation, and finally the image is aligned to an RGB image of size 640 × 640, which is input into the backbone network. The backbone part is composed of several BConv layers, E-ELAN layers, and MPConv layers, which appear alternately to reduce the width and height of the input image by half, increase the channel, and extract features, finally outputting three layers of feature maps. The head part consists of SPPCPC layers, several BConv layers, several ELAN-H layers, several MPConv layers, several Catconv layers, and RepVGG block layers that output three heads subsequently. In the head part, the backbone network technology is used to output three layers of feature maps of different sizes, and finally through the RepVGG block and Conv layers, the three tasks of image detection (classification, foreground and background classification, and bounding box) are predicted, and the final result is output.

As the target of this study is to monitor relatively dense passenger flow, most of the passenger targets are occluded, so it is necessary to further enrich the feature information of the occluded area, which pays more attention to the details of pedestrians, so that partially occluded pedestrians can also be detected. This can further improve the detection accuracy and lay a foundation for tracking occluded targets later. Therefore, this paper introduces a very effective attention module for convolutional neural networks – SimAM. Compared with existing channel attention modules and spatial attention modules, this module does not need to add parameters to the original network. Instead, it infers the 3-D attention weights of the feature map in one layer. When the target is partially occluded, it can also infer the targets for detection. Without changes to the Backbone, the SimAM module is embedded in the Head part. The modified YOLOv7 network structure is shown in Figure 2.

Figure 2.

Improved YOLOv7 network structure (Embedded with SimAM attention mechanism).

Research has shown that most existing attention modules (SE,²⁵ CBAM²⁶) generate one-dimensional or two-dimensional weights from the feature map X and extend the generated channels and spatial attention weights. For channel (one-dimensional) attention weights, they treat different channel differently and treat all positions equally. For spatial (two-dimensional) attention weights, they treat different positions differently and treat all channel equally, which limits their ability of learning more distinguishing clues. Assigning three-dimensional weights to generated channels largely solves this limitation and is superior to traditional one-dimensional and two-dimensional weight attention. The SimAM attention module directly estimates the three-dimensional weight. In the sub graph, the same color is assigned to points on each channel, spatial position, or feature. Embedding the SimAM attention module in the network improves the three-dimensional feature extraction during feature extraction, thereby improving detection accuracy.

Another advantage of this module is that most of its operators are chosen based on the solution to the defined energy function, which avoids spending too much effort on structural adjustments. Through quantitative evaluations of various visual tasks, it has been shown that the module’s flexibility and effectiveness have improved the expressive power of many ConvNets.²⁷

This paper used the pedestrian subset of the COCO public datasets for large object detection, and used its pre-defined training and validation set to train the model, and the test set to evaluate the accuracy of the model. The improved YOLOv7 model based on the SimAM attention mechanism was compared with the original YOLOv7 and YOLOv5s²⁸ models, and the comparison results were shown in Table 2, which indicated the effectiveness of the improved YOLOv7 model proposed in this paper.

Table 2.

Accuracy comparison of YOLO series algorithms (best in bold).

Model	Test size	Precision (%)	Recall (%)	mAP@0.5 (%)
YOLOv5s	640	66.3	58.2	64.4
YOLOv7	640	77.2	62.7	68.0
YOLOv7 + SimAM(Ours)	640	82.4	73.2	75.8

Optimized KF

This paper found through experiments on the multi-object tracking datasets MOT16²⁹ that estimating the specific box height of the bounding box can indeed improve tracking performance.

In order to establish the motion model for objects on the image plane, researchers generally use discrete Kalman Filter at a constant speed. In the Deep-Sort model, an eight-dimensional space was used to represent the state of the track at a certain time, that is, $(x, y, γ, h, \overset{\cdot}{x}, \overset{\cdot}{y}, \dot{γ,} \overset{\cdot}{h})$ : $(x, y)$ representing the central coordinate of the bbox, $γ$ representing the width to height ratio of the bbox, $h$ representing the height, and $(\overset{\cdot}{x}, \overset{\cdot}{y}, \dot{γ,} \overset{\cdot}{h})$ representing the velocity information of the four variables $x, y, γ, h$ respectively. Through specific experiments, it was found that predicting the true width and height of the bounding box can help to achieve better tracking effect. Therefore, this paper chooses to define the state vector of the optimized KF $X_{k}$ as Formula (1) and the measurement vector $Z_{k}$ as Formula (2), with $k$ represents a certain moment in time. In the Deep-Sort model, it was recommended to select $Q$ and $R$ as a function of some estimation elements and some metrics. Therefore, this choice using $Q$ and $R$ produces a time related sum. With the change of the optimized KF state vector, the process noise covariance $Q_{k}$ and measurement noise covariance $R_{k}$ matrices will be modified, as shown in Formulas (3) and (4).

$X_{k} = {[x (k), y (k), γ (k), h (k), \overset{\cdot}{x} (k), \overset{\cdot}{y} (k), \overset{\cdot}{γ} (k), \overset{\cdot}{h} (k)]}^{T}$ (1)

$Z_{k} = {[Z_{x} (k), Z_{y} (k), Z_{γ} (k), Z_{h} (k)]}^{T}$ (2)

$\begin{array}{l} Q_{k} = d i a g ({(σ_{p} {\hat{γ}}_{k - 1 ∣ k - 1})}^{2}, {(σ_{p} {\hat{h}}_{k - 1 ∣ k - 1})}^{2}, \\ {(σ_{p} {\hat{γ}}_{k - 1 ∣ k - 1})}^{2}, {(σ_{p} {\hat{h}}_{k - 1 ∣ k - 1})}^{2}, \\ {(σ_{v} {\hat{γ}}_{k - 1 ∣ k - 1})}^{2}, {(σ_{v} {\hat{h}}_{k - 1 ∣ k - 1})}^{2}, \\ {(σ_{v} {\hat{γ}}_{k - 1 ∣ k - 1})}^{2}, {(σ_{v} {\hat{h}}_{k - 1 ∣ k - 1})}^{2}) \end{array}$ (3)

$\begin{matrix} R_{k} = diag ({(σ_{m} {\hat{γ}}_{k ∣ k - 1})}^{2}, {(σ_{m} {\hat{h}}_{k ∣ k - 1})}^{2}, \\ {(σ_{m} {\hat{γ}}_{k ∣ k - 1})}^{2}, {(σ_{m} {\hat{h}}_{k ∣ k - 1})}^{2}) \end{matrix}$ (4)

According to prior knowledge, this paper chooses the noise factors as in Wojke et al. to be $σ_{p}$ = 0.05 (positional noise factor), $σ_{v}$ = 0.00625 (velocity noise factor), and $σ_{m}$ = 0.05 (motion noise factor). The experimental results in this paper can be seen from Figure 3. It can be found that using Kalman Filtering to predict the specific width and height can better fit the tracking frame, more accurately predict the size of the tracking frame, and thus improve the tracking accuracy.

Figure 3.

Optimized Kalman Filter comparison results ((a) is the tracking result of the 143rd frame of the MOT16-09 dataset, the left side of the picture in (a) is the original KF result, and the right side is the optimized KF result; (b) is the tracking result of the 153rd frame of the MOT16-09 dataset, as shown above).

Fast-ReID

Re-identification (Reid) aims to use various intelligent algorithms to find objects similar to the targets to be searched for in an image database and is a sub-task of image retrieval. Fast-ReID is an open-source library of SOTA level reid methods, with flexible configuration and stronger recognition and feature extraction capabilities for obtaining information on pedestrians.

This paper selects the stronger appearance feature extractor BoT from the Fast-ReID library to replace the original simple CNN feature extractor, and integrates it into the tracker. The feature extraction model in the tracker is replaced with the ReID model trained by Fast-ReID. During the tracking process, the confirmed tracker stores the corresponding detection feature maps into a feature set list each time feature matching is completed. The feature set list is updated after each feature matching, such as deleting some useless features or removing target features that have gone out of the frame. The feature set plays a role in ID association in the subsequent process, and when the similarity of features exceeds a certain threshold, it is considered as a successful association. The association results largely affect whether the ID of the target is stable and thus affects the tracking results.

This paper uses ResNet50 as the backbone and Market1501 as the datasets to train the final pedestrian re-identification model based on ResNet50-ibn as the baseline, as shown in Table 3.

Table 3.

Market1501 datasets test results.

Datasets	Rank-1	Rank-5	Rank-10	mAP	mINP
Market1501	94.36	97.95	98.78	86.55	61.79

Similarly, this paper directly updates the appearance state of the i-th trajectory at the t-th frame by using the Exponential Moving Average (EMA) method, as shown in Formula (5):

$e_{i}^{t} = α e_{i}^{t - 1} + (1 - α) f_{i}^{t}$ (5)

Where, $f_{i}^{t}$ is the appearance embedding of the current matching detection, and $α$ is the momentum term, being 0.9. The EMA update strategy improves the matching quality and reduces the time consumption.

Experiment and data

Datasets and experimental environment

This paper selects the publicly available datasets MOT16 for multi-object tracking to test the proposed algorithm and compare it with other advanced algorithms to analyze model performance. The experimental environment is based on the Ubuntu 16.04 operating system, Nvidia Titan XP 2080Ti graphics card, and 64 GB of RAM. PyTorch 1.6.0 deep learning framework is used, and the implementation is performed on a server running Python 3.7.

Evaluation indicators

In order to make the evaluation of the model more objective and accurate, and to compare it with other algorithms in a reasonable way, this paper uses commonly used evaluation metrics in the multi-object tracking field: Multi-Object Tracking Accuracy (MOTA), Multi-Object Tracking Precision (MOTP), Identification F1 Score (IDF1), and ID Switch (IDS). The formulas for some of the evaluation metrics are shown in Formulas (6) and (7):

$MOTA = 1 - \frac{N_{FN} + N_{FP} + N_{IDs}}{N_{GT}}$ (6)

$MOTP = \frac{\sum_{i, t} d_{t}^{i}}{\sum_{t} c_{t}}$ (7)

In the formulas, t represents the current frame number, where t ∈ [1, N]; $N_{GT}$ represents the number of true bounding boxes; $N_{FN}$ represents the total number of missed detections in the entire video; $N_{FP}$ represents the total number of false detections in the entire video; $N_{IDs}$ represents the total number of ID switches; $d_{t}^{i}$ represents the intersection-over-union (IoU) distance between the i-th predicted bounding box and the true bounding box in the t-th frame; c represents the number of successfully matched targets.

Experiment and analysis

For multi-object tracking, this paper chooses the MOT16 datasets for experiments and compares the results with several advanced multi-object tracking algorithms. Inspired by Zou et al.,³⁰ this paper considers adding the time required for the detector when calculating the speed, and the results are shown in Table 4.

Table 4.

Comparison results between this algorithm and other algorithms (best in bold).

Model	MOTA(↑)	MOTP(↑)	IDF1(↑)	IDs(↓)	FPS(↑)
SORT	59.8	79.6	53.8	1423	8.6
Deep-Sort(POI)	61.1	75.9	51.2	781	6.4
JDE	64.4	-	55.8	1544	18.5
TubeTK³¹	64.9	59.4	59.4	1117	1.0
Ours	64.1	81.5	63.20	690	5.5

Through metric analysis, it can be seen that the algorithm proposed in this paper has relative advantages compared to some advanced algorithms. The MOTP, IDF1, and IDs metrics performed the best, and the MOTA value was similar to that of the TubeTK algorithm. However, the MOTP metric of the TubeTK algorithm was too low, and its real-time performance was poor, making it unsuitable for on-site deployment. Compared with the JDE algorithm, although there was a big difference in real-time performance, its ID switch frequency was too high, which was caused by the mutual occlusion of targets in a large number of dense pedestrian scenes. The scenario applied in this paper is the passage of a integrated hub station, which also has a large number of dense pedestrians. Therefore, the applicability and practicality of the algorithm proposed in this paper are stronger.

Ablation experiments

Table 5 shows the effectiveness of the improved YOLOv7 model based on the SimAM attention mechanism in this paper, as can be seen from the tracking results, by using the YOLO series as the target detector for target tracking.

Table 5.

Tracking results of YOLO series plus Deep-Sort algorithm (MOT16 datasets, best in bold).

Model	MOTA(↑)	MOTP(↑)	IDF1(↑)
Deep-Sort(YOLOv3³²)	54.89	78.08	51.61
Deep-Sort(YOLOv4³³)	59.24	81.55	56.70
Deep-Sort(YOLOv5s)	60.10	80.22	56.70
Deep-Sort(YOLOv7)	60.93	80.97	55.40
Deep-Sort(YOLOv7 + SimAM)	62.23	80.57	61.46

Table 6 shows the ablation experiments performed on the MOT16 datasets for the improved part of this paper, namely the addition of the SimAM attention mechanism, the optimized KF (KF+) and the introduction of the Fast-ReID method. The results of the ablation experiments show the effectiveness of the improved part of this paper.

Table 6.

Results of ablation experiment (best in bold).

Method	SimAM	KF⁺	Fast-ReID	MOTA(↑)	MOTP(↑)	IDF1(↑)
Baseline(YOLOv7 + DS)	-	-	-	60.93	80.97	55.40
Baseline + column 1	√			62.23	80.57	61.46
Baseline + column 1–2	√	√		63.30	81.98	59.93
Baseline + column 1–3	√	√	√	64.11	81.51	63.20

Application of experimental scenarios

Experimental process

Station passages are typically divided into two types: transfer passages and regular passages. Regular passages are characterized by a relatively loose and non-fixed direction of passenger flow, and congestion tends to occur during peak hours. Transfer passages are more prone to flow line conflicts, resulting in personnel accidents within the station. Bi-directional passenger flow refers to passengers traveling in two different directions, as shown in Figure 4. The figure shows a scene of a subway station passage, with barriers separating the passenger flows in two different directions of travel.

Figure 4.

Site map of bi-directional passenger flow in station passage.

The scenario studied in this paper is bi-directional passenger flow in integrated hub station passages, as shown in Figure 5: bi-directional passenger flow refers to the passenger flow traveling in any rectangular box.

Figure 5.

Bi-directional passenger flow diagram of station passage.

This paper describes how bi-directional passenger flow tracking and statistics are carried out. (Examples are only shown for single-direction tracking methods, while subsequent experimental figures demonstrate bi-directional tracking.) First, target tracking algorithms are used to track passenger flows, and then marking lines are manually or systematically drawn to record the center trajectories of the tracked passenger flows. It is determined whether the center of the passenger flow trajectory intersects with the marking line. Once an intersection occurs, the passenger ID is recorded, and passenger flow tracking statistics are carried out as shown in Figure 6.

Figure 6.

Algorithm statistics rules.

For the visualization view, this paper does not record the passenger ID in the interface, but only records the number of interested points - the passenger flow statistics being tracked. Figure 7 shows some experimental data, where the black solid line in the middle of the image is the calibrated marking line. The “Cross Down” in the upper right corner shows the number of passengers passing through from top to bottom on the left side, and “Cross Up” shows the number of passengers passing through from bottom to top on the right side.

Figure 7.

Some experimental data.

Experimental results

After statistical analysis of the experimental data, as shown in Table 7, the number of passengers tracked by the algorithm in this paper passing from top to bottom (“Down”) was 116, while the actual number of passengers passing through was 125, resulting in a tracking accuracy of 92.80%. The number of passengers tracked by the algorithm in this paper passing from bottom to top (“Up”) was 164, while the actual number of passengers passing through was 175, resulting in a tracking accuracy of 93.71%. The overall average tracking accuracy was 93.26%.

Table 7.

Tracking statistics results of experimental data.

	Algorithm detection(Down)	Actual (Down)	Algorithm detection(Up)	Actual(Up)
Statistics in all directions number of passengers	116	125	164	175
Tracking statistics accuracy in all directions	92.80		93.71
Total tracking statistics accuracy	93.26

Conclusion

This paper proposes an improved Deep-Sort algorithm based on YOLOv7. The SimAM attention mechanism is introduced on the basis of the YOLOv7 object detection algorithm, and the Kalman Filter (KF) is optimized and the Fast-ReID method is introduced on the basis of the Deep-Sort tracking algorithm, all of which aim to improve the model tracking performance. Finally, bi-directional passenger flow tracking is carried out in a integrated station passageways. Due to heavy occlusions in the station passageways, the number of tracked passenger flows cannot be fully counted. However, in our study, the overall tracking accuracy of the algorithm also reached more than 93%, and the real-time performance is remarkable, which is in line with practical applications on-site, and also demonstrates the applicability and advanced nature of the proposed algorithm.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported in part by the National Key R&D Program of China under Grant (No.2022YFB4301305).

ORCID iD

Jianfan Wu

References

Zhang

. Research on algorithm of subway inbound passenger flow detection based on deep learning. Beijing: North University of Technology, 2019.

Tons

Doerfler

Meinecke

, et al. Radar sensors and sensor platform used for pedestrian protection in the EC-funded project SAVE-U. In: Proceedings of the IEEE intelligent vehicles symposium, Parma, Italy, 14–17 June 2004, pp.813–818. New York, NY: IEEE.

Jing

Huifang

Qinglai

. Analysis and discussion on Big Data in facility management of Shanghai National Convention and Exhibition Center. Green Build 2017; 9(05): 18–19.

Chunhong

Guojin

Duwei

. Design of passenger flow detection system of urban rail transit based on zigbee technology. Urban Rail Transit Res 2011; 14(12): 56–59.

Leibe

Seemann

Schiele

. Pedestrian detection in crowded scenes. In: Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition, San Diego, CA, USA, 20–25 June 2005, pp.878–885. New York, NY: IEEE.

Wang

Hongchen

Wang

, et al. Application and prospect of new technologies of urban rail transit AFC system. Urban Rapid Transit 2017; 1: 41–44.

Bewley

Ott

, et al. Simple online and realtime tracking. In: Proceedings of 2016 IEEE international conference on image processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016, pp.3464–3468. New York, NY: IEEE.

Wojke

Bewley

Paulus

. Simple online and realtime tracking with a deep association metric. In: Proceedings of 2017 IEEE international conference on image processing (ICIP), Beijing, China, 17–20 September 2017, pp.3645–3649. New York, NY: IEEE.

Wang

Zheng

Liu

, et al. Towards real-time multi-object tracking. In: A

Vedaldi

Bischof

Brox

, et al. Computer Vision – ECCV 2020. Cham: Springer, 2020, pp.107–122.

10.

Zhang

Wang

, et al. FairMOT: on the fairness of detection and re-identification in multiple object tracking. Int J Comput Vis 2021; 129(11): 3069–3087.

11.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004; 60(2): 91–110.

12.

Dalal

Triggs

. Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition, San Diego, CA, USA, 20–25 June 2005, pp.886–893. New York, NY: IEEE.

13.

Papageorgiou

Oren

Poggio

. A general framework for object detection. In: Sixth international conference on computer vision, Bombay, India, 07 January 1998, pp.555–562. New York, NY: IEEE.

14.

Ojala

Pietikainen

Harwood

. Performance evaluation of texture measures with class-iflcation based on Kullback discrimination of distributions. In: Conference A: Computer Vision & Image Processing. Proceedings of the 12th IAPR International Conference, 1994, vol. 1, pp.582–585. IEEE.

15.

Felzenszwalb

Girshick

McAllester

, et al. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 2010; 32: 1627–1645.

16.

Redmon

Divvala

Girshick

You only look once: unified, real-time object detection. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016, pp.779–788.

17.

Liu

, et al. SSD: single shot multibox detector. In: B

Leibe

Matas

Sebe

, et al. (eds) Computer vision - ECCV 2016. Cham: Springer, 2016, pp.21–37. New York, NY: IEEE.

18.

Lin

Goyal

Girshick

, et al. Focal loss for dense object detection. In: 2017 IEEE international conference on comput vision, 2017, pp.2999–3007.

19.

Girshick

Donahue

Darrell

, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, 23–28 June 2014, pp.580–587. New York, NY: IEEE.

20.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 2015; 37(9): 1904–1916.

21.

Luo

Jiang

Fan

, et al. A survey on deep learning based person re-identification. Zidonghua Xuebao 2019; 45(11): 2032–2049.

22.

Duan

Song

Xie

, et al. CenterNet: keypoint triplets for object detection. In: Proceedings of 2019 IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 27 October 2019–02 November 2019, pp.6568–6577. New York, NY: IEEE.

23.

Chien-Yao

Bochkovskiy

Liao

H-YM

. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for realtime object detectors. arXiv preprint arXiv: 2207.02696, 2022.

24.

Xingyu

, et al. Fast-ReID: A Pytorch Toolbox for General Instance Re-identification. arXiv:2006.02631, 2020.

25.

Shen

Sun

. Squeeze-and-excitation networks. In: IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–23 June 2018, pp.7132–7141. New York, NY: IEEE.

26.

Woo

Park

Lee

, et al. CBAM: convolutional block attention module. In: European conference on computer vision, 2018, pp.3–19. arXiv:1807.06521.

27.

Yang

Zhang

, et al. SimAM: a simple, parameter-free attention module for convolutional neural networks. In: Proceedings of the 38th international conference on machine learning, 2021, vol. 139, pp.11863–11874. New York: PMLR.

28.

Jocher

Nishimura

Mineeva

, et al. YOLOv5, https://github.com/ultralytics/yolov5 (2020, accessed 10 July 2020).

29.

Milan

Leal-taixe

Reid

, et al. MOT16:a benchmark for multi-object tracking[EB/OL]. (2016-03-02)[2022-01-12].

30.

Zou

Liu

. A multi-pedestrian tracking algorithm based on center point detection and person re-identification. Geomat Inf Sci Wuhan Univ 2021; 46(9): 1345–1353.

31.

Pang

Zhang

, et al. TubeTK: adopting tubes to track multi-object in a one-step training model. In: Proceedings of 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, 2020, pp.6307–6317. New York, NY: IEEE.

32.

Joseph

Farhadi

. Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767, 2018.

33.

Bochkovskiy

. Chien-Yao W and Liao H-YM. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

Bi-directional passenger flow tracking and statistics analysis in station passageways based on an improved Deep-Sort algorithm

Abstract

Keywords

Introduction

Related work

Proposed methods

The first frame (F1)

The second frame (F2)

Detector

Optimized KF

Fast-ReID

Experiment and data

Datasets and experimental environment

Evaluation indicators

Experiment and analysis

Ablation experiments

Application of experimental scenarios

Experimental process

Experimental results

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References

The first frame (F₁)

The second frame (F₂)