Abstract
Keywords
Introduction
In recent years, unmanned aerial vehicle (UAV) has been widely used in many fields, which has aroused great interest in military and civil fields. The military UAV undertakes the task of danger detection. The police and government’s UAV is used to do safety monitoring and environmental monitoring. Meanwhile, civil UAV is developing rapidly in the field of photography. UAVs have brought convenience in many aspects to our life. 1 Under certain circumstances, UAVs violate personal privacy and affect normal social life. Important places need to restrict the entry of UAVs, such as airports and schools, etc. Therefore, it is necessary to locate and track the illegal and hidden UAVs in specific scenes. The mainstream detection methods include vision, radar, and electromagnetic wave. The vision system has the advantages of target visualization and active tracking, which can make up for the defect that radar cannot effectively identify the target and the defect of short electromagnetic detection range.
For the radar-based UAV positioning and tracking system, it relies on electromagnetic wave reception and reflection to determine the UAV’s position. It is less affected by light and can run in a dark environment. 2 However, in a low-altitude environment, the electromagnetic signal emitted or reflected by the drone is easily submerged in the background noise, so it is difficult to distinguish. When the drone is far away from the radar detector, high hardware costs are required to achieve detection.
Meanwhile, for the computer vision-based system, the position of UAV is determined by visual sensors (such as the camera), target detection algorithm, and target tracking algorithm. It has the advantages of the low cost of sensor and strong anti-jamming ability.3,4 However, due to the small size of the target, the accuracy is not enough in the case of detecting long-distance targets, and it is more vulnerable to the limitations of light and weather. 2 In the actual application of UAV detection system, radar system and vision system always complement each other to meet the detection requirements in different scenarios.
The biological world often inspires industrial design. In recent years, the research of robot and intelligent system inspired by biological structure has been paid more and more attention. Wang et al. 5 demonstrated the potential of wing-body interaction (WBI) in the design of flapping-wing micro aerial vehicle (MAV) that pursue higher performance. By analyzing the backward free flight of a dragonfly, Bode-Oke et al.6,7 found that wing–wing interaction could enhance the aerodynamic performance of the hindwings (HW) during backward flight. The system proposed in this paper is also inspired by the eyes of birds, and uses the focus of attention inspired by the biological visual system to model. Birds not only needs to look far in flight but also needs to see close scenery underwater. They can adjust the curvature of the lens in the eye to a great extent, which can have a larger field of view and can see distant objects clearly. 8 To better solve the problem of small target detection, the system uses binocular cameras with a strong zoom ability and uses a servo system to simulate the head movement of birds for target tracking.
In recent years, more and more scholars have made great contributions in the field of UAV detection. Hoffmann et al. 9 proposed a method to detect and track micro UAV by using multi-static radar NetRAD and combining time-domain signal with the micro-Doppler signal; Shi et al. 10 integrated a variety of monitoring technologies to establish an anti UAV system named ADS-ZJU to detect, locate, and radio frequency interfere with UAV; in the field of visual UAV, Wang et al. 4 proposed a small flying target detection method based on Gaussian mixture background modeling in compressed sensing domain and low-rank sparse matrix decomposition of the local image, while Li et al. 11 used the mobile camera to detect and track UAV through optical flow matching and Kalman filter. Dorudian et al. 12 used an external RGB-D sensor, and added a blind update model to adapt to background sudden changes.
Frame difference (FD), 13 Gaussian mixture model (GMM), 14 and ViBe1 algorithms are classic moving target detection algorithms. Scholars have made different improvements over the years. Sengar and Mukhopadhyay 15 combined FD with W4 16 algorithm. Zong et al. 17 proposed a deep auto-encoded GMM for unsupervised anomaly detection. Zhou et al. 18 combined depth information and color information for foreground segmentation to improve ViBe algorithms. In target tracking, STC has been widely used. Cao et al. 19 presented a hierarchical features-based tracker for spatio-temporal context (STC) learning, which devotes to enhancing tracking performance by constructing more robust model and designing more useful feature representations. A real-time updated learning rate and a fading factor are introduced to improve the occlusion loss problem of the STC algorithm in Yang et al.’s 20 work. Xue et al.21,22 proposed multi-scale spatio-temporal context learning tracking, which formulates a low-dimensional representation named fast perceptual hash algorithm to update long-term historical targets and the medium-term stable scene dynamically with image similarity.
The system proposed in the above articles has achieved good results in different specific situations. However, for long-distance dim and small targets, the detection and tracking effects are still not ideal. The main problem lies in the inability to distinguish UAVs, birds, kites, and other objects, and it is easy to be disturbed by the background environment. The main contributions of this paper are as follows: (1) detecting dim and small targets based on spatiotemporal continuity to improve detection efficiency; (2) adding a scale filter and introducing loss criterion to optimize the tracking performance in the case of scale change and targets occluded of STC algorithm. In the test results, we qualitatively and quantitatively show that the tracking method proposed in this paper is superior to the old model in some evaluation indexes.
The remainder of this paper is organized as follows. Firstly, the hardware and software architecture of the system are introduced in the next section. The subsequent section discusses the current moving target detection method and our improvement based on spatiotemporal context. Then the current target tracking algorithm and our improvement on scale filter and loss criterion is analyzed. The penultimate section discusses the analysis of experimental test results. Finally conclusions and future work directions are drawn in the last section.
System architecture
The system is composed of optical zoom visible light camera and servo automatic tracking pan-tilt, as shown in Figure 1. Two HIKVISION DS-2ZCN3008 optical zoom network cameras are installed side by side on the YAAN YS3081 servo pan-tilt. YAAN YS3081 can continuously rotate

Hardware composition of binocular vision system.
At the software level, firstly, the moving object detection algorithm is run to detect suspicious moving targets. After the suspicious target is found, servo motor controls the camera to move to the target, the target is moved to the center of the field of view. The improved spatiotemporal context object tracking algorithm is used to track the target. Then the slave camera is controlled to adjust the focal length to enlarge it. Finally, we run the deep learning algorithm to recognize the enlarged moving target, and determine whether to continue tracking according to the results.
Moving object detection algorithm
Analysis of current algorithm
The software architecture is shown in Figure 2. For the traditional moving target detection algorithm, there are mainly frame difference methods, background difference method, the optical flow method, etc. The frame difference method obtains the object contour by difference operation of the adjacent frames of the image sequence. The background difference method uses the difference between the reference background and the video sequence to detect the object. The optical flow method uses the change of the pixel points in the image sequence in the time domain to extract the object motion information. The detection effect of these methods for weak and small UAV target is relatively imited.23,24 The frame difference method cannot effectively distinguish between noise and moving targets. During opening and closing operations, the background noise is removed, while the weak and small targets are filtered out. The common background difference methods include Gaussian modeling and background modeling, which have a large amount of calculation and cannot meet the requirements of real-time detection. The optical flow method also has a large amount of calculation, which is difficult to meet the real-time performance. Inspired by insect’s neurons, Wang et al. 25 proposed a small target motion detectors neural network for small target detection in a cluttered background. In the aspect of spatial-temporal, based on sparse representation and Bayesian inference, a robust target tracking algorithm is proposed by Li et al. 26 which can estimate and predict the temporal and spatial structure of targets. According to the motion characteristics of UAV, a first in first out (FIFO) denoising structure and an improved method based on spatiotemporal context is proposed.

Software flow diagram of binocular vision system.
Moving object detection algorithm based on spatiotemporal context
There is no abrupt change in the position of two adjacent frames in the image sequence, that is, there is a continuity in time and a special spatial relationship between the target and the surrounding background. The combination of the time and spatial information forms the spatiotemporal context information. When biological vision system focuses on the target, it will focus on a specific area. By the vision system, more attention will be paid on the points which are close to the target. Inspired by biological vision system, this method is used to calculate the prior probability of tracking target in STC algorithm.27,28
If there is a small target at (

Moving target and noise region.

Spatio-temporal continuity of moving target and noise.
In Figure 3, the left side is the real scene, and the right side is the image after binarization by frame difference method. White pixels may be targets or noises. The white rectangular box is the target area which contains UAVs. The pink rectangular box is a noise area, which contains pedestrians or shaking trees.
In Figure 4(a) and (b) are the enlarged images in the white and pink rectangular frames in Figure 3. Figure 4(a) shows that the position of the target in the target area changes in a short time and remains in the initial window. Figure 4(b) shows the short-term position change of noise points in the noise area.
The following is the steps of moving target detection. The overall flow of the moving object detection algorithm is shown in Figure 5.

Basic flow of moving target detection FIFO.
Step 1, get the candidate target image
Step 2, the binarization result is accumulated and the appropriate threshold is set to remove the noise. The gray value images of the first frame image in the channel are selected as the candidate target points. A 10 × 10 window is established with each candidate target as the center, and the gray values images of each window position of all images in the channel are accumulated respectively. Because the isolated noise points are randomly distributed and the real target has space-time continuity, we set the threshold of gray value. If the gray value accumulated result exceeds the threshold, it will be considered that there is a target in the window, and the location of the target will be updated. Through this operation, most of the isolated noise points can be removed and the real target can be retained.
Step 3, get the moving target. The noise in the window can be removed by “dilating” the gray value of the window area in the FIFO channel. Then performing the “and” operation on the two adjacent differential images. If the target dose not move in certain frames, the “and” operation of adjacent image frames will lead to the loss of the real target. In this case, the window with the pixel value accumulated to 0 is removed first, and then the two adjacent frames are operated with “and”. If the result exceeds a certain threshold, it is regarded as the target. In terms of FIFO size selection, a small-size FIFO cannot buffer the target for a short period of time, and it is easy to cause the target to be lost. Choosing a large size fifo will cause a significant delay in the tracking frame when the target is truly lost. Through testing, this article selects a FIFO with a size of 10, which can achieve better detection results.
Improved STC tracking algorithm
Analysis of current target tracking algorithm
UAV imaging is small in the distance from the camera, and there are almost no texture features. At the same time, when the UAV moves from near to far or from far to near, the scale of the target changes significantly, which requires the tracking algorithm to have scale adaptability. Besides, UAV may be similar to the background color or be covered by the background during flight, so the tracking algorithm needs to consider the influence of complex background. The mainstream target tracking algorithms are compared in different scene datasets, including Staple (sum of template and pixel-wise learners29,30), STC, ECOHC (efficient convolutional network for online video understanding 31 ), DSST (discriminative scale space tracking 32 ), BACF (boot angle compensation filter 33 ), CN (color names), CSK (exploiting the circulant structure of tracking-by-detection with kernels 34 ), KCFv2 (kernel correlation filter 35 ). All of them can approach the real-time detection (STC algorithm reaches 350 FPS in Intel Core i7 platform). The performance of the tracking algorithm is evaluated by the success rate curve and the area under curve (AUC) of the area surrounded by the coordinate axis.
The accuracy can accurately reflect the performance of the tracker. The success rate is a reflection of the overlap area between the output box of the tracking algorithm and the annotation box. Some tracking algorithms are sensitive to the selection of the initial tracking frame. Giving initial boxes in different initial frames will lead to different tracking results. Therefore, the initial target frame is changed in space and time to test its SRE (spatial robustness assessment) and TRE (temporal robustness assessment).
In the analysis of the success rate, we mainly consider the tracking success rate when the overlap rate threshold is above 0.4. In the accuracy analysis, the accuracy of the target center positioning error threshold below 20 is mainly considered. Run different tracking algorithms to track all sequences, including weak and small target sequences, scale transformation sequences, and complex scene sequences. Examples of tracking results are shown in Figure 6, and the results are analyzed as shown in Tables 1 and 2. The figures in the table represent the results, while the overlap rate threshold is 0.4 or the target center positioning error threshold is 20.

Example of algorithm comparison results.
Success rate’s and precision’s evaluation results of SRE.
Success rate’s and precision’s evaluation results of TRE.
In Tables 1 and 2, generally, in the SRE and TRE test results of all sequences in the dataset, STC, Staple, and ECOHC algorithms have higher success rates. Among the accuracy test results of all sequences in the data set, the STC, Staple, BACF, and ECOHC algorithms have higher accuracy. In the subdivision dataset, the STC, Staple, ECOHC algorithm tracking success rates and accuracy are relatively high in the weak and small target sequence test. Compared with other algorithms, the STC algorithm can track small targets stably and accurately compared to other algorithms. In the success rate and accuracy test of the scale conversion sequence, STC, Staple, and DSST algorithms have high success rates, and STC, CN, Staple and other algorithms have high accuracy. In the tracking success rate and accuracy test results of complex scene sequences, the accuracy and success rate of various tracking algorithms are generally low. The accuracy and precision of the BACF algorithm are higher than other algorithms. The STC algorithm still updates the model when the target is occluded, which leads to model deviation and tracking error.
In the UAV detection scenario, first is to deal with the change of the target from far to near. The algorithm should have high accuracy for the weak and small target and scale change scene. Secondly, because the above tracking algorithms cannot track accurately in complex scenes, such as target occlusion, the tracking algorithm is required to have the retrieval ability. At the same time, when it is judged that the target cannot be retrieved for a long time, re running the target detection algorithm is used to deal with it. Therefore, this paper selects STC algorithm which has outstanding tracking effect on weak and small targets and can still track stably under the condition of target scale change. Meanwhile, the scale adaptation of STC algorithm and tracking performance in complex environment will be improved.
Improved STC tracking algorithm based on scale filter and loss criterion
Aiming at the problem that the STC algorithm cannot effectively adapt to scale transformation, inspired by the idea of the DSST target tracking algorithm,
36
an improved scale filter is added to the original STC position filter to replace the simple scale adaptation method in STC algorithm. Assuming that the size of the current target in the image is
In the above formula,
Perform DFT transformation on each dimension of the feature to obtain
In the formula,
In view of the problem that the STC algorithm cannot effectively cope with complex scenes, it is necessary to distinguish whether the target is occluded or lost, and to stop the update of the model when the target is lost, so as to avoid introducing wrong information and causing subsequent tracking failure. The loss criterion
By analyzing the STC algorithm tracking result response, the following conclusions are obtained. When the tracking result is accurate, the confidence map is a single-peak two-dimensional Gaussian distribution map and the peak is greater than zero. When the target is occluded or lost, the confidence map will oscillate severely, with multiple peaks and the maximum value may be less than zero, as shown in Figure 7. Therefore, the peak distribution of the confidence map can be used to determine whether the target is occluded and whether the target has been lost. On the basis of the STC algorithm, the updating formula of the position filter model is improved as follows
where

Comparison of confidence map between normal target and occluded target.
where
As shown in Figure 8, when the target is occluded (about frame 70), the

Curve of
Analysis of test results
Analysis of moving target detection results based on spatio-temporal continuity
The frame difference method (FD), Gaussian background modeling (GMM), vibe algorithm (VIBE), and the target detection method based on the Spatio-temporal context (OURS) proposed in this paper are tested on 6831 image sequences dataset, which contains weak small UAV targets. The detection results are shown in Figure 9.

Compare of result of detection.
The first line of Figure 9 is the detection result diagram of frame difference method, and the blue box is the detection result box. Due to the slight camera jitter, the results of frame 60, 150, and 300 all contain the noise caused by leaf shaking, and there is a large area of false detection around the 150th frame, which shows that the original frame difference method cannot effectively distinguish leaf shaking and real moving objects. Figure 9 shows the GMM background modeling result in the second line, and the green box is the detection result box. Because the background model is not stable, GMM algorithm has a large number of false detection in about 60 frames. The third line is the detection result of vibe algorithm, and the pink box is the detection result box. Due to the need to model each pixel, the speed of vibe algorithm cannot meet the real-time requirements, and the probability of missed detection is high.
The fourth line of Figure 9 shows the results image of the proposed target detection algorithm based on spatiotemporal continuity proposed in this paper. The red box is the detection result box. The results show that the proposed algorithm can effectively deal with random noise, and can accurately detect UAV and pedestrian targets.
The precision is defined as the ratio of the number of correct targets detected and the number of all targets detected. The recall is defined as the ratio of the number of correct targets detected and the number of real targets. Data analysis is shown in Figure 10.

Quantitative analysis of result of detection.
Analysis of improved STC tracking algorithm test result
STC, STC with scale variation (STCSCALE), and the improved STC algorithm proposed in this paper are used to track and test the UAV data set sequence. The test results are as follows:
In Tables 3 and 4, compare the SRE and TRE tests of the proposed algorithm on the weak and small target data set and the scale change data set. Compared with the classic STC algorithm, the proposed algorithm has a slight improvement in the success rate of weak and small target tracking, but it has a significant improvement in the calculation results on the scale change data set sequence.
SRE and TRE test result in dim target datasets.
SRE and TRE test result in scale change datasets.
Figure 11(a) shows the tracking test result of small and dim target; Figure 11(b) shows the tracking test result of scale transformation sequence 1; Figure 11(c) shows the tracking test result of scale transformation sequence 2. The red box represents the STC algorithm, the green box represents STCSCALE, and the blue box represents the improved STC algorithm proposed in this article.

Examples of tracking results of STC, STCSCALE and OURS.
Besides, the low computational complexity is one prime characteristic of the STC algorithm in which only six FFT operations are involved for processing one frame. For the local context region of
In Table 5, the calculation speed of the proposed algorithm is compared with the classical algorithms and the recent algorithms. The algorithms are run on the Intel Core-i7 3.4 GHz platform (GTX1050ti is used in recognition, not in tracking). When they are not integrated into the entire test UAV monitoring system, the proposed algorithm has a good performance on processing time. After being integrated into the system, the proposed algorithm can still meet real-time requirements.
Comparison of algorithms’ processing speed.
Analysis of dataset
YOLOv43 is trained to recognize the moving target and choose an initial target. Eleven video sequences in the anti-drone scene were shot and collected, totaling 14,705 pictures, and the true value was marked, that is, the position coordinates of the drone in each picture were recorded. At the same time, in order to evaluate the tracking performance of different algorithms under different environmental factors, and comprehensively select real-time fast algorithms suitable for anti-UAV scenarios, our work refers to Professor Wu Yi’s Benchmark to add weak targets, scale changes, and complexity to the video sequence. 32 Several attributes such as scenes facilitate the specific analysis of the adaptability of the tracking algorithm to different scenes. The meaning of each attribute is as follows:
Normal target: Image sequences with target size greater than 100 pixels;
Weak targets: Image sequences whose target size is less than 100 pixels;
Scale change: There are image sequences in which the ratio of the target frame scales of the two images exceeds the interval
Complex background: The target is completely occluded or partially occluded in the scene, or is similar to the background color;
The examples of our dataset are shown in Figure 12. The composition of our dataset is shown in Table 6.

Examples of our dataset.
The composition of UAV sequences.
Conclusions
A set of binocular vision system for tracking UAV is designed and built. The binocular camera is inspired by biology and modeled by biological vision focusing, which has good real-time and practicability. The detection efficiency is improved by detecting dim and small targets based on spatiotemporal continuity. In this paper, the comprehensive performance of the mainstream tracking algorithm is tested, and the evaluation based on different quality indicators is given. The STC algorithm, which has poor performance in scale adaptation, is improved. Inspired by DSST, a scale filter is added and a loss criterion is introduced to optimize the tracking performance of STC for scale transformed and occluded targets. The improved STC algorithm can achieve a rate of 58fps. On these datasets, we qualitatively and quantitatively show that the tracking method proposed in this paper is superior to the old model in some evaluation indexes.
At present, some work still needs to be improved. First of all, the current use of dual visible light cameras, one is used to detect and track weak and small targets, the other is used to zoom the near-sighted field to observe the target, which not only ensures the tracking robustness but also causes a waste of resources. It is a better scheme to use only one camera to complete target detection, recognition, and tracking. Secondly, although the improved STC algorithm has a significant effect on tracking dim and small targets and scale transformation scenes, the tracking accuracy still needs to be improved in complex scenes. Due to the strong adaptability of BACF to complex scenes, the state machine tracking algorithm combined with BACF is the future research direction.
