Abstract
Introduction
With the development of the mobile communication and smart devices, sharing of sports experience on Internet is becoming very popular these days, such as football or basketball videos uploaded on Facebook or WeChat. Not just a lifestyle, those short videos also contain movement information of both players’ arms/legs and driving implements, like a club, bat, cue, racket, hand, or foot, which can be used to coach beginners as to how to use their physical strength to improve their grades in the sports, which has a wild prospect in the sports training market. 1 –6 In this article, we focus on obtaining swing movements by object tracking from golf videos to provide good visualization to be shared on Internet and meaningful data for study in golf training.
Object tracking is one of the most researched topics in computer vision and machine learning. In golf videos, the hand and club, showing movements of players’ strength and swing, are our interested objects and they often move extremely fast and deform drastically, which are fundamental obstacles that all object tracking methods face. To be clear, the imaging quality in our golf videos is limited by many factors: Compared with the whole players’ body, the hand and club appear very small. With a video resolution of 720 × 1028 pixels (as shown in Figure 1 (with shielded face)), the hand patch is imaged at a resolution of 63 × 63 pixels (Figure 2) and the club patch at 48 × 48 pixels (Figure 3), both are short of details. The appearances of the hand and club in one swing are deformed in sequence. Each row in Figures 2 and 3 are typical positions of one player along the backswing and downswing. The swing speed can be as fast as 150 miles/h and the club can be hardly seen on the fast downswing. The club of player B is blurred in Figure 1(b) and on the last two columns on second row in Figure 3. The camera in mobile phone has low performance in dynamic shooting. All these videos in our application are shot by mobile phones from the back side of the players.

Frames of three players from videos with a resolution of 720 × 1280 pixels shot from the back side of the players.

Sequent appearances of the hands of the three players on a golf swing. The hand patch is sized at 63 × 63 pixels. The first row is the hands of player A, the second is player B, and the last is player C. The first four columns in every row are with the hands at the very beginning, above the knee, over the shoulder, and at the peak of a backswing, respectively. The last four columns are with the hands at the beginning, at the peak, over the shoulder, and down to the knee of a faster downswing, respectively.

Sequent appearances of the clubs of the three players on a golf swing. The club patch is sized at 48 × 48 pixels. The first row is the clubs of player A, the second is player B, and the last is player C. The first four columns on every row are with the clubs at the very beginning, above the knee, over the shoulder, and at the peak of a backswing, respectively. The last four columns are with the clubs at the beginning, at the peak, over the shoulder, and down to the knee of a faster downswing, respectively. Each one here is from the same frame with the one in the corresponding position in Figure 2.
Therefore, in this article, we aim to find the characteristics of the concerned objects and propose a tracking framework for golf video based on object recognition using machine learning with histograms of oriented gradients (HOG) and spatial–temporal vector.
The tracking framework is introduced in the second section, and performance evaluation about recognition and tracking is reported in the third sections. In the fourth section, we summarize our method and discuss problems.
Tracking framework
The tracking framework for golf video consists of three main parts:
Initialization
At initialization, the main task is to give candidate windows for the objects and then object recognition can be applied to detect the initial hand and club just in the candidate windows. Since the initial objects (hand and club) are usually in fixed positions relatively to the player’s body, we designed an initialization strategy as
First, the player’s body is detected automatically according to Dollar et al. 7 with aggregated channel feature (ACF), which is not described in detail in this article.
Second, if
Third, the hand and club patch are recognized by object recognition in the above defined object windows.
Object trajectory prediction
Since the third part object recognition is time-consuming and resource consuming, this part estimates object possible position on the current frame by their previous trajectory to decrease processing area in object recognition. Because the golf swing is an approximate circle movement centered on shoulder joint, the trajectories of the hand and club are not irregular. Figure 4 shows trajectories of hand and club with two totally different players. The whole trajectories are complicate curves hard to be expressed, but the neighbor positions in the trajectories are regular.

Hand and club trajectories of two players.
To simplify the prediction progress, we assume that the object’s local trajectories are quadratic, that is to say the coordinates in
The coefficients
Object recognition
Feature descriptor
Many methods are used to find and identify objects in an image or video with the fact that the objects may vary in different illuminations, scales, or viewpoints, such as the popular feature-based approach, which is also our choice in this article. What’s more, we prefer simple feature descriptor than a complex one to make it possible to transplant the tracking framework to mobile phone. On the other hand, the complicate descriptors for rotation and scale problem, such as scale-invariant feature transform (SIFT) 8,9 and speeded up robust feature, 10 are not the case in our golf videos with barely any scale and rotation change. Because simple descriptors, Haar 11,12 and local binary pattern), 13 are more suitable to feature objects with rich texture, HOG, 14 sensitive to outline, is adopted here.
However, as mentioned before, image quality of the objects is limited and object appearances are projected differently in one swing. We found learning only by HOG is not applicable and need to fuse other useful information. There are two other unignorable spatial and temporal clues that are useful in golf videos. The spatial one is the appearances of the hand and club which are relatively similar to the position of one golf swing. Figure 2 shows in the early of backswing and in the late of downswing (columns 1, 2, and 8), the fist back is shot, and in the late of backswing and in the early of downswing (columns 4, 5, and 6) the fist side is projected. On the other hand, the temporal clue is the appearance sequences of the hand and club are similar in every video, and their appearances can be implied by the previous frame. Figures 2 and 3 show the different players’ hand or club appearance changes similarly in sequence.
Therefore, the contribution of this article is to combine HOG’s sensitivity with object’s outline and relationship of object appearance with space and time. Unlike the fusion of different channels, such as visual–tactile information or visual–audio information,
15
–21
we turn the object spatial and temporal information into feature vectors like HOG, and a complex feature descriptor is proposed with HOG and spatial–temporal vector, which is referred to as
Training
Based on the above feature descriptor, we follow the adaptive boosting algorithm with the help of OpenCV and a boosted classifier is trained. 22,23 The training configuration and classifier performance will be shown in the second section.
Recognition
Since the real object may be not on the predicted position
Results
The tracking based on the proposed descriptor with HOG and spatial–temporal vector are carried out, and their performances are analyzed below.
Training performance
As shown in Table 1, since the hand and club are sized as 63 × 63 and 48 × 48, respectively, the dimension of the proposed descriptor of the hand and club is 2594 and 1802, respectively. Our training database consists of 13,287 positive samples and 147,671 negative samples from 99 videos. These samples are divided randomly into two parts: 80% of the positive and negative samples are used only for training an adaptive boosted classifier, and the remaining 20% for testing the performance of the classifier.
Configuration of the training database.
HOG: histograms of oriented gradients.
A score for every test sample can be obtained using the boosted classifier and is recognized as object or not by comparing it with a threshold. Figure 5 gives the precision and recall curves of classifiers based on conventional HOG and the proposed descriptor. Each point responds to a precision value (vertical) and recall value (horizontal) calculated with a score threshold and a curve is drawn when the score threshold changes from −15.0 to 20.0 with a step of 0.5. The red curve shows sharper descending when recall converges toward 1, which means both a higher precision rate and a higher recall rate can be designed with this specific score threshold, and is supposed to have better performance than the former one. That is the reason we propose this complex descriptor based on HOG and spatial–temporal vector in this article. The score threshold corresponding to the point (0.9706, 0.9706) on the red curve in Figure 5 is 7.5 and will be applied in the following recognition and tracking.

Precision–recall performance.
Tracking performance
Figure 6 is the initialization results with the strategy of player’s body–object window–object. The black box is the detected player’s body using ACF; the blue and red box is the object windows that are defined by Equations (1) and (2); the small white boxes are the hand and club patch recognized by the trained boosted classifier in the above defined object windows. Results show that the initial hand and club can be correctly found.

The tracking framework is carried out for videos of players A, B, and C with our proposed method. Because the hands have smaller movements, the hands tracking of players A, B, and C are totally correct in blue and yellow trajectories in Figures 7 –9, respectively. However, the clubs can move very fast and the tracking are just basically correct. Player A is imaged well and the club tracking is totally correct. Player B is imaged at dawn and the club tracking failed in 4 frames of the video with 280 frames. As shown in the seventh image of Figure 8, the fast club is seriously blurred in the late of the downswing and is not recognized. Player C is imaged very well but the club tracking failed in 4 frames of the video with 140 frames. As shown in the seventh image of Figure 9, the club can be hardly seen by human when it is moving in front of the leg and the club cannot be tracked.

Hand and club tracking of player A. The body position is in black box. The initial hand and club are in white patch, the current hand and club in the current frame are in blue and red patch, respectively. The first row is from backswing: The blue dots and red dots constitute trajectories of the hand and club, respectively. The second row is from downswing: The yellow dots and green dots constitute trajectories of the hand and club, respectively.

Hand and club tracking of player B. The illustrations are the same as Figure 7.

Hand and club tracking of player C. The illustrations are the same as Figure 7.
In our application, if a patch with ideal score is found in the sliding window, we just believe that the object doesn’t exist and the program skips to the next frame. This will avoid misrecognition and make the tracking trajectories reliable when the object is blurred or blocked. The hand trajectories in blue and yellow and the club trajectories in red and green are achieved by our proposed method in Figures 7 –9.
Conclusion
In this article, a hand and club tracking framework using machine learning based on a descriptor combining HOG and spatial–temporal vector is proposed to improve tracking performance for golf video. After the hand and club are recognized in initial windows positioned by the body region, the boosted classifier trained by the proposed complex descriptor is used for recognition and tracking in a searching window predicted by trajectory fitting with previous four object positions. The boosted classifier has a precision and recall rate both better than 97% and the hand and club tracking are basically correct in our testing videos, which have been shot in the different outdoors. The tracking results provide visualization of object movements and can be utilized to other desired information.
Since our golf video database is not large enough, the popular deep learning has not been applied in our framework. In the future, as we get more videos, more work can be done to further improve tracking performance with deep learning when videos are shot in the night, in the overcast day, or in other bad situations.
