Abstract
Introduction
Performing arts such as dance and theater are in the domain of intangible cultural heritage that risk becoming buried in oblivion if not archived in a proper record. It is particularly difficult to convey the expressive content of dance to an audience without showing bodily movements explicitly. To digitize the full-body motion of a dancer, optical motion capture systems 1 are commonly used due to their high accuracy in generating the motion data by tracking a set of external markers. However, these markers, attached to the body, often restraint the dancer’s dynamic movements and affect the performances during the motion capture session.
However, marker-free systems can produce motion data without using markers or the required tightly fitted suits. Recently, the availability of off-the-shelf depth sensors such as Microsoft Kinect 2 has drawn much attention as a way to track and capture human motion in real time. While systems with a single Kinect sensor suffer from the body’s occlusion and rotation problems, 3 systems with multiple sensors can capture a dancer’s posture at different angles and combine these views into continuous motion data.4–7. However, these approaches have mainly targeted the capturing of relatively simple steps and dance gestures, which are not suitable for complicated and dynamic movements in ballet, modern, and K-pop dances.
In this article, we propose a marker-free motion capture and composition system that can generate motion capture data from expert dancers and compose new dance performances for general users using the captured data. Figure 1 shows an overview of our system in two processes: motion acquisition and motion composition. During the motion acquisition process, our capture system tracks the bodily movements of an expert dancer based on the input data, RGB and depth (RGB-D) images retrieved from multiple RGB-D sensors and constructs a sequence of three-dimensional (3D) skeletal postures from the input data. By archiving the expert motion data as a database in the main server, our system provides an authoring tool for general users to compose various dance performances on the client devices such as tablets and kiosk PCs. During this composition process, the user can search for a specific motion clip throughout the motion database using one’s own posture and synthesize continuous motion by displacing the searched motion clips on a specified path.

System overview of dance motion acquisition and composition.
Our system makes two main contributions. First, we provide a marker-free system that combines off-the-shelf color and depth sensors to capture full-body dance motions. To do this, a particle filter–based method with raw sensor data (RGB-D images) is introduced to track the 3D skeletal postures of an expert dancer. In addition, we present an iterative closest point (ICP)-based method for unifying the skeleton data retrieved from multiple sensors. Second, with the expert motion data, our system can be applied to compose various dance performances for a general user. We provide authoring methods for intuitive motion composition: an online motion search with a user posture and a motion synthesis of multiple motion clips on a specified motion path, which can aid the user’s composition activities on the client devices.
The rest of this article is organized as follows. Previous approaches for human motion capture with marker-free systems are reviewed in section “Related work.” The motion acquisition process from an expert dancer and the motion composition process for a general user are detailed in sections “Motion acquisition” and “Motion composition,” respectively. The experimental results are demonstrated in section “Experimental results.” We conclude this article with a discussion of future improvements in section “Conclusion.”
Related work
Over the years, human motion capture has been actively studied by many researchers, especially in the computer vision field. Human motion acquisition based on vision techniques is well surveyed by Moeslund et al.
8
and Chen et al.
9
Without using external markers, much of the vision-based approaches can be grouped into three categories: generative (also known as
The generative approaches10–14 usually rely on an external model of the human body and try to estimate the model parameters that best describe the pose in an input image. Aguiar et al. 10 used a highly detailed 3D model to deform mesh appearances by estimating the 3D correspondence on multi-view images. Similarly, Gall et al. 11 produced both skeletal motion and mesh deformation by fitting the template model onto the silhouettes extracted from multi-view images. Ganapathi et al. 12 introduced a real-time system that tracks human motion from a sequence of depth images based on the probabilistic temporal model. In their approach, a set of physical constraints is used to deform the simplified body model. Using a Gaussian mixture model without establishing a point correspondence between the template model and the subject, Ye and Yang 13 tracked skeletal motion and a rough mesh model of articulated objects in real time. With two synchronized RGB-D sensors, Michel et al. 14 adopted a stochastic optimization technique to track the skeletal motion from the depth volume. However, the prerequisite template model and its initialization for model parameters have made these approaches difficult for capturing different types of dance movements without additional data.
However, the discriminative approaches15–18 try to identify body parts directly from the input images via a learning process. Michoud et al. 15 introduced a 3D shape estimation method that tracks body motions based on the silhouettes segmented from the multi-view images. In their approach, the tracking performance depends on the size of the image set used for the segmentation. Given a labeled training set of image patches, Plagemann et al. 16 used local shape descriptors to detect the salient body parts only. Shotton et al. 17 estimated the 3D joint positions from a single depth image by training the randomized decision forest classifier with a large image set. Recently, Jung et al. achieved a large performance gain for 3D human pose estimation by training a regression tree for each joint.
Backed by an existing database, the hybrid approaches19–23 try to improve the tracking accuracy by complementing the generative methods (i.e. the optimization problems) with the discriminative methods (i.e. the database reference). Ganapathi et al. 19 developed an interactive system that detects body parts throughout a kinematic chain in a stochastic framework. Combined with a randomized decision forest classifier, 17 Wei et al. 21 tracked skeletal motion in real time. This work was further improved by Zhang et al. 23 with the use of additional sensor data. Baak et al. 20 performed an extensive use of database references to search for similar poses with the salient body detection. 16 Later, Helten et al. 22 present a similar approach with a personalized tracker that can estimate various body shapes. However, both the discriminative and hybrid approaches require a large database in advance of the tracking process, which is not suitable for capturing various motion types from different dancers.
Recently, the availability of an off-the-shelf sensor such as Microsoft Kinect 2 makes it possible to capture real-time human motions in a cost-effective way. As a single use of Kinect can suffer from self-occlusion and body rotation problems, 3 multiple Kinects can be used to maximize the tracking performance all around a dancer. Berger et al. 4 and Zhang et al. 5 utilized multiple Kinects for posture estimation. However, their methods targeted non-skeletal motion data. For dance motion, Kitsikidis et al. 6 adopted a hidden conditional random fields (HCRF) classifier to recognize motion patterns fused from multi-Kinects. Baek and Kim 7 presented a similar approach for combining the postures, which included mixing the five joint segments tracked by the multi-Kinects system. However, the dance movements in these approaches are slow and simple, while our system mainly targets more dynamic motions such as ballet, modern, and K-pop dances.
When dance motions are given as a set of example data, Fan et al. 24 established a relationship between the input music and motions based on the dynamic programming method and synthesis dance motions that are synchronized in music. Panagiotakis et al. 25 synthesized a new sequence of motion patterns from the periodic examples using a motion graph. Unlike these approaches, our system focuses on synthesizing continuous motions by displacing a set of short motion clips on an arbitrary motion path.
Motion acquisition
System setup
To capture dynamic motion from an expert dancer without using external markers, our system consists of multiple high-speed RGB
26
and time-of-flight (ToF) depth
27
sensors. As shown in Figure 2, a set of RGB-D sensors is configured at hexagonal positions in a green-walled studio with light-emitting diode (LED) lights. These sensors cover

Overview of motion capture system with multiple RGB-D sensors.
Sensors used for the motion acquisition and composition.
Skeletal motion tracking
The proposed system generates 3D skeletal motions by tracking the input data, which are a set of RGB-D images retrieved from multiple RGB-D sensors. At first, the tracking process is initialized by registering an articulated skeleton model (initial joint positions) retrieved from the Kinect sensor, 2 which is controlled by the main server. This skeleton model consists of 21 internal and end joints in a hierarchical structure as shown in Figure 3. Due to the limited reconstruction of skeleton postures from a single depth image, 17 the dancer should face toward the sensor with all the joints visible (i.e. a T-pose or an N-pose) during the skeleton registration.

Skeletal motion tracking at different angles: (a) a hierarchical skeleton model used for output, (b) RGB input images, (c) depth input images, and (d) filtered RGB-D input images.
Provided with the skeleton model
as a prediction step and
as a filtering step. Here, a Gaussian sampling process with the prior function
where
Skeletal motion unification
As shown in Figure 4, visible joints and their tracked positions are different from one sensor to another over time. At each frame, our system unifies them into one skeletal motion by transforming each of the skeletons to a reference coordinate system and selecting a joint that has a minimum positional difference between

3D skeletal motion tracked by multiple RGB-D sensors.
When one of the RGB-D sensors is selected for a reference skeleton
where
where
Given a correlation matrix
where
As shown in Figure 5,

Skeleton unification: (a) two point clouds with different body orientations and (b) the unified skeleton model via the ICP method.
Given
where
Motion database
The skeletal motion
where
A fixed length of the bone segment
where
Motion Composition
Figure 6 shows an overview of motion composition on client devices such as tablets and kiosk PCs. As the size of the motion database in the main server grows quickly by capturing the dance movements with high-speed RGB-D sensors, it becomes a time-consuming task for a general user to browse the entire database and search for desired motion data. Furthermore, each motion data in the database contains a fixed motion path, requiring the user to edit it on an input path for motion composition. Our system eases this difficulty by providing two authoring methods:

Overview of motion composition on a client device.
Online motion search
To search for a specific motion clip

Posture parameters used for online motion search: (a) skeleton structure of a user posture, (b) a set of feature vectors defined from end-effector positions, and (c) normal vectors defined for body orientations.
Given
where
A searched motion clip,
Motion synthesis
As shown in Figure 6, when an arbitrary path is drawn on the ground (i.e.
where
The temporal location of each motion clip on

Displacement of multiple motion clips on a specified motion path: three different motion clips are placed on the path via the timing editor shown. Here, the path (red) with a set of control points (yellow) is fitted to the input points (black).
Experimental results
We demonstrated the applicability of our system by capturing various dance movements from expert dancers and providing general users the captured motion data for dance motion composition. As the RGB-D sensors in the system capture 120 and 30 frames per second (fps), respectively, we set
Dance motion capture
As shown in Figure 9, various dance movements are captured from expert dancers in ballet, modern, and Latin dances without using external markers and a special suit. The dancers wear ordinary clothes that they usually use during a practice session, except for the bluish color used for our studio walls. To evaluate the tracking accuracy of dance motion, our system is compared against the multi-Kinects system 7 that consists of four Kinect sensors. At the same time, the ground-truth data are captured by a commercial system 38 that uses a set of inertial sensors embedded in a wearable suit.

Dance motion capture from (a) a ballet, (b) a modern, and (c) a Latin dancer with a 3D character model.
Due to the differences in the skeleton structure and size produced between the comparing systems, the direct comparison between two skeleton postures is inaccurate for tracking tests. For this reason, we have adopted the online motion retargeting method, which maps the different skeleton structures to template one, 39 and then used the positional differences of joints in the global coordinate as an accuracy measure. Figure 10 compares the tracking accuracy of our system against the multi-Kinects system based on the ground-truth data captured by the commercial system. In this comparison, a total of 30,728 frames (about 256 s) are captured from ballet, modern, and Latin dances which include dynamic movements such as cross steps, rapid turns, stretches, and high jumps. As a result, our system tracks the dance movements at an average of 89.5% for wrist, 82.5% for ankle, and 92.0% for head joints against the ground-truth data, making our system considerably more accurate than the multi-Kinects system (i.e. 69.3% for wrist, 58.8% for ankle, and 74.3% for head joints). It is noticeable that the accuracy of the legs are lower than other parts due to the higher noises around the dancer’s feet in the depth images. In addition, the tracking accuracy of the legs in Latin dance is relatively lower than the other dances. This is mainly because parts of the dancer’s legs are occluded by the stage costume, which affects the tracking performance during the motion acquisition process. However, the decrease in accuracy is relatively small compared to the multi-Kinects system.

Dance motion composition
For the dance motion database used by a general user, we captured the motion data from ballet, modern, K-pop, Latin, and traditional Korean dancers, respectively. Table 2 shows the total frames and processing time for capturing each type of motion by our system. All the motion data in the database are down-sampled to 30 fps to speed up the online search process with the user’s posture.
Dance motion database: archived at 30 fps.
As shown in Figure 11, our system provides user interfaces that let the user draw a motion path and displace the searched motion clips directly on the device screen. Figure 12 shows an instance of synthesizing four motion clips on the specified path into one continuous motion. Each of the motion clips is searched throughout the database with a length of

User interfaces for motion composition.

Motion synthesis: (a) searched motion clips with a fixed path and (b) displaced motion clips on a specified path.

Composition of various dance performances on stages: (a) ballet, (b) K-pop, and (c) traditional Korean dance with corresponding motion clips used.
Conclusion
In this article, we have introduced a motion capture system that can track dynamic movements in dance motions without using external markers. Based on the RGB-D cues retrieved from multiple RGB-D sensors, our system can generate 3D skeletal motions from various expert dancers. To compose dance performances on virtual stages for general users, our system provides authoring methods that can search a set of desired motion clips from the given motion database and synthesize the stage scene by displacing the motion clips on a specified path. As demonstrated in the experimental results, various dance performances can be produced on the client devices in intuitive and efficient ways. In practice, our system can be used to archive various dance performances into the motion database to be used for theater stage plans, movement education, and ultimately, heritage preservation in dance.
One of the ongoing improvements in the current system is enhancing the motion quality around the feet area. Due to the ambiguity between the feet and touching the ground, the high noise levels in the depth images degrade the tracking performance of our system. We expect that adding a small and weightless inertial sensor on each foot can obtain more precise joint rotations without affecting the user’s freedom of performances. In addition, using an articulated template model with image segmentation techniques can be a potential solution for better motion quality. 40
The tracking performance of our system is affected by the type of dress. During the motion capture session, it is not unusual that the expert dancer wears a voluminous costume that makes difficult to track the bodily movement inside. We are currently working on the extraction of skeletal postures from such a dress based on the probability model.
The current system requires numerous sensors and system PCs to capture motions, incurring a high system cost. Depending on the dance type and processing time, the system configuration can be scaled down by removing some of the sensors and servers. For example, for a slower and simpler type of dance, three or four RGB-D cameras in optimal capturing positions can generate comparable motion data. The number of servers can be reduced if the processing time for motion data generation is not important for dance composition. Finally, a predefined skeleton template can be used as an initial posture model to save the Kinect sensor in our system.
