Abstract
Introduction
Augmented reality is a live view of a physical, real-world environment whose objects or elements are augmented by computer-generated sensory inputs such as sound, video, or graphics. It enhances one’s current perception of reality and is frequently used in military, visual art, video games, and education. 1 Reconstruction of objects in real-world and projection mapping are two techniques to achieve an augmented reality experience. For indoor static objects, real-time reconstruction techniques are very mature.2,3 However, real-time reconstruction of flexible or unsteady surface with desultory interference noise is still a big challenge. The ability to reconstruct the flexible surface with fine-grained motions in live circumstances would bring many applications. For example, virtual fitting rooms (VFRs) can facilitate the shopping experience by letting customers to try on all kinds of dresses and accessories without being physically present in the retail shop. 4 In addition, children may dig down or heap up in a sand pool, and the deformed surface of sand can be captured for real-time feedback. Besides real-time reconstruction, systems such as IllumiRoom and RoomAlive have brought projection mapping into the consumer domain.5,6 This notion of spatial augmented reality (SAR), 1 presents an idea to augments physical objects with virtual content through projection mapping, which is more shared than virtual reality systems with head mounted device.
Despite considerable advances in the field of reconstruction, there has been limited research on real-time techniques that work on flexible surface with desultory interference noise. Special cases such as hands, faces, or full bodies have been studied for years and researchers have demonstrated compelling real-time reconstructions of non-rigid articulated motion 7 and shape. 8 However, these cases rely on strong priors based on pre-learned statistical models or morphable shape models, which cannot work in general scenes. As for projection mapping, many existing projection systems require a static setup to handle some simple interactions (i.e. activity recognition). Nevertheless, lots of novel interactive scenarios with rendered content needs to be projected on a flexible, deforming surface, 9 which demands more kinds of interactions with convenience. Systems like RoomAlive or Augmented Reality Sandbox 10 are expensive and hard to set up, which highly restrict their application scenarios.
In this article, we introduce an RGB-depth (RGB-D) sensor–based interactive system for immersive augmented reality applications. As demonstrated, our system allows for live smooth surface reconstruction and high-quality projection mapping without any specific shape model priors or markers. Besides, our system can provide kinds of interactions in convenient ways, which give spectators more immersive experience. The hardware setup is easy and the calibration procedure is fast, while the real-time performance is pleasing and compelling. The main contributions of our work are as follows:
A real-time interactive system with RGB-D sensor and projector that uses consumer hardware for immersive augmented reality experience, which provides real-time reconstruction, dynamic projection mapping and convenient interactions.
A fast, adaptable, and precise calibration schema to estimate models for correction and transformation of raw depth data.
An optimization framework for denoising and stabilizing to update the target surface mesh, which will bring spectators a comfortable and fluent experience.
Related work
There has been a lot of work aiming at three-dimensional (3D) mapping since a few decades ago, which is mostly used for automatic localization and reconstruction. Earlier techniques rely on expensive sensors such as range sensors or laser finders to generate raw point clouds for 3D mapping. 11 Alternatively, a stereo-camera setup 12 is used in the reconstruction field, and the depth of each point is computed using stereo-matching algorithms. The emergence of RGB-D sensors, such as the Microsoft Kinect, has brought new interest in real-time 3D mapping as exemplified by systems like KinectFusion.2,13 With the appearance of SAR techniques, it is therefore a natural next step to think about developing augmented reality applications by real-time capture of flexible surface using RGB-D sensors and dynamic projection mapping using projectors. Follow-up research based on KinectFusion mostly focused on modeling objects especially humans (e.g. for 3D printing or avatar modeling) where the target object rotates in front of the Kinect while maintaining steady.14,15 These systems are motivated by producing a single mesh or 3D model by analyzing and merging an RGB-D sequence, whereas we intend to reconstruct the target object in a real-time capture and perform a projection mapping simultaneously.
With the availability of consumer depth sensors, other more real-time reconstruction approaches have been proposed.16,17 With three hand-held Kinects, Ye et al. 18 captured multi-person skeletal poses using a markerless algorithm. Furthermore, in some works, such as for hands, arms, and human bodies, researchers have demonstrated impressive real-time reconstructions of articulated motion.7,19 However, these researches rely on strong priors based on pre-learned statistical models 7 or morphable shape models, 8 which cannot work in general scenes. In addition, these real-time methods are unable to reconstruct target object detail with desultory interference noise.
With the development of SAR techniques, there has been extensive work on enabling augmented reality scenarios using projection mapping. 20 Some early work explored the application of projection mapping on large and curved surfaces, which could be semi-automatically calibrated. 1 While there has been much more work about projection systems, calibration techniques, and sensor-projector networks. In recent years, the problem of projecting and interacting across multiple flat surfaces have got more attention,21,22 and some researchers use a moving hand-held projector and a passive pen to interact with multiple virtual information spaces embedded in a physical environment. 23 These systems mainly studied the blending of multiple projections on flat or pseudo-flat surfaces. However, they have not explored reforming or moving surfaces.
More recently, with the emergence of real-time RGB-D sensors, projector systems have begun to deal with more complex geometries. For example, IllumiRoom 5 explored projecting around the periphery of a television screen to augments traditional gaming experiences with projected visualizations, and RoomAlive 6 performed multiple projections and sensors within an entire room to bring new interactive projection mapping experiences. However, neither of these systems deal with reforming surfaces or moving scenes. In follow-up work, Lee et al. 12 demonstrated projecting on a moving target, on which required infrared markers and magnetic sensors, and the study shows lots of limitations in many aspects. Besides, in order to texture-mapping with interactive behavior, DisplayObjects 24 used a marker-based tracking system to track the target physical models. Interactive systems like Augmented Reality Sandbox 10 allows users to create topographic models by shaping real “kinetic” sand, but the calibration procedure is complex and error-prone. In addition, its content is only for advancing earth science education and hard to develop. In systems introduced above, the projectors and sensors are assumed to remain static and only have little interactions. In this article, projection mapping will focus on deforming surface with convenient interactions in real time and handle desultory interference noise.
Our system attempts to combine the flexible surface reconstruction and dynamic projection mapping together, which demonstrates real-time performance simultaneously. Using an RGB-D sensor based system, our approach does not require any specific shape model prior (e.g. a pre-defined surface model or terrain model) and provides smooth surface reconstruction and high-quality projection mapping during live capture. With a fast, precise, and adaptable calibration schema, devices in our system are easy to use and setup. In addition, in terms of reconstructed geometry detail and real-time performance, our system can run on an regular consumer personal computer (PC), which brings us closer to high-immersive augmented reality applications for consumer scenarios, including gaming, teaching, and human–computer interaction.
System overview
Our interactive system is designed to deal with indoor applications of real-time surface reconstruction and dynamic projection mapping. The RGB-D sensor (e.g. Microsoft Kinect) provides system with raw depth stream of target surface. The raw depth stream will be used to perform real-time reconstruction and generate fine mesh at every time step. Furthermore, we use a projector to illuminate the target surface with right color at every pixel, based on the mesh generated before. In order to bring spectators an excellent augmented reality experience, precise calibrations are essential. With models estimated during calibration procedure, raw depth stream can be corrected and transformed to color frames of projector.
The system pipeline is depicted in Figure 1 and comprises two phases: calibrations and real-time performance. The calibrations procedure needs some manual operations for sampling. Because of using chessboard pattern and RGB sensor, the recognition and sampling are easier and faster, compared to Augmented Reality Sandbox, 10 which uses disk and depth sensor. So, the whole calibration procedure takes almost 5 min, while the Augmented Reality Sandbox 10 may take nearly 10 min. The RGB-D sensor provides depth stream of target surface. To compensate for some error depth information, a optimization framework is applied after the transformation and correction. In addition, a denoising method is used to remove meaningless noise. According to a given scenario or color scheme, the renderer then computes the pixel colors for projector.

Pipeline of the interactive system with RGB-D sensor and projector. In the calibrations part, correction model and transform model are estimated based on base-plane calibration and projector calibration, using RGB-D sensor and projector. In the real-time performance part, depth stream of the target surface from depth sensor is processed by solver, interpolation, stabilization, and smoothing, then the pixel colors are projected to the target surface using projector.
Hardware setup
An exemplary setup with an regular consumer PC, an RGB-D sensor, a single projector, and a target surface is depicted in Figure 5. The regular PC consists of an Intel Core i3 6100 (3.7 GHz), 8 GB of memory, and NVIDIA GeForce GTX 1050. As RGB-D sensor we use Microsoft Kinect v2, which has a depth frame with a resolution of 512 by 424 pixels. For projection mapping, we use an Epson CB-S04 with a resolution of 1024 by 768 pixels for demo. And, white quartz sand in a box with 1.2 m by 0.9 m is built as the target surface, because it is easy to show different terrains and flexible surface with sand.
Base-plane calibration
The RGB-D sensor is hardly mounted to “look-at” the target surface exactly vertically. So, when the target surface is flat, depth data collected by the RGB-D sensor can’t be all the same, which would cause pixel color projected on the surface be obviously different. To avoid this bad augmented reality experience, base-plane calibration will be performed to estimate a base-plane model of target surface to ensure processed depth data can be same when the target surface is flat. To get such a calibration, we pave sand in the box to form a horizontal plane in visual. Then, random sampling is performed by the depth sensor to calculate the correction model, which is detailed described in section “Calibration schema.”
Projector calibration
In this step, we need an extrinsic calibration between the projector and depth sensor. We project specific sequences of chessboard pattern onto planes in different heights above the target surface, and recognize it in the color stream captured by RGB sensor. The RGB sensor is only used for this step. Similar to base-plane calibration, a sequential sampling is performed to calculate the transformation model, which is detailed described in section “Calibration schema.” We only have one single projector in our exemplary setup to calibrate. In a multi-projector system, the calibration procedure of each projector is similar, which will estimate a transformation model for every projector.
Optimization
Raw depth data of target surface captured by depth sensor is discrete and has lots of error or missing data. To have a good visual experience, we implement a optimization framework to remove noise, repair mesh, and match the difference of resolution between depth sensor and projector. As in the real-time performance phase, spectators’ interaction with the target surface will have an effect on the surface mesh. In order to ignore this situation, we present a stabilizing strategy to update the surface mesh, which will bring spectators a smooth and fluent experience.
Calibration schema
The calibration schema includes two steps, as shown in Figure 2. Step one is to estimate a base-plane model for correcting depth data in the next calibration procedure and during real-time demonstrations. While step two is to estimate a transformation model, which is only used for projection mapping in the real-time demonstrations phase. Both of the estimations need sampling and solving model parameters, but they are all easy to perform and works very robust.

Illustration of two-step calibration schema.
The flowchart of calibration procedures is depicted in Figure 3. We mark the target surface’s boundary by simple interactions and calculate the base-plane model automatically. As for transformation model, the process is a little more complicated. We need to sample a sequences of known pattern (i.e. chessboard) using pre-defined rules, and the base-plane model estimated in step one will be used to correct depth data at each sample point pair. With enough sample point pairs, the transformation model will be estimated in seconds, and the model estimation error will be presented at the same time.

Flowchart of the calibration approach: the exact number of sample points is 15, which is discussed in section “Results.”
Base-plane model
In our interactive system, the RGB-D sensor is required to mount directly “look at” the target surface, which means when the target surface is flat or in a initial state, depth data collected by the RGB-D sensor have to be all the same. This setup requirement is essential for a perfect visual experience during real-time projection mapping, but always hard to be achieved. In this step, we present a convenient procedure to eliminate the deviation caused by an inclined RGB-D sensor.
First, the target surface (i.e. sand or clothing) has to be arranged to a flat plane in visual, which means we consider the flat plane has the same standard depth data. So, this particular plane can be called “the base-plane.” Then, we mark the target surface’s boundaries in the depth image captured by RGB-D sensor. We randomly sample a series of points in the marked area based on Latin hypercube sampling algorithm,
25
to ensure representative of the real variability whereas traditional random sampling. We use
where
Based on least-squares approach, we construct objective function of this model estimation problem as
Here,
The equations can be rewritten as matrix form
Until now, the desired linear equations which can be solved by many methods (i.e. Jacob iterations or lower–upper (LU) decomposition) have been obtained. Finally, we can use the base-plane model to correct raw depth data by
where
Transformation model
Projection mapping is a very important part in our interactive system. To ensure it works perfectly, a precise transformation model is needed. In this step, a convenient and compelling calibration approach is presented, which is also very fast and easy to perform. The derivation process of the particular approximate model will be explained first and then the procedure of specific calibration will be elaborated.
We define two 3D orthogonal spaces with Cartesian coordinate system called camera space and projector space.
With this matrix equation, each point in camera space can be mapped to its corresponding point in projector space. However, in order to continuously adapt the right color of each projector pixel,
where
Combining the transformation matrix,
Here, we use subscript
where
And,
Here,
The calibration procedure of transformation model is shown in Figure 3. The important parts to be emphasized of the whole process is described below.
Project known pattern
We use a
Recognize
Sample points of screen space are easily obtained, because the sequences of chessboard pattern are already known. We recognize chessboard pattern on RGB image captured by RGB-D sensor at each position of projection sequences. Then, the particular point recognized on chessboard is mapped to depth image based on the alignment of RGB image and depth image. we use OpenCV to do the recognition and 15 point pairs will be collected at last to estimate the transformation model.
Estimate model
As illustrated above, transformation model can be estimated by solving the parameters estimation problem. The non-zero solution of this problem is
Estimate error
After the calibration procedure, we can estimate the root-mean-square (RMS) error of transformation model by analyzing variance between sampling points and the mapping results. Let
The physical meaning of

Calibration result test: the transformation model has an
We only have one single projector in our exemplary setup. However, Microsoft Kinect v2 has a depth image resolution of
Real-time reconstruction
After the calibration procedures are done, we can obtain the base-plane model and transformation model, which are used for depth data correction and space transformation. This is some necessary preparatory work before the real-time demonstration. In this section, we will introduce an optimization framework, including denoising and stabilizing, which is the main process of projection mapping to achieve an immersive augmented reality experience. The whole process of real-time reconstruction is roughly described as follows:
Get raw depth data from RGB-D sensor.
Use calibration models to calculate the real depth data and corresponding pixel.
Perform image restoration algorithm to intelligently fix missing depth data.
Use stabilizing algorithm to determine whether the current frame needs to be discarded or not.
Perform smoothing filter to achieve a smoother result.
Calculation
The raw depth data read from RGB-D sensor cannot be used directly to perform projection mapping. It has to be corrected by equation (7) to obtain the real depth data. And, each pixel on depth image has to be transformed by equation (13) to obtain the corresponding pixel on projector screen. The calibration models are used in this phase to perform correction and transformation on each pixel on depth image read from RGB-D sensor.
Interpolation
After the process of correction and transformation, a two-dimensional (2D) depth matrix with lots of missing values is obtained. This matrix will be used to generate virtual sand surface displayed by the projector. However, we should first interpolate these missing values as mentioned above. In this section, we introduce image restoration algorithm 26 to do so. The specific steps are described as follows:
Assume the 2D depth matrix
For the purpose of keeping real-time frame rate, we use a resizing method based on nearest neighbor (NN) algorithm to properly reduce
We use
Using the same unresizing method, we enlarge
Since the resizing process may cause data loss, we cannot directly use
Finally,
Stabilization
In order to eliminate the undesirable interactions (i.e. waving hand fastly or putting head under RGB-D sensor unconsciously), we present a stabilizing strategy to update the surface mesh to achieve natural and fluent interactions. The fake code of multi-frame stabilization algorithm is illustrated below (Algorithm 1).
As illustrated in Algorithm 1, the essence of this multi-frame stabilization algorithm is to determine whether the new depth value is valid or not. If the new depth value is fairly stable in several frames, then the algorithm would think the new depth value is caused by effective interactions, and the value needs to be updated. The condition of multi-frame stabilization is
Here,
Smoothing
Low-pass filter could eliminate aliasing appearing in the edge of depth mutation and provide a better visual experience. In this article, we perform filtering to real depth data along each column and each row. In this procedure, each pixel is processed twice. Experiments demonstrate that after repeating the procedure above twice, the depth data will have a smooth transition, which leads to a comfortable visual experience.
Optimization
In order to improve the efficiency of real-time reconstruction process, a corresponding parallel optimization scheme is used in calculation, interpolation, and stabilization.
In the “Calculation” section, we demonstrate a graphics processing unit (GPU) parallel optimization scheme. As the calculation of each pixel on raw depth image is independent, we use the Compute Shader to compute in parallel. The interpolation step is mainly image processing, so we perform the OpenCV image library acceleration optimization to carry out resizing and inpaint method. For stabilization, we use central processing unit (CPU) to create multiple threads in parallel to perform stability judgment on each pixel.
Results
In the following section, we will implement the approach described before, especially comparing the calibration models estimated in different conditions. In addition, we will also detail some example applications based on our calibration procedures and optimization framework.
We use Kinect v2 as RGB-D sensor, Epson CB-S04 as projector, and Unity3D Engine for visual display. In addition, the display system consists of a regular PC with an Intel Core i3 6100 (3.7 GHz) and NVIDIA GeForce GTX 1050, and a large sand container full of white quartz sand (Figure 5). For the sake of convenience, we call it

System illustration: we use Microsoft Kinect v2 as RGB-D sensor, Epson CB-S04 as projector, and a 1.2 m by 0.9 m box, which is full of white quartz sand as the target surface.
Calibration results comparison
There are lots of factors which would have an effect on the result of calibration procedures. In this section, we choose the height of Kinect v2 mounted, ambient illuminance, and number of sample points as conditions to compare the
Calibration results comparison: the height of Kinect v2 mounted, ambient illuminance, and number of sample points are three conditions. 1.70 and 2.50 m are two heights that Kinect v2 mounted. 50, 100, and 200 are three kinds of ambient illuminance. 5, 10, 12, 15, and 20 are numbers of five different groups of sample points. Numbers in the last column are calculated
Example applications
We demonstrate three example applications in this section after calibration procedures are done. They share one typical calibration model which has an

The Summer: deep blue for oceans, bright green for plants, with flying flowers, this scene represents summer. Corresponding time of each frame is listed as follows: (a) −00:03, (b) −00:06, (c) −00:08, (d) −00:18, (e) −00:23, and (f) −00:29.

The Grasslands: light blue for rivers, grass green for lands, with trees and animals, this scene represents grasslands. Corresponding time of each frame is listed as follows: (a) −00:01, (b) −00:12, (c) −00:20, (d) −00:26, (e) −00:38, and (f) −00:51.

The Beaches: blue for oceans, green for lands, khaki for beaches, with seagulls and volcanoes, this scene represents beaches. Corresponding time of each frame is listed as follows: (a) −00:02, (b) −00:08, (c) −00:15, (d) −00:22, (e) −00:29, and (f) −00:35.
The real-time performances of different applications are shown in Table 2. When the frames per second (FPS) is above 20, spectators could have a good augmented reality experience in visual.
Real-time performances: applications are run on a regular PC with an Intel Core i3 6100 (3.7 GHz), 8 GB of memory, and NVIDIA GeForce GTX 1050.
CPU: central processing unit; FPS: frames per second.
Conclusion
In this article, we have introduced a combined hardware and software solution for surface reconstruction and dynamic projection mapping in real time, which contains a robust calibration schema and high-efficient optimization framework. For calibration models, we present a base-plane model to fix the attitude deviation of Kinect, and a transformation model for perfect projection mapping. Our objective function of transformation model is derived by space transformation and solved by LU decomposition which is proved to be fast and effective. After a series of experiments, we found that number of sample points has a significant impact on calibration result and a proper value would be around 15 as shown in Table 1. To have a good visual experience, we implement a optimization framework to remove noise, repair mesh, and match the difference of resolution between depth sensor and projector. In addition, we present a stabilizing strategy to update the surface mesh, which run well on a regular PC and could bring spectators a smooth and fluent experience.
Our system gives a lot of potential for the future. For example, the system could be extended to multiple projectors and RGB-D sensors in the setting, which would bring more complex interaction, and comes with more challenges as well. Another interesting field of future research is change the medium of projection mapping, which means projection mapping onto real-world objects rather than sand simply. 27
