Sage Journals: Discover world-class research

Abstract

Reconstruction and projection mapping enable us to bring virtual worlds into real spaces, which can give spectators an immersive augmented reality experience. Based on an interactive system with RGB-depth sensor and projector, we present a combined hardware and software solution for surface reconstruction and dynamic projection mapping in real time. In this article, a novel and adaptable calibration scheme is proposed, which is used to estimate approximate models to correct and transform raw depth data. Besides, our system allows for smooth real-time performance using an optimization framework, including denoising and stabilizing. In the entire pipeline, markers are only used in the calibration procedure, and any priors are not needed. Our approach enables us to interact with the target surface in real time, while maintaining correct illumination. It is easy and fast to develop different applications for our system, and some interesting cases are demonstrated at last.

Keywords

Augmented reality RGB-depth sensor interactive system surface reconstruction projection mapping

Introduction

Augmented reality is a live view of a physical, real-world environment whose objects or elements are augmented by computer-generated sensory inputs such as sound, video, or graphics. It enhances one’s current perception of reality and is frequently used in military, visual art, video games, and education.¹ Reconstruction of objects in real-world and projection mapping are two techniques to achieve an augmented reality experience. For indoor static objects, real-time reconstruction techniques are very mature.^2,3 However, real-time reconstruction of flexible or unsteady surface with desultory interference noise is still a big challenge. The ability to reconstruct the flexible surface with fine-grained motions in live circumstances would bring many applications. For example, virtual fitting rooms (VFRs) can facilitate the shopping experience by letting customers to try on all kinds of dresses and accessories without being physically present in the retail shop.⁴ In addition, children may dig down or heap up in a sand pool, and the deformed surface of sand can be captured for real-time feedback. Besides real-time reconstruction, systems such as IllumiRoom and RoomAlive have brought projection mapping into the consumer domain.^5,6 This notion of spatial augmented reality (SAR),¹ presents an idea to augments physical objects with virtual content through projection mapping, which is more shared than virtual reality systems with head mounted device.

Despite considerable advances in the field of reconstruction, there has been limited research on real-time techniques that work on flexible surface with desultory interference noise. Special cases such as hands, faces, or full bodies have been studied for years and researchers have demonstrated compelling real-time reconstructions of non-rigid articulated motion⁷ and shape.⁸ However, these cases rely on strong priors based on pre-learned statistical models or morphable shape models, which cannot work in general scenes. As for projection mapping, many existing projection systems require a static setup to handle some simple interactions (i.e. activity recognition). Nevertheless, lots of novel interactive scenarios with rendered content needs to be projected on a flexible, deforming surface,⁹ which demands more kinds of interactions with convenience. Systems like RoomAlive or Augmented Reality Sandbox¹⁰ are expensive and hard to set up, which highly restrict their application scenarios.

In this article, we introduce an RGB-depth (RGB-D) sensor–based interactive system for immersive augmented reality applications. As demonstrated, our system allows for live smooth surface reconstruction and high-quality projection mapping without any specific shape model priors or markers. Besides, our system can provide kinds of interactions in convenient ways, which give spectators more immersive experience. The hardware setup is easy and the calibration procedure is fast, while the real-time performance is pleasing and compelling. The main contributions of our work are as follows:

A real-time interactive system with RGB-D sensor and projector that uses consumer hardware for immersive augmented reality experience, which provides real-time reconstruction, dynamic projection mapping and convenient interactions.

A fast, adaptable, and precise calibration schema to estimate models for correction and transformation of raw depth data.

An optimization framework for denoising and stabilizing to update the target surface mesh, which will bring spectators a comfortable and fluent experience.

Related work

There has been a lot of work aiming at three-dimensional (3D) mapping since a few decades ago, which is mostly used for automatic localization and reconstruction. Earlier techniques rely on expensive sensors such as range sensors or laser finders to generate raw point clouds for 3D mapping.¹¹ Alternatively, a stereo-camera setup¹² is used in the reconstruction field, and the depth of each point is computed using stereo-matching algorithms. The emergence of RGB-D sensors, such as the Microsoft Kinect, has brought new interest in real-time 3D mapping as exemplified by systems like KinectFusion.^2,13 With the appearance of SAR techniques, it is therefore a natural next step to think about developing augmented reality applications by real-time capture of flexible surface using RGB-D sensors and dynamic projection mapping using projectors. Follow-up research based on KinectFusion mostly focused on modeling objects especially humans (e.g. for 3D printing or avatar modeling) where the target object rotates in front of the Kinect while maintaining steady.^14,15 These systems are motivated by producing a single mesh or 3D model by analyzing and merging an RGB-D sequence, whereas we intend to reconstruct the target object in a real-time capture and perform a projection mapping simultaneously.

With the availability of consumer depth sensors, other more real-time reconstruction approaches have been proposed.^16,17 With three hand-held Kinects, Ye et al.¹⁸ captured multi-person skeletal poses using a markerless algorithm. Furthermore, in some works, such as for hands, arms, and human bodies, researchers have demonstrated impressive real-time reconstructions of articulated motion.^7,19 However, these researches rely on strong priors based on pre-learned statistical models⁷ or morphable shape models,⁸ which cannot work in general scenes. In addition, these real-time methods are unable to reconstruct target object detail with desultory interference noise.

With the development of SAR techniques, there has been extensive work on enabling augmented reality scenarios using projection mapping.²⁰ Some early work explored the application of projection mapping on large and curved surfaces, which could be semi-automatically calibrated.¹ While there has been much more work about projection systems, calibration techniques, and sensor-projector networks. In recent years, the problem of projecting and interacting across multiple flat surfaces have got more attention,^21,22 and some researchers use a moving hand-held projector and a passive pen to interact with multiple virtual information spaces embedded in a physical environment.²³ These systems mainly studied the blending of multiple projections on flat or pseudo-flat surfaces. However, they have not explored reforming or moving surfaces.

More recently, with the emergence of real-time RGB-D sensors, projector systems have begun to deal with more complex geometries. For example, IllumiRoom⁵ explored projecting around the periphery of a television screen to augments traditional gaming experiences with projected visualizations, and RoomAlive⁶ performed multiple projections and sensors within an entire room to bring new interactive projection mapping experiences. However, neither of these systems deal with reforming surfaces or moving scenes. In follow-up work, Lee et al.¹² demonstrated projecting on a moving target, on which required infrared markers and magnetic sensors, and the study shows lots of limitations in many aspects. Besides, in order to texture-mapping with interactive behavior, DisplayObjects²⁴ used a marker-based tracking system to track the target physical models. Interactive systems like Augmented Reality Sandbox¹⁰ allows users to create topographic models by shaping real “kinetic” sand, but the calibration procedure is complex and error-prone. In addition, its content is only for advancing earth science education and hard to develop. In systems introduced above, the projectors and sensors are assumed to remain static and only have little interactions. In this article, projection mapping will focus on deforming surface with convenient interactions in real time and handle desultory interference noise.

Our system attempts to combine the flexible surface reconstruction and dynamic projection mapping together, which demonstrates real-time performance simultaneously. Using an RGB-D sensor based system, our approach does not require any specific shape model prior (e.g. a pre-defined surface model or terrain model) and provides smooth surface reconstruction and high-quality projection mapping during live capture. With a fast, precise, and adaptable calibration schema, devices in our system are easy to use and setup. In addition, in terms of reconstructed geometry detail and real-time performance, our system can run on an regular consumer personal computer (PC), which brings us closer to high-immersive augmented reality applications for consumer scenarios, including gaming, teaching, and human–computer interaction.

System overview

Our interactive system is designed to deal with indoor applications of real-time surface reconstruction and dynamic projection mapping. The RGB-D sensor (e.g. Microsoft Kinect) provides system with raw depth stream of target surface. The raw depth stream will be used to perform real-time reconstruction and generate fine mesh at every time step. Furthermore, we use a projector to illuminate the target surface with right color at every pixel, based on the mesh generated before. In order to bring spectators an excellent augmented reality experience, precise calibrations are essential. With models estimated during calibration procedure, raw depth stream can be corrected and transformed to color frames of projector.

The system pipeline is depicted in Figure 1 and comprises two phases: calibrations and real-time performance. The calibrations procedure needs some manual operations for sampling. Because of using chessboard pattern and RGB sensor, the recognition and sampling are easier and faster, compared to Augmented Reality Sandbox,¹⁰ which uses disk and depth sensor. So, the whole calibration procedure takes almost 5 min, while the Augmented Reality Sandbox¹⁰ may take nearly 10 min. The RGB-D sensor provides depth stream of target surface. To compensate for some error depth information, a optimization framework is applied after the transformation and correction. In addition, a denoising method is used to remove meaningless noise. According to a given scenario or color scheme, the renderer then computes the pixel colors for projector.

Figure 1.

Pipeline of the interactive system with RGB-D sensor and projector. In the calibrations part, correction model and transform model are estimated based on base-plane calibration and projector calibration, using RGB-D sensor and projector. In the real-time performance part, depth stream of the target surface from depth sensor is processed by solver, interpolation, stabilization, and smoothing, then the pixel colors are projected to the target surface using projector.

Hardware setup

An exemplary setup with an regular consumer PC, an RGB-D sensor, a single projector, and a target surface is depicted in Figure 5. The regular PC consists of an Intel Core i3 6100 (3.7 GHz), 8 GB of memory, and NVIDIA GeForce GTX 1050. As RGB-D sensor we use Microsoft Kinect v2, which has a depth frame with a resolution of 512 by 424 pixels. For projection mapping, we use an Epson CB-S04 with a resolution of 1024 by 768 pixels for demo. And, white quartz sand in a box with 1.2 m by 0.9 m is built as the target surface, because it is easy to show different terrains and flexible surface with sand.

Base-plane calibration

The RGB-D sensor is hardly mounted to “look-at” the target surface exactly vertically. So, when the target surface is flat, depth data collected by the RGB-D sensor can’t be all the same, which would cause pixel color projected on the surface be obviously different. To avoid this bad augmented reality experience, base-plane calibration will be performed to estimate a base-plane model of target surface to ensure processed depth data can be same when the target surface is flat. To get such a calibration, we pave sand in the box to form a horizontal plane in visual. Then, random sampling is performed by the depth sensor to calculate the correction model, which is detailed described in section “Calibration schema.”

Projector calibration

In this step, we need an extrinsic calibration between the projector and depth sensor. We project specific sequences of chessboard pattern onto planes in different heights above the target surface, and recognize it in the color stream captured by RGB sensor. The RGB sensor is only used for this step. Similar to base-plane calibration, a sequential sampling is performed to calculate the transformation model, which is detailed described in section “Calibration schema.” We only have one single projector in our exemplary setup to calibrate. In a multi-projector system, the calibration procedure of each projector is similar, which will estimate a transformation model for every projector.

Optimization

Raw depth data of target surface captured by depth sensor is discrete and has lots of error or missing data. To have a good visual experience, we implement a optimization framework to remove noise, repair mesh, and match the difference of resolution between depth sensor and projector. As in the real-time performance phase, spectators’ interaction with the target surface will have an effect on the surface mesh. In order to ignore this situation, we present a stabilizing strategy to update the surface mesh, which will bring spectators a smooth and fluent experience.

Calibration schema

The calibration schema includes two steps, as shown in Figure 2. Step one is to estimate a base-plane model for correcting depth data in the next calibration procedure and during real-time demonstrations. While step two is to estimate a transformation model, which is only used for projection mapping in the real-time demonstrations phase. Both of the estimations need sampling and solving model parameters, but they are all easy to perform and works very robust.

Figure 2.

Illustration of two-step calibration schema.

The flowchart of calibration procedures is depicted in Figure 3. We mark the target surface’s boundary by simple interactions and calculate the base-plane model automatically. As for transformation model, the process is a little more complicated. We need to sample a sequences of known pattern (i.e. chessboard) using pre-defined rules, and the base-plane model estimated in step one will be used to correct depth data at each sample point pair. With enough sample point pairs, the transformation model will be estimated in seconds, and the model estimation error will be presented at the same time.

Figure 3.

Flowchart of the calibration approach: the exact number of sample points is 15, which is discussed in section “Results.”

Base-plane model

In our interactive system, the RGB-D sensor is required to mount directly “look at” the target surface, which means when the target surface is flat or in a initial state, depth data collected by the RGB-D sensor have to be all the same. This setup requirement is essential for a perfect visual experience during real-time projection mapping, but always hard to be achieved. In this step, we present a convenient procedure to eliminate the deviation caused by an inclined RGB-D sensor.

First, the target surface (i.e. sand or clothing) has to be arranged to a flat plane in visual, which means we consider the flat plane has the same standard depth data. So, this particular plane can be called “the base-plane.” Then, we mark the target surface’s boundaries in the depth image captured by RGB-D sensor. We randomly sample a series of points in the marked area based on Latin hypercube sampling algorithm,²⁵ to ensure representative of the real variability whereas traditional random sampling. We use $P_{i} (x_{i}, y_{i}, z_{i})$ to represent an arbitrary sample point, and $n$ to be the number of sample points. So, $i$ should satisfy $1 \leq i \leq n$ . Now, naturally we assume the base-plane model in the depth space, which is in this form

$ax + by + z + d = 0$ (1)

where $a$ , $b$ , and $d$ are parameters of the base-plane model, and $z$ is the depth dimension. As the coefficient of $z$ could not be zero, so we set the coefficient of $z$ to $1$ for simplification.

Based on least-squares approach, we construct objective function of this model estimation problem as

$\min E = \frac{1}{n} \sum_{i = 1}^{n} (a x_{i} + b y_{i} + z_{i} + d)^{2}$ (2)

Here, $E$ is called mean square error, and we have to estimate optical values of $a$ , $b$ , and $d$ to minimize $E$ . To achieve this goal, we solve the partial derivatives of $a, b, and d$ and make them equal zero

$\frac{\partial E}{\partial a} = \sum_{i = 1}^{n} ({ax}_{i}^{2} + b x_{i} y_{i} + x_{i} z_{i} + d x_{i}) = 0$ (3)

$\frac{\partial E}{\partial b} = \sum_{i = 1}^{n} (a x_{i} y_{i} + {by}_{i}^{2} + y_{i} z_{i} + d y_{i}) = 0$ (4)

$\frac{\partial E}{\partial d} = \sum_{i = 1}^{n} (a x_{i} + b y_{i} + z_{i} + d) = 0$ (5)

The equations can be rewritten as matrix form

$\sum_{i = 1}^{n} (\begin{matrix} x_{i}^{2} & x_{i} y_{i} & x_{i} \\ x_{i} y_{i} & y_{i}^{2} & y_{i} \\ x_{i} & y_{i} & 1 \end{matrix}) (\begin{matrix} a \\ b \\ d \end{matrix}) = \sum_{i = 1}^{n} (\begin{matrix} x_{i} z_{i} \\ y_{i} z_{i} \\ z_{i} \end{matrix})$ (6)

Until now, the desired linear equations which can be solved by many methods (i.e. Jacob iterations or lower–upper (LU) decomposition) have been obtained. Finally, we can use the base-plane model to correct raw depth data by

$D_{correct} = D_{capture} + ax + by + d$ (7)

where $D_{correct}$ is the correction depth data and $D_{capture}$ is the raw depth data captured by RGB-D sensor. $(x, y)$ is index of the depth image. This equation will be applied to each sample point pair in the next calibration procedure to estimate right transformation model for projection mapping.

Transformation model

Projection mapping is a very important part in our interactive system. To ensure it works perfectly, a precise transformation model is needed. In this step, a convenient and compelling calibration approach is presented, which is also very fast and easy to perform. The derivation process of the particular approximate model will be explained first and then the procedure of specific calibration will be elaborated.

We define two 3D orthogonal spaces with Cartesian coordinate system called camera space and projector space. $P_{c} (x_{c}, y_{c}, z_{c})$ indicates an arbitrary point in camera space, and $P_{p} (x_{p}, y_{p}, z_{p})$ indicates an arbitrary point in projector space. The transformation matrix of points between camera space and projector space is

$(\begin{matrix} R & T \\ 0 & 1 \end{matrix}) (\begin{matrix} x_{c} \\ y_{c} \\ z_{c} \\ 1 \end{matrix}) = (\begin{matrix} x_{p} \\ y_{p} \\ z_{p} \\ 1 \end{matrix})$ (8)

$R$ is a $3 \times 3$ rotation matrix and $T$ is a $3 \times 1$ translation matrix.

With this matrix equation, each point in camera space can be mapped to its corresponding point in projector space. However, in order to continuously adapt the right color of each projector pixel, $P_{p} (x_{p}, y_{p}, z_{p})$ in projector space has to be transformed to $P_{s} (x_{s}, y_{s})$ in screen space, which is perspective, Where $P_{s} (x_{s}, y_{s})$ is an arbitrary pixel of projection screen. The transformation equations of this two spaces is

$x_{s} = \frac{a_{1} x_{p}}{z_{p}} + \frac{W_{s}}{2}$ (9)

$y_{s} = \frac{a_{2} y_{p}}{z_{p}} + \frac{H_{s}}{2}$ (10)

where $a_{1}$ and $a_{2}$ are constant coefficients, which contains the length-to-pixel conversion. $(W_{s}, H_{s})$ is the resolution of projection screen. Then, we use $P_{d} (x_{d}, y_{d})$ to represent an arbitrary pixel of depth image captured by RGB-D sensor. We can obtain the transformation equations between camera space and depth image by the similar process

$x_{c} = b_{1} z_{c} (x_{d} - \frac{W_{d}}{2})$ (11)

$y_{c} = b_{2} z_{c} (y_{d} - \frac{H_{d}}{2})$ (12)

Combining the transformation matrix, $x_{s}$ and $y_{s}$ can be rewritten as

$\begin{matrix} x_{s} = \frac{q_{1} z_{c} x_{d} + q_{2} z_{c} y_{d} + q_{3} z_{c} + q_{4}}{q_{9} z_{c} x_{d} + q_{10} z_{c} y_{d} + q_{11} z_{c} + q_{12}} \\ y_{s} = \frac{q_{5} z_{c} x_{d} + q_{6} z_{c} y_{d} + q_{7} z_{c} + q_{8}}{q_{9} z_{c} x_{d} + q_{10} z_{c} y_{d} + q_{11} z_{c} + q_{12}} \end{matrix}$ (13)

$q_{i}$ is the result after merge constant coefficient together, which is also the unsolved parameter of transformation model. For convenience, we mark the right side of the two equations as $f_{x}$ and $f_{y}$ . To solve $q_{i}$ , we construct objective function of this model estimation problem as

$\min E = \frac{1}{m} \sum_{k = 1}^{m} ({(x_{s_{k}} - f_{x_{k}})}^{2} + {(y_{s_{k}} - f_{y_{k}})}^{2})$ (14)

Here, we use subscript $k$ to represent an arbitrary sample point of screen space, and $m$ to be the number of sample points. $x_{s}$ and $y_{s}$ can be calculated by equations above, meanwhile, $x_{d}$ , $y_{d}$ , and $z_{c}$ are the corresponding sample point. Particularly, $z_{c}$ of $P_{c}$ in camera space has exactly the same value at the index $(x_{d}, y_{d})$ on depth image. Similar to the base-plane model estimation, once $q_{i}$ is solved, we can obtain the transformation model. By solve the partial derivatives of $q_{i}$ and make them equal zero, we have

$MQ = 0$ (15)

where $Q$ is the parameter vector of the transformation model, which is expressed as a column vector

$Q = {(\begin{matrix} q_{1} q_{2} q_{3} q_{4} q_{5} q_{6} q_{7} q_{8} q_{9} q_{10} q_{11} q_{12} \end{matrix})}^{T}$ (16)

And, $M = \sum_{k = 1}^{m} M_{k}$ is a $12 \times 12$ coefficient matrix, which can be denoted as

$M_{k} = (\begin{matrix} M_{C_{k}} & 0 & - x_{s_{k}} M_{C_{k}} \\ 0 & M_{C_{k}} & - y_{s_{k}} M_{C_{k}} \\ - x_{s_{k}} M_{C_{k}} & - y_{s_{k}} M_{C_{k}} & (x_{s_{k}}^{2} + y_{s_{k}}^{2}) M_{C_{k}} \end{matrix})$ (17)

Here, $M_{C_{k}}$ is the child block matrix of $M_{k}$ , which is solved as

$M_{C_{k}} = (\begin{matrix} x_{d_{k}}^{2} & x_{d_{k}} y_{d_{k}} & x_{d_{k}} z_{c_{k}} & x_{d_{k}} \\ x_{d_{k}} y_{d_{k}} & y_{d_{k}}^{2} & y_{d_{k}} z_{c_{k}} & y_{d_{k}} \\ x_{d_{k}} z_{c_{k}} & y_{d_{k}} z_{c_{k}} & z_{c_{k}}^{2} & z_{c_{k}} \\ x_{d_{k}} & y_{d_{k}} & z_{c_{k}} & 1 \end{matrix})$ (18)

The calibration procedure of transformation model is shown in Figure 3. The important parts to be emphasized of the whole process is described below.

Project known pattern

We use a $4 \times 5$ chessboard pattern to collect sample point pairs in depth image and screen space. In this procedure, we project specific sequences of chessboard pattern onto planes in different heights above the target surface. In order to ensure sample point pairs to be representative, the projection sequences is a $3 \times 5$ grid to cover the entire target surface as much as possible. Particularly, when collecting sample point pairs, a white board is needed to be held in different heights above the target surface, so the chessboard pattern can be projected onto it to make sure that patterns are on planes of different heights.

Recognize

Sample points of screen space are easily obtained, because the sequences of chessboard pattern are already known. We recognize chessboard pattern on RGB image captured by RGB-D sensor at each position of projection sequences. Then, the particular point recognized on chessboard is mapped to depth image based on the alignment of RGB image and depth image. we use OpenCV to do the recognition and 15 point pairs will be collected at last to estimate the transformation model.

Estimate model

As illustrated above, transformation model can be estimated by solving the parameters estimation problem. The non-zero solution of this problem is $Q$ . Once $Q$ is acquired, pixels $(x_{s}, y_{s})$ of screen space can be mapped by calculating $f_{x}$ and $f_{y}$ .

Estimate error

After the calibration procedure, we can estimate the root-mean-square (RMS) error of transformation model by analyzing variance between sampling points and the mapping results. Let $E_{model}$ be the RMS error, which can be obtained by

$E_{RMS} = \sqrt{\frac{1}{m} \sum_{k = 1}^{m} ((x_{s_{k}} - f_{x_{k}})^{2} + {(y_{s_{k}} - f_{y_{k}})}^{2})}$ (19)

The physical meaning of $E_{RMS}$ is the pixel distance between target pixel in screen space and the mapping result. The projection mapping would have a compelling performance when the value of $E_{RMS}$ is less than 2.0, which means the mean error between target pixels and the mapping results is 2 pixels. Figure 4 shows a test of one typical calibration result’s performance, which has an $E_{RMS}$ of 1.2610. In this kind of test, intersection of red cross lines is the mapping result, and upper right corner of chessboard is the target pixel. Calibration result is perfect when the mapping result and the target pixel well coincide. It is obvious that the RGB-D sensor and projector have been successfully calibrated according to Figure 4.

Figure 4.

Calibration result test: the transformation model has an $E_{RMS}$ of 1.2610.

We only have one single projector in our exemplary setup. However, Microsoft Kinect v2 has a depth image resolution of $512 by 424 pixels$ with viewing angles of $70.6 by 60.0$ degrees resulting in an average of about 7 by 7 pixels per degree. Because of its wider fields of view than projector, the interactive system can be extended to multiple projectors to have a greater display. In a multi-projector system, calibration of projectors has to be performed one by one, which will estimate base-plane model and transformation model for each projector.

Real-time reconstruction

After the calibration procedures are done, we can obtain the base-plane model and transformation model, which are used for depth data correction and space transformation. This is some necessary preparatory work before the real-time demonstration. In this section, we will introduce an optimization framework, including denoising and stabilizing, which is the main process of projection mapping to achieve an immersive augmented reality experience. The whole process of real-time reconstruction is roughly described as follows:

Get raw depth data from RGB-D sensor.

Use calibration models to calculate the real depth data and corresponding pixel.

Perform image restoration algorithm to intelligently fix missing depth data.

Use stabilizing algorithm to determine whether the current frame needs to be discarded or not.

Perform smoothing filter to achieve a smoother result.

Calculation

The raw depth data read from RGB-D sensor cannot be used directly to perform projection mapping. It has to be corrected by equation (7) to obtain the real depth data. And, each pixel on depth image has to be transformed by equation (13) to obtain the corresponding pixel on projector screen. The calibration models are used in this phase to perform correction and transformation on each pixel on depth image read from RGB-D sensor.

Interpolation

After the process of correction and transformation, a two-dimensional (2D) depth matrix with lots of missing values is obtained. This matrix will be used to generate virtual sand surface displayed by the projector. However, we should first interpolate these missing values as mentioned above. In this section, we introduce image restoration algorithm²⁶ to do so. The specific steps are described as follows:

Assume the 2D depth matrix $T$ corrected by equation (8) is our target matrix. The position in $T$ represents the pixel coordinate, and the values stored in these pixels are depth values. In order to improve the quality of interpolation, we pad $T$ using a border with the width of $C$ , which represents the average depth value of $T$ .

For the purpose of keeping real-time frame rate, we use a resizing method based on nearest neighbor (NN) algorithm to properly reduce $T$ . After resizing, a new matrix $T_{small}$ is obtained.

We use Telea²⁶Inpainting algorithm to repair $T_{small}$ and get $T_{fixed, small}$ .

Using the same unresizing method, we enlarge $T_{fixed, small}$ to the original size (T’s size) and get $T_{resize}$ .

Since the resizing process may cause data loss, we cannot directly use $T_{resize}$ as our final interpolation result. Instead, for each missing value in $T$ , we locate the corresponding data in $T_{resize}$ to fill. Note that in this filling procedure, we only manipulate those missing points, leaving the original data unchanged. This process will guarantee the accuracy of resultant depth values. After all the missing values being filled, we get $T_{final}$ .

Finally, $T_{final}$ will be fed into next procedure called stabilization.

Stabilization

In order to eliminate the undesirable interactions (i.e. waving hand fastly or putting head under RGB-D sensor unconsciously), we present a stabilizing strategy to update the surface mesh to achieve natural and fluent interactions. The fake code of multi-frame stabilization algorithm is illustrated below (Algorithm 1).

Algorithm 1. Multi-frame stabilization algorithm.
for all Pixel $X_{i}$ do oldValue = averageBuffer[ $X_{i}$ , averageSlotIndex]; newValue = realDepthData $[X_{i}]$ ; if $X_{i} . depth$ is in valid depth range then averageBuffer[ $X_{i}$ , averageSlotIndex] = newValue; stateBuffer[ $X_{i}$ , 0] += 1; stateBuffer[ $X_{i}$ , 1] += newValue; stateBuffer[ $X_{i}$ , 2] += newValue * newValue; if oldValue $\neq 0$ then stateBuffer[ $X_{i}$ , 0] – = 1; stateBuffer[ $X_{i}$ , 1] – = oldVal; stateBuffer[ $X_{i}$ , 2] – = oldVal * oldVal; end if end if if satisfy stabilization condition then stableValue = stateBuffer $[X_{i}, 1]$ /stateBuffer $[X_{i}, 0]$ ; validBuffer[ $X_{i}$ ] = stableValue; end if if++averageSlotIndex == numAverageSlots then averageSlotIndex = 0; end if end for

Algorithm 1. Multi-frame stabilization algorithm.

for all Pixel

X_{i}

do oldValue = averageBuffer[

X_{i}

, averageSlotIndex]; newValue = realDepthData

[X_{i}]

; if

X_{i} . depth

is in valid depth range then averageBuffer[

X_{i}

, averageSlotIndex] = newValue; stateBuffer[

X_{i}

, 0] += 1; stateBuffer[

X_{i}

, 1] += newValue; stateBuffer[

X_{i}

, 2] += newValue * newValue; if oldValue

\neq 0

then stateBuffer[

X_{i}

, 0] – = 1; stateBuffer[

X_{i}

, 1] – = oldVal; stateBuffer[

X_{i}

, 2] – = oldVal * oldVal; end if end if if satisfy stabilization condition then stableValue = stateBuffer

[X_{i}, 1]

/stateBuffer

[X_{i}, 0]

; validBuffer[

X_{i}

] = stableValue; end if if++averageSlotIndex == numAverageSlots then averageSlotIndex = 0; end if end for

As illustrated in Algorithm 1, the essence of this multi-frame stabilization algorithm is to determine whether the new depth value is valid or not. If the new depth value is fairly stable in several frames, then the algorithm would think the new depth value is caused by effective interactions, and the value needs to be updated. The condition of multi-frame stabilization is

$B_{2} B_{0} \leq V \cdot B_{0}^{2} + B_{1}^{2}$ (20)

Here, $B_{0}$ is stateBuffer[ $X_{i}$ , 0] in fake code. $B_{1}$ is stateBuffer[ $X_{i}$ , 1] and $B_{2}$ is stateBuffer[ $X_{i}$ , 2]. V is a constant called maximum variance, which is set to 100 in this article.

Smoothing

Low-pass filter could eliminate aliasing appearing in the edge of depth mutation and provide a better visual experience. In this article, we perform filtering to real depth data along each column and each row. In this procedure, each pixel is processed twice. Experiments demonstrate that after repeating the procedure above twice, the depth data will have a smooth transition, which leads to a comfortable visual experience.

Optimization

In order to improve the efficiency of real-time reconstruction process, a corresponding parallel optimization scheme is used in calculation, interpolation, and stabilization.

In the “Calculation” section, we demonstrate a graphics processing unit (GPU) parallel optimization scheme. As the calculation of each pixel on raw depth image is independent, we use the Compute Shader to compute in parallel. The interpolation step is mainly image processing, so we perform the OpenCV image library acceleration optimization to carry out resizing and inpaint method. For stabilization, we use central processing unit (CPU) to create multiple threads in parallel to perform stability judgment on each pixel.

Results

In the following section, we will implement the approach described before, especially comparing the calibration models estimated in different conditions. In addition, we will also detail some example applications based on our calibration procedures and optimization framework.

We use Kinect v2 as RGB-D sensor, Epson CB-S04 as projector, and Unity3D Engine for visual display. In addition, the display system consists of a regular PC with an Intel Core i3 6100 (3.7 GHz) and NVIDIA GeForce GTX 1050, and a large sand container full of white quartz sand (Figure 5). For the sake of convenience, we call it Magic Island.

Figure 5.

System illustration: we use Microsoft Kinect v2 as RGB-D sensor, Epson CB-S04 as projector, and a 1.2 m by 0.9 m box, which is full of white quartz sand as the target surface.

Calibration results comparison

There are lots of factors which would have an effect on the result of calibration procedures. In this section, we choose the height of Kinect v2 mounted, ambient illuminance, and number of sample points as conditions to compare the $E_{RMS}$ of different calibration results. As shown in Table 1, number of sample points is the key factor that has a significant impact on the $E_{RMS}$ of calibration result, and a proper number of sample points is verified to be around 15.

Table 1.

Calibration results comparison: the height of Kinect v2 mounted, ambient illuminance, and number of sample points are three conditions. 1.70 and 2.50 m are two heights that Kinect v2 mounted. 50, 100, and 200 are three kinds of ambient illuminance. 5, 10, 12, 15, and 20 are numbers of five different groups of sample points. Numbers in the last column are calculated $E_{RMS}$ of the 10 implementations listed.

No.	Height (m)	Illuminance (lx)	Samples	$E_{RMS}$
1	1.70	200	5	18.3822
2	1.70	200	10	7.2415
3	1.70	200	12	3.5269
4	1.70	200	15	1.2394
5	1.70	200	20	1.5843
6	1.70	100	15	1.1335
7	1.70	50	15	1.2932
8	2.50	200	15	1.1630
9	2.50	100	15	1.2033
10	2.50	50	15	1.3659

Example applications

We demonstrate three example applications in this section after calibration procedures are done. They share one typical calibration model which has an $E_{RMS}$ of 1.1366. In each example application, different colors corresponding to the value of real height are applied to specific pixels in real time, to illustrate the terrain formed by sand. The results are shown in Figures 6 –8.

Figure 6.

The Summer: deep blue for oceans, bright green for plants, with flying flowers, this scene represents summer. Corresponding time of each frame is listed as follows: (a) −00:03, (b) −00:06, (c) −00:08, (d) −00:18, (e) −00:23, and (f) −00:29.

Figure 7.

The Grasslands: light blue for rivers, grass green for lands, with trees and animals, this scene represents grasslands. Corresponding time of each frame is listed as follows: (a) −00:01, (b) −00:12, (c) −00:20, (d) −00:26, (e) −00:38, and (f) −00:51.

Figure 8.

The Beaches: blue for oceans, green for lands, khaki for beaches, with seagulls and volcanoes, this scene represents beaches. Corresponding time of each frame is listed as follows: (a) −00:02, (b) −00:08, (c) −00:15, (d) −00:22, (e) −00:29, and (f) −00:35.

The real-time performances of different applications are shown in Table 2. When the frames per second (FPS) is above 20, spectators could have a good augmented reality experience in visual.

Table 2.

Real-time performances: applications are run on a regular PC with an Intel Core i3 6100 (3.7 GHz), 8 GB of memory, and NVIDIA GeForce GTX 1050.

Applications	CPU (%)	Memory (%)	FPS
The Summer	56	40	33
The Grasslands	61	59	28
The Beaches	43	51	32

CPU: central processing unit; FPS: frames per second.

Conclusion

In this article, we have introduced a combined hardware and software solution for surface reconstruction and dynamic projection mapping in real time, which contains a robust calibration schema and high-efficient optimization framework. For calibration models, we present a base-plane model to fix the attitude deviation of Kinect, and a transformation model for perfect projection mapping. Our objective function of transformation model is derived by space transformation and solved by LU decomposition which is proved to be fast and effective. After a series of experiments, we found that number of sample points has a significant impact on calibration result and a proper value would be around 15 as shown in Table 1. To have a good visual experience, we implement a optimization framework to remove noise, repair mesh, and match the difference of resolution between depth sensor and projector. In addition, we present a stabilizing strategy to update the surface mesh, which run well on a regular PC and could bring spectators a smooth and fluent experience.

Our system gives a lot of potential for the future. For example, the system could be extended to multiple projectors and RGB-D sensors in the setting, which would bring more complex interaction, and comes with more challenges as well. Another interesting field of future research is change the medium of projection mapping, which means projection mapping onto real-world objects rather than sand simply.²⁷

Footnotes

Handling Editor: Jeng-Shyang Pan

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by the National Natural Science Foundation of China (grant nos 51490663,51475418,and 51521064) and Zhejiang Provincial Key Development Program of China (grant no. 2017C01045).

References

Bimber

Raskar

. Spatial augmented reality: merging real and virtual worlds. Boca Raton, FL: CRC Press, 2005.

Izadi

Kim

Hilliges

et al . KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th annual ACM symposium on user interface software and technology, pp.559–568. New York: ACM, http://people.inf.ethz.ch/otmarh/download/Papers/p559-izadi(KinectFusion).pdf

Henry

Krainin

Herbst

et al . RGB-D mapping: using Kinect-style depth cameras for dense 3D modeling of indoor environments. Int J Robot Res 2012; 31(5): 647–663.

Pachoulakis

Kapetanakis

. Augmented reality platforms for virtual fitting rooms. Int J Multimed Appl 2012; 4(4): 35.

Jones

Benko

Ofek

et al . IllumiRoom: peripheral projected illusions for interactive experiences. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp.869–878. New York: ACM, https://www.microsoft.com/en-us/research/wp-content/uploads/2013/04/illumiroom-illumiroom_chi2013_bjones.pdf

Jones

Sodhi

Murdock

et al . RoomAlive: magical experiences enabled by scalable, adaptive projector-camera units. In: Proceedings of the 27th annual ACM symposium on user interface software and technology, Honolulu, HI, 5–8 October 2014, pp.637–644. New York: ACM.

Taylor

Shotton

Sharp

et al . The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. In: Proceedings of the 2012 IEEE conference on computer vision and pattern recognition (CVPR), Providence, RI, 16–21 June 2012, pp.103–110. New York: IEEE.

Cao

Weng

Lin

et al . 3D shape regression for real-time facial animation. ACM T Graph 2013; 32(4): 41.

Bandyopadhyay

Raskar

Fuchs

. Dynamic Shader lamps: painting on movable objects. In: Proceedings of the IEEE and ACM international symposium on augmented reality, New York, 29–30 October 2001, pp.207–216. New York: IEEE.

10.

Reed

Kreylos

Hsi

et al . Shaping watersheds exhibit: an interactive, augmented reality sandbox for advancing earth science education. In: Proceedings of the AGU fall meeting abstracts, http://adsabs.harvard.edu/abs/2014AGUFMED34A..01R

11.

Thrun

Burgard

Fox

. A real-time algorithm for mobile robot mapping with applications to multi-robot and 3D mapping. In: Proceedings of the IEEE international conference on robotics and automation, San Francisco, CA, 24–28 April 2000, vol. 1, pp.321–328. New York: IEEE.

12.

Lee

Hudson

Tse

Foldable interactive displays. In: Proceedings of the 21st annual ACM symposium on user interface software and technology, Monterey, CA, 19–22 October 2008, pp.287–290. New York: ACM.

13.

Newcombe

Izadi

Hilliges

et al . Kinectfusion: Real-time dense surface mapping and tracking. In: Proceedings of the 2011 10th IEEE international symposium on mixed and augmented reality (ISMAR), Basel, 26–29 October 2011, pp.127–136. New York: IEEE.

14.

Vouga

Gudym

et al . 3D self-portraits. ACM T Graph 2013; 32(6): 187.

15.

Helten

Baak

Bharaj

et al . Personalization and evaluation of a real-time depth-based full body tracker. In: Proceedings of the 2013 international conference on 3DTV-conference, Seattle, WA, 29 June–1 July 2013, pp.279–286. New York: IEEE.

16.

Stoll

Valgaerts

et al . On-set performance capture of multiple actors with a stereo camera. ACM T Graph 2013; 32(6): 161.

17.

Garrido

Valgaerts

et al . Reconstructing detailed dynamic face geometry from monocular video. ACM T Graph 2013; 32(6): 158–151.

18.

Liu

Hasler

et al . Performance capture of interacting characters with handheld Kinects. Comput Vis 2012; 2012: 828–841.

19.

Wei

Zhang

Chai

. Accurate realtime full-body motion capture using a single depth camera. ACM T Graph 2012; 31(6): 188.

20.

Bimber

Iwai

Wetzstein

et al . The visual computing of projector-camera systems. Comput Graph Forum27: 2219–2245.

21.

Pinhanez

. The everywhere displays projector: a device to create ubiquitous graphical interfaces. In: Proceedings of the international conference on ubiquitous computing, Atlanta, GA, 30 September–2 October 2001, pp.315–331. Berlin: Springer.

22.

Wilson

Benko

. Combining multiple depth cameras and projectors for interactions on, above and between surfaces. In: Proceedings of the 23nd annual ACM symposium on user interface software and technology, New York, 3–6 October 2010, pp.273–282. New York: ACM.

23.

Cao

Balakrishnan

. Interacting with dynamically defined information spaces using a handheld projector and a pen. In: Proceedings of the 19th annual ACM symposium on user interface software and technology, Montreux, 15–18 October 2006, pp.225–234. New York: ACM.

24.

Akaoka

Ginn

Vertegaal

. Displayobjects: prototyping functional physical interfaces on 3D styrofoam, paper or cardboard models. In: Proceedings of the fourth international conference on tangible, embedded, and embodied interaction, Cambridge, MA, 24–17 January 2010, pp.49–56. New York: ACM.

25.

Iman

. Latin hypercube sampling. Hoboken, NJ: John Wiley & Sons, 2008.

26.

Telea

. An image inpainting technique based on the fast marching method. J Graph Tools 2004; 9(1): 23–34.

27.

Mehta

Kim

Pajak

et al . Filtering environment illumination for interactive physically-based rendering in mixed reality. In: Proceedings of the Eurographics symposium on rendering, http://research.nvidia.com/publication/filtering-environment-illumination-interactive-physically-based-rendering-mixed-reality