Abstract
1. Introduction
Pose estimation from a referenced rigid structure is one of the most basic and important problems in robotics vision and computer graphics. It can be used to obtain the 6DOF (degrees of freedom) of the cameras needed for further implementation, e.g., in helping robots to locate targets in referenced coordinates [1, 2], in calculating coordinates in images of virtual 3D objects to synthesize augmented reality scenes [3], in locating flying balls for robots [4], and in calibrating the camera-laser sensor system [5].
Although the single camera pose estimation problem has been widely researched in the last decade, the methods used often suffer from low robustness and ill-conditioned pose estimation problems. Furthermore, a relatively small angle for viewing influences the accuracy of the estimated pose.
In practical robotics vision applications (e.g., catching objects via robot arms, playing ping-pong with robots, etc., all of which may require the vision system to perform in real-time), the tasks of pose estimation (which aim to locate the absolute coordinates of objects via the vision system of the robot) require quite a high degree of accuracy and robust performance in most conditions. A single camera cannot provide enough precision or robustness in such tasks for pose estimation, and thus a multiple camera system is required. Furthermore, task-related robots are normally required to obtain the absolute coordinates of a tracing target in the referenced space via rigid point correspondences from landmarks, while a single-camera system will obtain infinite solutions for the target due to the property of perspective projection.
In the specific case of mobile robots, pose estimation methods for multiple camera systems have always faced the following challenges:
In applications of mobile robots - especially in the case of robots working at high speed - to achieve the poses of all the cameras simultaneously is the most important requirement. Otherwise, the poses estimated individually will break the rigid constraint among all the cameras in the multiple camera system, as there may be some motion in the pose estimation of different cameras due to the high speed of the robot.
Methods are always required that can work accurately, stably (with low standard deviation - STD) and robustly under any angle of viewpoint due to the uncertain poses of mobile robots.
The task of localization is always followed by the task of pose estimation; thus, the methods should also provide preferable localization performance for their estimations.
As the movement of the robot may shake the rigid rig among the cameras and introduce bias, the pose estimation methods should also be robust in relation to the interference of bias on solid rigs.
Mobile robots are real-time systems, and so they require the pose estimation methods to calculate as quickly as possible and with fewer point correspondences, due to the mobility of a given view.
In this paper, we address all the above challenges and present a novel approach for estimating the poses of all the cameras in a multi-camera system with only a few coplanar points in a manner that is accurate, robust and simultaneous 1 , and we aim to resolve the above four main challenges as they arise in the practical vision system of a humanoid ping-pong robot.
2. Related Works
Pose estimation for single cameras [6–11] has been studied for many years. Recently, much of the research has focused on pose estimation in multi-camera systems, due to the limitations of single cameras, e.g., their low accuracy and limited field-of-view.
One of the most important advantages of a multi-camera system is that it can recover stereo information easily (e.g., the visual odometry [12], which employs calibrated cameras to recover the 3D information of targets), and it can help to estimate the motion of the cameras from the optical flow via the Kalman filter method or by minimizing a cost function based on the geometric and 3D properties of the features.
Generally, there are two kinds of multi-camera systems: one is designed for overlapping fields of view [13–17] and the other for non-overlapping fields of view [18, 19].
The methods for a non-overlapping system [18, 19] require the use of cameras which are placed rigidly on a moving object, where the translation and rotation between the cameras are known. When the object is moving rigidly, these methods can recover the 6DOF motion for the multi-camera system via the point correspondences between two points seen before and after motion. These methods are normally implemented on a vehicle or other moving objects, and require relative motion between every other image.
The methods for overlapping systems also employ cameras placed rigidly, with known translation and rotation between each camera, and they can recover the pose of their systems with only one frame of the multiple images from different cameras. As such, these methods can be used to process static scenes. As there are many efficient pose estimation methods for single cameras, an intuitive solution for overlapping fields of view in multi-camera systems is to estimate the pose for each camera and then reduce the ambiguities produced by the estimated poses of every camera based on their rigid constraint via fusing or polling policies. The methods presented by Baker et al. [14] and Viksté et al. [15] belong to this category. However, this kind of method does not obtain a unifying pose for the multi-camera system with all the information from all the cameras simultaneously. It may introduce some inconsistencies with the rigid constraint between each pair of cameras, which may reduce the precision of multi-camera systems for measurement or object localization (i.e., stereo vision for grasping with robots [20]). In this paper, we present a novel approach which estimates the pose of a multi-camera system with overlapping fields of view and can calculate the unifying pose for the system with all the information from all the cameras simultaneously.
3. Our Approach to Pose Estimation for Multi-camera Systems
Pose estimation for a multi-camera system should calculate the orientation and translation with the rigid pose constraint among the various cameras. Most existing methods attempt to solve the orientation and translation directly, by optimization [21] or iteration [11, 22]. In contrast, we employ homography, which is widely implemented in calibration [23, 24] and can map image points with 3D coplanar referenced points. Firstly, we establish the corresponding relations between each camera using their Euclidean geometries and optimize the homographies of the cameras; then, we solve the orientation and translation from the optimal homographies.
3.1 Problem definitions
The intrinsic parameters of the
The task of estimating the absolute pose of a multi-camera system can be formalized as follows. Given a set of coplanar non-collinear 3D coordinates of referenced points
3.2 Homography for coplanar corresponding points
The homogeneous coordinate of the referenced point
Here,
λ
3.3 The global optimum for multi-camera systems
In our approach, we try to minimize the image distances of all the cameras with respect to the referenced coplanar points.
Given
The above optimal function is established on the assumption that the image points of each camera are perturbed by Gaussian noise, which is quite usual in many image noise removal methods [25–27] and vision practices [28–30]. For cameras that are assembled rigidly, the homography of each camera has inherent constraint relations, which will help us to obtain the global optimization for the multi-camera system using only a few points.
3.4 Extrinsic translation for multiple cameras with homography
In this section, we will present the method for the calculation a camera's homography from one known homography with known rigid rotation and translation. Assume that there are two cameras,
Thus, the homography of camera
The above formulation presents only the scale transformation relation between the homographies of two different cameras. However, in our approach, the homography is defined with the scale
With homography
Accordingly, the optimization function in (5) can be rewritten as:
Here, the scale factor
3.5 Estimation of the initial homography
The solution for the formula (10) is a typical nonlinear optimization problem. In our approach, the Levenberg-Marquardt method is adopted, which has been widely used in computer vision cases. When solving the optimization with formula (10), an initial guess for
The method to calculate the initial homography
Since
Before calculating the initial guess as to the homography using the above method, the data should first be normalized [28] to obtain more stable and accurate results.
If a more accurate initial guess for the homography is desired, the maximum likelihood estimation of
Such that
The optimization of the above function can also be solved using the Levenberg-Marquardt method, and the initial guess can be obtained with the solution to equation (12).
4. Experiments
We compared our method with various state-of-the-art methods, using both simulations and real image data. Finally, we present the practical implementation of our method in ping-pong robots.
In the following experiments, we used seven pose estimation methods.
The above seven methods could be classified according to three categories. One set includes the methods
4
estimated from coplanar points, e.g.,
With the above seven methods,
4.1 Simulation experiments
In these experiments, we simulated several cameras placed rigidly. The distances
7
between cameras were about 100. Each camera's focal length ratio was set as
We used the relative error to evaluate the experimental results. Given the true results of camera
where
To evaluate the effect of the estimation when the two cameras are placed and subject to different amounts of rigid motion, all the simulation experiments were designed with random relative positions of the camera, which was achieved by limiting three axis-rotation angles with intervals of [0,2] degrees, translation vectors from [50,0,0] to [60,10,10], and while controlling all the reference points in the overlapping view.
4.1.1. Simulation experiments for R, T, localization and re-projection error
As the performance of the pose estimation methods may be influenced by the viewing angle, we evaluated the performance (R, T, localization and image re-projection error) of the above methods given small viewing angles (which occurred quite frequently in the humanoid ping-pong robot).
In the simulation experiment, the 3D referenced points were restricted in the plane with

Simulation experiment results under varied Gaussian noise. There were two cameras in the multi-camera system using eight referenced points. (c),(d),(g) and (h) are the corresponding standard deviations of (a),(b),(e) and (f).

Simulation experiment results under varied referenced points. There were two cameras in the multi-camera system using Gaussian noise σ = 0.5. (c),(d),(g) and (h) are the corresponding standard deviations of (a),(b),(e) and (f).
We also carried out an experiment to evaluate the absolute locating performances of these methods
8
. When executing the pose estimation, a noised point (with a ground true value
According to figure 1(a),(b),(c) and (d), we can see that the object space-minimizing-based approaches, i.e.,
According to figure 2(a)–(d), we can see that the object space-minimizing-based approaches, i.e.,
The simulation experimental results also show that the pose estimation for each camera in turn could perform as well as the global optimization methods; however, the performances in localization with those methods estimated one by one were always worse than the global methods. In our view, the reason for this may be that the global optimization methods always consider the rigid rig as the constraint in optimization, which may lead to greater compromise in the computation of the average system bias. However, the method optimizing singly can ignore this constraint and converge on those poses with less system bias, while the real rig is broken and leads to poor performance in localization applications.
4.1.2. Simulation Experiments on Other Metrics
In the second simulation experiment, we further evaluated four approaches, i.e.,
We first evaluated the performances of the four approaches for different mounted cameras. Figure 3 shows the experimental results after using various numbers of cameras (3–5) in the multi-camera system. In this experiment, we randomly chose 25 poses for the multi-camera system, which were distributed in the half sphere towards the original centre (0,0,0) with a distance of 1,500, and then executed the estimation 200 times to output the average. The results in figure 3 show that the performances of all four methods will increase as the number of cameras increases. Although the number of cameras has been increased, the

Simulation experiment results with increasing numbers of cameras. Gaussian noise σ = 0.5, for eight referenced points.
We also compared the execution time of the four methods (

Experimental results for average time consumption
In order to evaluate the multi-camera algorithms' robustness, we also inspected the disturbance introduced by a small calibration error of the multi-camera system. The experiment was carried out by adding Gaussian noise to the parameters of the principal point (

Simulation experiment results introducing a small calibration error. (a),(b) and (c) show the rotation, translation and location errors of the four methods with noise on their
According to figure 5(a) and (b), we can see that the object space-minimum-based methods (e.g.,
4.1.3. Overall performance analysis of the simulation experiments
In the above simulation experiments, we have presented the performances of seven state-of-the-art pose estimation methods for various metrics. In this section, we try to present an objective analysis of our approach, in terms of both robustness and accuracy, and in comparing it with the other six methods. In our analysis, we use a comparable performance diagram to illustrate the performance of our method vs the other methods. The results are shown in figure 6. In figure 6, each row compares the performance of our

Overall performance comparison of
The results in figure 6 illustrate that

Overall performance comparison of
4.2 Real case experiments
We also carried out comparative experiments using real image data. In these experiments, we built a miniature multi-camera system with two Toshiba Teli cameras, as shown in figure 8(left). Each camera was equipped with a 4 mm lens and operated at a resolution of 640 × 480. In the experiments, there were eight green referenced points placed on the ping-pong table and one static table tennis ball as a target point, as shown in figure 8(right). We randomly placed the cameras of the system and obtained their translation poses using the methods described above. Next, we computed the positions of the table tennis ball via the pose estimated by the different methods. The true positions of the table tennis ball were obtained via a 3D Micro-hite DCC coordinate-measuring machine (CCM)
9
. There were 12 random positions for the translation pose estimation and ball location - at each position, we executed each method 200 times and removed the maximum and minimum of each method. The final output results were the average of the remaining data. The experimental results are shown in figure 9. Figure 9 shows that

Hardware(left) and software(right) of the multi-camera system used for real case experiments. In the right-hand figure, the upper-left view shows the 3D pose of the multi-camera system in the virtual environment, while the lower views show the images from the system and the upper-right view shows the detailed 6DOF poses.

Localization experimental results in the ping-pong robot vision system with different methods - the x-coordinate is the ratio of the distance error compared with the absolute distance
4.3 Practical implementation in ping-pong robots
The method presented in this article was originally motivated by our humanoid ping-pong robot project, which uses a multiple camera system to guide the robot arms during ping-pong games. We thus use the hardware described in Section 4.2 as the on-board vision system of the humanoid ping-pong robot. As concerns the vision system in the ping-pong robot, there are two main challenges in pose estimation due to the shaking of the robot. Firstly, there are few referenced points in this system, and the referenced 3D points are all located in the plane of the table. There are normally fewer than 10 correspondences, and so the vision system needs to estimate its pose and calculate the coordinates of the balls in the referenced point's coordinates with high accuracy using only a few coplanar points. Secondly, the vision system can be deployed at any angle relative to the ping-pong table, and so it may be placed in some locations with small viewing angles, which will greatly affect the accuracy of the pose estimation.
We implemented
The two-camera system as described in Section 4.2 tracks the ping-pong ball and generates its traces while playing. Figure 10 (right) shows the accuracy of the trace estimated by our method. The blue points show the true positions in this trace, which were obtained using an additional high-speed camera system with 500

Left: the camera pose in our ping-pong robot system. Right: the trace obtained by our method (white) compared with the true trace (blue).
5. Conclusion
In this paper, we presented an efficient and robust pose estimation algorithm for multi-camera systems which can obtain 6DOF poses for all the cameras using only a few coplanar points simultaneously. Large-scale simulation experiments have shown that this algorithm can be more robust than the classical iterative pose estimation algorithm in both small- and large-angle viewing conditions. Practical experiments also showed that this method is more accurate and robust.
Hence, our method is especially suitable for implementation in tasks where there may be various poses for the cameras, including ill-condition or relatively small angles of viewing.
