Abstract
Keywords
1. Introduction
Occluded object imaging is a significantly challenging task in many computer vision application fields such as video surveillance and monitoring, hidden object detection and recognition, tracking through occlusion, etc. However, because traditional photography simply captures the 2D projection of the 3D world, essentially it cannot handle occlusion.
Recently, computational photography is changing the traditional way of imaging, which captures additional visual information by using generalized optics. Synthetic aperture imaging (SAI) [1–7] is one of the key aspects of computational photograph. Figure 1 visualizes the principle of multiple camera synthetic aperture imaging. In a convex lens, rays from the red point on the plane of focus after refraction converge to a single point on the sensor plane, forming a sharp image (Figure 1a). Rays from the blue point, which is not in the plane of focus, form a circle of confusion on the sensor plane, resulting in a blurred image (Figure 1b). A camera array is analogous to a “synthetic” lens aperture - each camera being a sample point on a virtual lens (Figure 1c). We synthetically focus the camera array by choosing a plane of focus, and adding up all the rays corresponding to each point on the chosen plane to get a pixel in a “synthetic aperture” image. By warping and integrating the multiple view images, synthetic aperture imaging can simulate a virtual camera with a large convex lens, and focus on different frontal-parallel or oblique planes with a narrow depth of field. As a result, occluded objects focused on the virtual focal plane are visible, while those that are not are blurry.

Explanation of the principle of multiple camera synthetic aperture imaging via geometric optics.
Synthetic aperture imaging photography [1, 2] provides a new concept in resolving the occluded object imaging problem, however it still suffers from the following limitations:
The clarity of the occluded object image is often significantly decreased by the shadows of the foreground occluder. Although some methods have been presented to label the foreground via object segmentation or 3D reconstruction, these methods would fail in the case of a complicated occluder or severe occlusion.
Because the state-of-the-art SAI methods use the intensity average of multiple cameras, various colour responses of cameras often significantly reduce the colour smoothness and consistency of synthetic aperture image.
The occluded object's contour and contrast are sensitive to calibration error, which is serious especially for unstructured light field synthetic aperture imaging with a moving camera.
In this paper, we address the above issues by proposing a new algorithm which for the first time formulates the occluded object imaging as an optimal camera selection problem. A multiple label energy minimization formulation is designed in each plane to select the optimal camera. The energy is estimated in the 3D synthetic aperture image volume, which integrates the multiple view intensity consistency clustering, visibility probability propagation and camera view smoothness. When focusing on a hidden object, instead of naively averaging all camera views in the synthetic aperture image, our method will actively select the rays from only one optimal camera via multiple label graph cuts-based energy minimization [8–11].
The organization of this paper is as follows: Section 2 introduces several related works. Our algorithm is presented in Section 3. Following that, Section 4 presents the data set, implementation details and the experimental results. In addition, the performance analysis and discussions are proposed. Finally, we conclude the paper and point to future work in Section 5.
2. Related Work
In 1999, the first famous camera array setup was devised for the film
The MIT computer graphics group [12] used 64 USB webcams for synthesizing dynamic depth of field effects. Lei
When focusing on the occluded object, outlier rays that actually hit occluders will blur the focused occluded object, and decrease the clarity and contrast of the synthesized image quality. Several methods have been presented to overcome this problem. Vaish
3. Our Approach
In this section we will introduce our optimal camera selection-based occluded object imaging method. Instead of averaging all rays from the entire camera array [1] or partial visible cameras[22], our approach selects only one
3.1. Algorithm framework
In this subsection, we give an overview of the optimal camera selection and imaging approach by describing the flow of information with the Stanford light field dataset shown. The overall method mainly includes two parts: (1)

Algorithm framework of our approach.
An imaging cycle begins when capturing multi-view images by multiple camera or a moving camera.
The image warping results are then fed as input to the
Finally, the camera selection results are combined together to generate a high quality occluded object image result (Figure 2, right top image), which is much better than the traditional synthetic aperture imaging result (Figure 2, right bottom image).
3.2. Optimal camera selection via multi-labelling graph cut
Let
Consider the labelling redundancy of the camera array (the labels in different cameras are highly related), we label all the pixels in the reference camera view instead of in all camera views. Thus, we only seek a more succinct labelling,
The objective of choosing the visible view can be formulated as a following energy minimization problem:
where the data term
where
Because of the different colour responses among multiple camera views and calibration errors of the camera position, even for the same visible point, the colour value of the corresponding visible pixels in multiple camera views are always different. Thus this prior is very important and it will determine the colour smoothness of the final synthetic image. Surprisingly the traditional synthetic aperture imaging methods seldom consider the problem.
Here, we adopt the standard four-connected neighbourhood system and penalize if the labels of two neighbouring pixels are different:
The intuition behind the design of the above energy is that:
If the point is a fully visible focus point, then the appearance can be modelled as a unimodal distribution. In this case, the cost of choosing any visible camera is small and therefore the imaging quality will not be greatly influenced by the choice of labelling.
If the point is a partially occluded focus point, we are likely to get a unimodal distribution and the case is similar to 1).
If the point is a free point, the distribution will tend to be a uniform distribution. In this case all the cameras have relatively large cost and as a result the smoothness term will play the decisive role.
In the experiment, we adopt Boykov's graph cuts methods [8, 9] to solve the above energy minimization problem.
4. Experiments
4.1. Imaging system and datasets
In order to evaluate the performance of the proposed method under various circumstances, we have designed and set up several moving camera-based light field capture systems.
Figure 3(a) displays our moving linear camera array light field capture system. The vertical linear array contains eight Pointgrey Flea3 cameras, and it can move smoothly on a sliding track. The entire system has the ability to simulate a virtual camera with a three metre convex lens. In this experiment, we adopt the above moving linear array to capture multiple occluded toys in an indoor environment (as shown in Figure 5).

Our moving camera synthetic aperture imaging systems. (a) and (b) are designed for light field capture in an indoor environment. (c) is designed for capturing an unstructured light field in an outdoor environment.
In order to capture a dense 3D light field of the scene, as well as to avoid the various colour responses among different cameras, we have also set up a single moving camera-based imaging system (as shown in Figure 3(b)). Through moving the camera in different directions, the system can simulate a virtual camera array with 900 camera views and a three metre convex lens (as shown in Figure 3(b), right image).
To evaluate our approach in outdoor scenes, we have also set up an outdoor unstructured light field imaging system based on a moving Canon 5D Marker III camera. Figure 3(c) displays the system and the examples of an outdoor building occluded by the front trees. We use Zhang's automatic camera tracking method [24] to estimate the moving camera's pose and position for synthetic aperture imaging. The imaging results of this system are shown in Figures 6 and 7. Besides our system and datasets, in this experiment section, we also adopt the public available UCSD light field datasets[12]in the experiment to evaluate and compare the imaging performance of our method.
4.2. UCSD Santa dataset
The UCSD Santa light field dataset was acquired using an eight-camera array and a linear translating gantry[23]. This dataset contains 120 views on a 120×1 grid with image resolution as 640×512. Figure 4(a) displays five examples of the Santa dataset. We adopt the view #57 as the reference camera view (as shown in Figure 4(b1)), and Figure 4(b2) shows the synthetic aperture imaging result using the Vaish's method [20]. Please note that when we focus on the distant chairs and windows, the shadows from the foreground Santa significantly blurred the image. By contrast, through multi-labelling graph cuts based optimal camera selection (as shown in Figure 4(b4)), our approach successfully removes the false shadows and produces a high quality occluded object image with far greater clarity (as shown in Figure 4(b3)).

Imaging results of USCD Santa light field datasets [23]. (a) shows the examples of original camera views. (b1) to (b4) display the comparison results of the chair through occlusion.
4.3. Our multiple occluded toys dataset
To further test our method on severe occlusion cases, we have conducted another experiment with multiple objects. We use our moving linear camera array (as shown in Figure 3) to capture this set of images. As shown in Figure 5(a), the flower pot, teddy bear and the penguin are lined up in a column. It can be seen that the penguin is occluded by the teddy bear, which is further occluded by the front flower pot. In particular, we can see nothing but the feet of the teddy bear from the reference camera view (Figure 5(b1)). The standard synthetic aperture imaging results of the teddy bear and penguin are shown in Figure 5(b2) and Figure 5(c2) respectively. Due to the severe occlusion, Vaish's method [20] can only obtain a blurred image of the occluded object. By contrast, our method successfully selects the optimal camera views via energy minimization (as shown in Figures 5(b4) and 5(c4)), and provides a clear and complete object image of the teddy bear (Figure 5(b3)) and penguin (Figure 5(c3)) through severe occlusion.

Imaging results through multiple occluders. Please note that in this challenging scene, the penguin is occluded by the teddy bear, which is further occluded by the front flower pot. Our approach successfully sees through multiple objects (b3) and (c3).
4.4. Our window and building dataset
Figure 6 shows the imaging results of the outdoor scene through a large window. We capture these test datasets with a hand-held single moving camera in our laboratory. Because of occlusion by the black window frame, we cannot get a complete image of the distant building (Figure 6(a)). Figure 6(b2) gives the synthetic aperture imaging results using the Vaish's method [20], which is blurry due to the foreground window frame and depth variation of the distant building. By contrast, our approach virtually “removes” the foreground occluder, and creates a complete and clear image with lots of details via optimal camera selection.

Imaging results of the outdoor scene through the window. Please note that standard synthetic aperture imaging approach is very blurred (b2). By contrast, our method successfully removes the window and generates a clear image of the occluded building (b3).
4.5. Our street dataset
To evaluate our method on the challenging street view, we have conducted another experiment with a complex outdoor scene. As shown in Figure 7, the distant buildings are occluded by the nearby trees (Figure 7(a)). Our aim is to see the building to the rear of the scene through the trees in the foreground. Comparison results using Vaish's method [20] and our method are shown in Figures 7(b2) and 7(b3). As Vaish's method [20] simply averages the intensity value of the multiple view images, it cannot select different camera views for different regions, and the resulting image is significantly blurred. In addition, since Vaish's method only focuses on a given depth plane, the image clarity can be reduced due to the depth variation of the building's surface (Figure 7(b2)). Please note that through selecting the optimal camera views, our method accurately gives the desired imaging result even under depth changes and occlusion (Figure 7(b3)).

Imaging results of the distant buildings through the trees. Please note that standard synthetic aperture imaging approach is very blurred (b2). By contrast, our method successfully removes the complex foreground trees and generates a clear image of the occluded building (b3).
5. Conclusion
A novel occluded object imaging approach has been presented. Experimental results with qualitative and quantitative analysis demonstrate that the proposed method can reliably select the optimal camera and generate a clear image even through severe occlusion. Moreover, the satisfied imaging results with a moving camera indicate that this approach has great potential for many applications. Our future work will focus on extending the method by developing new applications of the occluded object imaging techniques on smartphones.
