Sage Journals: Discover world-class research

Abstract

In this article, we focus on the problem of depth estimation from a stereo pair of event-based sensors. These sensors asynchronously capture pixel-level brightness changes information (events) instead of standard intensity images at a specified frame rate. So, these sensors provide sparse data at low latency and high temporal resolution over a wide intrascene dynamic range. However, new asynchronous, event-based processing algorithms are required to process the event streams. We propose a fully event-based stereo three-dimensional depth estimation algorithm inspired by semiglobal matching. Our algorithm considers the smoothness constraints between the nearby events to remove the ambiguous and wrong matches when only using the properties of a single event or local features. Experimental validation and comparison with several state-of-the-art, event-based stereo matching methods are provided on five different scenes of event-based stereo data sets. The results show that our method can operate well in an event-driven way and has higher estimation accuracy.

Keywords

Event-based camera stereo matching semiglobal matching event-driven disparity map

Introduction

Three-dimensional (3-D) vision tries to investigate the 3-D information of the real word, which is a challenging problem in computer vision. In the 3-D vision, depth perception is the key issue in sensing for robots and other autonomous applications. Different technologies of depth perception can generally be categorized into two groups including active and passive methods. The active methods emit light and use triangulation or time-of-flight measurements for analysis. The laser, Kinect, and time-of-flight camera are active sensors. These sensors perform remarkable result but usually suffer from limited range and outdoor sunlight condition, since light strength falls off with distance and struggles in ambient noise light. The passive methods process pairs of images and use stereo or structure from motion methods to reconstruct the depth as is done by ZED and VI-sensor. These methods usually have relative long detection range and high resolution.

Thanks to the publicly available benchmarks such as the Middlebury¹ and KITTI,² which provide good test images for both indoor and automotive context environment, stereo vision continues to steadily mature. However, frame-based stereo algorithms are not enough for agile motion such as fast flying and immediate obstacle avoidance. One sensible reason is that the frame-based sensing pipelines do not meet the real-time performance requirements. The latency of current pipelines is typically in the order of 50–200 ms including the time-of-sensor data capturing and data processing. A brute force solution is to use high frame rate cameras and a high-performance computer. However, this brute force approach is not suitable for application to the autonomous flight or the mobile robot because of the limited payload and power consumption.

The bioinspired vision sensor or called event-based camera mimics retinas to asynchronously generate the response to relative light intensity variations rather than the actual image intensity.³ Event-based cameras are data-driven sensors that have advantages of low redundancy, high temporal resolution (in the order of microseconds), low latency, and high dynamic range (130 dB compared with 60 dB of standard cameras). These properties make it an ideal device for using on a mobile robot, but the traditional frame-based algorithms are not well suited to operate on event-based data.

In this article, we present a fully event-based stereo matching algorithm for reliable 3-D depth estimation using semiglobal matching (SGM). For that, we use two dynamic and active-pixel vision sensors (DAVISs) to build an event-based stereo setup (“Event-based stereo camera setup” section) as input and calculate the matching cost by spatiotemporal attribute of events (“Semi-global matching” section). Cost aggregation is performed as an event-driven approximation of a global cost function by path-wise optimizations from all directions through the image (“Event-based matching cost calculation” section). Disparity computation is done by a winner taking all mechanism (“Event-driven semi-global cost aggregation” section). “Event-driven disparity computation” section presents the experiments of comparison with the state-of-the-art, event-based stereo matching methods on five actual scenes data sets. The results which demonstrate our method have a higher estimation accuracy and are robust in different scenes.

Related work

Event-based vision sensor

Before direct moving to the event-based stereo, we first introduce the event-based vision sensor. Unlike the conventional frame-based camera, the event-based camera is not driven by certain frequency clock signals. Each pixel is independently recording an event (a structure presents pixel’s activity at a particular time) when it detects the illuminance change larger than a threshold. Today’s event-based camera is similar to the Mahowald and Mead’s silicon retina,⁴ and the data are encoded with the address event representation.⁵ The principle and visualized output is shown in Figure 1.

Figure 1.

The principle operation of single DVS pixel and a single snapshot of the event stream in 20 ms.⁶ DVS: dynamic vision sensor.

In this work, DAVIS is used, which is an extension of the dynamic vision sensor (DVS)⁷ with higher resolution (240 × 180) and an additional frame-based intensity readout (not used in this work).⁸ Each event is presented with a quadruplet e(t, x, y, p); t is the time stamp, (x, y) is the image coordinates, and p means polarity (ON/OFF) which indicates the luminance increase (ON) or decrease (OFF). The output of DAVIS consists of a stream of asynchronous high-rate (up to 1 MHz) events together with a stream of synchronous gray-scale frames at a low frame rate (on demand and up to 24 Hz).

Event-based stereo vision

The event-based stereo matching task is to find corresponding events from two different views and estimate the disparity. However, events do not encode information about absolute intensity and cannot construct general features that are widely used in frame-based stereo matching.

Mahowald and Delbruck⁹ first explored the event-based stereo vision by developing a chip for one-dimensional (1-D) image matching. The result was quite promising, but this chip was not flexible enough for two-dimensional (2-D) matching and variable range of disparity.

In recent years, more researchers try to explore event-based matching criteria using spatiotemporal information of DVS. Kogler et al.¹⁰ came up with the more flexible result using the temporal and polarity correlation with DVS sensors and achieved promising initial results.

Rogister et al. found that using temporal and polarity criterion alone is prone to errors because the latency of events varies (jitter). So, they added geometry constraint, event ordering, and temporal activity constraint.

Camuasmesa’s et al. mentioned previous methods using temporal and geometrical constraints alone still got relatively low reconstruction accuracy because ambiguities cannot be uniquely solved.¹¹ So they tried to explore more invariance motion constraints such as orientation and time surfaces for event-based stereo matching.^11,12

Piatkowska et al. and Firouzi and Conradt¹³ both came up with modification of cooperative network to make use of the events stream. These methods used the idea from Marr’s cooperative computing approach¹⁴ but made it dynamic to take into account the temporal aspects of the stereo events. Our previous work⁶ also uses a Markov random field (MRF) to consider the depth continuity between the nearby events. The results show that the estimation rate of these methods considerably outperforms previous works.

Recently, Eibensteiner et al.¹⁵ implemented an event-based stereo matching algorithm in hardware on a field-programmable gate array. Osswald et al.¹⁶ proposed a spiking-based cooperative neural network that can be directly implemented with neuromorphic engineering devices.

These methods considering local temporal and geometrical constraints work remarkable in some simple artificial data sets, but the estimation rate (ratio of depth estimates to input events) and matching rate (ratio of the correct depth estimated to the total depth estimated) are still low. Recent cooperative network¹³ and belief propagation⁶-based methods show their advantages using disparity uniqueness and depth continuity constraints between the nearby events. However, the neighbor radius is 1 or 2 for the large radius, increasing the computational cost. The small neighborhood may lead to local optimum and causes mismatches.

Event-based stereo camera setup

Our stereo setup is built using two DAVISs mounted side-by-side and a ZED sensor under the event stereo cameras, as shown in Figure 2. The hardware setup and software driver are the same as our previous work.⁶ The baseline of the event-based cameras is 12 cm. Two DAVISs are synchronized by hardware to output event streams. ZED sensor records the depth frames as depth ground truth. Although ZED works at 100 Hz, the depth information is still not available between frames at the very moment the events being generated. Then, we use a similar approach in the study by Weikersdorfer et al.,¹⁷ which uses the smallest depth value from the latest frame in a one-pixel neighborhood, to assign depth value to each event, which can depress noise and absent values. Later, the ground truth depth will be used to evaluate the event-based matching method.

Figure 2.

The stereo camera setup.⁶

Algorithm

The main idea of our algorithm is borrowed from SGM, which offers a good trade-off among accuracy, robustness, and runtime compared to other global stereo-matching algorithm. In this section, we first give a brief review of the SGM algorithm. Then, three main steps of our event-driven stereo matching are introduced: event-based matching cost calculation, event-driven semiglobal cost aggregation, and event-driven disparity computation.

Semiglobal matching

Stereo matching can be defined as a labeling problem. The label corresponds to the disparity. Typically, the quality of labeling is formulated in a cost function. Finding labels that minimize the cost function is the key problem. However, the labeling problem is an nondeterministic polynomial time (NP)-complete problem. Graph cuts and belief propagation¹⁸ are good approximate solutions to this problem, but the main drawback is high computation cost.

SGM¹⁸ has become a popular choice in real-time depth perception applications, which successfully incorporate the advantages of global and local stereo methods. Some extensions and modifications to the original SGM have been proposed to improve the runtime and robustness to different environments.^19
–21

Typically, the framework of SGM can be broken down into three steps: matching cost estimation, cost aggregation, and disparity computation.²² The general cost function E(D) (defined as 1) assigns costs to a disparity image D by summing the matching cost and the penalties for smoothness constraints $\begin{matrix} E (D) = \sum_{p} (C (p, D_{p}) + \sum_{q \in N_{p}} P_{1} T [| D_{p} - D_{q} | = 1] \\ + \sum_{q \in N_{p}} P_{2} T [| D_{p} - D_{q} | > 1]), \end{matrix}$ 1

where N(d) denotes d’s neighborhood and T[] is an indicator function if its argument is true and 0 otherwise. The matching costs $C (p, D_{p})$ measure the difference between a certain pixel at $p (x, y)$ in the left image and pixel $(x - D_{p}, y)$ in the right image (assuming the left image is the base image). The difference can be defined in different forms, such as absolute gray-scale value, mutual information,²² and census transform.²³ The penalties depend on the difference to the neighborhood disparities. In general global stereo matching, the penalties are defined as potts, linear, or quadratic models. In SGM, if the disparity difference is <1, a small penalty P₁ is added. Otherwise, P₂ is added. P₁ penalizes tilted or slanted surfaces and P₂ discontinuities.

The problem can be modeled as MRF, and some optimization algorithms such as graph cut and belief propagation are implemented to minimize the cost function. However, such global methods suffer large computation cost. Thus, SGM methods have been proposed to speed up the disparity optimization. The energy only propagates along 1-D paths, compared to the energy propagation strategy in the whole 2-D map of the MRF model. Along each path, the minimum cost is calculated by means of dynamic programming. After that, the cost of each pixel is accumulated with all paths in all directions r. Finally, the disparity that minimizes the cost for each pixel is selected as the output disparity. The detailed discussions are referred to Hirschmuller’s work.²²

However, traditional SGM algorithms can only operate on static frame information to solve the correspondence problem. The matching cost estimation needs absolute gray-scale value or gradient. Meanwhile, the cost aggregation only considers the spatial disparity smoothness constraints. In this work, our input is a stream of events instead of image pair sequences, so we have to redefine the matching cost and construct an event-driven SGM to manage the spatiotemporal correlation event stream.

Event-based matching cost calculation

The input event streams have been preprocessed with the nearest neighbor noise filtering²⁴ and event-based rectification. For each input event, there is only one rectified event location. This allows us to implement rectification using a lookup table.

After the preprocessing, for each incoming event $e (t_{l}, x_{l}, y_{l},, p_{l})$ at time t_l, pixel (x_l, y_l) with polarity p_l, the possible matching pixels in right retina have a similar y coordinate, and the x coordinate of the candidates can be defined as follows: $\begin{array}{l} C_{e} = (x_{r}, y_{r}) | x_{l} - d_{max} \leq x_{r} \leq x_{l}, \\ | y_{l} - y_{r} | \leq τ_{d}, | t_{r} - t l | \leq τ_{t}, p_{l} = p_{r}, \end{array}$ 2

where d_max is a parameter that determines the maximal disparity, τ_t is the time limit, and τ_d is the limit of the distance in row y. Meanwhile, the candidates should have the same polarity with the current event.

In order to make our matching cost estimation event driven, two mechanisms are used. First, instead of using the gray-scale image, the last spike map of all pixels for each sensor is created to store the spatiotemporal information. The maps update locally for each incoming event. For instance, each camera has its own last spike map. The size of the map is W × H × 2, in which W and H are the width and height of the sensor resolution, and two channels are used to store the time stamp and the event polarity. “Last spike” means only the latest event data for each pixel can be stored. With the last spike map, it is not necessary to accumulate a certain time interval or a number of events. For each incoming event, e, of the left camera, the candidates C_e in the last spike map of the right camera will be considered. Each candidate is represented as x_l−d and d is the disparity we want to estimate.

Then, the matching cost of the incoming event in the left camera can be calculated with the candidate events in the right camera. There is no intensity information for the event, but our algorithm considers the events in the same polarity and uses the time and epipolar geometry to calculate the matching cost for potential matches. The matching cost of every pixel is initialized to a constant big value. When a new event comes, the matching cost is calculated as $\begin{array}{l} C_{t} (d, y_{l}) = \frac{| t_{l} - t_{r} |}{∊_{t}} \\ C_{g} (d, y_{l}) = \frac{| y_{l} - y_{r} |}{∊_{g}} \\ C_{total} (d, y_{l}) = C_{t} (d, y_{l}) + C_{g} (d, y_{l}), \end{array}$ 3 $D (d) = {\begin{array}{l} min_{y_{l}} (C_{total} (d, y_{l})), \\ if {min}_{y_{l}} (C_{total} (d, y_{l})) < D_{o} . \\ D_{t}, otherwise \end{array}$ 4.

C_t is the cost in time and is normalized by the scalar of the time difference. C_g is the cost of epipolar geometry, which measures the distance of the candidate points to the epipolar line and ε_g is a normalizing scalar of geometry difference. D_o is a saturation term used to limit the maximum value of the matching cost. We rectified the events in both cameras, so the form of $C_{g} (e_{l}, e_{r})$ is simplified as the absolute difference in the y coordinate. As a result, the matching cost of each coming event and its candidates is a vector. The length of the vector is d _max, and the matching cost will be used in event-driven cost aggregation.

Event-driven semiglobal cost aggregation

Traditionally, SGM algorithm accumulates the cost along all the paths at every pixel of the image. Our modified event-driven SGM does not simultaneously update the entire image but update whenever a new event comes from the stereo matching step.

We redefine the path costs $L_{r} (p, d)$ for pixel p in disparity d along a path r in original SGM²² with the temporal correlation kernel as $\begin{array}{l} L_{r} (p, d) = H_{p \in N (p)}^{τ_{m} - Δ t} [D (p, d) + min (L_{r} (p - r, d), \\ L_{r} (p - r, d - 1) + P_{1}, \\ L_{r} (p - r, d + 1) + P_{1}, \\ min_{i} L_{r} (p - r, i) + P_{2}) - min_{k} L_{r} (p - r, k)], \end{array}$ 5

where the H() is a Heaviside step function and t is the time since the last update for the node. The first term D uses the event-based matching cost computed in the previous section. The matching cost of nonactive pixels will be set back to the initial constant big value. The second term redefined the penalty with minimal path costs of the previous pixel pr in a certain direction. The last term is the minimum path cost of the previous pixel from the whole disparities, which prevents constantly increasing path costs.

The temporal correlation kernel ensures that only the recent update nodes are active. The kernel can be defined as inverse linear, Gaussian, or quadratic. In our algorithm, we use a Heaviside step function, which ensures only the active pixel is used to update the current incoming event. Meanwhile, considering the edge-like sparsity of the event data, the range of each path is not cross the whole image but using N(p) (from p − n_r to p + n_r), which is a good trade-off between the accuracy and the efficiency. All L_r are initialized to zero.

After all the paths in all direction are calculated, the cost for each coming event and each disparity is obtained using $S (p, d) = \sum_{r} L_{r} (p, d)$ 6

Event-driven disparity computation

Finally, the disparity that minimizes the cost for each pixel is selected as the output disparity. Stereo matching is a labeling problem. We select the label d_p which minimizes b_p(d_p) as the best disparity for the pixel. If the cost of the best disparity, $b_{p} (d_{p})$ , is greater than an outlier threshold τ_o, then no disparity output is generated for the node.

This mechanism makes our disparity computation event driven. Whenever there is a new event arriving, the most likely disparity at the location of the observation can be output.

The general workflow of the algorithm is depicted in Algorithm 1.

Experiments and results

Experiment setup

The experiments compared our algorithm with other event-based stereo matching algorithms.^11,13,25 We found the previous researches typically use simple objects such as pens,²⁵ ring, and cube¹¹ as stimulus and showed the detected disparity. But the accuracy of the algorithm is not quantitatively analyzed, and there is no ground truth of depth for each event to precisely analyze the results. Firouzi and Conradt used more complex stimulus such as hand shaking in different depths. However, the ground truth is estimated by manually measuring the distance between the camera and the object and assumed all the triggered events are in the same disparity.¹³

Algorithm 1: Event-driven SGM
Initialize last spike time map, matching cost, cost of 1-D paths and the parameters
for each incoming event, e = (c, t, x, y, p) do
Update last spike time map
Construct set of possible corresponding candidates, using 2
for each candidate matching pair C_e do
Calculate temporal and geometrical difference as
data cost term using 3,4
end for
Aggregate matching cost in 1-D from 4 directions equally. The cost along a path in direction r of pixel p at disparity d is defined as 5
Compute cost in x, y using 6
Select disparity which minimizes cost (if less than outlier threshold τ_o).
Store the time at which each node was updated for future use in 5
end for

Recently, some data sets for event-based simultaneous localization and mapping^26,27 are available, but none of those are created for event-based stereo matching, and the abovementioned previous research do not release the test data sets. So, we have tried to replicate these algorithms and use our own data sets including not only the simple rigid objects such as the boxes but also the flexible objects such as walking people with depth ground truth.⁶ Five different scenes of event stereo data sets are recorded for comparison, which are listed as follows

one box moving sidewise (one box),

two boxes at different depths moving sidewise (two boxes),

one person walking sidewise (one person),

two people in different depths walking sidewise (two people), and

one person walking from near to far (one person at different depths).

The parameters used for the comparison are manually tuned using the one box moving sideways recording to achieve the best result. The same parameters are then used for the other four recordings. The main parameters of our algorithm are set as follows: P₁ = 15, P₂ = 200, τ_t = 20 ms, τ_m = 10 ms, d_max = 50, ∊_t = 1 ms, ∊_g = 1, τ_d = 2, D_t = 32, and τ_o = 190.

The depth map and the disparity histogram are used to evaluate the performance of each algorithm. In order to visualize the event with depth, the depth maps are accumulated with 40-ms events. Each pixel represents an event, and the color map of jet is used to present the depth from red to blue (red means close and blue means far away). The disparity histograms are created to show the number of events with a certain disparity.

In order to quantitatively evaluate the result, we used four measures: estimation rate, estimation accuracy, average depth error, and depth accuracy. The estimation rate is the ratio of stereo matches detected divided by the number of input events from the left camera. The estimation accuracy is the percentage of estimated disparities that are within one pixel disparity of the ground truth (obtained from ZED).¹⁰ The estimation accuracy is a good indicator to show the performance of the algorithm, which is also used in the evaluation of the previous work. The error of depth is not only determined by the error of disparity but also by the truth depth. So, we use the average depth error and the depth accuracy. Average depth error is the mean of the error between the estimated depth and the ground truth (obtained from ZED). The depth accuracy is measured as a percentage, defined as $z_{acc} (Θ) = \frac{\sum_{i = 1}^{N} H (Θ - | \frac{{z'}_{i} - z_{i}}{z_{i}} |)}{N} .$ 7

where z indicates depth (z-direction from the camera), and $z_{i}^{'}$ and z_i are the estimated and ground truth depths, respectively. Θ is the error tolerance percentage, H()˙ is the Heaviside step function, and there are N depth estimates generated for the sequence. z_acc(Θ) gives the percentage of estimates for which the error is below the threshold of Θ%.

In the experiments, ST denotes Rogister’s method which enforces space (epipolar) and time constraints for stereo matching. STS is used to denote matching based on spatiotemporal surfaces.¹² Cop-Net is used to denote Firouzi’s cooperative network approach, and ESGM is used to denote our event-based SGM approach. All the algorithms are implemented in MATLAB2016b 64-bit, running on an Intel Xeon 3.2GHz processor with 12 GB RAM.

Result and discussion

In the scene of one box moving sidewise and one person walking sidewise, we test the algorithm with one moving object with different complexities. The depth map and the disparity histogram using ST, STS, Cop-Net, and our ESGM methods (from left to right) are showed, respectively, in Figures 3 and 4.

Figure 3.

The result of one box scene. The upper row is a color-coded depth map generated by accumulating 40 ms of depth estimates. The ground truth depth of the box is 2 m and the corresponding disparity is 15. The lower row shows the event disparity histograms over a period of 3 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.

Figure 4.

The result of the one person scene. The upper row is a color-coded depth map generated by accumulating 40 ms of disparity estimates. The ground truth depth of the person is 3 m and the corresponding disparity is 10. The lower row shows the event disparity histograms over a period of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.

For the depth maps, we see STS has fewer mismatches than the ST. Typically, the mismatches appear in complex areas (events are triggered almost at the same time and in near rows). The result using Cop-Net is denser than others and can remove some mismatches. But we can see mismatches or blank areas in the hand and leg parts of the walking people. Our method clearly has less wrong matches (red or dark blue pixels). For the disparity histograms, it is obvious that the result using Cop-Net and our algorithm is sharper around the ground truth, and our algorithm performs better for eliminating mismatches.

To evaluate the performance in the case of temporally overlapping situation, the scenes of two objects (two boxes and two walking people) are used. The depth map and the disparity histogram using ST, STS, Cop-Net, and our ESGM methods (from left to right) are showed, respectively, in Figures 5 and 6. The performance for each method follows the same pattern in one object scenes.

Figure 5.

The result of the two boxes scene. The upper row is a color-coded depth map of a 40-ms long stream of events for two moving boxes (one is at 1.5 m and another at 3 m). The lower row shows the event disparity histograms over a period of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.

Figure 6.

The result of the two people scene. The upper row is a color-coded disparity frame map of a 40-ms long stream of events for two waling persons (one at 1.5 m and another at 3 m). The lower ones are event disparity histogram within time of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.

To evaluate the performance in the case of one flexible object moving in different depth situations, the scene that one person walking from near to far is tested. The disparity map and the histogram using ST, STS, Cop-Net, and our ESGM methods (from left to right) are shown in Figure 7. This data set is more difficult, for the disparity of object varies in time and different parts of the body trigger events across common epipolar lines. All the methods do not perform as well as the previous data sets. The proposed algorithm performs better than the previous algorithms when compared to the ground truth.

Figure 7.

The results of one person different depth scenes. The upper row is a color-coded disparity frame map of a 40-ms long stream of events for one walking person (from 1 to 5 m). The lower ones are event disparity histogram within time of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.

The quantitative results of the estimation rate, the estimation accuracy, and depth accuracy for each method are shown in Table 1. The estimation rate of Cop-Net is higher than other methods. The average estimation accuracy of the proposed method is 10–20, present better than the others, and the mean depth error is much smaller than others. The depth accuracy, z_acc(Θ) (vertical axis), lies within an acceptable error tolerance, Θ (horizontal axis), as described in equation (7). To visualize the performance of depth accuracy of each algorithm, the accuracy and error tolerance figure are shown in Figure 8. The bigger of the area under the curve, the better of the depth accuracy.

Table 1.

The choice of options.

Data set	Method	Est-rate (%)	Est-Acc (%)	Average error (m)
One box	ST	40.74	68.16	0.2074
	STS	49.16	73.33	0.1396
	Cop-Net	71.84	75.29	0.1803
	ESGM	39.22	76.47	0.0875
One person	ST	45.43	52.33	0.3418
	STS	47.87	56.53	0.1824
	Cop-Net	50.77	74.10	0.1663
	ESGM	34.10	86.94	0.0999
Two boxes	ST	34.98	54.25	0.4452
	STS	34.34	62.61	0.3325
	Cop-Net	61.23	75.29	0.3729
	ESGM	44.13	89.10	0.0707
Two people	ST	43.06	42.59	0.3618
	STS	40.89	47.28	0.2702
	Cop-Net	49.73	67.08	0.4780
	ESGM	42.78	75.38	0.1219
One person different depth	ST	37.86	41.33	0.4572
One person different depth	STS	35.42	46.08	0.3448
	Cop-Net	53.84	40.78	0.4890
	ESGM	22.24	60.84	0.1219

SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM. The bold values indicate the quantitative results of the estimation rate, the estimation accuracy, and depth accuracy.

Figure 8.

The relationship between depth accuracy z_acc(Θ) and error tolerance (Θ) calculated using Equation (7). (a) One box. (b) Two boxes. (c) One person. (d) Two people. (e) One person different depths.

It should be mentioned that there are some black grids in the disparity map. It is more obvious if the events are dense enough. This is caused by the rectification. For each coming event, we need to find the corresponding event in rectified camera coordinate, and it usually needs to be approximated to get an integer coordinate.

For all the recordings tested in this article, the ESGM algorithm performs better in estimation accuracy and shows the robustness in different data sets. The ST and STS algorithms try to use properties or the local feature of the event to find the corresponding match. Although the criterions used are well defined, these methods may still suffer from the problems of noise and occlusion. The cooperative network algorithm creates a network to store the “state” of recent detected events. The disparity of each coming event will not only rely on the matching result but also on its spatiotemporal neighborhood, which suppresses the mismatches and increases the estimation rate. However, the small neighborhood is likely to be affected by noisy and previous mismatches, and the large neighborhood updating is time-consuming and not suitable for the edge-like event streams.

One good example can be found in the first row of Figure 4. Focusing on the front leg of the walking person, the ST and STS methods both have some mismatches, because the spatial, temporal, and local feature criterions are similar in the leg area. The cooperative network method may solve this situation, but the parameter is not tuned based on the walking data set, and a high outlier threshold may remove all the events as mismatches. The advantage of our method is the use of more information from paths in all directions to decide the influence between a pixel and the neighborhood. The 1-D cost regularized can be efficiently calculated, so the range of each path is much larger than the neighborhood of the cooperative network, which suppresses the local mismatches and makes the algorithm more robust in different data sets with the same parameter.

Conclusion

We propose a fully event-based 3-D depth perception algorithm using a message passing method. Different from the previous algorithms which only consider the constrain of one event, our algorithm considers the uniqueness constraints and the disparity continuity constraints between the adjacent events. Based on the traditional idea of SGM, we propose a novel event-driven SGM framework. Compared to several state-of-the-art, event-based stereo matching methods, the results show our method has higher estimation accuracy. Further work will focus on considering the situation of motion and dense depth reconstruction.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by National Natural Science Foundation of China U1509207,61325019.

ORCID iD

Pengfei Wang

References

Scharstein

Szeliski

. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int J Comput Vis 2002; 47(1): 7–42.

Menze

Geiger

. Object scene flow for autonomous vehicles. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, June 2015, pp. 3061–3070. Los Alamitos: IEEE Computer Society Press.

Posch

Matolin

Wohlgenannt

. A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. IEEE J Solid-State Circ 2011; 46(1): 259–275.

Mahowald

Mead

. The silicon retina. Sci Am 1991; 264(5): 76–82.

Mahowald

. VLSI analogs of neural visual processing: a synthesis of form and function. Calif Institc Technol. Pasadena, CA, USA: California Institute of Technology, 1992.

Xie

Chen

Orchard

. Event-based stereo depth estimation using belief propagation. Frontiers in Neuroscience, 10 2017; 11: 535.

Lichtsteiner

Posch

Delbruck

. A 128 * 128 120 dD 15 us latency asynchronous temporal contrast vision sensor. IEEE J Solid-State Circ 2008; 43(2): 566–576.

Brandli

Berner

Yang

. A 240 * 180 130 dB 3 us latency global shutter spatiotemporal vision sensor. Circuits IEEE J Solid-State 2014; 49(10): 2333–2341.

Mahowald

Delbrck

. Cooperative stereo matching using static and dynamic image features. Analog VLSI Implementation of Neural Systems 1989; 8: 213–238.

10.

Kogler

Humenberger

Sulzbachner

. Event-based stereo matching approaches for frameless address event stereo data. In: Proceedings, Part I Advances in Visual Computing: 7th International Symposium, Las Vegas, NV, USA, 26–28 September 2011, pp. 674–685. Las Vegas: ISVC.

11.

Camuasmesa

Serranogotarredona

Ieng

. On the use of orientation filters for 3D reconstruction in event-driven stereo vision. Front Neurosci 2014; 8(8): 48.

12.

Lagorce

Orchard

Gallupi

. HOTS: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans Patt Anal Mach Intell 2016; PP(99): 1–1.

13.

Firouzi

Conradt

. Asynchronous event-based cooperative stereo matching using neuromorphic silicon retinas. Neural Process Lett 2016; 43(2): 311–326.

14.

Marr

Vision: a computational investigation into the human representation and processing of visual information. Quar Rev Biol 1983; 8.

15.

Eibensteiner

Brachtendorf

Scharinger

. Event- driven stereo vision algorithm based on silicon retina sensors. In: Radioelektronika (RADIOELEKTRONIKA), 2017 27th international conference, Brno, Czech Republic, 19–20 April 2017, pp. 1–6. Los Alamitos: IEEE.

16.

Osswald

Ieng

Benosman

. A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems. Sci Rep 2017; 7: 40703.

17.

Weikersdorfer

Adrian

Cremers

. Event-based 3D slam with a depth-augmented dynamic vision sensor. In: IEEE international conference on robotics and automation, Hong Kong, China, 31 May–7 June 2014, pp. 359–364. Los Alamitos: IEEE Computer Society Press

18.

Felzenszwalb

Huttenlocher

. Efficient belief propagation for early vision. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, Washington, DC, USA, USA, 27 June–2 July 2004, Vol. 1, pp. I–261–I–268. Los Alamitos: IEEE Computer Society Press, CVPR.

19.

Hermann

Klette

. Iterative semi-global matching for robust driver assistance systems. Lect Notes Comput Sci 2012; 7726: 465–478.

20.

Michael

Salmen

Stallkamp

. Real-time stereo vision: optimizing semi-global matching. In: Intelligent Vehicles Symposium, Gold Coast, QLD, Australia, 23–26 June 2013, pp. 1197–1202. Los Alamitos: IEEE Computer Society Press.

21.

Spangenberg

Langner

Adfeldt

. Large scale semiglobal matching on the CPU. In: Intelligent vehicles symposium proceedings, Dearborn, MI, USA, 8–11 June 2014, pp. 195–201. Los Alamitos: IEEE Computer Society Press.

22.

Hirschmuller

. Stereo processing by semiglobal matching and mutual information. IEEE Trans Patt Anal Mach Intell 2008; 30(2): 328.

23.

Cremers

Kohlberger

Schnrr

. Nonlinear shape statistics in Mumford–Shah based segmentation. In: European conference on computer vision, Copenhagen, Denmark, 28–31 May 2002, pp. 93–108. Berlin, Heidelberg: Springer Berlin Heidelberg.

24.

Czech

Orchard

. Evaluating noise filtering for event-based asynchronous change detection image sensors. In: IEEE international conference on biomedical robotics and biomechatronics, Singapore, Singapore, 26–29 June 2016, Los Alamitos: IEEE Computer Society Press.

25.

Rogister

Benosman

Ieng

. Asynchronous event-based binocular stereo matching. IEEE Trans Neural Net Learn Syst 2012; 23(2): 347.

26.

Serrano-Gotarredona

Park

Linares-Barranco

. Improved contrast sensitivity DVS and its application to event-driven stereo vision. In: 2013 IEEE international symposium on circuits and systems (ISCAS2013), Beijing, China, 19–23 May 2013, Vol. 2013, pp. 2420–2423. Los Alamitos: IEEE Computer Society Press.

27.

Kogler

Eibensteiner

Humenberger

. Ground truth evaluation for event-based silicon retina stereo data. In: IEEE conference on computer vision and pattern recognition workshops, Portland, OR, USA, 23–28 June 2013, pp. 649–656. Los Alamitos: IEEE Computer Society Press.