Abstract
Introduction
Three-dimensional (3-D) vision tries to investigate the 3-D information of the real word, which is a challenging problem in computer vision. In the 3-D vision, depth perception is the key issue in sensing for robots and other autonomous applications. Different technologies of depth perception can generally be categorized into two groups including active and passive methods. The active methods emit light and use triangulation or time-of-flight measurements for analysis. The laser, Kinect, and time-of-flight camera are active sensors. These sensors perform remarkable result but usually suffer from limited range and outdoor sunlight condition, since light strength falls off with distance and struggles in ambient noise light. The passive methods process pairs of images and use stereo or structure from motion methods to reconstruct the depth as is done by ZED and VI-sensor. These methods usually have relative long detection range and high resolution.
Thanks to the publicly available benchmarks such as the Middlebury 1 and KITTI, 2 which provide good test images for both indoor and automotive context environment, stereo vision continues to steadily mature. However, frame-based stereo algorithms are not enough for agile motion such as fast flying and immediate obstacle avoidance. One sensible reason is that the frame-based sensing pipelines do not meet the real-time performance requirements. The latency of current pipelines is typically in the order of 50–200 ms including the time-of-sensor data capturing and data processing. A brute force solution is to use high frame rate cameras and a high-performance computer. However, this brute force approach is not suitable for application to the autonomous flight or the mobile robot because of the limited payload and power consumption.
The bioinspired vision sensor or called event-based camera mimics retinas to asynchronously generate the response to relative light intensity variations rather than the actual image intensity. 3 Event-based cameras are data-driven sensors that have advantages of low redundancy, high temporal resolution (in the order of microseconds), low latency, and high dynamic range (130 dB compared with 60 dB of standard cameras). These properties make it an ideal device for using on a mobile robot, but the traditional frame-based algorithms are not well suited to operate on event-based data.
In this article, we present a fully event-based stereo matching algorithm for reliable 3-D depth estimation using semiglobal matching (SGM). For that, we use two dynamic and active-pixel vision sensors (DAVISs) to build an event-based stereo setup (“Event-based stereo camera setup” section) as input and calculate the matching cost by spatiotemporal attribute of events (“Semi-global matching” section). Cost aggregation is performed as an event-driven approximation of a global cost function by path-wise optimizations from all directions through the image (“Event-based matching cost calculation” section). Disparity computation is done by a winner taking all mechanism (“Event-driven semi-global cost aggregation” section). “Event-driven disparity computation” section presents the experiments of comparison with the state-of-the-art, event-based stereo matching methods on five actual scenes data sets. The results which demonstrate our method have a higher estimation accuracy and are robust in different scenes.
Related work
Event-based vision sensor
Before direct moving to the event-based stereo, we first introduce the event-based vision sensor. Unlike the conventional frame-based camera, the event-based camera is not driven by certain frequency clock signals. Each pixel is independently recording an event (a structure presents pixel’s activity at a particular time) when it detects the illuminance change larger than a threshold. Today’s event-based camera is similar to the Mahowald and Mead’s silicon retina, 4 and the data are encoded with the address event representation. 5 The principle and visualized output is shown in Figure 1.

The principle operation of single DVS pixel and a single snapshot of the event stream in 20 ms. 6 DVS: dynamic vision sensor.
In this work, DAVIS is used, which is an extension of the dynamic vision sensor (DVS)
7
with higher resolution (240 × 180) and an additional frame-based intensity readout (not used in this work).
8
Each event is presented with a quadruplet
Event-based stereo vision
The event-based stereo matching task is to find corresponding events from two different views and estimate the disparity. However, events do not encode information about absolute intensity and cannot construct general features that are widely used in frame-based stereo matching.
Mahowald and Delbruck 9 first explored the event-based stereo vision by developing a chip for one-dimensional (1-D) image matching. The result was quite promising, but this chip was not flexible enough for two-dimensional (2-D) matching and variable range of disparity.
In recent years, more researchers try to explore event-based matching criteria using spatiotemporal information of DVS. Kogler et al. 10 came up with the more flexible result using the temporal and polarity correlation with DVS sensors and achieved promising initial results.
Rogister et al. found that using temporal and polarity criterion alone is prone to errors because the latency of events varies (jitter). So, they added geometry constraint, event ordering, and temporal activity constraint.
Camuasmesa’s et al. mentioned previous methods using temporal and geometrical constraints alone still got relatively low reconstruction accuracy because ambiguities cannot be uniquely solved. 11 So they tried to explore more invariance motion constraints such as orientation and time surfaces for event-based stereo matching. 11,12
Piatkowska et al. and Firouzi and Conradt 13 both came up with modification of cooperative network to make use of the events stream. These methods used the idea from Marr’s cooperative computing approach 14 but made it dynamic to take into account the temporal aspects of the stereo events. Our previous work 6 also uses a Markov random field (MRF) to consider the depth continuity between the nearby events. The results show that the estimation rate of these methods considerably outperforms previous works.
Recently, Eibensteiner et al. 15 implemented an event-based stereo matching algorithm in hardware on a field-programmable gate array. Osswald et al. 16 proposed a spiking-based cooperative neural network that can be directly implemented with neuromorphic engineering devices.
These methods considering local temporal and geometrical constraints work remarkable in some simple artificial data sets, but the estimation rate (ratio of depth estimates to input events) and matching rate (ratio of the correct depth estimated to the total depth estimated) are still low. Recent cooperative network 13 and belief propagation 6 -based methods show their advantages using disparity uniqueness and depth continuity constraints between the nearby events. However, the neighbor radius is 1 or 2 for the large radius, increasing the computational cost. The small neighborhood may lead to local optimum and causes mismatches.
Event-based stereo camera setup
Our stereo setup is built using two DAVISs mounted side-by-side and a ZED sensor under the event stereo cameras, as shown in Figure 2. The hardware setup and software driver are the same as our previous work. 6 The baseline of the event-based cameras is 12 cm. Two DAVISs are synchronized by hardware to output event streams. ZED sensor records the depth frames as depth ground truth. Although ZED works at 100 Hz, the depth information is still not available between frames at the very moment the events being generated. Then, we use a similar approach in the study by Weikersdorfer et al., 17 which uses the smallest depth value from the latest frame in a one-pixel neighborhood, to assign depth value to each event, which can depress noise and absent values. Later, the ground truth depth will be used to evaluate the event-based matching method.

The stereo camera setup. 6
Algorithm
The main idea of our algorithm is borrowed from SGM, which offers a good trade-off among accuracy, robustness, and runtime compared to other global stereo-matching algorithm. In this section, we first give a brief review of the SGM algorithm. Then, three main steps of our event-driven stereo matching are introduced: event-based matching cost calculation, event-driven semiglobal cost aggregation, and event-driven disparity computation.
Semiglobal matching
Stereo matching can be defined as a labeling problem. The label corresponds to the disparity. Typically, the quality of labeling is formulated in a cost function. Finding labels that minimize the cost function is the key problem. However, the labeling problem is an nondeterministic polynomial time (NP)-complete problem. Graph cuts and belief propagation 18 are good approximate solutions to this problem, but the main drawback is high computation cost.
SGM 18 has become a popular choice in real-time depth perception applications, which successfully incorporate the advantages of global and local stereo methods. Some extensions and modifications to the original SGM have been proposed to improve the runtime and robustness to different environments. 19 –21
Typically, the framework of SGM can be broken down into three steps: matching cost estimation, cost aggregation, and disparity computation.
22
The general cost function
where
The problem can be modeled as MRF, and some optimization algorithms such as graph cut and belief propagation are implemented to minimize the cost function. However, such global methods suffer large computation cost. Thus, SGM methods have been proposed to speed up the disparity optimization. The energy only propagates along 1-D paths, compared to the energy propagation strategy in the whole 2-D map of the MRF model. Along each path, the minimum cost is calculated by means of dynamic programming. After that, the cost of each pixel is accumulated with all paths in all directions
However, traditional SGM algorithms can only operate on static frame information to solve the correspondence problem. The matching cost estimation needs absolute gray-scale value or gradient. Meanwhile, the cost aggregation only considers the spatial disparity smoothness constraints. In this work, our input is a stream of events instead of image pair sequences, so we have to redefine the matching cost and construct an event-driven SGM to manage the spatiotemporal correlation event stream.
Event-based matching cost calculation
The input event streams have been preprocessed with the nearest neighbor noise filtering 24 and event-based rectification. For each input event, there is only one rectified event location. This allows us to implement rectification using a lookup table.
After the preprocessing, for each incoming event
where
In order to make our matching cost estimation event driven, two mechanisms are used. First, instead of using the gray-scale image, the last spike map of all pixels for each sensor is created to store the spatiotemporal information. The maps update locally for each incoming event. For instance, each camera has its own last spike map. The size of the map is
Then, the matching cost of the incoming event in the left camera can be calculated with the candidate events in the right camera. There is no intensity information for the event, but our algorithm considers the events in the same polarity and uses the time and epipolar geometry to calculate the matching cost for potential matches. The matching cost of every pixel is initialized to a constant big value. When a new event comes, the matching cost is calculated as
Event-driven semiglobal cost aggregation
Traditionally, SGM algorithm accumulates the cost along all the paths at every pixel of the image. Our modified event-driven SGM does not simultaneously update the entire image but update whenever a new event comes from the stereo matching step.
We redefine the path costs
where the
The temporal correlation kernel ensures that only the recent update nodes are active. The kernel can be defined as inverse linear, Gaussian, or quadratic. In our algorithm, we use a Heaviside step function, which ensures only the active pixel is used to update the current incoming event. Meanwhile, considering the edge-like sparsity of the event data, the range of each path is not cross the whole image but using
After all the paths in all direction are calculated, the cost for each coming event and each disparity is obtained using
Event-driven disparity computation
Finally, the disparity that minimizes the cost for each pixel is selected as the output disparity. Stereo matching is a labeling problem. We select the label
This mechanism makes our disparity computation event driven. Whenever there is a new event arriving, the most likely disparity at the location of the observation can be output.
The general workflow of the algorithm is depicted in Algorithm 1.
Experiments and results
Experiment setup
The experiments compared our algorithm with other event-based stereo matching algorithms. 11,13,25 We found the previous researches typically use simple objects such as pens, 25 ring, and cube 11 as stimulus and showed the detected disparity. But the accuracy of the algorithm is not quantitatively analyzed, and there is no ground truth of depth for each event to precisely analyze the results. Firouzi and Conradt used more complex stimulus such as hand shaking in different depths. However, the ground truth is estimated by manually measuring the distance between the camera and the object and assumed all the triggered events are in the same disparity. 13
Recently, some data sets for event-based simultaneous localization and mapping
26,27
are available, but none of those are created for event-based stereo matching, and the abovementioned previous research do not release the test data sets. So, we have tried to replicate these algorithms and use our own data sets including not only the simple rigid objects such as the boxes but also the flexible objects such as walking people with depth ground truth.
6
Five different scenes of event stereo data sets are recorded for comparison, which are listed as follows one box moving sidewise (one box), two boxes at different depths moving sidewise (two boxes), one person walking sidewise (one person), two people in different depths walking sidewise (two people), and one person walking from near to far (one person at different depths).
The parameters used for the comparison are manually tuned using the
The depth map and the disparity histogram are used to evaluate the performance of each algorithm. In order to visualize the event with depth, the depth maps are accumulated with 40-ms events. Each pixel represents an event, and the color map of jet is used to present the depth from red to blue (red means close and blue means far away). The disparity histograms are created to show the number of events with a certain disparity.
In order to quantitatively evaluate the result, we used four measures:
where
In the experiments,
Result and discussion
In the scene of one box moving sidewise and one person walking sidewise, we test the algorithm with one moving object with different complexities. The depth map and the disparity histogram using ST, STS, Cop-Net, and our ESGM methods (from left to right) are showed, respectively, in Figures 3 and 4.

The result of one box scene. The upper row is a color-coded depth map generated by accumulating 40 ms of depth estimates. The ground truth depth of the box is 2 m and the corresponding disparity is 15. The lower row shows the event disparity histograms over a period of 3 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.

The result of the one person scene. The upper row is a color-coded depth map generated by accumulating 40 ms of disparity estimates. The ground truth depth of the person is 3 m and the corresponding disparity is 10. The lower row shows the event disparity histograms over a period of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.
For the depth maps, we see STS has fewer mismatches than the ST. Typically, the mismatches appear in complex areas (events are triggered almost at the same time and in near rows). The result using Cop-Net is denser than others and can remove some mismatches. But we can see mismatches or blank areas in the hand and leg parts of the walking people. Our method clearly has less wrong matches (red or dark blue pixels). For the disparity histograms, it is obvious that the result using Cop-Net and our algorithm is sharper around the ground truth, and our algorithm performs better for eliminating mismatches.
To evaluate the performance in the case of temporally overlapping situation, the scenes of two objects (two boxes and two walking people) are used. The depth map and the disparity histogram using ST, STS, Cop-Net, and our ESGM methods (from left to right) are showed, respectively, in Figures 5 and 6. The performance for each method follows the same pattern in one object scenes.

The result of the two boxes scene. The upper row is a color-coded depth map of a 40-ms long stream of events for two moving boxes (one is at 1.5 m and another at 3 m). The lower row shows the event disparity histograms over a period of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.

The result of the two people scene. The upper row is a color-coded disparity frame map of a 40-ms long stream of events for two waling persons (one at 1.5 m and another at 3 m). The lower ones are event disparity histogram within time of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.
To evaluate the performance in the case of one flexible object moving in different depth situations, the scene that one person walking from near to far is tested. The disparity map and the histogram using ST, STS, Cop-Net, and our ESGM methods (from left to right) are shown in Figure 7. This data set is more difficult, for the disparity of object varies in time and different parts of the body trigger events across common epipolar lines. All the methods do not perform as well as the previous data sets. The proposed algorithm performs better than the previous algorithms when compared to the ground truth.

The results of one person different depth scenes. The upper row is a color-coded disparity frame map of a 40-ms long stream of events for one walking person (from 1 to 5 m). The lower ones are event disparity histogram within time of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.
The quantitative results of the estimation rate, the estimation accuracy, and depth accuracy for each method are shown in Table 1. The estimation rate of Cop-Net is higher than other methods. The average estimation accuracy of the proposed method is 10–20, present better than the others, and the mean depth error is much smaller than others. The depth accuracy,
The choice of options.
SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM. The bold values indicate the quantitative results of the estimation rate, the estimation accuracy, and depth accuracy.

The relationship between depth accuracy
It should be mentioned that there are some black grids in the disparity map. It is more obvious if the events are dense enough. This is caused by the rectification. For each coming event, we need to find the corresponding event in rectified camera coordinate, and it usually needs to be approximated to get an integer coordinate.
For all the recordings tested in this article, the ESGM algorithm performs better in estimation accuracy and shows the robustness in different data sets. The ST and STS algorithms try to use properties or the local feature of the event to find the corresponding match. Although the criterions used are well defined, these methods may still suffer from the problems of noise and occlusion. The cooperative network algorithm creates a network to store the “state” of recent detected events. The disparity of each coming event will not only rely on the matching result but also on its spatiotemporal neighborhood, which suppresses the mismatches and increases the estimation rate. However, the small neighborhood is likely to be affected by noisy and previous mismatches, and the large neighborhood updating is time-consuming and not suitable for the edge-like event streams.
One good example can be found in the first row of Figure 4. Focusing on the front leg of the walking person, the ST and STS methods both have some mismatches, because the spatial, temporal, and local feature criterions are similar in the leg area. The cooperative network method may solve this situation, but the parameter is not tuned based on the walking data set, and a high outlier threshold may remove all the events as mismatches. The advantage of our method is the use of more information from paths in all directions to decide the influence between a pixel and the neighborhood. The 1-D cost regularized can be efficiently calculated, so the range of each path is much larger than the neighborhood of the cooperative network, which suppresses the local mismatches and makes the algorithm more robust in different data sets with the same parameter.
Conclusion
We propose a fully event-based 3-D depth perception algorithm using a message passing method. Different from the previous algorithms which only consider the constrain of one event, our algorithm considers the uniqueness constraints and the disparity continuity constraints between the adjacent events. Based on the traditional idea of SGM, we propose a novel event-driven SGM framework. Compared to several state-of-the-art, event-based stereo matching methods, the results show our method has higher estimation accuracy. Further work will focus on considering the situation of motion and dense depth reconstruction.
