Abstract
1. Introduction
Simultaneous Localization and Mapping (SLAM), or Concurrent Mapping and Localization (CML), is a well-known and well-studied problem among the members of the robotics community, being one of the most active fields of research over the last years. The SLAM problem concerns how a mobile robot can operate in an a priori unknown environment by means of only on-board sensors to simultaneously build a map of its surroundings using this to track its position. Thus, SLAM is one of the most important problems to solve in robotics in order to build truly autonomous mobile robots.
There are many techniques and algorithms that have been developed in order to address this problem, many of them aiming to be run on-line on a robotic device. Most of these solutions focus on the estimation of self-mapped features located through assorted types of sensors. The most frequently used sensors for SLAM techniques include odometers, radar, GPS and several kinds of range finders such as laser, sonar and infrared-based ones [1][2]. All these sensors have their own advantages, but also several drawbacks have to be considered, such as: the increasing difficultly regarding correspondence or data association, the limitation to 2D mapping, excessive computational requirements, or being too expensive to be deployed on certain commercial platforms. At the same time, consumer demand has pushed the industry to the mass production of cheap and reliable camera devices at relatively low prices. All these factors have contributed to the appearance of recent works concerning the use of cameras as main sensors. The requirements regarding the accessibility and easiness of the use of the cameras, and the amount and diversity of the data provided by them as sensors, make computer vision an obvious choice for streamlining autonomous robotics. This will lead to robotic platforms with less high-end sensors with complex requirements and being hard to integrate, substituted by off-the-shelf cameras, maintaining their levels of performance and accuracy. The use of a camera as a main sensor for SLAM is of interest given the information that it provides, can serve as the information required for solving the data association problem and do so easily. Research into computer vision is constantly developing techniques to obtain information from images that can be used in visual SLAM. In any case, Monocular SLAM with six degrees of freedom (as presented in this work) remains one of the hardest SLAM variants, as only bearing data is provided by the camera, thus special techniques are needed to obtain information about the depth of a given point in an image.
This problem has several solutions stemming from the structure-from-motion field of research [3][4], being closely related to monocular SLAM. But a great deal of these solutions is based on methods of global nonlinear optimization, best performed offline, making them unsuitable for SLAM necessities.
Some relevant works on monocular SLAM rely on additional sensors, such as Strelow [5], who proposes mixing inertial sensors with a camera into an Iterated Extended Kalman Filter (IEKF). Other works try to employ different estimation techniques, such as Particle Filters (PF), in Kwok and Dissanayake [6][7]. Still, some of the most notable works are based on the well-known EKF: Davison [8] proved the feasibility of real-time monocular SLAM using an EKF; and Montiel [9] developed the Inverse-Depth parameterization, which allows initializing features to be seen by the robot with a heuristically chosen value for depth.
With regards to the monocular SLAM problem, the batch validation problem has been considered only in recent years. A good survey of the techniques for dealing with this can be found in the works of Durrant-Whyte and Bailey [10][11]. The early implementation of SLAM techniques dealt with each data association individually, but this validation strategy frequently leads to unreliable results, as it is easy to have a wrong match due to the texture and geometry of the environment. Eventually batch validation was introduced, considering multiple data associations to be validated at the same time. Currently, several tests and methodologies exist for dealing with the batch validation of data association. One of the most usual validation methods is the Joint Compatibility Branch and Bound technique, JCBB [12]. The JCBB method is considered as a strong batch validation technique, but the algorithm has a worst-case exponential cost, partly mitigated by several optimizations. This method gives great results within the context of undelayed depth feature initialization, as it allows for those matches deemed incompatible to be ignored with the rest of the data association pairs. Another widely known batch validation technique is the Combined Constraint Data Association (CCDA) developed by Bailey [13], based upon graphs instead of trees. The strengths of these techniques reside in the ability to test batch validation without knowing the device pose robustly in a cluttered environment.
Developments within the field of monocular SLAM are having a growing impact on applied robotics, as seen over the last years. Examples of visual SLAM-navigated applications are to be found deployed on land, at sea and in the air, creating new ways of and techniques for autonomous navigation. Though based on stereo vision, in [14] a high density map is reconstructed from a stereo video, leading to a dense cloud map allowing for accurate navigation. For sea depths, a monocular SLAM-based mosaicing technique [15] allows ROVs to reconstruct visual maps of the seabed. Recent developments in Micro Aerial Vehicles (MAV), as presented in [16], introduce monocular SLAM as a way to stabilize and navigate MAVs without external tracking or prior knowledge of the environment. Although this last work is based on a new parallel tracking batch optimization technique [17], it shows the potential of monocular SLAM for autonomous robotics applications. Further developments in filtering SLAM techniques will expand the possibilities of the limited autonomous robotics systems, where filtering monocular SLAM techniques still have an edge over more complex but also computationally expensive approaches [18].
This paper significantly expands on the authors' previous works proposed in [19] and [20]. Those works presented and dealt with several aspects of DI-D initialization, such as parameter adjustment and a comparison with similar approaches, without dealing with the data association validation problem. The aim of this work is to explicitly address the batch validation problem with the proposal of a novel algorithm, the HOHCT, which evaluates the compatibility of a given set of matches, and tries to obtain the largest subset of compatible matches efficiently. These improvements allow developers to start testing the technique deployed in a robotic system in order to determine its feasibility as an enabling technology with which to decrease the sensory requirements of robotics systems. The following sections deal with the accurate description of the system in terms of its formulation, with emphasis on the novel HOHCT proposed algorithm; after the description of the method, several experimental cases are presented which test the proposal with real data, discussing the results and their implications on the performance of this novel contribution.
2. Monocular SLAM with DI-D initialization
The procedure and formulation of DI-D Monocular SLAM is described in this section. For the sake of simplicity, at each step subscript
2.1 State and system specification
The EKF SLAM methodology requires that data regarding localization and mapping are maintained within the so-called augmented state vector,
The vector
The map to be found and estimated is composed of a set of features,
The obtained values form the following model for feature localization:
In the model represented in (Eq. 7),

Camera, scene point feature and scale initialization reference in DI-D model.
2.1 Scale and System Initialization
The initialization of a metric scale with previous knowledge is analogous to a well-known and previously solved problem in computer vision, the PnP (perspective of n-points) problem, [21]. This problem tries to find the orientation of a camera with respect to an object from a set of points. If the points are coplanar, with four points with spatial coordinates (
At the beginning of the iterative process, the system state
where
where
2.2 State prediction
An unconstrained constant-acceleration camera motion prediction model can be defined by the following equation (Eq. 3), from [24]:
being
This model, (Eq. 11), updates the robotic camera part of the state, while the features are assumed to remain static, thus the complete state prediction model is defined as:
The prediction step is completed by propagating the estimation uncertainty through the covariance matrix, Eq. 13, wher
2.3 Measurement prediction
The different locations of the camera, along with the location of the already mapped features, are used to predict the feature position
A description of this process is found in Algorithm 1, where ▽
2.4 Features Matching Process
When a feature
If the score of a pixel location (
2.5 Batch Validation of Data Association
In SLAM, the injurious effect of incorrect or incompatible matches is well known. In monocular SLAM systems delayed initialization techniques implicitly prune some weak image features prior to their addition to the stochastic map, e.g., image features produced by fast lighting changes, shining on highly reflective surfaces, or even caused by some dynamic elements in the scene. Nevertheless, the risk of incorrect or incompatible matches could remain due to several factors:
Incompatibility due to a repeated design. Fake composite landmark. Incompatibilities produced by reflections on curved surfaces and materials. Detection of landmarks running along edges.
The main contribution of this article is to present a novel validation technique that the authors call the
2.6 Filter Update
With the information obtained from the data association pairs found by the matching process and validated by the HOHCT the filter state and covariance are updated according to Eqs. 17 and 18 respectively:
where the innovation
And the Kalman gain
2.7 Delayed Inverse Depth Initialization of Features
Depth information cannot be obtained in a single measurement when bearing sensors (e.g., a single camera) are used. In this case, in order to incorporate new features into the map, special techniques for feature initialization are needed to enable the use of bearing sensors in SLAM systems.
In this work the Delayed Inverse-Depth (DI-D) Feature Initialization is used to incorporate new features

Diagram of parallax
The initialization process can be summarized in a few steps, with further details in [19]:
When a feature is detected for the first time In subsequent instants If a minimum parallax threshold
3. Highest Order Hypothesis Compatibility Test, HOHCT
The matching methodology described in the previous section uses an active search technique to address the problem of data association. This problem is usually a critical part of any EKF-based SLAM system, as errors could dampen the convergence of the filter. These data association errors may even not be incorrect matching: a moving object can be correctly matched, but can give landmark information which disrupts the map, as this ‘fake’ landmark is not static. Other errors may arise when dealing with ambiguous textures and features on the mapped environment.
It is worth noting that previous works based on DI-D did not feature data association validation, but could achieve good results [19][20] due to the resilience to both ‘fake’ (formed by the superposition of different elements seen from a given perspective) and ‘weak’ (due to texture, lightning or any other factors that are hard to track in consecutive frames) features. Figure 3 and 4 illustrate examples of both a ‘fake’ composite feature and ‘weak’ features, respectively. The fake landmark appears at the intersection of the display edge and the blinds, and as the camera moves, it would slide along the blinds and the display border so as to always be found at the intersection point. The example of weak features (Fig. 4) shows two features (on the desktop PC case) that could be correctly initialized, but later on they proved to be weak due to reflections on the surface.

Fake composite landmark (marked by a star) at the intersection of the display border and the blinds.

Incompatibilities produced by reflections on materials and curved surfaces, leading to weak features.
DI-D SLAM without data validation could reject many of these errors because candidate features must satisfy a series of tests and conditions before being considered as suitable landmarks to be initialized. In the implemented technique, a feature must be tracked correctly within a minimal number of frames (normally 10 frames) and achieve a parallax value greater than a minimum
Thus, accounting for DI-D initialization features, a new technique for batch validation of the data association has been developed, the
So, this ‘compatibility test’ evaluates sets of data parings, known as ‘hypotheses’. Each hypothesis is a subset of the data association pairings obtained from the matching process. To determine if a hypothesis is compatible or not, its innovation Mahalanobis distance is estimated (Eq. 21). This distance is used as a quality metric tested against a threshold given by the Chi-squared distribution:
where
As the proposed monocular SLAM performs data association based on an active search strategy, each mapped landmark is associated to an unique feature on the image, so the mentioned ‘hypothesis’ can be represented as an array of Boolean values the size of the total set of data pairing, N, where each of the pairings is marked as having been considered by the hypothesis (true, 1) or ignored (false, 0). Then, a complete search considering all the possible hypotheses (binary arrays of size N), would be a time consuming effort of exponential cost. Thus, the proposed HOHCT algorithm performs a search ordered in an ascending number of rejected data association pairings.
An example of pair rejection after the HOHCT application can be seen in Fig. 5. Initially an optimistic hypothesis taking all the pairings found during the matching step is tested. If this hypothesis fails to pass the compatibility test, a search for a satisfactory hypothesis is performed. In Fig. 5, the third feature matching is incompatible, thus both the prediction and the match found are rejected and ignored.

Example of a pair elimination and hypothesis.
The proposed search algorithm combines iteration and recursion, as shown in Algorithm 1. The HOHCT will look for a compatible hypothesis, formed as a subset of an initial group of
So, each of the successive searches explores an

Example of search increasing number of pairs rejected
Given the sparse error conditions found in the DI-D initialization SLAM, this ordered search will normally have a linear cost with the number of landmarks matched, with exceptional cases achieving a cubic cost over some frames, still very far from the exponential cost that binary tree recursion could suppose over the whole number of matched landmarks. In this work, the rejected landmarks are eliminated as they are deemed incompatible.
In Fig. 5 the different trees that could be considered are represented. Tree 1 represents the first set of hypotheses that will be used to calculate the distance between the pairs. If any hypothesis is not compatible (if the statistical test fails) then Tree 2 is built and the compatibility is checked again, this process of tree creation is iteratively repeated until a compatible hypothesis is found.
4. Experimental Results and Discussion
In order to test the performance of the proposed method, several video sequences were acquired using a low cost camera, in two different environments. Later, a MATLAB© implementation of the algorithm was executed off-line using those video sequences as input signals.
4.1 Indoor and Outdoor Experiments
In experiments, the following parameter values were used: variances for linear and angular velocity respectively
A Logitech C920 HD camera was used in the experiments. This low cost camera has an USB interface and a wide angle lens. It is capable of acquiring HD colour video. However, in experiments, grey level video sequences, with a resolution of 424 × 240 pixels captured at 15 frames per second, were used. It is important to note that all the sequences of the video were captured at a relatively low frame rate of 15 frames per second (fps). While this frame rate would increase the difficulty of the SLAM process itself, and make it more prone to error, it would also give a bigger window of time in which to process each frame in an implementation aiming for real-time. So, although satisfactory results would be easier to achieve assuming 30 fps streams of image (in the literature, most of the experiments are reported to be captured at least at 25 frames per second), it has been considered a better option to evaluate SLAM results at 15 fps, to eventually allow for an easier implementation into systems with limited power, such as autonomous robots. All the indoor video sequences were captured inside the Vision and Intelligent Systems laboratory at the authors' university. A rail guide was assembled in order to provide an approximate ground truth reference. Every video sequence in this scenario was captured by sliding the camera (manually), at different speeds, over the rail guide. The duration of the different sequences for both scenarios ran from 35 seconds to 1 minute (525 to 900 frames) for the same trajectory.
Figure 7 illustrates the estimated map and trajectory for two different video sequences for the first scenario. The left and right columns show the results, for each sequence, respectively with and without HOHCT validation. As we would expect, the estimations obtained with the HOHCT validation were consistently better. In this case the experimental results show that the HOHCT validation test significantly improves the algorithm robustness, by rejecting harmful matches, clearly noted in the improvement of orientation estimation, as seen in sequence 2. Another observed improvement was the enhanced preservation of the metric scale on estimations. The sequences taken with slower camera movements showed generally better results, although this can be easily attributed to the low frame rate deliberately employed.

Map and trajectory estimation results obtained from two sequences of video: (1) 845 frames and (2) 615 frames. The left column displays results using HOHCT, while the right column displays results for the same sequences without using HOHCT batch validation.
The outdoor sequences were captured with the help of a robotic platform Pioneer 3-AT to provide accurate navigation. This robotic platform repeatedly traversed a known trajectory in a nearby courtyard with columns, benches and multiple reflective surfaces among other elements (Fig. 8). This trajectory constituted an ‘L’ shaped course running for 12m, with a 90° curve. While going along this trajectory, a camera installed on top of the platform captured the sequences, looking sideways. The courtyard allowed us to perform outdoor tests, with an open space and longer trajectories, while keeping disturbances from uncontrolled lightning to a minimum and other difficulties usually associated with this kind of test.

Outdoor environment setup used for tests.
The sequences were taken with the platform moving at different speeds, ranging from 0.25 m/s to 1m/s. Thus, the duration of the sequences went approximately from 20 seconds to 90 seconds (600 to 1300 frames) for different tours in the same trajectory. Figure 9 shows the results of the off-line application of the DI-D SLAM technique, with and without the application of the HOHCT, with the robotic platform moving at 0.65 m/s. The most notable difference with the indoor handheld experiments is the capability to move at greater speeds while maintaining filter convergence in the SLAM process. This was due mainly to two facts: the robotic platform described a much cleaner trajectory, with a constant speed along the straight parts, and smoother turns; and the presence of objects at a wider range, which allowed it to keep a better estimation of the orientation. The impact of the HOHCT can be clearly seen in the different trajectory estimations: while both SLAM with and without HOHCT are able to estimate quite accurately the length of the trajectory, the estimation drifts greatly without data association validation, especially in terms of orientation.

Outdoor experiment results navigating with a Pioneer 3-AT platform at 0.65m/s.
4.2 Discussion
The benefits of the application of the HOHCT validation come together with the addition of the computational cost
where
Table 1 shows the results obtained in the indoor tests. For a sample of 7350 frames (accumulated over different sequences of video) in only 22 (0.3%) of the cases did the computational cost of the test become quadratic. For the indoor sample a case of r = 3 never occurred.
Results of HOHCT failed tests and number of matching deemed incompatible for indoor test.
On the other hand, the outdoor test revealed a greater ratio of pairing rejection. Though being a relatively controlled outdoor space (inside a courtyard), a not much bigger sample (of 12750 frames) produced a greater ratio of rejection of data association pairings, and also produced cases with a cubic cost for the search. Note how the ratio of the rejection of single pairings almost doubles. This means that the cost will grow in less structured and/or less controlled environments.
The cost of the HOHCT can be optimized still when trying to reach real time performance. As the HOHCT already tries to search and test the minimal number of hypotheses, the most time consuming part of the algorithm is during the estimation of Mahalanobis distance. At this step, which is repeated for each hypothesis, a matrix inversion takes place. A strategy to compute the Mahalanobis distances in an incremental way could be developed, allowing for the exploitation of techniques to optimally compute the inversion of iteratively growing matrices, as demonstrated in [26].
Results of HOHCT failed tests and number of matching deemed incompatible for outdoor sample.
5. Conclusions
A method to address the problem of batch validation within the context of the monocular SLAM technique [20] has been presented in this paper, within the context of the DI-D, although the proposed technique, the HOHCT, could in fact be used in any feature based SLAM approach. The considered monocular SLAM technique, the inverse depth delayed initialization, presents a series of particular features and characteristics due to its estimation methodology.
Landmarks are only introduced into the Kalman filter once the available depth estimation is accurate enough and fewer landmarks are introduced than in the undelayed approaches. Despite these particularities, a data association gating technique is needed, mainly when the estimation process fails to converge to a good solution. Thus a batch validation technique based on statistical thresholding is introduced. This batch validation technique, the Highest Order Hypothesis Compatibility Test, has been shown to greatly improve the monocular SLAM results under certain circumstances. These circumstances include the emergence of ‘false’ landmarks, difficult illumination conditions and irregular trajectories with non-smooth changes to linear or angular velocities. Additionally, tests show that the algorithm responsible for finding the best hypothesis will rarely go beyond the quadratic cost with respect to the set of observed features, in fact being more prone to linear cost. This occurs when the batch validation test fails despite having a worst case scenario cost which is almost exponential.
Future works are expected to progress along two main lines of research. First, a method to deal with bigger maps is to be introduced, exploiting the characteristics of the inverse depth delayed initialization as the SLAM technique. A view was presented in [19], though new approaches are being studied in order to determine which would better fit the technique. The second line will work towards the production of a real-time implementation of the techniques proposed. This shall help evaluate the perspectives for the introduction of monocular SLAM as a reliable replacement of high end sensors, such as laser range finders.
