Sage Journals: Discover world-class research

Abstract

In robotic applications of visual simultaneous localization and mapping (SLAM) techniques, loop-closure detection detects whether or not a current location has previously been visited. We present an online and incremental approach to detect loops when images come from an already visited scene and learn new information from the environment. Instead of utilizing a bag-of-words model, the attributed graph model is applied to represent images and measure the similarity between pairs of images in our method. In order to position a camera in visual environments in real-time, the method demands retrieval of images from the database through a clustering tree that we call RSOM (recursive self-organizing feature map). As long as the match is found between the current graph and several graphs in the database, a threshold will be chosen to judge whether loop-closure is accepted or rejected. The results demonstrate the method's accuracy and real-time performance by testing several videos collected from a digital camera fixed on vehicles in indoor and outdoor environments.

Keywords

loop-closure detection simultaneous localization and mapping clustering tree-RSOM attributed graph

1. Introduction

In the past decade, the solution for SLAM has became one of the prevalent issues in the field of robotics research, and is also considered to be a key technology for autonomous navigation of mobile robots [1, 2]. For a robot in an unknown environment, due to the lack of prior knowledge and the uncertainty of the surroundings, it has to recognize whether a current place is being revisited. This process is known as loop-closure detection [3–5]. Considering that visual sensors with lower costs contain rich information on the environment and are suitable for various types of robots for localization and mapping researchers have proposed numerous solutions for visual SLAM [2, 6–8]. However, solutions of positioning using visions suffer from large accumulated errors as the distance increases and often encounter gross errors in dynamic environments. Detecting loop-closure events allows accumulated errors in positioning and mapping to be corrected; hence, accurate long-term positioning is achieved [6].

Recently, a great number of research studies on loop-closure detection have been conducted [9–18]. According to [8], the BoWSLAM can accurately locate itself by using loop-closure detection in suburbs even after it has moved more than 2.5km and the visual dictionary based on bag-of-words is adopted to retrieve matched images. In [9], authors extend a bag-of-words (BoW) method that utilizes image classification to incremental conditions and relies on a Bayesian framework to estimate the loop-closure probability.

Although BoW has been shown to be a useful model for retrieving similar images and measuring the similarity between pairs of images in loop-closure detection, it has some drawbacks. The step of vector quantization, which converts visual features into visual words required by indexing, causes perceptual aliasing [19], i.e., past images that look like the current image but come from the different scene. To solve such a problem, researchers use epipolar geometry as a powerful tool to reject outliers [9]. In [11], researchers have developed an appearance-based system called RTAB-Map. In this system, by using memory management and geometric constraints, the loop will be closed only when the robot is near its previous location. [12] has discussed a loop-closure detection method called BoRF (bag of raw features), which uses scale-invariant visual features directly, rather than their vector-quantized representation or bag-of-words, so will not be affected by the perceptual aliasing. However, the real-time performance of this method cannot meet requirements.

Much research has been conducted on loop-closure detection, and most of these studies are based on the BoW model. However, the BoW modelling method deems image features as a random set without using geometric constraints between features, which is one of the most important clues for image matching. In order to precisely measure the similarity between pairs of images, we applied the attributed graph model [20]. As displayed in Figure 1, attributed graph model contains geometric constraints between features which can represent images more precisely. We also applied clustering tree-RSOM [21] (as developed by our group for more than ten years), which has several advantages: first, by utilizing a voting scheme, its retrieval speed is affected slightly by the enlargement of the database; second, if a RSOM tree has already been trained by a large dataset, we can apply this tree to operate other datasets: in other words, we do not need to train a new RSOM tree when recognizing a new dataset; and third, we can use the RSOM tree to learn and recognize millions of images in a short time. The above characteristics can be referred to in our previous published paper [21]. In addition, we use our original image grouping method, database management method and weighted algorithm to ensure our system's accuracy and efficiency. Generally, we use attributed graph and clustering tree-RSOM in our work instead of BoW, Bayesian framework and visual dictionary methods [8, 9, 11, 16] to develop a loop-closure detection system. These properties are the essential contributions of our work and will be introduced in detail in the following sections.

Based on the attributed graph model and clustering tree-RSOM, we construct a precise and real-time loop-closure detection system referred to as RSOM-SLAM. First of all, we extract SIFT [22] features from a given image I, and represent this image with an attributed graph G. After attributed graph G has been constructed, we group graph G and then obtain K-nearest neighbours of graph G (KNNG) using a RSOM tree. The loop-closure thresholds for matching images in each group are weighted through another RSOM tree so that every group has a corresponding loop-closure threshold. Finally, we judge whether the current scene has been ever visited by comparing images’ similarities with some selected loop-closure thresholds.

Figure 1.

Example of attributed graph, where blue dots are salient representative features and red lines are geometric constraints between features

The rest of this paper is organized as follows: in Section 2, we present two basic technologies for our work. The proposed method is explained in detail in Section 3 and the experiment results are illustrated in Section 4. The last section is devoted to discussion and conclusion.

2. Image Representation and Retrieval

In this section, we will introduce the basic technologies for better understanding our proposed methods.

2.1 Attributed graph model

For an image, instead of taking all SIFT features as a set of unordered elementary features, these SIFT features and their geometric information are selected, which can be denoted as U= {V^t, t = 1,2, …, T) where Vt=((Xt)^T,(R^t) ^T,(U^t)^T)^T. Here, X^t=(x^t₁, x₂^t)^T is the location, R^t=(r^t_→a^t)^T is the gradient direction r t and value vector at and U^t is the descriptor of a SIFT feature. Then, we construct an attributed graph G using the geometric information of these SIFT features to represent this image. Formally, an attributed graph G = (V, E) is a 2-tuple model, where V is the set of vertices, and $E \subseteq V \times U$ is the set of edges [21].

Pairwise graph similarity measurement: we perform pairwise graph measurement (PGM) with the aim of discovering a maximum common subgraph (MCS) between two graphs Gl and G_q, and the result is denoted as MCS(G_l, G). For an image, we find a set of SIFT features that are robustly matched with the SIFT features in similar images using the distance ratio between the nearest and the second-nearest neighbour, and then take them as salient representative features. Moreover, we apply the method proposed in [23] to rank SIFT features in descending order according to a matching frequency. In order to save time, T best ranked SIFT features and their geometric information are selected. In our experiments, T is set to be 30. If the number of features is less than 30 in an image, all of the available SIFT features are selected. Then, we use a RANSAC procedure to align the features and discard those not satisfying the spatial arrangement constraints.

Given MCS (G_l, Gq) obtained by the above PGM procedure, we construct a similarity measurement R(G_l, G_q), which is called PGSM (pairwise graph similarity measurement) between graph Gl and Gq as follows:

$R (G {}_{l}, G_{q}) = ‖ M C S (G {}_{l}, G_{q}) ‖ \times {(\exp (- d (X_{l}, X_{q})))}^{k}$ (1)

$d (X_{l}, X_{q}) = \min_{Q} {‖ X_{l} - X_{q} Q ‖}_{F}$ (2)

where MCS(G_l, G_q) is the cardinality of the MCS of G_l and G_q; the geometric dissimilarity between Xl and X is denoted as −d(X_l, X_q), and k is the number of mismatched feature pairs discarded by RANSAC matching, for amplifying the influence of the geometric dissimilarities between X_l and Xq.

2.2 RSOM tree

RSOM tree is a cluster method allowing our system to retrieve images efficiently. At first, we train a set of labelled samples using a SOM net, so a set of output nodes is obtained. Then, each sample will be distributed into a corresponding node based on the nearest neighbour criterion, and this SOM net is now denoted as the root node of an RSOM tree. When examining samples in each output node by discriminative condition, if a node does not meet the criterion, this node is labelled as a leaf node. Then, the decomposition of this node is stopped. On the other hand, if a node met the criterion, the node would be trained in the same way as the root node, so a new SOM net would be constructed and samples in this node distributed into a corresponding output node. By applying a recursive method to analyse every node based on the discriminative condition, an RSOM tree is constructed successfully when no nodes need further growth. During the growth of the RSOM tree, some control factors are used to ensure the RSOM tree has a good structure and excellent recognizing ability. In this paper, each leaf node contains a set of SIFT features labelled with their image ID. After a RSOM tree is constructed, it can be used to retrieve images with the method of voting and weighting (introduced in later section). The example of the RSOM tree is shown in Figure 2.

Figure 2.

RSOM tree model

3. Loop-closure Detection

Loop-closure detection, a useful method to modify accumulated errors in simultaneous localization and mapping, can be achieved by collecting data association between map and map, image and image, or images and map. Our purpose is to provide a precise and long-term loop-closure detection method based on similarity measurement between images. In this paper, we take loop-closure detection as an image retrieval task. In order to avoid loop closures on locations that have just been visited (the high similarity between current captured image and images captured from the location that have just been visited could easily lead to loop closure), we need to group captured images before learning them into RSOM. At the same time, we use the RSOM tree to retrieve a set of similar graphs of the input graph G_l then learn new images incrementally if a group has been built. A loop-closure threshold is weighted and selected when our system has the most similar graph G_mostSimilar from this set. At last, the decision will be made by comparing the R(G_l,G_mostSimilar) with the selected loop-closure threshold. Figure 3 illustrates the overall procedure of our method.

Figure 3.

Overall processing diagram (see the text for details)

3.1 Grouping images

During the movement, most of the time the last image looks similar to the most recent ones. In [11], authors used a memory management that they called STM to avoid this problem, while epipolar geometry is introduced as a powerful method in [9]. In our system, we adopt a grouping method to prevent loop-closure problems mentioned above.

We define the first captured image and its corresponding attributed graph G in building Group_i as the template of Group_i (i is the index of the Group_i). Every graph in Group_i has high similarity with graph G and represents the same scene. Supposing that the index of the group represented current scene is i; the robot is moving at an even speed of v m/s and images are captured at k frames/s. Based on the assumptions, our grouping strategy is explained as follows:

Step 1: Extract SIFT features from the input image I_l; then construct its attributed graph G_l to represent image I_l.

Step 2: If Gl is the template of Group_i, return to Step 1; Otherwise, calculate the PGSM R(G_m, G_l), where G_m is the template for the rest images in the Group_i.

Step 3: If R(G_m, G_l)<0.70, transfers Gl to a temporary group called TG, return to Step 1; if R(G_m, G_l)≥0.70, put Gl into Group_i and then empty TG_i.

When consecutive Z(x(i)) PGSMs are between captured graphs and historical graphs, $R (G_{m}, G_{l + x (i) + 1}), R (G_{m}, G_{l + x (i) + 2}) \dots R (G_{m}, G_{l + x (i) + Z (x (i))})$ are lower than a specified threshold (e.g., 0.7 in our experiments), indicating that robot has moved into a new scene. Thus, stop putting more graphs into Group_i, meanwhile using G_l_+x+Z(x(i)) as the template for the rest images in a new group Group_i+₁.

The value of Z(x(i)) is based on the number of attributed graphs in current group and it is updated when a new graph is acquired. The more graphs current group has, the higher value Z(x(i)) will be. The definition of Z(x(i)) in our experiments is given as follows:

$Z (x (i)) = {\begin{matrix} 3, x (i) \leq 10 \\ 3 + (x (i) - 5) / 5, 10 < x (i) \leq 20 \\ 9, x (i) > 20 \end{matrix}$ (3)

where i is the index of current group and x(i) is the current number of attributed graphs in Grou p_i.

Step 4: Calculate the number of attributed graphs in Group_i. If x(i)≤2k/v, cancel Group_i for this scene that may contain large dynamic obstacles. Then, return to Step 1; whereas for x(i) >2k / v, our method will switch to the next step.

A large number of attributed graphs in a group indicate that the scene represented by this group remains almost the same for a long distance. On the other hand, with a small number of graphs, the group could only represent a small scene.

Step 5: Transfer graphs in Group_i which are similar to the template of Group_i+1. We adopt a dichotomy to find $G = {G_{s} | R (G_{m'}, G_{s}) < 0.70, R (G_{m'}, G_{s + 1}) \geq 0.70, G_{s}, G_{s + 1} ∊ G r o u p_{i}}$ , where G_m_' is the template of Grou p_i+1; then, transfer every graph $G = {G_{d} | d > s, G_{d} ∊ G r o u p_{i}}$ $T G_{i}$ since the closer a captured image is to the template of Grou p_i+₁, the higher the similarity between them.

Step 6: Calculate the number of attributed graphs in Group_i, if x(i)≤4, all available graphs are selected; if x(i)>4, we rank all graphs G of Group_i in descending order. From the rank list, four graphs G_R1, G_R2, G_R3, G_R₄ in Group_i are selected to represent the scene while others are regarded as redundant graphs. Thus, these graphs would be discarded, where G_R1 is the template of Group_i, G_R2 is located at the 1/3rd in the list, G_R3 is at the 2/3rds in the list, and G_R₄ is at the end of the list.

Step 7: Our system will measure similarities between graphs in TG_i−1 and graphs in Group_i. Consider that too many graphs in a database will lower the operating speed of our system, instead of learning all graphs in TGi ₁, only a subset of TG_i−₁ will be put into Group, denoted by STG_i−₁ as follows:

$S T G_{i - 1} {G_{q}} = {\begin{cases} G_{q} | 0.80 > R (G_{R}, G_{q}) \geq 0.70, \\ G_{q} ∊ T G_{i - 1}, G_{R} ∊ G r o u p_{i}, R = 1, 2, 3, 4 \end{cases}}$ (4)

Then, empty TG_i−₁. This step would minimize information loss in grouping.

Step 8: Group_i has been built up. Now all graphs in Group_i are marked with the index i and learned into the RSOM tree. Then, we assign the number i+1 to the index number for the next group and return to Step 1.

3.2 Retrieval

As we take loop-closure detection as an image retrieval task, the computation time will increase sharply with the enlargement of the database if a sequential search strategy is adopted. Motivated by this, we use a clustering-tree RSOM to retrieve loop-closure candidates. For a captured attributed graph G_l = (V_l, E_l), we put extracted SIFT features U ={V_l^t, t = 1, 2,…,T} into the RSOM tree; then each SIFT feature is distributed into a winner leaf node. In these leaf nodes, there are many SIFT features labelled with image ID which are learned previously. Therefore, we can vote each ID according to its frequency. We define the union of all graphs for the winners as follows:

$U G {G_{l}} = {G_{q} | U_{q}^{j} ∊ G_{q}, U_{q}^{j} ∊ W L {U_{l}^{t}}, U_{l}^{t} ∊ G_{l}}$ (5)

where WL {U_l^t} is the winner of the leaf nodes for descriptor U_l^t. The frequency of graph Gq, denoted as H, represents the number of roughly matched descriptors between G_q and G_l. Upon the frequency, every graph Gq in UG{G_l} is ranked in descending order. Instead of using all graphs in UG{G_l}, the K top ranked graphs are defined as the generalized K-nearest neighbour graphs (KNNG) of graph G_l, denoted as K{G_l} as follows [24].

$K {G_{l}} = {G_{q} | K G_{q} ∊ U G {G_{l}}, F_{q} > F_{q + 1}, q = 1, 2…, k}$ (6)

In general, when an image Il is captured from outside, the system will efficiently construct its attributed graph G_l and then retrieve candidate matching graphs and obtained K{G_l} using the RSOM tree.

3.3 Database management

In dynamic scenes, dynamic obstacles such as cars and pedestrians may change or disappear when a robot revisits the same place. For this sake, RSOM-SLAM needs to learn these new information as well as discard the information that might no longer exist. The purpose of database management is to enhance the recognition and real-time performance of our system.

3.3.1 Incremental learning

The definition of incremental learning is that when new samples are input into the RSOM tree, there is no need for a well-trained RSOM tree to learn all samples again, but only to learn samples with certain new information while the structure of the tree remains highly robust. The incremental learning algorithm is presented in [21].

When a robot revisits a scene, RSOM-SLAM will learn new graphs similar to graphs in the database whilst having certain new information. In our work, we adopt two PGSM thresholds to limit the scale of graphs acquired through incremental learning step. The higher threshold is 0.80 and the lower threshold is equal to the current loop-closure threshold. Based on the setting, we can represent each scene more precisely as well as keeping the scale of the database from being too large.

3.3.2 Simplification of groups

During the movement, groups of graphs are accumulated, which leads to the enlargement of the database and lowering of operating speed. To enhance the real-time performance of RSOM-SLAM, we need to discard invalid groups. The criterion for invalid groups is: the group is defined as an invalid group when it is not detected by RSOM-SLAM even after its geometric neighbours (see in 3.4.2) have been visited K times (in our experiments, K = 10).

3.4 Weighting threshold

In visual SLAM's positioning, perceptual aliasing is one of the major problems to be solved. When a robot operates in an area with many highly similar scenes (such as same chairs, tables indoors, and buildings, streets outdoors etc.), it is easy for the robot to close incorrect loops. Reference [9, 11] has proposed some methods that weighted locations or visual words to minimize this problem. We present a time-saving weighting method that weighs two components of the loop-closure threshold of each scene instead of weighing scenes directly.

3.4.1 Similar groups

During the movement, the more similar scenes a scene has, the easier a misjudgement can be made. Therefore, we would need a higher detection threshold for scenes of this kind to prevent incorrect loop-closure from being detected.

In this step we use another RSOM tree, and denote this RSOM tree trained by graph templates from every group as the RTGW (RSOM tree for groups weighting). When Group_i begins its building, we input the template graph G_m of Group_i into RTGW and obtain K {G_m}. We define groups for which the template matches the condition:

$G = {G_{c} | G_{c} ∊ K^{'} {G_{m}}, R (G_{m}, G_{c}) > 0.50, c = 1, 2… n}$ (7)

as the similar groups of Group_i, then vote these groups once and vote Group_in times.

After voting, a component of the loop-closure threshold denoted as GSC (group similarity component) has been weighted. Moreover, weighting values of GSC for each group are updated in real time. The weighting function for GSC is defined as follows (see in Figure 4):

$W (i, j) = \log (1 + k (i, j)) / \log (1 + m (j)), (i \leq j)$ (8)

where i is the index of group, j is the total number of groups, k(i, j) is the number of Grou p_i ‘s SG when grouping of Group_j has been achieved, and m(j) is the number of total groups when Group_j has been built up.

3.4.2 Geometric neighbours

In areas where landmarks look similar or even the same, it is easy for a robot to make a wrong judgment if we only observe these landmarks separately. To position more precisely, some researchers have proposed loop-closure detection methods in which matches are found between current observations and a limited region of the map, but such a method cannot be guaranteed in a real-world situation [11]. Thus, without using geometric constraint information between images and the global map, we utilize geometric constraint information of each scene by weighting the GGC (group geometric component) of the loop-closure threshold.

Figure 4.

As we can see, the weighting value of GSC is increasing with k(i, j) / m(j) increasing, which suggests that the larger the proportion of Group_i ‘s SG in the total groups, the higher the weighting value for this group. On the contrary, when the proportion of a group's SG in the total groups is low, the weighting value for Group_i becomes relatively low.

Since the trajectory of robots’ movement is uninterrupted, we define time-adjacent groups to Group_i as the geometric neighbour of Group_i. However, if a group is the geometric neighbour of the geometric neighbour of Group_i and not adjacent to Group_i directly, we define the group as an indirect neighbour of Group_i. Here, we take Group_j as the geometric neighbour of Group_i under two conditions. First, Group_j is being built or visited as soon as Group_i has been already built up or visited; second, Group_i is being built or visited as soon as Group_j has been already built up or visited (an example is shown in Figure 5).

Figure 5.

Example of geometric neighbour

A robot has more opportunities to move into scenes that near current locations, and as a result the loop-closure threshold should be decreased in these scenes. Motivated by this, the weighting rule of GGC is introduced as follows: if the current location of a robot is the neighbour of Group_i, N(i) = 0, or if the current location of robot is the indirect neighbour of Group_i, N(i) = 0.5. In other cases, N (i) = 1.0, where i (i = 0, 1, 2… j) is the group index, j is the number of total groups, and N (i) is the GGC weighting value of Group_i. With this weighting approach, RSOM-SLAM can evaluate loop-closure probabilities with the geometric constraints of the trajectory without using the global map.

In general, by these time-saving weighting steps we can approach a higher loop-closure threshold to detect scenes that are difficult to identify to minimize the perceptual aliasing problem. During the movement of RSOM-SLAM, both of these weighting values are updated in real time.

3.5 Detecting loop closures

The rationale behind the loop-closure judgment is to combine the results of the KNNG of a captured image with the results of weighted threshold to evaluate the probabilities as to whether a current location is being revisited. For the KNNG of a captured image $K {G_{l}} = {G_{S} | S = 1, 2 \dots, k}$ , we calculate all PGSMs $R (G_{l}, G_{S}) = {R (G_{l}, G_{S}) | G_{S} ∊ K^{'} {G_{l}}}$ ; then select G_h which is the most similar graph to graph G_l and its corresponding group index H. We obtain weightings of Group_H and calculate the loop-closure threshold of Group_H. If R(G_l, G_h) is above the loop-closure threshold of Group_H, the loop will be closed, and this also indicates that the current location is in the scene represented by G_h. If R(G_l,G_h) is above the loop-closure threshold of Group_H while lower than 0.80, G_l would be learned into the RSOM tree. If R(G_l,G_h) is lower than the loop-closure threshold of Group_H, reject this loop-closure. The loop-closure threshold for Group_i is defined as follows:

$T_{l o o p - c l o s u r e} (i, j) = T_{b a s i c T h r e s h o l d} + w_{G S C} W (i, j) + w_{G G C} N (i)$ (9)

where i is the index of group, j is the number of total groups, W (i, j) is the weighting of GSC of Group_i and N(i) is the weighting of GGC of Group_i. In our model's setting, we set T_{basicThreshold} = 0.7, w_GSC = 0.2, w_GGC = 0.1.

4. Experimental Results

In this section, we will evaluate the performance of our method by precision-recall metrics. Precision is the ratio of true-positive loop-closure detections to the total number of detections. Recall is defined as the ratio of true-positive loop-closure detections to the number of ground-truth loop closures [27].

In our indoor experiments, test videos were captured with a Canon Digital camera fixed on the experimental vehicle at an even speed of 0.5m/s, and the resolution of the videos is 640×480 pixels. In our outdoor experiment, one test video was captured from a Canon Digital camera fixed on an electro motorcycle moving at an even speed of 5m/s. The video is at a resolution of 640×480 pixels. We also test a video from [8] which is captured outdoors.

As long as videos are captured, we use a computer of Intel(R) Core i5-2320 CPU@3.00GHz, 500G hard drive and 4G RAM as a server to test our method. During experiments, the resolution of each image will be converted to 160×120 pixels online, the time consumption of calculating PGSM between each pair of images is an average 17ms by testing 500 couples of images at a resolution of 160×120 pixels.

4.1 Indoor experiments

4.1.1 Precision-recall performance

We conducted two indoor experiments at region C and region A in the Chemical Building of Hunan University. Region C is a circular corridor area with distinguishable scenes. In this experiment, a video of five consecutive circles consisting of 1880 frames was collected.

The experimental result in region C, which contains PGSMs from measuring each captured graph against historical graphs and loop-closure threshold, is represented in Figure 7, and some accepted loop closures are shown in Figure 6. In the first circle, although some PGSM values between some captured images and historical images were high (high similarities between captured images and images in database), no wrong loop-closure was detected. We can see from Figure 7 that the first loop-closure detection occurs when RSOM-SLAM is coming back to its starting position, at the beginning of the second circle. When RSOM-SLAM visited region C for the fifth time, only three correct loop-closures were rejected.

Figure 6.

Examples of positive loop-closure detection for the indoor image sequence. Where (a), (b), (c) are scenes of Region C and (d), (e), (f) are scenes of Region A. (g) is the path trajectory of Region C.

Figure 7.

Five circles’ recall performance of indoor experiments, where (a) is the result tested in Region C and (b) is the result tested in Region A. The blue dots are the PGSMs between each frame and historical images. Loop-closure threshold is shown with an irregular red line. There is no loop-closure detected in the first round according to there being no dot over the line.

The reasons that the recall performance of our first indoor experiment did not reach 100% are as follows: first, the interferences such as dynamic obstacles and camera wobbling caused failure to detect loop-closure at the 524th frame, 863rd frame, 1042nd frame; secondly, the 28th-36th frame, 93rd-94th frame, 99th-100th frame, 249th-251st frame were missed in inputting into the RSOM tree and database from the grouping procedure, resulting in RSOM-SLAM failure to close loops at the 353rd-361st frame, 424th-433rd frame and 578th-585th frame. Despite suffering from these problems, the recall performance of the method is over 98% at 100% precision in the fourth and fifth circle.

The second indoor experiment is conducted at region A. In this experiment, a video of five consecutive circles consisting of 2498 frames was collected. The region A is a circular corridor area as well. However, unlike region C, region A is a corridor with many highly similar scenes, which means this area is under high perceptual aliasing. For this reason, the recall performance of this experiment is worse than the first indoor experiment. The reasons as to why the recall performance in region A did not reach 100% are the same as in region C, mentioned above. The results of the experiment are represented in Figure 7 and the recall performances of two indoor experiments are summarized in Table 2.

4.1.2 Real-time performance

To test real-time performance of our method in an indoor environment, we collected a video of 20 cycles from the region A. According to Figure 10 (b), we can observe that the average time consumption of operating a frame was not increasing significantly with the growth of moving distance, and stayed at around at 120ms per frame. Therefore, we can conclude that RSOM-SLAM can operate in real-time indoors where there are only a small number of dynamic obstacles.

Table 1.
The number of groups and average loop-closure threshold in indoor experiments

Cycles	Experiment(distance)
	Region C (5×90m)		Region A (5×120m)
	Groups	Average threshold	Groups	Average threshold
1	27	80.13%	24	80.37%
2	30	72.14%	27	73.04%
3	30	70.38%	28	72.18%
4	30	70.15%	28	71.83%
5	30	70.24%	28	72.10%

During the movement, RSOM-SLAM has built up several groups to represent different scenes with selected loop-closure threshold in real time. By comparing the results of two different indoor experiments in Table 1, we can observe that RSOM-SLAM has built more groups, having a lower average loop-closure threshold and detected more precisely in a differentiated environment than in an environment with many similar scenes in unit distances.

4.2 Outdoor Experiments

4.2.1 Precision-recall performance

By using the suburb's dataset in [8], we conducted the first outdoor experiment. The video consisting of two consecutive circles outdoors has 2345 frames. In this environment, pedestrians and cars imposed a negative effect on our method's recall performance, which is 80% at 98% precision. In the second outdoor experiment, we collected a video of 15 cycles of 110815 frames in a residential area with a circle length of around 1.8km. This environment contained far more pedestrians and cars than that of the environment of the first outdoor experiment, which made RSOM-SLAM difficult to detect loops. However, the recall performance of our method still reached 74% at 93% precision. Some positive loop-closure detections and path trajectories are shown in Figure 8. The recall performance of this experiment is detailed in Table 2.

Table 2.
Recall performance of experiments

Circles	Experiment
Circles	Region C	Region A	Residential area
1	0.00%	0.00%	0.00%
2	90.42%	88.98%	74.28%
3	96.71%	95.60%	79.65%
4	99.14%	98.98%	85.51%
5	99.27%	99.34%	88.29%

Figure 8.

Examples of positive loop-closure detection for the outdoor image sequence. Where (a), (b), (c) are scenes of the residential area and (d), (e), (f) are scenes of the dataset from [8]. (g) is the path trajectory of the residential area.

Figure 9 illustrates the grouping and scene representation results of our second outdoor experiment. As we can see in this figure, the number of graphs in these groups may be changed through learning TG and incremental learning.

4.2.2 Real-time performance

To test the real-time performance of our method in a large and dynamic environment, we use the video from the second outdoor experiment. In addition, there are a large number of dynamic obstacles in this video in which the recorded distance was more than 27km.

In order to operate in real time, our system is designed as two threads – thread grouping and loop-closure detection and thread incremental learning. As these two threads are working independently, when a captured image needs to be learned it will be put into a vector, which is used to store the graphs needed to be learned into RSOM. According to Figure 10, the time consumption of incremental learning for each graph is around 80 to 160 milliseconds. These results demonstrate that the time consumption of incremental learning does not tend to increase with the scale of database growing. Unfortunately, in this environment, the operating time of our system still increased with the moving distance growth. When the result K {G_l} is obtained (ID of K nearest neighbour graphs), RSOM-SLAM will retrieve these graphs from the hard drive by this ID linearity. Thus, the time consumption of operating an image will increase with the enlargement of the database. The reasons for the database growth are as follows: first, the number of newly built groups is more than the number of groups discarded (RSOM-SLAM may build some groups just for representing large dynamic obstacles); second, when the vehicle revisited the area, the RSOM tree has learned some new graphs through incremental learning. After capturing 73159 images and learning more than 12806 graphs into a RSOM tree, which has 65536 leaf nodes been trained, the time consumption of retrieving and matching five nearest neighbour graphs from the database has become more than 500ms; the results are shown in Figure 10.

Figure 9.

Examples of grouping and scene representation, where the first graph in each group is the template graph

5. Discussion and Conclusion

Based on similarity measurement between image and image, this paper has presented an incremental vision-based method allowing loop-closure detection in real time. The attributed graph model introduced in [20] provides a precise way to manage each image's representation and to measure the similarity between a pair of images. By weighting components of the loop-closure threshold, RSOM-SLAM can make use of the continuity of robots’ travelling trajectory and similarity relationship among scenes to enhance recognize ability. We conducted tests with two datasets from [9] to compare the performance between our system and IAB-MAP. We also tested BoWSLAM from http://hilandtom.com/tombotterill/code/index.php with our datasets to compare the performance between our system and this approach. According to Figure 11, we conclude that RSOM-SLAM can achieve better recall performances both indoors and outdoors. In the indoor environment, RSOM-SLAM can achieve recall performances at around 90% at 100% precision.

Figure 10.

The time consumption of our method

In the outdoor environment, although a large number of pedestrians and cars made RSOM-SLAM difficult to position, recall performances over 74% at 93% precision were still achieved by our method when a robot revisited the circle for the first time. According to Table 2, we can infer that when a robot revisited a scene additional times, the recall performance would be improved, even reaching near to 100% in the indoor experiments.

The clustering tree-RSOM introduced in [21] offers a fast solution to obtain the KNNG of given images from the database. Through grouping steps, instead of using all captured graphs to represent a scene, only a small part of the graphs were selected to represent scenes, which enhances the real-time performance of RSOM-SLAM. The results of experiments in region A presented in Figure 10 suggest that the time consumption of our system to operate an image does not tend to increase in the long term. By analysing the results, our conclusion for these results is: after RSOM-SLAM visits an small area with no or only a few dynamic obstacles several times, there is no new information that the RSOM-SLAM needs to learn; hence, the scale of the database is constant. Therefore, the time consumption is only from two procedures: pairwise graph similarity measurement, which costs only 17ms on average, and obtaining KNNG. Therefore, we can deduce from the results that RSOM-SLAM can operate for a long distance in this kind of environment. In the residential area, RSOM-SLAM could operate in real time for more than 73000 frames with a distance of more than 18km (1.8km/circle) by discarding invalid groups. Otherwise, our system could only operate in real time for about 46300 frames without discarding invalid groups. With the enlargement of the database, time consumption of some frames was higher than 500ms after the 73000th frame.

Figure 11.

(a) and (b) are the examples of positive loop-closure detection from the test on L6I by RSOM-SLAM while (c) is the recall performance. (d) is the comparison result of recall performance between RSOM-SLAM and other methods.

In summary, as the results of our experiments, RSOM-SLAM has good recall performance both in indoor and outdoor environments. Moreover, it can operate long distance in real time even in a large and dynamic environment. Future tasks are expected to consist of two parts: first, improvement of the real-time performance of our system based on cluster-computer technology and RSOM dictionary; and second, constructing the whole SLAM system based on some data association methods and the achievement that RSOM-SLAM represents each scene by using groups of graphs marked with the corresponding group index.

References

Csorba

(1998). Simultaneous localisation and map building. Unpublished, University of Oxford.

Dissanayake

Newman

Clark

Durrant-Whyte

Csorba

(2001). A solution to the simultaneous localization and map building (SLAM) problem. Robotics and Automation, IEEE Transactions on. v. 17(3), pp. 229–241.

Folkesson

Christensen

(2007). Closing the loop with graphical SLAM. Robotics, IEEE Transactions on. v. 23(4), pp. 731–741.

Callmer

Granstr

M K

Nieto

Ramos

(2008). Tree of words for visual loop closure detection in urban SLAM. Proceedings of the 2008 Australasian Conference on Robotics and Automation. pp. 8.

Cummins

Newman

(2007). Probabilistic appearance based navigation and loop closing. Robotics and Automation, 2007 IEEE International Conference. pp. 2042–2048.

Davison

(2003). Real-time simultaneous localisation and mapping with a single camera. Computer Vision, 2003. Proceedings. Ninth IEEE International Conference. pp. 1403–1410.

Lemaire

Berger

Jung

I-K

Lacroix

(2007). Vision-based slam: Stereo and monocular approaches. International Journal of Computer Vision. v. 74(3), pp. 343–364.

Botterill

Mills

Green

(2011). Bag-of-Words-Driven, Single-Camera Simultaneous Localization and Mapping. Journal of Field Robotics. v. 28(2), pp. 204–226.

Angeli

Filliat

Doncieux

Meyer

J-A

(2008). Fast and incremental method for loop-closure detection using bags of visual words. Robotics, IEEE Transactions on. v. 24(5), pp. 1027–1037.

10.

Newman

(2007). Detecting loop closure with scene sequences. International Journal of Computer Vision. v. 74(3), pp. 261–286.

11.

Labbe

Michaud

(2013). Appearance-based loop closure detection for online large-scale and long-term operation. Robotics, IEEE Transactions on. v. 29(3), pp. 734–745.

12.

Zhang

(2011). BoRF: Loop-closure detection with scale invariant visual features. Robotics and Automation (ICRA), 2011 IEEE International Conference. pp. 3125–3130.

13.

Galvez-Lopez

Tardos

(2011). Real-time loop detection with bags of binary words. Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference. pp. 51–58.

14.

Labbe

and Michaud

F. O.

(2011) Memory management for real-time appearance-based loop closure detection. in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on: IEEE. pp. 1271–1276.

15.

Sariyanidi

Sencan

Temeltas

(2012). An image-to-image loop-closure detection method based on unsupervised landmark extraction. in Intelligent Vehicles Symposium (IV), 2012 IEEE. pp. 420–425.

16.

Zhiwei

Xiang

Yanyan

Songhao

(2013). A novel loop closure detection method in monocular SLAM. Intelligent Service Robotics. v. 6(2), pp. 79–87.

17.

Liu

Zhang

(2012). Visual loop closure detection with a compact image descriptor. Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference. pp. 1051–1056.

18.

Vorst

Yang

Zell

(2009). Loop closure and trajectory estimation with long-range passive RFID in densely tagged environments. Advanced Robotics, 2009. ICAR 2009. International Conference. pp. 1–6.

19.

Newman

Cole

(2006). Outdoor SLAM using visual appearance and laser ranging. in Robotics and Automation, 2006. ICRA 2006. Proceedings 2006 IEEE International Conference. pp. 1180–1187.

20.

Chung

F. R. K.

and Theory

C. C. o. R. A. i. S. G.

(1997) Spectral graph theory. Regional Conference.

21.

Xia

S-P

Liu

J-j

Yuan

Zhang

(2007). Cluster-computer based incremental and distributed rsom data-clustering. ACTA Electronica sinica. v. 35(3), pp. 385.

22.

Lowe

(2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. v. 60(2), pp. 91–110.

23.

Xia

Ren

Hancock

(2008). Ranking the local invariant features for the robust visual saliencies. in Pattern Recognition, 2008. ICPR 2008. 19th International Conference. pp. 1–4.

24.

Xia

Hancock

. (2009) Pairwise similarity propagation based graph clustering for scalable object indexing and retrieval. Graph-Based Representations in Pattern Recognition. Springer. pp. 184–194.

25.

Fei-Fei

Perona

(2005). A Bayesian hierarchical model for learning natural scene categories. Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference. pp. 524–531.

26.

Sivic

Russell

Efros

Zisserman

Freeman

(2005). Discovering objects and their location in images. Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference. pp. 370–377.

27.

Cummins

and Newman

(2009) Highly scalable appearance-only SLAM-FAB-MAP 2.0. in Robotics: Science and Systems: Seattle, USA. pp. 12–18.

A Precise and Real-Time Loop-closure Detection for SLAM Using the RSOM Tree

Abstract

Keywords

1. Introduction

2.1 Attributed graph model

3.3.1 Incremental learning

3.3.2 Simplification of groups

3.4 Weighting threshold

3.4.1 Similar groups

4.1 Indoor experiments

4.1.1 Precision-recall performance

Table 1. The number of groups and average loop-closure threshold in indoor experiments

4.2.1 Precision-recall performance

Table 2. Recall performance of experiments

References

Table 1.
The number of groups and average loop-closure threshold in indoor experiments

Table 2.
Recall performance of experiments