Abstract
Keywords
Introduction
In the past decade, the solution for SLAM has became one of the prevalent issues in the field of robotics research, and is also considered to be a key technology for autonomous navigation of mobile robots [1, 2]. For a robot in an unknown environment, due to the lack of prior knowledge and the uncertainty of the surroundings, it has to recognize whether a current place is being revisited. This process is known as loop-closure detection [3–5]. Considering that visual sensors with lower costs contain rich information on the environment and are suitable for various types of robots for localization and mapping researchers have proposed numerous solutions for visual SLAM [2, 6–8]. However, solutions of positioning using visions suffer from large accumulated errors as the distance increases and often encounter gross errors in dynamic environments. Detecting loop-closure events allows accumulated errors in positioning and mapping to be corrected; hence, accurate long-term positioning is achieved [6].
Recently, a great number of research studies on loop-closure detection have been conducted [9–18]. According to [8], the BoWSLAM can accurately locate itself by using loop-closure detection in suburbs even after it has moved more than 2.5km and the visual dictionary based on bag-of-words is adopted to retrieve matched images. In [9], authors extend a bag-of-words (BoW) method that utilizes image classification to incremental conditions and relies on a Bayesian framework to estimate the loop-closure probability.
Although BoW has been shown to be a useful model for retrieving similar images and measuring the similarity between pairs of images in loop-closure detection, it has some drawbacks. The step of vector quantization, which converts visual features into visual words required by indexing, causes perceptual aliasing [19],
Much research has been conducted on loop-closure detection, and most of these studies are based on the BoW model. However, the BoW modelling method deems image features as a random set without using geometric constraints between features, which is one of the most important clues for image matching. In order to precisely measure the similarity between pairs of images, we applied the attributed graph model [20]. As displayed in Figure 1, attributed graph model contains geometric constraints between features which can represent images more precisely. We also applied clustering tree-RSOM [21] (as developed by our group for more than ten years), which has several advantages: first, by utilizing a voting scheme, its retrieval speed is affected slightly by the enlargement of the database; second, if a RSOM tree has already been trained by a large dataset, we can apply this tree to operate other datasets: in other words, we do not need to train a new RSOM tree when recognizing a new dataset; and third, we can use the RSOM tree to learn and recognize millions of images in a short time. The above characteristics can be referred to in our previous published paper [21]. In addition, we use our original image grouping method, database management method and weighted algorithm to ensure our system's accuracy and efficiency. Generally, we use attributed graph and clustering tree-RSOM in our work instead of BoW, Bayesian framework and visual dictionary methods [8, 9, 11, 16] to develop a loop-closure detection system. These properties are the essential contributions of our work and will be introduced in detail in the following sections.
Based on the attributed graph model and clustering tree-RSOM, we construct a precise and real-time loop-closure detection system referred to as RSOM-SLAM. First of all, we extract SIFT [22] features from a given image I, and represent this image with an attributed graph G. After attributed graph G has been constructed, we group graph G and then obtain K-nearest neighbours of graph G (KNNG) using a RSOM tree. The loop-closure thresholds for matching images in each group are weighted through another RSOM tree so that every group has a corresponding loop-closure threshold. Finally, we judge whether the current scene has been ever visited by comparing images’ similarities with some selected loop-closure thresholds.

Example of attributed graph, where blue dots are salient representative features and red lines are geometric constraints between features
The rest of this paper is organized as follows: in Section 2, we present two basic technologies for our work. The proposed method is explained in detail in Section 3 and the experiment results are illustrated in Section 4. The last section is devoted to discussion and conclusion.
In this section, we will introduce the basic technologies for better understanding our proposed methods.
Attributed graph model
For an image, instead of taking all SIFT features as a set of unordered elementary features, these SIFT features and their geometric information are selected, which can be denoted as
Pairwise graph similarity measurement: we perform pairwise graph measurement (PGM) with the aim of discovering a maximum common subgraph (MCS) between two graphs
Given
where
RSOM tree is a cluster method allowing our system to retrieve images efficiently. At first, we train a set of labelled samples using a SOM net, so a set of output nodes is obtained. Then, each sample will be distributed into a corresponding node based on the nearest neighbour criterion, and this SOM net is now denoted as the root node of an RSOM tree. When examining samples in each output node by discriminative condition, if a node does not meet the criterion, this node is labelled as a leaf node. Then, the decomposition of this node is stopped. On the other hand, if a node met the criterion, the node would be trained in the same way as the root node, so a new SOM net would be constructed and samples in this node distributed into a corresponding output node. By applying a recursive method to analyse every node based on the discriminative condition, an RSOM tree is constructed successfully when no nodes need further growth. During the growth of the RSOM tree, some control factors are used to ensure the RSOM tree has a good structure and excellent recognizing ability. In this paper, each leaf node contains a set of SIFT features labelled with their image ID. After a RSOM tree is constructed, it can be used to retrieve images with the method of voting and weighting (introduced in later section). The example of the RSOM tree is shown in Figure 2.

RSOM tree model
Loop-closure detection, a useful method to modify accumulated errors in simultaneous localization and mapping, can be achieved by collecting data association between map and map, image and image, or images and map. Our purpose is to provide a precise and long-term loop-closure detection method based on similarity measurement between images. In this paper, we take loop-closure detection as an image retrieval task. In order to avoid loop closures on locations that have just been visited (the high similarity between current captured image and images captured from the location that have just been visited could easily lead to loop closure), we need to group captured images before learning them into RSOM. At the same time, we use the RSOM tree to retrieve a set of similar graphs of the input graph

Overall processing diagram (see the text for details)
During the movement, most of the time the last image looks similar to the most recent ones. In [11], authors used a memory management that they called STM to avoid this problem, while epipolar geometry is introduced as a powerful method in [9]. In our system, we adopt a grouping method to prevent loop-closure problems mentioned above.
We define the first captured image and its corresponding attributed graph
Step 1: Extract SIFT features from the input image
Step 2: If
Step 3: If
When consecutive
The value of
where
Step 4: Calculate the number of attributed graphs in
A large number of attributed graphs in a group indicate that the scene represented by this group remains almost the same for a long distance. On the other hand, with a small number of graphs, the group could only represent a small scene.
Step 5: Transfer graphs in
Step 6: Calculate the number of attributed graphs in
Step 7: Our system will measure similarities between graphs in
Then, empty
Step 8:
As we take loop-closure detection as an image retrieval task, the computation time will increase sharply with the enlargement of the database if a sequential search strategy is adopted. Motivated by this, we use a clustering-tree RSOM to retrieve loop-closure candidates. For a captured attributed graph
where
In general, when an image
In dynamic scenes, dynamic obstacles such as cars and pedestrians may change or disappear when a robot revisits the same place. For this sake, RSOM-SLAM needs to learn these new information as well as discard the information that might no longer exist. The purpose of database management is to enhance the recognition and real-time performance of our system.
Incremental learning
The definition of incremental learning is that when new samples are input into the RSOM tree, there is no need for a well-trained RSOM tree to learn all samples again, but only to learn samples with certain new information while the structure of the tree remains highly robust. The incremental learning algorithm is presented in [21].
When a robot revisits a scene, RSOM-SLAM will learn new graphs similar to graphs in the database whilst having certain new information. In our work, we adopt two PGSM thresholds to limit the scale of graphs acquired through incremental learning step. The higher threshold is 0.80 and the lower threshold is equal to the current loop-closure threshold. Based on the setting, we can represent each scene more precisely as well as keeping the scale of the database from being too large.
Simplification of groups
During the movement, groups of graphs are accumulated, which leads to the enlargement of the database and lowering of operating speed. To enhance the real-time performance of RSOM-SLAM, we need to discard invalid groups. The criterion for invalid groups is: the group is defined as an invalid group when it is not detected by RSOM-SLAM even after its geometric neighbours (see in 3.4.2) have been visited K times (in our experiments,
Weighting threshold
In visual SLAM's positioning, perceptual aliasing is one of the major problems to be solved. When a robot operates in an area with many highly similar scenes (such as same chairs, tables indoors, and buildings, streets outdoors etc.), it is easy for the robot to close incorrect loops. Reference [9, 11] has proposed some methods that weighted locations or visual words to minimize this problem. We present a time-saving weighting method that weighs two components of the loop-closure threshold of each scene instead of weighing scenes directly.
Similar groups
During the movement, the more similar scenes a scene has, the easier a misjudgement can be made. Therefore, we would need a higher detection threshold for scenes of this kind to prevent incorrect loop-closure from being detected.
In this step we use another RSOM tree, and denote this RSOM tree trained by graph templates from every group as the RTGW (RSOM tree for groups weighting). When
as the similar groups of
After voting, a component of the loop-closure threshold denoted as GSC (group similarity component) has been weighted. Moreover, weighting values of GSC for each group are updated in real time. The weighting function for GSC is defined as follows (see in Figure 4):
where
In areas where landmarks look similar or even the same, it is easy for a robot to make a wrong judgment if we only observe these landmarks separately. To position more precisely, some researchers have proposed loop-closure detection methods in which matches are found between current observations and a limited region of the map, but such a method cannot be guaranteed in a real-world situation [11]. Thus, without using geometric constraint information between images and the global map, we utilize geometric constraint information of each scene by weighting the GGC (group geometric component) of the loop-closure threshold.

As we can see, the weighting value of GSC is increasing with
Since the trajectory of robots’ movement is uninterrupted, we define time-adjacent groups to

Example of geometric neighbour
A robot has more opportunities to move into scenes that near current locations, and as a result the loop-closure threshold should be decreased in these scenes. Motivated by this, the weighting rule of GGC is introduced as follows: if the current location of a robot is the neighbour of
In general, by these time-saving weighting steps we can approach a higher loop-closure threshold to detect scenes that are difficult to identify to minimize the perceptual aliasing problem. During the movement of RSOM-SLAM, both of these weighting values are updated in real time.
The rationale behind the loop-closure judgment is to combine the results of the KNNG of a captured image with the results of weighted threshold to evaluate the probabilities as to whether a current location is being revisited. For the KNNG of a captured image
where
In this section, we will evaluate the performance of our method by precision-recall metrics. Precision is the ratio of true-positive loop-closure detections to the total number of detections. Recall is defined as the ratio of true-positive loop-closure detections to the number of ground-truth loop closures [27].
In our indoor experiments, test videos were captured with a Canon Digital camera fixed on the experimental vehicle at an even speed of 0.5m/s, and the resolution of the videos is 640×480 pixels. In our outdoor experiment, one test video was captured from a Canon Digital camera fixed on an electro motorcycle moving at an even speed of 5m/s. The video is at a resolution of 640×480 pixels. We also test a video from [8] which is captured outdoors.
As long as videos are captured, we use a computer of Intel(R) Core i5-2320 CPU@3.00GHz, 500G hard drive and 4G RAM as a server to test our method. During experiments, the resolution of each image will be converted to 160×120 pixels online, the time consumption of calculating PGSM between each pair of images is an average 17ms by testing 500 couples of images at a resolution of 160×120 pixels.
Indoor experiments
Precision-recall performance
We conducted two indoor experiments at region C and region A in the Chemical Building of Hunan University. Region C is a circular corridor area with distinguishable scenes. In this experiment, a video of five consecutive circles consisting of 1880 frames was collected.
The experimental result in region C, which contains PGSMs from measuring each captured graph against historical graphs and loop-closure threshold, is represented in Figure 7, and some accepted loop closures are shown in Figure 6. In the first circle, although some PGSM values between some captured images and historical images were high (high similarities between captured images and images in database), no wrong loop-closure was detected. We can see from Figure 7 that the first loop-closure detection occurs when RSOM-SLAM is coming back to its starting position, at the beginning of the second circle. When RSOM-SLAM visited region C for the fifth time, only three correct loop-closures were rejected.

Examples of positive loop-closure detection for the indoor image sequence. Where (a), (b), (c) are scenes of Region C and (d), (e), (f) are scenes of Region A. (g) is the path trajectory of Region C.

Five circles’ recall performance of indoor experiments, where (a) is the result tested in Region C and (b) is the result tested in Region A. The blue dots are the PGSMs between each frame and historical images. Loop-closure threshold is shown with an irregular red line. There is no loop-closure detected in the first round according to there being no dot over the line.
The reasons that the recall performance of our first indoor experiment did not reach 100% are as follows: first, the interferences such as dynamic obstacles and camera wobbling caused failure to detect loop-closure at the 524th frame, 863rd frame, 1042nd frame; secondly, the 28th-36th frame, 93rd-94th frame, 99th-100th frame, 249th-251st frame were missed in inputting into the RSOM tree and database from the grouping procedure, resulting in RSOM-SLAM failure to close loops at the 353rd-361st frame, 424th-433rd frame and 578th-585th frame. Despite suffering from these problems, the recall performance of the method is over 98% at 100% precision in the fourth and fifth circle.
The second indoor experiment is conducted at region A. In this experiment, a video of five consecutive circles consisting of 2498 frames was collected. The region A is a circular corridor area as well. However, unlike region C, region A is a corridor with many highly similar scenes, which means this area is under high perceptual aliasing. For this reason, the recall performance of this experiment is worse than the first indoor experiment. The reasons as to why the recall performance in region A did not reach 100% are the same as in region C, mentioned above. The results of the experiment are represented in Figure 7 and the recall performances of two indoor experiments are summarized in Table 2.
To test real-time performance of our method in an indoor environment, we collected a video of 20 cycles from the region A. According to Figure 10 (b), we can observe that the average time consumption of operating a frame was not increasing significantly with the growth of moving distance, and stayed at around at 120ms per frame. Therefore, we can conclude that RSOM-SLAM can operate in real-time indoors where there are only a small number of dynamic obstacles.
The number of groups and average loop-closure threshold in indoor experiments
The number of groups and average loop-closure threshold in indoor experiments
During the movement, RSOM-SLAM has built up several groups to represent different scenes with selected loop-closure threshold in real time. By comparing the results of two different indoor experiments in Table 1, we can observe that RSOM-SLAM has built more groups, having a lower average loop-closure threshold and detected more precisely in a differentiated environment than in an environment with many similar scenes in unit distances.
Precision-recall performance
By using the suburb's dataset in [8], we conducted the first outdoor experiment. The video consisting of two consecutive circles outdoors has 2345 frames. In this environment, pedestrians and cars imposed a negative effect on our method's recall performance, which is 80% at 98% precision. In the second outdoor experiment, we collected a video of 15 cycles of 110815 frames in a residential area with a circle length of around 1.8km. This environment contained far more pedestrians and cars than that of the environment of the first outdoor experiment, which made RSOM-SLAM difficult to detect loops. However, the recall performance of our method still reached 74% at 93% precision. Some positive loop-closure detections and path trajectories are shown in Figure 8. The recall performance of this experiment is detailed in Table 2.
Recall performance of experiments
Recall performance of experiments

Examples of positive loop-closure detection for the outdoor image sequence. Where (a), (b), (c) are scenes of the residential area and (d), (e), (f) are scenes of the dataset from [8]. (g) is the path trajectory of the residential area.
Figure 9 illustrates the grouping and scene representation results of our second outdoor experiment. As we can see in this figure, the number of graphs in these groups may be changed through learning TG and incremental learning.
To test the real-time performance of our method in a large and dynamic environment, we use the video from the second outdoor experiment. In addition, there are a large number of dynamic obstacles in this video in which the recorded distance was more than 27km.
In order to operate in real time, our system is designed as two threads – thread grouping and loop-closure detection and thread incremental learning. As these two threads are working independently, when a captured image needs to be learned it will be put into a vector, which is used to store the graphs needed to be learned into RSOM. According to Figure 10, the time consumption of incremental learning for each graph is around 80 to 160 milliseconds. These results demonstrate that the time consumption of incremental learning does not tend to increase with the scale of database growing. Unfortunately, in this environment, the operating time of our system still increased with the moving distance growth. When the result

Examples of grouping and scene representation, where the first graph in each group is the template graph
Based on similarity measurement between image and image, this paper has presented an incremental vision-based method allowing loop-closure detection in real time. The attributed graph model introduced in [20] provides a precise way to manage each image's representation and to measure the similarity between a pair of images. By weighting components of the loop-closure threshold, RSOM-SLAM can make use of the continuity of robots’ travelling trajectory and similarity relationship among scenes to enhance recognize ability. We conducted tests with two datasets from [9] to compare the performance between our system and IAB-MAP. We also tested BoWSLAM from http://hilandtom.com/tombotterill/code/index.php with our datasets to compare the performance between our system and this approach. According to Figure 11, we conclude that RSOM-SLAM can achieve better recall performances both indoors and outdoors. In the indoor environment, RSOM-SLAM can achieve recall performances at around 90% at 100% precision.

The time consumption of our method
In the outdoor environment, although a large number of pedestrians and cars made RSOM-SLAM difficult to position, recall performances over 74% at 93% precision were still achieved by our method when a robot revisited the circle for the first time. According to Table 2, we can infer that when a robot revisited a scene additional times, the recall performance would be improved, even reaching near to 100% in the indoor experiments.
The clustering tree-RSOM introduced in [21] offers a fast solution to obtain the KNNG of given images from the database. Through grouping steps, instead of using all captured graphs to represent a scene, only a small part of the graphs were selected to represent scenes, which enhances the real-time performance of RSOM-SLAM. The results of experiments in region A presented in Figure 10 suggest that the time consumption of our system to operate an image does not tend to increase in the long term. By analysing the results, our conclusion for these results is: after RSOM-SLAM visits an small area with no or only a few dynamic obstacles several times, there is no new information that the RSOM-SLAM needs to learn; hence, the scale of the database is constant. Therefore, the time consumption is only from two procedures: pairwise graph similarity measurement, which costs only 17ms on average, and obtaining KNNG. Therefore, we can deduce from the results that RSOM-SLAM can operate for a long distance in this kind of environment. In the residential area, RSOM-SLAM could operate in real time for more than 73000 frames with a distance of more than 18km (1.8km/circle) by discarding invalid groups. Otherwise, our system could only operate in real time for about 46300 frames without discarding invalid groups. With the enlargement of the database, time consumption of some frames was higher than 500ms after the 73000th frame.

(a) and (b) are the examples of positive loop-closure detection from the test on L6I by RSOM-SLAM while (c) is the recall performance. (d) is the comparison result of recall performance between RSOM-SLAM and other methods.
In summary, as the results of our experiments, RSOM-SLAM has good recall performance both in indoor and outdoor environments. Moreover, it can operate long distance in real time even in a large and dynamic environment. Future tasks are expected to consist of two parts: first, improvement of the real-time performance of our system based on cluster-computer technology and RSOM dictionary; and second, constructing the whole SLAM system based on some data association methods and the achievement that RSOM-SLAM represents each scene by using groups of graphs marked with the corresponding group index.
