Abstract
Introduction
Simultaneously localization and mapping (SLAM) has become a very popular research direction in recent years, which requires to construct and update an environment map while simultaneously tracking an agent’s position. 1,2 SLAM has a variety of applications such as autonomous driving, mobile robots, and virtual reality. Especially, visual SLAM has received extensive attentions due to the large amount of information, wide range of application scenarios, and low cost of visual sensors. 3,4 Compared with monocular and stereo cameras, RGB-Depth (RGB-D) camera is widely used in indoor environments because they can directly provide the depth and color measurements of the scene. In this article, we focus on RGB-D SLAM.
For traditional visual SLAM, the feature-based approach 5 –7 and direct method 8,9 are mainstream solutions, where low-level point information plays an important role. The former associates points in successive frames according to the local appearance near every feature point, while the latter tracks points on the basis of constant brightness assumption. 10 However, these methods suffer from illumination and viewpoint changes. 11,12 Different viewpoints and illuminations can lead to the variations of local appearance and brightness of the same point, which will cause the tracking failure of points with incorrect data association. Thus, the localization accuracy of the visual SLAM is decreased. On the other hand, the traditional visual SLAM mainly focuses on low-level geometric information, which possibly results in a weak interaction with complex surrounding environments. 13
With the development of deep learning, great progresses have been made in object detection and object segmentation whose high-level semantic information can better adapt to viewpoint and illumination changes. The purpose of object detection is to infer the locations and class labels of objects, where the location of the object is represented in the form of a bounding box. For object detection based on deep learning, it can be classified into approaches based on regional proposal and without regional proposal. The former is a two-stage process: firstly generate a series of candidate regions and then extract the features of the candidate regions for classification and boundary regression. Its popular methods include regions with convolutional neural network features (R-CNN), Fast R-CNN, and Faster R-CNN and so on. For the approaches without regional proposal, the global information of the image is directly used, and you only look once (YOLO), 14 YOLO9000, 15 YOLOv3, 16 and single-shot multibox detector 17 are the representative methods. Different from the bounding box of object detection, object segmentation predicts the class labels pixel by pixel, and it is related to semantic segmentation 18,19 and instance segmentation. 20 A possible problem of object segmentation is its computation cost, which makes it hard to integrate into a real-time SLAM.
Driven by object detection and segmentation based on deep learning, the researchers concern semantic visual SLAM with the combination of object detection or segmentation. Semantics can not only help SLAM achieve better localization 11,21 –23 but also establish more abundant map. To improve the localization accuracy, semantic constraints are added. Lianos et al. constructed a semantic error function by utilizing semantic segmentation to promote point–point association. 11 An et al. evaluated the importance of each semantic category based on semantic segmentation for better visual features and the removal of outliers in the matching process. 21 On the basis, the accuracy and robustness of localization are improved. Besides semantic constraint, pose optimization of objects is also considered. A 3-D cuboid object detection approach is proposed, 22 and it is combined with the Oriented FAST and Rotated BRIEF (ORB) feature points to respectively build semantic error functions for static and dynamic environments. On this basis, poses of points, 3-D cuboids, and cameras are jointly optimized. Similarly, Li et al. utilized 3-D object detection with viewpoint classification as well as feature points for constructing semantic constraints, 23 which is suitable for both static and dynamic conditions.
It shall be noted that existing semantic SLAM approaches mainly concern the constraints of camera–landmark, camera–camera, as well as different types of landmarks, where a landmark can be a point-type, and it can also be an object type. The constraint of landmarks with the same kind is seldom considered. In fact, there exists invariance in terms of relative distance and orientation between two static object landmarks, and it may be changed if only the aforementioned constraints are employed. It is beneficial for the localization by introducing the relative constraints among objects into the SLAM optimization process. In this article, we propose a real-time visual Point-Object SLAM (PO-SLAM) approach on the basis of RGB-D ORB-SLAM2, which incorporates object–object constraint in the bundle adjustment (BA) optimization process. To ensure the real-time performance of the system while considering the instantiation of the objects, YOLOv3 16 is adopted and it is combined with a rough geometric segmentation based on depth histogram to obtain the contours of objects, which can improve the association quality. Moreover, the object–object constraint is reflected by the relative position invariance of objects, which is converted to the length and orientation invariances of the line segment connecting every two objects in each frame. This provides additional information for pose optimization.
In the following, we will describe the proposed PO-SLAM approach combining points and objects in detail. Then, the experiments are presented, and finally, we conclude the article.
The proposed semantic PO-SLAM with points and objects
The framework of the proposed semantic PO-SLAM is shown in Figure 1, where point features, point–point association, and point–point constraint are directly used according to ORB-SLAM2. 7 In the feature extraction module, object features are extracted from the color image provided by RGB-D camera using YOLOv3. 16 Considering that object detection cannot accurately express the contours of objects, we utilize the depth image to geometrically segment the detected objects based on depth histograms. Then, combined with point features, point–object association is executed to obtain the feature points on each detected object. After extracting the features of every frame, we track the features between the current frame and the previous frame. And besides the point–point association, object–object interframe association is also executed. On this basis, the extracted point and object features as well as the association results are involved in the BA optimization process. With the help of loop closing of ORB-SLAM2, SLAM is finally implemented. In the following, we will address the PO-SLAM in detail.

Overall framework of the semantic PO-SLAM approach. SLAM: simultaneously localization and mapping.
Feature extraction
Low-level point features are combined with high-level semantic object features in our SLAM. The reader may refer to the study by Mur-Artal and Tardós7 for point features extraction, and in this section, we focus on the extraction of object features.
Object features extraction
Object features including the objects number, categories, as well as the positions are favorable for data association of SLAM due to the reliability of high-level feature. In this article, YOLOv3 16 is utilized to detect the objects at each frame, where the deep network is trained on the MS COCO data set including 80 categories of common objects. By object detection, the bounding boxes, labels, and label confidences of objects are obtained. Note that we only reserve the results with confidence of more than 70%.
Geometric segmentation
For object detection, the resulting bounding box surrounding an object cannot fit the actual boundary of object completely, and some background information is inevitably contained. In this case, it is not easy to judge whether a feature point is on an object, which will affect the determination of the object’s position. Also, in spite of good performance in segmentation effect, instance segmentation based on deep learning needs to take more time. A fast segmentation solution to extract the foreground in the bounding box of an object is required. Herein, a geometric segmentation based on depth histogram is presented.
In a detection bounding box, there are only two types of pixels: background and foreground. Their differentiation may be solved using depth information that reflects the distance between an object and the camera, and a depth threshold to separate the foreground from the background needs to be determined. With the depth values of foreground and background, we utilize the Otsu threshold segmentation method 24 to segment the depth values by maximizing the interclass variance of these two parts. Otsu is a method to automatically determine the threshold; however, it is sensitive to noise. For the depth map provided by the RGB-D camera, there exists the case where the depth value of a pixel is 0, which may be caused by the pixels outside the depth range or miss detection. Those pixels with a depth value of 0 in the depth map should be first filtered out before calculating the depth threshold.
To obtain the geometrical segmentation for an object, the depth image of current frame is cropped according to the predicted bounding box, and one can obtain the depth submap
Geometrical segmentation process.
Figure 2 illustrates the segmentation result. Take the teddy bear in the original image from the TUM data set
25
(see Figure 2(a)) as an example. Figure 2(b) provides the detection result, and the depth histogram of the pixels in the bounding box is presented in Figure 2(c). One can see that the depth values are divided into two parts by the yellow dashed line corresponding to the depth threshold

The geometric segmentation. (a) Original image. (b) One detected object. (c) The depth histogram of the bounding box in (b). (d) The extracted foreground after the segmentation.
Data association
As a reflection of the common view between frames, data association is important in solving camera poses and landmark positions of SLAM. In addition to the association of interframe point features used in ORB-SLAM2, 7 we also take the correlation of point features and object features in each frame as well as the association of interframe object features into account.
Point–object association
As mentioned above, for each detected bounding box in each frame, the foreground image is separated by the depth image information, and the feature points located in the foreground area are used as the feature points corresponding to the object. The association of points and objects is used to calculate the point–object error in the subsequent BA optimization. Figure 3 gives an illustration of association results for a selected image in fr2/desk of the TUM RGB-D data set. 25 The bounding boxes of different classes of objects are represented by different colors, and the color of feature points belonging to the same object is consistent with that of the bounding box. Notice that multiple object instances of the same class can be distinguished by the positions of their bounding boxes, and the green points do not belong to any detected object, which are considered as the background. When the points fall within the bounding box of an object and their colors match the color of the bounding box, they are regarded as the feature points associated with the object.

Results of point–object association for an image in fr2/desk of TUM RGB-D data set, where the color of points belonging to the same object is the same as that of the corresponding bounding box.
Object–object association
Object–object association between two frames is similar to standard object tracking. Since we have known the categories of the objects in each frame, we can only concern the object categories that simultaneously appear in two frames. At first, the center
where
Bundle adjustment
Combining point and object features, constraints with geometric and semantic relationships are constructed to optimize camera poses and 3-D point positions. The sets of image sequence, positions of 3-D points, and objects in the world coordinate system are denoted as
We can observe the measurements corresponding to 3-D points and objects from each frame.
BA formulation
Our semantic optimization process can be described as the following problem: given the observations
where
Error functions
where
where
The relative position between two objects is constrained by distance and orientation. To solve the problem, we connect the positions of two objects into an abstract line segment, and thus the distance and direction constraints can be converted to the invariance of length and direction of the line segment. We define
According to the direction invariance constraint, we can infer that the projection points of object
The length invariance of the line segment indicates that the distance between the projected points is the same as that of
where
where
Experiments and results
In this section, we will evaluate the localization performance of our approach and conduct the comparison with ORB-SLAM2.
Experimental setup
We adopt the TUM RGB-D SLAM data set and benchmark 25,27 to test and validate the approach. TUM data set consists of different types of sequences, which provide color and depth images with a resolution of 640 × 480 using a Microsoft Kinect sensor. YOLOv3 scales the original images to 416 × 416. Combining objects we concerned such as book, keyboard, mouse, TV-monitor, cup, cell phone, remote, bottle, teddy bear, and potted plant, 10 sequences related to office environments are selected.
We adopt the following evaluation metrics
27
: absolute trajectory error with root mean square (ATE) and mean relative pose error (RPE), where ATE quantifies the difference between points of the estimated trajectory and their ground truths, whereas RPE assesses the local accuracy of the estimated poses in a fixed interval. All of the experiments are repeated five times and the median of these five results is considered as the final result. To clearly demonstrate the improvement of our method,
Experiment on the TUM RGB-D data set 25
Tables 1 and 2 give the comparison of our PO-SLAM and ORB-SLAM2 over 10 sequences. To better address our approach, we also consider two other methods PO-SLAM1 and PO-SLAM2. These two methods correspond to the cases of PO-SLAM without point–object error in (3) and PO-SLAM without object–object error in (3), respectively. Noticing that the first seven sequences describe static scenes, whereas the last three sequences are related to dynamic scenes.
Comparison of our methods with ORB-SLAM2 according to absolute trajectory errors.
SLAM: simultaneously localization and mapping; ATE: absolute trajectory error with root mean square.
Comparison of our methods with ORB-SLAM2 according to relative pose errors.
SLAM: simultaneously localization and mapping; RPE: mean relative pose error.
As can be seen in Tables 1 and 2, our PO-SLAM has an improvement of up to 10.46% in ATE and up to 10.95% in RPE compared with ORB-SLAM2. Overall, our three methods perform better than ORB-SLAM2 in both ATE and RPE for most of the sequences, and PO-SLAM performs best.
Figure 4 depicts the comparison of the trajectories obtained by PO-SLAM and ORB-SLAM2 on four sequences with the ground truth. It is seen that our trajectories are closer to the ground truth than ORB-SLAM2. Note that all ORB features extracted by ORB-SLAM2 are used in our point–point error. From Tables 1 and 2, our method has proved a better adaptability to dynamic environments. Figure 5 illustrates a performance comparison of PO-SLAM and ORB-SLAM2 on fr3/walking_xyz dynamic sequence. 25 Clearly, ORB-SLAM2 fails to track on the frame 696 and frame 768, while PO-SLAM is still in the SLAM mode with enough matching points with the previous frame.

Comparison of trajectories estimated by our PO-SLAM, ORB-SLAM2, and ground truth on the TUM RGB-D data set. (a) fr1/desk2, (b) fr1/room, (c) fr3/office and (d) fr2/desk. SLAM: simultaneously localization and mapping.

Comparison of our PO-SLAM and ORB-SLAM2 on fr3/walking_xyz. (a) and (b) The results of ORB-SLAM2, and (c) and (d) the results of PO-SLAM. SLAM: simultaneously localization and mapping.
The average running time per frame of PO-SLAM is demonstrated in Figure 6 for 10 sequences on the TUM RGB-D data set. It is seen that the average time is 71.47 ms with a speed about 14 fps, which meets the real-time requirement.

Average running time per frame of PO-SLAM on the TUM RGB-D data set. SLAM: simultaneously localization and mapping.
Conclusions
In this article, we propose a semantic visual SLAM approach combining 2-D object detection and ORB feature points with additional semantic constraints for the process of BA optimization. The object segmentation approach combining object detection and the depth histogram of 2-D bounding box is used to associate feature points and their corresponding objects. Besides, the correlation between any two detected objects within the field of view of each frame is also introduced. Experimental results on the TUM RGB-D data set indicate that our approach can improve the accuracy and robustness compared with ORB-SLAM2.
