Abstract
Keywords
1. Introduction
The next generation of robots and autonomous systems will need to build actionable, metric-semantic, multi-resolution, persistent representations of large-scale unknown environments in real-time.
The pioneering work (Davison 2018) has introduced the notion of spatial AI. Indeed the requirements that spatial AI has to build actionable and persistent representations can already be found in Davison (2018). In this paper we take a step further by arguing that such representations must be
1.1. Hierarchical representations
3D Scene Graphs (Armeni et al., 2019; Kim et al., 2019; Rosinol et al. 2020b, 2021; Wald et al., 2020; Wu et al., 2021) have recently emerged as expressive hierarchical representations of 3D environments. A 3D scene graph (e.g., Figure 1) is a layered graph where nodes represent spatial concepts at multiple levels of abstraction (from low-level geometry to objects, places, rooms, buildings, etc.) and edges represent relations between concepts. Armeni et al. (2019) pioneered the use of 3D scene graphs in computer vision and proposed the first algorithms to parse a metric-semantic 3D mesh into a 3D scene graph. Kim et al. (2019) reconstruct a 3D scene graph of objects and their relations. Rosinol et al. (2021, 2020b) propose a novel 3D scene graph model that (i) is built directly from sensor data, (ii) includes a sub-graph of places (useful for robot navigation), (iii) models objects, rooms, and buildings, and (iv) captures moving entities in the environment. Recent work (Gothoskar et al., 2021; Izatt and Tedrake 2021; Wald et al., 2020; Wu et al., 2021) infers objects and relations from point clouds, RGB-D sequences, or object detections. In this paper, we formalize the intuition behind these works that suitable representations for robot perception have to be hierarchical in nature, and discuss guiding principles behind the choice of “symbols” we have to include in these representations. We introduce 
1.2. Real-time systems
While 3D scene graphs can serve as an advanced “mental model” for robots, methods to build such a rich representation in real-time remain largely unexplored. The works (Kim et al., 2019; Wald et al., 2020; Wu et al., 2021) allow real-time operation but are restricted to “flat” 3D scene graphs that only include objects and their relations while disregarding the top layers in Figure 1. The works (Armeni et al., 2019; Rosinol et al. 2020b, 2021), which focus on building truly hierarchical representations, run offline and require several minutes to build a 3D scene graph (Armeni et al. (2019) even assumes the availability of a complete metric-semantic mesh of the environment built beforehand). Extending our prior works (Rosinol et al., 2020b, 2021) to operate in real-time is non-trivial. These works utilize an Euclidean Signed Distance Function (ESDF) of the entire environment to build the scene graph. Unfortunately, ESDFs memory requirements scale poorly in the size of the environment; see Oleynikova et al. (2017) and Section 2. Moreover, the extraction of places and rooms in Rosinol et al. (2021, 2020b) involves batch algorithms that process the entire ESDF, whose computational cost grows over time and is incompatible with real-time operation. Finally, the ESDF is reconstructed from the robot trajectory estimate which keeps changing in response to loop closures. The approaches (Rosinol et al., 2020b, 2021) would therefore need to rebuild the scene graph from scratch after every loop closure, clashing with real-time operation.
The present paper extends our prior work (Hughes et al., 2022) and proposes the first real-time system to build hierarchical 3D scene graphs of large-scale environments. Following (Hughes et al., 2022), recent works has explored constructing
1.3. Contribution 1: Foundations of hierarchical representations (Section 2)
We start by observing that flat metric-semantic representations scale poorly in the size of the environment and the size of the vocabulary of semantic labels the robot has to incorporate in the map. For instance, a voxel-based metric-semantic map picturing the floor of an office building with 40 semantic labels per voxel (as the one underlying the approaches of Rosinol et al. (2021) and Grinvald et al. (2019)) already requires roughly 450 MiB to be stored. Envisioning future robots to operate on much large scales (e.g., an entire city) and using a much larger vocabulary (e.g., the English dictionary includes roughly 500,000 words), we argue that research should move beyond flat representations. We show that hierarchical representations allow to largely reduce the memory requirements, by enabling lossless compression of semantic information into a layered graph, as well as lossy compression of the geometric information into meshes and graph-structured representations of the free space. Additionally, we show that hierarchical representations are amenable for efficient inference. In particular, we prove that the layered graphs appearing in hierarchical map representations have small
1.4. Contribution 2: Real-time incremental 3D scene graph construction (Section 3)
After establishing the importance of hierarchical representations, we move to developing a suite of algorithms to estimate 3D scene graphs from sensor data. In particular, we develop real-time algorithms that can incrementally estimate a 3D scene graph of an unknown building from visual-inertial sensor data. We use existing methods for geometric processing to incrementally build a metric-semantic mesh of the environment and reconstruct a sparse graph of “places”; intuitively, the mesh describes the occupied space (including objects in the environment), while the graph of places provides a succinct description of the free space. Then we propose novel algorithms to efficiently cluster the places into rooms; here, we use tools from topology, and in particular the notion of
1.5. Contribution 3: Persistent representations via hierarchical loop closure detection and 3D scene graph optimization (Section 4)
Building a persistent map representation requires the robot to recognize that it is revisiting a location it has seen before, and correcting the map accordingly. We propose a novel hierarchical approach for loop closure detection: the approach involves (i) a
1.6. Contribution 4: Hydra, a real-time spatial perception system (Section 5)
We conclude the paper by integrating the proposed algorithms into a highly parallelized perception system, named
1.7. Novelty with respect to Hughes et al. (2022); Talak et al. (2021)
This paper builds on our previous conference papers (Hughes et al., 2022; Talak et al., 2021) but includes several novel findings. First, rather than postulating a 3D scene graph structure as done in Hughes et al. (2022), we formally show that hierarchical representations are crucial to achieve scalable scene understanding and fast inference. Second, we propose a novel room segmentation method based on the notion of persistent homology, as a replacement for the heuristic method in Hughes et al. (2022). Third, we develop novel learning-based hierarchical descriptors for place recognition, that further improve performance with respect to the handcrafted hierarchical descriptors in Hughes et al. (2022). Fourth, the real-time system described in this paper is able to also assign room labels, leveraging the neural tree architecture from Talak et al. (2021); while Talak et al. (2021) uses the neural tree over small graphs corresponding to a single room, in this paper we provide an efficient way to obtain a tree decomposition of the top layers of the scene graph, hence extending Talak et al. (2021) to simultaneously operate over
2. Symbol grounding and the need for hierarchical representations
The goal of this section is twofold. First, we remark that in order to support the execution of complex instructions, map representations must be metric-semantic, and hence ground symbolic knowledge (i.e., semantic aspects in the scene) into the map geometry (Section 2.1). Second, we observe that “flat” representations, which naively store semantic attributes for each geometric entity in the map (e.g.
2.1. Symbols and symbol grounding
High-level instructions issued by humans (e.g., “go and pick up the chair”) involve symbols. A
2.1.1. Which symbols should a map contain?
In this paper, we directly specify the set of symbols the robot is interested in (i.e., certain classes of objects and rooms, to support navigation in indoor environments). However, it is worth discussing how to choose the symbols more generally. The choice of symbols clearly depends on the task the robot has to execute. A mobile robot may implement a path planning query by just using the “free-space” symbol and the corresponding grounding, while a more complex domestic robot may instead need additional symbols (e.g., “cup of coffee,” “table,” and “kitchen”) and their groundings to execute instructions such as “bring me the cup of coffee on the table in the kitchen.” A potential way to choose the symbols to include in the map therefore consists of inspecting the set of instructions the robot has to execute, and then extracting the list of symbols (e.g., objects and relations of interest) these instructions involve. For instance, in this paper, we mostly focus on indoor environments and assume the robot is tasked with navigating to certain objects and rooms in a building. Therefore, the symbols we include in our maps include free-space, objects, rooms, and buildings. After establishing the list of relevant symbols, the goal is for the robot to build a compact sub-symbolic representation (i.e., a geometric map), and then ground these symbols into the sub-symbolic representation. A pictorial representation of this idea is given in Figure 2(a): the robot builds an occupancy map (sub-symbolic representation) and attaches a number of semantic labels (symbols) to each cell. We refer to this representation as “flat” since each cell is assigned a list of symbols of interest. As we will see shortly, there are more clever ways to store the same information. (a) Example of a flat metric-semantic representation. We show labels only for a subset of the cells in the grid map for the sake of readability. (b) Example of hierarchical metric-semantic representation with a grid-map as sub-symbolic layer. (c) A hierarchical metric-semantic representation with compressed sub-symbolic layer. The cells representing each object are compressed into bounding polygons, and the free-space cells are compressed into a sparse graph where each node and edge are also assigned a radius that defines a circle of free-space around it.
2.2. Hierarchical representations enable large-scale mapping
While the flat representation of Figure 2(a) may contain all the information the robot needs to complete its task, it is unnecessarily expensive to store. Here we show that the symbolic knowledge underlying such representation naturally lends itself to be efficiently stored using a hierarchical representation.
Consider the case where we use a flat representation (as shown in Figure 2(a)) to represent a 3D scene and assume our dictionary contains
A key observation here is that multiple voxels encode the same grounded symbols (e.g., a chair). In addition, many symbols naturally admit a hierarchical organization where higher-level concepts (i.e., buildings or rooms for indoor environments) contain lower-level symbols (e.g., objects). This suggests that we can more efficiently store information
While we “compressed” the symbolic representation using a hierarchical data structure, the term
2.3. Hierarchical representations enable efficient inference
While above we showed that hierarchical representations are more scalable in terms of memory requirements, this section shows that the graphs underlying hierarchical representations also enable fast inference. Specifically, we show that these hierarchical graphs have small
In the rest of this section, we formalize the notion of hierarchical graph and show that the treewidth of a hierarchical graph is always bounded by the maximum treewidth of each of its layers. We do this by proving that the tree decomposition of the hierarchical graph can be obtained by a suitable concatenation of the tree decompositions of its layers. The results in this section are general and apply to arbitrary hierarchical representations (as defined below) beyond the representations of indoor environments we consider later in the paper.
Hierarchical Graph. 1. 2. 3. To gain some insight into Definition 1, consider a hierarchical graph where the bottom layer We now show that a tree decomposition of a hierarchical graph can be constructed by concatenating the tree decomposition of each of its layers. This result will allow us to obtain the treewidth bound in Proposition 3. The resulting tree decomposition algorithm (Algorithm 1) will also be used in the neural tree approach used for room classification in Section 3.4.2. The interested reader can find a refresher about tree decomposition and treewidth in Appendix A.1
4
and an example of execution of Algorithm 1 in Figure 3. In the following we assume that the provided hierarchical graph is undirected, or is obtained from a directed graph through moralization (Koller and Friedman 2009).

(a) Hierarchical object-room-building graph with objects O1, O2, … O8, rooms R1, R2, R3, R4, and a single building B1. (b) Tree decomposition
Tree Decomposition of Hierarchical Graph.
(ii) Figure 3 shows an example of execution of Algorithm 1 for the graph in Figure 3(a). Figure 3(b) shows the (disconnected) tree decompositions produced by line 2 (which produces the single green bag B1) and the first execution of line 8 (i.e., for (iii) C1 C2 Every bag C3 For all edges C4 For every node C5 For every node
Algorithm 1 constructs a tree decomposition We prove that, at any iteration For C1, we note that Theorem 2 above showed that we can easily build a tree decomposition of a hierarchical graph by cleverly “gluing together” tree decompositions of sub-graphs in the hierarchy. Now we want to use that result to bound the treewidth of a hierarchical graph
Treewidth of Hierarchical Graph. Intuitively, the proposition states that the treewidth of a hierarchical graph does not grow with the number of nodes in the graph, but rather with the treewidth (roughly speaking, the complexity) of each layer in the graph.
2.4. 3D scene graphs and indoor environments
While the considerations in the previous sections apply to generic hierarchical representations, in this section we tailor the discussion to a specific hierarchical representation, namely a
2.4.1. Choice of symbols for indoor navigation
We focus on indoor environments and assume the robot is tasked with navigating to certain objects and rooms in a building. Therefore, we include the following groups of symbols that form the
As we mentioned in Section 2.2, this choice of symbols and relations is dictated by the tasks we envision the robot to perform and therefore it is not universal. For instance, other tasks might require breaking down the free-space symbol into multiple symbols (e.g., to distinguish the front from the back of a room), or might require considering other agents (e.g., humans moving in the environment as in Rosinol et al. (2021; 2020b)). This dependence on the task also justifies the different definitions of a 3D scene graph found in recent literature: for instance, the original proposal in Armeni et al. (2019) focuses on visualization and human-machine interaction tasks, rather than robot navigation, hence the corresponding scene graphs do not include the free-space as a symbol; the proposals Wu et al. (2021); Kim et al. (2019) consider smaller-scale tasks and disregard room and building symbols. Similarly, the choice of relations is task dependent and can include a much broader set of relations beyond inclusion and adjacency. For instance, relations can describe attributes of an object (e.g., “has color”), material properties (e.g., “is made of”), can be used to compare entities (e.g., “is taller than”), or may encode actions (e.g., a person “carries” an object, a car “drives on” the road). While these other relations are beyond the scope of our paper, we refer the reader to Zhu et al. (2022) for further discussion.
2.4.2. Choice of sub-symbolic representations
We ground each symbol in compact sub-symbolic representations; intuitively, this reduces to attaching geometric attributes to each symbol observed by the robot. We ground the “obstacle” symbol into a 3D mesh describing the observed surfaces in the environment. We ground the “free-space” symbol using a places sub-graph, which can be understood as a topological map of the environment. Specifically, each place node in the graph of places is associated with an obstacle-free 3D location (more precisely, a sphere of free space described by a centroid and radius), while edges represent traversability between places. 5 We ground the “agent” symbol using the pose graph describing the robot trajectory (Cadena et al., 2016). We ground each “object,” “room,” and “building” symbol using a centroid and a bounding box. Somewhat redundantly, we also store the mesh vertices corresponding to each object, and the set of places included in each room, which can be also understood as additional groundings for these symbols. As discussed in the next section, this is mostly a byproduct of our algorithms, rather than a deliberate choice of sub-symbolic representation.
2.4.3. 3D scene graphs for indoor environments
The choices of symbolic and sub-symbolic representations outlined above lead to the 3D scene graph structure visualized in Figure 1. In particular,
Our 3D scene graph definition has been shown to be a suitable model to support various indoor tasks in robotics. Ravichandran et al. (2022) show that the use of 3D scene graphs improves peformance in RL-based object search. Other works have examined the benefits of hierarchical representations for planning efficiency, including path-planning (Rosinol et al., 2021) and symbolic task planning (Agia et al., 2022). More recently, Rana et al. (2023) use 3D scene graphs for indoor mobile manipulation tasks, using the hierarchy present in the representation to ground tasks specified in natural language. While these examples of early adoption of our 3D scene graph definition provide reassuring evidence of the “actionable” nature of our representation, we remind the reader that our definition is not intedend to be universal to all indoor robotics tasks. Instead, we intend this definition to be a foundation that can be easily extended in a task-driven manner.
2.4.4. Treewidth of indoor 3D scene graphs
In Section 2.3 we concluded that the treewidth of a hierarchical graph is bounded by the treewidth of its layers. In this section, we particularize the treewidth bound to 3D scene graphs modeling indoor environments. We first prove bounds on the treewidth of key layers in the 3D scene graph (Lemmas 4 and 5). Then we translate these bounds into a bound on the treewidth of the object-room-building sub-graph of a 3D scene graph (Proposition 6); the bound is important in practice since we will need to perform inference over such a sub-graph to infer the room (and potentially building) labels, see Section 3.4.2.
We start by bounding the treewidth of the room layer.
Treewidth of Room Layer. The insight behind Lemma 4 is that the room sub-graph is not very far from a tree (i.e., it has low treewidth) as long as there are not many rooms with multiple doors (or passage ways). In particular, the room sub-graph has treewidth equal to 2 if each room has at most two passage ways to other rooms with multiple entries. Note that the theorem statement allows a room to be connected to an arbitrary number of rooms with a single entry (i.e., a corridor leading to multiple rooms that are only accessible via that corridor). Figure 4(a) reports empirical upper-bounds of the treewidth of the room sub-graphs in the 90 scenes of the Matterport3D dataset (Chang et al., 2017) (see Section 6 for more details about the experimental setup). The observed treewidth is at most 2, confirming that the conclusions of Lemma 4 hold in several apartment-like environments.

Histogram of the treewidth upper-bounds for (a) the room sub-graph and (b) the object sub-graph of the 3D scene graphs obtained using the 90 scenes of the Matterport3D dataset Chang et al. (2017). We use the minimum-degree and the minimum-fill-in heuristic to obtain treewidth upper-bounds (Bodlaender and Koster 2010). We then compute an upper-bound to be the lowest of the two.
Treewidth of Object Layer. The result is a simple consequence of the fact that in our 3D scene graph, there is no edge connecting objects in different rooms, therefore, the graph of objects includes disconnected components corresponding to each room, whose treewidth is bounded by the size of that connected component, i.e., the number of objects in that room. Figure 4(b) reports the treewidth upper-bounds for the object sub-graphs in the Matterport3D dataset. We observe that the treewidth of the object sub-graphs tends to be larger compared to the room sub-graphs, but still remains relatively small (below 20) in all tests. We can now conclude with a bound on the treewidth of the object-room-building graph in a 3D scene graph.
Treewidth of the Object-Room-Building Graph. Proposition 6 indicates that the treewidth of the object-room-building graph does not grow with the size of the scene graph, but rather depends on how cluttered each room is. This is in stark contrast with the treewidth of social network graphs (Maniu et al., 2019; Talak et al., 2021), and further motivates the proposed hierarchical organization. The treewidth bounds in this section open the door to tractable inference techniques; in particular, in Section 3.4.2, we show they allow applying novel graph-learning techniques, namely, the neural tree (Talak et al., 2021).
Directed vs. Undirected Graphs.
3. Real-time incremental 3D scene graph layers construction
This section describes how to estimate an
We start by reconstructing a metric-semantic 3D mesh to populate
3.1. Layers 1: Mesh
We build a metric-semantic 3D mesh (
In more detail, for each keyframe,
6
we use a 2D semantic segmentation network to obtain a pixel-wise semantic segmentation of the RGB image, and reconstruct a depth-map using stereo matching (when using a stereo camera) or from the depth channel of the sensor (when using an RGB-D camera). We then convert the semantic segmentation and depth into a semantically-labeled 3D point cloud and transform it according to the odometric estimate of the robot pose (Section 3.2). We use Voxblox (Oleynikova et al., 2017) to integrate the semantically-labeled point cloud into a Truncated Signed Distance Field (TSDF) using ray-casting, and Kimera (Rosinol et al., 2021) to perform Bayesian updates over the semantic label of each voxel during ray-casting. Both Voxblox (Oleynikova et al., 2017) and Kimera (Rosinol et al., 2021) operate over an
3.2. Layers 2: Objects and agents
3.2.1. Agents
The agent layer consists of the pose graph describing the robot trajectory (we refer the reader to Rosinol et al. (2021) for extensions to multiple agents, including humans). During exploration, the odometric pose graph is obtained using stereo or RGB-D visual-inertial odometry, which is also available in Kimera (Rosinol et al., 2020a, 2021). The poses in the pose graph correspond to the keyframes selected by Kimera, and for each pose we also store visual features and descriptors, which are used for loop closure detection in Section 4. As usual, edges in the pose graph correspond to odometry measurements between consecutive poses, while we will also add loop closures as described in Section 4.
3.2.2. Objects
The object layer consists of a graph where each node corresponds to an object (with a semantic label, a centroid, and a bounding box) and edges connect nearby objects. After extracting the 3D mesh within the active window, we segment objects by performing Euclidean clustering (Rusu 2009) on
3.3. Layers 3: Places
We build the places layer by computing a Generalized Voronoi Diagram (GVD) of the environment, and then sparsifying it into a graph of places. While the idea of sparsifying the GVD into a graph has appeared in related work (Oleynikova et al., 2018; Rosinol et al., 2021), these works extract such representations from a monolithic ESDF of the environment, a process that is computationally expensive and confined to off-line use (Rosinol et al. (2021) reports computation times in the order of tens of minutes). Instead, we combine and adapt the approaches of Voxblox (Oleynikova et al., 2017), which presents an incremental version of the brushfire algorithm that converts a TSDF to an ESDF, and of Lau et al. (2013), who present an incremental version of the brushfire algorithm that is capable of constructing a GVD during the construction of the ESDF, but takes as input a 2D occupancy map. As a result, we show how to obtain a local GVD and a graph of places as a byproduct of the 3D mesh construction.
3.3.1. From voxels to the GVD
The GVD is a data structure commonly used in computer graphics and computational geometry to compress shape information (Tagliasacchi et al., 2016). The GVD is the set of voxels that are equidistant to at least two closest obstacles (“basis points” or “parents”), and intuitively forms a skeleton of the environment (Tagliasacchi et al., 2016) (see Figure 5(a)). Following the approach in Lau et al. (2013), we use the fact that voxels belonging to the GVD can be easily detected from the wavefronts of the brushfire algorithm used to create and update the ESDF in the active window. Intuitively, GVD voxels have the property that multiple wavefronts “meet” at a given voxel (i.e., fail to update the signed distance of the voxel), as these voxels are equidistant from multiple obstacles. The brushfire algorithm (and its incremental variant) are traditionally seeded at voxels containing obstacles (e.g., Lau et al. (2013)), but as mentioned previously, we follow the approach of Oleynikova et al. (2017) and use the TSDF to seed the brushfire wavefronts instead. In particular, we track all TSDF voxels that correspond to zero-crossings (i.e., voxels containing a surface) when creating (a) GVD (wireframe) of an environment. GVD voxels having three or more basis points are shown in red, while GVD voxels with two basis points are shown in yellow. Note that voxels with three or more basis points correspond to topological features in the GVD such as edges. (b) Sparse places graph (in black) and associated spheres of free-space corresponding shown in red. Note that the union of the spheres roughly approximates the geometry of the free-space.
3.3.2. From the GVD to the places graph
While the GVD already provides a compressed representation of the free-space, it still typically contains a large number of voxels (e.g., more than 10,000 in the Office dataset considered in Section 6). We could instantiate a graph of places with one node for each GVD voxel and edges between nearby voxels, but such representation would be too large to manipulate efficiently (e.g., our room segmentation approach in Section 3.4.1 would not scale to operating on the entire GVD). Previous attempts at sparsifying the GVD (notably Oleynikova et al. (2018) and our earlier paper Hughes et al. (2022)) used a subset of topological features (edges and vertices in the GVD, which correspond to GVD voxels having more than three and more than four basis points respectively) to form a sparse graph of places. However, the resulting graph does not capture the same connectivity of free-space as the full GVD, and may lead to graph of places with multiple connected components. It would also be desirable for the user to balance the level of compression with the quality of the free-space approximation instead of always picking certain GVD voxels.
To resolve these issues, we first form a graph
These proposed nodes and edges are assigned information from the voxels that generated them: each new or updated node is assigned a position and distance to the nearest obstacle from the GVD voxel with the most basis points in the cluster; each new or updated edge is assigned a distance that is the minimum distance to the nearest obstacle of all the GVD voxels that the edge passes through. We denote this stored distance in each node and edge as
3.3.3. Inter-layer edges
After building the places graph, we add inter-layer edges from each object or agent node to the nearest place node in the active window. To accomplish this, we use nanoflann (Blanco and Rai 2014) as a nearest-neighbor search structure to find the nearest place node to each object or agent node position.
3.4. Layer 4: Rooms
We segment the rooms by first geometrically clustering the graph of places into separate rooms (using persistent homology, Section 3.4.1), and then assigning a semantic label to each room (using the neural tree, Section 3.4.2). We remark that the construction of the room layer is fundamentally different from the construction of the object layer. While we can directly observe the objects (e.g., via a network trained to detect certain objects), we do not have a network trained to detect certain rooms. Instead, we have to rely on prior knowledge to infer both their geometry and semantics. For instance, we exploit the fact that rooms are typically separated by small passageways (e.g., doors), and that their semantics can be inferred from the objects they contain.
3.4.1. Layer 4: Room clustering via persistent homology
This section discusses how to geometrically cluster the environment into different rooms. Many approaches for room segmentation or detection (including Rosinol et al. (2021)) require volumetric or voxel-based representations of the full environments and make assumptions on the environment geometry (i.e., a single floor or ceiling height) that do not easily extend to arbitrary buildings. To resolve these issues, we present a novel approach for constructing
3.4.1.1. Identifying rooms via persistent homology
We use the key insight that dilating the voxel-based map helps expose rooms in the environment: if we inflate obstacles, apertures in the environment (e.g., doors) will gradually close, naturally partitioning the voxel-based map into disconnected components (i.e., rooms); however, we apply the same insight to the graph of places Connected components (in orange and red) in the sub-graph of places induced by a dilation distance 
Directly specifying a single best dilation distance (a) Example of Betti curve; we only consider components with at least 15 nodes. (b–d) Example of filtration of the graph for various thresholds

The resulting mapping between the dilation distance and the number of connected components is an example of
Choices of rooms that are more persistent correspond to large “flat” horizontal regions of the Betti curve, where the number of connected components stays the same across a large sub-set of
3.4.1.2. Assigning remaining nodes via flood-fill
The connected components produced by the persistent homology do not account for all places in
Novelty, Advantages, and Limitations.
3.4.1.3. Intra-layer and inter-layer edges
We connect pairs of rooms, say (
3.4.2. Layer 4: Room classification via neural tree
While in the previous section, we (geometrically) clustered places nodes into different rooms, the approach described in this section assigns a semantic label (e.g., kitchen, bedroom) to each room. The key insight that we are going to use is that objects in a room are typically correlated with the room type (e.g., a room that contains refrigerator, oven, and toaster is likely to be a kitchen, a room with a bathtub and a toilet is likely to be a bathroom). Edges connecting rooms may also carry information about the room types (e.g., a master bedroom is more likely to be connected to a bathroom). We construct a sub-graph of the 3D scene graph that includes the object and the room layers and the corresponding intra- and inter-layer edges. Given this graph, and the object semantic labels, centroids, and bounding boxes (see Section 3.2), we infer the semantic labels of each room.
While room inference could be attacked via standard techniques for inference in probabilistic graphical models, those techniques typically require handcrafting or estimating expressions for the factors connecting the nodes, inducing a probability distribution over the labels in the graph. We instead rely on more modern techniques that use
3.4.2.1. Neural tree overview
While traditional GNNs perform neural message passing on the edges of the given graph
3.4.2.2. Constructing the H-Tree
The neural tree performs message passing on the H-tree, a tree-structured graph constructed from the input graph. Each node in the H-tree corresponds to a sub-graph of the input graph. These sub-graphs are arranged hierarchically in the H-tree such that the parent of a node in the H-tree always corresponds to a larger sub-graph in the input graph. The leaf nodes in the H-tree correspond to singleton subsets (i.e., individual nodes) of the input graph.
The first step to construct an H-tree is to compute a tree decomposition
3.4.2.3. Message passing and node classification
Message passing on the H-tree generates embeddings for all the nodes and important sub-graphs of the input graph. Any of the existing message passing protocols (e.g., the ones used in Graph Convolutional Networks (GCN) (Bronstein et al., 2017; Defferrard et al., 2016; Henaff et al., 2015; Kipf and Welling 2017; Kipf and Welling 2017, 2017), GraphSAGE (Hamilton. et al., 2017), or Graph Attention Networks (GAT) (Busbridge et al., 2019; Lee et al., 2019; Veličković et al., 2018)) can be re-purposed to operate on the neural tree. We provide an ablation of different choices of message passing protocols and node features in Section 6. After message passing is complete, the final label for each node is extracted by pooling embeddings from all leaf nodes in the H-tree corresponding to the same node in the input graph, as in Talak et al. (2021).
One important difference between the H-tree in Talak et al. (2021), and the H-tree constructed for the object-room graph is the heterogeneity of the latter. The H-tree of a heterogeneous graph will also be heterogeneous, i.e., the H-tree will now contain nodes that correspond to various kinds of sub-graphs in the input object-room graph. Specifically, the H-tree has nodes that correspond to sub-graphs: (i) containing only room nodes, (ii) containing one room node and multiple object nodes, (iii) containing only object nodes, and (iv) leaf nodes which correspond to either an object or a room node. Accordingly, we treat the neural tree as a heterogeneous graph when performing message passing. Message passing over heterogeneous graphs can be implemented using off-the-shelf functionalities in the PyTorch geometric library (Fey and Lenssen 2019).
3.4.2.4. Expressiveness of the neural tree and graph treewidth
The following result, borrowed from our previous work (Talak et al., 2021), establishes a connection between the expressive power of the neural tree and the treewidth of the corresponding graph.
Neural Tree Expressiveness, Theorem 7 and Corollary 8 in Talak et al. (2021). While we refer the reader to Talak et al. (2021) for a more extensive discussion, the intuition is that graph-compatible functions can model (the logarithm of) any probability distribution over the given graph. Hence, Theorem 9 essentially states that the neural tree can learn any (sufficiently well-behaved) graphical model over
4. Persistent representations: Detecting and enforcing loop closures in 3D scene graphs
The previous section discussed how to estimate the layers of an “odometric” 3D scene graph as the robot explores an unknown environment. In this section, we discuss how to use the 3D scene graph to
4.1. Loop closure detection and geometric verification
We augment visual loop closure detection and geometric verification by using information across multiple layers in the 3D scene graph. Standard approaches for visual place recognition rely on visual features (e.g., SIFT, SURF, ORB) and fast retrieval methods (e.g., bag of words (Gálvez-López and Tardós 2012)) to detect loop closures. Advantageously, the 3D scene graph not only contains visual features (included in each node of the agent layer), but also additional information about the semantics of the environment (described by the object layer) and the geometry and topology of the environment (described by the places layer). In the following we discuss how to use this additional 3D information to develop better descriptors for loop closure detection and geometric verification.
4.1.1. Top-down loop closure detection
As mentioned in Section 3.2, the agent layer stores visual features for each keyframe pose along the robot trajectory. We refer to each such poses as
4.1.1.1. Top-down loop closure detection overview
For each agent node, we construct a hierarchy of descriptors describing statistics of the node’s surroundings, from low-level appearance to semantics and geometry. At the lowest level, our hierarchical descriptors include standard DBoW2 appearance descriptors (Gálvez-López and Tardós 2012). We augment the appearance descriptor with an object-based descriptor and a place-based descriptor computed from the objects and places in a sub-graph surrounding the agent node. We provide details about two choices of descriptors (hand-crafted and learning-based) below. To detect loop closures, we compare the hierarchical descriptor of the current (query) node with all the past agent node hierarchical descriptors, searching for a match. When comparing descriptors, we walk down the hierarchy of descriptors (from places, to objects, to appearance descriptors). In particular, we first compare the places descriptor and—if the descriptor distance is below a threshold—we move on to comparing object descriptors and then appearance descriptors. If any of the appearance or object descriptor comparisons return a strong enough match (i.e., if two distance between two descriptors is below a threshold), we perform geometric verification; see Figure 8 for a summary. Loop closure detection (left) and geometric verification (right). To find a match, we “descend” the 3D scene graph layers, comparing descriptors. We then “ascend” the 3D scene graph layers, attempting registration.
4.1.1.2. Hand-crafted scene graph descriptors
Top-down loop closure detection relies on having descriptors (i.e., vector embeddings) of the sub-graphs of objects and places around each agent node. In the conference version (Hughes et al., 2022) of this paper, we proposed hand-crafted descriptors. In particular, for the objects, we use the histogram of the semantic labels of the object nodes in the sub-graph as an object-level descriptor. For the places, we use the histogram of the distances associated to each place node in the sub-graph as a place-level descriptor. As shown in Hughes et al. (2022) and confirmed in Section 6, the resulting hierarchical descriptors already lead to improved loop closure detection performance over traditional appearance-based loop closures. However, these descriptors fail to capture relevant information about objects and places, e.g., their spatial layout and connectivity. In the following, we describe learning-based descriptors that use graph neural networks to automatically find a suitable embedding for the object and place sub-graphs; these are observed to further improve loop closure detection performance in some cases; see Section 6.
4.1.1.3. Learning-based scene graph descriptors
Given a sub-graph of objects and places around the agent node, we learn fixed-size embeddings using a Graph Neural Network (GNN). At a high level, we learn such embeddings from scene graph datasets, such that the Euclidean distance between descriptors is smaller if the corresponding agent nodes are spatially close.
In more detail, we learn separate embeddings for the sub-graph of objects and the sub-graph of places. For every object layer sub-graph, we encode the bounding-box size and semantic label of each object as node features. For every places layer sub-graph, we encode the distance of the place node to the nearest obstacle and the number of basis points of the node as node features. Rather than including absolute node positions in the respective node features, we assign a weight to each edge (
4.1.2. Bottom-up geometric verification
After we have a putative loop closure between our query and match agent nodes (say
4.2. 3D scene graph optimization
This section describes a framework to correct the entire 3D scene graph in response to putative loop closures. Assume we use the algorithms in Section 3 to build an “odometric” 3D scene graph, which drifts over time as it is built from the odometric trajectory of the robot—we refer to this as the Loop closure detection and optimization: (a) after a loop closure is detected, (b) we extract and optimize a sub-graph of the 3D scene graph—the 
4.2.1. 3D scene graph optimization
We propose an approach to simultaneously deform the 3D scene graph layers using an
Specifically, we form the deformation graph as the sub-graph of the 3D scene graph that includes (i) the agent layer, consisting of a pose graph that includes both odometry and loop closures edges, (ii) the
The embedded deformation graph approach associates a local frame (i.e., a pose) to each node in the deformation graph and then solves an optimization problem to adjust the local frames in a way that minimizes deformations associated to each edge (including loop closures). Let us call
We can find an optimal configuration for the poses
In hindsight, 3D scene graph optimization transforms a subset of the 3D scene graph into a
4.2.2. Interpolation and reconciliation
Once the optimization terminates, the agent and place nodes are updated with their new (optimized) positions and the full mesh is interpolated back from its control points according to the deformation graph approach in Sumner et al. (2007); Rosinol et al. (2021). After the 3D scene graph optimization and the interpolation step, certain portions of the scene graph—corresponding to areas revisited multiple times by the robot—contain redundant information. To avoid this redundancy, we merge overlapping nodes. For places nodes, we merge nodes within a distance threshold (0.4 m in our implementation). For object nodes we merge nodes if the corresponding objects have the same semantic label and if one of nodes is contained inside the bounding box of the other node. After this process is complete, we recompute the object centroids and bounding boxes from the position of the corresponding vertices in the optimized mesh. Finally, we recompute the rooms from the graph of places using the approach in Section 3.4.
5. Thinking fast and slow: The Hydra architecture
We integrate the algorithms described in this paper into a highly parallelized
We visualize Hydra in Figure 10. Each block in the figure denotes an algorithmic module, matching the discussion in the previous sections. Hydra starts with fast Hydra’s functional blocks. We conceptualize three different functional block groupings: low-level perception, mid-level perception, and high-level perception in order of increasing latency. Each functional block is labeled with a number that identifies the “logical” thread that the module belongs to.
Hydra runs in real-time on a multi-core CPU. The only module that relies on GPU computing is the 2D semantic segmentation, which uses a standard off-the-shelf deep network. The neural tree (Section 3.4.2) and the GNN-based loop closure detection (Section 4.1) can be optionally executed on a GPU, but the forward pass is relatively fast even on a CPU (see Section 6.3.4). The fact that most modules run on CPU has the advantage of (i) leaving the GPU to learning-oriented components, and (ii) being compatible with the power limitations imposed by current mobile robots. In the next section, we will report real-time results with Hydra running on a mobile robot, a Unitree A1 quadruped, with onboard sensing and computation (an NVIDIA Xavier embedded computer).
6. Experiments
The experiments in this section (i) qualitatively and quantitatively compare the 3D scene graph produced by Hydra to another state-of-the-art 3D scene graph construction method, SceneGraphFusion (Wu et al., 2021), (ii) examine the performance of Hydra in comparison to batch offline methods, i.e., Kimera (Rosinol et al., 2021), (iii) validate design choices for learned components in our method via ablation studies, and (iv) present a runtime analysis of Hydra. We also document our experimental setup, including training details for both the GNN-based loop closure descriptors and the neural-tree room classification, and datasets used. Our implementation of Hydra is available at https://github.com/MIT-SPARK/Hydra.
6.1. Datasets
We use four primary datasets for training and evaluation: two simulated datasets (Matterport3D (Chang et al., 2017) and uHumans2 (Rosinol et al., 2021)) and two real-world datasets (SidPac and Simmons). In addition, we use the Stanford3D dataset (Armeni et al., 2019) to motivate neural tree design choices with respect to our initial proposal in Talak et al. (2021). Note that Matterport3D 19 , uHumans 20 , and Stanford3D 21 are all publicly available.
6.1.1. Matterport3D
We utilize the Matterport3D (MP3D) dataset (Chang et al., 2017), an RGB-D dataset consisting of 90 reconstructions of indoor building-scale scenes. We use the Habitat Simulator (Savva et al., 2019) to traverse the scenes from the MP3D dataset and render color imagery, depth, and ground-truth 2D semantic segmentation. We generate two different 3D scene graphs datasets for the 90 MP3D scenes by running Hydra on pre-generated trajectories; one for training descriptors and one for room classification. For training the GNN-based descriptors for loop closure detection (GNN-LCD), we generate a single trajectory for each scene through the navigable scene area such that we get coverage of the entire scene, resulting in 90 scene graphs. For training the room classification approaches, we generate 5 trajectories for each scene by randomly sampling navigable positions until a total path length of at least 100 m is reached, resulting in 450 trajectories. When running Hydra on these 450 trajectories, we save intermediate scene graphs every 100 timesteps (resulting in roughly 15 scene graphs per trajectory), giving us 6810 total scene graphs.
6.1.2. uHumans2
The uH2 dataset is a Unity-based simulated dataset (Rosinol et al., 2021) that includes four scenes: a small apartment, an office, a subway station, and an outdoor neighborhood. For the purposes of this paper, we only use the apartment and office scenes. The dataset provides visual-inertial data, ground-truth depth, and 2D semantic segmentation. The dataset also provides ground truth trajectories that we use for benchmarking.
6.1.3. SidPac
The SidPac dataset is a real dataset collected in a graduate student residence using a visual-inertial hand-held device. We used a Kinect Azure camera as the primary collection device, providing color and depth imagery, with an Intel RealSense T265 rigidly attached to the Kinect to provide an external odometry source. The dataset consists of two separate recordings, both of which are used in our previous paper Hughes et al. (2022). We only use the first recording for the purposes of this paper. This first recording covers two floors of the building (Floors 1 & 3), where we walked through a common room, a music room, and a recreation room on the first floor of the graduate residence, went up a stairwell, through a long corridor as well as a student apartment on the third floor, then finally down another stairwell to revisit the music room and the common room, ending where we started. These scenes are particularly challenging given the scale of the scenes (average traversal of around 400 m), the prevalence of glass and strong sunlight in regions of the scenes (causing partial depth estimates from the Kinect), and feature-poor regions in hallways. We obtain a proxy for the ground-truth trajectory for the Floor 1 & 3 scene via a hand-tuned pose graph optimization with additional height priors, to reduce drift and qualitatively match the building floor plans.
6.1.4. Simmons
The Simmons dataset is a real dataset collected on a single floor of an undergraduate student residence with a Clearpath Jackal rover and a Unitree A1 quadruped robot (Figure 11). The Clearpath Jackal rover uses the RealSense D455 camera as the primary collection device to provide color and depth imagery, but the rover is also equipped with a Velodyne to perform LIDAR-based odometry (Reinke et al., 2022) as an external odometry source. The A1 also uses the RealSense D455 camera as the primary collection device, but is not equipped with a Velodyne; instead, it is equipped with an industrial grade Microstrain IMU to improve the performance of the visual-inertial odometry (Rosinol et al., 2020a). The dataset consists of two recordings, one recorded on the Jackal, and one recorded using the A1. The Jackal recording covers the rooms scattered throughout half of a single floor of the building, where the Jackal traverses a distance of around 500 m through mostly student bedrooms, but also a lounge area, a kitchen, and a laundry room; the rooms are all joined by a long hallway that spans the full floor. The A1 sequence takes place on one end of the floor and maps 4 bedrooms, a small lounge, and a section of the hallway that connects all the rooms. The dataset is challenging due to the visual and structural similarity across student rooms that have similar layouts and furnishing. We obtain a proxy for the ground-truth trajectory for the Jackal using LIDAR-based SLAM, by running LOCUS (Reinke et al., 2022) with flat ground assumption and LAMP (Chang et al., 2022) for loop closure corrections. The ground-truth trajectory of the A1 is obtained by registering individual visual keyframes with the visual keyframes in the Jackal sequence, and then corrected using the proxy ground-truth of the Jackal sequence. (a) The Clearpath Jackal platform used to record one of the Simmons sequences. The Jackal is equipped with both a RealSense D455 camera and a Velodyne LIDAR. (b) The Unitree A1 platform used to record the second Simmons sequence. The A1 is equipped with a RealSense D455 camera and a Microstrain IMU. Both platforms have an Nvidia Xavier NX embedded computer onboard.
6.1.5. Stanford3D
We use the 35 human-verified scene graphs from the Stanford 3D Scene Graph (Stanford3d) dataset (Armeni et al., 2019) to compare the neural tree against standard graph neural networks for node classification and to assess new design choices against our initial proposal in Talak et al. (2021). These scene graphs represent individual residential units, and each consists of building, room, and object nodes with inter-layer connectivity. We use the same pre-processed graphs as in Talak et al. (2021) where the single-type building nodes (residential) are removed and additional 4920 intra-layer object edges are added to connect nearby objects in the same room. This results in 482 room-object graphs, each containing one room and at least one object per room. The full dataset has 482 room nodes with 15 semantic labels, and 2338 objects with 35 labels.
6.2. Experimental setup
In this section, we first discuss the implementation of Hydra, including our choice of networks for the 2D semantic segmentation. Then we provide details regarding the training of our learning-based approaches: the GNN-LCD descriptors and the neural tree for room classification.
6.2.1. Hydra
To provide 2D semantic segmentation for the real-world datasets, we compare three different models, all of which use the ADE20k (Zhou et al., 2017) label space. We use HRNet (Wang et al., 2021) as a “nominal” semantic segmentation source and MobileNetV2 (Sandler et al., 2018) as a light-weight source of semantics for use on a robot. For both HRNet and MobileNetV2, we use the pre-trained model from the MIT Scene Parsing challenge (Zhou et al., 2017) to export a deployable model for our inference toolchain (ONNX and TensorRT). Additionally, we use a state-of-the-art model, OneFormer (Jain et al., 2023), to provide more-accurate-but-slower 2D semantic segmentation. A comparison of the accuracy and frame-rate of these models is shown in Figure 12. (a) A sample RGB image from SidPac Floor 1-3. (b–d) 2D semantic segmentation from OneFormer (Jain et al., 2023), HRNet (Wang et al., 2021), and MobileNetV2 (Sandler et al., 2018). The framerate of each approach is overlaid on the images for both the GPU in the workstation used for evaluation (an Nvidia GTX 3080), as well as for a less powerful embedded computer (the Nvidia Xavier NX). OneFormer was not tested on the Xavier NX.
For both simulated and real datasets we use Kimera-VIO (Rosinol et al., 2020a) for visual-inertial odometry. For SidPac, we fuse the Kimera-VIO estimates with the output of the RealSense T265 to improve the quality of the odometric trajectory. For Simmons, we fuse the Kimera-VIO estimates with the odometry output of Reinke et al. (2022) to improve the quality of the odometric trajectory for the sequence recorded by the Jackal platform.
All the remaining blocks in Figure 10 are implemented in C++, following the approach described in this paper. In the experiments we primarily use a workstation with an AMD Ryzen9 3960X with 24 cores and two Nvidia GTX3080s. We also report timing results on an embedded computer (Nvidia Xavier NX) at the end of this section, and demonstrate the full Hydra system running on the same embedded computer onboard the A1 robot; see the video attachment.
6.2.2. GNN-LCD training
We use the MP3D scene graphs to train loop closure descriptors for the object and place sub-graphs. We generate a sub-graph around every agent pose which consists of all nodes and edges within a provided radius of the parent place node that the agent pose is nearest to. We adaptively determine the radius for each sub-graph; each sub-graph has a minimum radius of 3 m and is grown until either a minimum number of nodes (10) or a maximum radius of 5 m is reached. Each object and place sub-graph contains the node and edges features as described in Section 4.1. The semantic label of each object node is encoded either using word2vec (Mikolov et al., 2013) or a one-hot encoding. We explore the impact of this label encoding in Section 6.3.
We use triplet loss to train our model. To construct the triplets, we have an anchor (a candidate sub-graph), and need to find positive and negative examples to compute the loss. A good proxy for how similar two sub-graphs are is looking at the spatial overlap of the two sets of nodes of the sub-graphs. We compute this overlap between the sub-graphs by using IoU over the bounding boxes that encompass the positions of the nodes of each sub-graph. If the overlap between two sub-graphs is at least 40%, we consider them to be similar (and candidates for positive examples), otherwise they are considered negative examples for that particular anchor. Sub-graphs from different scenes are always considered negative examples. To train our GNN-LCD descriptors, we use online triplet mining with batch-all triplet loss (Schroff et al., 2015), where we construct valid triplets for a batch of sub-graph input embeddings and average loss on the triplets that produce a positive loss.
The message-passing architecture that we selected is GCNConv (Kipf and Welling 2017). While more expressive or performant architectures exist, few are compatible with the computational framework used for inference in Hydra (i.e., ONNX). We split the dataset by scene; we use 70% of the original 90 scene graphs to train on, 20% of the scene graphs for validation, and the last 10% for testing. Our learning rate is 5 × 10−4, and we train the object models for 50 epochs and place models for 80 epochs, saving the model when the average validation error is at a minimum. Each model produces a descriptor of dimension 64. Other key model parameters are reported in Appendix A.7.
6.2.3. Neural tree training
We train the neural tree and related baselines on two datasets: Stanford3d and MP3D.
For the 482 object-room scene graphs in the Stanford3D dataset, we train the neural tree and GNN baselines for the same semi-supervised node classification task examined in (Talak et al., 2021), where the architecture has to label a subset of room and object nodes. The goal of this comparison is to understand the impact of some design choices (e.g., heterogeneous graphs, edge features) with respect to our original proposal in (Talak et al., 2021) and related work. For this comparison, we implement the neural tree and baseline approaches with four different message passing functions: GCN (Kipf and Welling 2017), GraphSAGE (Hamilton et al., 2017), GAT (Veličković et al., 2018), and GIN (Xu et al., 2019b). We consider both homogeneous and heterogeneous graphs; for all nodes, we use their centroid and bounding box size as node features. In some of our comparisons, we also examine the use of relative node positions as edge features, and discard the centroid from the node features. For the GNN baselines, we construct heterogeneous graphs consisting of two node types: rooms and objects. For the neural tree, we obtain graphs with four node types: room cliques, room-object cliques, room leaves, and object leaves; see Section 3.4.2. There are few message passing functions that are compatible with both heterogeneous graphs and edge features; therefore, we compare all heterogeneous approaches using only GAT (Veličković et al., 2018). For all tests on Stanford3d, we report average test accuracy over 100 runs. For each run, we randomly generate a 70%, 10%, 20% split across all nodes (i.e., objects and rooms) for training, validation, and testing, respectively.
For the 6180 scene graphs in the MP3D dataset, we test the neural tree and baselines for room classification on object-room graphs. The goal of this experiment is to understand the impact of the node features and the connectivity between rooms on the accuracy of room classification. We only consider heterogeneous graphs for this dataset, and as such only use GAT (Veličković et al., 2018) for message passing. The heterogeneous node types are the same as for Stanford3D. We use the bounding box size as the base feature for each node, and use relative positions between centroids as edge features. We also evaluate the impact of using semantic labels as additional features for the object nodes using word2vec (Mikolov et al., 2013). The MP3D dataset contains scene graphs with partially explored rooms. We discard any room nodes where the IoU between the 2D footprint of the places within the room and the 2D footprint of the ground truth room is less than a specified ratio (60% in our tests). Further details on the construction and pre-processing of the dataset are provided in Appendix A.5. We predict a set of 25 room labels which are provided in Appendix A.4. For training, we use the scene graphs from the official 61 training scenes of the MP3D dataset. For the remaining 29 scenes, we use graphs from two trajectories of the five total trajectories for validation and the other three trajectories for testing. For use with Hydra, we select the best-performing heterogeneous neural tree architecture; this architecture uses bounding box size and word2vec embeddings of the semantic labels of the object nodes as node features, as well as relative positions between nodes as edge features.
All training and testing is done using a single Nvidia A10 G Tensor Core GPU. For both datasets, we use cross entropy loss between predicted and ground-truth labels during training, and save the models with the highest validation accuracy for testing. All models are implemented using PyTorch 1.12.1 (Paszke et al., 2019) and PyTorch Geometric 2.2.0 (Fey and Lenssen 2019). We base our implementation on our previous open-source version of the neural tree (Talak et al., 2021). We provide additional training details, including model implementation, timing, and hyper-parameter tuning in Appendix A.5.
6.3. Results and ablation study
We begin this section with a comparison between Hydra and SceneGraphFusion (Wu et al., 2021) (Section 6.3.1). We then analyze the accuracy and provide an ablation of the modules in Hydra (Section 6.3.2), and show an example of the quality of scene graph that Hydra is able to produce while running onboard the Unitree A1 (Section 6.3.3). Finally, we report a breakdown of the runtime of our system (Section 6.3.4).
6.3.1. Comparison against SceneGraphFusion
This section shows that Hydra produces better object maps compared to SceneGraphFusion (Wu et al., 2021), while also creating additional layers in the scene graph (SceneGraphFusion does not estimate rooms or places). We compare our system with SceneGraphFusion (Wu et al., 2021) for both the uHumans2 apartment and office scene. For this comparison, the only learned component used by our system is the off-the-shelf 2D semantic segmentation network, hence for a fair comparison, we do not retrain the object label prediction GNN used by SceneGraphFusion (Wu et al., 2021). Examples of the produced objects by Hydra and SceneGraphFusion are shown in Figure 13. Note that SceneGraphFusion tends to over-segment larger objects (e.g., the sofa in the lower right corner of Figure 13 or the bed in the top right corner of the same figure), while Hydra tends to under-segment nearby objects (such as the dining table and chairs in the bottom left of Figure 13). For the purposes of a fair comparison, we use OneFormer (Jain et al., 2023) to provide semantics for Hydra, and use ground-truth robot poses as input to both systems. A quantitative comparison is reported in Table 1, which also includes results for Hydra using ground-truth semantics as a upper bound on performance. We report two metrics. The first, (a) Objects produced by Hydra for the uHumans2 apartment scene. (b) 3D scene graph produced by SceneGraphFusion for the uHumans2 apartment scene. Note that SceneGraphFusion directly infers geometric object relationships, and edges are drawn between objects for each detected relationship. Object accuracy for Hydra and SceneGraphFusion. Best results in bold.
Table 1 shows that Hydra largely outperforms SceneGraphFusion in terms of both
6.3.2. Accuracy evaluation and ablation
Here, we examine the accuracy of Hydra, running in real-time, as benchmarked against the accuracy attained by constructing a 3D scene graph in an offline manner (i.e., by running Kimera (Rosinol et al., 2021)). We also present a detailed quantitative analysis to justify key design choices in Hydra. To do this, we break down this quantitative analysis across the objects, places, and rooms layers. We examine the impact of (i) the choice of loop closure detection and 2D semantic segmentation on the accuracy of each layer of the 3D scene graph, and (ii) the impact of model architectures and training parameters on the accuracy of the room classification network. As such, we first present the accuracy of the predicted object and place layers by Hydra. We then move to looking at room classification accuracy, including two ablation studies on the neural tree and an evaluation of our room clustering approach based on persistent homology. Finally, we examine the quality of predicted loop closures when using the proposed hierarchical descriptors.
When benchmarking the accuracy of the layers of Hydra, we consider five different configurations for odometry estimation and loop closure detection. The first configuration (“
6.3.2.1. Accuracy evaluation: Objects
Figure 14 evaluates the object layer of Hydra across the five configurations of Hydra, as well as across various 2D semantic segmentation sources: ground-truth semantics (when available), OneFormer, HRNet, and MobileNetV2. For this evaluation, we benchmark against the ground-truth objects from the underlying Unity simulation for uHumans2. As the recording for the uHumans2 scenes only explores a portion of the scene, we only compare against ground-truth objects that have a high enough number of mesh vertices inside their bounding boxes (40 vertices). For the real datasets, where we do not have ground-truth objects, we consider the object layer of the batch scene graph constructed from ground-truth poses and OneFormer-based 2D semantic segmentation using Kimera (Rosinol et al., 2021) as the ground-truth objects for the purposes of evaluation. For the object layer, we report two metrics: the percentage of objects in the ground-truth scene graph that have an estimated object with the correct semantic label within a specified radius (“ (a)–(e) Object accuracy metrics for all scenes across different 2D semantic segmentation sources and different LCD and pose source configurations. The metrics are averaged over 5 trials for each pair of configurations (i.e., there were 5 trials for every combination of 2D semantics and a pose source). For semantic segmentation sources, “OF” is Oneformer, “HR” is HRNet, “MN” is MobileNetV2, and “GT” is the ground truth semantic segmentation provided by the simulator. Each cell reports the mean and standard deviation. Values are colored from highest (dark green, best performance) to lowest (light yellow, worst performance) by mean. (f) Place layer accuracy for all scenes across different configurations for Hydra. Each bar shows the average distance error for a given scene and configuration over 5 trials; standard deviation across the 5 trials is shown as a black error bar.
We note some important trends in Figure 14. First, when using
As expected, the quality of the semantic segmentation impacts the accuracy; this is much more pronounced for “
Finally, we remark that even the
6.3.2.2. Accuracy evaluation: Places
Figure 14(f) evaluates the place layer by comparing the five configurations for Hydra. We use either the ground-truth semantics (when available) or OneFormer (when ground-truth semantics is not available), as the places are not influenced by the semantic segmentation quality. For the places evaluation, we construct a “ground-truth” GVD from a 3D reconstruction of the entire scene using the ground-truth trajectory of the scene. Using this, we measure the mean distance of an estimated place node to the nearest voxel in the ground-truth GVD, which we call “
We note some important trends in Figure 14(f). Similar to the case of objects, the place positions errors are almost identical for all configurations of Hydra for the uHumans2 scenes. These scenes have very low drift, and are relatively small, so it is expected that a higher quality trajectory makes less of an impact. For the larger scenes, we see a range of results. However, we see for SidPac,
6.3.2.3. Neural tree ablation 1: Node classification
Stanford3d: Node classification accuracy for different message passing architectures.
Best results in bold.
We note that the absolute position of a node centroid is not invariant to translation, and that invariance is important to generalize to new scenes regardless of the choice of coordinate frames. We therefore examine using the relative positions of node centroids as edge features, instead of including them directly as node features. For the H-tree, which contains clique nodes that are comprised of multiple objects or rooms, we use the mean room centroid as the centroid of the clique when computing relative positions. In addition, we also investigate the impact of using heterogeneous graphs (which can accommodate different node and edge types, as the ones arising in the 3D scene graph) against standard homogeneous graphs. For this comparison, we only use the GAT message passing function since it is the only one in Table 2 that can both handle heterogeneous message passing and incorporate edge features.
Stanford3d: Node classification accuracy for different graph types and position encodings.
Best results in bold.
6.3.2.4. Neural tree ablation 2: Room classification
MP3D: Room classification accuracy for different graph types and node features.
Best results in bold.
6.3.2.5. Accuracy evaluation: rooms
We evaluate the accuracy of the room segmentation in Hydra, by first evaluating the quality of the geometric room clustering described in Section 3.4.1 and then testing the room classification from Section 3.4.2 for different choices of 2D segmentation network.
Figure 15 evaluates the room clustering performance, using the precision and recall metrics defined in Bormann et al. (2016) (here we compute precision and recall over 3D voxels instead of 2D pixels). More formally, these metrics are: Room metrics for all scenes across different LCD pose source configurations. Each bar shows the average precision or recall over 5 trials; standard deviation across the 5 trials is shown as a black error bar.

Figure 15 shows that Hydra generally outperforms Kimera (Rosinol et al., 2021) when given the ground-truth trajectory, and that Hydra is slightly more robust to multi-floor environments. This is expected, as Kimera performs room segmentation by taking a 2D slice of the voxel-based map of the environment, which does not generalize to multi-floor scenes. For both the split-level Apartment scene and the multi-floor SidPac scene, we achieve higher precision as compared to Kimera. These differences stem from the difficulty of setting an appropriate height to attempt to segment rooms at for Kimera (this is the height at which Kimera takes a 2D slice of the ESDF). Finally, it is worth noting that our room segmentation approach is able to mostly maintain the same levels of precision and recall for
Room classification accuracy for Hydra.
We note two interesting trends. First, the richness of the object label space impacts the predicted room label accuracy, as shown by the relatively poor performance of the GT column as compared to either HRNet or OneFormer for the uHumans2 scenes (27% versus 40% or 45%): the GT semantics available in the simulator has a smaller number of semantic classes for the objects, hindering performance. This is consistent with the observations in Chen et al. (2022). Second, scenes such as the uHumans2 office or Simmons that are out of distribution—compared to the MP3D dataset we use for training—perform poorly. An example of a resulting 3D scene graph for SidPac with room category labels is shown in Figure 1.
6.3.2.6. Loop closure ablation
Finally, we take a closer look at the quality of the loop closures candidates proposed by our hierarchical loop closure detection approach, and compare it against traditional vision-based approaches on the Office scene. In particular, we compare our approach against a vision-based loop closure detection that uses DBoW2 for place recognition and ORB feature matching, as described in Rosinol et al. (2021).
Figure 16 shows the number of detected loop closures against the error of the registered solution (i.e., the relative pose between query and match computed by the geometric verification) for four different loop closure configurations: (i) “SG-LC”: the proposed handcrafted scene graph loop closure detection, (ii) “SG-GNN”: the proposed learned scene graph loop closure detection, (iii) “V-LC (Nominal)”: a traditional vision-based loop closure detection with nominal parameters, and (iv) “V-LC (Permissive)”: a vision-based loop closure detection with more permissive parameters (i.e., a decreased score threshold and less restrictive geometric verification settings). We report key parameters used in this evaluation in Appendix A.7. As expected, making the vision-based detection parameters more permissive leads to more but lower-quality loop closures. On the other hand, the scene graph loop closure approach produces approximately twice as many loop closures within 10 cm of translation error and 1° of rotation error as the permissive vision-based approach. The proposed approach produces quantitatively and quantitatively better loop closures compared to both baselines. Notably, SG-GNN also outperforms both vision baselines and the SG-LC. We present further discussion and a breakdown of additional statistics of the proposed loop closures of each method in Appendix A.8. Number of detected loop closures versus error of the estimated loop closure pose for four different loop closure detection configurations. Five individual trials and a trend-line are shown for each configuration.
p@k results for loop closure detection.
Best results in bold.
We note some interesting trends in Table 6. First, the one-hot encoding outperforms the word2vec encoding for both datasets, and appears to transfer better between datasets (i.e., showing 5% better performance for the MP3D dataset, but 10% better performance for uHumans2). Additionally, the original handcrafted descriptors maintain good performance compared to the learned descriptors, and only the learned object descriptors appear competitive to the handcrafted descriptors in terms of precision. We believe that the high performance of the handcrafted descriptors is due to the semantic and geometric diversity among the scenes of the MP3D dataset. Previous experiments (using the
6.3.3. Onboard operation on a robot
This section shows that Hydra is capable of running in real-time and is deployable to a robot. We show this by performing a qualitative experiment, running Hydra online on the Unitree A1. We run Hydra and the chosen 2D semantic segmentation network (MobilenetV2 (Sandler et al., 2018)) on the Nvidia Xavier NX mounted on the back of the A1. Additionally, we run our room classification approach in the loop with Hydra on the CPU of the same Nvidia Xavier. In this test, to circumvent the computational cost of running Kimera-VIO, we use an Intel RealSense T265 to provide odometry estimates to Hydra. As a result, we use our proposed scene-graph loop closure detection method without the appearance-based descriptors (which would rely on Kimera-VIO for computation of the visual descriptors and vision-based geometric verification). To maintain real-time performance, we configure the frontend of Hydra to run every fifth keyframe (instead of every keyframe), and limit the reconstruction range to 3.5 m (instead of the nominal 4.5 m). Note that this results in a nominal update rate of 1 Hz for Hydra, though we still perform dense 3D metric-semantic reconstruction at keyframe rate (5 Hz). A breakdown of the runtime of Hydra’s modules during this experiment is available in Section 6.3.4.
For this experiment, we partially explore a floor of building 31 on the MIT campus consisting of a group of cubicles, a conference room, two impromptu lounge areas, all of which are connected by a hallway. An intermediate scene graph produced by Hydra while running the experiment is shown in Figure 17, and the experiment itself is shown in the video supplement. Intermediate 3D scene graph created by Hydra onboard the A1 quadruped while exploring building 31 on the MIT campus. Estimated room labels are shown. The entire mapping sequence is shown in the video supplement.
Hydra estimates four rooms for the scene graph; of these four rooms, room 3 is centered over one of the two lounges, and room 0 covers both the conference room (located in the lower right corner of Figure 17) and a portion of the hallway. Qualitatively, Hydra over-segments the scene, but the produced rooms and labels are still somewhat consistent with the underlying room structure and labels. In this instance, only room 0 has an incorrectly estimated label; however the room categories that Hydra estimates over the course of the experiment are not as consistent. This is likely due to the poor quality of the 2D semantics from MobilenetV2, and general lack of useful object categories for inferring room labels.
6.3.4. Runtime evaluation
Figure 18 reports the runtime of Hydra versus the batch approach in Rosinol et al. (2021). This plot shows that the runtime of the batch approach increases over time and takes more than 5 minutes to generate the entire scene graph for moderate scene sizes; as we mentioned, most processes in the batch approach (Rosinol et al., 2021) entail processing the entire ESDF (e.g., place extraction and room detection), inducing a linear increase in the runtime as the ESDF grows. On the other hand, our scene graph frontend ( Runtime required for scene graph construction vs. timestamp for the SidPac Floor 1–3 dataset for a batch approach (Kimera) and for the proposed incremental approach (Hydra). The middle subplot shows runtime information from the same experiment as the top, but zoomed in to highlight the behavior of Hydra High-Level. The bottom subplot shows runtime information from the same experiment as the first but zoomed in to highlight the behavior of Hydra Mid-Level. We filter out measurements of Hydra High-Level less than 20 ms to clarify trends in the bottom subplot. For timing of the low-level processes in Hydra, we refer the reader to the analysis in Rosinol et al. (2021), as we also rely on Kimera-VIO.
When reconstructing the scene shown in Figure 18 (SidPac Floor1–3), Hydra uses a maximum of 7.2 MiB for the TSDF, 19.1 MiB for the storage of the semantic labels, and 47.8 MiB for the storage of the dense GVD inside the active window, for a total of 74.1 MiB. Kimera instead uses 79.2 MiB for the TSDF, 211 MiB for the storage of semantic labels, and 132 MiB for the storage of the ESDF when reconstructing the entire environment, for a total of 422 MiB of memory. Note that the memory storage requirement of Hydra for the active window is a fifth of the memory required for Kimera, and that the memory usage of Kimera grows with the size of the scene.
Hydra: Timing breakdown.
While the timing results in Table 7 are obtained with a relatively powerful workstation, here we restate that Hydra can run in real-time on embedded computers commonly used in robotics applications. Towards this goal, we report the timing statistics from the online experiment running Hydra onboard the Unitree A1 as shown in Figure 17. Hydra processes the objects in 83.9 ± 65 ms, places in 114.8 ± 103 ms, and rooms in 34.7 ± 37.6 ms. While these numbers imply that Hydra can run faster than the 1 Hz target rate on the Xavier, we note that the 1 Hz limit is chosen to not fully max out the computational resources of the Xavier. Additionally, there are other modules of Hydra and external processes that limit (sometimes significantly) the computational resources of the Xavier (e.g., the reconstruction of the TSDF and GVD). While there is still margin to optimize computation (see conclusions), these initial results stress the practicality and real-time capability of Hydra in building 3D scene graphs.
7. Related work
We provide a broad literature review touching on abstractions and symbolic representations (Section 7.1), metric-semantic and hierarchical map representations and algorithms to build them from sensor data (Section 7.2), and loop closure detection and optimization (Section 7.3).
7.1. Need for abstractions and symbolic representations
7.1.1. State and action abstractions
The need to abstract sensor data into higher-level representations has been studied in the context of planning and decision-making. Konidaris (2019) points out the necessity of state and action abstraction for efficient task and motion planning problems. Konidaris et al. (2018) extract task-specific state abstractions. James et al. (2020) show how task-independent state abstractions can also be learned. James et al. (2022) show how to autonomously learn object-centric representations of a continuous and high-dimensional environment, and argue that such a representation enables efficient planning for long-horizon tasks. Berg et al. (2022) propose a hierarchical representation for planning in large outdoor environments. The hierarchy contains three levels: landmarks (e.g., forests, buildings, streets), neighborhoods, and cities. Several related works also discuss action abstractions, e.g., how to group a sequence of actions into macro-actions (usually referred to as
7.1.2. Ontologies and knowledge graphs
A fundamental requirement for an embodied agent to build and communicate a meaningful representation of an environment is to use a common vocabulary describing concepts of interest
24
and their relations. Such a vocabulary may be represented as an
The need to create a standard ontology was identified by Schlenoff et al. (2012), which resulted in the emergence of several knowledge processing frameworks focused on robotics applications (Beetz et al., 2018; Diab et al., 2019; Lemaignan et al., 2010; Tenorth and Beetz 2013). A significant effort has been made in the creation of common-sense ontologies and knowledge graphs (Auer et al., 2007; Bollacker et al., 2008; Carlson et al., 2010; Lenat 1995; Miller 1995; Niles and Pease 2001; Speer et al., 2017; Suchanek et al., 2008; Vrandečić and Krötzsch 2014). In recent years, there has been a surge in applying these ontologies and knowledge graphs to problems such as 2D scene graph generation (Amodeo et al., 2022; Chen et al., 2023; Guo et al., 2021; Zareian et al., 2020), image classification (Movshovitz-Attias et al., 2015; Porello et al., 2015), visual question answering (Aditya et al., 2018; Ding et al., 2022; Gardères et al., 2020; Marino et al., 2021; Zheng et al., 2021), task planning Chen et al. (2020); Daruna et al. (2021); Tuli et al. (2022), and representation learning (Hao et al., 2019; Kwak et al., 2022), to name a few.
7.1.3. Scene grammars and compositional models
Compositionality, which is the ability of humans to perceive reality as a hierarchy of parts, has been considered fundamental to human cognition (Geman et al., 2002; Lake et al., 2017). Geman et al. (2002) propose a mathematical formulation of
Recent works have tended to use such a hierarchical structure along with deep neural networks to provide better learning models. Wang et al. (2022) show that using a hierarchical, compositional model of the human body results in better human pose estimation. The model consists of body parts (e.g., shoulder, elbow, arm) organized as a hierarchy (e.g., shoulder, elbow are children of arm). A bottom-up/top-down inference strategy is proposed that is able to correct ambiguities in perceiving the lowest-level parts. Niemeyer and Geiger (2021) show that modeling a 3D scene as one composed of objects and background leads to more controllable and accurate image synthesis. Mo et al. (2020) model object shape as a hierarchical assembly of individual parts, and the object shape is then generated by transforming the shape parameter of each object part. Ichien et al. (2021) show that compositional model significantly outperform deep learning models on the task of analogical reasoning. Yuan et al. (2023) survey work on compositional scene representation. While early work by Fodor and Pylyshyn (1988) argued that neural-network-based models are not compositional in nature, recent works have suggested otherwise. The works (Mhaskar 1996; Mhaskar and Poggio 2016; Poggio et al., 2017) show that deep convolutional networks avoid the curse of dimensionality in approximating a class of compositional functions. Webb et al. (2022) show that large language models such as GPT-3 show compositional generalizability as an emergent property. Such compositionality, however, is not yet evident in generative models trained on multi-object scenes (Xie et al., 2022).
7.2. Metric-semantic and hierarchical representations for scene understanding and mapping
7.2.1. 2D scene graphs in computer vision
2D scene graphs are a popular model for image understanding that describes the content of an image in terms of objects (typically grounded by bounding boxes in the image) and their attributes and relations (Johnson et al., 2015; Krishna et al., 2016). The estimation of 2D scene graphs (either from a single image or a sequence of images) has been well studied in the computer vision community and is surveyed in Zhu et al. (2022). The seminal works (Johnson et al., 2015; Krishna et al., 2016) advocated the use of 2D scene graphs to perform cognitive tasks such as image search, image captioning, and answering questions. These tasks, unlike object detection, require the model to reason about object attributes and object-to-object relations, and therefore, enable better image understanding. 2D scene graphs have been successfully used in image retrieval (Johnson et al., 2015), caption generation (Anderson et al., 2016; Karpathy and Fei-Fei 2015), visual question answering (Krishna et al., 2016; Ren et al., 2015), and relationship detection (Lu et al., 2016). GNNs are a popular tool for joint object labels and/or relationship inference on scene graphs (Li et al., 2017; Xu et al., 2017; Yang et al., 2018; Zellers et al., 2017). Chang et al. (2023a) provide a comprehensive survey on the various methods that have been proposed to infer a 2D scene graph from an image.
Despite their popularity, 2D scene graphs have several limitations. First of all, they are designed to ground concepts in the image space, hence they are not suitable to model large-scale scenes (see discussion in Section 2). Second, many of the annotated object-to-object relationships in Krishna et al. (2016) like “behind,” “next to,” “near,” “above,” and “under” are harder to assess from a 2D image due to the lack of depth information, and are more easily inferred in 3D. Finally, 2D representations such as 2D scene graphs are not invariant to viewpoint changes (i.e., viewing the same 3D scene from a different viewing angle may result in a different 2D scene graph), as observed in Armeni et al. (2019).
7.2.2. Flat metric-semantic representations
The last few years have seen a surge of interest towards
7.2.3. Building parsing
A somewhat parallel research line investigates how to
7.2.4. Hierarchical representations and 3D scene graphs
Hierarchical maps have been pervasive in robotics since its inception (Chatila and Laumond 1985; Kuipers 1978, 2000; Thrun 2003). Early work focuses on 2D maps and investigates the use of hierarchical maps to resolve the divide between metric and topological representations (Beeson et al., 2010; Choset and Nagatani 2001; Galindo et al., 2005; Ruiz-Sarmiento et al., 2017; Zender et al., 2008). These works preceded the “deep learning revolution” and could not leverage the rich semantics currently accessible via deep neural networks.
More recently,
7.3. Maintaining persistent representations
7.3.1. Loop closures detection
Established approaches for visual loop closure detection in robotics trace back to place recognition and image retrieval techniques in computer vision; these approaches are broadly adopted in SLAM pipelines but are known to suffer from appearance and viewpoint changes (Lowry et al., 2016). Alternative approaches investigate place recognition using image sequences (Garg and Milford 2021; Milford and Wyeth 2012; Schubert et al., 2021) or deep learning (Arandjelovic et al., 2016). More related to our proposal is the set of papers leveraging semantic information for loop closure detection. Gawel et al. (2018) perform object-graph-based loop closure detection using random-walk descriptors built from 2D images. Liu et al. (2019) use similar object-based descriptors but built from a 3D reconstruction. Lin et al. (2021) adopt random-walk object-based descriptors and then compute loop closure poses via object registration. Qin et al. (2021) propose an object-based approach based on sub-graph similarity matching. Zheng et al. (2020) propose a room-level loop closure detector. None of these approaches are hierarchical in nature.
7.3.2. Loop closures correction
After a loop closure is detected, the map needs to be corrected accordingly. While this process is easy in sparse (e.g., landmark-based) representations (Cadena et al., 2016), it is non-trivial to perform in real-time when using dense representations. Stückler and Behnke (2014) and Whelan et al. (2016) optimize a map of
8. Conclusions
This paper argues that large-scale spatial perception for robotics requires hierarchical representations. In particular, we show that hierarchical representations scale better in terms of memory and are more suitable for efficient inference. Our second contribution is to introduce algorithms to build a hierarchical representation of an indoor environment, namely a
8.1. Limitations
While we believe the proposed contributions constitute a substantial step towards high-level scene understanding and spatial perception for robotics, our current proposal has several limitations. First, the sub-graph of places captures the free-space in 3D, which is directly usable for a drone to navigate. However, traversability for ground robots is also influenced by other aspects (e.g., terrain type, steepness). These aspects are currently disregarded in the construction of the GVD and the places sub-graph, but we believe they are particularly important for outdoor extensions of 3D scene graphs. Second, our approach for room segmentation, which first clusters the rooms geometrically, and then labels each room, mainly applies to rooms with clear geometric boundaries (e.g., it would not work in an open floor-plan). We believe this limitation is surmountable (at the expense of using extra training data) by replacing the 2-stage approach with a single learning-based method (e.g., a neural tree or a standard graph neural network) that can directly classify places into rooms. Third, for computational reasons, we restricted the inference in the neural tree to operate at the level of objects, rooms, and buildings. However, it would be desirable for high-level semantic concepts (e.g., room and object labels) to propagate information downward towards the mesh geometry. While a more compact representation of the low-level geometry (Czarnowski et al., 2020; Sucar et al., 2020) might facilitate this process, it remains unclear how to fully integrate top-down reasoning in the construction of the 3D scene graph, which is currently a bottom-up process.
8.2. Future work
This work opens many avenues for current and future investigation. First of all, while this paper mostly focused on inclusion and adjacency relations, it would be interesting to label nodes and edges of the 3D scene graph with a richer set of relations and affordances, for instance building on Wu et al. (2021) or Looper et al. (2023). Second, the connections between our scene graph optimization approach and pose graph optimization offer opportunities to improve the efficiency of the optimization by leveraging recent advances in pose graph sparsification as well as novel solvers. Third, it would be interesting to replace the sub-symbolic representation (currently, a 3D mesh and a graph of places) with neural models, including neural implicit representations (Park et al., 2019) or neural radiance fields (Rosinol et al., 2023), which can more easily incorporate priors and better support shape completion (Kong et al., 2023). Fourth, the current set of detectable objects is fairly limited and restricted by the use of pre-trained 2D segmentation networks. However, we have noticed in Section 6 and in Chen et al. (2022) that a larger object vocabulary leads to better room classification; therefore, it would be interesting to investigate novel techniques that leverage language models for open-set segmentation, e.g., Ha and Song (2022); Jatavallabhula et al. (2023). Fifth, it would be desirable the extend the framework in this paper to arbitrary (i.e., mixed indoor-outdoor environments). Finally, the implications of using 3D scene graphs for prediction, planning, and decision-making are still mostly unexplored (see Ravichandran et al. (2022); Agia et al. (2022); Chang et al. (2023b); Rana et al. (2023) for early examples).
Supplemental Material
Footnotes
Declaration of conflicting interests
Funding
Disclaimer
Supplemental Material
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
