Sage Journals: Discover world-class research

Abstract

In this paper, we propose a real-world benchmark for studying robotic learning in the context of functional manipulation: a robot needs to accomplish complex long-horizon behaviors by composing individual manipulation skills in functionally relevant ways. The core design principles of our Functional Manipulation Benchmark (FMB) emphasize a harmonious balance between complexity and accessibility. Tasks are deliberately scoped to be narrow, ensuring that models and datasets of manageable scale can be utilized effectively to track progress. Simultaneously, they are diverse enough to pose a significant generalization challenge. Furthermore, the benchmark is designed to be easily replicable, encompassing all essential hardware and software components. To achieve this goal, FMB consists of a variety of 3D-printed objects designed for easy and accurate replication by other researchers. The objects are procedurally generated, providing a principled framework to study generalization in a controlled fashion. We focus on fundamental manipulation skills, including grasping, repositioning, and a range of assembly behaviors. The FMB can be used to evaluate methods for acquiring individual skills, as well as methods for effectively combining and ordering such skills in order to solve complex, multi-stage manipulation tasks. We also offer an imitation learning framework that includes a suite of policies trained to solve the proposed tasks. This enables researchers to utilize our tasks as a versatile toolkit for examining various parts of the pipeline. For example, researchers could propose a better design for a grasping controller and evaluate it in combination with our baseline reorientation and assembly policies as part of a pipeline for solving multi-stage tasks. Our dataset, object CAD files, code, and evaluation videos can be found on our project website: https://functional-manipulation-benchmark.github.io.

Keywords

Manipulation imitation learning benchmarking

1. Introduction

Manipulation is one of the foundational problems in robotics research, but enabling robots to perform dexterous manipulation skills that reflect the capabilities of humans is still out of reach. In fact, even matching the performance of human teleoperation remains a major challenge, particularly in environments that require generalization and are not constrained to a specific fixed set of well-characterized objects. As Cui and Trinkle (2021) point out, two primary difficulties in robotic manipulation lie in intelligently handling complex contact dynamics and the variability in the environment and objects. Robotic learning techniques hold the potential to address these challenges. However, making effective and measurable progress will require a comprehensive and accessible framework to offer essential components: sufficiently challenging tasks of practical relevance, reasonable amounts of high-quality data, an easy-to-reproduce setup, a collection of relevant methods providing baseline results, and thorough analysis of the experimental findings on the proposed tasks.

While significant recent research in robotic learning has made progress on various aspects of manipulation problems (Abbeel et al., 2006; Brohan et al., 2023; Buchli et al., 2011; Gu et al., 2017; Hopcroft et al., 1991; Kalashnikov et al., 2018; Levine et al., 2016; Mahler et al., 2017; Peters and Schaal, 2008; Salehian and Billard, 2018; Xu et al., 2022; Zeng et al., 2021), much of the emphasis in recent works have either been on broad generalization with relatively simple skills, which often do not capture many physical challenges of manipulation (e.g., imprecise pick-and-place tasks) (Ebert et al., 2022; Levine et al., 2018; Pinto and Gupta, 2016), or performing narrow tasks with physically more complex skills without extensive generalization (Hu et al., 2023; Kimble et al., 2020; OpenAI et al., 2019; Vecerik et al., 2019). This is not unreasonable: it is very difficult to simultaneously make progress on broad generalization (which often requires huge datasets) and tackle the full physical complexity of dexterous manipulation. So how can we take a step toward facilitating robotic learning research that emphasizes both generalization and physically intricate skills while still keeping the problem constrained enough so as to enable meaningful progress?

In this paper, we propose such a real-world benchmark, which we call the functional manipulation benchmark (FMB). FMB aims to cover important dimensions of physical complexity and object generalization while still providing a degree of accessibility by carefully restricting the scope to a domain where we can make progress with reasonably sized datasets and models. We approach the design of this benchmark by defining functional manipulation as the problem of manipulating objects in ways functionally relevant to a sequence of manipulation behaviors, such as picking up an object with an appropriate pose, repositioning it if necessary, and then using it for physical interactions. Two such examples can be seen in Figures 2 and 3. While this definition is more restrictive, we believe it captures a broad range of practical manipulation tasks and includes both the challenges of contact dynamics and object generalization.

The specific tasks we instantiate to capture functional manipulation are themed around assembly problems, including pick-and-place tasks and more complex long-horizon multi-stage multi-part assemblies. These tasks, illustrated in Figure 1, require picking up the individual pieces, reorienting them (potentially using environment fixtures and regrasping), and then slotting them into their corresponding location. Each phase requires addressing the challenge of complex contact dynamics, skill sequencing strategies, as well as object generalization. The objects vary in shape, size and color between training and testing phases, and their locations are randomized. The grasping phase requires selecting a grasp that is suitable for reorienting or inserting the object, the reorientation phase requires positioning the object so that its pose can be adjusted in the desired way, and the assembly phase requires compliant insertion and proper accounting for the contact forces on the object. Each phase requires handling different objects (including held-out objects) and different poses. The overall sequencing strategy needs to serve as the mechanism of composing such skills appropriately, as well as recovering from failed execution. For example, for the task presented in Figure 2, the robot may need to retry grasping on failed ones multiple times until it firmly holds the object before advancing to the next stage. In tasks illustrated in Figure 3, the robot must further reason the right sequence of manipulation as these objects are assembled in an interlocking fashion.

Figure 1.

Left: The 3D-printed parts for single-object manipulation tasks. Right: Three instantiations of the complex assembly task. These tasks require similar functional manipulation behaviors as the simpler set of tasks but with multiple interlocking objects and a more complex higher-level structure that requires assembling the parts in the right order.

Figure 2.

An illustration of the steps for completing a single-object manipulation task, which requires grasping the part, reorienting it (potentially using an environment fixture), and then inserting it into the appropriate slot.

Figure 3.

An illustration of the steps for solving a multi-object manipulation task, which requires performing the same skills as the single-object task repeatedly for each component in the interlocking assembly.

To ensure the reproducibility and portability of such tasks, we designed 66 3D-printed objects with diverse shapes and sizes that can be easily replicated by other researchers. Accompanying these objects, we collected a dataset of 22,500 human demonstrations of grasping, repositioning, and assembly skills. Our dataset contains a variety of sensory modalities, as presented in Figure 7: we record RGB and depth images from multiple cameras, relevant robot kinematics information, as well as force/torque measurement at the robot’s end-effector frame. We also trained a set of imitation learning policies to perform either individual stages or the entire assembly tasks. These policies are also provided as pre-trained model checkpoints so that they can be reused by others as component parts of larger systems or as scaffolds for studying improvements to individual stages. FMB is modular so that other researchers can repurpose it for a variety of methods that they may wish to develop and can focus on any stage or aspect of the task. For example, some researchers might choose to focus on better functional grasping or assembly methods, while the other stages are handled by our baseline system. Some researchers might focus on skill sequencing, utilizing trained skills from our system for the individual steps. Others might also focus on developing an end-to-end method for the entire multi-stage task, fully utilizing the provided training data. With the accessible and extensive framework that FMB provides, our hope is that it can serve as a “toolkit” to facilitate the entry of researchers into the field of robot learning with ease.

2. Related work

Considerable recent progress in robotic manipulation has studied generalization, though often in the context of simpler tasks such as grasping (Dasari et al., 2020; Levine et al., 2018; Yang et al., 2019), pushing (Dasari et al., 2020; Finn et al., 2016), and imprecise repositioning (Dasari et al., 2020; Lee et al., 2022). A number of other works have studied tasks that are dynamic (Seita et al., 2022), precise (e.g., insertion) (Zakka et al., 2020), contact-rich (Falco et al., 2016), or otherwise physically challenging (Kimble et al., 2020; OpenAI et al., 2019). Fewer works have studied these factors in combination (Heo et al., 2023). We believe many of the central challenges in robotic manipulation lie at the confluence of these two challenges: tasks that require handling contact dynamics, not by memorizing the particular pattern needed for a single narrow task, but by learning general behaviors for handling object interaction that can generalize to new objects. Our aim is to propose a benchmark that can study this combination of challenges while keeping the scope narrow enough that it remains accessible to many researchers.

Our functional manipulation tasks combine aspects of grasping, repositioning, and assembly. A number of works have studied functional grasping (Aleotti and Caselli, 2008; Levine et al., 2018; Li and Sastry, 1988; Liu et al., 2020; Zhao et al., 2021), and insertion (Mahler et al., 2017) separately. Our goal is not to attain the best possible performance in narrow settings for any of these stages (e.g., ultra-high-precision industrial insertion e.g., NIST board challenge (Kimble et al., 2020)) but to use these tasks as a lens through which to gauge general manipulation capabilities learned via general-purpose robotic learning methods.

A number of prior works have proposed datasets for robotic learning, including datasets consisting of demonstrations (Ebert et al., 2022; Fang et al., 2023; Walke et al., 2023) and autonomously collected data (Levine et al., 2018; Pinto and Gupta, 2016), as well as annotated datasets of grasp points (Fang et al., 2019), object geometries (Padalunkal et al., 2023; Tyree et al., 2022), simulated environments (James et al., 2019), and multi-modal inputs (Fang et al., 2023). However, there has been comparatively little work on standard and accessible object sets that are combined with multi-stage tasks for studying generalization. The YCB object set comes with a number of evaluation protocols (Calli et al., 2015), but these protocols generally focus on object repositioning tasks that do not evaluate the complex contacts challenges that we discuss in the previous section. A number of existing demonstration datasets cover many different behaviors (Bharadhwaj et al., 2023; Dasari et al., 2020; Ebert et al., 2022; Mandlekar et al., 2019; Walke et al., 2023), but also focus more on behaviors that emphasize basic pick-and-place skills rather than precise or contact-rich manipulation. Some works have focused on insertion skills in particular (e.g., connector insertion) (Bruyninckx et al., 1995; De Magistris et al., 2018; kook Yun, 2008; Luo et al., 2019, 2021; Tang et al., 2016; Zhao et al., 2022). While FMB is related, we aim specifically to cover a range of skills, including grasping and repositioning, that we believe cover a basis of basic manipulation capabilities. We also emphasize generalization as a primary challenge for FMB.

We use 3D-printed objects to facilitate reproducibility. Other prior works have also proposed standard meshes and 3D-printed parts for benchmarking and reproducibility (Calli et al., 2015), typically focusing on object grasping. These efforts are related, but our aim is to provide parts that are specifically well suited for evaluating all of the stages: grasping, reorientation, and assembly, rather than only grasping.

3. Functional manipulation benchmark

In this section, we introduce the basic principles behind FMB and the protocols to evaluate different methods on this benchmark. FMB tasks can broadly fall into two categories: single-object multi-stage manipulation tasks and multi-object multi-stage manipulation tasks. They both require acquiring individual manipulation skills such as grasping, repositioning, and insertion, as well as composing these individual skills to complete the full task as depicted in Figures 2 and 3. These two categories bear similar design principles but differ in the additional complexity of the second category, which involves selecting the appropriate object for manipulation. We are primarily concerned with studying the generalization of each individual functional manipulation skill as well as evaluating the performance of different methods on the full assembly task. Therefore, we collect a diverse dataset of robotic behaviors with different objects, viewpoints, and robot initial poses. We also provide novel objects to evaluate the generalization capability of individual skills. Thus, we test the generalization of learned manipulation skills in terms of object location and physical attributes.

3.1. Object set

The objects in FMB are 3D-printable, and the CAD files are available on our website. In total, we have 66 objects as in Figure 1, 54 of them belong to single-object manipulation tasks; the remaining compose the multi-object manipulation tasks. Out of these 54 objects, we designed nine different basic shapes and six different sizes for each shape; each object is assigned one of eight colors specified on our website. These objects are paired with three boards with matching openings as in the left of Figure 1. We additionally designed three more complex boards to facilitate multi-stage assembly tasks, shown in the right of Figure 1; objects there are generated procedurally so that they can only be fit together in specific orders. The tolerance for mating all objects is between 1 mm and 2 mm, which is practical for commercial 3D printers available on the market. Additionally, we created five test objects used to evaluate the generalization capabilities. These vary in shape, size, and color from the training objects.

3.2. Individual manipulation skills

In this section, we describe the “primitive” manipulation skills included in FMB for evaluation as well as our data collection system. For each type of skill, we provide demonstration trajectories collected with a Franka robot (see Figure 7) and an evaluation protocol as in Section 3.7. This modular design of our benchmark facilitates extension to add new tasks with the provided objects, and the tasks we describe here are suitable both for evaluating generalization and for testing a range of manipulation capabilities.

3.2.1. Grasping

The grasping task in our benchmark is a functional grasping task, in the sense that the robot must grasp the object in a way that facilitates downstream manipulation rather than simply picking the object in any pose. We illustrate this task in Figure 4. For example, if we are going to perform insertion after grasping an object, a top-down grasp is reasonable if the object is placed in a vertical pose, as shown on the right side of Figure 4. However, a horizontal grasp is much more desirable if the object is positioned as in the second row of the left side of Figure 4; because it can be impossible to find a collision-free path to grasp vertically on the top of the blue object or easily violating the robot’s kinematics constraints to perform downstream manipulation even if such grasps can be found. In such scenarios, the robot needs to perform additional repositioning steps to adjust the feasible grasp pose. The robot must learn grasping skills that deploy the appropriate grasp conditioning on the object’s current configuration and also generalize across different object shapes, colors, and sizes. Our demonstration dataset for the grasping task consists of 50 trajectories per object, with varying object rest poses in the randomization zone, for a total of 2700 trajectories performing functional grasping over the 54-object set; additionally, we collect 1800 grasping trajectories for objects in the three multi-object assembly tasks, so each object gets 150 demonstrations in a randomized setting of placements among other objects.

Figure 4.

Illustration of individual skills in the single-object task. Note that the grasp and rotate skills have to manipulate the object in both the vertical and horizontal orientation. For isometric shapes like the rectangle, the insert skill needs to decide whether to rotate the object to line up with the hole.

3.2.2. Repositioning

A repositioning step is sometimes necessary to adjust the grasping pose so that the object is held in a way that is suitable for downstream assembly, as mentioned in the last paragraph. For objects with asymmetric geometries, a rotation operation is usually desirable for the downstream insertion task. For example, the objects in the second column of Figure 4 need to be rotated 180° so that they can slot into the matching holes in the board more easily. On the other side, manipulating and reorienting objects by leveraging environment affordances (e.g., tilting the object in the gripper by levering it against a table or wall) may often be necessary for fluent and complex manipulation, and this reorientation task exercises this capability. We provide a simple fixture that can serve as an environment affordance to rest the object at an angle, as shown in Figure 4. To reorient the objects into the right pose, the robot may need to use this fixture, resting the object on it and then regrasping it in a more appropriate pose for reorientation. We collected 4500 demonstrations for placing and regrasping, which can be used to learn strategies for using environmental affordances for regrasping and reorientation. Since objects land in the fixture in a relatively deterministic fashion, we partially script our demonstration collection process while maintaining a certain degree of randomness for the purpose of data diversity. We detail such process and code of implementing it on our website.

3.2.3. Insertion

Our assembly tasks require inserting objects with diverse shapes into their matching slots, which requires performing fine-grained precise manipulation. An illustrative example is shown in Figure 4. Here, having completed the preceding steps, the robot is holding an object and needs to insert it into the matching slot on the board. For the single-object task, we collected 125 human demonstrations that include various robot initial poses and board positions, for a total of 6750 demonstrations performing the assembly task from various initial conditions. Note that in the single-object task, the board’s pose is randomized within a 35 × 35 cm region and rotated up to 15° in each episode, requiring a reactive strategy that localizes the board and the appropriate matching slot, and guides the object into the correct location. Similarly, 150 human demonstrations were collected per object in the multi-object assembly tasks, resulting in a total of 1800 trajectories.

3.3. Single-object multi-stage manipulation tasks

Aside from performing individual steps mentioned above, such as grasping, reorientation, and assembly, our benchmark and demonstrations can be used to learn the entire long-horizon sequence, composing these steps to insert a free object into the assembly board; one such example is shown in Figure 2. The difficulty of this task mainly comes from the compounding errors accumulated over each individual step which gets even more magnified when switching between tasks. For instance, after completing the grasping and repositioning stages, an object might be held in a pose different from the ones in the human demonstration data used for insertion.

3.4. Multi-object multi-stage manipulation tasks

We also present three sets of more challenging objects for assembly, as presented in the right of Figure 1. These tasks are more challenging than the single-object tasks since the pieces fit in an interlocking fashion, so there is much more variability in which object to perform manipulation skills on. For the grasping stage, as pictured in Figure 5, the robot needs to grasp a desired object among several others with the added complexity of randomized object placements for each attempt. For the insertion stage, as illustrated in Figure 6, the robot needs to insert objects while coming into contact with other objects already present on the assembly board. This situation introduces more complexity in contact dynamics, necessitating a higher level of precision in manipulation. Another major challenge with these tasks is that the interlocking pieces need to be put together in a specific order. While it may not be too hard to perform individual steps alone, the difficulty increases rapidly when a policy needs to simultaneously reason the manipulation sequence as well as accounting for compounding manipulation errors introduced by individual steps.

Figure 5.

Example of different initial configurations for one multi-object task assembly board. We randomize both the orientation and position relative to other objects at the start of each assembly demonstration episode.

Figure 6.

Various initial distributions of the insert skill for the multi-object task. In each instance, there are different numbers of objects already inserted into the board.

3.5. Robotic system and data collection

We now describe the robotic system and details of the 22,550 demonstration trajectories that we collected and released as part of the benchmark. The dataset composition can be seen in Table 1.

Table 1.

Number of demonstration trajectories in our dataset separated by primitive and task. Each trajectory is approximately 5 s in length, for a total of 22,550 trajectories.

	Grasp	Place_on_fixture	Regrasp	Rotate	Move_to_board	Insert	Total
Single-object task	2700	1350	1350	500	2700	6,750	15,350
Multi-object task	1800	900	900	0	1800	1800	7200
Subtotal	4500	2250	2250	500	4500	8,550	22,550

3.5.1. Robotic system overview

Our system can be seen in Figure 7. We use a Franka Panda robot to collect our dataset since it is widely adopted for research and offers a torque control interface which is very desirable in contact-rich manipulation tasks. To tele-operate the robot, we use a SpaceMouse to command 6 DoF end-effector twist at 10 Hz, which is then tracked by a low-level impedance controller running at 1 K Hz. The software for operating the robot, as well as the low-level controller, is also included in our open-sourced release. In total, we have four Intel RealSense D405 cameras, two of which are mounted on the robot end-effector, and the rest are placed on each side of the bin to provide a complementary view of objects in the bin. To ensure the image observations are free of background distractions, we put white curtains around the side of the workspace. We simultaneously capture RGB and depth images from these cameras, and we also provide calibrated camera intrinsics. This calibration allows for the conversion of depth images into point clouds when necessary. We also log the end-effector force/torque information provided by the Franka Panda robot. We did not use an additional force/torque sensor as it simplifies the standardization process by utilizing the robot’s inherent sensing capabilities.¹ Our robotic system setup is simple and modular; one can reproduce our exact setup by following the procedure on our website https://functional-manipulation-benchmark.github.io/files/index.html.

Figure 7.

Illustration of the robot setup, with a standard Franka arm equipped with four cameras (two on the wrist and two attached to the environment), each with RGB and depth channels, positioned in front of a workspace containing an object, reorientation fixture, and assembly board. The board is placed into a random pose within the randomization region, and the object is located in a randomized pose on the table, from where it must be picked up, reoriented, and inserted.

3.5.2. Single-object task dataset

Our dataset comprises 2700 demonstrations of the complete single-object task, encompassing every aspect from grasping and reorientation to object insertion. Each stage within these complete trajectories is automatically labeled, enabling the segmentation of trajectories into individual skills by querying the corresponding labels. We also collected an additional 4050 demonstrations of the insertion stage alone since it’s a much harder task, thus requiring more data. Each end-to-end demonstration trajectory ranges from 20 to 40 s in length. One can directly learn a “flat” policy on these long trajectories or break them into “primitive” trajectory sequences using the labels mentioned before. In our dataset, these primitives include grasp, place on fixture, regrasp, rotate, move to board, and insertion. After segmenting by primitives, we end up with a total of 15,350 demonstrations, with an average length of about 5 s. As shown in Figure 7, the pose of the task object for the grasping task is randomized around a 20 cm × 20 cm rectangular area in the bin. For the insertion task, the board is randomized inside a 35 cm × 35 cm area. A drawing of such a protocol can be found on our website. We also include distractors (i.e., objects not needed for a task) when performing the insertion task. One-fifth of the insertion demonstrations were carried out when there were distractors present.

3.5.3. Multi-object task dataset

In addition to the single-object manipulation task dataset, we also collected 150 end-to-end demonstrations of solving each of the three multi-object assemblies. Each trajectory contains steps to grasp, reorient, and insert the four components of the assembly sequentially and can exceed 100 s in length. We again break them down into separate primitives like grasp, place on fixture, regrasp, move to board, and insert for each manipulation object. After segmentation, this part of the dataset contains 7200 trajectories with lengths of about 5 s.

For the multi-object manipulation task, all four assembly objects are randomly placed in the 20 cm × 30 cm area, requiring the learned system to determine the desired piece to pick up. Unlike the single-object task, the assembly board is fixed to the table. A drawing of such a protocol can be found on our website.

3.6. Using the benchmark

To use the FMB benchmark, users would first need to reproduce the setup. This includes purchasing relevant equipment, such as the bin and cameras, as well as printing the FMB objects and tools with specified materials and colors. The detailed instructions can be found on our website https://functional-manipulation-benchmark.github.io/usage/index.html. Our dataset was collected using a Franka panda robot; however, users could still use relevant components within the FMB framework to collect their own data if they choose to use a different robot.

3.7. Evaluation protocol

In order to evaluate the performance of different methods, we designed a set of detailed evaluation protocols for each task of FMB. In these protocols, we specify a set of object initial poses within the randomization region to test the proposed methods’ generalization capability while ensuring consistency across different experiments and labs.

3.7.1. Single-object tasks

For grasping and repositioning tasks, one can hold out a specific object in the training set, train a policy without seeing any data associated with that object, and then test on the held-out object. Additionally, we also provide novel objects that are not contained in the dataset for which researchers can directly evaluate the trained policies, such as the five objects shown in Figure 8. Furthermore, we define a set of specific starting poses for both the object and the insertion board, aiming to consistently evaluate the adaptability of different policies in handling various grasping and insertion points.

Figure 8.

Unseen test objects used for evaluating generalization to new combinations of shapes, sizes, and colors.

3.7.2. Multi-object tasks

For the multi-object task, the assembly components for each board are placed in one of five specified starting arrangements within the designated grasp randomization area, as illustrated in Figure 5. A successful policy must choose the intended piece to grasp amidst the presence of other items within the same vicinity. However, the board is fixed to the workspace within the insertion randomization region to reduce the complexity required during the insertion phase.

The precise protocols for each individual skill and the multi-stage tasks can be found on our website: https://functional-manipulation-benchmark.github.io/procedure/index.html.

4. An imitation learning system for the FMB

One significant benefit of the FMB framework is its ability to function as a standardized “toolkit” for researchers, facilitating a convenient and unified starting point for studying various robot learning challenges. In this section, we will describe an imitation learning system we built for the FMB that serves both to provide baseline performance and a collection of components that researchers can extend to study the FMB tasks. In the next section, we analyze the performance of this system and various baselines and ablations.

4.1. Imitation learning policies for individual skills

By using the FMB dataset together with an evaluation protocol described in Section 3.7, we trained and tested various imitation learning models, detailed below, on individual manipulation skills. As we will discuss in Section 5, we also study the most effective sensor modalities for each skill as well as how the performance scales with the data available. In all our experiments, we use two types of architectures to learn imitation learning policies, ResNet (He et al., 2016) and Transformer (Vaswani et al., 2017). In this section, we describe the detailed architectures of our imitation learning policies.

4.1.1. ResNet-based policy

Our ResNet-based policy’s overall architecture can be seen in Figure 9(b). It is composed of ResNet-34 vision backbones and an MLP as the policy head, representing a Gaussian distribution. We use this general structure for all of our tasks, only adapting the inputs specific to each task. It takes multiple RGB and depth images and encodes them separately with weight-shared ResNets before concatenating the features. It also takes the robot’s proprioceptive information, such as end-effector pose, twist, or force/torque measurements, and then performs linear projection before being fed into the MLP. Furthermore, the system is capable of conditioning on both the object ID and manipulation skill ID, which are represented as one-hot vectors. This mechanism is crucial for employing a hierarchical approach to effectively address long-horizon, multi-stage tasks. The output is a 6D end-effector twist as well as a binary variable that indicates whether the gripper should open or close. In our experiments, we vary the input space to fit the needs of each scenario—for example, when evaluating a single-task policy, the skill ID is omitted, and when evaluating the importance of force/torque measurements, we vary whether or not they are included in the input.

Figure 9.

Architecture diagrams of the baseline policies. Both models encode each image view with weight-shared ResNet encoders before concatenating with proprioceptive information and optional object and primitive ID features to predict 7DoF actions. (a) Architecture of the transformer model used to train the baseline policies. (b) Architecture of the ResNet model used to train the baseline policies.

4.1.2. Transformer-based policy

Several recent works (Brohan et al., 2023; Collaboration et al., 2023; Zitkovich et al., 2023) showed that high-capacity models such as Transformers (Vaswani et al., 2017) can be effective in robotic control. The major advantages of these models lie in handling multi-modal inputs and scaling with large, diverse datasets. Our decoder-only Transformer architecture is shown in Figure 9(a). We use weight-shared ResNet-34 encoders to tokenize images from multiple camera views. We additionally add FiLM (Perez et al., 2018) layers to condition on the object ID or primitive ID if they are required as part of the inputs to the policies. This prevents the one-hot ID vectors from being ignored by the neural network, thus making the conditioning procedure more stable. Robot proprioceptive information is tokenized via an MLP separately. These tokens, after being concatenated together with sinusoidal position embeddings, are then processed through self-attention layers with four attention heads and four MLP layers. The network outputs a discretized action consisting of a 6D end-effector twist as well as a binary variable indicating whether the gripper should open or close. Each dimension of the continuous 6D robot action space is discretized into 256 bins during training by using a Gaussian quantizer. The discretized action space is converted back into continuous values when sending commands to the robot at runtime.

4.2. Composing skills to solve long-horizon tasks

An important part of the FMB consists of the two long-horizon sequential manipulation tasks. One way of solving such tasks is to just train a “flat” imitation learning policy on the long-horizon trajectories. However, this would suffer from compounding error issues (Ross et al., 2011), potentially requiring a significant amount of data to achieve desirable performance. Alternatively, we can perform the long-horizon task by employing hierarchical methods to compose individual manipulation skills with a high-level policy. In our experiments, we simply used a human-provided sequence of steps to trigger associated low-level primitives in time. This “human oracle” can sequence a set of primitives to generate recovery behaviors, thus reducing compounding errors. For example, the robot can repeatedly execute the grasping primitive until the object is securely held, or opting to use a repositioning primitive to adjust the object’s pose after unsuccessful grasping attempts, thus simplifying subsequent attempts. This can be achieved by using our ResNet or Transformer policy architectures with the proposed conditioning mechanism. Future work could explore learning such high-level policies that dynamically choose the best primitives based on the current observations. Such tasks necessitate explicit reasoning of the spatial relationships between objects and the associated manipulation skills, facilitating the use of a suitable abstract representation.

5. Experiments

Our experiments study the performance of the imitation learning system described previously in order to compare different variants of the imitation learning approach, understand the properties and challenges of the FMB tasks, and study the impact of different input modalities and design decisions. Specifically, our experiments study the following research questions: (1) How do various imitation learning techniques perform in our tasks so we can establish stable baselines? (2) What do the failure modes of these methods suggest about the challenges of the FMB tasks? (3) How does the difficulty of the various FMB tasks change with the choice of input modality and policy architecture? (4) How do hierarchical policies compare to “flat” policies on long-horizon tasks?

To achieve this, we train a set of imitation learning policies with either ResNet (He et al., 2016) or Transformer (Vaswani et al., 2017) architectures, shown in Figure 9. We also combine these architectures with techniques such as diffusion (Chi et al., 2023) and action chunking (Zhao et al., 2023). We’ll detail these choices in the section. All pre-trained model checkpoints associated with experiments in this section can be found on our website.

5.1. Grasping task

An important aspect of FMB is to study the generalization across objects’ physical attributes and their locations. We conduct the grasping task to get baseline numbers of our imitation learning system, as well as to verify that we can study the proposed generalization. To achieve this, we prepare different training datasets by randomly extracting portions of data from the diverse pool of grasping data available. Specifically, we sample 20%, 50%, 80%, and 100% of the overall grasping data and study the policy’s performance with the randomized evaluation procedure mentioned in Section 3.7. To test the generalization across objects, we conduct evaluations for both objects in the FMB dataset as well as unseen objects illustrated in Figure 8 in accordance with our evaluation protocol detailed in Section 3.7.

For this task, we train both the ResNet and Transformer-based policies on RGB inputs to assess the general completion rate of the task. The specific input modality includes RGB images and TCP velocity. To test if depth information is helpful for the grasping task, we additionally train the ResNet policy with depth alongside RGB. We test each policy by evaluating it on five objects in the training set and five unseen objects shown in Figure 8, for five trials each, and report the performance over the 50 trials.

Summarized in Figure 10, the ResNet policy’s performance generally scales with the amount of training data. The Transformer policy trained on all grasping data with RGB inputs is able to grasp the objects 28 out of the 50 times tested. The ResNet policy trained on the same data and observation achieves a comparable 27 out of 50 success. The policy performance drops to 12 out of 50 as the amount of training data decreases to 20%. It is interesting to note that the ResNet policies are able to generalize and grasp unseen objects shown in Figure 8 with comparable success as seen objects regardless of the amount of training data. Furthermore, we find that depth information is beneficial as the ResNet policies trained with both depth and RGB information consistently outperform RGB-only policies trained with the same number of data. The common failure modes of this task include missing the objects and not closing the gripper at the right time.

Figure 10.

Number of successful grasps out of 50 trials across five seen and five unseen objects for ResNet and transformer policies trained on various observation spaces and data percentages. The policies are able to grasp unseen objects with similar success rates as seen objects, while the overall success rate grows with the amount of training data. Training ResNet policies with depth information increases the performance across the board.

5.2. Repositioning task

For the three repositioning skills, rotate, place on fixture, and regrasp, the human demonstration data can induce multiple modes of actions given the same observation. For example, an object can be rotated towards left or right contingent upon the context derived from its observational history. We thus also train the ResNet-based policies with action chunking (Zhao et al., 2023), a recent method of showing promising performance handling multi-modalities in human demonstrations. We tested the performance of ResNet policies with and without action chunking, along with a Transformer-based policy without action chunking on seen and unseen objects. The results are presented in Figure 11. The ResNet policy without action chunking outperforms its counterpart with action chunking and Transformer on the rotate skill. In contrast, the Transformer policies outperform ResNet policies with or without action chunking for the place on fixture and regrasp skills. The common failure modes for this task include not opening or closing the gripper at the right time and rotating in the wrong direction.

Figure 11.

Number of successes for policies trained on the three repositioning tasks: rotate, place on fixture, and regrasp. We tested three models: ResNet without action chunking, ResNet with action chunking length 3, and transformer without action chunking. Each policy is evaluated 50 times across 5 seen and 5 unseen objects.

5.3. Insertion task

For the insertion task, we studied the effect of different observation spaces of different input modalities, experimented with training a single policy for all insertion object shapes, and compared the performance of policies only trained on particular shapes.

We first experimented with the policy’s input modality by training ResNet policies on rectangular object insertion data, ablating the use of depth maps as well as force/torque information. As presented in Table 2, we found that the input modality has a large impact on the insertion performance, with the best being two wrist camera RGB views, one side camera RGB view, TCP pose, velocity, and force/torque. This shows that force/torque information is crucial for these contact-rich tasks, as the policy is able to tell whether the objects are in contact and execute a searching behavior. Surprisingly, using depth deteriorated the insertion performance. This could be because TCP pose information is already present in the observation space, and the noisy depth information does not aid in this precise task. Instead, it confuses the accurate end-effector pose readings. For the following experiments, we will utilize the input modalities that have led to the best performance, as demonstrated in this table.

Table 2.

Ablation on input modality for insertion policy. For all policies, we include one RGB side view, two RGB wrist views, and velocity. We experimented with adding depth information from each view and adding force/torque information. By evaluating 25 trials across 5 different object sizes, we can conclude that force/torque information is crucial for contact-rich manipulation tasks like this and that depth information deteriorates performance.

Observation	Success rate
Three RGB, pose, vel	2/25
Three RGBD, pose, vel	2/25
Three RGB, pose, vel, force/torque	11/25
Three RGBD, pose, vel, force/torque	5/25

To carry out an initial study to understand the complexity of the insertion task, we train different ResNet policies for each object shape and evaluate them according to the procedure in Section 3.7 We can see that the success rate does decrease as the shape becomes more complex, with the hardest one being the three-prong object shown in Figure 12. The common failure modes include getting stuck near the holes, impeding fine-grained adjustments, difficulty in locating the matching openings, and challenges in handling multi-modalities in the demonstration data. For example, the two-pronged object with asymmetrical shapes may require a rotation between 0 and 90°, depending on its grasping pose, to align with the shapes of the hole openings. This implies the assembly task is indeed a challenging robotic manipulation task for future benchmarking.

Figure 12.

Left to right: rectangle, hexagon, circle-square, three-prong, star. We show the number of successful insertions out of 25 trials across 5 different object sizes for ResNet policies trained on individual shapes. Notice that the success rate can vary dramatically depending on the geometry of the object, creating a gradient of difficulties, which is ideal for a benchmark.

To study if co-training with data from other shapes helps, we then perform experiments on training policies with the insertion data that contains all the shapes and sizes. Table 3 shows that, when we naïvely train an unconditional ResNet policy with all the data, the policy achieves a success rate of only 7 out of 45 across 9 shapes. The main failure mode is trying to insert the objects into the wrong slots as well as struggling with the fine-grained execution of the insertion when in close proximity to the slots. This is not unreasonable because the policy needs to infer the right matching opening from the camera inputs, together with predicting the fine motor commands to perform the precise insertion. The combined complexities of these tasks significantly heighten the challenge beyond that of any individual component. When we provide the policy with a one-hot vector indicating the object shape, the performance increases to 14 out of 45. Qualitatively, the policy sometimes goes to the wrong opening and sometimes fails to insert the object after going to the vicinity of the correct hole. When we train a Transformer policy with the same object shape conditioning, it achieves a 27 out of 45 success rate. We hypothesize that with the attention mechanism and the FiLM conditioning layer, our transformer policy architecture is able to pay more attention to the shape conditioning and, therefore, never reaches for the wrong hole.

Table 3.

We train policies on all insertion data and evaluate 5 trials for each of the 9 object shapes. We find that using one-hot vector embedding according to the shape of the object being assembled helps the policy spatially separate the target insertion position.

Policy	Success rate
Unconditioned ResNet	7/45
Object ID conditioned ResNet	14/45
Object ID conditioned transformer	27/45

5.4. Multi-stage manipulation tasks

As described in Section 3, the difficulties of the multi-stage assembly tasks mainly come from dealing with compounding errors introduced by each stage of manipulation, as well as reasoning the manipulation sequences. To verify these points so as to facilitate the use of proposed hierarchical policy structures, we train “flat” end-to-end imitation learning policies directly on the full long-horizon demonstrations. We train both ResNet and Transformer policies on all the RGB camera views together with other necessary robot proprioceptive information. The goal of trained policies is to successfully grasp, reorient, and perform assembly. We assess the performance of the trained policies by conducting 10 trials for each object shape in the case of the single-object task and 10 trials for each initial object configuration in scenarios involving multi-object manipulation.

Tables 4 and 5 present results for both single-object and multi-object tasks, as illustrated in Figures 2 and 3. In the single-object manipulation task, both the ResNet and Transformer models recorded a success rate of 0/10 when evaluated on objects of three distinct shapes. Similarly, in the multi-object manipulation task, both of them achieved a success rate of 0/10. The observed failure modes encompassed errors such as positioning the objects incorrectly, executing inappropriate gripper actions, and generating entirely irrational robot movements. While these outcomes serve as plausible indicators of the previously mentioned issue of error-compounding, they do not entirely rule out a significant confounding factor, namely, the multi-modalities present in the human demonstration data. To further investigate this, we also train a diffusion policy (Chi et al., 2023) using a ResNet. This approach models the conditional action distribution with diffusion models, which already show promising results in representing complex multi-modal distributions of human demonstration data. However, as the results presented in Tables 4 and 5, diffusion policies achieved 0/10 on both tasks. These experimental results confirmed FMB multi-stage manipulation tasks are indeed challenging, and error-compounding issues must be addressed appropriately to fully solve these tasks; which necessitate the use of hierarchical policies.

Table 4.

We conducted an evaluation of various policies for single-object multi-stage manipulation tasks, focusing on the performance of transformer and ResNet models across three distinct shapes. Notably, all unconditional policies, including those trained with diffusion models, recorded a zero success rate. We compared two types of hierarchical policies differentiated by the conditioning mechanism between the high-level and low-level policies. We found the transformer-based policy achieved the most compelling results while providing a flexible structure for handling different input modalities.

Method	Hexagon (10 trials)	Circle − square (10 trials)	Three − prong (10 trials)	Total (30 trials)
Diffusion ResNet
Unconditioned policy	0	0	0	0
ResNet
Unconditioned policy	0	0	0	0
Hierarchical policy
Human oracle → one policy (skill ID conditioned)	0	0	0	0
Human oracle → five policies (one policy per skill)	9	8	1	18
Transformer
Unconditioned policy	0	0	0	0
Hierarchical policy
Human oracle → one policy (skill ID conditioned)	7	6	2	15
Human oracle → five policies (one policy per skill)	9	8	2	19

Table 5.

We conducted an evaluation of various policies for multi-object multi-stage manipulation tasks, focusing on the red board as shown in Figure 1. The hierarchical policies use a human oracle as the high-level policy, sequentially triggering a low-level policy with the appropriate primitive and object IDs for each stage. Similar to single-object manipulation tasks, all unconditioned policies achieved zero success. Remarkably, the transformer-based policy outperformed others, achieving a success rate of 7/10.

Method	Assembly board one (10 trials)
Diffusion ResNet
Unconditioned policy	0
ResNet
Unconditioned policy	0
Hierarchical policy	5
Transformer
Unconditioned policy	0
Hierarchical policy	7

We studied two ways of instantiating such hierarchical methods as presented in Figure 13. In both cases, we employ a high-level human oracle that functions as a state machine, determining the appropriate low-level skill to execute. This oracle maintains a sequence of skills to be executed at each decision point. It is also responsible for re-executing any primitive skill that either failed in the previous step or the resulting state is deemed unsuitable for the subsequent step. For example, it may retry grasping if the object was initially grasped at a location unfavorable for insertion. The procedure is designed to terminate under two conditions: either when an unrecoverable state is encountered or when a pre-set maximum number of trial steps is reached. While they use the same high-level policy, these approaches diverge in their representation of low-level skills. To assess the efficacy of the conditioning mechanism integrated into the architecture depicted in Figure 9, we conducted a comparative study. This involved training five distinct policies, each representing a specific low-level skill, which were then directly invoked by the high-level policy.

Figure 13.

Illustration of the policies tested on the multi-stage task. (a) An unconditioned policy is trained on the end-to-end task. (b) A task-conditioned policy is trained on multiple skills, and a human oracle provides the appropriate skill ID, and optionally object ID, sequentially. (c) 5 unconditioned policies are trained on the 5 skills separately, and the human oracle selects the best policy to execute sequentially.

First, we observed that the hierarchical policies attained measurable levels of success, in contrast to the flat policies, which demonstrated zero success as in Tables 4 and 5. However, despite employing a human oracle as the high-level policy endowed with a profound understanding of the tasks to make near-optimal decisions, the maximum success rate achieved was only 19 out of 30 for single-object tasks and 7/10 for multi-object tasks. This indicates the inherently complex challenges presented by the FMB, affirming its suitability as a benchmark for developing advanced robotic learning methods.

For the single-object task as presented in Table 4, the Transformer-based policies achieve comparable performance between the two aforementioned hierarchical methods, namely, 19/30 compared to 15/30. However, for the ResNet-based policies, conditioned ResNet achieved zero success out of 30 trials, whereas chaining separate policies attained an 18/30 success rate, which is comparable to that of the Transformer-based policies. For the multi-object task presented in Table 5, the conditioned hierarchical ResNet policy achieved 5/10 success compared to the conditioned hierarchical Transformer policy’s 7/10 success rate. To understand this phenomenon, we found that the primary factor that causes performance difference is the ability to handle multi-modal sensory inputs between ResNet and Transformer policies. For each skill, there is an optimal set of sensory inputs. For example, the insertion skill reached its peak performance using three RGB camera views, supplemented with additional sensory data, as outlined in Table 2. However, we observed that incorporating a fourth camera view, specifically the right-side camera, into a ResNet policy significantly impairs its performance. This decline is primarily due to the randomized positions of the assembly board. The distant camera struggles to precisely locate the corresponding holes, leading to incorrect spatial feature associations, such as the board’s edge, rather than the target location. This observation is further corroborated by the fact that incorporating a fourth camera view in multi-stage tasks, as detailed in Table 5, did not adversely affect performance. This is largely attributable to the fixed position of the assembly board. In such scenarios, the redundant information provided by the additional camera remains consistent, making it sufficiently apparent for the system to effectively ignore it. Similarly, the grasping skill generally does not benefit from adding end-effector force/torque information as it does not perform contact-rich fine-grained manipulation. In fact, we selected distinct sets of sensory inputs to tailor the specific requirements of each task and supplied these to five different ResNet policies. On the other hand, we fed all available sensory inputs to the conditioned policies. These policies are then required to learn the skill of selecting the appropriate set of input modalities, guided by supervision from their respective actions. The performance of the ResNet-based policies was observed to degrade due to their difficulty in disregarding task-irrelevant inputs, leading to incorrect feature associations. In contrast to the ResNet-based policies, the Transformer-based policies learned to effectively ignore task-irrelevant modalities, such as the non-essential fourth camera in the insertion task. This attribute is particularly beneficial in the multi-stage, multi-task imitation learning settings characteristic of FMB tasks.

6. Discussion and limitations

In this paper, we present the Functional Manipulation Benchmark (FMB). Through the careful design of tasks, the provision of a comprehensive dataset and reproducible hardware and software system, FMB enables studying several critical challenges in robotic manipulation learning: complexity of task and skills, generalization across varied objects, and reproducibility of research.

One of the primary contributions of FMB is its focus on the complexity of manipulation tasks and the need for generalization. The tasks, ranging from single-object manipulation to complex multi-object multi-stage assemblies, capture important aspects of real-world manipulation challenges.

The inclusion of diverse 3D-printed objects enhances the need for robots to generalize their learned skills to new and unseen objects, as well as easing the burden of reproducing the proposed tasks. Our open-sourced imitation learning system, complemented by a comprehensive analysis of our experimental findings on FMB tasks, offers a foundation for researchers seeking to develop and enhance their methodologies.

Researchers can get started with FMB by first replicating our publicly available setup and trying out some of our pre-trained models. We anticipate that this initial exploration will pave the way for them to develop and evaluate new methods. For this reason, we look forward to their contributions and insights on the tasks proposed by FMB. Additionally, the nature of the FMB tasks is inherently conducive to ongoing development. Researchers have the opportunity to create novel 3D-printed objects and collect demonstrations, thereby enriching the FMB project. Notably, since the objects in multi-stage assembly tasks are constructed using a specific “grammar,” there is potential to incorporate a far greater variety of assembly boards than those currently present in FMB tasks.

Our hope is that FMB can serve as a user-friendly toolkit for individuals eager to delve into robot learning. Its inherent task complexity will foster the advancement of cutting-edge robot learning methodologies. We wish that the value FMB adds to the robot learning community will ultimately encourage community contributions, further supporting its ongoing development.

Supplemental Material

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This research was partly supported by the National Science Foundation through IIS-2150826,ONR through N00014-22-1-2773 and N00014-20-1-2383,Intrinsic,and the AI Institute. We also thank the computing resource support from the Berkeley Research Computing (BRC) program,the NSF Cloudbank program,and the Google TPU Research Cloud program.

ORCID iD

Jianlan Luo

Supplemental Material

Supplemental material for this article is available online.

Note

References

Abbeel

Coates

Quigley

, et al. (2006) An application of reinforcement learning to aerobatic helicopter flight. In: Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, Cambridge, MA, 2006, 1–8. MIT Press. URL: https://dl.acm.org/citation.cfm?id=2976456.2976457

Aleotti

Caselli

(2008) Grasp programming by demonstration: a task-based quality measure. In: RO-MAN 2008 - The 17th IEEE International Symposium on Robot and Human Interactive Communication, Munich, Germany, 1–8 August 2008, pp. 383–388. DOI: 10.1109/ROMAN.2008.4600696.

Bharadhwaj

Vakil

Sharma

, et al. (2023) Roboagent: generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In: First Workshop on Out-of-Distribution Generalization in Robotics at CoRL 2023, Atlanta, GA, 06 November 2023. URL: https://openreview.net/forum?id=Pt5N3OG5wP

Brohan

Brown

Carbajal

, et al. (2023) Rt-1: robotics transformer for real-world control at scale. In: Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023.

Bruyninckx

Dutre

De Schutter

(1995) Peg-on-hole: a model based solution to peg and hole alignment. In: Proceedings of 1995 IEEE International Conference on Robotics and Automation, Nagoya, Japan, 21–27 May 1995, pp. 1919–1924. DOI: 10.1109/ROBOT.1995.525545.

Buchli

Stulp

Theodorou

, et al. (2011) Learning variable impedance control. The International Journal of Robotics Research 30(7): 820–833. DOI: 10.1177/0278364911402527.

Calli

Walsman

Singh

, et al. (2015) Benchmarking in manipulation research: using the yale-CMU-Berkeley object and model set. IEEE Robotics and Automation Magazine 22(3): 36–52. DOI: 10.1109/mra.2015.2448951.

Chi

Feng

, et al. (2023) Diffusion policy: visuomotor policy learning via action diffusion. In: Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023.

Collaboration

Padalkar

Pooley

, et al. (2023) Open x-embodiment: robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864.

10.

Cui

Trinkle

(2021) Toward next-generation learned robot manipulation. Science Robotics 6(54): eabd9461. DOI: 10.1126/scirobotics.abd9461.

11.

Dasari

Ebert

Tian

, et al. (2020) Robonet: large-scale multi-robot learning. In: Kaelbling

Kragic

Sugiura

(eds) Proceedings of the Conference on Robot Learning, Proceedings of Machine Learning Research, Osaka, Japan, 2020, 885–897. PMLR. URL: https://proceedings.mlr.press/v100/dasari20a.html

12.

De Magistris

Munawar

Pham

, et al. (2018) Experimental force-torque dataset for robot learning of multi-shape insertion. arXiv preprint arXiv:1807.06749.

13.

Ebert

Yang

Schmeckpeper

, et al. (2022) Bridge data: boosting generalization of robotic skills with cross-domain datasets. In: Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023.

14.

Falco

Marvel

Norcross

, et al. (2016) Benchmarking Robot Force Control Capabilities: Experimental Results. Gaithersburg, MD: National Institute of Standards and Technology. DOI: 10.6028/NIST.IR.8097.

15.

Fang

Wang

Gou

, et al. (2019) Graspnet: a large-scale clustered and densely annotated dataset for object grasping. arXiv preprint arXiv:1912.13470.

16.

Fang

Tang

, et al. (2023) Rh20t: a robotic dataset for learning diverse skills in one-shot. In: RSS 2023 Workshop on Learning for Task and Motion Planning, Daegu, Republic of Korea, 10–14 July 2023.

17.

Finn

Goodfellow

Levine

(2016) Unsupervised learning for physical interaction through video prediction. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Barcelona, Spain, 5 December 2016, p. 64–72.

18.

Holly

Lillicrap

, et al. (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May 2017–03 June 2017, pp. 3389–3396. DOI: 10.1109/ICRA.2017.7989385.

19.

Zhang

Ren

, et al. (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp. 770–778.

20.

Heo

Lee

, et al. (2023) Furniturebench: reproducible real-world benchmark for long-horizon complex manipulation. arXiv preprint arXiv:2305.12821.

21.

Hopcroft

Kearney

Krafft

(1991) A case study of flexible object manipulation. The International Journal of Robotics Research 10(1): 41–50. DOI: 10.1177/027836499101000105.

22.

Rovinsky

Luo

, et al. (2023) Reboot: reuse data for bootstrapping efficient real-world dexterous manipulation. In: Tan

Toussaint

Darvish

(eds) Proceedings of the 7th Conference on Robot Learning, Proceedings of Machine Learning Research, Atlanta, GA, 06–09 November 2023, 1930–1949. PMLR. URL: https://proceedings.mlr.press/v229/hu23a.html

23.

James

Wohlhart

Kalakrishnan

, et al. (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, San Diego, CA, 27–28 August 2019, pp. 12627–12637.

24.

Kalashnikov

Irpan

Pastor

, et al. (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Billard

Dragan

Peters

, et al. (eds) Proceedings of the 2nd Conference on Robot Learning, Proceedings of Machine Learning Research, Zürich, Switzerland, 29–31 October 2018, 651–673. PMLR. URL: https://proceedings.mlr.press/v87/kalashnikov18a.html

25.

Kimble

Van Wyk

Falco

, et al. (2020) Benchmarking protocols for evaluating small parts robotic assembly systems. IEEE Robotics and Automation Letters 5(2): 883–889. DOI: 10.1109/LRA.2020.2965869.

26.

kook Yun

(2008) Compliant manipulation for peg-in-hole: is passive compliance a key to learn contact motion? In: 2008 IEEE International Conference on Robotics and Automation, Pasadena, CA, 19–23 May 2008, pp. 1647–1652. DOI:10.1109/ROBOT.2008.4543437.

27.

Lee

Devin

Zhou

, et al. (2022) Beyond pick-and-place: tackling robotic stacking of diverse shapes. In: Faust

Hsu

Neumann

(eds) Proceedings of the 5th Conference on Robot Learning, Proceedings of Machine Learning Research, London, UK, 08–11 November 2021, 1089–1131. PMLR. URL: https://proceedings.mlr.press/v164/lee22b.html

28.

Levine

Finn

Darrell

, et al. (2016) End-to-end training of deep visuomotor policies. Journal of Machine Learning Research 17(1): 1334–1373.

29.

Levine

Pastor

Krizhevsky

, et al. (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37(4–5): 421–436. DOI: 10.1177/0278364917710318.

30.

Sastry

(1988) Task-oriented optimal grasping by multifingered robot hands. IEEE Journal of Robotics and Automation 4(1): 32–44. DOI: 10.1109/56.769.

31.

Liu

Daruna

Chernova

(2020) Cage: context-aware grasping engine. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May 2020–31 August 2020, pp. 2550–2556. DOI: 10.1109/ICRA40945.2020.9197289.

32.

Luo

Solowjow

Wen

, et al. (2019) Reinforcement learning on variable impedance controller for high-precision robotic assembly. In: 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, 20–24 May 2019, 3080–3087. IEEE.

33.

Luo

Sushkov

Pevceviciute

, et al. (2021) Robust multi-modal policies for industrial assembly via reinforcement learning and demonstrations: a large-scale study. In: Proceedings of Robotics: Science and Systems, Virtual Conference, July 2021. DOI: 10.15607/RSS.2021.XVII.088.

34.

Mahler

Liang

Niyaz

, et al. (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In: Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023.

35.

Mandlekar

Booher

Spero

, et al. (2019) Scaling robot supervision to hundreds of hours with roboturk: robotic manipulation dataset through human reasoning and dexterity. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019, pp. 1048–1055. DOI: 10.1109/IROS40897.2019.8968114.

36.

OpenAI

Andrychowicz

Chociej

, et al. (2019) Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113.

37.

Padalunkal

Chao

Xiang

(2023) Fewsol: a dataset for few-shot object learning in robotic environments. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May 2023–2 Jun 2023, pp. 9140–9146. DOI: 10.1109/ICRA48891.2023.10161143.

38.

Perez

Strub

De Vries

, et al. (2018) Film: visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, 2–7 February 2018.

39.

Peters

Schaal

(2008) Learning to control in operational space. The International Journal of Robotics Research 27(2): 197–212. DOI: 10.1177/0278364907087548.

40.

Pinto

Gupta

(2016) Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016, 3406–3413. IEEE.

41.

Ross

Gordon

Bagnell

(2011) A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, 11–13 April 2011, pp. 627–635.

42.

Salehian

SSM

Billard

(2018) A dynamical-system-based approach for controlling robotic manipulators during noncontact/contact transitions. IEEE Robotics and Automation Letters 3(4): 2738–2745.

43.

Seita

Jamali

Laskey

, et al. (2022) Deep transfer learning of pick points on fabric for robot bed-making. In: Asfour

Yoshida

Park

, et al. (eds) Robotics Research. Cham, Switzerland: Springer International Publishing, 275–290.

44.

Tang

Lin

Zhao

, et al. (2016) Autonomous alignment of peg and hole by force/torque measurement for robotic assembly. In: 2016 IEEE International Conference on Automation Science and Engineering (CASE), Fort Worth, TX, 21–25 August 2016, pp. 162–167. DOI: 10.1109/COASE.2016.7743375.

45.

Tyree

Tremblay

, et al. (2022) 6-dof pose estimation of household objects for robotic manipulation: an accessible dataset and benchmark. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022, pp. 13081–13088. DOI: 10.1109/IROS47612.2022.9981838.

46.

Vaswani

Shazeer

Parmar

, et al. (2017) Attention is all you need. In: Guyon

Luxburg

Bengio

, et al. (eds) Advances in Neural Information Processing Systems, Long Beach, CA, 4–9 December 2017. Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

47.

Vecerik

Sushkov

Barker

, et al. (2019) A practical approach to insertion with variable socket position using deep reinforcement learning. In: 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, 20–24 May 2019, pp. 754–760. DOI: 10.1109/ICRA.2019.8794074.

48.

Walke

Black

Zhao

, et al. (2023) Bridgedata v2: a dataset for robot learning at scale. In: 7th Annual Conference on Robot Learning, Atlanta, GA, 6 November 2023. URL: https://openreview.net/forum?id=f55MlAT1Lu

49.

Chi

Burchfiel

, et al. (2022) Dextairity: deformable manipulation can be a breeze. In: Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023.

50.

Yang

Jayaraman

Zhang

, et al. (2019) Replab: a reproducible low-cost arm benchmark for robotic learning. In: 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, 20–24 May 2019, pp. 8691–8697. DOI: 10.1109/ICRA.2019.8794390.

51.

Zakka

Zeng

Lee

, et al. (2020) Form2fit: learning shape priors for generalizable assembly from disassembly. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May 2020–31 August 2020, pp. 9404–9410. DOI: 10.1109/ICRA40945.2020.9196733.

52.

Zeng

Florence

Tompson

, et al. (2021) Transporter networks: rearranging the visual world for robotic manipulation. In: Kober

Ramos

Tomlin

(eds) Proceedings of the 2020 Conference on Robot Learning, Proceedings of Machine Learning Research, Virtual Conference, 16–18 November 2020, 726–747. PMLR. URL: https://proceedings.mlr.press/v155/zeng21a.html

53.

Zhao

Troniak

Kroemer

(2021) Towards robotic assembly by predicting robust, precise and task-oriented grasps. In: Kober

Ramos

Tomlin

(eds) Proceedings of the 2020 Conference on Robot Learning, Proceedings of Machine Learning Research, Virtual Conference, 16–18 November 2020, 1184–1194. PMLR. URL: https://proceedings.mlr.press/v155/zhao21c.html

54.

Zhao

Luo

Sushkov

, et al. (2022) Offline meta-reinforcement learning for industrial insertion. In: 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, 23–27 May 2022, pp. 6386–6393. DOI: 10.1109/ICRA46639.2022.9812312.

55.

Zhao

Kumar

Levine

, et al. (2023) Learning fine-grained bimanual manipulation with low-cost hardware. In: Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023.

56.

Zitkovich

, et al. (2023) Rt-2: vision-language-action models transfer web knowledge to robotic control. In: Tan

Toussaint

Darvish

(eds) In: Proceedings of Machine Learning Research, Atlanta, GA, 6–9 November 2023, 2165–2183. PMLR. URL: https://proceedings.mlr.press/v229/zitkovich23a.html

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB