Abstract
1. Introduction
Manipulation is one of the foundational problems in robotics research, but enabling robots to perform dexterous manipulation skills that reflect the capabilities of humans is still out of reach. In fact, even matching the performance of human
While significant recent research in robotic learning has made progress on various aspects of manipulation problems (Abbeel et al., 2006; Brohan et al., 2023; Buchli et al., 2011; Gu et al., 2017; Hopcroft et al., 1991; Kalashnikov et al., 2018; Levine et al., 2016; Mahler et al., 2017; Peters and Schaal, 2008; Salehian and Billard, 2018; Xu et al., 2022; Zeng et al., 2021), much of the emphasis in recent works have either been on broad generalization with relatively simple skills, which often do not capture many physical challenges of manipulation (e.g., imprecise pick-and-place tasks) (Ebert et al., 2022; Levine et al., 2018; Pinto and Gupta, 2016), or performing narrow tasks with physically more complex skills without extensive generalization (Hu et al., 2023; Kimble et al., 2020; OpenAI et al., 2019; Vecerik et al., 2019). This is not unreasonable: it is very difficult to simultaneously make progress on broad generalization (which often requires huge datasets) and tackle the full physical complexity of dexterous manipulation. So how can we take a step toward facilitating robotic learning research that emphasizes both generalization and physically intricate skills while still keeping the problem constrained enough so as to enable meaningful progress?
In this paper, we propose such a real-world benchmark, which we call the functional manipulation benchmark (FMB). FMB aims to cover important dimensions of physical complexity and object generalization while still providing a degree of accessibility by carefully restricting the scope to a domain where we can make progress with reasonably sized datasets and models. We approach the design of this benchmark by defining functional manipulation as the problem of manipulating objects in ways functionally relevant to a sequence of manipulation behaviors, such as picking up an object with an appropriate pose, repositioning it if necessary, and then using it for physical interactions. Two such examples can be seen in Figures 2 and 3. While this definition is more restrictive, we believe it captures a broad range of practical manipulation tasks and includes both the challenges of contact dynamics and object generalization.
The specific tasks we instantiate to capture functional manipulation are themed around assembly problems, including pick-and-place tasks and more complex long-horizon multi-stage multi-part assemblies. These tasks, illustrated in Figure 1, require picking up the individual pieces, reorienting them (potentially using environment fixtures and regrasping), and then slotting them into their corresponding location. Each phase requires addressing the challenge of complex contact dynamics, skill sequencing strategies, as well as object generalization. The objects vary in shape, size and color between training and testing phases, and their locations are randomized. The grasping phase requires selecting a grasp that is suitable for reorienting or inserting the object, the reorientation phase requires positioning the object so that its pose can be adjusted in the desired way, and the assembly phase requires compliant insertion and proper accounting for the contact forces on the object. Each phase requires handling different objects (including held-out objects) and different poses. The overall sequencing strategy needs to serve as the mechanism of composing such skills appropriately, as well as recovering from failed execution. For example, for the task presented in Figure 2, the robot may need to retry grasping on failed ones multiple times until it firmly holds the object before advancing to the next stage. In tasks illustrated in Figure 3, the robot must further reason the right sequence of manipulation as these objects are assembled in an interlocking fashion. An illustration of the steps for completing a single-object manipulation task, which requires grasping the part, reorienting it (potentially using an environment fixture), and then inserting it into the appropriate slot. An illustration of the steps for solving a multi-object manipulation task, which requires performing the same skills as the single-object task repeatedly for each component in the interlocking assembly.


To ensure the reproducibility and portability of such tasks, we designed 66 3D-printed objects with diverse shapes and sizes that can be easily replicated by other researchers. Accompanying these objects, we collected a dataset of 22,500 human demonstrations of grasping, repositioning, and assembly skills. Our dataset contains a variety of sensory modalities, as presented in Figure 7: we record RGB and depth images from multiple cameras, relevant robot kinematics information, as well as force/torque measurement at the robot’s end-effector frame. We also trained a set of imitation learning policies to perform either individual stages or the entire assembly tasks. These policies are also provided as pre-trained model checkpoints so that they can be reused by others as component parts of larger systems or as scaffolds for studying improvements to individual stages. FMB is modular so that other researchers can repurpose it for a variety of methods that they may wish to develop and can focus on any stage or aspect of the task. For example, some researchers might choose to focus on better functional grasping or assembly methods, while the other stages are handled by our baseline system. Some researchers might focus on skill sequencing, utilizing trained skills from our system for the individual steps. Others might also focus on developing an end-to-end method for the entire multi-stage task, fully utilizing the provided training data. With the accessible and extensive framework that FMB provides, our hope is that it can serve as a “toolkit” to facilitate the entry of researchers into the field of robot learning with ease.
2. Related work
Considerable recent progress in robotic manipulation has studied generalization, though often in the context of simpler tasks such as grasping (Dasari et al., 2020; Levine et al., 2018; Yang et al., 2019), pushing (Dasari et al., 2020; Finn et al., 2016), and imprecise repositioning (Dasari et al., 2020; Lee et al., 2022). A number of other works have studied tasks that are dynamic (Seita et al., 2022), precise (e.g., insertion) (Zakka et al., 2020), contact-rich (Falco et al., 2016), or otherwise physically challenging (Kimble et al., 2020; OpenAI et al., 2019). Fewer works have studied these factors in combination (Heo et al., 2023). We believe many of the central challenges in robotic manipulation lie at the confluence of these two challenges: tasks that require handling contact dynamics, not by memorizing the particular pattern needed for a single narrow task, but by learning general behaviors for handling object interaction that can generalize to new objects. Our aim is to propose a benchmark that can study this combination of challenges while keeping the scope narrow enough that it remains accessible to many researchers.
Our functional manipulation tasks combine aspects of grasping, repositioning, and assembly. A number of works have studied functional grasping (Aleotti and Caselli, 2008; Levine et al., 2018; Li and Sastry, 1988; Liu et al., 2020; Zhao et al., 2021), and insertion (Mahler et al., 2017) separately. Our goal is not to attain the best possible performance in narrow settings for any of these stages (e.g., ultra-high-precision industrial insertion e.g., NIST board challenge (Kimble et al., 2020)) but to use these tasks as a lens through which to gauge general manipulation capabilities learned via general-purpose robotic learning methods.
A number of prior works have proposed datasets for robotic learning, including datasets consisting of demonstrations (Ebert et al., 2022; Fang et al., 2023; Walke et al., 2023) and autonomously collected data (Levine et al., 2018; Pinto and Gupta, 2016), as well as annotated datasets of grasp points (Fang et al., 2019), object geometries (Padalunkal et al., 2023; Tyree et al., 2022), simulated environments (James et al., 2019), and multi-modal inputs (Fang et al., 2023). However, there has been comparatively little work on standard and accessible object sets that are combined with multi-stage tasks for studying generalization. The YCB object set comes with a number of evaluation protocols (Calli et al., 2015), but these protocols generally focus on object repositioning tasks that do not evaluate the complex contacts challenges that we discuss in the previous section. A number of existing demonstration datasets cover many different behaviors (Bharadhwaj et al., 2023; Dasari et al., 2020; Ebert et al., 2022; Mandlekar et al., 2019; Walke et al., 2023), but also focus more on behaviors that emphasize basic pick-and-place skills rather than precise or contact-rich manipulation. Some works have focused on insertion skills in particular (e.g., connector insertion) (Bruyninckx et al., 1995; De Magistris et al., 2018; kook Yun, 2008; Luo et al., 2019, 2021; Tang et al., 2016; Zhao et al., 2022). While FMB is related, we aim specifically to cover a range of skills, including grasping and repositioning, that we believe cover a basis of basic manipulation capabilities. We also emphasize generalization as a primary challenge for FMB.
We use 3D-printed objects to facilitate reproducibility. Other prior works have also proposed standard meshes and 3D-printed parts for benchmarking and reproducibility (Calli et al., 2015), typically focusing on object grasping. These efforts are related, but our aim is to provide parts that are specifically well suited for evaluating all of the stages: grasping, reorientation, and assembly, rather than only grasping.
3. Functional manipulation benchmark
In this section, we introduce the basic principles behind FMB and the protocols to evaluate different methods on this benchmark. FMB tasks can broadly fall into two categories: single-object multi-stage manipulation tasks and multi-object multi-stage manipulation tasks. They both require acquiring individual manipulation skills such as grasping, repositioning, and insertion, as well as composing these individual skills to complete the full task as depicted in Figures 2 and 3. These two categories bear similar design principles but differ in the additional complexity of the second category, which involves selecting the appropriate object for manipulation. We are primarily concerned with studying the generalization of each individual functional manipulation skill as well as evaluating the performance of different methods on the full assembly task. Therefore, we collect a diverse dataset of robotic behaviors with different objects, viewpoints, and robot initial poses. We also provide novel objects to evaluate the generalization capability of individual skills. Thus, we test the generalization of learned manipulation skills in terms of object location and physical attributes.
3.1. Object set
The objects in FMB are 3D-printable, and the CAD files are available on our website. In total, we have 66 objects as in Figure 1, 54 of them belong to single-object manipulation tasks; the remaining compose the multi-object manipulation tasks. Out of these 54 objects, we designed nine different basic shapes and six different sizes for each shape; each object is assigned one of eight colors specified on our website. These objects are paired with three boards with matching openings as in the left of Figure 1. We additionally designed three more complex boards to facilitate multi-stage assembly tasks, shown in the right of Figure 1; objects there are generated procedurally so that they can only be fit together in specific orders. The tolerance for mating all objects is between 1 mm and 2 mm, which is practical for commercial 3D printers available on the market. Additionally, we created five test objects used to evaluate the generalization capabilities. These vary in shape, size, and color from the training objects.
3.2. Individual manipulation skills
In this section, we describe the “primitive” manipulation skills included in FMB for evaluation as well as our data collection system. For each type of skill, we provide demonstration trajectories collected with a Franka robot (see Figure 7) and an evaluation protocol as in Section 3.7. This modular design of our benchmark facilitates extension to add new tasks with the provided objects, and the tasks we describe here are suitable both for evaluating generalization and for testing a range of manipulation capabilities.
3.2.1. Grasping
The grasping task in our benchmark is a Illustration of individual skills in the single-object task. Note that the grasp and rotate skills have to manipulate the object in both the vertical and horizontal orientation. For isometric shapes like the rectangle, the insert skill needs to decide whether to rotate the object to line up with the hole.
3.2.2. Repositioning
A repositioning step is sometimes necessary to adjust the grasping pose so that the object is held in a way that is suitable for downstream assembly, as mentioned in the last paragraph. For objects with asymmetric geometries, a rotation operation is usually desirable for the downstream insertion task. For example, the objects in the second column of Figure 4 need to be rotated 180° so that they can slot into the matching holes in the board more easily. On the other side, manipulating and reorienting objects by leveraging environment affordances (e.g., tilting the object in the gripper by levering it against a table or wall) may often be necessary for fluent and complex manipulation, and this reorientation task exercises this capability. We provide a simple fixture that can serve as an environment affordance to rest the object at an angle, as shown in Figure 4. To reorient the objects into the right pose, the robot may need to use this fixture, resting the object on it and then regrasping it in a more appropriate pose for reorientation. We collected 4500 demonstrations for placing and regrasping, which can be used to learn strategies for using environmental affordances for regrasping and reorientation. Since objects land in the fixture in a relatively deterministic fashion, we partially script our demonstration collection process while maintaining a certain degree of randomness for the purpose of data diversity. We detail such process and code of implementing it on our website.
3.2.3. Insertion
Our assembly tasks require inserting objects with diverse shapes into their matching slots, which requires performing fine-grained precise manipulation. An illustrative example is shown in Figure 4. Here, having completed the preceding steps, the robot is holding an object and needs to insert it into the matching slot on the board. For the single-object task, we collected 125 human demonstrations that include various robot initial poses and board positions, for a total of 6750 demonstrations performing the assembly task from various initial conditions. Note that in the single-object task, the board’s pose is randomized within a 35 × 35 cm region and rotated up to 15° in each episode, requiring a reactive strategy that localizes the board and the appropriate matching slot, and guides the object into the correct location. Similarly, 150 human demonstrations were collected per object in the multi-object assembly tasks, resulting in a total of 1800 trajectories.
3.3. Single-object multi-stage manipulation tasks
Aside from performing individual steps mentioned above, such as grasping, reorientation, and assembly, our benchmark and demonstrations can be used to learn the entire long-horizon sequence, composing these steps to insert a free object into the assembly board; one such example is shown in Figure 2. The difficulty of this task mainly comes from the compounding errors accumulated over each individual step which gets even more magnified when switching between tasks. For instance, after completing the grasping and repositioning stages, an object might be held in a pose different from the ones in the human demonstration data used for insertion.
3.4. Multi-object multi-stage manipulation tasks
We also present three sets of more challenging objects for assembly, as presented in the right of Figure 1. These tasks are more challenging than the single-object tasks since the pieces fit in an interlocking fashion, so there is much more variability in which object to perform manipulation skills on. For the grasping stage, as pictured in Figure 5, the robot needs to grasp a desired object among several others with the added complexity of randomized object placements for each attempt. For the insertion stage, as illustrated in Figure 6, the robot needs to insert objects while coming into contact with other objects already present on the assembly board. This situation introduces more complexity in contact dynamics, necessitating a higher level of precision in manipulation. Another major challenge with these tasks is that the interlocking pieces need to be put together in a specific order. While it may not be too hard to perform individual steps alone, the difficulty increases rapidly when a policy needs to simultaneously reason the manipulation sequence as well as accounting for compounding manipulation errors introduced by individual steps. Example of different initial configurations for one multi-object task assembly board. We randomize both the orientation and position relative to other objects at the start of each assembly demonstration episode. Various initial distributions of the insert skill for the multi-object task. In each instance, there are different numbers of objects already inserted into the board.

3.5. Robotic system and data collection
Number of demonstration trajectories in our dataset separated by primitive and task. Each trajectory is approximately 5 s in length, for a total of 22,550 trajectories.
3.5.1. Robotic system overview
Our system can be seen in Figure 7. We use a Franka Panda robot to collect our dataset since it is widely adopted for research and offers a torque control interface which is very desirable in contact-rich manipulation tasks. To tele-operate the robot, we use a SpaceMouse to command 6 DoF end-effector twist at 10 Hz, which is then tracked by a low-level impedance controller running at 1 K Hz. The software for operating the robot, as well as the low-level controller, is also included in our open-sourced release. In total, we have four Intel RealSense D405 cameras, two of which are mounted on the robot end-effector, and the rest are placed on each side of the bin to provide a complementary view of objects in the bin. To ensure the image observations are free of background distractions, we put white curtains around the side of the workspace. We simultaneously capture RGB and depth images from these cameras, and we also provide calibrated camera intrinsics. This calibration allows for the conversion of depth images into point clouds when necessary. We also log the end-effector force/torque information provided by the Franka Panda robot. We did not use an additional force/torque sensor as it simplifies the standardization process by utilizing the robot’s inherent sensing capabilities.
1
Our robotic system setup is simple and modular; one can reproduce our exact setup by following the procedure on our website https://functional-manipulation-benchmark.github.io/files/index.html. Illustration of the robot setup, with a standard Franka arm equipped with four cameras (two on the wrist and two attached to the environment), each with RGB and depth channels, positioned in front of a workspace containing an object, reorientation fixture, and assembly board. The board is placed into a random pose within the randomization region, and the object is located in a randomized pose on the table, from where it must be picked up, reoriented, and inserted.
3.5.2. Single-object task dataset
Our dataset comprises 2700 demonstrations of the complete single-object task, encompassing every aspect from grasping and reorientation to object insertion. Each stage within these complete trajectories is automatically labeled, enabling the segmentation of trajectories into individual skills by querying the corresponding labels. We also collected an additional 4050 demonstrations of the insertion stage alone since it’s a much harder task, thus requiring more data. Each end-to-end demonstration trajectory ranges from 20 to 40 s in length. One can directly learn a “flat” policy on these long trajectories or break them into “primitive” trajectory sequences using the labels mentioned before. In our dataset, these primitives include grasp, place on fixture, regrasp, rotate, move to board, and insertion. After segmenting by primitives, we end up with a total of 15,350 demonstrations, with an average length of about 5 s. As shown in Figure 7, the pose of the task object for the grasping task is randomized around a 20 cm × 20 cm rectangular area in the bin. For the insertion task, the board is randomized inside a 35 cm × 35 cm area. A drawing of such a protocol can be found on our website. We also include distractors (i.e., objects not needed for a task) when performing the insertion task. One-fifth of the insertion demonstrations were carried out when there were distractors present.
3.5.3. Multi-object task dataset
In addition to the single-object manipulation task dataset, we also collected 150 end-to-end demonstrations of solving each of the three multi-object assemblies. Each trajectory contains steps to grasp, reorient, and insert the four components of the assembly sequentially and can exceed 100 s in length. We again break them down into separate primitives like grasp, place on fixture, regrasp, move to board, and insert for each manipulation object. After segmentation, this part of the dataset contains 7200 trajectories with lengths of about 5 s.
For the multi-object manipulation task, all four assembly objects are randomly placed in the 20 cm × 30 cm area, requiring the learned system to determine the desired piece to pick up. Unlike the single-object task, the assembly board is fixed to the table. A drawing of such a protocol can be found on our website.
3.6. Using the benchmark
To use the FMB benchmark, users would first need to reproduce the setup. This includes purchasing relevant equipment, such as the bin and cameras, as well as printing the FMB objects and tools with specified materials and colors. The detailed instructions can be found on our website https://functional-manipulation-benchmark.github.io/usage/index.html. Our dataset was collected using a Franka panda robot; however, users could still use relevant components within the FMB framework to collect their own data if they choose to use a different robot.
3.7. Evaluation protocol
In order to evaluate the performance of different methods, we designed a set of detailed evaluation protocols for each task of FMB. In these protocols, we specify a set of object initial poses within the randomization region to test the proposed methods’ generalization capability while ensuring consistency across different experiments and labs.
3.7.1. Single-object tasks
For grasping and repositioning tasks, one can hold out a specific object in the training set, train a policy without seeing any data associated with that object, and then test on the held-out object. Additionally, we also provide novel objects that are not contained in the dataset for which researchers can directly evaluate the trained policies, such as the five objects shown in Figure 8. Furthermore, we define a set of specific starting poses for both the object and the insertion board, aiming to consistently evaluate the adaptability of different policies in handling various grasping and insertion points. Unseen test objects used for evaluating generalization to new combinations of shapes, sizes, and colors.
3.7.2. Multi-object tasks
For the multi-object task, the assembly components for each board are placed in one of five specified starting arrangements within the designated grasp randomization area, as illustrated in Figure 5. A successful policy must choose the intended piece to grasp amidst the presence of other items within the same vicinity. However, the board is fixed to the workspace within the insertion randomization region to reduce the complexity required during the insertion phase.
The precise protocols for each individual skill and the multi-stage tasks can be found on our website: https://functional-manipulation-benchmark.github.io/procedure/index.html.
4. An imitation learning system for the FMB
One significant benefit of the FMB framework is its ability to function as a standardized “toolkit” for researchers, facilitating a convenient and unified starting point for studying various robot learning challenges. In this section, we will describe an imitation learning system we built for the FMB that serves both to provide baseline performance and a collection of components that researchers can extend to study the FMB tasks. In the next section, we analyze the performance of this system and various baselines and ablations.
4.1. Imitation learning policies for individual skills
By using the FMB dataset together with an evaluation protocol described in Section 3.7, we trained and tested various imitation learning models, detailed below, on individual manipulation skills. As we will discuss in Section 5, we also study the most effective sensor modalities for each skill as well as how the performance scales with the data available. In all our experiments, we use two types of architectures to learn imitation learning policies, ResNet (He et al., 2016) and Transformer (Vaswani et al., 2017). In this section, we describe the detailed architectures of our imitation learning policies.
4.1.1. ResNet-based policy
Our ResNet-based policy’s overall architecture can be seen in Figure 9(b). It is composed of ResNet-34 vision backbones and an MLP as the policy head, representing a Gaussian distribution. We use this general structure for all of our tasks, only adapting the inputs specific to each task. It takes multiple RGB and depth images and encodes them separately with weight-shared ResNets before concatenating the features. It also takes the robot’s proprioceptive information, such as end-effector pose, twist, or force/torque measurements, and then performs linear projection before being fed into the MLP. Furthermore, the system is capable of conditioning on both the object ID and manipulation skill ID, which are represented as one-hot vectors. This mechanism is crucial for employing a hierarchical approach to effectively address long-horizon, multi-stage tasks. The output is a 6D end-effector twist as well as a binary variable that indicates whether the gripper should open or close. In our experiments, we vary the input space to fit the needs of each scenario—for example, when evaluating a single-task policy, the skill ID is omitted, and when evaluating the importance of force/torque measurements, we vary whether or not they are included in the input. Architecture diagrams of the baseline policies. Both models encode each image view with weight-shared ResNet encoders before concatenating with proprioceptive information and optional object and primitive ID features to predict 7DoF actions. (a) Architecture of the transformer model used to train the baseline policies. (b) Architecture of the ResNet model used to train the baseline policies.
4.1.2. Transformer-based policy
Several recent works (Brohan et al., 2023; Collaboration et al., 2023; Zitkovich et al., 2023) showed that high-capacity models such as Transformers (Vaswani et al., 2017) can be effective in robotic control. The major advantages of these models lie in handling multi-modal inputs and scaling with large, diverse datasets. Our decoder-only Transformer architecture is shown in Figure 9(a). We use weight-shared ResNet-34 encoders to tokenize images from multiple camera views. We additionally add FiLM (Perez et al., 2018) layers to condition on the object ID or primitive ID if they are required as part of the inputs to the policies. This prevents the one-hot ID vectors from being ignored by the neural network, thus making the conditioning procedure more stable. Robot proprioceptive information is tokenized via an MLP separately. These tokens, after being concatenated together with sinusoidal position embeddings, are then processed through self-attention layers with four attention heads and four MLP layers. The network outputs a discretized action consisting of a 6D end-effector twist as well as a binary variable indicating whether the gripper should open or close. Each dimension of the continuous 6D robot action space is discretized into 256 bins during training by using a Gaussian quantizer. The discretized action space is converted back into continuous values when sending commands to the robot at runtime.
4.2. Composing skills to solve long-horizon tasks
An important part of the FMB consists of the two long-horizon sequential manipulation tasks. One way of solving such tasks is to just train a “flat” imitation learning policy on the long-horizon trajectories. However, this would suffer from compounding error issues (Ross et al., 2011), potentially requiring a significant amount of data to achieve desirable performance. Alternatively, we can perform the long-horizon task by employing hierarchical methods to compose individual manipulation skills with a high-level policy. In our experiments, we simply used a human-provided sequence of steps to trigger associated low-level primitives in time. This “human oracle” can sequence a set of primitives to generate recovery behaviors, thus reducing compounding errors. For example, the robot can repeatedly execute the grasping primitive until the object is securely held, or opting to use a repositioning primitive to adjust the object’s pose after unsuccessful grasping attempts, thus simplifying subsequent attempts. This can be achieved by using our ResNet or Transformer policy architectures with the proposed conditioning mechanism. Future work could explore learning such high-level policies that dynamically choose the best primitives based on the current observations. Such tasks necessitate explicit reasoning of the spatial relationships between objects and the associated manipulation skills, facilitating the use of a suitable abstract representation.
5. Experiments
Our experiments study the performance of the imitation learning system described previously in order to compare different variants of the imitation learning approach, understand the properties and challenges of the FMB tasks, and study the impact of different input modalities and design decisions. Specifically, our experiments study the following research questions: (1) How do various imitation learning techniques perform in our tasks so we can establish stable baselines? (2) What do the failure modes of these methods suggest about the challenges of the FMB tasks? (3) How does the difficulty of the various FMB tasks change with the choice of input modality and policy architecture? (4) How do hierarchical policies compare to “flat” policies on long-horizon tasks?
To achieve this, we train a set of imitation learning policies with either ResNet (He et al., 2016) or Transformer (Vaswani et al., 2017) architectures, shown in Figure 9. We also combine these architectures with techniques such as diffusion (Chi et al., 2023) and action chunking (Zhao et al., 2023). We’ll detail these choices in the section. All pre-trained model checkpoints associated with experiments in this section can be found on our website.
5.1. Grasping task
An important aspect of FMB is to study the generalization across objects’ physical attributes and their locations. We conduct the grasping task to get baseline numbers of our imitation learning system, as well as to verify that we can study the proposed generalization. To achieve this, we prepare different training datasets by randomly extracting portions of data from the diverse pool of grasping data available. Specifically, we sample 20%, 50%, 80%, and 100% of the overall grasping data and study the policy’s performance with the randomized evaluation procedure mentioned in Section 3.7. To test the generalization across objects, we conduct evaluations for both objects in the FMB dataset as well as unseen objects illustrated in Figure 8 in accordance with our evaluation protocol detailed in Section 3.7.
For this task, we train both the ResNet and Transformer-based policies on RGB inputs to assess the general completion rate of the task. The specific input modality includes RGB images and TCP velocity. To test if depth information is helpful for the grasping task, we additionally train the ResNet policy with depth alongside RGB. We test each policy by evaluating it on five objects in the training set and five unseen objects shown in Figure 8, for five trials each, and report the performance over the 50 trials.
Summarized in Figure 10, the ResNet policy’s performance generally scales with the amount of training data. The Transformer policy trained on all grasping data with RGB inputs is able to grasp the objects 28 out of the 50 times tested. The ResNet policy trained on the same data and observation achieves a comparable 27 out of 50 success. The policy performance drops to 12 out of 50 as the amount of training data decreases to 20%. It is interesting to note that the ResNet policies are able to generalize and grasp unseen objects shown in Figure 8 with comparable success as seen objects regardless of the amount of training data. Furthermore, we find that depth information is beneficial as the ResNet policies trained with both depth and RGB information consistently outperform RGB-only policies trained with the same number of data. The common failure modes of this task include missing the objects and not closing the gripper at the right time. Number of successful grasps out of 50 trials across five seen and five unseen objects for ResNet and transformer policies trained on various observation spaces and data percentages. The policies are able to grasp unseen objects with similar success rates as seen objects, while the overall success rate grows with the amount of training data. Training ResNet policies with depth information increases the performance across the board.
5.2. Repositioning task
For the three repositioning skills, rotate, place on fixture, and regrasp, the human demonstration data can induce multiple modes of actions given the same observation. For example, an object can be rotated towards left or right contingent upon the context derived from its observational history. We thus also train the ResNet-based policies with action chunking (Zhao et al., 2023), a recent method of showing promising performance handling multi-modalities in human demonstrations. We tested the performance of ResNet policies with and without action chunking, along with a Transformer-based policy without action chunking on seen and unseen objects. The results are presented in Figure 11. The ResNet policy without action chunking outperforms its counterpart with action chunking and Transformer on the rotate skill. In contrast, the Transformer policies outperform ResNet policies with or without action chunking for the place on fixture and regrasp skills. The common failure modes for this task include not opening or closing the gripper at the right time and rotating in the wrong direction. Number of successes for policies trained on the three repositioning tasks: rotate, place on fixture, and regrasp. We tested three models: ResNet without action chunking, ResNet with action chunking length 3, and transformer without action chunking. Each policy is evaluated 50 times across 5 seen and 5 unseen objects.
5.3. Insertion task
For the insertion task, we studied the effect of different observation spaces of different input modalities, experimented with training a single policy for all insertion object shapes, and compared the performance of policies only trained on particular shapes.
Ablation on input modality for insertion policy. For all policies, we include one RGB side view, two RGB wrist views, and velocity. We experimented with adding depth information from each view and adding force/torque information. By evaluating 25 trials across 5 different object sizes, we can conclude that force/torque information is crucial for contact-rich manipulation tasks like this and that depth information deteriorates performance.
To carry out an initial study to understand the complexity of the insertion task, we train different ResNet policies for each object shape and evaluate them according to the procedure in Section 3.7 We can see that the success rate does decrease as the shape becomes more complex, with the hardest one being the three-prong object shown in Figure 12. The common failure modes include getting stuck near the holes, impeding fine-grained adjustments, difficulty in locating the matching openings, and challenges in handling multi-modalities in the demonstration data. For example, the two-pronged object with asymmetrical shapes may require a rotation between 0 and 90°, depending on its grasping pose, to align with the shapes of the hole openings. This implies the assembly task is indeed a challenging robotic manipulation task for future benchmarking.
We train policies on all insertion data and evaluate 5 trials for each of the 9 object shapes. We find that using one-hot vector embedding according to the shape of the object being assembled helps the policy spatially separate the target insertion position.
5.4. Multi-stage manipulation tasks
As described in Section 3, the difficulties of the multi-stage assembly tasks mainly come from dealing with compounding errors introduced by each stage of manipulation, as well as reasoning the manipulation sequences. To verify these points so as to facilitate the use of proposed hierarchical policy structures, we train “flat” end-to-end imitation learning policies directly on the full long-horizon demonstrations. We train both ResNet and Transformer policies on all the RGB camera views together with other necessary robot proprioceptive information. The goal of trained policies is to successfully grasp, reorient, and perform assembly. We assess the performance of the trained policies by conducting 10 trials for each object shape in the case of the single-object task and 10 trials for each initial object configuration in scenarios involving multi-object manipulation.
We conducted an evaluation of various policies for single-object multi-stage manipulation tasks, focusing on the performance of transformer and ResNet models across three distinct shapes. Notably, all unconditional policies, including those trained with diffusion models, recorded a zero success rate. We compared two types of hierarchical policies differentiated by the conditioning mechanism between the high-level and low-level policies. We found the transformer-based policy achieved the most compelling results while providing a flexible structure for handling different input modalities.
We conducted an evaluation of various policies for multi-object multi-stage manipulation tasks, focusing on the red board as shown in Figure 1. The hierarchical policies use a human oracle as the high-level policy, sequentially triggering a low-level policy with the appropriate primitive and object IDs for each stage. Similar to single-object manipulation tasks, all unconditioned policies achieved zero success. Remarkably, the transformer-based policy outperformed others, achieving a success rate of 7/10.
We studied two ways of instantiating such hierarchical methods as presented in Figure 13. In both cases, we employ a high-level human oracle that functions as a state machine, determining the appropriate low-level skill to execute. This oracle maintains a sequence of skills to be executed at each decision point. It is also responsible for re-executing any primitive skill that either failed in the previous step or the resulting state is deemed unsuitable for the subsequent step. For example, it may retry grasping if the object was initially grasped at a location unfavorable for insertion. The procedure is designed to terminate under two conditions: either when an unrecoverable state is encountered or when a pre-set maximum number of trial steps is reached. While they use the same high-level policy, these approaches diverge in their representation of low-level skills. To assess the efficacy of the conditioning mechanism integrated into the architecture depicted in Figure 9, we conducted a comparative study. This involved training five distinct policies, each representing a specific low-level skill, which were then directly invoked by the high-level policy. Illustration of the policies tested on the multi-stage task. (a) An unconditioned policy is trained on the end-to-end task. (b) A task-conditioned policy is trained on multiple skills, and a human oracle provides the appropriate skill ID, and optionally object ID, sequentially. (c) 5 unconditioned policies are trained on the 5 skills separately, and the human oracle selects the best policy to execute sequentially.
First, we observed that the hierarchical policies attained measurable levels of success, in contrast to the flat policies, which demonstrated zero success as in Tables 4 and 5. However, despite employing a human oracle as the high-level policy endowed with a profound understanding of the tasks to make near-optimal decisions, the maximum success rate achieved was only 19 out of 30 for single-object tasks and 7/10 for multi-object tasks. This indicates the inherently complex challenges presented by the FMB, affirming its suitability as a benchmark for developing advanced robotic learning methods.
For the single-object task as presented in Table 4, the Transformer-based policies achieve comparable performance between the two aforementioned hierarchical methods, namely, 19/30 compared to 15/30. However, for the ResNet-based policies, conditioned ResNet achieved zero success out of 30 trials, whereas chaining separate policies attained an 18/30 success rate, which is comparable to that of the Transformer-based policies. For the multi-object task presented in Table 5, the conditioned hierarchical ResNet policy achieved 5/10 success compared to the conditioned hierarchical Transformer policy’s 7/10 success rate. To understand this phenomenon, we found that the primary factor that causes performance difference is the ability to handle multi-modal sensory inputs between ResNet and Transformer policies. For each skill, there is an optimal set of sensory inputs. For example, the insertion skill reached its peak performance using three RGB camera views, supplemented with additional sensory data, as outlined in Table 2. However, we observed that incorporating a fourth camera view, specifically the right-side camera, into a ResNet policy significantly impairs its performance. This decline is primarily due to the randomized positions of the assembly board. The distant camera struggles to precisely locate the corresponding holes, leading to incorrect spatial feature associations, such as the board’s edge, rather than the target location. This observation is further corroborated by the fact that incorporating a fourth camera view in multi-stage tasks, as detailed in Table 5, did not adversely affect performance. This is largely attributable to the fixed position of the assembly board. In such scenarios, the redundant information provided by the additional camera remains consistent, making it sufficiently apparent for the system to effectively ignore it. Similarly, the grasping skill generally does not benefit from adding end-effector force/torque information as it does not perform contact-rich fine-grained manipulation. In fact, we selected distinct sets of sensory inputs to tailor the specific requirements of each task and supplied these to five different ResNet policies. On the other hand, we fed all available sensory inputs to the conditioned policies. These policies are then required to learn the skill of selecting the appropriate set of input modalities, guided by supervision from their respective actions. The performance of the ResNet-based policies was observed to degrade due to their difficulty in disregarding task-irrelevant inputs, leading to incorrect feature associations. In contrast to the ResNet-based policies, the Transformer-based policies learned to effectively ignore task-irrelevant modalities, such as the non-essential fourth camera in the insertion task. This attribute is particularly beneficial in the multi-stage, multi-task imitation learning settings characteristic of FMB tasks.
6. Discussion and limitations
In this paper, we present the Functional Manipulation Benchmark (FMB). Through the careful design of tasks, the provision of a comprehensive dataset and reproducible hardware and software system, FMB enables studying several critical challenges in robotic manipulation learning: complexity of task and skills, generalization across varied objects, and reproducibility of research.
One of the primary contributions of FMB is its focus on the complexity of manipulation tasks and the need for generalization. The tasks, ranging from single-object manipulation to complex multi-object multi-stage assemblies, capture important aspects of real-world manipulation challenges.
The inclusion of diverse 3D-printed objects enhances the need for robots to generalize their learned skills to new and unseen objects, as well as easing the burden of reproducing the proposed tasks. Our open-sourced imitation learning system, complemented by a comprehensive analysis of our experimental findings on FMB tasks, offers a foundation for researchers seeking to develop and enhance their methodologies.
Researchers can get started with FMB by first replicating our publicly available setup and trying out some of our pre-trained models. We anticipate that this initial exploration will pave the way for them to develop and evaluate new methods. For this reason, we look forward to their contributions and insights on the tasks proposed by FMB. Additionally, the nature of the FMB tasks is inherently conducive to ongoing development. Researchers have the opportunity to create novel 3D-printed objects and collect demonstrations, thereby enriching the FMB project. Notably, since the objects in multi-stage assembly tasks are constructed using a specific “grammar,” there is potential to incorporate a far greater variety of assembly boards than those currently present in FMB tasks.
Our hope is that FMB can serve as a user-friendly toolkit for individuals eager to delve into robot learning. Its inherent task complexity will foster the advancement of cutting-edge robot learning methodologies. We wish that the value FMB adds to the robot learning community will ultimately encourage community contributions, further supporting its ongoing development.
Supplemental Material
Footnotes
Declaration of conflicting interests
Funding
Supplemental Material
Note
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
