Abstract
1. Introduction
More and more autonomous robots are (going to be) deployed in a variety of real-world applications. Examples include office service (Ahn et al., 2022), package delivery (Murray and Raj, 2020), and agriculture inspection (Liu et al., 2018), search and rescue (Queralta et al., 2020), autonomous vehicles (Rosenband, 2017; Tang, 2019), sports (Jolly et al., 2007; Liu et al., 2021), and others. For example, consider an assembly warehouse (Figure 1(a)) where a team of autonomous robots is assisting two humans by delivering tools. In order to support humans more efficiently, robots have to be able to predict when each human will need each tool and collaborate with each other to search for tools on a table (Figure 1(a)), pass tools (Figure 1(b)), and deliver them (Figure 1(c)). Performing these high-quality coordination behaviors in large, stochastic, and uncertain environments is challenging for the robots because it requires the robots to operate asynchronously using local information while reasoning about cooperation between teammates. Example of a real-world multi-robot tool delivery task.
Multi-agent reinforcement learning (MARL) is a promising framework to generate solutions for these kinds of multi-robot problems. Recently, by leveraging deep neural networks to deal with large state and observation input, deep MARL has solved many challenging multi-agent problems. Unfortunately, the state-of-the-art deep MARL methods (Foerster et al., 2018; Iqbal and Sha, 2019; Lowe et al., 2017; Omidshafiei et al., 2017b; Rashid et al., 2018, 2020; Son et al., 2019; Su et al., 2021; Wang et al., 2021b, 2021c; Wang and Dong, n.d.) struggle to solve large-scale real-world multi-robot problems that involve long-term reasoning and asynchronous behavior, because they were developed for cases where agents synchronously execute primitive actions at every time step.
Temporally-extended actions have been widely used in both learning and planning to improve scalability and reduce complexity in single-robot domains. For example, they have come in the form of motion primitives (Dalal et al., 2021; Stulp and Schaal, 2011), skills (Konidaris et al., 2011, 2018), spatial action maps (Wu et al., 2020), or macro-actions (He et al., 2010; Hsiao et al., 2010; Lee et al., 2021; Theocharous and Kaelbling, 2004). The idea of temporally-extended actions has also been incorporated into multi-agent approaches. In particular, we consider the
The MacDec-POMDP is a general model for cooperative multi-agent problems with partial observability and potentially different action durations. As a result, agents can start and end macro-actions at different time steps so decision-making can be asynchronous. MacDec-POMDPs assume the macro-actions are given (like primitive methods assume the primitive actions are given). This is well motivated by the fact that, in real-world multi-robot systems, each robot is already equipped with certain controllers (e.g., a navigation controller, and a manipulation controller) that can be modeled as macro-actions (Amato et al., 2015a; Omidshafiei et al., 2017; Wu et al., 2021a; Xiao and Hoffman, n.d.).
The MacDec-POMDP framework has shown strong scalability with planning-based methods (Amato et al., 2015a, 2015; Hoang et al., 2018; Omidshafiei et al., 2016, 2017). These methods allowed complex solutions to be generated for multi-robot problems ranging from warehouse (Amato et al., 2015b), logistics (Omidshafiei et al., 2016), and aerial delivery (Omidshafiei et al., 2017a). As planning methods, the approaches assume the problem model is known, but what we propose are model-free RL methods.
While several hierarchical multi-agent reinforcement learning (MARL) approaches have been developed, they don’t typically address asynchronicity since they assume agents have high-level decisions with the same duration (de Witt et al., 2019; Han et al., 2019; Nachum et al., 2019; Wang et al., 2020b, 2021a; Xu et al., 2021; Yang et al., 2020a). Notably, none of them provides a general formulation for multi-agent reinforcement learning that allows agents to asynchronously learn and execute.
The focus of our proposed algorithms is then on learning asynchronous high-level policies over macro-actions. Our contributions can be categorized into two groups: • •
These methods represent all the major classes of MARL algorithms and serve as a foundation for extending primitive-action methods to the asynchronous case. To our knowledge, this is the first general formalization of macro-action-based multi-agent frameworks under the three state-of-the-art training paradigms that allow multiple agents to asynchronously learn and execute.
We evaluate our proposed frameworks on diverse macro-action-based multi-robot problems: a benchmark Box Pushing domain, a variant of the Overcooked domain (Wu et al., 2021b), and a large warehouse service domain. Experimental results: (1) demonstrate that our methods are able to learn high-quality solutions while primitive-action-based methods cannot; (2) validate the proposed macro-action-based CTDE Q-learning approaches can learn better decentralized policies than fully decentralized learning methods in most of the domains; and (3) show the strength of Mac-IAICC for learning decentralized policies over Naive IAICC and Mac-IAC. Overall, we find that Mac-IAICC is more robust and scalable than other proposed algorithms, achieving the best overall performance in learning decentralized policies, while the utility of value-based approaches is very domain-dependent. Additionally, decentralized policies learned by using Mac-IAICC are successfully deployed on real robots to solve two warehouse tool delivery tasks in an efficient way.
This work extends our earlier conference papers (Amato et al., 2015; Xiao et al., 2022; Xiao and Hoffman, n.d.) in three ways: (1) we present all proposed approaches in a coherent and systematic fashion; (2) we conduct extensive extra simulated experiments to compare the two families of algorithms with a deep analysis of their different pros and cons; and (3) we showcase new real-robot experiments in a warehouse tool delivery scenario, where a team of robots shows more complex and interesting collaborative behaviors.
2. Background
This section introduces the formal definitions of the Dec-POMDP and the MacDec-POMDP and provides an overview of single-agent and multi-agent reinforcement learning algorithms with primitive actions.
2.1. Dec-POMDPs
The decentralized partially observable Markov decision processes (Dec-POMDP) (Oliehoek and Amato, 2016) is a general model for fully cooperative multi-agent tasks, where agents make decisions in a decentralized way based on only local information. A Dec-POMDP is formally defined by a tuple
2.2. MacDec-POMDPs
The macro-action decentralized partially observable Markov decision process (MacDec-POMDP) (Amato et al., 2014, 2019) incorporates the
2.3. Single-agent reinforcement learning
We focus on model-free reinforcement learning (RL), where the agent aims to learn an optimal policy by interacting with the environment without explicit world models (e.g.,
2.3.1. DQN, double-DQN, and DRQN
Q-learning (Watkins and Dayan, 1992) is a popular model-free method to optimize a policy
2.3.2. Actor-critic policy gradient
In single-agent reinforcement learning, the
2.4. Multi-agent reinforcement learning
We consider fully cooperative multi-agent reinforcement learning (MARL), where multiple agents interact with the same environment by perceiving input and selecting actions as well as considering the effect of each other in order to optimize the global return. In this section, we introduce three standard MARL training paradigms: centralized learning, decentralized learning, and centralized training for decentralized execution (CTDE), and we also discuss the corresponding algorithms under each paradigm.
2.4.1. Centralized learning and execution
Perhaps the most straightforward way to solve fully cooperative MARL problems is centralized learning and execution. Specifically, all agents are treated as a single big agent to learn a centralized policy
2.4.2. Decentralized learning and execution
Because of the aforementioned issues in the fully centralized case, having a decentralized policy for each agent is preferable, where each agent independently makes decisions based on only local information.
2.4.3. Centralized training for decentralized execution
In recent years, centralized training for decentralized execution (CTDE) has shown considerable promise in learning high-quality decentralized policies in Dec-POMDPs. To address the main difficulties encountered in independent learning, CTDE provides agents with access to global information during offline training while maintaining decentralized online execution based on local information. This paradigm is potentially more feasible to solve real-world multi-agent tasks, where the policies are first trained in a simulator and then deployed on the real system.
3. Approach
Multi-agent deep reinforcement learning with asynchronous decision-making and macro-actions is more challenging as it is difficult to determine
The algorithms proposed in this paper assume the existence of macro-actions, which can be either pre-defined by humans or learned beforehand (e.g., navigation controller for moving to a specific waypoint and manipulation controller for object pick and place). The algorithms then focus on learning policies over these macro-actions. All proposed algorithms are model-free RL methods, meaning that agents do not have prior knowledge about state space, observation space, transition dynamics, and the reward model. All agents have to directly learn from the reward signals and transitions they experience through trial and error.
3.1. Macro-action-based decentralized Q-learning (Mac-Dec-Q)
In the decentralized case, each agent only has access to its own macro-actions and macro-observations as well as the joint reward at each time step. As a result, there are several choices for how information is maintained. For example, each agent could maintain exactly the information mentioned above (as seen on the left side of Figure 2), the time-step information can be removed (losing the duration information), or some other representation could be used that explicitly calculates time. We choose the middle approach (see Appendix B for a theoretical analysis of this choice). As a result, updates only need to take place for each agent after the completion of its own macro-action, and we introduce a replay buffer based on An example of Mac-CERTs. Two agents first sample concurrent trajectories from the replay buffer; the valid experience (when the macro-action terminates, marked as red), is then selected to compose a squeezed sequential experience for each agent. Note that we collect agents’ high-level transition tuple at every primitive step, where each agent is allowed to obtain a new macro-observation if and only if the current macro-action terminates, otherwise, the next macro-observation is set as same as the previous one. The superscript is for distinguishing different macro-actions and macro-observations.
More concretely, under a macro-action-observation history
In each training iteration, agents first sample a concurrent mini-batch of sequential experiences (either random traces with the same length or entire episodes) from the replay buffer
In this work, we implement Dec-HDRQN with Double Q-learning (Section 2.3.1) to train the decentralized macro-action-value function
3.2. Macro-action-based centralized Q-learning (Mac-Cen-Q)
Achieving centralized control in the macro-action setting needs to learn a joint macro-action-value function
In this case, we introduce a centralized replay buffer that we call
In our work, we use Double-DRQN (DDRQN) to train the centralized macro-action-value function. In each training iteration, a mini-batch of sequential joint experiences is first sampled from Mac-JERTs, and a similar filtering operation, as presented in Section 3.1, is used to obtain the “squeezed” joint experiences (shown in Figure 3). But, in this case, only one joint reward is maintained that accumulates from the selection of any agent’s macro-action to the completion of any (possibly other) agent’s macro-action. An example of Mac-JERTs. A joint sequential experience is first sampled from the memory buffer, and then, depending on the termination (red) of each joint macro-action, a squeezed sequential experience is generated for the centralized training. Each agent’s macro-action, which is responsible for the termination of the joint one, is marked in red. For example, at 
Using the squeezed joint sequential experiences, the centralized macro-action-value function (in equation (18)),
3.3. Macro-action-based CTDE Q-learning
In multi-agent environments, decentralized learning causes the environment to be non-stationary from each agent’s perspective as other agents’ policies change during learning.
3.3.1. MacDec-DDRQN
Double DQN has been implemented in multi-agent domains for learning either centralized or decentralized policies (Simões et al., 2017; Xiao and Hoffman, n.d.; Zheng et al., 2018). However, in the decentralized learning case, each agent independently adopts double Q-learning purely based on its own local information. Learning only from local information often impedes agents from achieving high-quality cooperation.
In order to take advantage of centralized information for learning decentralized action-value functions, we train the centralized macro-action-value function
More concretely, consider a domain with
The experience replay buffer
In equation (23),
Additionally, similar to the idea of the conditional operation for training centralized joint macro-action-value function discussed in Section 3.2, in order to obtain a more accurate prediction by taking each agent’s macro-action executing status into account, equation (23) can be rewritten as:
3.3.2. Parallel-MacDec-DDRQN
Exploration is also a difficult problem in multi-agent reinforcement learning.
Therefore, in our approach, besides tuning
However, without having enough knowledge about the properties of a given domain in the very beginning, it is not clear which exploration choice is the best. To cope with this, we propose a more generalized version of MacDec-DDRQN, called
3.4. Macro-action-based independent actor-critic
Similar to the idea of IAC with primitive-actions (Section 2.4.2), a straightforward extension is to have each agent independently optimize its own macro-action-based policy (actor) using a local macro-action-value function (critic). Hence, we start with deriving a
3.5. Macro-action-based centralized actor-critic
In the fully centralized learning case, we treat all agents as a single joint agent to learn a centralized actor
3.6. Macro-action-based independent actor with centralized critic
As mentioned earlier, fully centralized learning requires perfect online communication which is often hard to guarantee, and fully decentralized learning suffers from environmental non-stationarity due to agents’ changing policies. In order to learn better decentralized macro-action-based policies, in this section, we propose two macro-action-based actor-critic algorithms using the CTDE paradigm. Typically, the difference between a joint macro-action termination from the centralized perspective and a macro-action termination from each agent’s local perspective gives rise to a new challenge:
A naive way of incorporating macro-actions into a CTDE-based actor-critic framework is to directly adapt the idea of the primitive-action-based IACC (Section 2.4.3) to have a shared joint macro-action-value function An example of the trajectory squeezing process in Naive Mac-IACC. The joint trajectory is first squeezed depending on joint macro-action termination for training the centralized critic. Then, the trajectory is further squeezed for each agent depending on each agent’s own macro-action termination for training the decentralized policy.

3.6.1. Independent actor with individual centralized critic (Mac-IAICC)
Note that naive Mac-IACC is technically incorrect. The cumulative reward
To tackle the aforementioned issues, we propose to learn a separate centralized critic
Now, from agent An example of the trajectory squeezing process in Mac-IAICC: each agent learns an individual centralized critic for the decentralized policy optimization. To better utilize centralized information, each agent’s critic should receive all the valid joint macro-observation-action history (when 
4. Experiments in simulation
We investigate the performance of our algorithms over a variety of multi-robot problems with macro-actions (Figure 6): Box Pushing (Xiao and Hoffman, n.d.), Overcooked (Wu et al., 2021b), and a larger Warehouse Tool Delivery (Xiao and Hoffman, n.d.) domain. Macro-actions are defined by using prior domain knowledge as they are straightforward in these tasks. Typically, we also include primitive actions in the macro-action set (as one-step macro-actions), which gives agents the chance to learn more complex policies that use both when it is necessary. The horizon of each problem domain and environmental partitions and labels are not known to the agents. We describe the domains’ key properties here and have more details in Appendix D. Experimental environments.
4.1. Experimental setup
4.1.1. Box pushing (Figure 6(a))
Two robots are in an environment with two small boxes and one large box. The optimal solution is to cooperatively push the big box to the yellow goal area for a terminal reward, but partial observability makes this difficult. Specifically, robots have four primitive-actions:
4.1.2. Overcooked (Figure 6(b) and (c))
Three robots must learn to cooperatively prepare a lettuce-tomato-onion salad and deliver it to the “star” cell. The challenge is that the salad’s recipe (Figure 6(d)) is unknown to the robots. With primitive actions (
4.1.3. Warehouse tool delivery (Figure 6(e)–6(h))
In each workshop (e.g., W-0), a human is working on an assembly task (involving four sub-tasks that each takes a number of time steps to complete) and requires three different tools for future sub-tasks to continue. A robot arm (gray) must find tools for each human on the table (brown) and pass them to mobile robots (green, blue and yellow) who are responsible for delivering tools to humans. Note that, the correct tools needed by each human are unknown to robots, which has to be learned during training in order to perform efficient delivery. A delayed delivery leads to a penalty. We consider variants with two or three mobile robots and two to four humans to examine the scalability of our methods (Figure 6(f)–6(h)). We also consider one faster human (orange) to check if robots can prioritize him (Figure 6(g)). Mobile robots have the following macro-actions:
4.2. Results
We evaluate performance of one training trial with a mean discounted return measured by periodically (every 100 episodes) evaluating the learned policies over 10 testing episodes. We plot the average performance of each method over 20 independent trials with one standard error and smooth the curves over 10 neighbors. We also show the optimal expected return in Box Pushing domain as a dash-dot line. To ensure a fair comparison over different approaches, we perform hyper-parameter tuning for methods in each comparison under the same set of hyper-parameters, and then, we choose the best performance of each method depending on its final converged value as the first priority and the sample efficiency as the second. More training details are in Appendix E.
4.2.1. Advantage of learning with macro-actions
We first present a comparison between our macro-action-based methods and the primitive-action-based methods in fully decentralized and fully centralized cases, and we show the results of value-based methods and actor-critic methods in Figures 7 and 8, respectively. The comparisons consider various grid world sizes of the Box Pushing domain and two Overcooked scenarios. The results show significant performance improvements for using macro-actions over primitive actions. More concretely, in the Box Pushing domain, reasoning about primitive movements at every time step makes the problem intractable so the robots cannot learn any good behaviors in primitive-action-based approaches other than to keep moving around. Conversely, Mac-Cen-Q (solid brown) and Mac-CAC (solid orange) reach near-optimal performance, enabling the robots to push the big box together. Unlike the centralized critic which can access joint information, even in the macro-action case, it is hard for each robot’s on-policy decentralized critic to correctly measure the responsibility for a penalty caused by a teammate pushing the big box alone. Mac-IAC (solid blue) thus converges to a local optimum of pushing two small boxes in order to avoid getting the penalty. Mac-Dec-Q (solid purple) learns slowly at the early stage, but, as an off-policy learning approach, it takes advantage of the replay buffer to re-visit the good experience of pushing the big box and eventually achieves near-optimal performance. Value-based approaches with macro-actions versus primitive-actions. Actor-critic approaches with macro-actions versus primitive-actions.

In the Overcooked domain, an efficient solution requires the robots to asynchronously work on independent subtasks (e.g., in scenario A, one robot gets a plate while another two robots pick up and chop vegetables; and in scenario B, the right robot transports items while the left two robots prepare the salad). This large amount of independence explains why Mac-Dec-Q (solid purple) and Mac-IAC (solid blue) can solve the task well. This also indicates that using local information is enough for robots to achieve high-quality behaviors. As a result, Mac-Cen-Q (solid brown) and Mac-CAC (solid orange) learn slower because they must figure out the redundant part of joint information in much larger joint macro-level history and action spaces than the spaces in the decentralized case. The primitive-action-based methods begin to learn but perform poorly in such long-horizon tasks.
4.2.2. Property analysis of macro-action-based CTDE Q-learing
Figure 9 shows the results of three variants of our macro-action-based CTDE Q-learning algorithm, MacDec-DDRQN: (a) with centralized exploration as the default; (b) with decentralized exploration (Dec-Explore); (c) with parallel environments (Parallel) compared with our fully decentralized method (MacDec-Q) and fully centralized method (MacCen-Q). Comparison of value-based methods with macro-actions.
In the Box Pushing domain, the key sequential cooperative behavior such as going to the big box and pushing it at the same moment is much easier to be generated from the centralized perspective than the decentralized way. Thus, relying on centralized exploration, MacDec-DDRQN achieves near-centralized performance and better sample efficiency than fully decentralized learning, MacDec-Q. The joint Q-value function learned in Dec-Explore is purely based on decentralized data, so it fails to provide better target actions for decentralized policy updates. The failure of Parallel shows that, in this particular domain, having a well-trained joint Q-value function from a separate environment but without the centralized cooperative behavior data may hurt the learning on decentralized policies.
Due to the aforementioned attribute of the independence in subtasks solving over agents, in the Overcooked domain, MacDec-Q performs the best while MacCen-Q leans slowly because of the huge joint-macro-action space (153). The variants of our CTDE-based approaches, therefore, cannot learn high-quality solutions, as they all rely on the guidance from the trained joint Q-value function (refer to the
In the warehouse domain, Mac-Dec-Q cannot solve this problem well due to its natural limitations and the domain’s partial observability. In particular, it is difficult for the gray robot (arm) to learn an efficient way to find the correct tools purely based on local information and very delayed rewards that depend on the mobile robots’ behaviors. Take advantage of join information over agents, MacCen-Q eventually outperforms all other methods. The Parallel way achieves significant improvement while learning decentralized policies. The failure of Dec-Explore and MacDec-DDRQN with centralized exploration demonstrates the necessity of using centralized data to achieve well-trained joint Q-value functions as well as the importance of having realistic decentralized data for decentralized policy training, which are attained in a Parallel manner.
4.2.3. Advantage of having individual centralized critics
Figure 10 shows the evaluation of our methods in all three domains. As each agent’s observation is extremely limited in Box Pushing, we allow centralized critics in both Mac-IAICC and Naive Mac-IACC to access the state (agents’ poses and boxes’ positions), but use the joint macro-observation-action history in the other two domains. Comparison of macro-action-based asynchronous actor-critic methods.
In the Box Pushing task (the left two at the top row in Figure 10), Naive Mac-IACC (green) can learn policies almost as good as the ones for Mac-IAICC (red) for the smaller domain, but as the grid world size grows, Naive Mac-IACC performs poorly while Mac-IAICC keeps its performance near the centralized approach. From each agent’s perspective, the bigger the world size is, the more time steps a macro-action could take, and the less accurate the critic of Naive Mac-IACC becomes since it is trained depending on any agent’s macro-action termination. Conversely, Mac-IAICC gives each agent a separate centralized critic trained with the reward associated with its own macro-action execution.
In Overcooked-A (the third one at the top row in Figure 10), as Mac-IAICC’s performance is determined by the training of three agents’ critics, it learns slower than Naive Mac-IACC in the early stage but converges to a slightly higher value and has better learning stability than Naive Mac-IACC in the end. The result of scenario B (the last one at the top row in Figure 10) shows that Mac-IAICC outperforms other methods in terms of achieving better sample efficiency, a higher final return and a lower variance. The middle wall in scenario B limits each agent’s moving space and leads to a higher frequency of macro-action terminations. The shared centralized critic in Naive Mac-IACC thus provides more noisy value estimations for each agent’s actions. Because of this, Naive Mac-IACC performs worse with more variance. Mac-IAICC, however, does not get hurt by such environmental dynamics change. Both Mac-CAC and Mac-IAC are not competitive with Mac-IAICC in this domain.
In the Warehouse scenarios (the bottom row in Figure 10), Mac-IAC (blue) performs the worst due to the same reason as MacDec-Q mentioned above. In contrast, in the fully centralized Mac-CAC (orange), both the actor and the critic have global information so it can learn faster in the early training stage. However, Mac-CAC eventually gets stuck at a local optimum in all five scenarios due to the exponential dimensionality of joint history and action spaces over robots. By leveraging the CTDE paradigm, both Mac-IAICC and Naive Mac-IACC perform the best in warehouse A. Yet, the weakness of Naive Mac-IACC is clearly exposed when the problem is scaled up in Warehouse B, C, and D. In these larger cases, the robots’ asynchronous macro-action executions (e.g., traveling between rooms) become more complex and cause more mismatching between the termination from each agent’s local perspective and the termination from the centralized perspective, and therefore, Naive Mac-IACC’s performance significantly deteriorates, even getting worse than Mac-IAC in Warehouse-D. In contrast, Mac-IAICC can maintain its outstanding performance, converging to a higher value with much lower variance, compared to other methods. This outcome confirms not only Mac-IAICC’s scalability but also the effectiveness of having an individual critic for each agent to handle variable degrees of asynchronicity in agents’ high-level decision-making.
4.2.4. Comparative analysis between actor-critic and value-based approaches in decentralized and centralized training paradigms
Here, we compare our actor-critic methods (Mac-IAC and Mac-CAC) with the value-based approaches (Mac-Dec-Q and Mac-Cen-Q), shown in Figure 11. As aforementioned, the Box Pushing task requires agents to simultaneously reach the big box and push it together. This consensus is rarely achieved when agents independently sample actions using stochastic policies in Mac-IAC and it is hard to learn from pure on-policy data. By having a replay buffer, value-based approaches show much stronger sample efficiency than on-policy actor-critic approaches in this domain with a small action space (the first row in Figure 11). Such an advantage is sustained by the decentralized value-based method (Mac-Dec-Q) but gets lost in the centralized one (Mac-Cen-Q) in the Overcooked domains due to a huge joint macro-action space (153) (the middle row in Figure 11). On the contrary, our actor-critic methods can scale to large domains and learn high-quality solutions. This is particularly noticeable in Warehouse-A, where the policy gradient methods quickly learn a high-quality policy while the centralized Mac-Cen-Q is slow to learn and the decentralized Mac-Dec-Q is unable to learn. In addition, the stochastic policies in actor-critic methods potentially have a better exploration property so that, in Warehouse domains, Mac-IAC can bypass an obvious local-optima that Mac-Dec-Q falls into, where the robot arm greedily chooses Comparisons of macro-action-based decentralized and centralized methods.
4.2.5. Comparative analysis between actor-critic and value-based approaches in CTDE paradigm
We also conduct comparisons between our CTDE-based actor-critic method (Mac-IAICC) and our CTDE-base Q-learning methods (MacDec-DDRQN and Parallel-MacDec-DDRQN). In the Box Pushing task, we consider MacDec-DDRQN with a centralized Comparisons of Mac-IAICC and MacDec-DDRQN.
In the Overcooked domain, Parallel-MacDec-DDRQN cannot learn any good behaviors, which makes sense as fully decentralized learning has solved this problem very well (as shown in Figure 11). Also, in Figure 11, we have seen that the centralized Q-function learns quite slowly due to the huge joint macro-action space, so it becomes the bottleneck in Parallel-MacDec-DDRQN such that it cannot offer a good target action for optimizing decentralized action-value functions and hurts the learning (the top row in Figure 13). Mac-IAICC successfully avoids the dilemma of the exponential joint action space by letting each agent learn an individual joint history-value function as the critic. Comparisons of Mac-IAICC and parallel-MacDec-DDRQN.
The results in Figure 10 have proved the necessity of having parallel environments to learn different Q-value functions in solving the warehouse task, we thus consider Parallel-MacDec-DDRQN in two warehouse scenarios and show the comparison with Mac-IAICC in Figure 13. According to the results of Mac-Dec-Q (purple curve) shown in Figure 11, we can conclude that the centralized Q-value function involved in Parallel-MacDec-DDRQN prevents the decentralized policies from a very bad local optimum. But eventually, the learned decentralized policies converge to another local optimum. We suspect the way of using the centralized Q-net to optimize decentralized policies (as in equation (24)) limits the improvement. One hypothesis is that the target action suggested by the centralized Q-net conditioning on joint information actually cannot always be reproduced by the decentralized Q-nets conditioning on only local information. Finally, Mac-IAICC’s leading performance over three scenarios further demonstrates its strong scalability to large and long-horizon problems.
5. Experiments on hardware
5.1. Experimental setup
While evaluating the proposed approaches in simulation, we also extend the environment of Warehouse-A (Figure 6(e)) to a hardware domain. Figure 14 provides an overview of the real-world experimental setup. An open area is divided into regions: a tool room, a corridor, and two workshops, to resemble the configuration shown in Figure 6(e). This mission involves one Fetch Robot (Wise et al., 2016) and two Turtlebots (Koubaa et al., 2016) to cooperatively find and deliver three YCB tools (Calli et al., 2015), in the order: a tape measure, a clamp, and an electric drill, required by each human in order to assemble an IKEA table. This real-world setup mirrors the simulated environment (Figure 6(e)) in a certain ratio in terms of warehouse dimensions and robot execution speeds. In the experiments, we train decentralized policies in the simulator and then deploy them on the real robots under the following two scenarios separately: (A) two humans work at the same speed on their assembly tasks; (B) the human in workshop-0 is faster than the other human on the assembly task. Overview of Warehouse-A hardware domain.
5.2. Results
Figure 15 shows the sequential collaborative behaviors of the robots in one hardware trial under scenario A. Fetch was able to find tools in parallel such that two tape measures (Figure 15(a)), two clamps (Figure 15(b)), and two electric drills, were found instead of finding all three types of tool for one human and then moving on to the other which would result in one of the humans waiting. Fetch’s efficiency is also reflected in the behaviors such that it passed a tool to the Turtlebot that arrived first (Figure 15(b)) and continued to find the next tool when there was no Turtlebot waiting beside it (Figure 15(c)). Meanwhile, Turtlebots were clever such that they successfully avoided delayed delivery by sending tools one by one to the nearby workshop (e.g., T-0 focused on W-0 shown in Figures 15(b) and 15(d), and T-1 focused on W-1 shown in Figure 15(c)), rather than waiting for all tools before delivering, traveling a longer distance to serve the human at the diagonal, or prioritizing one of the humans altogether. Collaborative behaviors generated by running the decentralized policies learned by Mac-IAICC in scenario A. Turtlebot-0 (T-0) is bounded in red and Turtlebot-1 (T-1) is bounded in blue. (a) After staging a tape measure at the left, Fetch looks for the second one while Turtlebots approach the table; (b) T-0 delivers a tap measure to W-0 and T-1 waits for a clamp from Fetch; (c) T-1 delivers a clamp to W-1, while T-0 carries the other clamp and goes to W-0, and Fetch searches for an electric drill; (d) T-0 delivers an electric drill (the last tool) to W-0 and the entire delivery task is completed.
Figure 16 shows more complex and interesting collaborative behaviors under scenario B, where the team of robots prioritized the faster human (with a black shirt) and successfully delivered all tools in time. More concretely, Fetch was smart to successively find a tape measure (Figure 16(a)) and a clamp (Figure 16(b)) for the faster human first, followed by passing one tool to each Turtlebot, which gave Turtlebots a chance to deliver the tools separately (Figure 16(b) and (c)) rather than letting only one Turtlebot send both tools that would lead to the human pausing there. Interestingly, instead of finding the third tool for the faster human, Fetch realized that it had to find the other tape measure (Figure 16(c)) to avoid a delayed delivery to the slower human, and then T-0 who received the tape measure from Fetch immediately transported it to the slower human (Figure 16(d)). Meanwhile, Fetch had staged the last tool, an electric drill (Figure 16(d)), that the faster human needed, and T-1 carried it to the W-0 at the right time (Figure 16(e)). After observing the faster human had received all necessary tools, T-1 was impressively clever such that it first went to W-1 to check the human’s status (Figure 16(f)) and then collaborate with T-0 to assist the slower human together: T-0 conveyed the other clamp (Figure 16(g)) and T-1 eventually delivered the other electric drill (Figure 16(h)). Collaborative behaviors generated by running the decentralized policies learned by Mac-IAICC in scenario B. Turtlebot-0 (T-0) is bounded in red and Turtlebot-1 (T-1) is bounded in blue. (a) Fetch passes a tape measure to T-1 and T-0 is waiting for a tool; (b) T-1 delivers a tape measure to W-0, while Fetch is passing a clamp to T-0; (c) T-0 delivers a clamp to W-0 and Fetch is passing the other tape measure to T-1; (d) T-1 delivers a tape measure to W-1, while T-0 returns the tool room; (e) T-0 sends an electric drill to W-0 and the faster human obtains all required tools, while Fetch finds the other clamp; (f) T-0 arrives at W-1 to observe the slower human’s status, and Fetch passes the clamp to T-1; (g) T-1 delivers a clamp to W-1, and T-0 waits beside Fetch; (h) T-0 delivers an electric drill (the last tool) to the slower human and the entire delivery task is completed.
6. Related work
To scale up learning in MARL problems, hierarchy has been introduced into multi-agent scenarios. One line of hierarchical MARL is still focusing on learning primitive-action-based policy for each agent while leveraging a hierarchical structure to achieve knowledge transfer (Yang et al., 2021), credit assignment (Ahilan and Dayan, 2019), and low-level policy factorization over agent (Vezhnevets et al., 2020). In these works, as the decision-making over agents is still limited at the low-level, none of them has been evaluated in large-scale realistic domains. Instead, by having macro-actions, our methods equip agents with the potential capability of exploiting abstracted skills, sub-task allocation, and problem decomposition via hierarchical decision-making, which is critical for scaling up to real-world multi-robot tasks.
Another line of the research allows agents to learn both a high-level policy and a low-level policy, but the methods either force agents to perform a high-level choice at every time step (de Witt et al., 2019; Han et al., 2019) or require all agents’ high-level decisions have the same time duration (Nachum et al., 2019; Wang et al., 2020b, 2021a; Xu et al., 2021; Yang et al., 2020a), where agents are actually synchronized at both levels. In contrast, our frameworks are more general and applicable to real-world multi-robot systems because they allow agents to asynchronously execute at a high level without synchronization or waiting for all agents to terminate.
Recently, some asynchronous hierarchical approaches have been developed. Wu et al., 2021a extend Deep Q-Networks (Mnih et al., 2015) to learn a high-level pixel-wise spatial-action-value map for each agent in a fully decentralized learning way. Our work, however, accepts any representations of high-level actions. Menda et al. (2019) frame multi-agent asynchronous decision-making problems as event-driven processes with one assumption on the acceptable of losing the ability to capture low-level interaction between agents within an event duration and the other on homogeneous agents, but our frameworks rely on the time-driven simulator used for general multi-agent and single-agent RL problems and do not have the above assumptions. Chakravorty et al. (2019) adapt a single-agent
7. Conclusion
In this paper, we consider fully cooperative multi-agent systems where agents are allowed to asynchronously execute macro-actions under partial observability. Such asynchronicity matches the nature of real-world multi-robot behavior, and it also raises the key challenge of when to perform updates and what information to maintain in MARL with macro-actions. To address this challenge, we introduce the first formulation and approaches to extend deep Q-nets for learning decentralized and centralized macro-action-value functions, together with two new replay-buffers, Mac-CERTs and Mac-JERTs, to correctly capture agents’ sequential macro-action-based experiences for asynchronous policy updates. These two approaches build up the base for developing MARL algorithms with macro-actions. Next, we present MacDec-DDRQN and Parallel-MacDec-DDRQN, the first set of value-based frameworks achieving CTDE with macro-actions, to learn better decentralized policies for solving complex tasks.
Since value-based algorithms do not scale well to large action spaces, we further formulated a set of macro-action-based actor-critic algorithms that allow agents to asynchronously optimize parameterized policies via policy gradients: a decentralized actor-critic method (Mac-IAC), a centralized actor-critic method (Mac-CAC), and two CTDE-based actor-critic methods (Naive Mac-IACC and Mac-IAICC). These are the first approaches to be able to incorporate controllers that may require different amounts of time to complete (macro-actions) in a general asynchronous multi-agent actor-critic framework.
Empirically, our methods are able to learn high-quality macro-action-based policies, allowing agents to perform asynchronous collaborations in a variety of multi-robot domains. Importantly, our most advanced method, Mac-IAICC, demonstrates its strong scalability and efficiency by achieving outstanding performance in long-horizon and large domains over other methods. Additionally, the practicality of Mac-IAICC is validated in a real-world multi-robot setup based on the warehouse domain.
Our formalism and methods open the door for other macro-action-based multi-agent reinforcement learning methods ranging from extensions of other current methods to new approaches and domains. We expect even more scalable learning methods that are feasible and flexible enough in solving realistic multi-robot problems.
Supplemental Material
Footnotes
Acknowledgments
Declaration of conflicting interests
Funding
Supplemental Material
Note
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
