Sage Journals: Discover world-class research

Abstract

Many research results have emerged in the past decade regarding multi-agent reinforcement learning. These include the successful application of asynchronous advantage actor-critic, double deep Q-network and other algorithms in multi-agent environments, and the more representative multi-agent training method based on the classical centralized training distributed execution algorithm QMIX. However, in a large-scale multi-agent environment, training becomes a major challenge due to the exponential growth of the state-action space. In this article, we design a training scheme from small-scale multi-agent training to large-scale multi-agent training. We use the transfer learning method to enable the training of large-scale agents to use the knowledge accumulated by training small-scale agents. We achieve policy transfer between tasks with different numbers of agents by designing a new dynamic state representation network, which uses a self-attention mechanism to capture and represent the local observations of agents. The dynamic state representation network makes it possible to expand the policy model from a few agents (4 agents, 10 agents) task to large-scale agents (16 agents, 50 agents) task. Furthermore, we conducted experiments in the famous real-time strategy game Starcraft II and the multi-agent research platform MAgent. And also set unmanned aerial vehicles trajectory planning simulations. Experimental results show that our approach not only reduces the time consumption of a large number of agent training tasks but also improves the final training performance.

Keywords

Large-scale agents MARL QMIX transfer learning UAVs

Introduction

Reinforcement learning (RL) has received increasing attention from the artificial intelligence (AI) research community in recent years. Deep reinforcement learning (DRL)¹ in single-agent tasks is a practical framework for solving decision-making tasks at a human level² by training a dynamic agent that interacts with the environment. Cooperative multi-agent reinforcement learning (MARL) is a more complicated problem in the RL field due to the exponential growth of decision dimensionality.³ The approach encourages multiple agents to achieve a goal by credit assignment,⁴ and it has a solid link to many real-world problems, such as performing well in multi-player video games⁵ and traffic light control.⁶ However, there are many challenges in MARL where agents must interact with each other in a shared environment.⁷ Furthermore, in a large-scale multi-agent task, the dynamic environment becomes more complicated and even unsolvable.⁸

Transfer learning (TL) is an efficient way to solve RL problem by leveraging prior knowledge.^9,10 Reusing existing knowledge can accelerate the RL agent learning process and make complex tasks learnable. It is crucial to decide how, when, and what to store knowledge into the knowledge space and reuse it.⁹ There is no general valid solution for all domains. An improper transfer might cause damage to the learning process instead of accelerating it, which is known as a negative transfer. Therefore, it is crucial to design transfer principles in different RL training scenarios, especially in complex tasks. In this article, we use TL methods to help improve the training efficiency and effectiveness in large-scale multi-agent tasks.

The state observed by agents in multi-agent training under partially observable settings changes dynamically. This poses an obstacle to the transfer of policies across different numbers of multi-agent tasks. We must address dynamic states using a state representation approach to solve this problem. To the best of my knowledge, no related work has been done to explore this topic in a multi-agent scenario. In this article, we use an attention mechanism to handle dynamic observations so that the multi-agent observation dimension can be relatively stable for different numbers of tasks. This approach also lays the foundation for the transfer of strategies.

Thus, on the basis of the above work, we use TL to help solve the large-scale multi-agent training problem. First, we proposed a dynamic state representation network (DSRNet) to remove the transfer barrier between different numbers of agents tasks. In turn, various classical algorithms of RL can be combined with the transfer method. We then selected typical methods to verify the capability of our methods in different experimental settings.

This article focuses on a real-time strategy (RTS) game to explore the large-scale fully cooperative MARL. StarCraft is an RTS game that is very popular around the world. StarCraft provides a suitable environment for AI researchers to simulate combat scenarios. SMAC¹¹ has become a standard benchmark for evaluating discrete cooperative MARL algorithms. We scaled SMAC to our large-scale multi-agent needs using the classical centralized training distributed execution algorithm QMIX¹² as a baseline to test our transfer framework. Another set of experiments used a platform that could support a larger number of agents, Magent.¹³ On top of this platform, we chose to combine the classical independent RL training methods double deep Q-network (Double DQN)¹⁴ and asynchronous advantage actor-critic (A3C)¹⁵ to validate our transfer method. Moreover, we conduct UAVs collision avoidance planning simulations that show our framework’s ability for large-scale robot control training.

The main contributions of this article can be concluded as follows:

Our approach verified the feasibility of TL in large-scale multi-agent cooperation schemes.

We introduced an attention network to the single agent in the partially observable setting for the representation of variable units.

We achieved practical TL with good performance from a few agents to more agents in different environments, which shed light on the training problem of very large-scale multi-agent scenes.

In the next part of this article, we present our work through sections on related work, problem formulation and background knowledge, MARL transfer framework, experiments, potential robotic applications, and conclusion.

Related work

TL has played an important role in accelerating single-agent RL by adapting learnt knowledge from past relevant tasks.^10,16,17 Inspired by this scenario, TL in MARL^18

–21 is also studied with respect to transferring knowledge across multi-agent tasks to help improve the learning performance. The above work has two main directions: knowledge transfer across tasks and transfer among agents. However, few works consider transferring knowledge across different numbers of agents, especially from a small number of agents to a large number of agents.

Attention mechanisms have become an essential model that has been adopted in many deep neural networks. In particular, self-attention²² trains the attention weight at a specific position in a sequence by considering all other positions in this sequence. Vaswani et al.²³ showed that a machine translation model composed of only a self-attention model could achieve state-of-the-art results. Wang et al.²⁴ reconstructed self-attention as a non-local operation to model spatial-temporal dependencies in video sequences. Nevertheless, self-attention mechanisms have not been fully explored in MARL.

The learning state representation (SL) aims to catch changes in the environment caused by the agent’s actions; this special representation is particularly suitable for extracting dynamic states in RL tasks. The main function of SL is to generate a low-dimensional state space in which RL policy can perform well and be efficient. The studies^25

–29 adapt SL methods to make the RL training process faster by separating the representation learning process and policy learning process.

The purpose of using policy distillation is to remove parameters that are not necessary for the original model, thereby improving the generalization of the traditional model.³⁰ Distillation is performed by comparing the classification results of the teacher and student networks without loss of information using soft labels. Policy distillation, which is the distillation of one or more behavioral policies from the teacher model to the student model, has been introduced to RL.³¹ Policy distillation has three cases: (1) the student model is trained with a negative log-likelihood loss (NLL) to predict the same task, (2) the student model is trained with mean-squared-error loss (MSE), (3) the student model is trained with Kullback–Leibler divergence loss (KL). This approach allows the network size to be compressed without performance degradation, and multiple task-specific policies can be consolidated into a single policy. Policy distillation is widely used in single-agent RL.³² A few previous works have focused on the transfer properties of MARL; for example, Barrett and Stone³³ proposed an ad hoc teamwork algorithm, Omidshafiei et al.³⁴ proposed the LeCTR algorithm to accomplish knowledge transfer between two agents, and Hernandez-Leal et al.³⁵ combined the Bayesian method and Pepper model in multi-agent matchmaking. However, the above methods do not focus on the problem of large-scale agents and become uncommon when the agent size becomes large, which is the challenge addressed in this work. In this article, we devised a method to efficiently distill large and heavy network policies into small and light networks in the deep MARL environment by taking ideas from these distillation methods.

Problem formulation and background

Partially observable stochastic games

Fully cooperative multi-agent tasks can be modelled as decentralized partially observable stochastic games (POSGs)³⁶ that extend from Markov decision processes (MDPs). In this article, we follow the POSG setting, where agents cannot obtain complete environmental information.

A POSG is composed of $< S, U, T, R_{1... n}, O, n, γ >$ , where $S = S_{0} \times S_{1} \times ... \times S_{n}$ is the whole state space of the environment; $U = A_{1} \times ... \times A_{n}$ is the joint action space; $T : S \times U \times S \to [0, 1]$ is the state transition function; $R_{i} : S \times U \times S \to ℝ$ is the reward function for agent i; O is the observation space for agents; n is the number of agents; and $γ$ is the discount factor. The goal of agents is to learn a policy $π$ that maximizes the discount accumulated reward $J = E_{π} [\sum_{t = 0}^{\infty} γ^{t} R^{t}]$ , where t is the step size.

State representation learning

State representation learning (SRL) is a special form of representation learning that learns abstract state features in low dimensions. Formally, the SRL task is to learn a mapping function $ϕ$ for translating the current dynamic high-dimensional state to a concise low-dimensional state $r_{t} = ϕ (s_{t})$ . We can decompose the policy function $π$ as $π = f \cdot ϕ$ . The goal of SRL can be formulated as

max_{f, ϕ} E_{π} [\sum_{t = 0}^{\infty} γ^{t} R^{t}]

In particular, Martin et al.³⁷ defined a good state representation as being able to represent the actual value of the current state and generalize the learned policy to unseen states, even unseen tasks.

Self-attention mechanism

Attention mechanisms have been widely adopted in computer vision and natural language processing.³⁸ Such mechanisms make neural networks focus on important feature representations.

Vaswani et al.²³ adopts queries, keys and values that can be described by three matrices $Q, K, V$ . The final attention is calculated as follows

A t t e n t i o n (Q, K, V) = s o f t max (Q K^{T} / \sqrt{d_{k}}) V

where d_k is a scaling factor. In our method, we adopt a multi-head attention framework to learn the dynamic observation features and the relationship between different agents’ observations.

QMIX

To solve the centralized training and decentralized execution paradigm setting of the multiagent problem, QMIX¹² proposed a method that learns a joint action-value function $Q_{t o t}$ . The approach adapts a mixing network to decompose the joint $Q_{t o t}$ into each agent’s independent $Q_{i}$ . $Q_{t o t}$ can be computed as follows

Q_{t o t} (s, u; θ, ϕ) = g_{ϕ} (s, Q_{1} (τ^{1}, μ^{1}, θ^{1},..., Q_{N} (τ^{N}, μ^{N}, θ^{N}))

\frac{\partial Q_{t o t} (s, u; θ, ϕ)}{\partial Q_{i} (τ^{i}, μ^{i}, θ^{i})} \geq 0, \forall i \in N

In the above equation, $ϕ$ is a parameter of the mixing network.

MARL transfer framework

In this study, we propose a multi-agent transfer framework based on policy distillation and state representation. This framework consists of two main components: a state representation that allows policies to transfer across different multi-agent tasks and a policy distillation approach that reduces the number of parameters in the model to reduce the transfer cost. The entire transfer process is illustrated in Figure 1.

Figure 1.

Transfer process from an 8-agent trained policy to a 16-agent mission. The purpose of this transfer framework is to reuse the policy model trained on the 8-agent case in training the 16-agent scenario using the state representation and policy distillation.

Dynamic state representation network

In our POSG setting, each agent’s total observation consists of the environment’s state information, the agent’s own state information and other agents’ partial observations. In the scenario of the RTS game, the game agent’s state information can be divided into three different parts: the agent’s own observations, the allies’ information and the enemies’ information.³⁹ From another perspective, these observational states can be divided into dimensional dynamic observations and dimensional static observations. The dimensionality of dimensional dynamic observations increases linearly with the number of agents. To facilitate knowledge transfer and model reloading between multi-agent tasks with different numbers of agents, we use multi-head attention networks to help us align and represent dynamic observations in a static observation space. Precisely, we proposed DSRNet to pre-process the dynamic observation space.

Observation classification

SMAC provides a range of different micromanagement control scenarios for cooperative MARL research.¹¹ A certain SMAC scenario has several homogeneous or heterogeneous types of allies and enemies. Agents can only receive partial observations within their range of view at every time step. The range can be described as a circular area around every unit with a radius equal to the observable range, as shown in Figure 2. In this range, agents can observe the following attributes for all alive units: shield, health, relative x, relative y, and distance.

Figure 2.

Some observable examples of agents in SMAC. The inner ring of yellow dashed circles is the range the central agent can attack, while the outer ring of grey circles is the range the central agent can observe.

We can classify observed features into two categories based on whether the dimension of the feature changes: static features and dynamic features. Figure 3 shows the feature classification structure. Given an observation, $o_{t}^{i}$ is the observed features of agent i at step t. The agents can observe their private own attributes $o_{t}^{i, o w n}$ and the environment information $o_{t}^{i, e n v}$ , which are dimension static features. In addition, the agents can also observe the attributes of other agents within their border of sight. The attributes of those other agents contribute to the dynamic features. The dynamic features observed by agent i at step t can be denoted as $(6)$ .

Figure 3.

The DSRNet structure. The left part is the network architecture of the multi-headed attention mechanism, while the right part is the overall representational network for agent observations.

Each agent i with its partial observation at step t (5) can be represented as (6)–(8)

o_{t}^{i} = (o_{t}^{i, d y n a m i c}, o_{t}^{i, s t a t i c})

o_{t}^{i, d y n a m i c} = (o_{t}^{i,1}, o_{t}^{i,2},..., o_{t}^{i, j}) where j \neq i and j < M

o_{t}^{i, s t a t i c} = (o_{t}^{i, o w n}, o_{t}^{i, e n v})

o_{t}^{i, j} = (f_{t}^{s h i e l d}, f_{t}^{h e a l t h}, f_{t}^{d i s t a n c e}, f_{t}^{x}, f_{t}^{y})

where $f_{t}^{f e a t u r e}$ denotes the observed $f e a t u r e$ of the jth other unit, which can be any ally or enemy in observation range of agent i itself. The $o_{t}^{i}$ is composed of $o_{t}^{i, d y n a m i c}$ and $o_{t}^{i, s t a t i c}$ , as shown in $(5)$ . Suppose our scenario has N allies and M enemies. $o_{t}^{i, d y n a m i c}$ is composed of a subset of $N - 1$ allies and M enemies, and the observation vector of agent i can be denoted as $(6)$ . The dimension of $o_{t}^{i, d y n a m i c}$ is dynamic and limited as $d i m (o_{t}^{i, d y n a m i c}) < = (N + M - 1)$ . The dimension of $o_{t}^{i, s t a t i c}$ is a constant number in this setting that is equal to $d i m (o_{t}^{i, o w n}) + d i m (o_{t}^{i, e n v})$ .

Dynamic state representation

If the structure of the network is not affected by the variation of the observation space and the number of agents, we can easily exploit prior knowledge from different tasks. Inspired by the attention model, we propose a new network architecture (DSRNet) to solve this problem.

Figure 3 illustrates the overall network architecture of DSRNet. The original observation $o_{t}^{i}$ is transformed by DSRNet. The left part of DSRNet is a single-layer neural network (NN) with static observation $o_{t}^{i, s t a t i c}$ as input. The dynamic observation $o_{t}^{i, d y n a m i c}$ , which changes in different tasks, is processed by the attention mechanism. The right part of DSRNet implements multi-head attention to compress and represent the dynamic observation information. More precisely, we learn a representation $r_{t}^{i, j}$ for agent i‘s observation $o_{t}^{i, j}$ , where j consists of scaled dot-product attention neural network layers and a concatenation layer that uses an aggregation operator to obtain the output of the attention network. The final output of the multi-head attention network can be formulated as (9)

r_{t}^{i} = A G G R E G A T E (r_{t}^{i, j}) where j \neq i

Next, the outputs of the above two parts of DSRNet are concatenated to the downstream of the following NN layers. The final output is the Q-values generated by the QMIX algorithm.

By adapting our DSRNet, the small-scale learned models can easily be reloaded as an initialization model for large-scale tasks. This training program can greatly increase the learning rate and improve the final strategy effect.

Gradient-based policy distillation

We propose a policy distillation approach that performs well in a multi-agent environment. In MARL, a policy is a rule of action described by a model that maximizes the reward for the agents. Thus, the distillation of the action probability distribution $p (a_{T})$ of the teacher network T into the action probability distribution $p (a_{T})$ of the student network S can likewise be considered as a policy distillation. In this article, we designed a gradient-based policy distillation as in knowledge distillation to efficiently distill $p (a_{T})$ . The specific distillation method is shown in the following equations

s o f t \max (p (a_{T})) = \frac{e^{p (a_{T}) - \max (p (a_{T}))}}{\sum e^{p (a_{T}) - \max (p (a_{T}))}}

g r a d i e n t = \sum s o f t \max (p (a_{T})) * p (a_{S}) - s o f t \max (p (a_{T}))

p {(a_{S})}^{*} = p (a_{S}) - l r * g r a d i e n t

The student model uses the probability distribution $p {(a_{T})}^{*}$ as the final action policy, where $l r$ is the learning rate.

Experiments

In this section, we test the performance of our transfer framework based on two sets of video game experiments, one based on the extended SMAC⁴⁰ environment and the other based on MAgent.¹³ Moreover, we also test our transfer framework on UAVs planning simulation.⁴¹

The StarCraft multi-agent challenge

The StarCraft multi-agent challenge (SMAC)⁴⁰ is based on the popular RTS game StarCraft 2 and focuses on micromanagement challenges, where an independent agent controls each unit that must act based on local observations. It is a popular benchmark for fully cooperative multi-agent tasks. SMAC provides many battle scenarios.

Our experiment is based on open-source library PyTorch, SMAC, and Pysc2. We performed our experiments on a single server with a Linux system. SMAC provides different battle scenarios and difficulty options. However, to simplify the transfer process and increase the richness of the experiment, the original setting of the map has been suitably modified. We expanded the number of agents based on the original map “3m,” with three marines on both teams, and limited the active attack options for our ally units. In the transfer process, agents learn on a mission with 4 marines versus 4 marines, which we name 4m. Then, agents progressively learn on an 8 marines versus 8 marines mission (8m) and 16 marines versus 16 marines mission (16m), as shown in Figure 4. The red units are allies in these maps, while the blue units are enemy units. We train only the red units, and the blue units are set to the very hard level of the game’s built-in AI. These settings and scenarios make it more difficult for us to win without losing the fairness of the game.

Figure 4.

Screenshot of three different large-scale MARL maps extended from SMAC: 4 marines versus marines (a), 8 marines versus 8 marines (b), and 16 marines versus 16 marines (c).

To ensure the validity of the experiment, the parameters were fixed throughout the experiment. These parameters include observation space, action space, game mechanics, environmental parameters, and game difficulty. We also strictly ensure that the model execution is distributed, that is, the agent can obtain data from only its own valid observation range when executing the strategy. The adjustment of the hyperparameter settings greatly affects the final result of QMIX,⁴² so we strictly use the same hyperparameter settings as in the original SMAC experiment, as shown in Table 1.

Table 1.

Hyperparameter settings of QMIX.

Hyperparameter	Value
Optimizer	RMSProp
Learning rate	5e − 4
Action-selector	$ε$ -greedy
$ε$ -start	1.0
$ε$ -finish	0.05
Batch-size	32
Replay buffer size	5000
Discount factor	0.99

The primary evaluation metric of a single task is the average win rate and reward, which varies with the number of steps agents run. The whole evaluation process can be tested periodically during the training process. Therefore, the transfer performance can be compared easily via the average win rate or reward. In the experiment, our curves are the average of five independent runs.

We evaluate the effects of our solution for large-scale MARL tasks based on the proposed framework. These experimental results are based on one main source task (4m) and two target maps (8m, 16m) as shown in Figure 4. The basic algorithm used here for training multiple agents is QMIX. The final results of the experimental tests are described in detail in the following Figure 5.

Figure 5.

Training results in the target task. (a) The win rate and reward comparison when transferring a model trained on the 4m task to the 8m task. (b) The win rate and reward comparison when transferring the model trained on the 8m task to the 16m task.

We show very strong results in Figure 5. The left part (a) of this figure demonstrates the average win rate and reward running on the 8m mission and the right part (b) shows these for 16m. The performance of the policy model with the transfer is better than that of the original QMIX algorithm. To be precise, the effect of the transfer can be understood in three ways. First, there is a clear initial performance improvement at the beginning of the training, which is particularly evident in the rewards curve. Second, there is an increase in asymptotic performance in the later stages of training, that is, an increase in the final win rate. Third, if an 80% win rate is taken as an acceptable threshold, the transferred algorithm can reach it much faster, especially for the 16m mission, where the transferred algorithm can reach it in only 100,000 time steps, compared to 500,000 time steps for learning from scratch, saving 80% of the training steps.

MAgent

MAgent is a platform for the research and development of large-scale MARL. Unlike other single-agent or multi-agent tasks, MAgent can support thousands of agents. This large-scale agent interaction not only helps to study the optimal policy of multi-agent cooperation but also enables observation and investigation of group behavior patterns and even social phenomena through the behavior of a large number of agents interacting with each other. Even in the case of a single GPU, MAgent can scale to millions of agents. It provides three main types of scenarios, Pursuit, Gathering, and Battle.

Our experiments focus on choosing Battle scenarios. In this scenario, each agent has the following action options within the local observation range: move to another position or attack the opponent. Here, we assign a −0.001 bonus for each move, a 0.1 bonus for attacking an enemy, and a −0.1 bonus for wiping out all agents in the local experience of 10 if they are killed. We consider missions with different numbers of agents (30 vs. 30, 40 vs. 40, 50 vs. 50).

As shown in Figure 6, red and blue indicate two different pairs of troops. They receive a bonus for killing their opponents and a penalty for being killed. Agents must cooperate with their teammates to destroy all enemy agents.

Figure 6.

An example of a battle mission in MAgent.

We validate the effect of transfer with DSRNet and non-transfer by two different independent RL algorithms (A3C and Double DQN). Table 2 shows the average number of remaining teammates and killed opponents in a 50 versus 50 target task. The performance of Double DQN and A3C trained from scratch can be significantly improved with the transfer framework; that is, more teammates remain, and a higher average number of enemies are killed.

Table 2.

Max and mean standard error of 50 versus 50 task in MAgent.

Algorithms			Teammates (remain)	Enemies (killed)
A3C	Max	Transfer	0.32 $\pm$ 1.41	22.56 $\pm$ 8.91
	Max	No transfer	0	5.21 $\pm$ 5.77
	Mean	Transfer	0.33 $\pm$ 1.23	24.51 $\pm$ 7.61
	Mean	No transfer	0	0.38 $\pm$ 0.72
Double DQN	Max	Transfer	21.21 $\pm$ 5.01	50 $\pm$ 0
	Max	No transfer	0.52 $\pm$ 2.55	30.41 $\pm$ 9.44
	Mean	Transfer	18.2 $\pm$ 3.56	44.21 $\pm$ 7.74
	Mean	No transfer	0	5.42 $\pm$ 7.21

UAV trajectory planning simulation

In this part, we test our transfer framework in a UAV simulated environment based on ML-Agents Toolkit.⁴¹ As shown in Figure 7, this platform consists of static and dynamic obstacles. The main task of cooperative control for UAVs is to get through the pathway with collision avoidance ability.

Figure 7.

The UAV simulated environment. The UAV agents are marked with a red circle. The random blue balls are dynamic obstacles, while other objects are static obstacles.

We assume the UAVs are shuttling in a 3D space, and the terminal time is set to $T = 10 s$ . The initial position distribution of multi-UAVs is formed by a Gaussian distribution with a variance of 0.6 centering at $(- 50, - 50, - 50, 0, 0, 0)$ . We set one type of dynamic obstacle that is represented by blue balls that have a random radius below $1 m$ . The position coordinates of dynamic obstacles change with time as $(- 50 + 100 t / T, - 50 + 100 t / T,50 - {(- 50 + 100 t / T)}^{2} / 50)$ . Moreover, we also generate static obstacles like a white cube with channels, red stoppers or green cylinders.

In this experimental setting, we used the QMIX algorithm to model the speed control and evasion strategies of the UAVs. Moreover, we compare the performance of this method on the success rate of collision avoidance. Collision avoidance includes two parts: one is the collision avoidance between UAVs and obstacles, and the other one is the collision avoidance among UAVs agent.

Figure 8 shows the success probability of collision avoidance. We select different numbers of UAVs ranging from 2 to 32, and the final probability is averaged by five-time runs. From Figure 8(a), it can be seen the decreasing trend of success probability of collision avoidance between UAVs. However, our transfer framework obviously slows down the decreased speed. Moreover, in Figure 8(b), UAVs can avoid dynamic and static obstacles well. Our transfer framework combined by QMIX and DSRNet can also help improve the success probability, especially when the number of UAVs is increasing (up to 32).

Figure 8.

Collision avoidance performance of a large number of UAVs in 3D environment with static and dynamic obstacles. (a) The collision avoidance success probability between UAVs, (b) the collision avoidance success probability between UAVs and obstacles.

Potential robotic applications

Robot collaboration is an important challenge in robotics. Robot swarming integrates simple, inexpensive and modular multiple units into working groups based on tasks, which can perform large and complex tasks.⁴³ The use of machine learning, especially MARL, is a promising direction to help inter-agent communication and coordinated collaboration. For example, with the MARL transfer framework proposed in this work, the knowledge learned by a few robots can be extended to a large number of robots for cooperative tasks, such as industrial cooperative control and physical UAV swarm control.⁴⁴ Furthermore, this article can be based on partially observable centralized training distributed execution or independent training, which is well adapted to the deployment of agents and resource allocation.

Conclusion

This article proposes a new scheme for training large-scale multi-agent cooperation tasks via TL. This scheme is achieved by means of a dynamic state representation net (DSRNet) for a variable number of agents in the partial observation setting. Additionally, the DSRNet can work with the CTDE paradigm algorithm QMIX and the independent RL algorithms A3C and Double DQN. We have conducted extensive comparison experiments on the well-known multi-agent testing platforms SMAC and MAgent. We also applied our method to UAVs’ trajectory planning simulation experiments. When implementing the scheme, both the training time and final results show impressive improvement. The results provide novel insights into the problem of training large-scale multi-agent tasks.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iD

Zhen Jia

References

Sutton

Barto

. Reinforcement learning. Bradford Book 1998; 15(7): 665–685.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature 2015; 518(7540): 529–533.

Buşoniu

Babuška

Schutter

. Multi-agent reinforcement learning: an overview. Innovat Multi-Agent Sys Applications-1 2010; 310: 183–221.

da Silva

Costa

. Accelerating multiagent reinforcement learning through transfer learning. Proc AAAI Conf Artif Intel 2017; 31(1): 5034–5035.

Peng

Yuan

Wen

, et al. Multiagent bidirectionally-coordinated nets for learning to play StarCraft combat games. 2017. CoRR, abs/1703.10069.

Wiering

. Multi-agent reinforcement learning for traffic light control. In: Machine learning: proceedings of the seventeenth international conference (ICML’2000), Qingdao, China, 2000, pp. 1151–1158. IEEE.

Busoniu

Babuska

Schutter

. A comprehensive survey of multiagent reinforcement learning. IEEE Trans Sys 2008; 38(2): 56–172.

Yang

Luo

, et al. Mean field multi-agent reinforcement learning. In: Dy

Krause

(eds) Proceedings of the 35th international conference on machine learning, volume 80 of proceedings of machine learning research, 10–15 July 2018, pp. 5571–5580. PMLR.

Zhu

Lin

Zhou

. Transfer learning in deep reinforcement learning: a survey. 2020. CoRR, abs/2009.07888.

10.

Gamrian

Goldberg

. Transfer learning for related reinforcement learning tasks via image-to-image translation. In: K

Chaudhuri

Salakhutdinov

(eds) Proceedings of the 36th international conference on machine learning, volume 97 of proceedings of machine learning research, 9–15 June 2019, pp. 2063–2072. PMLR.

11.

Samvelyan

Rashid

de Witt

, et al. The StarCraft multi-agent challenge. In: Proceedings of the 18th international conference on autonomous agents and multiagent Systems, Montreal, Canada, May 13–17, 2019, pp. 2186–2188. OCLC World Cat.

12.

Rashid

Samvelyan

Witt

, et al. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. 2018.

13.

Zheng

Yang

Cai

, et al. MAgent: a many-agent reinforcement learning platform for artificial collective intelligence. Proc AAAI Conf Artif Intell 2018; 32: 8222–8223.

14.

Hasselt

Guez

Silver

. Deep reinforcement learning with double Q-learning. Proc AAAI Conf Artif Intell 2016; 30: 2094–2100.

15.

Mnih

Badia

Mirza

, et al. Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, 2016, pp. 1928–1937. PMLR.

16.

Xing

Nagata

Chen

, et al. Domain adaptation in reinforcement learning via latent unified state representation. 2021. CoRR, abs/2102.05714.

17.

Ammar

Tuyls

Taylor

, et al. Reinforcement learning transfer via sparse coding. In: 11th international conference on autonomous agents and multiagent systems 2012, AAMAS 2012: innovative applications track, vol. 1, Valencia, Spain, 4–8 June 2012, pp. 89–96. AAMAS.

18.

da Silva

Taylor

Costa

AHR

. Autonomously reusing knowledge in multiagent reinforcement learning. IJCAI 2018; 5487–5493.

19.

da Silva

Glatt

Costa

AHR

. Simultaneously learning and advising in multiagent reinforcement learning. In: Proceedings of the 16th conference on autonomous agents and multi agent systems, Richland, SC, 2017, AAMAS ‘17, 2017, pp. 1100–1108. International Foundation for Autonomous Agents and Multiagent Systems.

20.

Zhou

Liu

Sui

, et al. Learning implicit credit assignment for cooperative multi-agent reinforcement learning. 2020. arXiv e-prints , page arXiv:2007.02529.

21.

Yang

Wen

Wang

, et al. Multi-agent determinantal Q-learning. In: Daumé

III Singh

(eds) Proceedings of the 37th international conference on machine learning, volume 119 of proceedings of machine learning research, 13–18 July 2020, pp. 10757–10766. PMLR.

22.

Parikh

Täckström

Das

, et al. A decomposable attention model for natural language inference. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, November 2016, pp. 2249–2255. Association for Computational Linguistics.

23.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: NIPS’17: proceedings of the 31st international conference on neural information processing systems, December 2017, pp. 6000–6010. NIPS .

24.

Wang

Girshick

Gupta

, et al. Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp. 7794–7803. IEEE.

25.

Singh

Lee

. Value prediction network. In: Proceedings of the 31st international conference on neural information processing systems, Long Beach, CA, USA, 4–9 December, 2017, pp. 6120–6130. Curran Associates, Inc.

26.

Shelhamer

Mahmoudieh

Argus

, et al. Loss is its own reward: self-supervision for reinforcement learning. 2017.

27.

Curran

Brys

Aha

, et al. Dimensionality reduced reinforcement learning for assistive robots. AAAI Fall Symposia 2016.

28.

Munk

Kober

Babuška

. Learning state representation for deep actor-critic control. In: 2016 IEEE 55th conference on decision and control (CDC), Las Vegas, NV, USA, 12–14 December 2016, pp. 4667–4673. IEEE.

29.

Van Hoof

Chen

Karl

, et al. Stable reinforcement learning with autoencoders for tactile and visual data. In: 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), Daejeon, Korea (South), 9–14 October 2016, pp. 3928–3934. IEEE.

30.

Gou

Maybank

, et al. Knowledge distillation: a survey. Inter J Comput Vision 2021; 129(6): 1789–1819.

31.

Wadhwania

Kim

D-K

Omidshafiei

, et al. Policy distillation and value matching in multiagent reinforcement learning. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Macau, China, 3–8 November 2019, pp. 8193–8200. IEEE.

32.

Rusu

Colmenarejo

Gulcehre

, et al. Policy distillation. 2015. arXiv preprint arXiv:1511.06295 .

33.

Barrett

Stone

. Cooperating with unknown teammates in complex domains: a robot soccer case study of ad hoc teamwork. Twenty-ninth AAAI Conf Artif Intell 2015, pp. 2010–2016.

34.

Omidshafiei

Kim

D-K

Liu

, et al. Learning to teach in cooperative multiagent reinforcement learning. Proc AAAI Conf Artif Intell 2019; 33: 6128–6136.

35.

Hernandez-Leal

Kaisers

. Towards a fast detection of opponents in repeated stochastic games. In: International conference on autonomous agents and multiagent systems, 25 November 2017, pp. 239–257. Springer.

36.

Hansen

Bernstein

Zilberstein

. Dynamic programming for partially observable stochastic games. AAAI 2004; 4: 709–715.

37.

Riedmiller

Boedecker

Bohmer

, et al. Autonomous learning of state representations for control: an emerging field aims to autonomously learn state representations for reinforcement learning agents from their real-world sensor observations. Kunstl Intell 2015; 29(4): 353–362.

38.

Khan

Naseer

Hayat

, et al. Transformers in vision: a survey. 2021.

39.

Vinyals

Ewalds

Bartunov

, et al. StarCraft II: a new challenge for reinforcement learning. 2017. arXiv preprint arXiv:1708.04782 .

40.

Samvelyan

Rashid

de Witt

, et al. The StarCraft multi-agent challenge. 2019. CoRR, abs/1902.04043.

41.

Juliani

Berges

V-P

Teng

, et al. Unity: a general platform for intelligent agents. 2018. arXiv preprint arXiv:1809.02627 .

42.

Jiang

Harding

, et al. RIIT: rethinking the importance of implementation tricks in multi-agent reinforcement learning. 2021. arXiv preprint arXiv:2102.03479 .

43.

Yang

G-Z

Bellingham

Dupont

, et al. The grand challenges of science robotics. Sci Robot 2018; 3(14): eaar7650.

44.

Campion

Ranganathan

Faruque

. UAV swarm communication and control architectures: a review. J Unmanned Veh Sys 2018; 7(2): 93–106.