Abstract
Introduction
Reinforcement learning (RL) has received increasing attention from the artificial intelligence (AI) research community in recent years. Deep reinforcement learning (DRL) 1 in single-agent tasks is a practical framework for solving decision-making tasks at a human level 2 by training a dynamic agent that interacts with the environment. Cooperative multi-agent reinforcement learning (MARL) is a more complicated problem in the RL field due to the exponential growth of decision dimensionality. 3 The approach encourages multiple agents to achieve a goal by credit assignment, 4 and it has a solid link to many real-world problems, such as performing well in multi-player video games 5 and traffic light control. 6 However, there are many challenges in MARL where agents must interact with each other in a shared environment. 7 Furthermore, in a large-scale multi-agent task, the dynamic environment becomes more complicated and even unsolvable. 8
Transfer learning (TL) is an efficient way to solve RL problem by leveraging prior knowledge. 9,10 Reusing existing knowledge can accelerate the RL agent learning process and make complex tasks learnable. It is crucial to decide how, when, and what to store knowledge into the knowledge space and reuse it. 9 There is no general valid solution for all domains. An improper transfer might cause damage to the learning process instead of accelerating it, which is known as a negative transfer. Therefore, it is crucial to design transfer principles in different RL training scenarios, especially in complex tasks. In this article, we use TL methods to help improve the training efficiency and effectiveness in large-scale multi-agent tasks.
The state observed by agents in multi-agent training under partially observable settings changes dynamically. This poses an obstacle to the transfer of policies across different numbers of multi-agent tasks. We must address dynamic states using a state representation approach to solve this problem. To the best of my knowledge, no related work has been done to explore this topic in a multi-agent scenario. In this article, we use an attention mechanism to handle dynamic observations so that the multi-agent observation dimension can be relatively stable for different numbers of tasks. This approach also lays the foundation for the transfer of strategies.
Thus, on the basis of the above work, we use TL to help solve the large-scale multi-agent training problem. First, we proposed a dynamic state representation network (DSRNet) to remove the transfer barrier between different numbers of agents tasks. In turn, various classical algorithms of RL can be combined with the transfer method. We then selected typical methods to verify the capability of our methods in different experimental settings.
This article focuses on a real-time strategy (RTS) game to explore the large-scale fully cooperative MARL. StarCraft is an RTS game that is very popular around the world. StarCraft provides a suitable environment for AI researchers to simulate combat scenarios. SMAC 11 has become a standard benchmark for evaluating discrete cooperative MARL algorithms. We scaled SMAC to our large-scale multi-agent needs using the classical centralized training distributed execution algorithm QMIX 12 as a baseline to test our transfer framework. Another set of experiments used a platform that could support a larger number of agents, Magent. 13 On top of this platform, we chose to combine the classical independent RL training methods double deep Q-network (Double DQN) 14 and asynchronous advantage actor-critic (A3C) 15 to validate our transfer method. Moreover, we conduct UAVs collision avoidance planning simulations that show our framework’s ability for large-scale robot control training.
The main contributions of this article can be concluded as follows: Our approach verified the feasibility of TL in large-scale multi-agent cooperation schemes. We introduced an attention network to the single agent in the partially observable setting for the representation of variable units. We achieved practical TL with good performance from a few agents to more agents in different environments, which shed light on the training problem of very large-scale multi-agent scenes.
In the next part of this article, we present our work through sections on related work, problem formulation and background knowledge, MARL transfer framework, experiments, potential robotic applications, and conclusion.
Related work
TL has played an important role in accelerating single-agent RL by adapting learnt knowledge from past relevant tasks. 10,16,17 Inspired by this scenario, TL in MARL 18 –21 is also studied with respect to transferring knowledge across multi-agent tasks to help improve the learning performance. The above work has two main directions: knowledge transfer across tasks and transfer among agents. However, few works consider transferring knowledge across different numbers of agents, especially from a small number of agents to a large number of agents.
Attention mechanisms have become an essential model that has been adopted in many deep neural networks. In particular, self-attention 22 trains the attention weight at a specific position in a sequence by considering all other positions in this sequence. Vaswani et al. 23 showed that a machine translation model composed of only a self-attention model could achieve state-of-the-art results. Wang et al. 24 reconstructed self-attention as a non-local operation to model spatial-temporal dependencies in video sequences. Nevertheless, self-attention mechanisms have not been fully explored in MARL.
The learning state representation (SL) aims to catch changes in the environment caused by the agent’s actions; this special representation is particularly suitable for extracting dynamic states in RL tasks. The main function of SL is to generate a low-dimensional state space in which RL policy can perform well and be efficient. The studies 25 –29 adapt SL methods to make the RL training process faster by separating the representation learning process and policy learning process.
The purpose of using policy distillation is to remove parameters that are not necessary for the original model, thereby improving the generalization of the traditional model. 30 Distillation is performed by comparing the classification results of the teacher and student networks without loss of information using soft labels. Policy distillation, which is the distillation of one or more behavioral policies from the teacher model to the student model, has been introduced to RL. 31 Policy distillation has three cases: (1) the student model is trained with a negative log-likelihood loss (NLL) to predict the same task, (2) the student model is trained with mean-squared-error loss (MSE), (3) the student model is trained with Kullback–Leibler divergence loss (KL). This approach allows the network size to be compressed without performance degradation, and multiple task-specific policies can be consolidated into a single policy. Policy distillation is widely used in single-agent RL. 32 A few previous works have focused on the transfer properties of MARL; for example, Barrett and Stone 33 proposed an ad hoc teamwork algorithm, Omidshafiei et al. 34 proposed the LeCTR algorithm to accomplish knowledge transfer between two agents, and Hernandez-Leal et al. 35 combined the Bayesian method and Pepper model in multi-agent matchmaking. However, the above methods do not focus on the problem of large-scale agents and become uncommon when the agent size becomes large, which is the challenge addressed in this work. In this article, we devised a method to efficiently distill large and heavy network policies into small and light networks in the deep MARL environment by taking ideas from these distillation methods.
Problem formulation and background
Partially observable stochastic games
Fully cooperative multi-agent tasks can be modelled as decentralized partially observable stochastic games (POSGs) 36 that extend from Markov decision processes (MDPs). In this article, we follow the POSG setting, where agents cannot obtain complete environmental information.
A POSG is composed of
State representation learning
State representation learning (SRL) is a special form of representation learning that learns abstract state features in low dimensions. Formally, the SRL task is to learn a mapping function
In particular, Martin et al. 37 defined a good state representation as being able to represent the actual value of the current state and generalize the learned policy to unseen states, even unseen tasks.
Self-attention mechanism
Attention mechanisms have been widely adopted in computer vision and natural language processing. 38 Such mechanisms make neural networks focus on important feature representations.
Vaswani et al.
23
adopts queries, keys and values that can be described by three matrices
where
QMIX
To solve the centralized training and decentralized execution paradigm setting of the multiagent problem, QMIX
12
proposed a method that learns a joint action-value function
In the above equation,
MARL transfer framework
In this study, we propose a multi-agent transfer framework based on policy distillation and state representation. This framework consists of two main components: a state representation that allows policies to transfer across different multi-agent tasks and a policy distillation approach that reduces the number of parameters in the model to reduce the transfer cost. The entire transfer process is illustrated in Figure 1.

Transfer process from an 8-agent trained policy to a 16-agent mission. The purpose of this transfer framework is to reuse the policy model trained on the 8-agent case in training the 16-agent scenario using the state representation and policy distillation.
Dynamic state representation network
In our POSG setting, each agent’s total observation consists of the environment’s state information, the agent’s own state information and other agents’ partial observations. In the scenario of the RTS game, the game agent’s state information can be divided into three different parts: the agent’s own observations, the allies’ information and the enemies’ information.
39
From another perspective, these observational states can be divided into dimensional dynamic observations and dimensional static observations. The dimensionality of dimensional dynamic observations increases linearly with the number of agents. To facilitate knowledge transfer and model reloading between multi-agent tasks with different numbers of agents, we use multi-head attention networks to help us align and represent dynamic observations in a static observation space. Precisely, we proposed
Observation classification
SMAC provides a range of different micromanagement control scenarios for cooperative MARL research.
11
A certain SMAC scenario has several homogeneous or heterogeneous types of allies and enemies. Agents can only receive partial observations within their range of view at every time step. The range can be described as a circular area around every unit with a radius equal to the observable range, as shown in Figure 2. In this range, agents can observe the following attributes for all alive units:

Some observable examples of agents in SMAC. The inner ring of yellow dashed circles is the range the central agent can attack, while the outer ring of grey circles is the range the central agent can observe.
We can classify observed features into two categories based on whether the dimension of the feature changes:

The DSRNet structure. The left part is the network architecture of the multi-headed attention mechanism, while the right part is the overall representational network for agent observations.
Each agent
where
Dynamic state representation
If the structure of the network is not affected by the variation of the observation space and the number of agents, we can easily exploit prior knowledge from different tasks. Inspired by the attention model, we propose a new network architecture (DSRNet) to solve this problem.
Figure 3 illustrates the overall network architecture of DSRNet. The original observation
Next, the outputs of the above two parts of DSRNet are concatenated to the downstream of the following NN layers. The final output is the Q-values generated by the QMIX algorithm.
By adapting our DSRNet, the small-scale learned models can easily be reloaded as an initialization model for large-scale tasks. This training program can greatly increase the learning rate and improve the final strategy effect.
Gradient-based policy distillation
We propose a policy distillation approach that performs well in a multi-agent environment. In MARL, a policy is a rule of action described by a model that maximizes the reward for the agents. Thus, the distillation of the action probability distribution
The student model uses the probability distribution
Experiments
In this section, we test the performance of our transfer framework based on two sets of video game experiments, one based on the extended SMAC 40 environment and the other based on MAgent. 13 Moreover, we also test our transfer framework on UAVs planning simulation. 41
The StarCraft multi-agent challenge
The StarCraft multi-agent challenge (SMAC) 40 is based on the popular RTS game StarCraft 2 and focuses on micromanagement challenges, where an independent agent controls each unit that must act based on local observations. It is a popular benchmark for fully cooperative multi-agent tasks. SMAC provides many battle scenarios.
Our experiment is based on open-source library PyTorch, SMAC, and Pysc2. We performed our experiments on a single server with a Linux system. SMAC provides different battle scenarios and difficulty options. However, to simplify the transfer process and increase the richness of the experiment, the original setting of the map has been suitably modified. We expanded the number of agents based on the original map “3m,” with three marines on both teams, and limited the active attack options for our ally units. In the transfer process, agents learn on a mission with 4 marines versus 4 marines, which we name 4m. Then, agents progressively learn on an 8 marines versus 8 marines mission (8m) and 16 marines versus 16 marines mission (16m), as shown in Figure 4. The red units are allies in these maps, while the blue units are enemy units. We train only the red units, and the blue units are set to the

Screenshot of three different large-scale MARL maps extended from SMAC: 4 marines versus marines (a), 8 marines versus 8 marines (b), and 16 marines versus 16 marines (c).
To ensure the validity of the experiment, the parameters were fixed throughout the experiment. These parameters include observation space, action space, game mechanics, environmental parameters, and game difficulty. We also strictly ensure that the model execution is distributed, that is, the agent can obtain data from only its own valid observation range when executing the strategy. The adjustment of the hyperparameter settings greatly affects the final result of QMIX, 42 so we strictly use the same hyperparameter settings as in the original SMAC experiment, as shown in Table 1.
Hyperparameter settings of QMIX.
The primary evaluation metric of a single task is the average win rate and reward, which varies with the number of steps agents run. The whole evaluation process can be tested periodically during the training process. Therefore, the transfer performance can be compared easily via the average win rate or reward. In the experiment, our curves are the average of five independent runs.
We evaluate the effects of our solution for large-scale MARL tasks based on the proposed framework. These experimental results are based on one main source task (4m) and two target maps (8m, 16m) as shown in Figure 4. The basic algorithm used here for training multiple agents is QMIX. The final results of the experimental tests are described in detail in the following Figure 5.

Training results in the target task. (a) The win rate and reward comparison when transferring a model trained on the 4m task to the 8m task. (b) The win rate and reward comparison when transferring the model trained on the 8m task to the 16m task.
We show very strong results in Figure 5. The left part (a) of this figure demonstrates the average win rate and reward running on the 8m mission and the right part (b) shows these for 16m. The performance of the policy model with the transfer is better than that of the original QMIX algorithm. To be precise, the effect of the transfer can be understood in three ways. First, there is a clear initial performance improvement at the beginning of the training, which is particularly evident in the rewards curve. Second, there is an increase in asymptotic performance in the later stages of training, that is, an increase in the final win rate. Third, if an 80% win rate is taken as an acceptable threshold, the transferred algorithm can reach it much faster, especially for the 16m mission, where the transferred algorithm can reach it in only 100,000 time steps, compared to 500,000 time steps for learning from scratch, saving 80% of the training steps.
MAgent
MAgent is a platform for the research and development of large-scale MARL. Unlike other single-agent or multi-agent tasks, MAgent can support thousands of agents. This large-scale agent interaction not only helps to study the optimal policy of multi-agent cooperation but also enables observation and investigation of group behavior patterns and even social phenomena through the behavior of a large number of agents interacting with each other. Even in the case of a single GPU, MAgent can scale to millions of agents. It provides three main types of scenarios, Pursuit, Gathering, and Battle.
Our experiments focus on choosing Battle scenarios. In this scenario, each agent has the following action options within the local observation range: move to another position or attack the opponent. Here, we assign a −0.001 bonus for each move, a 0.1 bonus for attacking an enemy, and a −0.1 bonus for wiping out all agents in the local experience of 10 if they are killed. We consider missions with different numbers of agents (30 vs. 30, 40 vs. 40, 50 vs. 50).
As shown in Figure 6, red and blue indicate two different pairs of troops. They receive a bonus for killing their opponents and a penalty for being killed. Agents must cooperate with their teammates to destroy all enemy agents.

An example of a battle mission in MAgent.
We validate the effect of transfer with DSRNet and non-transfer by two different independent RL algorithms (A3C and Double DQN). Table 2 shows the average number of remaining teammates and killed opponents in a 50 versus 50 target task. The performance of Double DQN and A3C trained from scratch can be significantly improved with the transfer framework; that is, more teammates remain, and a higher average number of enemies are killed.
Max and mean standard error of 50 versus 50 task in MAgent.
UAV trajectory planning simulation
In this part, we test our transfer framework in a UAV simulated environment based on ML-Agents Toolkit. 41 As shown in Figure 7, this platform consists of static and dynamic obstacles. The main task of cooperative control for UAVs is to get through the pathway with collision avoidance ability.

The UAV simulated environment. The UAV agents are marked with a red circle. The random blue balls are dynamic obstacles, while other objects are static obstacles.
We assume the UAVs are shuttling in a 3D space, and the terminal time is set to
In this experimental setting, we used the QMIX algorithm to model the speed control and evasion strategies of the UAVs. Moreover, we compare the performance of this method on the success rate of collision avoidance. Collision avoidance includes two parts: one is the collision avoidance between UAVs and obstacles, and the other one is the collision avoidance among UAVs agent.
Figure 8 shows the success probability of collision avoidance. We select different numbers of UAVs ranging from 2 to 32, and the final probability is averaged by five-time runs. From Figure 8(a), it can be seen the decreasing trend of success probability of collision avoidance between UAVs. However, our transfer framework obviously slows down the decreased speed. Moreover, in Figure 8(b), UAVs can avoid dynamic and static obstacles well. Our transfer framework combined by QMIX and DSRNet can also help improve the success probability, especially when the number of UAVs is increasing (up to 32).

Collision avoidance performance of a large number of UAVs in 3D environment with static and dynamic obstacles. (a) The collision avoidance success probability between UAVs, (b) the collision avoidance success probability between UAVs and obstacles.
Potential robotic applications
Robot collaboration is an important challenge in robotics. Robot swarming integrates simple, inexpensive and modular multiple units into working groups based on tasks, which can perform large and complex tasks. 43 The use of machine learning, especially MARL, is a promising direction to help inter-agent communication and coordinated collaboration. For example, with the MARL transfer framework proposed in this work, the knowledge learned by a few robots can be extended to a large number of robots for cooperative tasks, such as industrial cooperative control and physical UAV swarm control. 44 Furthermore, this article can be based on partially observable centralized training distributed execution or independent training, which is well adapted to the deployment of agents and resource allocation.
Conclusion
This article proposes a new scheme for training large-scale multi-agent cooperation tasks via TL. This scheme is achieved by means of a dynamic state representation net (DSRNet) for a variable number of agents in the partial observation setting. Additionally, the DSRNet can work with the CTDE paradigm algorithm QMIX and the independent RL algorithms A3C and Double DQN. We have conducted extensive comparison experiments on the well-known multi-agent testing platforms SMAC and MAgent. We also applied our method to UAVs’ trajectory planning simulation experiments. When implementing the scheme, both the training time and final results show impressive improvement. The results provide novel insights into the problem of training large-scale multi-agent tasks.
