Qatten is a novel Qvalue Attention network for the multiagent Qvalue decomposition problem. Qatten provides a theoretic linear decomposition formula of $Q_{tot}$ and $Q^{i}$ which covers previous methods and achieves stateoftheart performance on the StarCraft II micromanagement tasks across different scenarios. Paper
Introduction
In many realworld settings, a team of cooperative agents must learn to coordinate their behavior with private observations and communication constraints. Deep multiagent reinforcement learning algorithms (DeepMARL) have shown superior performance in these realistic and difficult problems but still suffer from challenges. One branch is the multiagent value decomposition, which decomposes the global shared multiagent Qvalue $Q_{tot}$ into individual Qvalues $Q^{i}$ to guide individuals’ behaviors. There are few related works, but they either lack the theoretical depth or perform poorly in realistic and complex tasks. To overcome these issues, we proposed the Qatten Attention (Qatten) network, which consist of a theoretical linear decomposing formation from $Q_{tot}$ to each $Q^{i}$ and a theoretically accountable multihead attention implement. Combining the decomposing theory and tactful practice, Qatten achieves stateoftheart performance in the challenging and widely adopted Startcraft Multiagent Challenge (SMAC) testbed.
Motivation
Related Works
There are several previous methods which are related to our work. Value Decomposition Network (VDN) (Sunehag et al., 2018) is proposed to learn a centralized but factored $Q_{tot}$, where $Q_{tot}(s, \vec{a})= \sum_{i} Q^{i}(s, a^{i})$. VDN assumes that the additivity exists when $Q^{i}$ is evaluated based on $o^{i}$, which indeed makes an approximation and brings inaccuracy. Besides, VDN severely limits the complexity of centralized actionvalue functions and ignores any extra state information available during training. Different from VDN, QMIX learns a monotonic multiagent Qvalue approximation $Q_{tot}$ (Rashid et al., 2018). QMIX factors the joint actionvalue $Q_{tot}$ into a monotonic nonlinear combination of individual Qvalue $Q^{i}$ of each agent which learns via a mixing network. The mixing network with nonnegative weights produced by a hynernetwork is responsible for combing the agent’s utilities for the chosen actions into $Q_{tot}(s, \vec{a})$. This nonnegativity ensures that $\frac{\partial Q_{tot}}{\partial Q^{i}} \ge 0$, which in turn guarantees the IGM property (Son et al., 2019). However, QMIX adopts an implicit inexplicable mixing method which lacks of the theoretical insights. Recently, QTRAN (Son et al., 2019) is proposed to guarantee optimal decentralization by using linear constraints between agent utilities and joint action values. However, the constraints on the optimization problem involved is computationally intractable and the corresponding relaxations make QTRAN perform poorly in complex tasks (Mahajan et al., 2019).
Qatten
In this paper, for the first time, we theoretically derive a linear decomposing formation from $Q_{tot}$ to each $Q^{i}$. Based on this theoretical finding, we introduce the multihead attention mechanism to approximate each term in the decomposing formula with theoretical explanations. In one word, when we investigate the global Qvalue $Q_{tot}$ near maximum point in action space, the dependence of $Q_{tot}$ on individual Qvalue $Q^{i}$ is approximately linear. Below we explain this theory.
The $Q_{tot}$ could be viewed as a function in terms of $Q^{i}$.
\[Q_{tot}=Q_{tot}(s, Q^{1}, Q^{2}, ..., Q^{n}).\]We could prove (see details in our paper) that
 Theorem 1. There exist constants $c(s),\lambda_i(s)$ (depending on state $s$), such that when we neglect higher order terms $o( \vec{a} \vec{a}_{o} ^2)$, the local expansion of $Q_{tot}$ admits the following form
And in an cooperative setting, the constants $\lambda_i(s) \ge 0$.
 Theorem 2. The functional relation between $Q_{tot}$ and $Q^{i}$ appears tobe linear in action space, yet contains all the nonlinear information. We have the following finer structure of $\lambda_{i}$.
Then we have
\[Q_{tot}(s,\vec{a})=c(s) +\sum_{i,h} \lambda_{i,h}(s) Q^{i}(s,a^i),\]where $\lambda_{i,h}$ is a linear functional of all partial derivatives $\frac{\partial^{h}Q_{tot}}{\partial Q^{i_1}…\partial Q^{i_h}}$ of order $h$, and decays superexponentially fast in $h.$ Based on above findings, we introduce the multihead attention to realize the deep version implement (Qatten).
The overall architecture consists of agents’ recurrent Qvalue networks representing each agent’s individual value function $Q^{i}(\tau^{i}, a^{i})$ and the refined attention based valuemixing network to model the relation between $Q_{tot}$ and individual Qvalues. The attentionbased mixing network takes individual agents’ Qvalues and local information as input and mixes them with global state to produce the values of $Q_{tot}$. Qatten’s mixing network perfectly implement the theorems.
Demonstration
Like previous works, we test Qatten in the SMAC (Samvelyan et al., 2019) platform. Here are some video demonstrations.
We also give the median win rate table on all maps. Qatten beats other popular MARL methods across almost scenarios, which validates its effectiveness.
Senario  Qatten  QMIX  COMA  VDN  IQL  QTRAN 

2s_vs_1sc  100  100  97  100  100  100 
2s3z  97  97  34  97  75  83 
3s5z  94  94  0  84  9  13 
1c3s5z  97  94  23  84  11  67 
5m_vs_6m  74  63  0  63  49  57 
3s_vs_5z  96  85  0  87  43  0 
bane_vs_bane  97  62  40  90  97  100 
2c_vs_64zg  65  45  0  19  2  10 
MMM2  79  61  0  0  0  0 
3s5z_vs_3s6z  16  1  0  0  0  0 
References

Rashid, T., Samvelyan, M., Witt, C. S. d., Farquhar, G., Foerster, J. N., and Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep MultiAgent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 4292–4301, 2018.

Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. QTRAN: Learning to Factorize with Transformation for Cooperative MultiAgent Reinforcement Learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), In Proceedings of the 36th International Conference on Machine Learning, pp. 5887–5896, 2019.

Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. MAVEN: MultiAgent Variational Exploration. In Wallach, H., Larochelle, H., Beygelzimer, A., Alch´eBuc, F. d., Fox, E., and Garnett, R. (eds.), In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7611–7622, 2019.