Qatten is a novel Q-value Attention network for the multiagent Q-value decomposition problem. Qatten provides a theoretic linear decomposition formula of $Q_{tot}$ and $Q^{i}$ which covers previous methods and achieves state-of-the-art performance on the StarCraft II micro-management tasks across different scenarios. Paper
Introduction
In many real-world settings, a team of cooperative agents must learn to coordinate their behavior with private observations and communication constraints. Deep multiagent reinforcement learning algorithms (Deep-MARL) have shown superior performance in these realistic and difficult problems but still suffer from challenges. One branch is the multiagent value decomposition, which decomposes the global shared multiagent Q-value $Q_{tot}$ into individual Q-values $Q^{i}$ to guide individuals’ behaviors. There are few related works, but they either lack the theoretical depth or perform poorly in realistic and complex tasks. To overcome these issues, we proposed the Qatten Attention (Qatten) network, which consist of a theoretical linear decomposing formation from $Q_{tot}$ to each $Q^{i}$ and a theoretically accountable multi-head attention implement. Combining the decomposing theory and tactful practice, Qatten achieves state-of-the-art performance in the challenging and widely adopted Startcraft Multiagent Challenge (SMAC) testbed.
Motivation
Related Works
There are several previous methods which are related to our work. Value Decomposition Network (VDN) (Sunehag et al., 2018) is proposed to learn a centralized but factored $Q_{tot}$, where $Q_{tot}(s, \vec{a})= \sum_{i} Q^{i}(s, a^{i})$. VDN assumes that the additivity exists when $Q^{i}$ is evaluated based on $o^{i}$, which indeed makes an approximation and brings inaccuracy. Besides, VDN severely limits the complexity of centralized action-value functions and ignores any extra state information available during training. Different from VDN, QMIX learns a monotonic multiagent Q-value approximation $Q_{tot}$ (Rashid et al., 2018). QMIX factors the joint action-value $Q_{tot}$ into a monotonic non-linear combination of individual Q-value $Q^{i}$ of each agent which learns via a mixing network. The mixing network with non-negative weights produced by a hynernetwork is responsible for combing the agent’s utilities for the chosen actions into $Q_{tot}(s, \vec{a})$. This nonnegativity ensures that $\frac{\partial Q_{tot}}{\partial Q^{i}} \ge 0$, which in turn guarantees the IGM property (Son et al., 2019). However, QMIX adopts an implicit inexplicable mixing method which lacks of the theoretical insights. Recently, QTRAN (Son et al., 2019) is proposed to guarantee optimal decentralization by using linear constraints between agent utilities and joint action values. However, the constraints on the optimization problem involved is computationally intractable and the corresponding relaxations make QTRAN perform poorly in complex tasks (Mahajan et al., 2019).
Qatten
In this paper, for the first time, we theoretically derive a linear decomposing formation from $Q_{tot}$ to each $Q^{i}$. Based on this theoretical finding, we introduce the multi-head attention mechanism to approximate each term in the decomposing formula with theoretical explanations. In one word, when we investigate the global Q-value $Q_{tot}$ near maximum point in action space, the dependence of $Q_{tot}$ on individual Q-value $Q^{i}$ is approximately linear. Below we explain this theory.
The $Q_{tot}$ could be viewed as a function in terms of $Q^{i}$.
\[Q_{tot}=Q_{tot}(s, Q^{1}, Q^{2}, ..., Q^{n}).\]We could prove (see details in our paper) that
- Theorem 1. There exist constants $c(s),\lambda_i(s)$ (depending on state $s$), such that when we neglect higher order terms $o(|| \vec{a}- \vec{a}_{o} ||^2)$, the local expansion of $Q_{tot}$ admits the following form
And in an cooperative setting, the constants $\lambda_i(s) \ge 0$.
- Theorem 2. The functional relation between $Q_{tot}$ and $Q^{i}$ appears tobe linear in action space, yet contains all the non-linear information. We have the following finer structure of $\lambda_{i}$.
Then we have
\[Q_{tot}(s,\vec{a})=c(s) +\sum_{i,h} \lambda_{i,h}(s) Q^{i}(s,a^i),\]where $\lambda_{i,h}$ is a linear functional of all partial derivatives $\frac{\partial^{h}Q_{tot}}{\partial Q^{i_1}…\partial Q^{i_h}}$ of order $h$, and decays super-exponentially fast in $h.$ Based on above findings, we introduce the multi-head attention to realize the deep version implement (Qatten).
The overall architecture consists of agents’ recurrent Q-value networks representing each agent’s individual value function $Q^{i}(\tau^{i}, a^{i})$ and the refined attention based value-mixing network to model the relation between $Q_{tot}$ and individual Q-values. The attention-based mixing network takes individual agents’ Q-values and local information as input and mixes them with global state to produce the values of $Q_{tot}$. Qatten’s mixing network perfectly implement the theorems.
Demonstration
Like previous works, we test Qatten in the SMAC (Samvelyan et al., 2019) platform. Here are some video demonstrations.
We also give the median win rate table on all maps. Qatten beats other popular MARL methods across almost scenarios, which validates its effectiveness.
Senario | Qatten | QMIX | COMA | VDN | IQL | QTRAN |
---|---|---|---|---|---|---|
2s_vs_1sc | 100 | 100 | 97 | 100 | 100 | 100 |
2s3z | 97 | 97 | 34 | 97 | 75 | 83 |
3s5z | 94 | 94 | 0 | 84 | 9 | 13 |
1c3s5z | 97 | 94 | 23 | 84 | 11 | 67 |
5m_vs_6m | 74 | 63 | 0 | 63 | 49 | 57 |
3s_vs_5z | 96 | 85 | 0 | 87 | 43 | 0 |
bane_vs_bane | 97 | 62 | 40 | 90 | 97 | 100 |
2c_vs_64zg | 65 | 45 | 0 | 19 | 2 | 10 |
MMM2 | 79 | 61 | 0 | 0 | 0 | 0 |
3s5z_vs_3s6z | 16 | 1 | 0 | 0 | 0 | 0 |
References
-
Rashid, T., Samvelyan, M., Witt, C. S. d., Farquhar, G., Foerster, J. N., and Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 4292–4301, 2018.
-
Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), In Proceedings of the 36th International Conference on Machine Learning, pp. 5887–5896, 2019.
-
Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. MAVEN: Multi-Agent Variational Exploration. In Wallach, H., Larochelle, H., Beygelzimer, A., Alch´e-Buc, F. d., Fox, E., and Garnett, R. (eds.), In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7611–7622, 2019.