Qatten: A General Framework for Cooperative MARL

2020-02-17

MARL
DRL

Qatten is a novel Q-value Attention network for the multiagent Q-value decomposition problem. Qatten provides a theoretic linear decomposition formula of $Q_{tot}$ and $Q^{i}$ which covers previous methods and achieves state-of-the-art performance on the StarCraft II micro-management tasks across different scenarios. Paper

Introduction

In many real-world settings, a team of cooperative agents must learn to coordinate their behavior with private observations and communication constraints. Deep multiagent reinforcement learning algorithms (Deep-MARL) have shown superior performance in these realistic and difficult problems but still suffer from challenges. One branch is the multiagent value decomposition, which decomposes the global shared multiagent Q-value $Q_{tot}$ into individual Q-values $Q^{i}$ to guide individuals’ behaviors. There are few related works, but they either lack the theoretical depth or perform poorly in realistic and complex tasks. To overcome these issues, we proposed the Qatten Attention (Qatten) network, which consist of a theoretical linear decomposing formation from $Q_{tot}$ to each $Q^{i}$ and a theoretically accountable multi-head attention implement. Combining the decomposing theory and tactful practice, Qatten achieves state-of-the-art performance in the challenging and widely adopted Startcraft Multiagent Challenge (SMAC) testbed.

Motivation

There are several previous methods which are related to our work. Value Decomposition Network (VDN) (Sunehag et al., 2018) is proposed to learn a centralized but factored $Q_{tot}$, where $Q_{tot}(s, \vec{a})= \sum_{i} Q^{i}(s, a^{i})$. VDN assumes that the additivity exists when $Q^{i}$ is evaluated based on $o^{i}$, which indeed makes an approximation and brings inaccuracy. Besides, VDN severely limits the complexity of centralized action-value functions and ignores any extra state information available during training. Different from VDN, QMIX learns a monotonic multiagent Q-value approximation $Q_{tot}$ (Rashid et al., 2018). QMIX factors the joint action-value $Q_{tot}$ into a monotonic non-linear combination of individual Q-value $Q^{i}$ of each agent which learns via a mixing network. The mixing network with non-negative weights produced by a hynernetwork is responsible for combing the agent’s utilities for the chosen actions into $Q_{tot}(s, \vec{a})$. This nonnegativity ensures that $\frac{\partial Q_{tot}}{\partial Q^{i}} \ge 0$, which in turn guarantees the IGM property (Son et al., 2019). However, QMIX adopts an implicit inexplicable mixing method which lacks of the theoretical insights. Recently, QTRAN (Son et al., 2019) is proposed to guarantee optimal decentralization by using linear constraints between agent utilities and joint action values. However, the constraints on the optimization problem involved is computationally intractable and the corresponding relaxations make QTRAN perform poorly in complex tasks (Mahajan et al., 2019).

Qatten

In this paper, for the first time, we theoretically derive a linear decomposing formation from $Q_{tot}$ to each $Q^{i}$. Based on this theoretical finding, we introduce the multi-head attention mechanism to approximate each term in the decomposing formula with theoretical explanations. In one word, when we investigate the global Q-value $Q_{tot}$ near maximum point in action space, the dependence of $Q_{tot}$ on individual Q-value $Q^{i}$ is approximately linear. Below we explain this theory.

The $Q_{tot}$ could be viewed as a function in terms of $Q^{i}$.

\[Q_{tot}=Q_{tot}(s, Q^{1}, Q^{2}, ..., Q^{n}).\]

We could prove (see details in our paper) that

Theorem 1. There exist constants $c(s),\lambda_i(s)$ (depending on state $s$), such that when we neglect higher order terms $o(|| \vec{a}- \vec{a}_{o} ||^2)$, the local expansion of $Q_{tot}$ admits the following form

\[Q_{tot}(s,\vec{a})=c(s) +\sum_i \lambda_i(s) Q^{i}(s,a^i).\]

And in an cooperative setting, the constants $\lambda_i(s) \ge 0$.

Theorem 2. The functional relation between $Q_{tot}$ and $Q^{i}$ appears tobe linear in action space, yet contains all the non-linear information. We have the following finer structure of $\lambda_{i}$.

\[\lambda_i(s)=\sum_{h} \lambda_{i,h}(s).\]

Then we have

\[Q_{tot}(s,\vec{a})=c(s) +\sum_{i,h} \lambda_{i,h}(s) Q^{i}(s,a^i),\]

where $\lambda_{i,h}$ is a linear functional of all partial derivatives $\frac{\partial^{h}Q_{tot}}{\partial Q^{i_1}…\partial Q^{i_h}}$ of order $h$, and decays super-exponentially fast in $h.$ Based on above findings, we introduce the multi-head attention to realize the deep version implement (Qatten).

Qatten Framework

The overall architecture consists of agents’ recurrent Q-value networks representing each agent’s individual value function $Q^{i}(\tau^{i}, a^{i})$ and the refined attention based value-mixing network to model the relation between $Q_{tot}$ and individual Q-values. The attention-based mixing network takes individual agents’ Q-values and local information as input and mixes them with global state to produce the values of $Q_{tot}$. Qatten’s mixing network perfectly implement the theorems.

Demonstration

Like previous works, we test Qatten in the SMAC (Samvelyan et al., 2019) platform. Here are some video demonstrations.

Alt Text

We also give the median win rate table on all maps. Qatten beats other popular MARL methods across almost scenarios, which validates its effectiveness.

Senario	Qatten	QMIX	COMA	VDN	IQL	QTRAN
2s_vs_1sc	100	100	97	100	100	100
2s3z	97	97	34	97	75	83
3s5z	94	94	0	84	9	13
1c3s5z	97	94	23	84	11	67
5m_vs_6m	74	63	0	63	49	57
3s_vs_5z	96	85	0	87	43	0
bane_vs_bane	97	62	40	90	97	100
2c_vs_64zg	65	45	0	19	2	10
MMM2	79	61	0	0	0	0
3s5z_vs_3s6z	16	1	0	0	0	0

References

Sunehag, P. et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18, pp. 2085–2087, 2018.
Rashid, T., Samvelyan, M., Witt, C. S. d., Farquhar, G., Foerster, J. N., and Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 4292–4301, 2018.
Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), In Proceedings of the 36th International Conference on Machine Learning, pp. 5887–5896, 2019.
Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. MAVEN: Multi-Agent Variational Exploration. In Wallach, H., Larochelle, H., Beygelzimer, A., Alch´e-Buc, F. d., Fox, E., and Garnett, R. (eds.), In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7611–7622, 2019.
Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G. J., Hung, C.-M., Torr, P. H. S., Foerster, J., and Whiteson, S. The StarCraft Multi- Agent Challenge. In arXiv:1902.04043, 2019.