Qatten: A General Framework for Cooperative MARL

2020-02-17

Qatten is a novel Q-value Attention network for the multiagent Q-value decomposition problem. Qatten provides a theoretic linear decomposition formula of $Q_{tot}$ and $Q^{i}$ which covers previous methods and achieves state-of-the-art performance on the StarCraft II micro-management tasks across different scenarios. Paper

Introduction

In many real-world settings, a team of cooperative agents must learn to coordinate their behavior with private observations and communication constraints. Deep multiagent reinforcement learning algorithms (Deep-MARL) have shown superior performance in these realistic and difficult problems but still suffer from challenges. One branch is the multiagent value decomposition, which decomposes the global shared multiagent Q-value $Q_{tot}$ into individual Q-values $Q^{i}$ to guide individuals’ behaviors. There are few related works, but they either lack the theoretical depth or perform poorly in realistic and complex tasks. To overcome these issues, we proposed the Qatten Attention (Qatten) network, which consist of a theoretical linear decomposing formation from $Q_{tot}$ to each $Q^{i}$ and a theoretically accountable multi-head attention implement. Combining the decomposing theory and tactful practice, Qatten achieves state-of-the-art performance in the challenging and widely adopted Startcraft Multiagent Challenge (SMAC) testbed.

Motivation

There are several previous methods which are related to our work. Value Decomposition Network (VDN) (Sunehag et al., 2018) is proposed to learn a centralized but factored $Q_{tot}$, where $Q_{tot}(s, \vec{a})= \sum_{i} Q^{i}(s, a^{i})$. VDN assumes that the additivity exists when $Q^{i}$ is evaluated based on $o^{i}$, which indeed makes an approximation and brings inaccuracy. Besides, VDN severely limits the complexity of centralized action-value functions and ignores any extra state information available during training. Different from VDN, QMIX learns a monotonic multiagent Q-value approximation $Q_{tot}$ (Rashid et al., 2018). QMIX factors the joint action-value $Q_{tot}$ into a monotonic non-linear combination of individual Q-value $Q^{i}$ of each agent which learns via a mixing network. The mixing network with non-negative weights produced by a hynernetwork is responsible for combing the agent’s utilities for the chosen actions into $Q_{tot}(s, \vec{a})$. This nonnegativity ensures that $\frac{\partial Q_{tot}}{\partial Q^{i}} \ge 0$, which in turn guarantees the IGM property (Son et al., 2019). However, QMIX adopts an implicit inexplicable mixing method which lacks of the theoretical insights. Recently, QTRAN (Son et al., 2019) is proposed to guarantee optimal decentralization by using linear constraints between agent utilities and joint action values. However, the constraints on the optimization problem involved is computationally intractable and the corresponding relaxations make QTRAN perform poorly in complex tasks (Mahajan et al., 2019).

Qatten

In this paper, for the first time, we theoretically derive a linear decomposing formation from $Q_{tot}$ to each $Q^{i}$. Based on this theoretical finding, we introduce the multi-head attention mechanism to approximate each term in the decomposing formula with theoretical explanations. In one word, when we investigate the global Q-value $Q_{tot}$ near maximum point in action space, the dependence of $Q_{tot}$ on individual Q-value $Q^{i}$ is approximately linear. Below we explain this theory.

The $Q_{tot}$ could be viewed as a function in terms of $Q^{i}$.

\[Q_{tot}=Q_{tot}(s, Q^{1}, Q^{2}, ..., Q^{n}).\]

We could prove (see details in our paper) that

  • Theorem 1. There exist constants $c(s),\lambda_i(s)$ (depending on state $s$), such that when we neglect higher order terms $o(|| \vec{a}- \vec{a}_{o} ||^2)$, the local expansion of $Q_{tot}$ admits the following form
\[Q_{tot}(s,\vec{a})=c(s) +\sum_i \lambda_i(s) Q^{i}(s,a^i).\]

And in an cooperative setting, the constants $\lambda_i(s) \ge 0$.

  • Theorem 2. The functional relation between $Q_{tot}$ and $Q^{i}$ appears tobe linear in action space, yet contains all the non-linear information. We have the following finer structure of $\lambda_{i}$.
\[\lambda_i(s)=\sum_{h} \lambda_{i,h}(s).\]

Then we have

\[Q_{tot}(s,\vec{a})=c(s) +\sum_{i,h} \lambda_{i,h}(s) Q^{i}(s,a^i),\]

where $\lambda_{i,h}$ is a linear functional of all partial derivatives $\frac{\partial^{h}Q_{tot}}{\partial Q^{i_1}…\partial Q^{i_h}}$ of order $h$, and decays super-exponentially fast in $h.$ Based on above findings, we introduce the multi-head attention to realize the deep version implement (Qatten).

Qatten Framework

The overall architecture consists of agents’ recurrent Q-value networks representing each agent’s individual value function $Q^{i}(\tau^{i}, a^{i})$ and the refined attention based value-mixing network to model the relation between $Q_{tot}$ and individual Q-values. The attention-based mixing network takes individual agents’ Q-values and local information as input and mixes them with global state to produce the values of $Q_{tot}$. Qatten’s mixing network perfectly implement the theorems.

Demonstration

Like previous works, we test Qatten in the SMAC (Samvelyan et al., 2019) platform. Here are some video demonstrations.

Alt Text

Alt Text

Alt Text

Alt Text

We also give the median win rate table on all maps. Qatten beats other popular MARL methods across almost scenarios, which validates its effectiveness.

Senario Qatten QMIX COMA VDN IQL QTRAN
2s_vs_1sc 100 100 97 100 100 100
2s3z 97 97 34 97 75 83
3s5z 94 94 0 84 9 13
1c3s5z 97 94 23 84 11 67
5m_vs_6m 74 63 0 63 49 57
3s_vs_5z 96 85 0 87 43 0
bane_vs_bane 97 62 40 90 97 100
2c_vs_64zg 65 45 0 19 2 10
MMM2 79 61 0 0 0 0
3s5z_vs_3s6z 16 1 0 0 0 0

References