Learning to cooperate, compete, and communicate
Multiagent environments where agents compete for resources are stepping stones on the path to AGI. Multiagent environments have two useful properties: first, there is a natural curriculum—the difficul
Multiagent environments where agents compete for resources are stepping stones on the path to AGI. Multiagent environments have two useful properties: first, there is a natural curriculum—the difficulty of the environment is determined by the skill of your competitors (and if you’re competing against clones of yourself, the environment exactly matches your skill level). Second, a multiagent environment has no stable equilibrium: no matter how smart an agent is, there’s always pressure to get smarter. These environments have a very different feel from traditional environments, and it’ll take a lot more research before we become good at them.
Traditional decentralized RL approaches—DDPG, actor-critic learning, deep Q-learning, and so on—struggle to learn in multiagent environments, as at every time step each agent will be trying to learn to predict the actions of other agents while also taking its own actions. This is especially true in competitive situations. MADDPG employs a centralized critic to supply agents with information about their peers’ observations and potential actions, transforming an unpredictable environment into a predictable one.
Using policy gradient methods presents further challenges: because these exhibit high variance learning the right policy is difficult to do when the reward is inconsistent. We also found that adding in a critic, while improving stability, still failed to solve several of our environments such as cooperative communication. It seems that considering the actions of others during training is important for learning collaborative strategies.
Before we developed MADDPG, when using decentralized techniques, we noticed that listener agents would often learn to ignore the speaker if it sent inconsistent messages about where to go to. The agent would then set all the weights associated with the speaker’s message to 0, effectively deafening itself. Once this happens, it’s hard for training to recover, since the speaker will never know if it says the right thing due to the absence of any feedback. To fix this, we looked at a technique outlined in a recent hierarchical reinforcement project(opens in a new window), which lets us force the listener to incorporate the utterances of the speaker in its decision-making process. This fix didn’t work, because though it forces the listener to pay attention to the speaker, it doesn’t help the speaker figure out what to say that is relevant. Our centralized critic method helps deal with these challenges, by helping the speaker to learn which utterances might be relevant to the actions of other agents. For more of our results, you can watch the following video: