Learning to model other minds

LOLA, a collaboration by researchers at OpenAI and the University of Oxford, lets a reinforcement learning (RL) agent take account of the learning of others when updating its own strategy. Each LOLA agent adjusts its policy in order to shape the learning of the other agents in a way that is advantageous. This is possible since the learning of the other agents depends on the rewards and observations occurring in the environment, which in turn can be influenced by the agent.

This means that the LOLA agent, “Alice,” models how the parameter updates of the other agent, “Bob,” depend on its own policy and how Bob’s parameter update impacts its own future expected reward. Alice then updates its own policy in order to make the learning step of the other agents, like Bob, more beneficial to its own goals.

LOLA agents can discover effective, reciprocative strategies, in games like the iterated _prisoner’s dilemma_⁠(opens in a new window), or the coin game⁠(opens in a new window). In contrast, state-of-the-art deep reinforcement learning methods, like Independent PPO, fail to learn such strategies in these domains. These agents typically learn to take selfish actions that ignore the objectives of other agents. LOLA solves this by letting agents act out of a self-interest that incorporates the goals of others. It also works without requiring hand-crafted rules, or environments set up to encourage cooperation.

The inspiration for LOLA comes from how people collaborate with one another: Humans are great at reasoning about how their actions can affect the future behavior of other humans, and frequently invent ways to collaborate with others that leads to a win–win. One of the reasons humans are good at collaborating with each other is that they have a sense of a “theory of mind” about other humans, letting them come up with strategies that lead to benefits for their collaborators. So far, this sort of “theory of mind” representation has been absent from deep multi-agent reinforcement learning. To a state of the art deep-RL agent there is no inherent difference between another learning agent and a part of the environment, say a tree.

The key to LOLA’S performance is the inclusion of term:

(∂V1(θi1,θi2)∂θi2)T∂2V2(θi1,θi2)∂θi1∂θi2⋅δη,\left( \frac{\partial V^1 (\theta^1_i,\theta^2_i) }{\partial \theta^2_i} \right)^T \frac{\partial^2 V^2 (\theta^1_i,\theta^2_i)}{\partial \theta^1_i \partial \theta^2_i} \cdot \delta \eta,

LOLA lets us train agents that succeed at the coin game⁠(opens in a new window), in which two agents, red and blue, compete with one another to pick up red and blue colored coins. Each agent gets a point for picking up any coin, but if they pick up a coin which isn’t their color then the other agent will receive a –2 penalty. Thus, if both agents greedily pick up both coins, everyone gets zero points on average. LOLA agents learn to predominantly pick up coins of their own color, leading to high scores (shown above).