OpenAI Five | xifan.uno

Given a learning algorithm capable of handling long horizons, we still need to explore the environment. Even with our restrictions⁠, there are hundreds of items, dozens of buildings, spells, and unit types, and a long tail of game mechanics to learn about—many of which yield powerful combinations. It’s not easy to explore this combinatorially-vast space efficiently.

OpenAI Five learns from self-play (starting from random weights), which provides a natural curriculum for exploring the environment. To avoid “strategy collapse”, the agent trains 80% of its games against itself and the other 20% against its past selves. In the first games, the heroes walk aimlessly around the map. After several hours of training, concepts such as laning⁠(opens in a new window), farming⁠(opens in a new window), or fighting over mid⁠(opens in a new window) emerge. After several days, they consistently adopt basic human strategies: attempt to steal Bounty⁠(opens in a new window) runes from their opponents, walk to their tier one⁠(opens in a new window) towers to farm, and rotate heroes around the map to gain lane advantage. And with further training, they become proficient at high-level strategies like 5-hero push⁠(opens in a new window).

In March 2017, our first agent⁠(opens in a new window) defeated bots but got confused against humans. To force exploration in strategy space, during training (and only during training) we randomized the properties (health, speed, start level, etc.) of the units, and it began beating humans. Later on, when a test player was consistently beating our 1v1 bot, we increased our training randomizations and the test player started to lose. (Our robotics team concurrently applied similar randomization techniques to physical⁠ robots⁠ to transfer from simulation to the real world.)

OpenAI Five uses the randomizations we wrote for our 1v1 bot. It also uses a new “lane assignment” one. At the beginning of each training game, we randomly “assign” each hero to some subset of lanes⁠(opens in a new window) and penalize it for straying from those lanes until a randomly-chosen time in the game.

Exploration is also helped by a good reward. Our reward⁠(opens in a new window) consists mostly of metrics humans track to decide how they’re doing in the game: net worth, kills, deaths, assists, last hits, and the like. We postprocess each agent’s reward by subtracting the other team’s average reward to prevent the agents from finding positive-sum situations.

We hardcode item and skill builds (originally written for our scripted⁠ baseline), and choose which of the builds to use at random. Courier⁠(opens in a new window) management is also imported from the scripted baseline.

Authors

Greg Brockman, Christy Dennison, Susan Zhang, Jakub Pachocki, Michael Petrov, Henrique Pondé, Przemysław Dębiak, David Farhi, Filip Wolski, Jonathan Raiman, Jie Tang, Szymon Sidor, Brooke Chan

Contributors

Quirin Fischer, Christopher Hesse, Shariq Hashme, Ilya Sutskever, Alec Radford, Scott Gray, Jack Clark, Paul Christiano, David Luan, Christopher Berner, Eric Sigler, Jonas Schneider, Larissa Schiavo, Diane Yoon, John Schulman

Current set of restrictions

Mirror match of Necrophos⁠(opens in a new window), Sniper⁠(opens in a new window), Viper⁠(opens in a new window), Crystal Maiden⁠(opens in a new window), and Lich⁠(opens in a new window)
No warding⁠(opens in a new window)
No Roshan⁠(opens in a new window)
No invisibility⁠(opens in a new window) (consumables and relevant items)
No summons⁠(opens in a new window)/illusions⁠(opens in a new window)
No Divine Rapier⁠(opens in a new window), Bottle⁠(opens in a new window), Quelling Blade⁠(opens in a new window), Boots of Travel⁠(opens in a new window), Tome of Knowledge⁠(opens in a new window), Infused Raindrop⁠(opens in a new window)
5 invulnerable couriers, no exploiting them by scouting or tanking
No Scan⁠(opens in a new window)

The hero set restriction makes the game very different from how Dota is played at world-elite level (i.e. Captains Mode⁠(opens in a new window) drafting from all 100+ heroes). However, the difference from regular “public” games (All Pick⁠(opens in a new window) / Random Draft⁠(opens in a new window)) is smaller.

Most of the restrictions come from remaining aspects of the game we haven’t integrated yet. Some restrictions, in particular wards and Roshan, are central components of professional-level play. We’re working to add these as soon as possible.

Draft feedback

Thanks to the following for feedback on drafts of this post: Alexander Lavin, Andrew Gibiansky, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, David Dohan, David Ha, Denny Britz, Erich Elsen, James Bradbury, John Miller, Luke Metz, Maddie Hall, Miles Brundage, Nelson Elhage, Ofir Nachum, Pieter Abbeel, Rumen Hristov, Shubho Sengupta, Solomon Boulos, Stephen Merity, Tom Brown, Zak Stone