Quantifying generalization in reinforcement learning

We’re releasing CoinRun, a training environment which provides a metric for an agent’s ability to transfer its experience to novel situations and has already helped clarify a longstanding puzzle in re

We trained 9 agents to play CoinRun, each with a different number of available training levels. The first 8 agents trained on sets ranging from of 100 to 16,000 levels. We trained the final agent on an unrestricted set of levels, so this agent never sees the same level twice. We trained our agents with policies using a common⁠(opens in a new window) 3-layer convolutional architecture⁠(opens in a new window), which we call Nature-CNN. Our agents trained with Proximal Policy Optimization⁠(opens in a new window) (PPO⁠(opens in a new window)) for a total of 256M timesteps. Since an epsiode lasts 100 timesteps on average, agents with fixed training sets will see each training level thousands to millions of times. The final agent, trained with the unrestricted set, will see roughly 2 million distinct levels — each of them exactly once.

We collected each data point in the following graphs by averaging the final agent’s performance across 10,000 episodes. At test time, the agent is evaluated on never-before-seen levels. We discovered substantial overfitting occurs when there are less than 4,000 training levels. In fact, we still see overfitting even with 16,000 training levels! Unsurprisingly, agents trained with the unrestricted set of levels performed best, as these agents had access to the most data. These agents are represented by the dotted line in the following graphs.

We compared our Nature-CNN baseline against the convolutional architecture used in IMPALA⁠(opens in a new window) and found the IMPALA-CNN agents generalized much better with any training set as seen below.

Footnotes

  1. A

Even impressive⁠(opens in a new window) RL policies are often trained without supervised learning techniques such as dropout and batch normalization. In the CoinRun generalization regime, however, we find that these methods do have a positive impact and that our previous RL policies were overfitting to particular MDPs.

Author

Karl Cobbe

Acknowledgments

Thanks to the many people who contributed to this paper and blog post:

Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman, Mira Murati, Jack Clark, Ashley Pilipiszyn, Matthias Plappert, Ilya Sutskever, Greg Brockman

External reviewers:

Jon Walsh, Caleb Kruse, Nikhil Mishra