Quantifying Generalization in Reinforcement Learning

Data science

Quantifying Generalization in Reinforcement Learning – Source openai.com

Quantifying Generalization in Reinforcement Learning

We’re releasing a new training environment, CoinRun, which provides a metric for an agent’s ability to transfer its experience to novel situations, and has already helped clarify a longstanding puzzle in reinforcement learning1. CoinRun strikes a desirable balance in complexity: the environment is much simpler than traditional platformer games like Sonic the Hedgehog, but it still poses a worthy generalization challenge for state of the art algorithms.

Read PaperView Code
Quantifying Generalization in Reinforcement Learning
Quantifying Generalization in Reinforcement Learning
Quantifying Generalization in Reinforcement Learning
Quantifying Generalization in Reinforcement Learning
Quantifying Generalization in Reinforcement Learning
Quantifying Generalization in Reinforcement Learning

The Generalization Challenge

Generalizing between tasks remains difficult for state of the art deep reinforcement learning (RL) algorithms. Although trained agents can solve complex tasks, they struggle to transfer their experience to new environments. Even though people know that RL agents tend to overfit — that is, to latch onto the specifics of their environment rather than learn generalizable skills — RL agents are still benchmarked by evaluating on the environments they trained on. This would be like testing on your training set in supervised learning!

Previous work has used the Sonic benchmark, procedurally generated gridworld mazes, and the General Video Game AI framework to address this problem. In all cases, generalization is measured by training and testing agents on different sets of levels. Agents trained on our Sonic benchmark were great at the training levels but performed poorly on the test levels without any fine-tuning. In a similar displays of overfitting, agents trained on procedurally generated mazes learned to memorize a large number of training levels, and GVG-AI agents performed poorly under difficulty settings that weren’t seen during training.

Rules of the Game

CoinRun was designed to be tractable for existing algorithms and mimics the style of platformer games like Sonic. The levels of CoinRun are procedurally generated, providing agents access to a large and easily quantifiable supply of training data. The goal of each CoinRun level is simple: collect the single coin that lies at the end of the level. Several obstacles, both stationary and non-stationary, lie between the agent and the coin. A collision with an obstacle results in the agent’s immediate death. The only reward in the environment is obtained by collecting the coin, and this reward is a fixed positive constant. The level terminates when the agent dies, the coin is collected, or after 1000 time steps.

Quantifying Generalization in Reinforcement Learning
Each level of CoinRun has a difficulty setting from 1 to 3. Two levels are displayed above: Difficulty-1 (left) and Difficulty-3 (right)

Evaluating Generalization

We trained 9 agents to play CoinRun, each with a different number of available training levels. The first 8 agents trained on sets ranging from of 100 to 16,000 levels. We trained the final agent on an unrestricted set of levels, so this agent never sees the same level twice. We trained our agents with policies using a common 3-layer convolutional architecture, which we call Nature-CNN. Our agents trained with Proximal Policy Optimization (PPO) for a total of 256M timesteps. Since an epsiode lasts 100 timesteps on average, agents with fixed training sets will see each training level thousands to millions of times. The final agent, trained with the unrestricted set, will see roughly 2 million distinct levels — each of them exactly once.

We collected each data point in the following graphs by averaging the final agent’s performance across 10,000 episodes. At test time, the agent is evaluated on never-before-seen levels. We discovered substantial overfitting occurs when there are less than 4,000 training levels. In fact, we still see overfitting even with 16,000 training levels! Unsurprisingly, agents trained with the unrestricted set of levels performed best, as these agents had access to the most data. These agents are represented by the dotted line in the following graphs.

We compared our Nature-CNN baseline against the convolutional architecture used in IMPALA and found the IMPALA-CNN agents generalized much better with any training set as seen below.

Final train and test performance of Nature-CNN agents after 256M timesteps, as a function of the number of training levels.

Final train and test performance of IMPALA-CNN agents after 256M timesteps, as a function of number of training levels.

Improving Generalization Performance

In our next experiments, we used a fixed training set of 500 CoinRun levels. Our baseline agents struggle to generalize with so few levels, making this an ideal training set for a benchmark. We encourage others to evaluate their own methods by training on the same 500 levels, directly comparing test time performance. Using this training set, we investigated the impact of several regularization techniques:

  • Dropout and L2 regularization: Both noticeably reduce the generalization gap, though L2 regularization has a bigger impact.
  • Data augmentation (modified Cutout) and batch normalization: Both data augmentation and batch normalization significantly improve generalization.
  • Environmental stochasticity: Training with stochasticity improves generalization to a greater extent than any of the previously mentioned techniques (see the paper for details).

Additional Environments

We also developed two additional environments to investigate overfitting: a CoinRun variant called CoinRun-Platforms and a simple maze navigation environment called RandomMazes. In these experiments, we used the original IMPALA-CNN architecture followed by a LSTM, since memory is necessary to perform well in these environments.

In CoinRun-Platforms, there are several coins the agent attempts to collect within the 1000 step time-limit. Coins are randomly scattered across platforms in the level. Levels are a larger, fixed size in CoinRun-Platforms, so the agent must more actively explore, occasionally retracing its steps.

Final train and test performance in CoinRun-Platforms after 2B timesteps, as a function of the number of training levels.

When we ran both CoinRun-Platforms and RandomMazes through our baseline experiment, our agents strongly overfit in all cases. We observe particularly strong overfitting in the case of RandomMazes, as a sizeable generalization gap remains even when using 20,000 training levels.

A level in RandomMazes showing the agent’s observation space (left). Final train and test performance shown as a function of the number of training levels (right).

Next Steps

Our results provide insight into the challenges underlying generalization in RL. Using the procedurally generated CoinRun environment, we can precisely quantify such overfitting. With this metric, we can better evaluate key architectural and algorithmic decisions. We believe that the lessons learned from this environment will apply in more complex settings, and we hope to use this benchmark, and others like it, to iterate towards more generalizable agents.

We suggest the following for future research:

  • Investigate the relationship between environment complexity and the number of levels required for good generalization
  • Investigate whether different recurrent architectures are better suited for generalization in these environments
  • Explore ways to effectively combine different regularization methods

If you are interested in this line of research, consider working at OpenAI!


[1] Even impressive RL policies are often trained without supervised learning techniques such as dropout and batch normalization. In the CoinRun generalization regime, however, we find that these methods do have a positive impact and that our previous RL policies were overfitting to particular MDPs.


Acknowledgements

Thanks to the many people who contributed to this paper and blog post:
Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman, Mira Murati, Jack Clark, Ashley Pilipiszyn, Matthias Plappert, Ilya Sutskever, Greg Brockman

External reviewers: Jon Walsh, Caleb Kruse, Nikhil Mishra
Assets: Kenney