Better exploration with parameter noise

Parameter noise lets us teach agents tasks much more rapidly than with other approaches. After learning for 20 episodes on the HalfCheetah⁠(opens in a new window) Gym environment (shown above), the policy achieves a score of around 3,000, whereas a policy trained with traditional action noise only achieves around 1,500.

Parameter noise adds adaptive noise to the parameters of the neural network policy, rather than to its action space. Traditional RL uses action space noise to change the likelihoods associated with each action the agent might take from one moment to the next. Parameter space noise injects randomness directly into the parameters of the agent, altering the types of decisions it makes such that they always fully depend on what the agent currently senses. The technique is a middle ground between evolution strategies⁠(opens in a new window) (where you manipulate the parameters of your policy but don’t influence the actions a policy takes as it explores the environment during each rollout) and deep reinforcement learning approaches like TRPO⁠(opens in a new window), DQN⁠(opens in a new window), and DDPG (where you don’t touch the parameters, but add noise to the action space of the policy).

When we first conducted this research, we found that the perturbations we applied to the Q function of DQN could sometimes be so extreme it would lead to the algorithm repeatedly executing the same action. To deal with this, we added a separate head that explicitly represents the policy as in DDPG (in regular DQN, the policy is only represented implicitly by the Q function) to make the setup more similar to our other experiments. However, when preparing our code for this release we ran an experiment that used parameter space noise without the separate policy head. We found that this worked comparably to our version with the separate policy head while being much simpler to implement. Further experiments confirmed that the separate policy head was indeed unnecessary as the algorithm had likely improved since our early experiments due to us altering how we re-scaled the noise. This led to a simpler, easier to implement, and less costly to train algorithm while still achieving very similar results. It’s important to remember that AI algorithms, especially in reinforcement learning, can fail silently and subtly⁠(opens in a new window), which can lead to people engineering solutions around missed bugs.