Generalizing from simulation

The abundance of RL results with simulated robots can make it seem like RL easily solves most robotics tasks. But common RL algorithms work well only on tasks where small perturbations to your action can provide an incremental change to the reward. Some robotics tasks have simple rewards, like walking, where you can be scored on distance traveled. But most tasks do not⁠—to define a dense reward for block stacking, you’d need to encode that the arm is close to the block, that the arm approaches the block in the correct orientation, that the block is lifted off the ground, the distance of block to the desired position, etc.

We spent a number of months unsuccessfully trying to get conventional RL algorithms working on pick-and-place tasks before ultimately developing a new reinforcement learning algorithm, Hindsight Experience Replay⁠(opens in a new window) (HER), which allows agents to learn from a binary reward by pretending that a failure was what they wanted to do all along and learning from it accordingly. (By analogy, imagine looking for a gas station but ending up at a pizza shop. You still don’t know where to get gas, but you’ve now learned where to get pizza.) We also used domain randomization⁠ on the visual shapes to learn a vision system robust enough for the physical world.

Our HER implementation uses the actor-critic technique with asymmetric information. (The actor is the policy, and the critic is a network which receives action/state pairs and estimates their Q-value, or sum of future rewards, providing training signal to the actor.) While the critic has access to the full state of the simulator, the actor only has access to RGB and depth data. Thus the critic can provide fully accurate feedback, while the actor uses only data present in the real world.