OpenAI Baselines: ACKTR & A2C

We’re releasing two new OpenAI Baselines implementations: ACKTR and A2C. A2C is a synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C) which we’ve found gives equal performa

For machine learning algorithms, two costs are important to consider: sample complexity and computational complexity. Sample complexity refers to the number of timesteps of interaction between the agent and its environment, and computational complexity refers to the amount of numerical operations that must be performed.

ACKTR has better sample complexity than first-order methods such as A2C because it takes a step in the natural gradient direction, rather than the gradient direction (or a rescaled version as in ADAM). The natural gradient gives us the direction in parameter space that achieves the largest (instantaneous) improvement in the objective per unit of change in the output distribution of the network, as measured using the KL-divergence. By limiting the KL divergence, we ensure that the new policy does not behave radically differently than the old one, which could cause a collapse in performance.

As for computational complexity, the KFAC update used by ACKTR is only 10–25% more expensive per update step than a standard gradient update. This contrasts with methods like TRPO (i.e, Hessian-free optimization), which requires a more expensive conjugate-gradient computation.

In the following video you can see comparisons at different timesteps between agents trained with ACKTR to solve the game Q-Bert and those trained with A2C. ACKTR agents get higher scores than ones trained with A2C.