OpenAI Baselines: DQN

When transforming the screen images into greyscale we had incorrectly calibrated our coefficients for the green color values, which led to the fish disappearing. After we noticed the bug we tweaked the color values and our algorithm was able to see the fish again.

To debug issues like this in the future, Gym now contains a play⁠(opens in a new window) function, which lets a researcher easily see the same observations as the AI agent would.

Fix bugs, then hyperparameters: After debugging, we started to calibrate our hyperparameters. We ultimately found that setting the annealing schedule for epsilon, a hyperparameter which controlled the exploration rate, had a huge impact on performance. Our final implementation decreases epsilon to 0.1 over the first million steps and then down to 0.01 over the next 24 million steps. If our implementation contained bugs, then it’s likely we would come up with different hyperparameter settings to try to deal with faults we hadn’t yet diagnosed.

Double check your interpretations of papers: In the DQN Nature⁠(opens in a new window) paper the authors write: “We also found it helpful to clip the error term from the update [...] to be between -1 and 1.”. There are two ways to interpret this statement — clip the objective, or clip the multiplicative term when computing gradient. The former seems more natural, but it causes the gradient to be zero on transitions with high error, which leads to suboptimal performance, as found in one DQN implementation⁠(opens in a new window). The latter is correct and has a simple mathematical interpretation — Huber Loss⁠(opens in a new window). You can spot bugs like these by checking that the gradients appear as you expect — this can be easily done within TensorFlow by using compute_gradients⁠(opens in a new window).

The majority of bugs in this post were spotted by going over the code multiple times and thinking through what could go wrong with each line. Each bug seems obvious in hindsight, but even experienced researchers tend to underestimate how many passes over the code it can take to find all the bugs in an implementation.