Prover-Verifier Games improve legibility of language model outputs

Making sure that language models produce understandable text is crucial to making them helpful for people, especially when dealing with complex tasks like solving math problems.

We found that when we optimize the problem-solving process of strong models solely for getting the correct answer, the resulting solutions can become harder to understand. In fact, when we asked human evaluators with limited time to assess these highly optimized solutions, they made nearly twice as many errors compared to when they evaluated less optimized solutions. This finding highlights the importance of not just correctness, but also clarity and ease of verification in AI-generated text.

By training advanced language models to create text that weaker models can easily verify, we found that humans could also evaluate these texts more effectively – a process we call improving legibility.

This is where prover-verifier games come into play. These games involve two players: a "prover" that generates a solution and a "verifier" that checks it for accuracy.

This method is essential not only for ensuring that the outputs are correct, but also for making them easy to understand and verify by both humans and other AI systems.

Understanding and addressing the performance / legibility balance can lead to more effective and trustworthy AI applications, benefiting a wide range of fields where precise and clear communication is essential.

Authors

Yining Chen, Jan Hendrik Kirchner

Contributors

Angela Baek, Yuri Burda, Thomas Degry, Harri Edwards, Elie Georges, Cary Hudson, Jan Leike, Nat McAleese, Wes McCabe, Lindsay McCallum, Freddie Sulit