AI safety needs social scientists

The goal of long-term artificial intelligence (AI) safety is to ensure that advanced AI systems are aligned with human values—that they reliably do things that people want them to do. At OpenAI we hope to achieve this by asking people questions about what they want, training machine learning (ML) models on this data, and optimizing AI systems to do well according to these learned models. Examples of this research include Learning from human preferences⁠(opens in a new window), AI safety via debate⁠(opens in a new window), and Learning complex goals with iterated amplification⁠(opens in a new window).

Unfortunately, human answers to questions about their values may be unreliable. Humans have limited knowledge and reasoning ability, and exhibit a variety of cognitive biases and ethical beliefs that turn out to be inconsistent on reflection. We anticipate that different ways of asking questions will interact with human biases in different ways, producing higher or lower quality answers. For example, judgments about how wrong an action is can vary depending on whether the word “morally” appears in the question⁠(opens in a new window), and people can make inconsistent choices between gambles if the task they are presented with is complex⁠(opens in a new window).

We have several methods that try to target the reasoning behind human values, including amplification⁠(opens in a new window) and debate⁠(opens in a new window), but do not know how they behave with real people in realistic situations. If a problem with an alignment algorithm appears only in natural language discussion of a complex value-laden question, current ML may be too weak to uncover the issue.

To avoid the limitations of ML, we propose experiments that consist entirely of people, replacing ML agents with people playing the role of those agents. For example, the debate⁠(opens in a new window) approach to AI alignment involves a game with two AI debaters and a human judge; we can instead use two human debaters and a human judge. Humans can debate whatever questions we like, and lessons learned in the human case can be transferred to ML.