Lessons learned on language model safety and misuse

Footnotes

This post is based on our approach to deploying language models through an API, and as such the lessons and mitigations described are most relevant to those also pursuing API-based deployment. However, we also expect some of the discussion to be relevant to those building first-party applications using language models and those considering the open source release of language models.

This post is intended to explain and share learnings from our approach, rather than to suggest that all actors should necessarily adopt the same approach, or that the same approach is applicable to all possible AI systems. There are benefits and costs associated with different deployment approaches, different models will benefit more or less from study prior to deployment, and in some cases it can be valuable for distinct deployment paths to be pursued by different actors.

More details on this workshop will be included in the forthcoming publication based on it.

The mitigations that we emphasize in response to misuse have also evolved. For example, we initially focused on long form text generation as a threat vector, given prior cases of influence operations that involved people manually writing long form misleading content. Given that emphasis, we set maximum output lengths for generated text. Based on a pilot study of long form generation, however, we saw that output restrictions had little effect on policy violations—we’ve come to believe instead that short-form content amplifying or increasing engagement on misleading content could be the greater risk.

Examples of limitations in existing datasets, from the perspective of practitioners seeking a holistic assessment of the safety of real language model outputs, include the following: an overly narrow focus (e.g., just measuring occupational gender bias), an overly broad focus (e.g., measuring all under the umbrella of “toxicity”), a tendency to abstract away the specifics of use and context, a failure to measure the generative dimension of language model use (e.g., using multiple choice style), prompts that differ stylistically from those typically used in real language model use cases, not capturing dimensions of safety that are important in practice (e.g., an output following or ignoring a safety-motivated constraint in the instruction), or not capturing types of outputs we have found to be correlated with misuse (e.g., erotic content).

While our efforts are specifically oriented towards addressing limitations in existing benchmarks and in our own models, we also acknowledge that there are limitations to the methods we use such as classifier-based data filtration. For instance, operationally defining the content areas we aim to detect via filtration is challenging and filtration itself can introduce harmful biases. Additionally, the labeling of toxic data is a critical component of this work and ensuring the mental health of these labelers is an industry-wide challenge.

The relevant “user” of our API may be a developer building an application or an end-user interacting with such an application, depending on context. There are deep questions about the values our aligned models reflect and we hope to build a more nuanced understanding of how to balance the values of wide range of possible users and competing objectives when aligning language models to be more helpful, more truthful and less harmful.

More aligned models also have more practical advantages such as reducing the need for “prompt engineering” (providing examples of the desired behavior to steer the model in the right direction), saving space in the model’s context window which can be used for other purposes.

Beyond research, we have found that other safety-motivated interventions sometimes have unexpected benefits to customers. For example, rate limits intended to curb spam or misleading content also help customers to control expenses.

Authors

Miles Brundage, Katie Mayer, Tyna Eloundou, Sandhini Agarwal, Steven Adler, Gretchen Krueger, Jan Leike, Pamela Mishkin

Acknowledgments

Thanks to Lilian Weng, Rosie Campbell, Anna Makanju, Bob McGrew, Hannah Wong, Ryan Lowe, Steve Dowling, Mira Murati, Sam Altman, Greg Brockman, Ilya Sutskever, Percy Liang, Peter Welinder, Ethan Perez, Ellie Evans, Helen Ngo, Helen Toner, Justin Jay Wang, Jack Clark, Rishi Bommasani, Girish Sastry, Sarah Shoker, Matt Knight, Bianca Martin, Bob Rotsted, Lama Ahmad, Toki Sherbakov, and others for providing feedback on this post and related work.