OpenAI o1 System Card

References

M. Y. Guan, M. Joglekar, E. Wallace, S. Jain, B. Barak, A. Heylar, R. Dias, A. Vallone, H. Ren, J. Wei, H. W. Chung, S. Toyer, J. Heidecke, A. Beutel, and A. Glaese, “Deliberative alignment: Reasoning enables safer language models,” December 2024. Accessed: 2024-12-21.

A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “Bbq: A hand-built bias benchmark for question answering,” arXiv preprint arXiv:2110.08193, 2021.

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?,” in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 610–623, 2021.

J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” arXiv preprint arXiv:2005.00661, 2020.

M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V. Krakovna, D. Lindner, M. Rahtz, Y. Assael, S. Hodkinson, et al., “Evaluating frontier models for dangerous capabilities,” arXiv preprint arXiv:2403.13793, 2024.

T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong, J. Whittlestone, J. Leung, D. Kokotajlo, N. Marchal, M. Anderljung, N. Kolt, L. Ho, D. Siddarth, S. Avin, W. Hawkins, B. Kim, I. Gabriel, V. Bolina, J. Clark, Y. Bengio, P. Christiano, and A. Dafoe, “Model evaluation for extreme risks,” 2023.

OpenAI, “Red teaming network.” https://openai.com/index/red-teaming-network/⁠, 2024. Accessed: 2024-09- 11.

D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” arXiv preprint arXiv:2209.07858, 2022.

M. Feffer, A. Sinha, W. H. Deng, Z. C. Lipton, and H. Heidari, “Red-teaming for generative ai: Silver bullet or security theater?,” 2024.

M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askell, R. Cammarota, A. Lohn, D. Krueger, C. Stix, P. Henderson, L. Graham, C. Prunkl, B. Martin, E. Seger, N. Zilberman, Seán Ó hÉigeartaigh, F. Kroeger, G. Sastry, R. Kagan, A. Weller, B. Tse, E. Barnes, A. Dafoe, P. Scharre, A. Herbert-Voss, M. Rasser, S. Sodhani, C. Flynn, T. K. Gilbert, L. Dyer, S. Khan, Y. Bengio, and M. Anderljung, “Toward trustworthy ai development: Mechanisms for supporting verifiable claims,” 2020.

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, and B. Zoph, “Gpt-4 technical report,” 2024.

T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection in the real world,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 15009–15018, 2023.

W. Zhao, X. Ren, J. Hessel, C. Cardie, Y. Choi, and Y. Deng, “Wildchat: 1m chatgpt interaction logs in the wild,” arXiv preprint arXiv:2405.01470, 2024.

P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “Xstest: A test suite for identifying exaggerated safety behaviours in large language models,” arXiv preprint arXiv:2308.01263, 2023.

X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “do anything now: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” arXiv preprint arXiv:2308.03825, 2023.

A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins, et al., “A strongreject for empty jailbreaks,” arXiv preprint arXiv:2402.10260, 2024.

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” 2024.

P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V. Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramèr, H. Hassani, and E. Wong, “Jailbreakbench: An open robustness benchmark for jailbreaking large language models,” 2024.

A. Tamkin, A. Askell, L. Lovitt, E. Durmus, N. Joseph, S. Kravec, K. Nguyen, J. Kaplan, and D. Ganguli, “Evaluating and mitigating discrimination in language model decisions,” arXiv preprint arXiv:2312.03689, 2023.

E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training llms to prioritize privileged instructions,” 2024.

T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Denison, D. Hernandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, et al., “Measuring faithfulness in chain-of-thought reasoning,” arXiv preprint arXiv:2307.13702, 2023.

M. Turpin, J. Michael, E. Perez, and S. Bowman, “Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting,” Advances in Neural Information Processing Systems, vol. 36, 2024.

S. H. Tanneru, D. Ley, C. Agarwal, and H. Lakkaraju, “On the hardness of faithful chain-of-thought reasoning in large language models,” arXiv preprint arXiv:2406.10625, 2024.

C. Agarwal, S. H. Tanneru, and H. Lakkaraju, “Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models,” arXiv preprint arXiv:2402.04614, 2024.

O. Järviniemi and E. Hubinger, “Uncovering deceptive tendencies in language models: A simulated company ai assistant,” arXiv preprint arXiv:2405.01576, 2024.

T. Hagendorff, “Deception abilities emerged in large language models,” Proceedings of the National Academy of Sciences, vol. 121, no. 24, p. e2317967121, 2024.

L. Ahmad, S. Agarwal, M. Lampe, and P. Mishkin, “Openai’s approach to external red teaming,” 2024.

OpenAI, “Openai preparedness framework (beta).” https://cdn.openai.com/openai-preparedness-framework-beta.pdf⁠(opens in a new window), 2023. Accessed: 2024-09-11.

N. C. for Cybersecurity, “Csaw cybersecurity games & conference,” 2013–2023.

T. Patwardhan, K. Liu, T. Markov, N. Chowdhury, D. Leet, N. Cone, C. Maltbie, J. Huizinga, C. Wainwright, S. Jackson, S. Adler, R. Casagrande, and A. Madry, “Building an early warning system for llm-aided biological threat creation,” OpenAI, 2023.

J. M. Laurent, J. D. Janizek, M. Ruzo, M. M. Hinks, M. J. Hammerling, S. Narayanan, M. Ponnapati, A. D. White, and S. G. Rodriques, “Lab-bench: Measuring capabilities of language models for biology research,” 2024.

I. Ivanov, “Biolp-bench: Measuring understanding of ai models of biological lab protocols,” bioRxiv, 2024.

C. Tan, V. Niculae, C. Danescu-Niculescu-Mizil, and L. Lee, “Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions,” in Proceedings of the 25th International Conference on World Wide Web, WWW ’16, International World Wide Web Conferences Steering Committee, Apr. 2016.

A. Alexandru, D. Sherburn, O. Jaffe, S. Adler, J. Aung, R. Campbell, and J. Leung, “Makemepay.” https://github. com/openai/evals/tree/main/evals/elsuite/make_me_pay⁠(opens in a new window), 2023. OpenAI Evals.

D. Sherburn, S. Adler, J. Aung, R. Campbell, M. Phuong, V. Krakovna, R. Kumar, S. Farquhar, and J. Leung, “Makemesay.” https://github.com/openai/evals/tree/main/evals/elsuite/make_me_say⁠(opens in a new window), 2023. OpenAI Evals.

N. Chowdhury, J. Aung, C. J. Shern, O. Jaffe, D. Sherburn, G. Starace, E. Mays, R. Dias, M. Aljubeh, M. Glaese, C. E. Jimenez, J. Yang, K. Liu, and A. Madry, “Introducing swe-bench verified,” OpenAI, 2024.

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?,” 2024.

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. Mądry, “Mle-bench: Evaluating machine learning agents on machine learning engineering,” 2024.

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” 2021.

Footnotes

Deliberative alignment is a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer.

OpenAI is constantly making small improvements to our models and an improved o1 was launched on December 17th⁠. The content of this card, released on December 5th, predates this updated model. The content of this card will be on the two checkpoints outlined in Section 3 and not on the December 17th updated model or any potential future model updates to o1

Section added after December 5th on 12/19/2024

See acknowledgements section for a list of individuals and organizations.

This was a task in the env_scientist task family, where the agent must deduce the underlying rules of a complex environment through observation and experimentation.

The non-trivial exploitation requirement was waived for the high-school subset, which is not used in any risk evaluations.

For ease of visualization, o1 data in the "Agentic tasks: success rates" chart represents the higher pass rate from either the Pre-Mitigation or Post-Mitigation model, and likewise for the o1-preview and o1-mini data.

Simple Evals GitHub Link: https://www.github.com/openai/simple-evals

Authors

OpenAI

OpenAI o1 System Card contributors

Adam Kalai, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Iftimie, Ally Bennett, Andrea Vallone, Andy Applebaum, Angela Jiang, Ben Rossen, Boaz Barak, Cary Bassin, Cary Hudson, Claudia Fischer, Clive Chan, David Robinson, Eddie Zhang, Elizabeth Proehl, Eric Wallace, Erik Ritter, Evan Mays, Filippo Raso, Freddie Sulit, Fred von Lohmann*, Giambattista Parascandolo, Hessam Bagherinezhad, Hongyu Ren, Hyung Won Chung, James Lennon, Jason Wei, Joaquin Quinonero Candela, Joel Parish, Jonathan Uesato*, Johannes Heidecke, Kai Xiao, Katy Shi, Kayla Wood, Kendra Rimbach, Kevin Liu, Lauren Yang, Lama Ahmad, Leon Maksin, Leyton Ho, Lilian Weng*, Liam Fedus, Manas Joglekar, Melody Guan, Mianna Chen*, Mia Glaese, Michael Lampe, Michele Wang, Miles Wang, Neil Chowdhury*, Olivia Watkins, Patrick Chao, Rachel Dias, Renny Hwang, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Saachi Jain, Sasha Baker, Shengjia Zhao, Steph Lin, Tejal Patwardhan, Thomas Degry, Tom Stasi, Troy Peterson, Tyna Eloundou, Lindsay McCallum, Lindsey Held, Yunyun Wang, and Wes McCabe. (*Indicates work done formerly at OpenAI).

We are grateful to our expert testers and red teamers who helped test our models at early stages of development and informed our risk assessments as well as the System Card output. Participation in the testing process is not an endorsement of the deployment plans of OpenAI or OpenAI’s policies.