Learning to summarize with human feedback

Footnotes

For training, we use the Reddit TL;DR dataset instead of the more popular CNN/DM dataset because simple copying baselines perform better than the human-written reference summaries on CNN/DM, which is not the case for TL;DR (see Appendix D of our paper). We performed a new web crawl to increase the TL;DR dataset size, required summaries to be between 24 and 48 tokens, and performed some other cleaning and filtering

We hire human labelers to judge summary quality, and implement quality control to ensure that labeler judgments agree with our own. We describe our human data collection procedure below.

Interestingly, we found that human evaluators preferred the Lead-3 baseline (taking the first 3 sentences of the article) to the dataset’s reference summaries, and we confirmed this ourselves.

We generate all of our samples at temperature 0, which we found humans preferred most.

While we use human-written TL;DRs as our main point of comparison, they don’t always represent optimal human performance; they are sometimes intended to be funny or to summarize only a part of the post, and their grammar and style are all over the map.

We control by training a logistic regression model to predict the preferred summary given only the policy ID and the log ratio of the lengths of the summaries. Then, we report the regression coefficients on each policy ID, corresponding to a length ratio of 1 with the reference summaries.

We took this approach because it is hard to directly compare our TL;DR-trained models to models trained on CNN/DM; the CNN/DM summaries are much longer and written in bullet-point form.

In terms of ROUGE results on CNN/DM, our 6.7B supervised models are a bit worse than T5 , but a bit better than state-of-the-art models from mid-2019.

Our main models are trained on about 65K comparisons, though we achieve good results with as few as 8K comparisons.

Specifically, we use Upwork, Scale, and Lionbridge. Our contractors have a range of ages, genders, and educational backgrounds, and are mostly American or Filipino (see Appendix C of our paper for demographic data).

Our criteria for hiring contractors were: (1) they were willing to do the task, and (2) they passed a minimum threshold of speed and agreement with researcher labels. We paid all our contractors at least $15/hr.

This is impressive relative to the TL;DR reference summaries, which get a perfect overall score 23% of the time, but indicates there is still room for improvement.

References

Völske, M., Potthast, M., Syed, S., & Stein, B. (2017). “TL; DR: Mining reddit to learn automatic summarization⁠(opens in a new window).” In Proceedings of the Workshop on New Frontiers in Summarization 2017.

Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). ” Teaching machines to read and comprehend⁠(opens in a new window).” In Advances in neural information processing systems 2015.

Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). “On Faithfulness and Factuality in Abstractive Summarization.⁠(opens in a new window).” arXiv preprint.

Sheng, E., Chang, K. W., Natarajan, P., & Peng, N. (2019). “The woman worked as a babysitter: On biases in language generation⁠(opens in a new window).” arXiv preprint.

Bordia, S., & Bowman, S. R. (2019). “Identifying and reducing gender bias in word-level language models⁠(opens in a new window).” arXiv preprint.

Nadeem, M., Bethke, A., & Reddy, S. (2020). “StereoSet: Measuring stereotypical bias in pretrained language models⁠(opens in a new window).” arXiv preprint.

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). “Fine-tuning language models from human preferences⁠(opens in a new window).” arXiv preprint.

Böhm, F., Gao, Y., Meyer, C. M., Shapira, O., Dagan, I., & Gurevych, I. (2019). “Better rewards yield better summaries: Learning to summarise without references⁠(opens in a new window).” arXiv preprint.

Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., & Picard, R. (2019). “Way off-policy batch deep reinforcement learning of implicit human preferences in dialog⁠(opens in a new window).” arXiv preprint.

Yi, S., Goel, R., Khatri, C., Cervone, A., Chung, T., Hedayatnia, B., ... & Hakkani-Tur, D. (2019). “Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators⁠(opens in a new window).” arXiv preprint.

Hancock, B., Bordes, A., Mazare, P. E., & Weston, J. (2019). “Learning from dialogue after deployment: Feed yourself, chatbot!⁠(opens in a new window).” arXiv preprint.

Lawrence, C., & Riezler, S. (2018). “Improving a neural semantic parser by counterfactual learning from human bandit feedback⁠(opens in a new window).” arXiv preprint.

Kreutzer, J., Khadivi, S., Matusov, E., & Riezler, S. (2018). “Can Neural Machine Translation be Improved with User Feedback?⁠(opens in a new window).” arXiv preprint.

Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., ... & Bengio, Y. (2016). “An actor-critic algorithm for sequence prediction⁠(opens in a new window).” arXiv preprint.

Zhou, W., & Xu, K. (2020). “Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models⁠(opens in a new window).” In AAAI 2020.

Cho, W., & Zhang, P., & Zhang, Y., & Li, X., & Galley, M., & Brockett, C., & Wang, M., & Gao, J. (2018). “Towards coherent and cohesive long-form text generation.⁠(opens in a new window)” arXiv preprint.

Perez, E., & Karamcheti, S., & Fergus, R., & Weston, J., & Kiela, D., & Cho, K. (2019). ” Finding generalizable eevidence by learning to convince Q&A models.⁠(opens in a new window)” arXiv preprint.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). “Deep reinforcement learning from human preferences⁠(opens in a new window).” In Advances in Neural Information Processing Systems 2017.

Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., & Amodei, D. (2018). “Reward learning from human preferences and demonstrations in Atari⁠(opens in a new window).” In Advances in Neural Information Processing Systems 2018.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. (2020). ” Language models are few-shot learners⁠(opens in a new window).” arXiv preprint.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2019). ” Exploring the limits of transfer learning with a unified text-to-text transformer⁠(opens in a new window).” arXiv preprint.

Zhang, Y., Li, D., Wang, Y., Fang, Y., & Xiao, W. (2019). ” Exploring the limits of transfer learning with a unified text-to-text transformer⁠(opens in a new window).” In Applied Sciences.

Christiano, P., Shlegeris, B., & Amodei, D. (2018). ” Supervising strong learners by amplifying weak experts⁠(opens in a new window).” arXiv preprint.

Authors

Nisan Stiennon, Paul Christiano, Daniel Ziegler, Ryan Lowe, Jeffrey Wu, Chelsea Voss, Long Ouyang

Acknowledgments

We’d like to thank the following people who gave feedback on various iterations of the blog post: Douwe Kiela, Zach Lipton, Alex Irpan, Jack Clark, Jacob Hilton, Raul Puri, Miles Brundage, Greg Brockman, Ilya Sutskever, Kelly Sims, Wojciech Kryscinski, and Dzimitry Bahdanau. We’d also like to thank Justin Jay Wang for driving the blog post design, Ashley Pilipiszyn for editing, Alec Radford and Dario Amodei for guidance on the project, Shan Carter for help designing the main diagram, Gretchen Krueger for co-writing the model card, Beth Barnes for help with labeler hiring and general encouragement, and many other people at OpenAI for training our large pre-trained models, supporting us through computing infrastructure improvements and maintenance, and writing fast GPU kernels. Finally, we’d like to thank all of our contractors for providing the data that was essential for training the models in this post.