AI and efficiency | xifan.uno

Compute Scaling

Footnotes

In the sorting example, the “difficulty” of the problem is the length of the list. The cost for quicksort, a commonly used algorithm is denoted in Big O notation: O(nlog⁡n) O(n\log{}n)

Inference costs dominate total costs for successful deployed systems. Inference costs scale with usage of the system, whereas training costs only need to be paid once.

Throughout this post we refer to Moore’s Law as the consistent, long-observed 2-year doubling time of dollars/flop. One could also interpret Moore’s Law as the trend in dollars/flop, that has recently slowed down.

For instance algorithmic progress could change the complexity class on some task from exponential to polynomial cost. Such efficiency gains on capabilities of interest are intractable to directly observe, though they may be observable through asymptotic analysis or extrapolating empirically derived scaling laws.

Making credible forecasts on such topics is a substantial enterprise, we’d rather avoid here than give insufficient treatment.

In fact, this work was primarily done by training PyTorch examples models, with tweaks to improve early learning.

ImageNet is the only training data source allowed for the vision benchmark. No human captioning, other images, or other data is allowed. Automated augmentation is ok.

References

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). “ImageNet: A Large-Scale Hierarchical Image Database⁠(opens in a new window).” In CVPR09.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). “Imagenet classification with deep convolutional neural networks⁠(opens in a new window).” In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25 (pp. 1097–1105). Curran Associates, Inc.

Moore, G. E. (1965). “Cramming more components onto integrated circuits⁠(opens in a new window).” Electronics 38(8).

Amodei, D. & Hernandez, D. (2018). “AI and Compute⁠.”

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). “Going deeper with convolutions⁠(opens in a new window).”

Simonyan, K. & Zisserman, A. (2014). “Very deep convolutional networks for large-scale image recognition⁠(opens in a new window).”

He, K., Zhang, X., Ren, S., & Sun, J. (2015). “Deep residual learning for image recognition⁠(opens in a new window) .”

Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size⁠(opens in a new window).”

Zagoruyko, S. & Komodakis, N. (2016). “Wide residual networks⁠(opens in a new window).”

Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2016). “

Aggregated residual transforma- tions for deep neural networks⁠(opens in a new window).”

Huang,G.,Liu,Z.,vanderMaaten,L.,&Weinberger,K.Q.(2016). “Densely connected convolutional networks⁠(opens in a new window).”

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An- dreetto, M., & Adam, H. (2017). “Mobilenets: Efficient convolutional neural networks for mobile vision applications⁠(opens in a new window).”

Zhang, X., Zhou, X., Lin, M., & Sun, J. (2017). “Shufflenet: An extremely efficient convolutional neural network for mobile devices⁠(opens in a new window).”

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). “Mobilenetv2: Inverted residuals and linear bottlenecks⁠(opens in a new window).”

Ma, N., Zhang, X., Zheng, H.-T., & Sun, J. (2018). “Practical guidelines for efficient cnn architecture design⁠(opens in a new window).”

Tan, M. & Le, Q. V. (2019). “Efficientnet: Rethinking model scaling for convolutional neural networks⁠(opens in a new window).”

Sawyer, Eric (2011). “High Throughput Sequencing and Cost Trends⁠(opens in a new window).”

Roberts, David (2019). “Getting to 100% renewables requires cheap energy storage. But how cheap?⁠(opens in a new window).”

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmai- son, A., Antiga, L., & Lerer, A. (2017). “Automatic differentiation in PyTorch. In NIPS Autodiff Workshop⁠(opens in a new window).”

Huang, J. (2017). “Shufflenet in pytorch⁠(opens in a new window).”

Xiao, H. (2017). “Pytorch mobilenet implementation of “mobilenets: Efficient convolutional neural networks for mobile vision applications”⁠(opens in a new window).”

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). “Attention is all you need. CoRR, abs/1706.03762⁠(opens in a new window).”

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). “Sequence to sequence learning with neural networks. CoRR, abs/1409.3215⁠(opens in a new window).”

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144⁠(opens in a new window).”

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). “Mastering the game of go without human knowledge. Nature, 550, 354–⁠(opens in a new window).”

OpenAI et. al, :, Berner, C., Brockman, G., Chan, B., Cheung, V., De ̨biak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., & Zhang, S. (2019). “

Dota 2 with Large Scale Deep Reinforcement Learning⁠(opens in a new window).”

Cody A. Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia (2017). “

High Throughput Sequencing and Cost Trends⁠(opens in a new window).”

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmai- son, A., Antiga, L., & Lerer, A. (2017). “DAWNBench: An End-to-End Deep Learning Benchmark and Competition. NIPS ML SYSTEMS WORKSHOP, 2017⁠(opens in a new window).”

Raymond Perrault, Yoav Shoham, E. B. J. C. J. E. B. G. T. L. J. M. S. M. & Niebles, J. C. (2019). “The AI Index 2019 Annual Report”. Technical report, AI Index Steering Committee, Human-Centered AI Institute, Stanford University, Stanford, CA⁠(opens in a new window).”

McCandlish, S., Kaplan, J., Amodei, D., & Team, O. D. (2018). “An empirical model of large-batch training”⁠(opens in a new window).”

van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., Lockhart, E., Cobo, L. C., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, D., & Hassabis, D. (2017). “Parallel wavenet: Fast high-fidelity speech synthesis.⁠(opens in a new window).”

Jack Clark (2019). “Written Testimony of Jack Clark, Policy Director at OpenAI. Hearing on “Artificial Intelligence: Societal and Ethical Implications” before the House Committee on Science, Space, & Technology⁠(opens in a new window).”

Authors

Danny Hernandez, Tom Brown

Acknowledgments

We’d like to thank the following people helpful conversations and/or feedback on this post: Dario Amodei, Jack Clark, Alec Radford, Paul Christiano, Sam McCandlish, Ilya Sutskever, Jacob Steinhardt, Jared Kaplan, Amanda Askell, John Schulman, Jacob Hilton, Asya Bergal, Katja Grace, Ryan Carey, Nicholas Joseph, Geoffrey Irving, Jeff Clune, and Ashley Pilipiszyn.

Thanks to Justin Jay Wang for design.

Thanks to Niki Parmar for providing the relevant points from the original transformer⁠(opens in a new window) learning curves.

Also thanks to Mingxing Tan for providing the relevant points from EfficientNet⁠(opens in a new window) learning curves and running an experiment with reduced warmup.