Image GPT | xifan.uno

Footnotes

Measured through logistic regression on learned features (linear probe).

A transformer is trained to maximize the likelihood, and thus is mode covering, which automatically ensures the diversity of its samples.

The original analysis by synthesis idea is more an argument for generative models with latent variables, but because generative models without latent variables were so much better at modeling the data distribution, we thought the analysis-by-synthesis conjecture should hold for them as well.

We only show linear probe accuracy on ImageNet for iGPT-XL since other experiments did not finish before we needed to transition to different supercomputing facilities.

To extract features for a linear probe, we take the post layernorm attention block inputs at some layer and average pool over the sequence dimension.

To fine-tune, we take the post layernorm transformer output and average pool over the sequence dimension as input for the classification head.

A generative model which learns features in a purely unsupervised fashion.

References

LeCun, Y. (2017). “Predictive Learning⁠(opens in a new window).”

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I. “Attention is All you Need⁠(opens in a new window).” In NeurIPS 2017.

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding⁠(opens in a new window).” arXiv preprint.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). “Language Models are Unsupervised Multitask Learners⁠(opens in a new window).” Technical Report, OpenAI.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). “RoBERTa: A Robustly Optimized BERT Pretraining Approach⁠(opens in a new window).” arXiv preprint.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. (2019). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer⁠(opens in a new window).” arXiv preprint.

Dai, A., Le, Q. V. (2015). “Semi-supervised sequence learning⁠(opens in a new window).” In NeurIPS 2015.

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). “Deep Contextualized Word Representations⁠(opens in a new window).” In NAACL 2018.

Howard, J., Ruder, S. (2018). “Universal Language Model Fine-tuning for Text Classification⁠(opens in a new window).” In ACL 2018.

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). “Improving language understanding by generative pre-training⁠(opens in a new window).” Technical Report, OpenAI.

Ke N., Goyal, A., Bilaniuk,O., Binas, J., Mozer, M., Pal, C., Bengio, Y (2018). “Sparse attentive backtracking: Temporal credit assignment through reminding⁠(opens in a new window).” In NeurIPS 2018.

Chen, T., Kornblith, S., Norouzi, M., Hinton, G. (2020). “A Simple Framework for Contrastive Learning of Visual Representations⁠(opens in a new window).” arXiv preprint.

Bachman, P., Hjelm, R., & Buchwalter, W. (2019). “Learning representations by maximizing mutual information across views⁠(opens in a new window).” In NeurIPS 2019.

Kolesnikov, A. & Beyer, L. & Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N. (2019). “Big Transfer (BiT): General Visual Representation Learning⁠(opens in a new window).” arXiv preprint.

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., & Chen, Z. (2019) “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism⁠(opens in a new window).” In NeurIPS 2019.

Sandler, M., Baccash, J., Zhmoginov, A., & Howard, A. (2019). “Non-discriminative data or weak model? On the relative importance of data and model resolution⁠(opens in a new window).” In ICCV 2019.

Lasserre, J., Bishop, C., & Minka, T. P. (2006). “Principled Hybrids of Generative and Discriminative Models⁠(opens in a new window).” In CVPR 2006.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., Bengio, S. (2010). “Why does unsupervised pre-training help deep learning?⁠(opens in a new window).” In JMLR 2010.

Elman, J. (1990). “Finding Structure in Time⁠(opens in a new window).” In Cognitive Science 1990.

Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., Khudanpur, S. (2010). “Recurrent neural network based language model⁠(opens in a new window).” In INTERSPEECH-2010.

Larochelle, H., Murray, I. (2011). “The neural autoregressive distribution estimator⁠(opens in a new window).” In AISTATS 2011.

Graves, A. (2013). “Generating sequences with recurrent neural networks⁠(opens in a new window).” arXiv preprint.

Tian, Y., Krishnan, D., & Isola, P. (2019). “Contrastive multiview coding⁠(opens in a new window).” arXiv preprint.

He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2019). “Momentum Contrast for Unsupervised Visual Representation Learning⁠(opens in a new window).” arXiv preprint.

Henaff, O., Srinivas, A., De Fauw, J., Razavi, A., Doersch, C., Eslami, S., Oord, A. (2019). “Data-Efficient Image Recognition with Contrastive Predictive Coding⁠(opens in a new window) .” arXiv preprint.

Oord, A., Kalchbrenner, N., Kavukcuoglu, K. (2016). “Pixel recurrent neural networks⁠(opens in a new window).” arXiv preprint.

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). “Image transformer⁠(opens in a new window).” In ICML 2018.

Menick, J., Kalchbrenner, N. (2018). “Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling⁠(opens in a new window).” arXiv preprint.

Mumford, D. (1992). “On the computational architecture of the neocortex⁠(opens in a new window).” In Biol. Cybern.

Rao, R., Ballard, D. (1999). “Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects⁠(opens in a new window).” In Nature Neuroscience.

Smolensky, P. (1986). “Information processing in dynamical systems: Foundations of harmony theory⁠(opens in a new window).”

Hinton, G. (2002). “Training Products of Experts by Minimizing Contrastive Divergence⁠(opens in a new window).” In MIT Press.

Hinton, G., Osindero, S., & Teh, Y. (2006). “A fast learning algorithm for deep belief nets⁠(opens in a new window).” In Neural Computation.

Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. (2008). “Extracting and composing robust features with denoising autoencoders⁠(opens in a new window).” In ICML 2008.

Coates, A., Lee, H., & Ng, A. Y. (2011). “An analysis of single-layer networks in unsupervised feature learning⁠(opens in a new window).” In AISTATS 2011.

Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J. & Ng, A. Y. (2012). “Building high-level features using large scale unsupervised learning⁠(opens in a new window).” In ICML 2012.

Donahue, J., Simonyan, K. (2019). “Large scale adversarial representation learning⁠(opens in a new window).” In NeurIPS 2019.

Ciresan, D., Meier, U., Gambardella, L. & Schmidhuber, J. (2010). “Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition⁠(opens in a new window).” In CoRR 2010.

Shaw, P., Uszkoreit, J., & Vaswani A. (2018). “Self-attention with relative position representations⁠(opens in a new window).” In NAACL 2018.

Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). “Generating long sequences with sparse transformers⁠(opens in a new window).” arXiv preprint.

Becker, S., Hinton, G. (1991). “Self-organizing neural network that discovers surfaces in random-dot stereograms⁠(opens in a new window).” In Nature.

Bromley, J., Guyon, I., LeCun, Y., Sackinger, E., & Shah, R. (1994). “Signature verification using a” siamese” time delay neural network⁠(opens in a new window).” In NeurIPS 1994.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). “Distributed Representations of Words and Phrases and their Compositionality⁠(opens in a new window) .” In NeurIPS 2013.

Oord, A., Li, Y., Vinyals, O. (2018). “Representation Learning with Contrastive Predictive Coding⁠(opens in a new window) .” arXiv preprint.

Hjelm, R., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., & Bengio, Y. (2018). “Learning deep representations by mutual information estimation and maximization⁠(opens in a new window).” In ICLR 2019.

Alley, E., Khimulya, G., Biswas, S., AlQuraishi, M., Church, G. (2019). “Unified rational protein engineering with sequence-only deep representation learning⁠(opens in a new window).” In Nature Methods.

Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, C., Ma, J., Fergus, R. (2019). “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences⁠(opens in a new window).” bioRxiv preprint.

Torralba, A., Fergus, R., Freeman, W. (2008). “80 million tiny images: A large data set for nonparametric object and scene recognition⁠(opens in a new window).” In IEEE transactions on pattern analysis and machine intelligence.

“List of 8-Bit Computer Hardware Graphics⁠(opens in a new window).” Wikipedia, 8 May 2020

Kornblith, S., Shlens, J., & Le, Q. V. (2019). “Do Better ImageNet Models Transfer Better?⁠(opens in a new window).” In CVPR 2019.

Cubuk, E., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). “AutoAugment: Learning Augmentation Strategies From Data⁠(opens in a new window).” In CVPR 2019.

Tan, M., Le, Q. V. (2019). “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks⁠(opens in a new window).” In ICML 2019.

Gidaris, S., Singh, P., & Komodakis, N. (2018). “Unsupervised Representation Learning by Predicting Image Rotations⁠(opens in a new window).” In ICLR 2018.

Kingma, D., Rezende, D. J., Mohamed, S., & Welling, M. (2014). “Semi-Supervised Learning with Deep Generative Models⁠(opens in a new window).” In NeurIPS 2014.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X. (2016). “Improved techniques for training gans⁠(opens in a new window).” In NeurIPS 2016.

Tarvainen, A., Valpola, H. (2017). “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results⁠(opens in a new window).” In NeurIPS 2017.

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C. (2019). “MixMatch: A Holistic Approach to Semi-Supervised Learning⁠(opens in a new window).” In NeurIPS 2019.

Xie, Q., Dai, Z., Hovy, E., Luong, M., & Le, Q. V. (2019). “Unsupervised Data Augmentation for Consistency Training⁠(opens in a new window).” arXiv preprint.

Sohn, K., Berthelot, D., Li, C., Zhang, Z., Carlini, N., Cubuk, E., Kurakin, A., Zhang, H., Raffel, C. (2020). “Fixmatch: Simplifying semi-supervised learning with consistency and confidence⁠(opens in a new window).” arXiv preprint.

Sutton, R. (2019). “The Bitter Lesson⁠(opens in a new window).”

Authors

Mark Chen, Alec Radford, Ilya Sutskever

Acknowledgments

Foremost, we would like to acknowledge our paper co-authors Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, and David Luan.

Thanks to the following for their feedback on this work and contributions to this release: Vedant Misra, Noah Golmant, Johannes Otterbach, Pranav Shyam, Aditya Ramesh, Yura Burda, Harri Edwards, Chris Hallacy, Jeff Clune, Jack Clark, Irene Solaiman, Ryan Lowe, Greg Brockman, Kelly Sims, David Farhi, Will Guss, Quoc V. Le, and Ashish Vaswani.

Editor: Ashley Pilipiszyn

Design: Justin Jay Wang

Cover artwork: Ben Barry