DALL·E: Creating images from text

Footnotes

A token is any symbol from a discrete vocabulary; for humans, each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.

The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE that we pretrained using a continuous relaxation. We found that training using the relaxation obviates the need for an explicit codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes.

Further details provided in a later section⁠.

This task is called variable binding, and has been extensively studied in the literature.

References

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H. (2016). “Generative adversarial text to image synthesis⁠(opens in a new window)”. In ICML 2016.

Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H. (2016). “Learning what and where to draw⁠(opens in a new window)”. In NIPS 2016.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang X., Metaxas, D. (2016). “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks⁠(opens in a new window)”. In ICCY 2017.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D. (2017). “StackGAN++: realistic image synthesis with stacked generative adversarial networks⁠(opens in a new window)”. In IEEE TPAMI 2018.

Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. (2017). “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks⁠(opens in a new window).

Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J. (2019). “Object-driven text-to-image synthesis via adversarial training⁠(opens in a new window)”. In CVPR 2019.

Koh, J. Y., Baldridge, J., Lee, H., Yang, Y. (2020). “Text-to-image generation grounded by fine-grained user attention⁠(opens in a new window)”. In WACV 2021.

Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J. (2016). “Plug & play generative networks: conditional iterative generation of images in latent space⁠(opens in a new window).

Cho, J., Lu, J., Schwen, D., Hajishirzi, H., Kembhavi, A. (2020). “X-LXMERT: Paint, caption, and answer questions with multi-modal transformers⁠(opens in a new window)”. EMNLP 2020.

Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes⁠(opens in a new window).” arXiv preprint (2013).

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models⁠(opens in a new window).” arXiv preprint (2014).

Jang, E., Gu, S., Poole, B. (2016). “Categorical reparametrization with Gumbel-softmax⁠(opens in a new window)”.

Maddison, C., Mnih, A., Teh, Y. W. (2016). “The Concrete distribution: a continuous relaxation of discrete random variables⁠(opens in a new window)”.

van den Oord, A., Vinyals, O., Kavukcuoglu, K. (2017). “Neural discrete representation learning⁠(opens in a new window)”.

Razavi, A., van der Oord, A., Vinyals, O. (2019). “Generating diverse high-fidelity images with VQ-VAE-2⁠(opens in a new window)”.

Andreas, J., Klein, D., Levine, S. (2017). “Learning with Latent Language⁠(opens in a new window)”.

Smolensky, P. (1990). “Tensor product variable binding and the representation of symbolic structures in connectionist systems⁠(opens in a new window)”.

Plate, T. (1995). “Holographic reduced representations: convolution algebra for compositional distributed representations⁠(opens in a new window)”.

Gayler, R. (1998). “Multiplicative binding, representation operators & analogy⁠(opens in a new window)”.

Kanerva, P. (1997). “Fully distributed representations⁠(opens in a new window)”.

Primary Authors

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray

Supporting Authors

Mark Chen, Rewon Child, Vedant Misra, Pamela Mishkin, Gretchen Krueger, Sandhini Agarwal, Ilya Sutskever