Posts

Showing posts from April, 2022

DALL-E (Zero-Shot Text-to-Image Generation) -PART(2/2)

Image
  Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/ DALL-E consist of two components. First component is d-VAE(discrete-Variational Auto Encoder) and second is Auto-regressive transformer. First component is responsible for generating a tokens of size 1024 for an image of size 224x224. More details on this was covered in part-1 https://rakshithv-deeplearning.blogspot.com/2022/04/dall-e-zero-shot-text-to-image.html . Transformer is decoder only network with 64 layers, each layer has 62 heads and model’s latent size is 64. Most of the ideas are borrowed from sparse attention paper which shows a way to reduce computation of a default self attention which is quadratic in time complexity (Link to the sparse transformer paper : https://arxiv.org/pdf/1904.10509.pdf ). There are three kind of attentions which are used in the transformer. Row attention, Column attention and causal convolution attention. From layer 1 to layer 63, we only have row or column attention.

DALL-E (Zero-Shot Text-to-Image Generation) -PART(1/2)

Image
  Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/ Last week OpenAI has released DALL-E2 https://twitter.com/OpenAI/status/1511707245536428034?s=20&t=iYtfg3SC-WPupM4IkTeQfA . This system is basically have a capability of generating an image from a text description. Below twitter thread has few examples generated from DALL-E2 https://twitter.com/OpenAI/status/1511714511673126914?s=20&t=4iYWQtFoQ326tSzOyGZcUA . Following is my favourite example :                                                      DALL-E2 example In this blog, I want to discuss technical details of DALL-E (version 1) which was released almost an year ago. I personally felt this paper is content rich than the recent paper. This work is exciting because a system which is trained on image-text pairs is able to generate a very meaningful image from a text which probably hasn’t seen  (more like a OOD), of-course this claim could be more appreciable if there is more transparency about the