DALL-E (Zero-Shot Text-to-Image Generation) -PART(2/2)
Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/ DALL-E consist of two components. First component is d-VAE(discrete-Variational Auto Encoder) and second is Auto-regressive transformer. First component is responsible for generating a tokens of size 1024 for an image of size 224x224. More details on this was covered in part-1 https://rakshithv-deeplearning.blogspot.com/2022/04/dall-e-zero-shot-text-to-image.html . Transformer is decoder only network with 64 layers, each layer has 62 heads and model’s latent size is 64. Most of the ideas are borrowed from sparse attention paper which shows a way to reduce computation of a default self attention which is quadratic in time complexity (Link to the sparse transformer paper : https://arxiv.org/pdf/1904.10509.pdf ). There are three kind of attentions which are used in the transformer. Row attention, Column attention and causal convolution attention. From layer 1 to layer 63, we only have row or column attention.