Posts

DALL-E (Zero-Shot Text-to-Image Generation) -PART(2/2)

Image
  Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/ DALL-E consist of two components. First component is d-VAE(discrete-Variational Auto Encoder) and second is Auto-regressive transformer. First component is responsible for generating a tokens of size 1024 for an image of size 224x224. More details on this was covered in part-1 https://rakshithv-deeplearning.blogspot.com/2022/04/dall-e-zero-shot-text-to-image.html . Transformer is decoder only network with 64 layers, each layer has 62 heads and model’s latent size is 64. Most of the ideas are borrowed from sparse attention paper which shows a way to reduce computation of a default self attention which is quadratic in time complexity (Link to the sparse transformer paper : https://arxiv.org/pdf/1904.10509.pdf ). There are three kind of attentions which are used in the transformer. Row attention, Column attention and causal convolution attention. From layer 1 to layer 63, we only have row or column attention.

DALL-E (Zero-Shot Text-to-Image Generation) -PART(1/2)

Image
  Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/ Last week OpenAI has released DALL-E2 https://twitter.com/OpenAI/status/1511707245536428034?s=20&t=iYtfg3SC-WPupM4IkTeQfA . This system is basically have a capability of generating an image from a text description. Below twitter thread has few examples generated from DALL-E2 https://twitter.com/OpenAI/status/1511714511673126914?s=20&t=4iYWQtFoQ326tSzOyGZcUA . Following is my favourite example :                                                      DALL-E2 example In this blog, I want to discuss technical details of DALL-E (version 1) which was released almost an year ago. I personally felt this paper is content rich than the recent paper. This work is exciting because a system which is trained on image-text pairs is able to generate a very meaningful image from a text which probably hasn’t seen  (more like a OOD), of-course this claim could be more appreciable if there is more transparency about the

NeurIPS 2021 - Curated papers - Part 2

Image
Link to my deep learning blogs :  https://rakshithv-deeplearning.blogspot.com/ Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training This work aims at Vision-Language Pre-training (VLP) or multi-modal learning through pre-training. 1.Given an image and text , image will be passed through Vision Transformer (ViT) and output of the ViT is taken as visual tokens. For text, tokens are generated from BERT tokenizer. 2. [Cls]_[Visual tokens]_[SEP]_[Text tokens] are concatenated and masks are generated similar to BERT. 3. Concatenated text and image token is given to the Multi-modal transformer (MT) 4. Pre-training has 3 objective functions: a. MLM (Masked Language Modelling) : Almost similar to BERT , predicting the masked token in the text. b. ITM (Image-Text Matching) : Replacing image with other image with 0.5 probability and binary classification (Same image or different image) is used to learn the inter-modal alignment. c. MFR (Masked Feature Regres

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Image
  Why: Adaption of NLP’s famous transformer architecture for vision tasks and state of art has been achieved with relatively less computational resources compared to convolutions. How: Convert the image into sequence of patches and treat them as tokens like we do in NLP applications and feed the embedding of patch as in input to transformer and classification happens after a MLP head. What: It is called as vision transformer where how much attention needs to be given between patches is learned TL: DR: 1. This has worked well for large dataset compared to moderate dataset because it doesn’t capture local properties(edges shared across 2 different patches and hence doesn’t generalize well) 2. Previous approaches tends to work on pixel level focusing on self-attention of a neighborhood or mixture of CNN+transformer. 3.Present work: Architecture mentioned in the paper Points to ponder: 1. What if we change order in which we give the crops (one example : randomly changing the order or overl

MLP-Mixer: An all-MLP Architecture for Vision

Image
  Why: Comparable results can be achieved for vision related tasks without using CNN or ViT (Self-attention) simply by using MLP How: Like Vision transformers image patches are fed as input tokens and it process through MLP’s What: This technique is called as MLP mixer , it has 2 kind of MLP’s , one which interacts with channels and other interacts with spatial region TL: DR My interpretation of the architecture Architecture from paper Architecture is self-explanatory, main idea here is we have 2 type of MLP layers , MLP1 deals with it’s channel component (Own Patch) and MLP2 interacts across spatial region (Other patches) Points to Ponder: 1. Still property of parameter sharing isn’t used completely as much we do in CNN 2. Like vision transformers interesting to observe what happens if we change the order of patches or overlapping patch? link to the paper :  https://arxiv.org/pdf/2105.01601.pdf

Emerging Properties in Self-Supervised Vision Transformers (DINO)

Image
  Why: It has information about semantic properties of image better than normal ViT , It achieves good accuracy on K-NN classifiers which means representations of different class are well separated for final classification How: Similar to typical contrastive learning different augmented views passed through the 2 (student and teacher) network and then student network tries (learn) to match the probability distribution of teacher’s network What: It is called as DINO (self-distillation with no labels) instead of learning the difference between the representation here it’s about mapping the representation TL: DR 1.Augmentation of images (multi-crop, Gaussian-blurr,etc) 2. All views will be passed through student network and only global view will be passed through teacher’s network 3. For given image , V different views can be generated (It will have at least 2 global views) 4.Student and Teacher network shares same architecture (ViT or ConvNet) 4. Output of network outputs k-dimension dis

Einsum equation:

Image
  It’ an elegant way to perform matrix or vector manipulation. I find it’s extremely useful if I have to perform matrix multiplication of matrices which is of higher dimension, it gives a great flexibility to sum and multiply among certain axis. Ex : if you have to multiply matrix A of shape (1,200,2,32) & matrix B of shape (2,32,32) and results in a matrix C of shape (1,200,32). This can be implemented as follows: np.einsum(‘abcd,cde->abe’,A,B) That’s it ! It can be implemented similarly in Tensor-flow & PyTorch. I was going through the keras implementation of Multi-head attention https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/python/keras/layers/multi_head_attention.py#L124-L516 In “Attention is all you need” paper, they concatenate different heads but in implementation they multiply different heads with weight matrix, I will be discussing few examples from this paper. Syntax of “einsum” equation: np.einsum(‘shape_of_A, shape_of_B -> shape_of_C’,A,B) Le