Deep learning blogs

Posts

DALL-E (Zero-Shot Text-to-Image Generation) -PART(1/2)

April 15, 2022

Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/ Last week OpenAI has released DALL-E2 https://twitter.com/OpenAI/status/1511707245536428034?s=20&t=iYtfg3SC-WPupM4IkTeQfA . This system is basically have a capability of generating an image from a text description. Below twitter thread has few examples generated from DALL-E2 https://twitter.com/OpenAI/status/1511714511673126914?s=20&t=4iYWQtFoQ326tSzOyGZcUA . Following is my favourite example : DALL-E2 example In this blog, I want to discuss technical details of DALL-E (version 1) which was released almost an year ago. I personally felt this paper is content rich than the recent paper. This work is exciting because a system which is trained on image-text pairs is able to generate a very meaningful ima...

NeurIPS 2021 - Curated papers - Part 2

December 28, 2021

Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/ Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training This work aims at Vision-Language Pre-training (VLP) or multi-modal learning through pre-training. 1.Given an image and text , image will be passed through Vision Transformer (ViT) and output of the ViT is taken as visual tokens. For text, tokens are generated from BERT tokenizer. 2. [Cls]_[Visual tokens]_[SEP]_[Text tokens] are concatenated and masks are generated similar to BERT. 3. Concatenated text and image token is given to the Multi-modal transformer (MT) 4. Pre-training has 3 objective functions: a. MLM (Masked Language Modelling) : Almost similar to BERT , predicting the masked token in the text. b. ITM (Image-Text Matching) : Replacing image with other image with 0.5 probability and binary classification (Same image or different image) is used to learn the inter-modal alignment. c. MFR (Masked Feature Re...

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

December 18, 2021

Why: Adaption of NLP’s famous transformer architecture for vision tasks and state of art has been achieved with relatively less computational resources compared to convolutions. How: Convert the image into sequence of patches and treat them as tokens like we do in NLP applications and feed the embedding of patch as in input to transformer and classification happens after a MLP head. What: It is called as vision transformer where how much attention needs to be given between patches is learned TL: DR: 1. This has worked well for large dataset compared to moderate dataset because it doesn’t capture local properties(edges shared across 2 different patches and hence doesn’t generalize well) 2. Previous approaches tends to work on pixel level focusing on self-attention of a neighborhood or mixture of CNN+transformer. 3.Present work: Architecture mentioned in the paper Points to ponder: 1. What if we change order in which we give the crops (one example : randomly changing the order or o...

MLP-Mixer: An all-MLP Architecture for Vision

December 18, 2021

Why: Comparable results can be achieved for vision related tasks without using CNN or ViT (Self-attention) simply by using MLP How: Like Vision transformers image patches are fed as input tokens and it process through MLP’s What: This technique is called as MLP mixer , it has 2 kind of MLP’s , one which interacts with channels and other interacts with spatial region TL: DR My interpretation of the architecture Architecture from paper Architecture is self-explanatory, main idea here is we have 2 type of MLP layers , MLP1 deals with it’s channel component (Own Patch) and MLP2 interacts across spatial region (Other patches) Points to Ponder: 1. Still property of parameter sharing isn’t used completely as much we do in CNN 2. Like vision transformers interesting to observe what happens if we change the order of patches or overlapping patch? link to the paper : https://arxiv.org/pdf/2105.01601.pdf

Emerging Properties in Self-Supervised Vision Transformers (DINO)

December 18, 2021

Why: It has information about semantic properties of image better than normal ViT , It achieves good accuracy on K-NN classifiers which means representations of different class are well separated for final classification How: Similar to typical contrastive learning different augmented views passed through the 2 (student and teacher) network and then student network tries (learn) to match the probability distribution of teacher’s network What: It is called as DINO (self-distillation with no labels) instead of learning the difference between the representation here it’s about mapping the representation TL: DR 1.Augmentation of images (multi-crop, Gaussian-blurr,etc) 2. All views will be passed through student network and only global view will be passed through teacher’s network 3. For given image , V different views can be generated (It will have at least 2 global views) 4.Student and Teacher network shares same architecture (ViT or ConvNet) 4. Output of network outputs k-dimension...

Einsum equation:

December 18, 2021

It’ an elegant way to perform matrix or vector manipulation. I find it’s extremely useful if I have to perform matrix multiplication of matrices which is of higher dimension, it gives a great flexibility to sum and multiply among certain axis. Ex : if you have to multiply matrix A of shape (1,200,2,32) & matrix B of shape (2,32,32) and results in a matrix C of shape (1,200,32). This can be implemented as follows: np.einsum(‘abcd,cde->abe’,A,B) That’s it ! It can be implemented similarly in Tensor-flow & PyTorch. I was going through the keras implementation of Multi-head attention https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/python/keras/layers/multi_head_attention.py#L124-L516 In “Attention is all you need” paper, they concatenate different heads but in implementation they multiply different heads with weight matrix, I will be discussing few examples from this paper. Syntax of “einsum” equation: np.einsum(‘shape_of_A, shape_of_B -> shape_of_C’,A,B...

Search This Blog

Deep learning blogs

Posts

DALL-E (Zero-Shot Text-to-Image Generation) -PART(2/2)

DALL-E (Zero-Shot Text-to-Image Generation) -PART(1/2)

NeurIPS 2021 - Curated papers - Part 2

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

MLP-Mixer: An all-MLP Architecture for Vision

Emerging Properties in Self-Supervised Vision Transformers (DINO)

Einsum equation: