Posts

Showing posts from December, 2021

NeurIPS 2021 - Curated papers - Part 2

Image
Link to my deep learning blogs :  https://rakshithv-deeplearning.blogspot.com/ Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training This work aims at Vision-Language Pre-training (VLP) or multi-modal learning through pre-training. 1.Given an image and text , image will be passed through Vision Transformer (ViT) and output of the ViT is taken as visual tokens. For text, tokens are generated from BERT tokenizer. 2. [Cls]_[Visual tokens]_[SEP]_[Text tokens] are concatenated and masks are generated similar to BERT. 3. Concatenated text and image token is given to the Multi-modal transformer (MT) 4. Pre-training has 3 objective functions: a. MLM (Masked Language Modelling) : Almost similar to BERT , predicting the masked token in the text. b. ITM (Image-Text Matching) : Replacing image with other image with 0.5 probability and binary classification (Same image or different image) is used to learn the inter-modal alignment. c. MFR (Masked Feature Regres

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Image
  Why: Adaption of NLP’s famous transformer architecture for vision tasks and state of art has been achieved with relatively less computational resources compared to convolutions. How: Convert the image into sequence of patches and treat them as tokens like we do in NLP applications and feed the embedding of patch as in input to transformer and classification happens after a MLP head. What: It is called as vision transformer where how much attention needs to be given between patches is learned TL: DR: 1. This has worked well for large dataset compared to moderate dataset because it doesn’t capture local properties(edges shared across 2 different patches and hence doesn’t generalize well) 2. Previous approaches tends to work on pixel level focusing on self-attention of a neighborhood or mixture of CNN+transformer. 3.Present work: Architecture mentioned in the paper Points to ponder: 1. What if we change order in which we give the crops (one example : randomly changing the order or overl

MLP-Mixer: An all-MLP Architecture for Vision

Image
  Why: Comparable results can be achieved for vision related tasks without using CNN or ViT (Self-attention) simply by using MLP How: Like Vision transformers image patches are fed as input tokens and it process through MLP’s What: This technique is called as MLP mixer , it has 2 kind of MLP’s , one which interacts with channels and other interacts with spatial region TL: DR My interpretation of the architecture Architecture from paper Architecture is self-explanatory, main idea here is we have 2 type of MLP layers , MLP1 deals with it’s channel component (Own Patch) and MLP2 interacts across spatial region (Other patches) Points to Ponder: 1. Still property of parameter sharing isn’t used completely as much we do in CNN 2. Like vision transformers interesting to observe what happens if we change the order of patches or overlapping patch? link to the paper :  https://arxiv.org/pdf/2105.01601.pdf

Emerging Properties in Self-Supervised Vision Transformers (DINO)

Image
  Why: It has information about semantic properties of image better than normal ViT , It achieves good accuracy on K-NN classifiers which means representations of different class are well separated for final classification How: Similar to typical contrastive learning different augmented views passed through the 2 (student and teacher) network and then student network tries (learn) to match the probability distribution of teacher’s network What: It is called as DINO (self-distillation with no labels) instead of learning the difference between the representation here it’s about mapping the representation TL: DR 1.Augmentation of images (multi-crop, Gaussian-blurr,etc) 2. All views will be passed through student network and only global view will be passed through teacher’s network 3. For given image , V different views can be generated (It will have at least 2 global views) 4.Student and Teacher network shares same architecture (ViT or ConvNet) 4. Output of network outputs k-dimension dis

Einsum equation:

Image
  It’ an elegant way to perform matrix or vector manipulation. I find it’s extremely useful if I have to perform matrix multiplication of matrices which is of higher dimension, it gives a great flexibility to sum and multiply among certain axis. Ex : if you have to multiply matrix A of shape (1,200,2,32) & matrix B of shape (2,32,32) and results in a matrix C of shape (1,200,32). This can be implemented as follows: np.einsum(‘abcd,cde->abe’,A,B) That’s it ! It can be implemented similarly in Tensor-flow & PyTorch. I was going through the keras implementation of Multi-head attention https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/python/keras/layers/multi_head_attention.py#L124-L516 In “Attention is all you need” paper, they concatenate different heads but in implementation they multiply different heads with weight matrix, I will be discussing few examples from this paper. Syntax of “einsum” equation: np.einsum(‘shape_of_A, shape_of_B -> shape_of_C’,A,B) Le

BEIT: BERT Pre-Training of Image Transformers

Image
  We are all aware of the fact that how successful was BERT for NLP applications. BERT was inspired from a transformer architecture and recently there has been a significant success in Vision, Audio as well using transformers. In the last year, we have seen lot of work in Vision domain (DINO, Image is worth of 16*16 words,etc) related to usage of transformers. One of the key ideas is that using image token as a text token in analogous to NLP. Huge language models like BERT and GPT are benefited largely from pre-training with large corpus. Whereas Vision Transformers rely on the different version of contrastive learning for pre-training task. Though they achieve close to SOTA but they still require lot more unlabeled data than conventional convolution based neural networks. Pre-training through contrastive learning has certain limitations because there is dependency of high number of negative samples and mode collapse, Of-course, there are works which is trying to solve these limitation

NeurIPS 2021 — Curated papers — Part 1

Image
  UniDoc: Unified Pretraining Framework for Document Understanding Authors has proposed a self-supervised framework for document understanding from multi-modal point of view. Language Pre-training using transformers have become extremely popular. In this work, authors have showed how to do SSL using transformers by taking inputs from different modalities such as image and text. UniDoc has mainly 4 steps : Feature Extraction :  Given a document image I and location of document elements, using OCR sentences and it’s corresponding bounding boxes are extracted. Feature Embedding :  For bounding box, features are extracted through CNN backbone+RoIAlign and they are quantized using Gumble-softmax (similar to Wav2Vec2) and embedding for sentences are extracted from pre-trained hierarchical transformers. Gated cross attention :  It’s one of the main ingredient of the work , where cross-modal interaction takes places between text and visual embedding through typical cross-attention mechanism. N