NeurIPS 2021 - Curated papers - Part 2
Link to my deep learning blogs : https://rakshithv-deeplearning.blogspot.com/ Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training This work aims at Vision-Language Pre-training (VLP) or multi-modal learning through pre-training. 1.Given an image and text , image will be passed through Vision Transformer (ViT) and output of the ViT is taken as visual tokens. For text, tokens are generated from BERT tokenizer. 2. [Cls]_[Visual tokens]_[SEP]_[Text tokens] are concatenated and masks are generated similar to BERT. 3. Concatenated text and image token is given to the Multi-modal transformer (MT) 4. Pre-training has 3 objective functions: a. MLM (Masked Language Modelling) : Almost similar to BERT , predicting the masked token in the text. b. ITM (Image-Text Matching) : Replacing image with other image with 0.5 probability and binary classification (Same image or different image) is used to learn the inter-modal alignment. c. MFR (Masked Feature Re...